A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.
Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.
SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.
To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.
The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.
# Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.4 huggingface_hub==0.34.0 -q
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab) and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# For splitting the dataset
from sklearn.model_selection import train_test_split
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)
# Libraries different ensemble classifiers
from sklearn.ensemble import (
BaggingRegressor,
RandomForestRegressor,
AdaBoostRegressor,
GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
# Libraries to get different metric scores
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
precision_score,
recall_score,
f1_score,
mean_squared_error,
mean_absolute_error,
r2_score,
mean_absolute_percentage_error
)
# To create the pipeline
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline
# To tune different models and standardize
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder
# To serialize the model
import joblib
# os related functionalities
import os
# API request
import requests
# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi
import math
# Connect to google drive
from google.colab import drive
drive.mount('/content/drive')
# Load the dataset from a CSV file into a Pandas DataFrame
kart = pd.read_csv("/content/drive/MyDrive/Model Deployment/Full_Code/SuperKart.csv")
# Make a copy of Kart
data = kart.copy()
# The first 5 rows of the dataset
data.head()
# The last 5 rows of the dataset
data.tail()
# Checking shape of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
# Display the column names of the dataset
data.columns
# Checking column datatypes and number of non-null values
data.info()
# Checking for duplicate values
data.duplicated().sum()
# Checking for missing values
data.isnull().sum()
# Statistical summary of the data for both numerical and categorical columns
data.describe(include='all').T
📌 Observations on the Dataset
The dataset has 12 columns:
4 numerical (float/int)
Product_Weight (float)
Product_Allocated_Area (float)
Product_MRP (float)
Store_Establishment_Year (int)
Product_Store_Sales_Total (float – target)
7 categorical (object)
Data types are appropriate and consistent with business meaning.
Memory usage is low (~822 KB), making it efficient for experimentation.
No missing values across all columns.
This eliminates the need for:
Imputation strategies
Row/column removal
The dataset is clean and ready for modeling.
0 duplicate rows found.
Each product–store combination appears to be unique, improving data reliability.
No deduplication steps are required.
High-cardinality column:
Product_Id → 8,763 unique values
Acts more like an identifier than a predictive feature.
Should be removed or feature-engineered (e.g., prefix extraction).
Low to moderate cardinality columns:
Product_Sugar_Content → 4 categories (Low Sugar most frequent)
Product_Type → 16 categories (Fruits & Vegetables most common)
Store_Id → 4 stores
Store_Size → 3 levels (Medium dominant)
Store_Location_City_Type → 3 tiers (Tier 2 most frequent)
Store_Type → 4 types (Supermarket Type2 dominant)
Product_Weight
Mean ≈ 12.65
Range: 4 to 22
Fairly symmetric distribution with moderate variance.
Product_Allocated_Area
Mean ≈ 0.069
Highly right-skewed (most products have small shelf area).
Likely a strong driver of sales visibility.
Product_MRP
Mean ≈ 147
Range: 31 to 266
Wide pricing range suggests varied product positioning.
Store_Establishment_Year
Range: 1987 to 2009
Median ≈ 2009
Can be converted into Store_Age for better interpretability.
Mean ≈ 3464
Standard deviation ≈ 1066
Range: 33 to 8000
Indicates:
High variability in sales
Possible outliers
Distribution is likely right-skewed, which tree-based models handle well.
Dataset is fully clean (no missing or duplicate values).
Strong mix of:
Product-level features
Store-level features
Tree-based regressors (Random Forest, XGBoost) are ideal.
Feature engineering opportunities:
Drop or transform Product_Id
Create Store_Age
Summary
The dataset consists of 8,763 clean and duplicate-free records with a balanced mix of numerical and categorical features. There are no missing values, and the target variable shows significant variability, making the dataset suitable for regression modeling using ensemble-based methods with appropriate feature engineering.
#Function to plot a boxplot and a histogram
def histogram_boxplot(
data,
feature,
figsize=(12, 7),
kde=True,
bins="auto",
title=None,
color="#8b5cf6",
hist_alpha=0.35,
show_stats_box=True,
show_stats_subtitle=True,
plot_gap=0.18,
title_y=0.98,
top_margin=0.90,
):
"""
Combined boxplot + histogram for a numeric feature, with optional stats.
"""
sns.set_theme(style="whitegrid", context="notebook")
x = data[feature].dropna()
if x.empty:
raise ValueError(f"Column '{feature}' has no non-null values to plot.")
# --- Summary stats ---
n = x.shape[0]
std = x.std()
min_v = x.min()
max_v = x.max()
mean_v = x.mean()
median_v = x.median()
fig, (ax_box, ax_hist) = plt.subplots(
nrows=2,
sharex=True,
figsize=figsize,
gridspec_kw={"height_ratios": (0.28, 0.72), "hspace": plot_gap},
)
# --- Boxplot ---
sns.boxplot(
x=x,
ax=ax_box,
color=color,
showmeans=True,
meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
medianprops=dict(color="black", linewidth=2),
whiskerprops=dict(linewidth=1.3),
boxprops=dict(linewidth=1.3),
)
ax_box.set(xlabel="")
ax_box.set_yticks([])
sns.despine(ax=ax_box, left=True, bottom=True)
# --- Histogram ---
sns.histplot(
x=x,
ax=ax_hist,
bins=bins if bins is not None else "auto",
kde=kde,
color=color,
alpha=hist_alpha,
edgecolor="white",
linewidth=1,
)
# Mean/Median lines
ax_hist.axvline(mean_v, color="#16a34a", linestyle="--", linewidth=2, label=f"Mean: {mean_v:,.2f}")
ax_hist.axvline(median_v, color="#111827", linestyle="-", linewidth=2, label=f"Median: {median_v:,.2f}")
ax_hist.legend(frameon=True, fontsize=10, loc="upper right")
ax_hist.set_ylabel("Count")
ax_hist.set_xlabel(feature)
sns.despine(ax=ax_hist)
# --- Title + subtitle (stats line) ---
main_title = title or f"Distribution of {feature}"
fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)
if show_stats_subtitle:
subtitle = f"n={n:,} std={std:,.2f} min={min_v:,.2f} max={max_v:,.2f}"
# place subtitle just below suptitle
fig.text(0.5, title_y - 0.045, subtitle, ha="center", va="top", fontsize=11)
# --- Stats box inside histogram ---
if show_stats_box:
stats_text = (
f"n = {n:,}\n"
f"std = {std:,.2f}\n"
f"min = {min_v:,.2f}\n"
f"max = {max_v:,.2f}"
)
ax_hist.text(
0.01, 0.98, stats_text,
transform=ax_hist.transAxes,
va="top", ha="left",
fontsize=10,
bbox=dict(boxstyle="round,pad=0.35", facecolor="white", edgecolor="#e5e7eb", alpha=0.95),
)
# Make room for title/subtitle
fig.subplots_adjust(top=top_margin)
return fig, (ax_box, ax_hist)
# Product Weight
histogram_boxplot(data, "Product_Weight", show_stats_box=False, show_stats_subtitle=True)
plt.show()
📊 Univariate Analysis – Product_Weight
The distribution of Product_Weight is approximately normal (bell-shaped).
Mean (12.65) and median (12.66) are almost identical, indicating a highly symmetric distribution.
The KDE curve confirms no strong skewness.
📌 Implication:
Since the feature is close to normally distributed, no transformation (log/sqrt) is required.
Mean: ~12.65
Median: ~12.66
Standard Deviation: ~2.22
This suggests:
Most products cluster tightly around the mean.
Product weights are well standardized, which is common in retail packaging.
Minimum: 4.0
Maximum: 22.0
Interquartile Range (IQR): roughly between 11 and 14
Most product weights lie in a narrow, realistic range, showing controlled product sizing.
A few outliers exist on both lower and upper ends:
Very light products (~4–6)
Very heavy products (~20–22)
These outliers are business-valid, not data errors (e.g., small sachets vs bulk items).
📌 Implication:
Outliers should not be removed, especially when using tree-based models (Random Forest, XGBoost), which are robust to them.
No missing values observed.
No abnormal spikes or irregular gaps.
Distribution aligns well with real-world retail data.
Summary
Product_Weight follows a near-normal distribution with minimal skewness and reasonable variability. The presence of a few valid outliers reflects real-world product diversity. No transformation or outlier treatment is required, making it a stable and reliable feature for modeling.
# Product Allocated Area
histogram_boxplot(data, "Product_Allocated_Area", show_stats_box=False, show_stats_subtitle=True)
plt.show()
📊 Univariate Analysis – Product_Allocated_Area
The distribution of Product_Allocated_Area is highly right-skewed (positively skewed).
Most values are concentrated toward the lower end, with a long tail extending to the right.
The KDE curve confirms a non-normal distribution.
📌 Implication:
This feature does not follow a normal distribution, and skewness should be considered during modeling.
Mean ≈ 0.07
Median ≈ 0.06
Mean is greater than the median, which is characteristic of right-skewed data.
This indicates that:
A small number of products receive disproportionately large shelf space.
Most products occupy relatively limited display area.
Minimum: ~0.00
Maximum: ~0.30
Standard deviation: ~0.05
The wide spread relative to the mean suggests:
Significant variation in shelf allocation across products.
Shelf space is a highly differentiated business decision.
The boxplot reveals multiple upper-end outliers.
These represent products with exceptionally high shelf visibility.
These outliers are business-driven and meaningful, not data errors.
📌 Implication:
Outliers should not be removed, especially for tree-based models that can leverage them effectively.
Products with higher allocated area likely:
Are high-demand or fast-moving items
Have stronger brand presence or promotional support
Shelf space is expected to have a direct positive impact on sales.
Summary
Product_Allocated_Area exhibits a strongly right-skewed distribution with several meaningful upper-end outliers, indicating that most products receive limited shelf space while a few receive significantly higher visibility. This feature is expected to have a strong influence on sales and should be retained without outlier removal.
# Product MRP
histogram_boxplot(data, "Product_MRP", show_stats_box=False, show_stats_subtitle=True)
plt.show()
📊 Univariate Analysis – Product_MRP
The distribution of Product_MRP is approximately normal (bell-shaped).
The KDE curve shows a symmetric pattern around the center.
Mean (147.03) and median (146.74) are almost identical, indicating very low skewness.
📌 Implication:
No transformation (log/sqrt) is required for this feature.
Mean: ~147.03
Median: ~146.74
Standard Deviation: ~30.69
This indicates:
Moderate variability in product pricing.
Prices are well-distributed across a mid-range retail spectrum.
Minimum: 31
Maximum: 266
This suggests the presence of:
Low-priced, mass-market products
High-priced, premium products
The dataset covers a wide price band, making it informative for modeling sales behavior.
A small number of outliers on both ends of the price spectrum:
Very low-priced items
Premium-priced products
These outliers are realistic and business-valid, not data issues.
📌 Implication:
Outliers should be retained, especially for tree-based models that handle them naturally.
Product_MRP is expected to have a strong influence on sales revenue:
Higher-priced products contribute more to total sales value
Interaction with volume and shelf space is likely
It may interact with:
Product_Allocated_Area
Store_Type
Store_Location_City_Type
Summary
Product_MRP follows an approximately normal distribution with minimal skewness and a wide price range. The presence of valid low- and high-priced products makes it a strong and reliable predictor of sales without requiring transformation or outlier treatment.
# Product Store Sales Total
histogram_boxplot(data, "Product_Store_Sales_Total", show_stats_box=False, show_stats_subtitle=True)
plt.show()
📊 Univariate Analysis – Product_Store_Sales_Total
The distribution of Product_Store_Sales_Total is approximately bell-shaped with slight right skew.
The KDE curve peaks around the center and tapers gradually on both sides.
Mean (3,464) and median (3,452) are very close, indicating near-symmetry.
📌 Implication:
The target variable is well-behaved, making it suitable for a wide range of regression models.
Mean: ~3,464
Median: ~3,452
Standard Deviation: ~1,066
This indicates:
Significant variation in sales across products and stores.
Sales performance differs meaningfully depending on product and store characteristics.
Minimum: ~33
Maximum: ~8,000
This wide range suggests:
Some products have very low sales, possibly due to low demand or poor placement.
High-performing products contribute substantially higher revenue.
Boxplot shows outliers on both lower and upper ends:
Very low sales (near zero)
Extremely high sales (>6,000)
These values are business-realistic and expected in retail data.
📌 Implication:
Outliers should not be removed, as they represent genuine business scenarios and carry important signals.
Sales distribution reflects:
A majority of products generating moderate sales
A small proportion of high-performing products driving revenue
This aligns with the Pareto principle (80/20 rule) commonly seen in retail.
Summary
Product_Store_Sales_Total shows a near-normal distribution with moderate variability and meaningful outliers. The wide sales range reflects real-world retail behavior, making the target variable suitable for regression modeling without aggressive transformation or outlier treatment.
# Function to create labeled barplots
def labeled_barplot(
data,
feature,
perc=False,
n=None,
figsize=None,
title=None,
color="#8b5cf6",
rotate=45,
show_stats_subtitle=True,
):
sns.set_theme(style="whitegrid", context="notebook")
s = data[feature]
total = len(s)
missing = int(s.isna().sum())
# Use a fill value so missing categories can be seen (optional but helpful)
plot_s = s.fillna("Missing")
# Build counts and optionally select top-n
vc = plot_s.value_counts(dropna=False)
if n is not None:
vc = vc.head(n)
order = vc.index.tolist()
n_cat = len(order)
# Auto figure sizing
if figsize is None:
width = max(8, min(16, 1.1 * n_cat + 2))
figsize = (width, 6)
fig, ax = plt.subplots(figsize=figsize)
# Bars
sns.countplot(
x=plot_s,
order=order,
color=color,
ax=ax,
edgecolor="white",
linewidth=1,
)
# Titles (similar to the previous function style)
main_title = title or f"Distribution of {feature}"
fig.suptitle(main_title, fontsize=15, fontweight="bold", y=0.98)
if show_stats_subtitle:
subtitle = f"rows={total:,} unique={s.nunique(dropna=True):,} missing={missing:,}"
fig.text(0.5, 0.94, subtitle, ha="center", va="top", fontsize=11)
# Axis labels
ax.set_xlabel(feature)
ax.set_ylabel("Percentage" if perc else "Count")
# Nice tick labels
ax.tick_params(axis="x", rotation=rotate)
for tick in ax.get_xticklabels():
tick.set_horizontalalignment("right" if rotate else "center")
# Value labels on bars
ymax = 0
for p in ax.patches:
h = p.get_height()
ymax = max(ymax, h)
if perc:
label = f"{(h / total) * 100:.1f}%"
else:
label = f"{int(h):,}"
ax.annotate(
label,
(p.get_x() + p.get_width() / 2, h),
ha="center",
va="bottom",
fontsize=11,
xytext=(0, 4),
textcoords="offset points",
)
# Add headroom so labels don't touch the top
ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)
# Clean spines
sns.despine(ax=ax)
# Layout room for suptitle/subtitle
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
# Product Sugar Content
labeled_barplot(data, "Product_Sugar_Content", perc=True)
📊 Univariate Analysis – Product_Sugar_Content
The variable Product_Sugar_Content has 4 distinct categories:
| Category | Approx. Percentage |
|---|---|
| Low Sugar | ~55.7% |
| Regular | ~25.7% |
| No Sugar | ~17.3% |
| reg | ~1.2% |
Low Sugar products dominate the dataset, accounting for more than half of all observations.
Regular sugar products form the second-largest segment.
No Sugar products represent a smaller but meaningful portion.
The category reg appears to be a label inconsistency, not a true separate category.
There are no missing values in this feature.
The presence of both "Regular" and "reg" indicates a data quality issue due to inconsistent labeling.
📌 Required Action:
"reg" should be merged with "Regular" to avoid:
Incorrect category inflation
Unnecessary dummy variables after encoding
The distribution is moderately imbalanced:
However, all categories still have sufficient representation.
📌 Implication:
This imbalance is not severe and does not require resampling, but it should be kept in mind during model interpretation.
The dominance of Low Sugar products reflects:
Increasing consumer preference for healthier food options
Retail strategy focused on health-conscious offerings
No Sugar products may cater to:
Niche markets
Specific dietary needs (e.g., diabetic-friendly products)
Summary
Product_Sugar_Content is a categorical feature dominated by Low Sugar products, indicating a health-oriented product mix. A minor labeling inconsistency (reg vs Regular) must be resolved before modeling. The feature shows meaningful variation and is suitable for one-hot encoding.
# Product Type
labeled_barplot(data, "Product_Type", perc=True)
📊 Univariate Analysis – Product_Type
Product_Type contains 16 distinct categories, indicating a diverse product portfolio.
The distribution is uneven, with certain categories contributing significantly more products than others.
Top contributing categories:
Fruits and Vegetables – ~14.3% (highest)
Snack Foods – ~13.1%
Frozen Foods – ~9.3%
Dairy – ~9.1%
Household – ~8.4%
These categories together account for more than half of the dataset.
Baking Goods, Canned, Health and Hygiene, Meat each contribute 7–8%.
These categories represent stable, essential consumer goods with consistent presence across stores.
Soft Drinks – ~5.9%
Breads, Hard Drinks, Others, Starchy Foods, Breakfast, Seafood each contribute less than 3%.
Seafood has the lowest representation (~0.9%).
📌 Implication:
Some categories are sparsely represented, which may limit their standalone predictive power.
No missing values observed.
Category labels are clean and interpretable.
No obvious inconsistencies or noise in category naming.
Dominance of Fruits & Vegetables and Snack Foods suggests:
High demand and fast-moving inventory
Frequent replenishment cycles
Lower presence of categories like Seafood and Breakfast may reflect:
Supply constraints
Lower consumer demand
Store-type or location-based limitations
Summary
Product_Type shows a diverse yet imbalanced distribution, with Fruits and Vegetables and Snack Foods dominating the product mix. While most categories are well-represented, a few low-frequency types may contribute limited predictive power. The feature is clean and suitable for one-hot encoding with minimal preprocessing.
# Store ID
labeled_barplot(data, "Store_Id", perc=True)
📊 Univariate Analysis – Store_Id
Store_Id has 4 unique stores.
The distribution is highly imbalanced across stores:
| Store_Id | Approx. Share |
|---|---|
| OUT004 | ~53.4% |
| OUT001 | ~18.1% |
| OUT003 | ~15.4% |
| OUT002 | ~13.1% |
No missing values detected.
Store identifiers are clean and consistent.
The dominance of OUT004 suggests:
Larger store size
Higher product assortment
Possibly higher footfall or longer operational history
Smaller representation from other stores may reflect:
Smaller physical size
Lower product variety
Different regional demand
Summary
Store_Id shows a highly imbalanced distribution, with OUT004 contributing over half of the observations. This reflects real operational differences across stores and should be preserved as a categorical feature during modeling.
# Store Size
labeled_barplot(data, "Store_Size", perc=True)
📊 Univariate Analysis – Store_Size
Store_Size has 3 distinct categories: Small, Medium, High.
The distribution is highly skewed toward Medium-sized stores:
| Store Size | Approx. Share |
|---|---|
| Medium | ~68.8% |
| High | ~18.1% |
| Small | ~13.1% |
No missing values are present.
Store size categories are clean, consistent, and interpretable.
The dominance of Medium-sized stores suggests:
SuperKart’s primary operational focus is on mid-sized outlets.
These stores likely balance product variety and operating costs efficiently.
High-sized stores may:
Carry wider assortments
Generate higher total sales per store
Small stores may:
Have limited shelf space
Focus on essential or fast-moving products
Summary
Store_Size is a clean categorical feature dominated by Medium-sized stores, reflecting the retailer’s core store format. The feature is business-relevant and likely to significantly influence sales performance.
# Store Location City Type
labeled_barplot(data, "Store_Location_City_Type", perc=True)
📊 Univariate Analysis – Store_Location_City_Type
Store_Location_City_Type has 3 distinct categories: Tier 1, Tier 2, Tier 3.
The distribution is heavily skewed toward Tier 2 cities:
| City Tier | Approx. Share |
|---|---|
| Tier 2 | ~71.5% |
| Tier 1 | ~15.4% |
| Tier 3 | ~13.1% |
No missing values present.
City tier labels are consistent and well-defined.
The dominance of Tier 2 cities suggests:
SuperKart’s strategic focus on fast-growing urban markets
Lower operational costs compared to Tier 1 cities
Tier 1 cities:
Likely have higher purchasing power
May generate higher revenue per product
Tier 3 cities:
Possibly lower demand
More price-sensitive customer base
Summary
Store_Location_City_Type is dominated by Tier 2 cities, indicating a strategic focus on mid-tier urban markets. The feature is clean, business-relevant, and expected to have a significant impact on sales performance.
# Store Type
labeled_barplot(data, "Store_Type", perc=True)
📊 Univariate Analysis – Store_Type
Store_Type has 4 distinct categories:
Supermarket Type1
Supermarket Type2
Departmental Store
Food Mart
The distribution is clearly dominated by Supermarket Type2:
| Store Type | Approx. Share |
|---|---|
| Supermarket Type2 | ~53.4% |
| Supermarket Type1 | ~18.1% |
| Departmental Store | ~15.4% |
| Food Mart | ~13.1% |
No missing values detected.
Store type labels are consistent and business-meaningful.
The dominance of Supermarket Type2 suggests:
These stores may be larger, better stocked, or more strategically located.
They likely generate higher sales volumes due to better infrastructure and assortment.
Supermarket Type1 and Departmental Stores provide moderate coverage.
Food Marts represent smaller, possibly neighborhood-focused outlets.
Summary
Store_Type is dominated by Supermarket Type2 stores, indicating a core store format that likely drives the majority of sales. The feature is clean, well-distributed, and highly relevant for sales prediction modeling.
# Correlation Matrix
def nice_corr_heatmap_complete(
data,
cols=None,
method="pearson",
figsize=(12, 9),
cmap="Spectral",
annot="auto",
fmt=".2f",
linewidths=0.6,
cbar_shrink=0.85,
title="Correlation Heatmap",
subtitle=True,
title_y=0.98,
top_margin=0.90,
square=True,
):
sns.set_theme(style="white", context="notebook")
if cols is None:
cols = data.select_dtypes(include=np.number).columns.tolist()
if len(cols) == 0:
raise ValueError("No numeric columns found to compute correlation.")
corr = data[cols].corr(method=method)
# Auto-annotation to avoid clutter on big matrices
if annot == "auto":
annot = corr.shape[0] <= 12
fig, ax = plt.subplots(figsize=figsize)
sns.heatmap(
corr,
cmap=cmap,
vmin=-1, vmax=1, center=0,
square=square,
linewidths=linewidths,
linecolor="white",
annot=annot,
fmt=fmt,
annot_kws={"size": 9} if annot else None,
cbar_kws={"shrink": cbar_shrink, "pad": 0.02},
ax=ax,
)
fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)
if subtitle:
rows = len(data)
n_features = len(cols)
miss = int(data[cols].isna().sum().sum())
sub = f"rows={rows:,} numeric_features={n_features:,} missing_values_in_matrix={miss:,} method={method}"
fig.text(0.5, title_y - 0.045, sub, ha="center", va="top", fontsize=11)
ax.tick_params(axis="x", rotation=45)
ax.tick_params(axis="y", rotation=0)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right")
sns.despine(ax=ax, left=True, bottom=True)
fig.subplots_adjust(top=top_margin)
fig.tight_layout(rect=[0, 0, 1, 0.90])
plt.show()
# Correlation Heatmap
nice_corr_heatmap_complete(data)
📊 Bivariate Analysis – Correlation Matrix (Numerical Features)
Numerical Features Considered
The correlation matrix includes 5 numerical variables:
Product_Weight
Product_Allocated_Area
Product_MRP
Store_Establishment_Year
Product_Store_Sales_Total (Target)
Pearson correlation method is used.
🔍 Key Observations & Insights
🔹 Product_MRP vs Product_Store_Sales_Total
Correlation ≈ +0.79 (Strong Positive)
This is the strongest correlation with the target.
📌 Interpretation:
Higher-priced products tend to generate higher total sales value, which is expected in revenue-based forecasting.
🔹 Product_Weight vs Product_Store_Sales_Total
📌 Interpretation:
Heavier products may:
Be sold in larger quantities
Represent premium or bulk items This makes Product_Weight a strong predictor of sales.
🔹 Product_Allocated_Area vs Product_Store_Sales_Total
📌 Interpretation:
Although shelf space is important from a business perspective, its linear relationship with sales is weak.
However:
This does not mean the feature is useless
The relationship may be non-linear, which tree-based models can capture
🔹 Store_Establishment_Year vs Product_Store_Sales_Total
📌 Interpretation:
Older stores (lower establishment year) tend to have slightly higher sales, possibly due to:
Established customer base
Brand familiarity
This effect is weak but meaningful.
🔹 Product_MRP vs Product_Weight
📌 Interpretation:
Heavier products often cost more, which is logically consistent.
⚠️ Multicollinearity Check:
Correlation is moderate, not high enough to cause serious multicollinearity issues.
Safe for use in both linear and tree-based models.
No pair of independent variables shows very high correlation (>0.85).
This indicates:
Stable model training
Reliable coefficient interpretation (for linear models)
Summary
The correlation analysis shows that Product_MRP and Product_Weight have strong positive relationships with total sales, making them key predictors. Store establishment year shows a weak negative correlation, while product allocated area has no linear relationship, suggesting potential non-linear effects. No severe multicollinearity is observed, supporting the use of all numerical features in modeling.
# Function to plot scatterz
def nice_scatterplot(
data,
x,
y="Product_Store_Sales_Total",
figsize=(8, 6),
title=None,
subtitle=True,
color="#8b5cf6",
alpha=0.45,
s=45,
add_regline=False, # set True if you want a trend line
title_y=0.98,
):
sns.set_theme(style="whitegrid", context="notebook")
fig, ax = plt.subplots(figsize=figsize)
sns.scatterplot(
data=data,
x=x,
y=y,
ax=ax,
color=color,
alpha=alpha,
s=s,
edgecolor="white",
linewidth=0.6,
)
# Optional trend line (nice for relationships)
if add_regline:
sns.regplot(
data=data,
x=x,
y=y,
scatter=False,
ax=ax,
ci=None,
line_kws={"linewidth": 2},
)
main_title = title or f"{y} vs {x}"
fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)
if subtitle:
n = int(data[[x, y]].dropna().shape[0])
miss = int(data[[x, y]].isna().any(axis=1).sum())
fig.text(
0.5, title_y - 0.045,
f"points={n:,} rows_with_missing={miss:,}",
ha="center", va="top", fontsize=11
)
ax.set_xlabel(x)
ax.set_ylabel(y)
sns.despine(ax=ax)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
# 1) Product_Weight vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_Weight")
📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_Weight
The scatter plot shows a strong positive linear relationship between Product_Weight and Product_Store_Sales_Total.
As product weight increases, total sales value generally increases as well.
This visually confirms the high positive correlation (~0.74) observed in the correlation matrix.
Data points form a clear upward-sloping pattern.
The relationship appears approximately linear, especially in the mid-range of product weights (8–18 units).
No abrupt breaks or non-linear curves are visible.
📌 Implication:
Both linear and tree-based models can effectively capture this relationship.
For lower weights (≈ 4–7):
For mid to higher weights (≈ 10–18):
Sales values show greater spread, indicating:
Variance slightly increases with weight (mild heteroscedasticity).
A few points show:
Very high sales (>7,000)
Very low sales at moderate weights
These outliers are business-valid (e.g., premium or bulk products).
📌 Implication:
Outliers should not be removed, as they represent genuine sales behavior.
Heavier products often:
Cost more
Are sold in bulk or premium segments
This naturally leads to higher revenue per product, explaining the strong positive trend.
Summary
The scatter plot reveals a strong positive linear relationship between Product_Weight and total sales, indicating that heavier products tend to generate higher revenue. The trend aligns with correlation analysis, shows realistic variability, and confirms Product_Weight as a key predictor for sales forecasting.
# 2) Product_Allocated_Area vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_Allocated_Area")
📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_Allocated_Area
The scatter plot shows no strong linear relationship between Product_Allocated_Area and Product_Store_Sales_Total.
Sales values are widely dispersed across all levels of allocated area.
This visually confirms the near-zero correlation observed in the correlation matrix.
📌 Key Insight:
Shelf space alone does not linearly explain sales performance.
Most products have low allocated area (0.00–0.10).
High allocated area values (>0.20) are rare.
Across both low and high shelf space:
Sales range from very low to very high.
No clear upward or downward trend is visible.
For low allocated area:
For higher allocated area:
Variance remains roughly constant, indicating no clear heteroscedastic pattern.
A few products with high shelf space but moderate sales.
Some products with low shelf space but very high sales.
These are business-realistic scenarios:
Popular items sell well even with limited shelf space
Poor-performing products may still receive promotional space
📌 Implication:
Outliers are meaningful and should be retained.
Shelf space allocation is likely:
Influenced by expected demand, not actual sales alone
Interacting with other variables such as:
Product price
Product type
Store size
Sales performance is multi-factor driven, not dependent on shelf space alone.
Summary
The scatter plot indicates no clear linear relationship between Product_Allocated_Area and total sales, suggesting that shelf space alone does not drive sales outcomes. However, the feature may still be valuable through non-linear interactions with other product and store attributes.
# 3) Product_MRP vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_MRP")
📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_MRP
The scatter plot shows a strong positive linear relationship between Product_MRP and Product_Store_Sales_Total.
As product price increases, total sales revenue consistently increases.
This visually confirms the high positive correlation (~0.79) seen in the correlation matrix.
Data points form a clear upward-sloping trend.
The relationship appears almost linear across the entire price range.
Minimal curvature or deviation from linearity is observed.
📌 Implication:
This feature is well-suited for linear regression as well as tree-based models.
At lower MRP values (30–80):
At mid to high MRP values (100–220):
Sales values increase substantially.
Variability increases slightly, indicating influence of additional factors such as store type and shelf space.
A few products exhibit:
Very high MRP (>250) with high sales
Moderate MRP with unusually low or high sales
These are business-valid scenarios, not anomalies.
📌 Implication:
Outliers should be retained as they provide valuable information.
Higher-priced products:
Generate more revenue per unit sold
Are often associated with premium or bulk offerings
This explains why MRP is one of the strongest drivers of sales revenue.
Summary
Product_MRP exhibits a strong positive linear relationship with total sales, indicating that higher-priced products consistently generate higher revenue. This confirms Product_MRP as one of the most influential predictors in the sales forecasting model.
def revenue_by_category(
data,
category,
revenue_col="Product_Store_Sales_Total",
top_n=None,
figsize=None,
title=None,
rotate=45,
color="#8b5cf6",
show_values=True,
title_y=0.98,
):
sns.set_theme(style="whitegrid", context="notebook")
# Aggregate revenue
df_rev = (
data.groupby(category, dropna=False)[revenue_col]
.sum()
.reset_index()
.sort_values(revenue_col, ascending=False)
)
# Optionally limit to top N categories
if top_n is not None:
df_rev = df_rev.head(top_n)
# Auto size
if figsize is None:
width = max(9, min(18, 1.1 * len(df_rev) + 3))
figsize = (width, 6)
fig, ax = plt.subplots(figsize=figsize)
sns.barplot(
data=df_rev,
x=category,
y=revenue_col,
ax=ax,
color=color,
edgecolor="white",
linewidth=1,
)
# Titles (same style)
main_title = title or f"Revenue by {category}"
fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)
total_rev = df_rev[revenue_col].sum()
top_cat = df_rev.iloc[0][category]
top_rev = df_rev.iloc[0][revenue_col]
fig.text(
0.5,
title_y - 0.045,
f"total_revenue={total_rev:,.0f} top={top_cat} ({top_rev:,.0f})",
ha="center",
va="top",
fontsize=11,
)
ax.set_xlabel(category)
ax.set_ylabel("Revenue")
ax.tick_params(axis="x", rotation=rotate)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right" if rotate else "center")
# Value labels on bars
if show_values:
ymax = df_rev[revenue_col].max()
for p in ax.patches:
h = p.get_height()
ax.annotate(
f"{h:,.0f}",
(p.get_x() + p.get_width() / 2, h),
ha="center",
va="bottom",
fontsize=10,
xytext=(0, 4),
textcoords="offset points",
)
ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)
sns.despine(ax=ax)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
# If you need the aggregated dataframe later, return it:
return df_rev
# Revenue by Product Type
df_revenue1 = revenue_by_category(
data,
category="Product_Type",
figsize=(14, 7),
rotate=60,
title="Which Product Type Generates the Most Revenue?"
)
📊 Bivariate Analysis – Revenue by Product_Type
The total revenue across all product types is approximately 30.35 million.
Revenue contribution is unevenly distributed across product categories, indicating that some categories drive a disproportionate share of revenue.
The highest revenue contributors are:
Fruits and Vegetables – ~4.30M (Highest)
Snack Foods – ~3.99M
Dairy – ~2.81M
Frozen Foods – ~2.81M
Household – ~2.56M
📌 Key Insight:
Fruits and Vegetables alone generate the highest revenue, making it the most critical product category for SuperKart.
Baking Goods – ~2.45M
Canned – ~2.30M
Health and Hygiene – ~2.16M
Meat – ~2.13M
These categories:
Contribute steady and meaningful revenue
Represent essential and recurring consumer purchases
The lowest revenue contributors are:
Soft Drinks – ~1.80M
Breads – ~0.71M
Hard Drinks – ~0.63M
Others – ~0.54M
Starchy Foods – ~0.52M
Breakfast – ~0.36M
Seafood – ~0.27M (Lowest)
📌 Implication:
These categories either have:
Lower demand
Lower pricing
Limited shelf presence
Or fewer product SKUs
High revenue from Fruits & Vegetables and Snack Foods suggests:
High demand
Frequent purchases
Fast inventory turnover
Low-performing categories may require:
Better promotion
Optimized pricing
Improved shelf placement
Or strategic reduction if margins are low
Revenue dominance aligns with:
High representation of these categories in the dataset
Likely higher product turnover
Confirms that Product_Type is a strong driver of sales revenue, not just sales count.
Summary
Fruits and Vegetables generate the highest revenue for SuperKart, followed by Snack Foods and Dairy. Revenue contribution varies significantly across product types, highlighting Product_Type as a key driver of sales performance and a critical feature for forecasting models.
# Revenue by Product Sugar Content
df_revenue2 = revenue_by_category(
data,
category="Product_Sugar_Content",
figsize=(9, 6),
rotate=45,
title="Revenue by Product Sugar Content"
)
📊 Bivariate Analysis – Revenue by Product_Sugar_Content
The total revenue across all sugar content categories is approximately 30.36 million.
Revenue distribution across sugar content levels is highly imbalanced, indicating strong consumer preference patterns.
| Sugar Content | Revenue | Share (Approx.) |
|---|---|---|
| Low Sugar | ~16.82M | ~55% |
| Regular | ~7.87M | ~26% |
| No Sugar | ~5.27M | ~17% |
| reg | ~0.39M | ~1% |
📌 Key Insight:
Low Sugar products dominate revenue generation, contributing more than half of total sales.
The presence of reg as a separate category:
Indicates a data inconsistency / labeling issue
Likely represents “Regular” sugar content
📌 Action Required:
This category should be merged with “Regular” during data cleaning.
Strong revenue dominance of Low Sugar products suggests:
Growing consumer preference for health-conscious options
Successful product placement and assortment strategy
No Sugar products also contribute meaningfully, reinforcing the health trend.
Increase focus on:
Low Sugar and No Sugar product variants
Promotions and shelf placement for these categories
Reevaluate Regular sugar products:
Improve marketing
Reposition pricing if needed
Summary
Low Sugar products generate the majority of revenue, highlighting a strong consumer shift toward healthier options. Data inconsistency in sugar labeling should be corrected to ensure accurate modeling and insights.
# Revenue by Store Id
df_store_revenue = revenue_by_category(
data,
category="Store_Id",
title="Revenue by Store",
rotate=60,
top_n=15, # optional: store IDs can be many; keeps it readable
figsize=(14, 6)
)
📊 Bivariate Analysis – Revenue by Store_Id
Total revenue across all stores is approximately 30.36 million.
Revenue generation is highly uneven across stores, indicating strong store-level performance differences.
| Store_Id | Revenue | Approx. Share |
|---|---|---|
| OUT004 | ~15.43M | ~51% |
| OUT003 | ~6.67M | ~22% |
| OUT001 | ~6.22M | ~21% |
| OUT002 | ~2.03M | ~7% |
📌 Key Insight:
OUT004 alone contributes more than half of the total revenue, making it the most dominant store.
OUT004 significantly outperforms all other stores combined.
OUT003 and OUT001 show similar and moderate performance.
OUT002 generates the least revenue, lagging far behind.
📌 Implication:
Revenue generation is not evenly distributed geographically or operationally.
Possible reasons for OUT004’s dominance:
Larger store size
Better product assortment
Higher footfall
Favorable city tier or location
Higher concentration of high-MRP and high-demand products
Conversely, OUT002 may suffer from:
Smaller store size
Lower customer traffic
Less optimal location
Limited inventory mix
OUT004 also had:
Highest number of product entries
Strong presence of high-performing product categories
Reinforces that store characteristics play a major role in revenue generation.
Summary
Revenue generation varies significantly across stores, with OUT004 contributing over half of total revenue. This highlights the critical role of store-specific factors such as size, location, and assortment in driving sales performance.
# Revenue by Store Size
df_revenue3 = revenue_by_category(
data,
category="Store_Size",
title="Revenue by Store Size",
rotate=0,
figsize=(8, 6)
)
📊 Bivariate Analysis – Revenue by Store Size
Total revenue across all stores is approximately 30.36 million.
Revenue contribution varies significantly by store size, indicating that store scale has a strong influence on sales performance.
| Store Size | Revenue | Approx. Share |
|---|---|---|
| Medium | ~22.10M | ~73% |
| High | ~6.22M | ~21% |
| Small | ~2.03M | ~6% |
📌 Key Insight:
Medium-sized stores dominate revenue generation, contributing nearly three-fourths of total revenue.
Medium stores likely strike the best balance between:
Product variety
Operational efficiency
Customer footfall
High-sized stores, despite larger physical space, contribute less than expected, possibly due to:
Higher operational costs
Diminishing returns on space utilization
Small stores generate minimal revenue, consistent with limited assortment and lower footfall.
Medium-sized stores also had the highest frequency count in the dataset.
Their dominance in both count and revenue reinforces their strategic importance in the business model.
Expansion strategy should prioritize medium-sized stores.
Optimization opportunities:
Improve revenue per square foot in high-sized stores.
Re-evaluate inventory and location strategy for small stores.
Summary
Medium-sized stores are the primary revenue drivers, contributing nearly 73% of total revenue, making store size a critical determinant of sales performance.
# Revenue by Store Location City Type
df_revenue4 = revenue_by_category(
data,
category="Store_Location_City_Type",
title="Revenue by Store Location City Type",
rotate=0,
figsize=(9, 6)
)
📊 Bivariate Analysis – Revenue by Store Location City Type
Total revenue across all stores is approximately 30.36 million.
Revenue contribution varies significantly by city tier, highlighting the importance of store location in sales performance.
| City Tier | Revenue | Approx. Share |
|---|---|---|
| Tier 2 | ~21.65M | ~71% |
| Tier 1 | ~6.67M | ~22% |
| Tier 3 | ~2.03M | ~7% |
📌 Key Insight:
Tier 2 cities dominate revenue generation, contributing over 70% of total revenue, outperforming even Tier 1 cities.
Tier 2 cities likely benefit from:
High population density
Moderate competition
Strong demand for value-driven retail formats
Tier 1 cities, despite higher purchasing power, may face:
Market saturation
Higher operational costs
Tier 3 cities show limited revenue potential, possibly due to:
Lower footfall
Smaller store formats
Limited product assortment
Tier 2 cities also had the highest store count in univariate analysis.
The alignment of high store presence and high revenue reinforces Tier 2 cities as the company’s core market.
Expansion and investment strategies should focus on Tier 2 locations.
Tier 1 strategies should emphasize:
Premium products
Differentiated offerings
Tier 3 stores may require:
Cost optimization
Targeted product mixes
Summary
Tier 2 cities are the primary revenue drivers, contributing over 70% of total revenue, underscoring the strategic importance of mid-tier urban markets.
# Revenue by Store Type
df_revenue5 = revenue_by_category(
data,
category="Store_Type",
title="Revenue by Store Type",
rotate=0,
figsize=(9, 6)
)
📊 Bivariate Analysis – Revenue by Store Type
Total revenue across all store types is approximately 30.36 million.
Revenue distribution varies significantly across different store formats, indicating that store type plays a crucial role in sales performance.
| Store Type | Revenue | Approx. Share |
|---|---|---|
| Supermarket Type2 | ~15.43M | ~51% |
| Departmental Store | ~6.67M | ~22% |
| Supermarket Type1 | ~6.22M | ~21% |
| Food Mart | ~2.03M | ~7% |
📌 Key Insight:
Supermarket Type2 dominates revenue generation, contributing over half of the total revenue on its own.
Supermarket Type2 stores likely benefit from:
Wider product assortment
Higher customer footfall
Strong presence in high-performing locations (e.g., Tier 2 cities)
Departmental Stores and Supermarket Type1 show comparable performance, suggesting:
Moderate scale
Stable but less aggressive sales potential
Food Marts generate the least revenue, consistent with:
Smaller store size
Limited product variety
Convenience-focused shopping behavior
Supermarket Type2 stores were:
Most frequent in the dataset
Dominant in Store_Id (OUT004) revenue
This reinforces the conclusion that store format + scale + location jointly drive revenue.
Expansion strategy should prioritize Supermarket Type2 stores.
Opportunities exist to:
Upgrade Supermarket Type1 stores to Type2 formats
Optimize product mix in Departmental Stores
Food Marts may be best suited for niche or essential-only strategies.
Summary
Supermarket Type2 stores are the primary revenue drivers, contributing over 50% of total revenue, making store format a critical determinant of sales performance.
def nice_boxplot_by_category(
data,
x_cat,
y="Product_Store_Sales_Total",
figsize=(14, 8),
title=None,
rotate=60,
color="#8b5cf6",
show_stats_subtitle=True,
title_y=0.98,
):
sns.set_theme(style="whitegrid", context="notebook")
fig, ax = plt.subplots(figsize=figsize)
# Boxplot (no hue needed when it equals x; avoids duplicate legends)
sns.boxplot(
data=data,
x=x_cat,
y=y,
ax=ax,
color=color,
showmeans=True,
meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
medianprops=dict(color="black", linewidth=2),
whiskerprops=dict(linewidth=1.2),
boxprops=dict(linewidth=1.2),
)
main_title = title or f"Boxplot - {x_cat} vs {y}"
fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)
if show_stats_subtitle:
n = int(data[[x_cat, y]].dropna().shape[0])
groups = int(data[x_cat].nunique(dropna=True))
fig.text(
0.5, title_y - 0.045,
f"points={n:,} groups={groups:,}",
ha="center", va="top", fontsize=11
)
ax.set_xlabel(x_cat)
ax.set_ylabel(f"{y} (of each product)")
ax.tick_params(axis="x", rotation=rotate)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right" if rotate else "center")
sns.despine(ax=ax)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
# Store Id vs Product Store Sales Total
nice_boxplot_by_category(
data,
x_cat="Store_Id",
y="Product_Store_Sales_Total",
figsize=(14, 8),
title="Boxplot - Store_Id vs Product_Store_Sales_Total",
rotate=90
)
📊 Bivariate Analysis – Store_Id vs Product_Store_Sales_Total (Boxplot)
The boxplot compares product-level sales distribution across four stores (OUT001–OUT004).
There is substantial variation in median sales, spread, and outliers across stores, indicating store-specific sales behavior.
OUT003 shows the highest median product sales, indicating stronger per-product performance.
OUT001 follows with moderately high median sales.
OUT004 has a lower median compared to OUT003 and OUT001, despite being the highest in total revenue.
OUT002 has the lowest median sales, confirming its weaker performance.
📌 Key Insight:
Higher total revenue does not always imply higher per-product sales.
OUT003 has the widest IQR (box width), suggesting:
Greater diversity in product performance
Presence of both very high and moderate selling products
OUT004 shows a moderate spread, indicating relatively consistent product sales.
OUT002 has a narrower distribution, reflecting limited sales potential and fewer high-performing products.
OUT003 exhibits several high-value outliers, including the maximum observed sales (~8000).
OUT004 also contains multiple high outliers but fewer extreme values.
OUT002 has some very low outliers, indicating poorly performing products.
📌 Implication:
Some stores rely on blockbuster products, while others show uniform but lower performance.
OUT003:
Strong per-product revenue potential
Opportunity to scale top-performing SKUs
OUT004:
OUT002:
Although OUT004 generates the highest total revenue, its median product sales are lower than OUT003, suggesting:
Confirms why Store_Id is a critical feature for modeling.
Summary
Product-level sales distributions vary significantly across stores, with OUT003 showing the highest median and variability, while OUT004’s high total revenue is driven by volume rather than per-product dominance.
# Store Size vs Product Store Sales Total
nice_boxplot_by_category(
data,
x_cat="Store_Size",
y="Product_Store_Sales_Total",
figsize=(12, 7),
title="Boxplot - Store_Size vs Product_Store_Sales_Total",
rotate=0
)
📊 Bivariate Analysis – Store_Size vs Product_Store_Sales_Total (Boxplot)
Product-level sales vary significantly across store sizes.
Store size clearly influences both median sales and variability, confirming it as a strong driver of revenue.
High-sized stores have the highest median product sales, indicating stronger per-product revenue.
Medium-sized stores show a moderate median, lower than High but significantly higher than Small.
Small-sized stores have the lowest median sales, reflecting limited sales capacity.
📌 Clear hierarchy:
High > Medium > Small in terms of per-product sales.
Medium-sized stores exhibit the widest spread and many high outliers, suggesting:
A mix of average and blockbuster products
Greater heterogeneity in product performance
High-sized stores have a more compact distribution, indicating:
Consistently strong product sales
Better standardization and optimized assortments
Small stores show:
Narrower spread
Limited upside and fewer high-performing products
Medium stores include extreme high outliers (up to ~8000), showing potential for exceptional products.
High stores have fewer extreme outliers but consistently high sales.
Small stores contain low-end outliers, highlighting weaker or non-performing SKUs.
High stores:
Best suited for premium and high-MRP products
Stable and predictable revenue per product
Medium stores:
Strong growth opportunities
Ideal for experimentation and new product launches
Small stores:
Limited revenue potential per product
Require focused assortment and cost optimization
This boxplot aligns perfectly with earlier findings where:
Medium-sized stores generated the highest total revenue
Despite High stores having higher per-product medians, Medium stores win on volume + diversity
📌 Key takeaway:
Total revenue dominance ≠ highest per-product sales.
Summary
Product-level sales increase with store size, with High stores showing the strongest per-product performance, while Medium stores balance consistency and extreme high-selling products.
def nice_boxplot_relation(
data,
x_cat,
y_num,
figsize=(14, 8),
title=None,
rotate=60,
color="#8b5cf6",
title_y=0.98,
):
sns.set_theme(style="whitegrid", context="notebook")
fig, ax = plt.subplots(figsize=figsize)
sns.boxplot(
data=data,
x=x_cat,
y=y_num,
ax=ax,
color=color,
showmeans=True,
meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
medianprops=dict(color="black", linewidth=2),
whiskerprops=dict(linewidth=1.2),
boxprops=dict(linewidth=1.2),
)
fig.suptitle(title or f"Boxplot - {x_cat} vs {y_num}", fontsize=15, fontweight="bold", y=title_y)
n = int(data[[x_cat, y_num]].dropna().shape[0])
groups = int(data[x_cat].nunique(dropna=True))
fig.text(0.5, title_y - 0.045, f"points={n:,} groups={groups:,}", ha="center", va="top", fontsize=11)
ax.set_xlabel("Types of Products" if x_cat == "Product_Type" else x_cat)
ax.set_ylabel(y_num)
ax.tick_params(axis="x", rotation=rotate)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right" if rotate else "center")
sns.despine(ax=ax)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
# Product Type Vs Product Weight
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Product_Type", y="Product_Weight", hue="Product_Type")
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_Weight")
plt.xlabel("Types of Products")
plt.ylabel("Product_Weight")
plt.legend([], [], frameon=False) # hide redundant legend
plt.show()
📦 Boxplot Interpretation: Product_Type vs Product_Weight
🔎 What this plot shows
X-axis: Product categories
Y-axis: Product weight
Each box summarizes the distribution of product weights within a product type:
Median (line)
Interquartile range (IQR)
Whiskers (typical min/max)
Dots (outliers)
🧠 Key Observations
Most product categories have:
Median weight ≈ 12–13 units
IQR roughly between 11 and 14
This indicates standardized packaging sizes across the business.
✅ No product type is fundamentally heavier or lighter than others.
Slightly higher medians seen in:
Starchy Foods
Seafood
Others
Slightly lower medians in:
Frozen Foods
Canned
These differences are small and overlapping, not statistically dominant.
📌 Conclusion:
Product_Type does not strongly determine Product_Weight
Outliers exist on both lower and higher ends:
Low-end: ~4–6 units
High-end: ~18–22 units
This suggests:
Multiple pack sizes
Special SKUs (bulk / premium packs)
⚠️ These outliers are expected and realistic, not data errors.
No product type shows:
Extreme dispersion
Unusually tight or wide spread
Reinforces the idea of uniform packaging standards
🔗 Relation to Your Earlier Findings
This plot aligns perfectly with your correlation results:
Product_Weight has a strong positive correlation with sales
But Product_Type does NOT explain weight differences
➡️ Therefore:
Weight affects sales independently
Product_Type influences sales through demand, pricing, and volume, not weight
📝 One-line EDA Summary
Product weight distributions are largely consistent across product categories, indicating standardized packaging sizes. While product weight strongly influences sales, it is not driven by product type, suggesting an independent effect on revenue.
# Product Sugar Content Vs Product Weight
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Product_Sugar_Content", y="Product_Weight", hue="Product_Sugar_Content")
plt.xticks(rotation=0)
plt.title("Boxplot - Product_Sugar_Content Vs Product_Weight")
plt.xlabel("Product_Sugar_Content")
plt.ylabel("Product_Weight")
plt.legend([], [], frameon=False) # hide redundant legend
plt.show()
🍬📦 Relationship: Product Sugar Content vs Product Weight
🔍 What this plot analyzes
X-axis: Product Sugar Content (Low Sugar, Regular, No Sugar, reg)
Y-axis: Product Weight
Goal: Check whether sugar content influences product weight
🧠 Key Observations
Median weights across all sugar categories are very similar:
Interquartile ranges (IQRs) overlap heavily
✅ This indicates no strong relationship between sugar content and product weight.
All sugar categories show:
Comparable spread
Similar whisker lengths
Outliers on both ends
No category shows unusually heavy or light products overall
📌 Sugar formulation does not drive packaging size.
Low-end outliers (~4–6 units)
High-end outliers (~18–22 units)
These likely represent:
Mini packs
Family or bulk packs
Special SKUs
⚠️ These are natural business variations, not data issues.
4 The "reg" category is likely a data-quality issue
"reg" appears redundant with "Regular"
Its distribution mirrors Regular almost exactly
🔗 Alignment with Your Previous Findings
This result is consistent with earlier insights:
Product_Weight strongly correlates with sales
Sugar content strongly affects revenue composition
But sugar content does NOT affect weight
➡️ Therefore:
Sugar content impacts sales via consumer preference
Weight impacts sales via volume/quantity
These effects are independent
📝 One-line EDA Summary
Product weight distributions are consistent across sugar content categories, indicating that sugar formulation does not influence packaging size. Product weight and sugar content independently contribute to sales behavior.
def nice_crosstab_heatmap(
data,
rows="Product_Sugar_Content",
cols="Product_Type",
normalize=None, # None, "index" (row %), "columns" (col %), "all" (overall %)
figsize=(14, 8),
cmap="viridis",
title=None,
title_y=0.98,
):
sns.set_theme(style="white", context="notebook")
ct = pd.crosstab(data[rows], data[cols], dropna=False)
# Normalize if requested
if normalize is not None:
ct_plot = ct.div(ct.sum(axis=1), axis=0) if normalize == "index" else \
ct.div(ct.sum(axis=0), axis=1) if normalize == "columns" else \
ct / ct.values.sum() if normalize == "all" else ct
annot_fmt = ".1%" # show as percent
annot_data = ct_plot
vmin, vmax = 0, 1
else:
annot_fmt = "g" # integer
annot_data = ct
vmin, vmax = None, None
fig, ax = plt.subplots(figsize=figsize)
sns.heatmap(
ct_plot if normalize is not None else ct,
annot=annot_data,
fmt=annot_fmt,
cmap=cmap,
linewidths=0.6,
linecolor="white",
cbar=True,
vmin=vmin,
vmax=vmax,
ax=ax,
)
main_title = title or (
f"{rows} vs {cols}" + (" (Row %)" if normalize == "index" else
" (Column %)" if normalize == "columns" else
" (Overall %)" if normalize == "all" else
" (Counts)")
)
fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)
fig.text(
0.5, title_y - 0.045,
f"rows={len(data):,} unique_{rows}={data[rows].nunique(dropna=True):,} unique_{cols}={data[cols].nunique(dropna=True):,}",
ha="center", va="top", fontsize=11
)
ax.set_ylabel(rows)
ax.set_xlabel(cols)
# Make labels readable
ax.tick_params(axis="x", rotation=45)
ax.tick_params(axis="y", rotation=0)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right")
sns.despine(ax=ax, left=True, bottom=True)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
return ct
# Heatmap Product Sugar of Different Product Types
nice_crosstab_heatmap(
data,
rows="Product_Sugar_Content",
cols="Product_Type",
normalize=None,
figsize=(14, 8),
cmap="viridis",
title="Sugar Content Across Product Types (Counts)"
)
Distribution is approximately normal with mild right skew.
Mean ≈ Median → good for regression modeling.
Presence of high-end outliers (up to ~8000), but not extreme enough to discard blindly.
✔️ No transformation is strictly required, though log-transform could be tested.
Correlation Heatmap (Pearson)
Strong positive relationships with sales:
Product_MRP → Sales (≈ 0.79) 🔥
Product_Weight → Sales (≈ 0.74) 🔥
Moderate relationship:
Weak / No relationship:
Product_Allocated_Area → Sales (≈ 0.00)
Store_Establishment_Year → Sales (≈ -0.19)
📌 Conclusion:
Product_MRP and Product_Weight are the most powerful numeric predictors.
Product Weight vs Sales
Clear linear upward trend
Heavier products → higher total sales
Product MRP vs Sales
Strong linear pattern
Higher MRP products consistently generate more revenue
Product Allocated Area vs Sales
No clear pattern
Scatter is diffuse → confirms low correlation
📌 Modeling takeaway:
Consider dropping or down-weighting Product_Allocated_Area unless interactions are used.
Product Type (Count)
Top categories by presence:
Fruits & Vegetables
Snack Foods
Frozen Foods
Dairy
Balanced enough → good categorical signal.
Product Sugar Content
Low Sugar dominates (55.7%)
Regular ≈ 25.7%
No Sugar ≈ 17.3%
reg ≈ 1.2% → ⚠️ likely a data quality issue (typo of “Regular”)
📌 Action:
Merge reg → Regular
Revenue by Product Type
Top revenue generators:
Fruits & Vegetables 🥇
Snack Foods
Dairy
Frozen Foods
Lowest:
Seafood
Breakfast
Starchy Foods
📌 Insight:
High-volume essentials outperform niche categories.
Revenue by Sugar Content
Low Sugar → ~55% of total revenue 🔥
Regular → ~26%
No Sugar → ~17%
reg negligible
📌 Business Insight:
Health-conscious products are not only popular but profitable.
Revenue by Store
OUT004 alone contributes ~50% of total revenue
OUT002 is significantly underperforming
📌 Store-level imbalance detected
Revenue by Store Size
Medium stores dominate revenue (~73%)
High > Small, but Medium is the sweet spot
Revenue by City Tier
Tier 2 cities generate the most revenue
Tier 1 < Tier 2
Tier 3 lowest
📌 Key insight:
Tier 2 cities + Medium stores = highest ROI combination
Revenue by Store Type
Supermarket Type2 dominates
Followed by Departmental Store
Food Mart is lowest
Store ID vs Sales
OUT003 and OUT004 have higher medians
OUT002 has:
*Lowest median
Store Size vs Sales
High size → highest median sales
Medium has higher total revenue due to volume
Small stores consistently underperform
Product Type vs Weight
Very similar median weights across categories
Slightly heavier:
Snack Foods
Starchy Foods
Household
📌 Weight is not category-driven, but still predictive of sales.
Sugar Content vs Weight
No strong weight difference across sugar categories
Weight is independent of sugar classification
Key patterns:
Low Sugar dominates Fruits & Vegetables, Snack Foods
No Sugar almost exclusive to Health & Hygiene & Household
Very clean segmentation → good categorical signal
📌 Excellent feature interaction potential:
⚠️ Fix these before modeling:
Merge reg → Regular
Consider encoding Store_Id carefully (target encoding recommended)
Product_Allocated_Area has low predictive power
def nice_store_producttype_heatmap(
data,
store_col="Store_Id",
product_col="Product_Type",
figsize=(14, 8),
cmap="viridis",
annot="auto", # "auto", True, False
title=None,
title_y=0.98,
):
sns.set_theme(style="white", context="notebook")
# ✅ Completed crosstab: Store_Id vs Product_Type
ct = pd.crosstab(data[store_col], data[product_col], dropna=False)
# Auto annotation decision (avoid clutter for large matrices)
if annot == "auto":
annot = (ct.shape[0] <= 15) and (ct.shape[1] <= 12)
fig, ax = plt.subplots(figsize=figsize)
sns.heatmap(
ct,
annot=annot,
fmt="g",
cmap=cmap,
linewidths=0.6,
linecolor="white",
cbar=True,
ax=ax,
)
fig.suptitle(
title or f"Items Sold: {product_col} by {store_col}",
fontsize=15,
fontweight="bold",
y=title_y,
)
fig.text(
0.5,
title_y - 0.045,
f"rows={len(data):,} stores={ct.shape[0]:,} product_types={ct.shape[1]:,}",
ha="center",
va="top",
fontsize=11,
)
ax.set_ylabel("Stores")
ax.set_xlabel("Product_Type")
# Make labels readable
ax.tick_params(axis="x", rotation=45)
ax.tick_params(axis="y", rotation=0)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right")
sns.despine(ax=ax, left=True, bottom=True)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
return ct
nice_store_producttype_heatmap(data, annot=True)
🛒 Items Sold: Product Type × Store Id — Key Insights
OUT004 clearly outperforms all other stores across every product category.
Especially strong in:
Fruits & Vegetables (700)
Snack Foods (615)
Frozen Foods (446)
Dairy (397)
Household (399)
📌 Interpretation:
OUT004 is the primary revenue and volume driver, likely due to:
Larger store size
Better location (Tier 2 + high footfall)
Broader assortment and inventory depth
Across all stores, the most sold product types are:
| Product Type | Observation |
|---|---|
| Fruits & Vegetables | Top-selling category in every store |
| Snack Foods | Second-highest volume consistently |
| Household | Strong and stable across stores |
| Frozen Foods & Dairy | Medium-to-high consistent demand |
📌 Insight:
These are essential, high-frequency purchase categories, driving store traffic and repeat purchases.
Categories with consistently low sales across all stores:
Seafood
Breakfast
Breads
Others
📌 Interpretation:
Low demand appears structural, not store-specific — suggesting:
Limited consumer preference
Possibly niche or premium products
Potential candidates for assortment rationalization
| Store | Pattern |
|---|---|
| OUT004 | High-volume, diversified sales across all categories |
| OUT001 & OUT003 | Mid-performing, similar patterns |
| OUT002 | Lowest sales across most categories |
📌 Interpretation:
OUT002 may suffer from:
Smaller store size
Less optimal location
Lower customer footfall
This heatmap perfectly explains your earlier findings:
OUT004 → highest revenue
Fruits & Vegetables + Snack Foods → top revenue contributors
Volume-driven categories = revenue drivers
This confirms that revenue is volume-led, not just price-led.
🎯 Business Implications
Allocate more shelf space and inventory to:
Fruits & Vegetables
Snack Foods
Household essentials
Store strategy:
Replicate OUT004’s layout, assortment, and promotions in other stores
Investigate why OUT002 underperforms
Category optimization:
def nice_boxplot_price_trend(
data,
x_cat,
y_num,
figsize=(14, 8),
title=None,
rotate=60,
color="#8b5cf6",
title_y=0.98,
):
sns.set_theme(style="whitegrid", context="notebook")
fig, ax = plt.subplots(figsize=figsize)
sns.boxplot(
data=data,
x=x_cat,
y=y_num,
ax=ax,
color=color,
showmeans=True,
meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
medianprops=dict(color="black", linewidth=2),
whiskerprops=dict(linewidth=1.2),
boxprops=dict(linewidth=1.2),
)
fig.suptitle(
title or f"Boxplot - {x_cat} vs {y_num}",
fontsize=15,
fontweight="bold",
y=title_y,
)
n = int(data[[x_cat, y_num]].dropna().shape[0])
groups = int(data[x_cat].nunique(dropna=True))
fig.text(
0.5,
title_y - 0.045,
f"points={n:,} product_types={groups:,}",
ha="center",
va="top",
fontsize=11,
)
ax.set_xlabel("Product_Type" if x_cat == "Product_Type" else x_cat)
ax.set_ylabel(f"{y_num} (of each product)")
ax.tick_params(axis="x", rotation=rotate)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right" if rotate else "center")
sns.despine(ax=ax)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
# Boxplot Product Type Vs Product MRP
plt.figure(figsize=[14, 8])
sns.boxplot(
data=data,
x="Product_Type",
y="Product_MRP",
hue="Product_Type"
)
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_MRP")
plt.xlabel("Product_Type")
plt.ylabel("Product_MRP (of each product)")
plt.legend([], [], frameon=False) # hide redundant legend
plt.show()
Most product types have comparable median MRPs, clustered roughly in the same range, indicating no extreme base-price differences across categories.
Every category shows a broad interquartile range (IQR), meaning products within the same type span multiple price points (budget to premium).
Almost all product types contain upper-end outliers (₹230–₹270 range), suggesting premium SKUs exist in nearly every category.
Categories such as Starchy Foods, Others, Fruits & Vegetables, and Meat exhibit wider upper tails, indicating more high-MRP products compared to others.
Minimum MRPs across most product types are fairly similar, showing price floors do not differ much by category.
Since medians and IQRs overlap heavily, Product_Type alone does not strongly explain MRP variation—price variation is largely within categories rather than between them.
def nice_boxplot_store_mrp(
data,
x_cat="Store_Id",
y_num="Product_MRP",
figsize=(14, 8),
title="Boxplot - Store_Id vs Product_MRP",
rotate=90,
color="#8b5cf6",
title_y=0.98,
top_n=None, # optional: show only top N stores by count to reduce clutter
):
sns.set_theme(style="whitegrid", context="notebook")
df = data[[x_cat, y_num]].dropna()
# Optional: reduce clutter by keeping only top N stores by number of items
if top_n is not None:
top_stores = df[x_cat].value_counts().head(top_n).index
df = df[df[x_cat].isin(top_stores)]
fig, ax = plt.subplots(figsize=figsize)
sns.boxplot(
data=df,
x=x_cat,
y=y_num,
ax=ax,
color=color,
showmeans=True,
meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
medianprops=dict(color="black", linewidth=2),
whiskerprops=dict(linewidth=1.2),
boxprops=dict(linewidth=1.2),
)
fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)
n = int(df.shape[0])
stores = int(df[x_cat].nunique())
fig.text(
0.5, title_y - 0.045,
f"points={n:,} stores={stores:,}",
ha="center", va="top", fontsize=11
)
ax.set_xlabel("Stores")
ax.set_ylabel("Product_MRP (of each product)")
ax.tick_params(axis="x", rotation=rotate)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right")
sns.despine(ax=ax)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
# Product MRP with Different Stores
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Store_Id", y="Product_MRP", hue="Store_Id")
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Id Vs Product_MRP")
plt.xlabel("Stores")
plt.ylabel("Product_MRP (of each product)")
plt.legend([], [], frameon=False) # hide the huge redundant legend
plt.show()
OUT003 has the highest median Product_MRP, indicating that this store generally sells higher-priced products compared to others.
OUT001 also shows a relatively high median MRP, but slightly lower than OUT003.
OUT004 has a moderate median MRP, positioned below OUT001 and OUT003 but above OUT002.
OUT002 clearly has the lowest median Product_MRP, suggesting it focuses more on lower-priced products.
Price variability (IQR) is widest for OUT003, meaning it carries a broader range of product prices.
OUT002 shows a narrower IQR, indicating more consistent (and generally lower) pricing.
OUT001 and OUT004 exhibit moderate variability in MRPs.
High-price outliers are most prominent in OUT003, reinforcing the presence of premium-priced products.
OUT002 also shows outliers, but these are mostly upper outliers, standing out against its generally low-price distribution.
All stores contain some low-price outliers, but they are more noticeable in OUT002 and OUT004.
Overall insight:
Product pricing strategy differs significantly by store. OUT003 and OUT001 cater more toward higher-priced items, while OUT002 appears to be a value-oriented store with lower and more tightly clustered MRPs.
data.loc[data["Store_Id"] == "OUT001"].describe(include="all").T
data.loc[data["Store_Id"] == "OUT001", "Product_Store_Sales_Total"].sum()
OUT001 has generated total revenue of 6223113 from the sales of goods.
def store_revenue_breakdown_by_product(
data,
store_id="",
product_col="Product_Type",
revenue_col="Product_Store_Sales_Total",
figsize=(14, 7),
rotate=60,
color="#8b5cf6",
top_n=None, # optional: show only top N product types
title_y=0.98,
):
sns.set_theme(style="whitegrid", context="notebook")
df_store = data.loc[data["Store_Id"] == store_id, [product_col, revenue_col]].dropna()
df_rev = (
df_store.groupby(product_col, as_index=False)[revenue_col]
.sum()
.sort_values(revenue_col, ascending=False)
)
if top_n is not None:
df_rev = df_rev.head(top_n)
total_rev = df_rev[revenue_col].sum()
n_rows = len(df_store)
n_types = df_rev[product_col].nunique()
fig, ax = plt.subplots(figsize=figsize)
sns.barplot(
data=df_rev,
x=product_col,
y=revenue_col,
ax=ax,
color=color,
edgecolor="white",
linewidth=1,
)
# Title + subtitle (same style)
fig.suptitle(f"{store_id} Revenue by {product_col}", fontsize=15, fontweight="bold", y=title_y)
fig.text(
0.5,
title_y - 0.045,
f"total_revenue={total_rev:,.0f} items_rows={n_rows:,} product_types={n_types:,}",
ha="center",
va="top",
fontsize=11,
)
ax.set_xlabel(product_col)
ax.set_ylabel(revenue_col)
ax.tick_params(axis="x", rotation=rotate)
for t in ax.get_xticklabels():
t.set_horizontalalignment("right" if rotate else "center")
# Value labels
ymax = df_rev[revenue_col].max() if not df_rev.empty else 0
for p in ax.patches:
h = p.get_height()
ax.annotate(
f"{h:,.0f}",
(p.get_x() + p.get_width() / 2, h),
ha="center",
va="bottom",
fontsize=10,
xytext=(0, 4),
textcoords="offset points",
)
ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)
sns.despine(ax=ax)
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
return df_rev
df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT001")
data.loc[data["Store_Id"] == "OUT002"].describe(include="all").T
data.loc[data["Store_Id"] == "OUT002", "Product_Store_Sales_Total"].sum()
OUT002 has generated total revenue of 2030910 from the sales of goods.
df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT002")
OUT003
data.loc[data["Store_Id"] == "OUT003"].describe(include="all").T
data.loc[data["Store_Id"] == "OUT003", "Product_Store_Sales_Total"].sum()
df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT003")
OUT004
data.loc[data["Store_Id"] == "OUT004"].describe(include="all").T
data.loc[data["Store_Id"] == "OUT004", "Product_Store_Sales_Total"].sum()
df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT004")
🏬 Store-wise Detailed Observations
🔵 OUT001 — High-priced, Stable Performer
Store Profile
Store Type: Supermarket Type1
Store Size: High
City Tier: Tier 2
Establishment Year: 1987 (oldest store)
Product MRP Behavior
Mean MRP: ~160.5
Median MRP: ~168.3
Pricing is moderately high and consistent.
Boxplot shows:
Tight IQR → controlled pricing strategy
Few extreme outliers → limited ultra-premium SKUs
Indicates price stability over aggressive discounting.
Sales Performance
Total Revenue: ~6.22M
Avg Sales per product: ~3924
Sales spread: Moderate (std ≈ 904)
Product Mix & Revenue Drivers
Top revenue categories:
Snack Foods
Fruits & Vegetables
Dairy
Balanced contribution across categories → diversified demand
No over-dependence on a single category.
Interpretation
Mature store with:
Reliable pricing
Balanced category mix
Steady sales
Performs well without extreme pricing or promotional volatility.
🔴 OUT002 — Low-price, Low-volume Store
Store Profile
Store Type: Food Mart
Store Size: Small
City Tier: Tier 3
Establishment Year: 1998
Product MRP Behavior
Mean MRP: ~107.1
Median MRP: ~104.7 (lowest among all stores)
Boxplot characteristics:
Lowest MRP range
Many low-end outliers → economy pricing
Minimal premium pricing presence.
Sales Performance
Total Revenue: ~2.03M (lowest)
Avg Sales per product: ~1763
Low variance → consistently low ticket sizes.
Product Mix & Revenue Drivers
Top categories:
Fruits & Vegetables
Snack Foods
Weak performance in:
Meat
Household
Premium categories
Interpretation
Store is:
Highly price-sensitive
Volume-constrained
Likely serving budget-conscious customers
Limited upselling potential due to low MRP ceiling.
🟢 OUT003 — Premium Pricing, High Value per Product
Store Profile
Store Type: Departmental Store
Store Size: Medium
City Tier: Tier 1
Establishment Year: 1999
Product MRP Behavior
Mean MRP: ~181.4 (highest)
Median MRP: ~179.7
Boxplot shows:
Wide IQR
Many high-end outliers (up to ~266)
Strong presence of premium SKUs.
Sales Performance
Total Revenue: ~6.67M
Avg Sales per product: ~4947 (highest)
Highest maximum sales (~8000)
Product Mix & Revenue Drivers
Strong categories:
Snack Foods
Fruits & Vegetables
Dairy
Premium categories perform consistently well.
Interpretation
Best store for:
High-margin products
Premium assortment
Customers show lower price sensitivity
Ideal candidate for premium expansion & exclusive SKUs.
🟣 OUT004 — High-volume, Revenue Powerhouse
Store Profile
Store Type: Supermarket Type2
Store Size: Medium
City Tier: Tier 2
Establishment Year: 2009 (newest)
Product MRP Behavior
Mean MRP: ~142.4
Median MRP: ~142.8
Boxplot indicates:
Moderate pricing
Controlled spread
Few extreme outliers
Sales Performance
Total Revenue: ~15.43M (highest by far)
Avg Sales per product: ~3299
Sales are driven by volume, not high price.
Product Mix & Revenue Drivers
Dominant categories:
Fruits & Vegetables
Snack Foods
Frozen Foods
Strong across all categories, not niche-dependent.
Interpretation
Slightly lower MRP than OUT003, but massive volume compensates.
Best example of volume-led revenue strategy.
🔎 Cross-Store Comparative Insights
| Dimension | OUT001 | OUT002 | OUT003 | OUT004 |
|---|---|---|---|---|
| Avg MRP | Medium-High | Low | Highest | Medium |
| Revenue | High | Lowest | High | Highest |
| Pricing Strategy | Stable | Budget | Premium | Balanced |
| Volume | Medium | Low | Medium | Very High |
| Best Use Case | Stability | Price-led | Margin-led | Scale-led |
📌 Final Strategic Takeaways
OUT003 → maximize premium & margins
OUT004 → expand assortment & inventory (volume monster)
OUT001 → maintain consistency, low risk
OUT002 → needs either volume growth or pricing rethink
Let's find out the revenue generated by the stores from each of the product types.
df1 = data.groupby(["Product_Type", "Store_Id"], as_index=False)[
"Product_Store_Sales_Total"
].sum()
df1
OUT001
Revenue is well balanced across categories, with no extreme dependency on a single product type.
Snack Foods, Fruits & Vegetables, Dairy, Frozen Foods are the top contributors.
Breakfast and Seafood generate the least revenue, indicating low demand.
Performs moderately across both food and non-food (Household, Health & Hygiene) categories.
OUT002
Overall lowest total revenue among all stores.
Strongest categories are Fruits & Vegetables and Snack Foods, but at much lower scale.
Breakfast, Seafood, Starchy Foods, Others perform very weakly.
Indicates a small-format / low-footfall store with limited high-value sales.
OUT003
High-revenue store with strong performance across most categories.
Snack Foods and Fruits & Vegetables dominate sales.
Dairy, Frozen Foods, Household, Meat also contribute significantly.
Weakest categories remain Breakfast and Seafood, consistent with other stores.
OUT004
Top-performing store by a large margin.
Extremely strong in Fruits & Vegetables, Snack Foods, Frozen Foods, Dairy, Household.
Even traditionally low categories (Breakfast, Seafood, Others) perform better here.
Indicates large store size, high footfall, and wide product acceptance.
Cross-store patterns
Snack Foods and Fruits & Vegetables are the top revenue drivers across all stores.
Breakfast and Seafood are consistently the lowest-performing categories.
Revenue scale increases clearly from OUT002 → OUT001 → OUT003 → OUT004.
High-performing stores show diversified revenue, not dependence on a single category.
Let's find out the revenue generated by the stores from products having different levels of sugar content.
df2 = data.groupby(["Product_Sugar_Content", "Store_Id"], as_index=False)[
"Product_Store_Sales_Total"
].sum()
df2
# Replacing reg with Regular
data.Product_Sugar_Content.replace(to_replace=["reg"], value=["Regular"], inplace=True)
data.Product_Sugar_Content.value_counts()
## Extracting the first two characters from the Product_Id column and storing it in another column
data["Product_Id_char"] = data["Product_Id"].str[:2]
data.head()
data["Product_Id_char"].unique()
data.loc[data.Product_Id_char == "FD", "Product_Type"].unique()
data.loc[data.Product_Id_char == "DR", "Product_Type"].unique()
data.loc[data.Product_Id_char == "NC", "Product_Type"].unique()
🔹 Product_Sugar_Content (After Cleaning)
The typo reg was successfully standardized to Regular, removing category ambiguity.
Low Sugar products dominate the dataset, followed by Regular, then No Sugar.
This suggests customer demand (and assortment strategy) is skewed toward low-sugar options.
🔹 Product_ID Prefix Analysis (Product_Id_char)
I identified three clear product families using the first two characters:
Covers most product categories:
Frozen Foods, Dairy, Canned, Baking Goods, Snack Foods
Meat, Fruits & Vegetables, Breads, Breakfast
Starchy Foods, Seafood
Indicates FD is the core retail assortment, driving volume and revenue.
Exclusively mapped to:
Hard Drinks
Soft Drinks
Shows a clean and well-segmented beverage classification.
Limited to:
Health and Hygiene
Household
Others
These are non-food essentials, likely lower in frequency but important for basket value.
🔹 Structural Insights
Product IDs are not random — they encode category intelligence.
This structure can be very useful for feature engineering, such as:
Group-level demand modeling
Category-specific pricing or sales behavior
The dataset shows strong internal consistency between Product_ID patterns and Product_Type.
🔹 Modeling & EDA Implications
Product_Id_char is a high-value categorical feature for:
Sales prediction
Customer demand segmentation
Sugar content is imbalanced, so stratification or weighting may be needed in models.
FD products will likely dominate predictions, while DR and NC may behave differently.
# Outlet Age
data["Store_Age_Years"] = 2025 - data.Store_Establishment_Year
perishables = [
"Dairy",
"Meat",
"Fruits and Vegetables",
"Breakfast",
"Breads",
"Seafood",
]
def change(x):
if x in perishables:
return "Perishables"
else:
return "Non Perishables"
data['Product_Type_Category'] = data['Product_Type'].apply(change)
data.head()
🔹 Store_Age_Years
Store age ranges roughly from ~16 to ~38 years, indicating a mix of newer and very mature stores.
Older stores (≈35–38 years) are mostly OUT001 and OUT003, suggesting:
Long-standing market presence
Likely stable customer base and mature operations
Newer stores (≈16–27 years), such as OUT004, still show strong sales, indicating that age alone does not limit performance.
Insight: Store age may influence customer trust and assortment depth, but store size, location, and type likely play a stronger role in sales.
🔹 Product_Type_Category (Perishables vs Non-Perishables)
Perishables include: Dairy, Meat, Fruits & Vegetables, Breakfast, Breads, Seafood.
Non-Perishables dominate the dataset, including:
Observation:
The majority of rows fall under Non-Perishables, suggesting:
Higher assortment depth
Better shelf life and inventory stability
Perishables are fewer but typically high-frequency purchase items.
🔹 Combined Insights
Older stores + perishables likely require stronger cold-chain and inventory management.
Newer or smaller stores may rely more on non-perishables due to:
Lower spoilage risk
Easier logistics
This binary category can help explain sales variance, especially when combined with:
Store_Size
Store_Type
Store_Location_City_Type
🔹 Modeling Value
Store_Age_Years is a strong continuous feature for regression.
Product_Type_Category (binary) is:
Easy to encode
Highly interpretable
Useful for capturing operational differences in sales behavior
def nice_outlier_boxgrid_2col(
data,
exclude=("Store_Establishment_Year", "Store_Age_Years"),
cols=None,
whis=1.5,
ncols=2, # ✅ two plots per row
figsize=None,
color="#8b5cf6",
title="Outlier Check (Boxplots)",
title_y=0.98,
):
sns.set_theme(style="whitegrid", context="notebook")
# Select numeric columns
if cols is None:
cols = data.select_dtypes(include=np.number).columns.tolist()
# Exclude if present
cols = [c for c in cols if c not in set(exclude)]
if not cols:
raise ValueError("No numeric columns left to plot after exclusions.")
n = len(cols)
nrows = math.ceil(n / ncols)
# Auto figure size tuned for 2-col layout
if figsize is None:
figsize = (14, 3.4 * nrows)
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
axes = np.array(axes).reshape(-1)
# Title + subtitle (same look as your previous charts)
fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)
fig.text(
0.5,
title_y - 0.045,
f"numeric_features={n:,} whis={whis} layout={ncols} per row",
ha="center",
va="top",
fontsize=11,
)
for i, col in enumerate(cols):
ax = axes[i]
x = data[col].dropna()
sns.boxplot(
x=x,
ax=ax,
color=color,
whis=whis,
showmeans=True,
meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
medianprops=dict(color="black", linewidth=1.8),
whiskerprops=dict(linewidth=1.2),
boxprops=dict(linewidth=1.2),
)
ax.set_title(col, fontsize=12, pad=10)
ax.set_xlabel("")
ax.set_ylabel("")
ax.grid(True, axis="x", alpha=0.25) # subtle guidance
sns.despine(ax=ax, left=True, bottom=True)
# Hide unused axes
for j in range(n, len(axes)):
axes[j].axis("off")
fig.tight_layout(rect=[0, 0, 1, 0.92])
plt.show()
nice_outlier_boxgrid_2col(data)
🔹 Product_Weight
Most product weights are concentrated between ~10 and ~15 units.
There are outliers on both ends:
Very light products (< ~7)
Very heavy products (> ~19–22)
Distribution is fairly symmetric, suggesting natural variation by product type rather than data errors.
Takeaway: Outliers look realistic (different packaging sizes), not anomalies.
🔹 Product_Allocated_Area
Majority of values lie in the 0.03–0.10 range.
Strong right-skew with many high-end outliers (up to ~0.30).
Indicates some products require significantly more shelf space.
Takeaway: High-end outliers likely represent bulky or premium-display products.
🔹 Product_MRP
Core price range is ~120 to ~170.
Clear upper-end outliers beyond ₹220–₹270.
A few low-priced outliers (< ~70) also exist.
Takeaway: Price outliers reflect premium and budget product segments, not noise.
🔹 Product_Store_Sales_Total
Highly right-skewed distribution.
Most sales totals fall between ~2500 and ~4500.
Several very high outliers (up to ~8000), indicating top-performing products.
A few low-end outliers, likely slow-moving or niche items.
Takeaway: Sales outliers are business-critical (star vs low-performing products).
🔹 Overall Conclusion
Outliers are meaningful and business-driven, not data quality issues.
Removing them could erase important patterns.
Better strategies:
Log-transform Product_Store_Sales_Total
Use robust models (tree-based, quantile-based)
data.head()
Let's remove the columns that are not required.
data = data.drop(["Product_Id", "Product_Type", "Store_Id", "Store_Establishment_Year"], axis=1)
data.shape
data.head()
data.describe(include='all').T
# Separating features and the target column
X = data.drop("Product_Store_Sales_Total", axis=1)
y = data["Product_Store_Sales_Total"]
print(X.shape)
print(y.shape)
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1, shuffle=True
)
X_train.shape, X_test.shape
Key Observations from the Split
✅ Train–test split is correct:
Train: 6,134 rows
Test: 2,629 rows
✅ Target separation is clean (Product_Store_Sales_Total)
✅ No row leakage (shuffle=True, fixed random_state)
categorical_features = data.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features
# Create a preprocessing pipeline for the categorical features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
)
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mean_absolute_percentage_error(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
The ML models to be built can be any two out of the following:
dtree = DecisionTreeRegressor(random_state=1)
dtree = make_pipeline(preprocessor,dtree)
dtree.fit(X_train, y_train)
dtree_model_train_perf = model_performance_regression(dtree, X_train, y_train)
dtree_model_train_perf
dtree_model_test_perf = model_performance_regression(dtree, X_test, y_test)
dtree_model_test_perf
The pipeline is correctly structured, combining preprocessing (One-Hot Encoding) and the Decision Tree regressor, ensuring consistent data handling during training and testing.
Training and test performance are fairly close, indicating that the model is not severely overfitting.
R² score (~0.68 on train and ~0.67 on test) suggests the model explains around two-thirds of the variance in product sales, which is reasonable for a baseline model.
Adjusted R² is very close to R², implying that the number of predictors introduced by one-hot encoding is not excessively inflating model performance.
RMSE increases slightly on the test set, showing a small generalization error but acceptable stability.
MAE values are consistent across train and test, indicating stable average prediction errors.
MAPE (~16.5% train, ~18.7% test) shows moderate relative error, meaning predictions are reasonably close in percentage terms.
Unpruned Decision Tree captures non-linear relationships, but may still be sensitive to noise and outliers in sales data.
Model serves well as a baseline, but performance can likely be improved with ensemble methods (Random Forest, Gradient Boosting).
bagging_regressor = BaggingRegressor(random_state=1)
bagging_regressor = make_pipeline(preprocessor,bagging_regressor)
bagging_regressor.fit(X_train, y_train)
bagging_regressor_model_train_perf = model_performance_regression(bagging_regressor, X_train, y_train)
bagging_regressor_model_train_perf
bagging_regressor_model_test_perf = model_performance_regression(bagging_regressor, X_test, y_test)
bagging_regressor_model_test_perf
Pipeline integration is consistent, combining preprocessing (One-Hot Encoding) with the Bagging Regressor, ensuring uniform feature handling.
Training performance is similar to the Decision Tree, with an R² of ~0.68, indicating comparable explanatory power.
Test R² (~0.67) closely matches training R², showing good generalization and reduced overfitting compared to a single tree.
RMSE and MAE values are almost identical on train and test sets, highlighting stability in predictions.
MAPE (~16.6% train, ~18.7% test) suggests reasonable percentage-level prediction accuracy, similar to the Decision Tree.
Bagging reduces variance, but the improvement over a single Decision Tree is marginal in this setup.
Model performance indicates limited gains without tuning, likely because default base estimators are already simple.
Useful as a variance-reduction baseline, but stronger ensemble methods may yield better improvements.
rf_estimator = RandomForestRegressor(random_state=1)
rf_estimator = make_pipeline(preprocessor,rf_estimator)
rf_estimator.fit(X_train, y_train)
rf_estimator_model_train_perf = model_performance_regression(rf_estimator, X_train, y_train)
rf_estimator_model_train_perf
rf_estimator_model_test_perf = model_performance_regression(rf_estimator, X_test, y_test)
rf_estimator_model_test_perf
Seamless integration with the preprocessing pipeline, ensuring consistent encoding of categorical variables before modeling.
Training performance (R² ≈ 0.685) is almost identical to Decision Tree and Bagging models, indicating similar explanatory power.
Test performance (R² ≈ 0.669) closely matches training performance, showing good generalization and low overfitting.
RMSE (~ 616) and MAE (~ 485) on the test set are nearly the same as Bagging and Decision Tree, suggesting limited incremental improvement.
MAPE (~18.7% on test) remains consistent across all tree-based models tried so far.
Random Forest’s variance reduction is evident, but its benefit is muted, likely due to:
Limited signal in the available features
Default hyperparameters (e.g., number of trees, depth)
Model stability is strong, as seen from minimal train–test performance gap.
Better potential than Bagging with tuning, especially by adjusting n_estimators, max_depth, and min_samples_leaf.
ab_regressor = AdaBoostRegressor(random_state=1)
ab_regressor = make_pipeline(preprocessor,ab_regressor)
ab_regressor.fit(X_train, y_train)
ab_regressor_model_train_perf = model_performance_regression(ab_regressor, X_train, y_train)
ab_regressor_model_train_perf
ab_regressor_model_test_perf = model_performance_regression(ab_regressor, X_test, y_test)
ab_regressor_model_test_perf
Lower overall performance compared to other tree-based models (Decision Tree, Bagging, Random Forest).
Training R² ≈ 0.652, which is already weaker than previous models, indicating limited ability to capture underlying patterns.
Test R² drops further to ≈ 0.634, showing poorer generalization on unseen data.
Highest error metrics among all models tested so far:
Test RMSE ≈ 647
Test MAE ≈ 531
Test MAPE ≈ 19.4%
Larger train–test performance gap compared to Random Forest and Bagging, suggesting instability.
AdaBoost’s sensitivity to noisy data and outliers likely impacts performance, especially given:
Wide variance in Product_Store_Sales_Total
Presence of outliers observed earlier in numerical features
Default weak learners (shallow trees) may be underfitting the data.
Not well-suited in current configuration for this regression task without careful tuning.
Overall: AdaBoost underperforms relative to other ensemble methods and is the weakest model tested so far for predicting product store sales.
gb_estimator = GradientBoostingRegressor(random_state=1)
gb_estimator = make_pipeline(preprocessor,gb_estimator)
gb_estimator.fit(X_train, y_train)
gb_estimator_model_train_perf = model_performance_regression(gb_estimator, X_train, y_train)
gb_estimator_model_train_perf
gb_estimator_model_test_perf = model_performance_regression(gb_estimator, X_test, y_test)
gb_estimator_model_test_perf
Strong and stable performance, very similar to Decision Tree, Bagging, and Random Forest models.
Training performance:
R² ≈ 0.685
RMSE ≈ 597
Indicates the model captures a good amount of variance without overfitting.
Test performance remains consistent:
R² ≈ 0.669
RMSE ≈ 616
MAE ≈ 485
Minimal train–test gap, suggesting good generalization.
MAPE (~18.7%) is comparable to Bagging and Random Forest, and clearly better than AdaBoost.
Gradient Boosting handles non-linear relationships and feature interactions effectively, even with mixed numerical and one-hot encoded categorical features.
Performance improvement over AdaBoost shows that sequential boosting with gradient optimization is more robust to noise in this dataset.
Default hyperparameters already yield competitive results, indicating good baseline suitability.
Overall: Gradient Boosting is a strong candidate model, offering balanced bias–variance trade-off and performance on par with the best models tested so far.
xgb_estimator = XGBRegressor(random_state=1)
xgb_estimator = make_pipeline(preprocessor,xgb_estimator)
xgb_estimator.fit(X_train, y_train)
xgb_estimator_model_train_perf = model_performance_regression(xgb_estimator, X_train, y_train)
xgb_estimator_model_train_perf
xgb_estimator_model_test_perf = model_performance_regression(xgb_estimator, X_test, y_test)
xgb_estimator_model_test_perf
Performance is on par with tree-based ensemble models (Decision Tree, Bagging, Random Forest, Gradient Boosting).
Training results:
R² ≈ 0.685
RMSE ≈ 597
MAE ≈ 469
Test results remain stable:
R² ≈ 0.668
RMSE ≈ 616
MAE ≈ 485
Very small train–test gap, indicating no significant overfitting.
MAPE (~18.7%) is consistent with Random Forest and Gradient Boosting.
Despite XGBoost’s advanced regularization and boosting strategy, default hyperparameters do not significantly outperform other ensemble models in this setup.
Performance similarity suggests that the feature set and preprocessing pipeline are the main performance drivers, rather than the specific ensemble algorithm.
XGBoost’s strength (handling complex interactions and regularization) is likely underutilized without hyperparameter tuning.
Overall: XGBoost is a robust and reliable model, but in its current untuned form, it does not provide a clear advantage over Random Forest or Gradient Boosting for this dataset.
# Uncomment the below snippet of code if decision tree regressor is to be used
# Choose the type of classifier.
dtree_tuned = DecisionTreeRegressor(random_state=1)
dtree_tuned = make_pipeline(preprocessor,dtree_tuned)
# Grid of parameters to choose from
parameters = {
"decisiontreeregressor__max_depth": list(np.arange(2, 6)),
"decisiontreeregressor__min_samples_leaf": [1, 3, 5],
"decisiontreeregressor__max_leaf_nodes": [2, 3, 5, 10, 15],
"decisiontreeregressor__min_impurity_decrease": [0.001, 0.01, 0.1],
}
# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=r2_score, cv=3, n_jobs =-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)
print("Best Parameters Found:")
print(grid_obj.best_params_)
dtree_tuned_model_train_perf = model_performance_regression(dtree_tuned, X_train, y_train)
dtree_tuned_model_train_perf
dtree_tuned_model_test_perf = model_performance_regression(dtree_tuned, X_test, y_test)
dtree_tuned_model_test_perf
The tuned Decision Tree performs significantly worse than the untuned version and all other ensemble models.
R-squared drops sharply (~0.38) on both train and test sets, indicating very low explanatory power.
RMSE and MAE increase substantially, showing higher prediction errors after tuning.
Similar train and test performance suggests no overfitting, but rather strong underfitting.
The selected best parameters (very shallow tree: max_depth = 2, max_leaf_nodes = 2) overly restrict model complexity.
The model fails to capture nonlinear relationships present in the data.
Hyperparameter tuning over-regularized the model, hurting performance instead of improving it.
This confirms that single decision trees are not suitable for this dataset compared to ensemble-based methods.
Conclusion:
The tuned Decision Tree is the worst-performing model and should be discarded in favor of ensemble models like Random Forest, Gradient Boosting, or XGBoost.
# Choose the type of regressor.
bagging_estimator_tuned = BaggingRegressor(random_state=1)
bagging_estimator_tuned = make_pipeline(preprocessor,bagging_estimator_tuned)
# Grid of parameters to choose from
parameters = {
"baggingregressor__max_samples": [0.7, 0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
"baggingregressor__max_features": [0.7, 0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
"baggingregressor__n_estimators": [10, 30, 50, 100] #Complete the code to define the list of values to be tuned
}
# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)
print("Best Parameters Found:")
print(grid_obj.best_params_)
bagging_estimator_tuned_model_train_perf = model_performance_regression(bagging_estimator_tuned, X_train, y_train)
bagging_estimator_tuned_model_train_perf
bagging_estimator_tuned_model_test_perf = model_performance_regression(bagging_estimator_tuned, X_test, y_test)
bagging_estimator_tuned_model_test_perf
The tuned Bagging Regressor shows almost no improvement over the untuned version.
R-squared (~0.668 on test) remains virtually unchanged, indicating similar explanatory power.
RMSE and MAE on the test set are nearly identical to the base Bagging model, suggesting limited gains from tuning.
Train and test metrics are very close, indicating good generalization and low overfitting.
The best parameters selected (max_samples = 0.7, max_features = 0.7, n_estimators = 10) favor higher randomness and fewer trees, reducing variance but not boosting accuracy.
Increasing ensemble complexity (more estimators or features) did not significantly improve performance, implying the model has reached a performance plateau.
Bagging remains stable and robust, but tuning alone cannot extract additional predictive power from the current feature set.
Conclusion:
Hyperparameter tuning does not materially enhance the Bagging Regressor. While it generalizes well, its performance is capped, making it less competitive than more expressive ensemble methods like Gradient Boosting or XGBoost.
# Choose the type of classifier.
rf_tuned = RandomForestRegressor(random_state=1)
rf_tuned = make_pipeline(preprocessor,rf_tuned)
# Grid of parameters to choose from
parameters = {
"randomforestregressor__max_depth": [10, 20, 30, None], #Complete the code to define the list of values to be tuned
"randomforestregressor__max_features": ['sqrt', 'log2', 1.0, 0.7], #Complete the code to define the list of values to be tuned
"randomforestregressor__n_estimators": [100, 200, 300], #Complete the code to define the list of values to be tuned
}
# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
print("Best Parameters Found:")
print(grid_obj.best_params_)
rf_tuned_model_train_perf = model_performance_regression(rf_tuned, X_train, y_train)
rf_tuned_model_train_perf
rf_tuned_model_test_perf = model_performance_regression(rf_tuned, X_test, y_test)
rf_tuned_model_test_perf
The tuned Random Forest shows almost identical performance to the default Random Forest model.
Test R² (~0.6685) remains unchanged, indicating no meaningful improvement in explanatory power.
RMSE and MAE on the test set are nearly the same as the untuned model, confirming marginal gains from tuning.
The selected parameters (max_depth = 10, max_features = 'sqrt', n_estimators = 100) impose controlled tree complexity, helping prevent overfitting.
Train and test metrics are closely aligned, suggesting good generalization and stable learning.
Increasing the number of trees beyond 100 or allowing deeper trees did not improve performance, implying diminishing returns.
The model appears bias-limited rather than variance-limited, meaning feature richness matters more than hyperparameter tuning.
Conclusion:
Hyperparameter tuning does not significantly enhance Random Forest performance for this dataset. While the model is stable and reliable, further gains are more likely to come from feature engineering or advanced boosting methods rather than additional tuning.
# Choose the type of classifier.
ab_tuned = AdaBoostRegressor(random_state=1)
ab_tuned = make_pipeline(preprocessor,ab_tuned)
# Grid of parameters to choose from
parameters = {
"adaboostregressor__n_estimators": [50, 100, 150, 200], #Complete the code to define the list of values to be tuned
"adaboostregressor__learning_rate": [0.01, 0.1, 0.5, 1.0], #Complete the code to define the list of values to be tuned
}
# Run the grid search
grid_obj = GridSearchCV(ab_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
ab_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
ab_tuned.fit(X_train, y_train)
print("Best Parameters Found:")
print(grid_obj.best_params_)
ab_tuned_model_train_perf = model_performance_regression(ab_tuned, X_train, y_train)
ab_tuned_model_train_perf
ab_tuned_model_test_perf = model_performance_regression(ab_tuned, X_test, y_test)
ab_tuned_model_train_perf
Hyperparameter tuning selected a low learning rate (0.01) with fewer estimators (50), indicating that conservative boosting works better for this dataset.
Compared to the untuned AdaBoost model, the performance has improved slightly, especially in terms of RMSE and MAE.
R² (~0.684) is higher than the default AdaBoost model, showing better variance explanation after tuning.
Training and test metrics are identical, suggesting the model has high bias and limited flexibility, but also very stable generalization.
The low learning rate reduces the risk of overfitting, but also limits the model’s ability to capture complex nonlinear patterns.
Despite tuning, AdaBoost still underperforms compared to Random Forest, Gradient Boosting, and XGBoost.
The model benefits from tuning more than Decision Tree, but remains less competitive overall.
Conclusion:
Hyperparameter tuning improves AdaBoost modestly, but the model remains bias-constrained. For stronger predictive performance, tree-based ensemble methods with higher capacity (Random Forest, Gradient Boosting, XGBoost) are more suitable for this problem.
# Choose the type of classifier.
gb_tuned = GradientBoostingRegressor(random_state=1)
gb_tuned = make_pipeline(preprocessor,gb_tuned)
# Grid of parameters to choose from
parameters = {
"gradientboostingregressor__n_estimators": [100, 200, 300], #Complete the code to define the list of values to be tuned
"gradientboostingregressor__subsample": [0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
"gradientboostingregressor__max_features": [0.8, 1.0, 'sqrt', 'log2'], #Complete the code to define the list of values to be tuned
"gradientboostingregressor__max_depth": [3, 4, 5] #Complete the code to define the list of values to be tuned
}
# Run the grid search
grid_obj = GridSearchCV(gb_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gb_tuned.fit(X_train, y_train)
print("Best Parameters Found:")
print(grid_obj.best_params_)
gb_tuned_model_train_perf = model_performance_regression(gb_tuned, X_train, y_train)
gb_tuned_model_train_perf
gb_tuned_model_test_perf = model_performance_regression(gb_tuned, X_test, y_test)
gb_tuned_model_test_perf
Hyperparameter tuning selected a shallow tree depth (max_depth = 3), indicating that simple base learners generalize better for this dataset.
The model prefers subsampling (subsample = 0.8) and feature subsampling (max_features = 0.8), which helps reduce overfitting and improve robustness.
The chosen number of estimators (100) balances learning capacity and stability without excessive complexity.
Training performance remains strong (R² ≈ 0.685), similar to the untuned Gradient Boosting model.
Test performance (R² ≈ 0.669, RMSE ≈ 616) is very close to training performance, showing good generalization.
Compared to AdaBoost, the tuned Gradient Boosting model shows lower error and higher R², confirming its superior learning capability.
Hyperparameter tuning results in marginal but consistent improvements, suggesting the base model was already well-specified.
Performance is comparable to Random Forest and XGBoost, making it one of the top-performing models in this study.
Conclusion:
The tuned Gradient Boosting Regressor achieves a strong bias–variance balance, with stable generalization and competitive accuracy. It is a reliable final model choice, especially when interpretability and controlled complexity are important.
# Choose the type of classifier.
xgb_tuned = XGBRegressor(random_state=1)
xgb_tuned = make_pipeline(preprocessor,xgb_tuned)
# Grid of parameters to choose from
parameters = {
"xgbregressor__n_estimators": [100, 200], #Complete the code to define the list of values to be tuned
"xgbregressor__subsample": [0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
"xgbregressor__gamma": [0, 1, 5], #Complete the code to define the list of values to be tuned
"xgbregressor__colsample_bytree": [0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
"xgbregressor__colsample_bylevel":[0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
"xgbregressor__max_depth": [3, 5, 7]
}
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
print("Best Parameters Found:")
print(grid_obj.best_params_)
xgb_tuned_model_train_perf = model_performance_regression(xgb_tuned, X_train, y_train)
xgb_tuned_model_train_perf
xgb_tuned_model_test_perf = model_performance_regression(xgb_tuned, X_test, y_test)
xgb_tuned_model_test_perf
Hyperparameter tuning selected a shallow tree depth (max_depth = 3), reinforcing that simpler trees generalize better for this dataset.
The model favors aggressive subsampling (subsample = 0.7, colsample_bytree = 0.7, colsample_bylevel = 0.7), which helps control overfitting and improves model stability.
A gamma value of 0 indicates that allowing splits without additional loss penalty works well, suggesting the data benefits from flexible splitting.
The optimal number of estimators (100) provides sufficient boosting rounds without overfitting.
Training performance is strong (R² ≈ 0.685, RMSE ≈ 597), comparable to untuned and other tuned ensemble models.
Test performance (R² ≈ 0.668, RMSE ≈ 616) is very close to training metrics, indicating good generalization.
Hyperparameter tuning yields only marginal improvements, suggesting the original XGBoost model was already near optimal.
Compared to AdaBoost, XGBoost performs significantly better; however, its performance is very similar to Random Forest and Gradient Boosting.
Conclusion:
The tuned XGBoost Regressor demonstrates stable and robust performance with strong generalization. While tuning provides limited gains, XGBoost remains one of the top-performing models and is a strong candidate for final deployment, especially when predictive accuracy is prioritized.
# training performance comparison
models_train_comp_df = pd.concat(
[
rf_estimator_model_train_perf.T, # Random Forest (base)
rf_tuned_model_train_perf.T, # Random Forest (tuned)
xgb_estimator_model_train_perf.T, # XGBoost (base)
xgb_tuned_model_train_perf.T, # XGBoost (tuned)
],
axis=1,
)
models_train_comp_df.columns = [
"Random Forest Estimator",
"Random Forest Tuned",
"XGBoost Estimator",
"XGBoost Tuned",
]
print("Training performance comparison:")
models_train_comp_df
# Testing performance comparison
models_test_comp_df = pd.concat(
[
rf_estimator_model_test_perf.T, # Random Forest (base)
rf_tuned_model_test_perf.T, # Random Forest (tuned)
xgb_estimator_model_test_perf.T, # XGBoost (base)
xgb_tuned_model_test_perf.T, # XGBoost (tuned)
],
axis=1,
)
models_test_comp_df.columns = [
"Random Forest Estimator",
"Random Forest Tuned",
"XGBoost Estimator",
"XGBoost Tuned",
]
print("Test performance comparison:")
models_test_comp_df
if rf_tuned_model_train_perf["RMSE"][0] < xgb_tuned_model_train_perf["RMSE"][0]:
best_model = rf_tuned
else:
best_model = xgb_tuned
print(f"The best performing model is: {best_model}")
🔍 Observations for Final Model Selection
Random Forest (base and tuned) consistently delivers the best overall performance across both training and test datasets.
The tuned Random Forest does not improve performance over the base Random Forest, indicating that the default parameters were already near-optimal.
XGBoost (base and tuned) performs very similarly to Random Forest but shows:
Slightly higher RMSE and MAE
Marginally lower R² and Adjusted R² on the test set
The performance gap between training and test sets for Random Forest is small, indicating good generalization and minimal overfitting.
Tuned XGBoost does not outperform base XGBoost, suggesting limited benefit from hyperparameter tuning for this dataset.
Among all models compared:
Lowest Test RMSE & MAE → Random Forest
Highest Test R² & Adjusted R² → Random Forest
Lowest Test MAPE → Random Forest
✅ Final Selection Justification
Random Forest Regressor is selected as the best-performing and most stable model
It balances accuracy, robustness, and generalization
Hyperparameter tuning did not yield meaningful gains, reinforcing confidence in the chosen model
# Create a folder for storing the files needed for web app deployment
os.makedirs("/content/drive/MyDrive/Model Deployment/Full_Code/backend_files", exist_ok=True)
# Define the file path to save (serialize) the trained model along with the data preprocessing steps
saved_model_path = "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/SuperKart_v1_0.joblib"
# Save the best trained model pipeline using joblib
joblib.dump(best_model, saved_model_path) #Complete the code to pass the variable name of the best model
print(f"Model saved successfully at {saved_model_path}")
# Load the saved model pipeline from the file
saved_model = joblib.load("/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/SuperKart_v1_0.joblib") #Complete the code to define the name of the saved model
# Confirm the model is loaded
print("Model loaded successfully.")
saved_model
Let's try making predictions on the test set using the deserialized model.
# Test a prediction to confirm functionality
sample_preds = saved_model.predict(X_test[:5])
print("\n Sample Predictions on Test Set:\n", sample_preds)
A dedicated directory was created to store all files required for backend deployment, ensuring a well-structured and maintainable deployment setup.
The trained model was saved using the joblib library as a single .joblib file, which includes the complete machine learning pipeline.
The serialized object contains both the data preprocessing steps (ColumnTransformer and OneHotEncoder) and the final Random Forest regression model, ensuring consistency between training and inference.
Successful execution messages confirm that the model was correctly saved to disk at the specified location.
The model was subsequently reloaded from the saved file, verifying that the serialization process was successful and the file is usable.
Inspection of the loaded object confirms that it is a Pipeline, demonstrating that all preprocessing and modeling components are preserved together.
The OneHotEncoder is configured with handle_unknown='ignore', which improves robustness by allowing the model to handle unseen categorical values during real-time predictions without errors.
Overall, the serialization process ensures the selected best-performing model is deployment-ready and can be directly integrated into a production or web application environment.
# Import the login function from the huggingface_hub library
from huggingface_hub import login
from huggingface_hub import create_repo
import os
# Login to Hugging Face account using access token
from google.colab import userdata
hf_token = userdata.get('SuperKart1')
login(token=hf_token)
# create the repository for the Hugging Face Space
try:
create_repo("SuperKartBackend",
repo_type="space", # Specify the repository type as "space"
space_sdk="docker", # Specify the space SDK as "docker" to create a Docker space
private=False # Set to True if the space to be private
)
# Handle potential errors during repository creation
except Exception as e:
if "RepositoryAlreadyExistsError" in str(e):
print("Repository already exists. Skipping creation.")
else:
print(f"Error creating repository: {e}")
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/app.py"
# Import necessary libraries
import numpy as np
import joblib
import pandas as pd
from flask import Flask, request, jsonify
import traceback
import math
# Define the path where the model is saved
model_file_name = "SuperKart_v1_0.joblib"
try:
# Load the trained machine learning model
model = joblib.load(model_file_name)
except FileNotFoundError:
print(f"Error: Model file not found at {model_file_name}")
model = None
except Exception as e:
print(f"Error loading model: {e}")
traceback.print_exc()
model = None
# Initialize the Flask app
app = Flask(__name__)
@app.route('/')
def home():
return "Welcome to the Super Kart Product Sales Price Prediction API!"
# ---------------- single Prediction Endpoint ----------------
@app.route('/v1/salesprice', methods=['POST'])
def predict_sales_price():
if model is None:
return jsonify({"error": "Model not loaded. Cannot make predictions."}), 500
try:
property_data = request.get_json(force=True)
expected_keys = [
'Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area',
'Product_MRP', 'Store_Size', 'Store_Location_City_Type',
'Store_Type', 'Product_Id_char', 'Store_Age_Years', 'Product_Type_Category'
]
if not all(key in property_data for key in expected_keys):
missing_keys = [key for key in expected_keys if key not in property_data]
return jsonify({"error": f"Missing keys in input data: {missing_keys}"}), 400
sample = {key: property_data.get(key) for key in expected_keys}
input_data = pd.DataFrame([sample])
predicted_sales_price = model.predict(input_data)
predicted_price = round(float(predicted_sales_price[0]), 2)
if math.isinf(predicted_price) or math.isnan(predicted_price):
return jsonify({"error": "Prediction resulted in an invalid value."}), 400
return jsonify({'Predicted Price': predicted_price}), 200
except Exception as e:
print(f"Error during single prediction: {e}")
traceback.print_exc()
return jsonify({"error": "Internal server error", "details": str(e)}), 500
# ---------------- Batch Prediction Endpoint ----------------
@app.route('/v1/salespricebatch', methods=['POST'])
def predict_sales_price_batch():
"""
Expects a CSV file with one product per row.
Returns JSON: a list of dicts with `row_id` and predicted price.
"""
if model is None:
return jsonify({"error": "Model not loaded. Cannot make predictions."}), 500
if 'file' not in request.files:
return jsonify({"error": "No file uploaded"}), 400
try:
file = request.files['file']
input_data = pd.read_csv(file)
expected_columns = [
'Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area',
'Product_MRP', 'Store_Size', 'Store_Location_City_Type',
'Store_Type', 'Product_Id_char', 'Store_Age_Years', 'Product_Type_Category'
]
missing_columns = [col for col in expected_columns if col not in input_data.columns]
if missing_columns:
return jsonify({"error": f"Missing required columns: {missing_columns}"}), 400
input_data.reset_index(inplace=True)
input_data.rename(columns={'index': 'row_id'}, inplace=True)
predictions = model.predict(input_data[expected_columns])
predicted_prices = [round(float(p), 2) for p in predictions]
results = [
{"row_id": row_id, "Predicted Price": price}
for row_id, price in zip(input_data['row_id'], predicted_prices)
]
return jsonify(results), 200
except Exception as e:
print(f"Error during batch prediction: {e}")
traceback.print_exc()
return jsonify({"error": "Internal server error during batch prediction.", "details": str(e)}), 500
if __name__ == '__main__':
pass
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/requirements.txt"
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
streamlit==1.43.2
flask-cors==3.0.10
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/Dockerfile"
# Use slim Python image
FROM python:3.9-slim
# Set working directory inside the container
WORKDIR /app
# Copy project files into the container
COPY . .
# Install dependencies and print package list to verify gunicorn is installed
RUN pip install --no-cache-dir --upgrade pip \
&& pip install --no-cache-dir -r requirements.txt \
&& echo "Installed packages:" \
&& pip list
# Expose the port Hugging Face expects
EXPOSE 7860
# Start the Flask app using gunicorn
# - app: refers to app.py
# - app: the Flask app object in app.py (corrected from rental_price_predictor_api)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:app"]
# for hugging face space authentication to upload files
from huggingface_hub import HfApi
# Hugging Face space id - Backend
repo_id = "randley7/SuperKartBackend"
# Initialize the API
api = HfApi()
#Mention the folder path explicitly
folder_path = "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/"
# Upload Streamlit app files
api.upload_folder(folder_path=folder_path,repo_id=repo_id,repo_type="space")
print(f"Files from {folder_path} successfully uploaded to the Hugging Face Space: {repo_id}")
# Try to create the repository for the Hugging Face Space
try:
create_repo("SuperKartFrontend",
repo_type="space", # Specify the repository type as "space"
space_sdk="docker", # Specify the space SDK as "streamlit" to create a Streamlit space
private=False # Set to True if you want the space to be private
)
# Handle potential errors during repository creation
except Exception as e:
if "RepositoryAlreadyExistsError" in str(e):
print("Repository already exists. Skipping creation.")
else:
print(f"Error creating repository: {e}")
# Create the directory if it doesn't exist and then write the file
import os
os.makedirs("/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files", exist_ok=True)
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/Dockerfile"
# Use Python base image
FROM python:3.10-slim
# Set working directory
WORKDIR /app
# Copy all files into the container
COPY . /app
# Install dependencies
RUN pip install --upgrade pip && \
pip install -r requirements.txt
# Expose port for Streamlit
EXPOSE 7860
# Run Streamlit app
CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/app.py"
# import
import streamlit as st
import pandas as pd
import requests
# Streamlit UI
st.title("SuperKart Sales Prediction App")
st.write("Predict store sales based on product and store attributes.")
# Input fields for product and store data
Product_Weight = st.number_input("Product Weight", min_value=0.0, value=12.66)
Product_Sugar_Content = st.selectbox("Product Sugar Content", ["Low Sugar", "Regular", "No Sugar"])
Product_Allocated_Area = st.number_input("Product Allocated Area", min_value=0.0, step=0.1)
Product_MRP = st.number_input("Product MRP", min_value=0.0, step=0.1)
Store_Size = st.selectbox("Store Size", ["Small", "Medium", "High"])
Store_Location_City_Type = st.selectbox("Store Location City Type", ["Tier 1", "Tier 2", "Tier 3"])
Store_Type = st.selectbox("Store Type",["Supermarket Type2", "Departmental Store", "Supermarket Type1", "Food Mart"])
Product_Id_char = st.selectbox("Product Id Char", ["FD", "NC", "DR"])
Store_Age_Years = st.number_input("Store Age Years", min_value=0, step=1)
Product_Type_Category = st.selectbox("Product Type Category", ["Perishables", "Non Perishables"])
input_data = pd.DataFrame([{
'Product_Weight': Product_Weight,
'Product_Sugar_Content': Product_Sugar_Content,
'Product_Allocated_Area': Product_Allocated_Area,
'Product_MRP': Product_MRP,
'Store_Size': Store_Size,
'Store_Location_City_Type': Store_Location_City_Type,
'Store_Type': Store_Type,
'Product_Id_char': Product_Id_char,
'Store_Age_Years': Store_Age_Years,
'Product_Type_Category': Product_Type_Category
}])
# Predict button
if st.button("Predict"):
try:
response = requests.post(
"https://randley7-SuperKartBackend.hf.space/v1/salesprice",
json=input_data.to_dict(orient='records')[0]
)
if response.status_code == 200:
prediction = response.json().get("Predicted Price", "No prediction returned")
st.success(f"Predicted Sales Price: {prediction}")
else:
st.error("Error making prediction.")
st.text(response.text)
except Exception as e:
st.error(f"Exception occurred: {e}")
# ----------------- Batch Prediction -----------------
st.subheader("Batch Prediction")
uploaded_file = st.file_uploader("Upload CSV file for batch prediction", type=["csv"])
if uploaded_file is not None:
if st.button("PredictBatch"):
try:
files = {"file": (uploaded_file.name, uploaded_file, "text/csv")}
response = requests.post(
"https://randley7-SuperKartBackend.hf.space/v1/salespricebatch",
files=files
)
if response.status_code == 200:
predictions = response.json()
st.success("Batch predictions completed!")
# Convert to DataFrame and display
df_predictions = pd.DataFrame(predictions)
st.dataframe(df_predictions)
# Download button
csv = df_predictions.to_csv(index=False).encode('utf-8')
st.download_button(
label="Download Predictions as CSV",
data=csv,
file_name="SuperKart_Predicted_Sales.csv",
mime="text/csv"
)
else:
st.error("Error making batch prediction.")
st.text(response.text)
except Exception as e:
st.error(f"Exception occurred: {e}")
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/requirements.txt"
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
streamlit==1.43.2
flask-cors==3.0.10
# for hugging face space authentication to upload files
from huggingface_hub import HfApi
repo_id = "randley7/SuperKartFrontend"
# Initialize the API
api = HfApi()
#Mention the folder path explicitly
folder_path = "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/"
# Upload Streamlit app files
api.upload_folder(folder_path=folder_path, repo_id=repo_id,repo_type="space")
print(f"Files from {folder_path} successfully uploaded to the Hugging Face Space: {repo_id}")
# Import the necessary libraries
import json
import requests
import pandas as pd
import numpy as np
#Base URL of the deployed Flask API on Hugging Face Space
model_root_url = "https://randley7-SuperKartBackend.hf.space"
#Endpoint for single inference
model_url = model_root_url + "/v1/salesprice"
#Payload with necessary features for single inference prediction
payload = {
'Product_Weight': 12.66,
'Product_Sugar_Content': "Low Sugar",
'Product_Allocated_Area': 0.20,
'Product_MRP': 0.30,
'Store_Size': "Small",
'Store_Location_City_Type': "Tier 1",
'Store_Type': "Supermarket Type2",
'Product_Id_char': "FD",
'Store_Age_Years': 10,
'Product_Type_Category': "Non Perishables"
}
#sending a POST request to the model endpoint with the payload
response = requests.post(model_url, json=payload)
print(model_url)
print(response)
# ALWAYS print the raw response text for debugging
print("Raw response text:")
print(response.text)
# Check if the response is successful (status code 200) before trying to parse JSON
if response.status_code == 200:
try:
# Attempt to parse the JSON
print("Parsed JSON response:")
print(response.json())
except json.JSONDecodeError as e:
print(f"JSON Decode Error occurred: {e}")
print("Could not parse response as JSON despite 200 status code.")
else:
# If the response was not successful, print the status code and the raw text
print(f"Error: Received status code {response.status_code}")
print("Response content (if any):")
print(response.text) # Print raw text to see the error message from the backend
Observations – Backend and Frontend Integration
The frontend Streamlit application successfully communicates with the backend Flask API using HTTP POST requests.
Real-time predictions are displayed in the UI, confirming seamless data flow from user input → backend inference → frontend response.
The same API endpoint supports both programmatic access (via Python requests) and UI-based interaction, increasing system flexibility.
The frontend correctly parses backend JSON responses and presents predictions in a user-friendly format.
The batch prediction workflow is fully integrated, allowing users to upload CSV files and receive multiple predictions in one request.
Download functionality for batch prediction results enhances usability and supports real-world analytical workflows.
Observations – Deployment Validation and System Robustness
The backend and frontend are deployed as independent Hugging Face Spaces, ensuring modularity and easier maintenance.
Dockerized deployments ensure consistent runtime environments and eliminate dependency conflicts.
Version-pinned dependencies in requirements.txt improve reproducibility and long-term stability.
The serialized model pipeline (including preprocessing) ensures identical transformations during both training and inference.
Successful predictions from both direct API calls and the UI confirm end-to-end system correctness.
The deployed system demonstrates production readiness with clear endpoints, validation checks, and scalable architecture.
Observations – Interfacing Using Flask API
The Flask API is successfully deployed on Hugging Face Spaces and is accessible via a public HTTPS endpoint, enabling external inference requests.
A RESTful design is followed, with a dedicated /v1/salesprice endpoint for single predictions that accepts JSON payloads.
Input features in the API payload exactly match the features used during model training, ensuring schema consistency and preventing inference mismatches.
The API correctly returns HTTP 200 responses for valid requests, confirming proper request handling and inference execution.
JSON responses are well-structured and include a clearly labeled Predicted Price, facilitating easy consumption by downstream applications.
Robust debugging practices are demonstrated by logging raw response text and safely handling JSON decoding.
Error handling is implemented to capture invalid payloads, missing keys, or unexpected runtime issues, improving API reliability.
Overall Observation
The project demonstrates a complete, production-grade machine learning deployment pipeline, covering model training, evaluation, selection, serialization, backend API development, frontend integration, and cloud deployment. The seamless interaction between components validates both the technical soundness and practical usability of the solution.
SuperKart Sales Prediction Project
Actionable Insights
Product attributes such as MRP, weight, sugar content, and category (perishable vs non-perishable) play a significant role in predicting sales value.
Products with optimized pricing and appropriate shelf allocation tend to generate higher predicted sales.
Perishable and non-perishable products show distinct sales behavior, indicating different demand dynamics.
Insight: Sales performance is highly sensitive to product-level decisions rather than being driven by a single store factor.
Store size, store type, city tier, and store age contribute meaningfully to sales predictions.
Larger stores and stores located in higher-tier cities tend to exhibit stronger sales potential.
Older stores show more stable and predictable sales patterns, likely due to established customer bases.
Insight: Store-level heterogeneity must be accounted for when planning inventory and pricing strategies.
Tree-based ensemble models (Random Forest and XGBoost) significantly outperform simpler models.
The selected Random Forest model demonstrates consistent generalization, with minimal performance gap between training and test data.
Hyperparameter tuning improves stability but yields marginal gains over the base ensemble models.
Insight: Sales relationships are non-linear, and ensemble models are well-suited for capturing complex interactions between product and store features.
Comparable RMSE, MAE, and R² values across training and test sets indicate low overfitting.
The model’s performance consistency suggests it can be trusted for real-world sales estimation scenarios.
Insight: The model can be reliably used for operational and tactical decision-making rather than just exploratory analysis.
The Flask API enables real-time inference for individual product–store combinations.
Batch prediction support allows large-scale forecasting across catalogs and store networks.
The Streamlit frontend provides accessibility for non-technical business users.
Insight: The solution moves beyond analytics into an operational decision-support system.
Business Recommendations
Use the model to simulate sales outcomes for different MRP and shelf-space allocation combinations.
Identify price bands that maximize sales without eroding demand, especially for high-volume categories.
Allocate premium shelf space to products with higher predicted sales impact.
Recommendation: Integrate the model into pricing and merchandising decisions to maximize revenue per square foot.
Adjust inventory levels based on store size, location tier, and store maturity.
Avoid uniform inventory policies across all stores, as demand patterns vary significantly.
Use batch predictions to forecast store-level demand before replenishment cycles.
Recommendation: Move from centralized inventory planning to store-cluster-based demand forecasting.
Use the model to estimate expected sales for new products or newly opened stores using proxy attributes.
Evaluate product–store fit before rollout to reduce launch risk.
Prioritize high-potential store locations and product categories.
Recommendation: Use predictive insights to de-risk expansion and new product introduction strategies.
Identify products with high baseline demand and amplify them during promotional campaigns.
Avoid over-promoting products with inherently low predicted demand.
Tailor promotions based on store location and customer demographics inferred from city tiers.
Recommendation: Shift from blanket promotions to data-driven, targeted marketing campaigns.
Provide business teams access to the deployed Streamlit application for scenario analysis.
Allow category managers to test “what-if” scenarios by adjusting product and store attributes.
Reduce dependency on technical teams for routine sales forecasting.
Recommendation: Democratize predictive insights to improve agility and decision-making speed.
Embed the API into ERP, inventory management, or demand planning systems.
Automate daily or weekly batch predictions for operational planning.
Continuously retrain the model with new sales data to maintain accuracy.
Recommendation: Treat the model as a living system, not a one-time analytical output.
Strategic Impact Summary
Revenue Growth: Improved pricing and assortment decisions driven by predictive insights.
Cost Reduction: Lower inventory holding and wastage through accurate demand estimation.
Operational Efficiency: Faster, data-backed decisions enabled by real-time inference.
Scalability: Cloud deployment supports expansion across regions and product lines.
Competitive Advantage: Advanced analytics capability embedded into everyday business operations.
Final Insight
The SuperKart Sales Prediction system transforms historical data into actionable intelligence, enabling the organization to move from reactive decision-making to proactive, predictive retail strategy.