By Randley Morales

Problem Statement

Business Context

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

Objective

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

Data Description

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

  • Product_Id - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
  • Product_Weight - weight of each product
  • Product_Sugar_Content - sugar content of each product like low sugar, regular and no sugar
  • Product_Allocated_Area - ratio of the allocated display area of each product to the total display area of all the products in a store
  • Product_Type - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
  • Product_MRP - maximum retail price of each product
  • Store_Id - unique identifier of each store
  • Store_Establishment_Year - year in which the store was established
  • Store_Size - size of the store depending on sq. feet like high, medium and low
  • Store_Location_City_Type - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
  • Store_Type - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
  • Product_Store_Sales_Total - total revenue generated by the sale of that particular product in that particular store

Installing and Importing the necessary libraries

In [ ]:
# Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.4 huggingface_hub==0.34.0 -q
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 7.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 223.6/223.6 MB 5.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 558.7/558.7 kB 25.4 MB/s eta 0:00:00

Note:

  • After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab) and run all cells sequentially from the next cell.

  • On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [ ]:
import warnings
warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# For splitting the dataset
from sklearn.model_selection import train_test_split

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


# Libraries different ensemble classifiers
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

# Libraries to get different metric scores
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    mean_absolute_percentage_error
)

# To create the pipeline
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline

# To tune different models and standardize
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder

# To serialize the model
import joblib

# os related functionalities
import os

# API request
import requests

# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi

import math

Loading the dataset

In [ ]:
# Connect to google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [ ]:
# Load the dataset from a CSV file into a Pandas DataFrame
kart = pd.read_csv("/content/drive/MyDrive/Model Deployment/Full_Code/SuperKart.csv")
In [ ]:
# Make a copy of Kart
data = kart.copy()

Data Overview

View the first and last 5 rows of the dataset

In [ ]:
# The first 5 rows of the dataset
data.head()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36
In [ ]:
# The last 5 rows of the dataset
data.tail()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
8758 NC7546 14.80 No Sugar 0.016 Health and Hygiene 140.53 OUT004 2009 Medium Tier 2 Supermarket Type2 3806.53
8759 NC584 14.06 No Sugar 0.142 Household 144.51 OUT004 2009 Medium Tier 2 Supermarket Type2 5020.74
8760 NC2471 13.48 No Sugar 0.017 Health and Hygiene 88.58 OUT001 1987 High Tier 2 Supermarket Type1 2443.42
8761 NC7187 13.89 No Sugar 0.193 Household 168.44 OUT001 1987 High Tier 2 Supermarket Type1 4171.82
8762 FD306 14.73 Low Sugar 0.177 Snack Foods 224.93 OUT002 1998 Small Tier 3 Food Mart 2186.08

Understand the shape of the dataset

In [ ]:
# Checking shape of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
There are 8763 rows and 12 columns.
In [ ]:
# Display the column names of the dataset
data.columns
Out[ ]:
Index(['Product_Id', 'Product_Weight', 'Product_Sugar_Content',
       'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Id',
       'Store_Establishment_Year', 'Store_Size', 'Store_Location_City_Type',
       'Store_Type', 'Product_Store_Sales_Total'],
      dtype='object')

Check the data types of the columns for the dataset

In [ ]:
# Checking column datatypes and number of non-null values
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Id                 8763 non-null   object 
 1   Product_Weight             8763 non-null   float64
 2   Product_Sugar_Content      8763 non-null   object 
 3   Product_Allocated_Area     8763 non-null   float64
 4   Product_Type               8763 non-null   object 
 5   Product_MRP                8763 non-null   float64
 6   Store_Id                   8763 non-null   object 
 7   Store_Establishment_Year   8763 non-null   int64  
 8   Store_Size                 8763 non-null   object 
 9   Store_Location_City_Type   8763 non-null   object 
 10  Store_Type                 8763 non-null   object 
 11  Product_Store_Sales_Total  8763 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 821.7+ KB

Checking for duplicate values

In [ ]:
# Checking for duplicate values
data.duplicated().sum()
Out[ ]:
np.int64(0)

Checking for missing values

In [ ]:
# Checking for missing values
data.isnull().sum()
Out[ ]:
0
Product_Id 0
Product_Weight 0
Product_Sugar_Content 0
Product_Allocated_Area 0
Product_Type 0
Product_MRP 0
Store_Id 0
Store_Establishment_Year 0
Store_Size 0
Store_Location_City_Type 0
Store_Type 0
Product_Store_Sales_Total 0

Let's check the statistical summary of the data.

In [ ]:
# Statistical summary of the data for both numerical and categorical columns
data.describe(include='all').T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Product_Id 8763 8763 FD306 1 NaN NaN NaN NaN NaN NaN NaN
Product_Weight 8763.0 NaN NaN NaN 12.653792 2.21732 4.0 11.15 12.66 14.18 22.0
Product_Sugar_Content 8763 4 Low Sugar 4885 NaN NaN NaN NaN NaN NaN NaN
Product_Allocated_Area 8763.0 NaN NaN NaN 0.068786 0.048204 0.004 0.031 0.056 0.096 0.298
Product_Type 8763 16 Fruits and Vegetables 1249 NaN NaN NaN NaN NaN NaN NaN
Product_MRP 8763.0 NaN NaN NaN 147.032539 30.69411 31.0 126.16 146.74 167.585 266.0
Store_Id 8763 4 OUT004 4676 NaN NaN NaN NaN NaN NaN NaN
Store_Establishment_Year 8763.0 NaN NaN NaN 2002.032751 8.388381 1987.0 1998.0 2009.0 2009.0 2009.0
Store_Size 8763 3 Medium 6025 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 8763 3 Tier 2 6262 NaN NaN NaN NaN NaN NaN NaN
Store_Type 8763 4 Supermarket Type2 4676 NaN NaN NaN NaN NaN NaN NaN
Product_Store_Sales_Total 8763.0 NaN NaN NaN 3464.00364 1065.630494 33.0 2761.715 3452.34 4145.165 8000.0

Observations:

📌 Observations on the Dataset

  1. Data Types & Structure
  • The dataset has 12 columns:

    • 4 numerical (float/int)

      • Product_Weight (float)

      • Product_Allocated_Area (float)

      • Product_MRP (float)

      • Store_Establishment_Year (int)

      • Product_Store_Sales_Total (float – target)

    • 7 categorical (object)

      • Product and store identifiers and descriptors.
  • Data types are appropriate and consistent with business meaning.

  • Memory usage is low (~822 KB), making it efficient for experimentation.


  1. Missing Values
  • No missing values across all columns.

  • This eliminates the need for:

    • Imputation strategies

    • Row/column removal

  • The dataset is clean and ready for modeling.


  1. Duplicate Records
  • 0 duplicate rows found.

  • Each product–store combination appears to be unique, improving data reliability.

  • No deduplication steps are required.


  1. Categorical Feature Insights
  • High-cardinality column:

    • Product_Id → 8,763 unique values

    • Acts more like an identifier than a predictive feature.

    • Should be removed or feature-engineered (e.g., prefix extraction).

  • Low to moderate cardinality columns:

    • Product_Sugar_Content → 4 categories (Low Sugar most frequent)

    • Product_Type → 16 categories (Fruits & Vegetables most common)

    • Store_Id → 4 stores

    • Store_Size → 3 levels (Medium dominant)

    • Store_Location_City_Type → 3 tiers (Tier 2 most frequent)

    • Store_Type → 4 types (Supermarket Type2 dominant)


  1. Numerical Feature Distribution

Product_Weight

  • Mean ≈ 12.65

  • Range: 4 to 22

  • Fairly symmetric distribution with moderate variance.

Product_Allocated_Area

  • Mean ≈ 0.069

  • Highly right-skewed (most products have small shelf area).

  • Likely a strong driver of sales visibility.

Product_MRP

  • Mean ≈ 147

  • Range: 31 to 266

  • Wide pricing range suggests varied product positioning.

Store_Establishment_Year

  • Range: 1987 to 2009

  • Median ≈ 2009

  • Can be converted into Store_Age for better interpretability.


  1. Target Variable (Product_Store_Sales_Total)
  • Mean ≈ 3464

  • Standard deviation ≈ 1066

  • Range: 33 to 8000

  • Indicates:

    • High variability in sales

    • Possible outliers

  • Distribution is likely right-skewed, which tree-based models handle well.


  1. Modeling Implications
  • Dataset is fully clean (no missing or duplicate values).

  • Strong mix of:

    • Product-level features

    • Store-level features

  • Tree-based regressors (Random Forest, XGBoost) are ideal.

  • Feature engineering opportunities:

    • Drop or transform Product_Id

    • Create Store_Age


Summary

The dataset consists of 8,763 clean and duplicate-free records with a balanced mix of numerical and categorical features. There are no missing values, and the target variable shows significant variability, making the dataset suitable for regression modeling using ensemble-based methods with appropriate feature engineering.

Exploratory Data Analysis (EDA)

Univariate Analysis

In [ ]:
#Function to plot a boxplot and a histogram
def histogram_boxplot(
    data,
    feature,
    figsize=(12, 7),
    kde=True,
    bins="auto",
    title=None,
    color="#8b5cf6",
    hist_alpha=0.35,
    show_stats_box=True,
    show_stats_subtitle=True,
    plot_gap=0.18,
    title_y=0.98,
    top_margin=0.90,
):
    """
    Combined boxplot + histogram for a numeric feature, with optional stats.
    """
    sns.set_theme(style="whitegrid", context="notebook")

    x = data[feature].dropna()
    if x.empty:
        raise ValueError(f"Column '{feature}' has no non-null values to plot.")

    # --- Summary stats ---
    n = x.shape[0]
    std = x.std()
    min_v = x.min()
    max_v = x.max()
    mean_v = x.mean()
    median_v = x.median()

    fig, (ax_box, ax_hist) = plt.subplots(
        nrows=2,
        sharex=True,
        figsize=figsize,
        gridspec_kw={"height_ratios": (0.28, 0.72), "hspace": plot_gap},
    )

    # --- Boxplot ---
    sns.boxplot(
        x=x,
        ax=ax_box,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.3),
        boxprops=dict(linewidth=1.3),
    )
    ax_box.set(xlabel="")
    ax_box.set_yticks([])
    sns.despine(ax=ax_box, left=True, bottom=True)

    # --- Histogram ---
    sns.histplot(
        x=x,
        ax=ax_hist,
        bins=bins if bins is not None else "auto",
        kde=kde,
        color=color,
        alpha=hist_alpha,
        edgecolor="white",
        linewidth=1,
    )

    # Mean/Median lines
    ax_hist.axvline(mean_v, color="#16a34a", linestyle="--", linewidth=2, label=f"Mean: {mean_v:,.2f}")
    ax_hist.axvline(median_v, color="#111827", linestyle="-", linewidth=2, label=f"Median: {median_v:,.2f}")
    ax_hist.legend(frameon=True, fontsize=10, loc="upper right")

    ax_hist.set_ylabel("Count")
    ax_hist.set_xlabel(feature)
    sns.despine(ax=ax_hist)

    # --- Title + subtitle (stats line) ---
    main_title = title or f"Distribution of {feature}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    if show_stats_subtitle:
        subtitle = f"n={n:,}   std={std:,.2f}   min={min_v:,.2f}   max={max_v:,.2f}"
        # place subtitle just below suptitle
        fig.text(0.5, title_y - 0.045, subtitle, ha="center", va="top", fontsize=11)

    # --- Stats box inside histogram ---
    if show_stats_box:
        stats_text = (
            f"n = {n:,}\n"
            f"std = {std:,.2f}\n"
            f"min = {min_v:,.2f}\n"
            f"max = {max_v:,.2f}"
        )
        ax_hist.text(
            0.01, 0.98, stats_text,
            transform=ax_hist.transAxes,
            va="top", ha="left",
            fontsize=10,
            bbox=dict(boxstyle="round,pad=0.35", facecolor="white", edgecolor="#e5e7eb", alpha=0.95),
        )

    # Make room for title/subtitle
    fig.subplots_adjust(top=top_margin)

    return fig, (ax_box, ax_hist)

Product Weight

In [ ]:
# Product Weight
histogram_boxplot(data, "Product_Weight", show_stats_box=False, show_stats_subtitle=True)
plt.show()

Observations:

📊 Univariate Analysis – Product_Weight

  1. Distribution Shape
  • The distribution of Product_Weight is approximately normal (bell-shaped).

  • Mean (12.65) and median (12.66) are almost identical, indicating a highly symmetric distribution.

  • The KDE curve confirms no strong skewness.

📌 Implication:

Since the feature is close to normally distributed, no transformation (log/sqrt) is required.


  1. Central Tendency & Spread
  • Mean: ~12.65

  • Median: ~12.66

  • Standard Deviation: ~2.22

This suggests:

  • Most products cluster tightly around the mean.

  • Product weights are well standardized, which is common in retail packaging.


  1. Range & Variability
  • Minimum: 4.0

  • Maximum: 22.0

  • Interquartile Range (IQR): roughly between 11 and 14

Most product weights lie in a narrow, realistic range, showing controlled product sizing.


  1. Outliers
  • A few outliers exist on both lower and upper ends:

    • Very light products (~4–6)

    • Very heavy products (~20–22)

  • These outliers are business-valid, not data errors (e.g., small sachets vs bulk items).

📌 Implication:

Outliers should not be removed, especially when using tree-based models (Random Forest, XGBoost), which are robust to them.


  1. Data Quality
  • No missing values observed.

  • No abnormal spikes or irregular gaps.

  • Distribution aligns well with real-world retail data.


Summary

Product_Weight follows a near-normal distribution with minimal skewness and reasonable variability. The presence of a few valid outliers reflects real-world product diversity. No transformation or outlier treatment is required, making it a stable and reliable feature for modeling.

Product Allocated Area

In [ ]:
# Product Allocated Area
histogram_boxplot(data, "Product_Allocated_Area", show_stats_box=False, show_stats_subtitle=True)
plt.show()

Observations:

📊 Univariate Analysis – Product_Allocated_Area

  1. Distribution Shape
  • The distribution of Product_Allocated_Area is highly right-skewed (positively skewed).

  • Most values are concentrated toward the lower end, with a long tail extending to the right.

  • The KDE curve confirms a non-normal distribution.

📌 Implication:

This feature does not follow a normal distribution, and skewness should be considered during modeling.


  1. Central Tendency
  • Mean ≈ 0.07

  • Median ≈ 0.06

  • Mean is greater than the median, which is characteristic of right-skewed data.

This indicates that:

  • A small number of products receive disproportionately large shelf space.

  • Most products occupy relatively limited display area.


  1. Range & Variability
  • Minimum: ~0.00

  • Maximum: ~0.30

  • Standard deviation: ~0.05

The wide spread relative to the mean suggests:

  • Significant variation in shelf allocation across products.

  • Shelf space is a highly differentiated business decision.


  1. Outliers
  • The boxplot reveals multiple upper-end outliers.

  • These represent products with exceptionally high shelf visibility.

  • These outliers are business-driven and meaningful, not data errors.

📌 Implication:

Outliers should not be removed, especially for tree-based models that can leverage them effectively.


  1. Business Interpretation
  • Products with higher allocated area likely:

    • Are high-demand or fast-moving items

    • Have stronger brand presence or promotional support

  • Shelf space is expected to have a direct positive impact on sales.


Summary

Product_Allocated_Area exhibits a strongly right-skewed distribution with several meaningful upper-end outliers, indicating that most products receive limited shelf space while a few receive significantly higher visibility. This feature is expected to have a strong influence on sales and should be retained without outlier removal.

Product MRP

In [ ]:
# Product MRP
histogram_boxplot(data, "Product_MRP", show_stats_box=False, show_stats_subtitle=True)
plt.show()

Observations:

📊 Univariate Analysis – Product_MRP

  1. Distribution Shape
  • The distribution of Product_MRP is approximately normal (bell-shaped).

  • The KDE curve shows a symmetric pattern around the center.

  • Mean (147.03) and median (146.74) are almost identical, indicating very low skewness.

📌 Implication:

No transformation (log/sqrt) is required for this feature.


  1. Central Tendency & Spread
  • Mean: ~147.03

  • Median: ~146.74

  • Standard Deviation: ~30.69

This indicates:

  • Moderate variability in product pricing.

  • Prices are well-distributed across a mid-range retail spectrum.


  1. Range & Pricing Segments
  • Minimum: 31

  • Maximum: 266

This suggests the presence of:

  • Low-priced, mass-market products

  • High-priced, premium products

The dataset covers a wide price band, making it informative for modeling sales behavior.


  1. Outliers
  • A small number of outliers on both ends of the price spectrum:

    • Very low-priced items

    • Premium-priced products

  • These outliers are realistic and business-valid, not data issues.

📌 Implication:

Outliers should be retained, especially for tree-based models that handle them naturally.


  1. Business Interpretation
  • Product_MRP is expected to have a strong influence on sales revenue:

    • Higher-priced products contribute more to total sales value

    • Interaction with volume and shelf space is likely

  • It may interact with:

    • Product_Allocated_Area

    • Store_Type

    • Store_Location_City_Type


Summary

Product_MRP follows an approximately normal distribution with minimal skewness and a wide price range. The presence of valid low- and high-priced products makes it a strong and reliable predictor of sales without requiring transformation or outlier treatment.

Product Store Sales Total

In [ ]:
# Product Store Sales Total
histogram_boxplot(data, "Product_Store_Sales_Total", show_stats_box=False, show_stats_subtitle=True)
plt.show()

Observations:

📊 Univariate Analysis – Product_Store_Sales_Total

  1. Distribution Shape
  • The distribution of Product_Store_Sales_Total is approximately bell-shaped with slight right skew.

  • The KDE curve peaks around the center and tapers gradually on both sides.

  • Mean (3,464) and median (3,452) are very close, indicating near-symmetry.

📌 Implication:

The target variable is well-behaved, making it suitable for a wide range of regression models.


  1. Central Tendency & Variability
  • Mean: ~3,464

  • Median: ~3,452

  • Standard Deviation: ~1,066

This indicates:

  • Significant variation in sales across products and stores.

  • Sales performance differs meaningfully depending on product and store characteristics.


  1. Range of Sales
  • Minimum: ~33

  • Maximum: ~8,000

This wide range suggests:

  • Some products have very low sales, possibly due to low demand or poor placement.

  • High-performing products contribute substantially higher revenue.


  1. Outliers
  • Boxplot shows outliers on both lower and upper ends:

    • Very low sales (near zero)

    • Extremely high sales (>6,000)

  • These values are business-realistic and expected in retail data.

📌 Implication:

Outliers should not be removed, as they represent genuine business scenarios and carry important signals.


  1. Business Interpretation
  • Sales distribution reflects:

    • A majority of products generating moderate sales

    • A small proportion of high-performing products driving revenue

  • This aligns with the Pareto principle (80/20 rule) commonly seen in retail.


Summary

Product_Store_Sales_Total shows a near-normal distribution with moderate variability and meaningful outliers. The wide sales range reflects real-world retail behavior, making the target variable suitable for regression modeling without aggressive transformation or outlier treatment.

In [ ]:
# Function to create labeled barplots
def labeled_barplot(
    data,
    feature,
    perc=False,
    n=None,
    figsize=None,
    title=None,
    color="#8b5cf6",
    rotate=45,
    show_stats_subtitle=True,
):

    sns.set_theme(style="whitegrid", context="notebook")

    s = data[feature]
    total = len(s)
    missing = int(s.isna().sum())

    # Use a fill value so missing categories can be seen (optional but helpful)
    plot_s = s.fillna("Missing")

    # Build counts and optionally select top-n
    vc = plot_s.value_counts(dropna=False)
    if n is not None:
        vc = vc.head(n)

    order = vc.index.tolist()
    n_cat = len(order)

    # Auto figure sizing
    if figsize is None:
        width = max(8, min(16, 1.1 * n_cat + 2))
        figsize = (width, 6)

    fig, ax = plt.subplots(figsize=figsize)

    # Bars
    sns.countplot(
        x=plot_s,
        order=order,
        color=color,
        ax=ax,
        edgecolor="white",
        linewidth=1,
    )

    # Titles (similar to the previous function style)
    main_title = title or f"Distribution of {feature}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=0.98)

    if show_stats_subtitle:
        subtitle = f"rows={total:,}   unique={s.nunique(dropna=True):,}   missing={missing:,}"
        fig.text(0.5, 0.94, subtitle, ha="center", va="top", fontsize=11)

    # Axis labels
    ax.set_xlabel(feature)
    ax.set_ylabel("Percentage" if perc else "Count")

    # Nice tick labels
    ax.tick_params(axis="x", rotation=rotate)
    for tick in ax.get_xticklabels():
        tick.set_horizontalalignment("right" if rotate else "center")

    # Value labels on bars
    ymax = 0
    for p in ax.patches:
        h = p.get_height()
        ymax = max(ymax, h)

        if perc:
            label = f"{(h / total) * 100:.1f}%"
        else:
            label = f"{int(h):,}"

        ax.annotate(
            label,
            (p.get_x() + p.get_width() / 2, h),
            ha="center",
            va="bottom",
            fontsize=11,
            xytext=(0, 4),
            textcoords="offset points",
        )

    # Add headroom so labels don't touch the top
    ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)

    # Clean spines
    sns.despine(ax=ax)

    # Layout room for suptitle/subtitle
    fig.tight_layout(rect=[0, 0, 1, 0.92])

    plt.show()

Product Sugar Content

In [ ]:
# Product Sugar Content
labeled_barplot(data, "Product_Sugar_Content", perc=True)

Observations:

📊 Univariate Analysis – Product_Sugar_Content

  1. Category Distribution

The variable Product_Sugar_Content has 4 distinct categories:

Category Approx. Percentage
Low Sugar ~55.7%
Regular ~25.7%
No Sugar ~17.3%
reg ~1.2%
  • Low Sugar products dominate the dataset, accounting for more than half of all observations.

  • Regular sugar products form the second-largest segment.

  • No Sugar products represent a smaller but meaningful portion.

  • The category reg appears to be a label inconsistency, not a true separate category.


  1. Data Quality Insight
  • There are no missing values in this feature.

  • The presence of both "Regular" and "reg" indicates a data quality issue due to inconsistent labeling.

📌 Required Action:

"reg" should be merged with "Regular" to avoid:

  • Incorrect category inflation

  • Unnecessary dummy variables after encoding


  1. Class Imbalance
  • The distribution is moderately imbalanced:

    • Low Sugar products heavily outweigh other categories.
  • However, all categories still have sufficient representation.

📌 Implication:

This imbalance is not severe and does not require resampling, but it should be kept in mind during model interpretation.


  1. Business Interpretation
  • The dominance of Low Sugar products reflects:

    • Increasing consumer preference for healthier food options

    • Retail strategy focused on health-conscious offerings

  • No Sugar products may cater to:

    • Niche markets

    • Specific dietary needs (e.g., diabetic-friendly products)


Summary

Product_Sugar_Content is a categorical feature dominated by Low Sugar products, indicating a health-oriented product mix. A minor labeling inconsistency (reg vs Regular) must be resolved before modeling. The feature shows meaningful variation and is suitable for one-hot encoding.

Product Type

In [ ]:
# Product Type
labeled_barplot(data, "Product_Type", perc=True)

Observations:

📊 Univariate Analysis – Product_Type

  1. Category Distribution
  • Product_Type contains 16 distinct categories, indicating a diverse product portfolio.

  • The distribution is uneven, with certain categories contributing significantly more products than others.

Top contributing categories:

  • Fruits and Vegetables – ~14.3% (highest)

  • Snack Foods – ~13.1%

  • Frozen Foods – ~9.3%

  • Dairy – ~9.1%

  • Household – ~8.4%

These categories together account for more than half of the dataset.


  1. Mid-Tier Categories
  • Baking Goods, Canned, Health and Hygiene, Meat each contribute 7–8%.

  • These categories represent stable, essential consumer goods with consistent presence across stores.


  1. Low-Frequency Categories
  • Soft Drinks – ~5.9%

  • Breads, Hard Drinks, Others, Starchy Foods, Breakfast, Seafood each contribute less than 3%.

  • Seafood has the lowest representation (~0.9%).

📌 Implication:

Some categories are sparsely represented, which may limit their standalone predictive power.


  1. Data Quality & Completeness
  • No missing values observed.

  • Category labels are clean and interpretable.

  • No obvious inconsistencies or noise in category naming.


  1. Business Interpretation
  • Dominance of Fruits & Vegetables and Snack Foods suggests:

    • High demand and fast-moving inventory

    • Frequent replenishment cycles

  • Lower presence of categories like Seafood and Breakfast may reflect:

    • Supply constraints

    • Lower consumer demand

    • Store-type or location-based limitations


Summary

Product_Type shows a diverse yet imbalanced distribution, with Fruits and Vegetables and Snack Foods dominating the product mix. While most categories are well-represented, a few low-frequency types may contribute limited predictive power. The feature is clean and suitable for one-hot encoding with minimal preprocessing.

Store Id

In [ ]:
# Store ID
labeled_barplot(data, "Store_Id", perc=True)

Observations:

📊 Univariate Analysis – Store_Id

  1. Category Distribution
  • Store_Id has 4 unique stores.

  • The distribution is highly imbalanced across stores:

Store_Id Approx. Share
OUT004 ~53.4%
OUT001 ~18.1%
OUT003 ~15.4%
OUT002 ~13.1%
  • OUT004 alone contributes more than half of all observations.

  1. Data Quality
  • No missing values detected.

  • Store identifiers are clean and consistent.


  1. Business Interpretation
  • The dominance of OUT004 suggests:

    • Larger store size

    • Higher product assortment

    • Possibly higher footfall or longer operational history

  • Smaller representation from other stores may reflect:

    • Smaller physical size

    • Lower product variety

    • Different regional demand


Summary

Store_Id shows a highly imbalanced distribution, with OUT004 contributing over half of the observations. This reflects real operational differences across stores and should be preserved as a categorical feature during modeling.

Store Size

In [ ]:
# Store Size
labeled_barplot(data, "Store_Size", perc=True)

Observations:

📊 Univariate Analysis – Store_Size

  1. Category Distribution
  • Store_Size has 3 distinct categories: Small, Medium, High.

  • The distribution is highly skewed toward Medium-sized stores:

Store Size Approx. Share
Medium ~68.8%
High ~18.1%
Small ~13.1%
  • Nearly two-thirds of all observations come from Medium-sized stores.

  1. Data Quality
  • No missing values are present.

  • Store size categories are clean, consistent, and interpretable.


  1. Business Interpretation
  • The dominance of Medium-sized stores suggests:

    • SuperKart’s primary operational focus is on mid-sized outlets.

    • These stores likely balance product variety and operating costs efficiently.

  • High-sized stores may:

    • Carry wider assortments

    • Generate higher total sales per store

  • Small stores may:

    • Have limited shelf space

    • Focus on essential or fast-moving products


Summary

Store_Size is a clean categorical feature dominated by Medium-sized stores, reflecting the retailer’s core store format. The feature is business-relevant and likely to significantly influence sales performance.

Store Location City Type

In [ ]:
# Store Location City Type
labeled_barplot(data, "Store_Location_City_Type", perc=True)

Observations:

📊 Univariate Analysis – Store_Location_City_Type

  1. Category Distribution
  • Store_Location_City_Type has 3 distinct categories: Tier 1, Tier 2, Tier 3.

  • The distribution is heavily skewed toward Tier 2 cities:

City Tier Approx. Share
Tier 2 ~71.5%
Tier 1 ~15.4%
Tier 3 ~13.1%
  • Over 70% of all observations come from Tier 2 locations.

  1. Data Quality
  • No missing values present.

  • City tier labels are consistent and well-defined.


  1. Business Interpretation
  • The dominance of Tier 2 cities suggests:

    • SuperKart’s strategic focus on fast-growing urban markets

    • Lower operational costs compared to Tier 1 cities

  • Tier 1 cities:

    • Likely have higher purchasing power

    • May generate higher revenue per product

  • Tier 3 cities:

    • Possibly lower demand

    • More price-sensitive customer base


Summary

Store_Location_City_Type is dominated by Tier 2 cities, indicating a strategic focus on mid-tier urban markets. The feature is clean, business-relevant, and expected to have a significant impact on sales performance.

Store Type

In [ ]:
# Store Type
labeled_barplot(data, "Store_Type", perc=True)

Observations:

📊 Univariate Analysis – Store_Type

  1. Category Distribution
  • Store_Type has 4 distinct categories:

    • Supermarket Type1

    • Supermarket Type2

    • Departmental Store

    • Food Mart

  • The distribution is clearly dominated by Supermarket Type2:

Store Type Approx. Share
Supermarket Type2 ~53.4%
Supermarket Type1 ~18.1%
Departmental Store ~15.4%
Food Mart ~13.1%
  • More than half of all observations come from Supermarket Type2 stores.

  1. Data Quality
  • No missing values detected.

  • Store type labels are consistent and business-meaningful.


  1. Business Interpretation
  • The dominance of Supermarket Type2 suggests:

    • These stores may be larger, better stocked, or more strategically located.

    • They likely generate higher sales volumes due to better infrastructure and assortment.

  • Supermarket Type1 and Departmental Stores provide moderate coverage.

  • Food Marts represent smaller, possibly neighborhood-focused outlets.


Summary

Store_Type is dominated by Supermarket Type2 stores, indicating a core store format that likely drives the majority of sales. The feature is clean, well-distributed, and highly relevant for sales prediction modeling.

Bivariate Analysis

Correlation matrix

In [ ]:
# Correlation Matrix
def nice_corr_heatmap_complete(
    data,
    cols=None,
    method="pearson",
    figsize=(12, 9),
    cmap="Spectral",
    annot="auto",
    fmt=".2f",
    linewidths=0.6,
    cbar_shrink=0.85,
    title="Correlation Heatmap",
    subtitle=True,
    title_y=0.98,
    top_margin=0.90,
    square=True,
):
    sns.set_theme(style="white", context="notebook")

    if cols is None:
        cols = data.select_dtypes(include=np.number).columns.tolist()
    if len(cols) == 0:
        raise ValueError("No numeric columns found to compute correlation.")

    corr = data[cols].corr(method=method)

    # Auto-annotation to avoid clutter on big matrices
    if annot == "auto":
        annot = corr.shape[0] <= 12

    fig, ax = plt.subplots(figsize=figsize)

    sns.heatmap(
        corr,
        cmap=cmap,
        vmin=-1, vmax=1, center=0,
        square=square,
        linewidths=linewidths,
        linecolor="white",
        annot=annot,
        fmt=fmt,
        annot_kws={"size": 9} if annot else None,
        cbar_kws={"shrink": cbar_shrink, "pad": 0.02},
        ax=ax,
    )

    fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)

    if subtitle:
        rows = len(data)
        n_features = len(cols)
        miss = int(data[cols].isna().sum().sum())
        sub = f"rows={rows:,}   numeric_features={n_features:,}   missing_values_in_matrix={miss:,}   method={method}"
        fig.text(0.5, title_y - 0.045, sub, ha="center", va="top", fontsize=11)

    ax.tick_params(axis="x", rotation=45)
    ax.tick_params(axis="y", rotation=0)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right")

    sns.despine(ax=ax, left=True, bottom=True)

    fig.subplots_adjust(top=top_margin)
    fig.tight_layout(rect=[0, 0, 1, 0.90])

    plt.show()
In [ ]:
# Correlation Heatmap
nice_corr_heatmap_complete(data)

Observations:

📊 Bivariate Analysis – Correlation Matrix (Numerical Features)

Numerical Features Considered

The correlation matrix includes 5 numerical variables:

  • Product_Weight

  • Product_Allocated_Area

  • Product_MRP

  • Store_Establishment_Year

  • Product_Store_Sales_Total (Target)

Pearson correlation method is used.


🔍 Key Observations & Insights

  1. Strong Positive Correlation with Target

🔹 Product_MRP vs Product_Store_Sales_Total

  • Correlation ≈ +0.79 (Strong Positive)

  • This is the strongest correlation with the target.

📌 Interpretation:

Higher-priced products tend to generate higher total sales value, which is expected in revenue-based forecasting.

🔹 Product_Weight vs Product_Store_Sales_Total

  • Correlation ≈ +0.74 (Strong Positive)

📌 Interpretation:

Heavier products may:

  • Be sold in larger quantities

  • Represent premium or bulk items This makes Product_Weight a strong predictor of sales.


  1. Weak / No Correlation with Target

🔹 Product_Allocated_Area vs Product_Store_Sales_Total

  • Correlation ≈ 0.00 (No Linear Relationship)

📌 Interpretation:

Although shelf space is important from a business perspective, its linear relationship with sales is weak.

However:

  • This does not mean the feature is useless

  • The relationship may be non-linear, which tree-based models can capture


  1. Store Age Effect

🔹 Store_Establishment_Year vs Product_Store_Sales_Total

  • Correlation ≈ −0.19 (Weak Negative)

📌 Interpretation:

Older stores (lower establishment year) tend to have slightly higher sales, possibly due to:

  • Established customer base

  • Brand familiarity

This effect is weak but meaningful.


  1. Inter-Feature Correlations

🔹 Product_MRP vs Product_Weight

  • Correlation ≈ +0.53 (Moderate Positive)

📌 Interpretation:

Heavier products often cost more, which is logically consistent.

⚠️ Multicollinearity Check:

  • Correlation is moderate, not high enough to cause serious multicollinearity issues.

  • Safe for use in both linear and tree-based models.


  1. Low Risk of Multicollinearity
  • No pair of independent variables shows very high correlation (>0.85).

  • This indicates:

    • Stable model training

    • Reliable coefficient interpretation (for linear models)


Summary

The correlation analysis shows that Product_MRP and Product_Weight have strong positive relationships with total sales, making them key predictors. Store establishment year shows a weak negative correlation, while product allocated area has no linear relationship, suggesting potential non-linear effects. No severe multicollinearity is observed, supporting the use of all numerical features in modeling.

Let's check the distribution of our target variable i.e Product_Store_Sales_Total with the numeric columns

In [ ]:
# Function to plot scatterz
def nice_scatterplot(
    data,
    x,
    y="Product_Store_Sales_Total",
    figsize=(8, 6),
    title=None,
    subtitle=True,
    color="#8b5cf6",
    alpha=0.45,
    s=45,
    add_regline=False,   # set True if you want a trend line
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    fig, ax = plt.subplots(figsize=figsize)

    sns.scatterplot(
        data=data,
        x=x,
        y=y,
        ax=ax,
        color=color,
        alpha=alpha,
        s=s,
        edgecolor="white",
        linewidth=0.6,
    )

    # Optional trend line (nice for relationships)
    if add_regline:
        sns.regplot(
            data=data,
            x=x,
            y=y,
            scatter=False,
            ax=ax,
            ci=None,
            line_kws={"linewidth": 2},
        )

    main_title = title or f"{y} vs {x}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    if subtitle:
        n = int(data[[x, y]].dropna().shape[0])
        miss = int(data[[x, y]].isna().any(axis=1).sum())
        fig.text(
            0.5, title_y - 0.045,
            f"points={n:,}   rows_with_missing={miss:,}",
            ha="center", va="top", fontsize=11
        )

    ax.set_xlabel(x)
    ax.set_ylabel(y)

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()
In [ ]:
# 1) Product_Weight vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_Weight")

Observations:

📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_Weight

  1. Nature of Relationship
  • The scatter plot shows a strong positive linear relationship between Product_Weight and Product_Store_Sales_Total.

  • As product weight increases, total sales value generally increases as well.

  • This visually confirms the high positive correlation (~0.74) observed in the correlation matrix.


  1. Trend Pattern
  • Data points form a clear upward-sloping pattern.

  • The relationship appears approximately linear, especially in the mid-range of product weights (8–18 units).

  • No abrupt breaks or non-linear curves are visible.

📌 Implication:

Both linear and tree-based models can effectively capture this relationship.


  1. Variability & Spread
  • For lower weights (≈ 4–7):

    • Sales values are generally lower and less dispersed.
  • For mid to higher weights (≈ 10–18):

    • Sales values show greater spread, indicating:

      • Influence of other factors such as price, store type, or shelf space.
  • Variance slightly increases with weight (mild heteroscedasticity).


  1. Outliers
  • A few points show:

    • Very high sales (>7,000)

    • Very low sales at moderate weights

  • These outliers are business-valid (e.g., premium or bulk products).

📌 Implication:

Outliers should not be removed, as they represent genuine sales behavior.


  1. Business Interpretation
  • Heavier products often:

    • Cost more

    • Are sold in bulk or premium segments

  • This naturally leads to higher revenue per product, explaining the strong positive trend.


Summary

The scatter plot reveals a strong positive linear relationship between Product_Weight and total sales, indicating that heavier products tend to generate higher revenue. The trend aligns with correlation analysis, shows realistic variability, and confirms Product_Weight as a key predictor for sales forecasting.

In [ ]:
# 2) Product_Allocated_Area vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_Allocated_Area")

Observations:

📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_Allocated_Area

  1. Nature of Relationship
  • The scatter plot shows no strong linear relationship between Product_Allocated_Area and Product_Store_Sales_Total.

  • Sales values are widely dispersed across all levels of allocated area.

  • This visually confirms the near-zero correlation observed in the correlation matrix.

📌 Key Insight:

Shelf space alone does not linearly explain sales performance.


  1. Distribution Pattern
  • Most products have low allocated area (0.00–0.10).

  • High allocated area values (>0.20) are rare.

  • Across both low and high shelf space:

    • Sales range from very low to very high.

    • No clear upward or downward trend is visible.


  1. Variability & Spread
  • For low allocated area:

    • Sales show very high variability.
  • For higher allocated area:

    • Sales remain scattered with no consistent increase.
  • Variance remains roughly constant, indicating no clear heteroscedastic pattern.


  1. Outliers
  • A few products with high shelf space but moderate sales.

  • Some products with low shelf space but very high sales.

  • These are business-realistic scenarios:

    • Popular items sell well even with limited shelf space

    • Poor-performing products may still receive promotional space

📌 Implication:

Outliers are meaningful and should be retained.


  1. Business Interpretation
  • Shelf space allocation is likely:

    • Influenced by expected demand, not actual sales alone

    • Interacting with other variables such as:

      • Product price

      • Product type

      • Store size

  • Sales performance is multi-factor driven, not dependent on shelf space alone.


Summary

The scatter plot indicates no clear linear relationship between Product_Allocated_Area and total sales, suggesting that shelf space alone does not drive sales outcomes. However, the feature may still be valuable through non-linear interactions with other product and store attributes.

In [ ]:
# 3) Product_MRP vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_MRP")

Observations:

📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_MRP

  1. Nature of Relationship
  • The scatter plot shows a strong positive linear relationship between Product_MRP and Product_Store_Sales_Total.

  • As product price increases, total sales revenue consistently increases.

  • This visually confirms the high positive correlation (~0.79) seen in the correlation matrix.


  1. Trend Pattern
  • Data points form a clear upward-sloping trend.

  • The relationship appears almost linear across the entire price range.

  • Minimal curvature or deviation from linearity is observed.

📌 Implication:

This feature is well-suited for linear regression as well as tree-based models.


  1. Variability & Spread
  • At lower MRP values (30–80):

    • Sales values are generally lower and less dispersed.
  • At mid to high MRP values (100–220):

    • Sales values increase substantially.

    • Variability increases slightly, indicating influence of additional factors such as store type and shelf space.


  1. Outliers
  • A few products exhibit:

    • Very high MRP (>250) with high sales

    • Moderate MRP with unusually low or high sales

  • These are business-valid scenarios, not anomalies.

📌 Implication:

Outliers should be retained as they provide valuable information.


  1. Business Interpretation
  • Higher-priced products:

    • Generate more revenue per unit sold

    • Are often associated with premium or bulk offerings

  • This explains why MRP is one of the strongest drivers of sales revenue.


Summary

Product_MRP exhibits a strong positive linear relationship with total sales, indicating that higher-priced products consistently generate higher revenue. This confirms Product_MRP as one of the most influential predictors in the sales forecasting model.

Let us see from which product type the company is generating most of the revenue

In [ ]:
def revenue_by_category(
    data,
    category,
    revenue_col="Product_Store_Sales_Total",
    top_n=None,
    figsize=None,
    title=None,
    rotate=45,
    color="#8b5cf6",
    show_values=True,
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    # Aggregate revenue
    df_rev = (
        data.groupby(category, dropna=False)[revenue_col]
        .sum()
        .reset_index()
        .sort_values(revenue_col, ascending=False)
    )

    # Optionally limit to top N categories
    if top_n is not None:
        df_rev = df_rev.head(top_n)

    # Auto size
    if figsize is None:
        width = max(9, min(18, 1.1 * len(df_rev) + 3))
        figsize = (width, 6)

    fig, ax = plt.subplots(figsize=figsize)

    sns.barplot(
        data=df_rev,
        x=category,
        y=revenue_col,
        ax=ax,
        color=color,
        edgecolor="white",
        linewidth=1,
    )

    # Titles (same style)
    main_title = title or f"Revenue by {category}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    total_rev = df_rev[revenue_col].sum()
    top_cat = df_rev.iloc[0][category]
    top_rev = df_rev.iloc[0][revenue_col]
    fig.text(
        0.5,
        title_y - 0.045,
        f"total_revenue={total_rev:,.0f}   top={top_cat} ({top_rev:,.0f})",
        ha="center",
        va="top",
        fontsize=11,
    )

    ax.set_xlabel(category)
    ax.set_ylabel("Revenue")

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    # Value labels on bars
    if show_values:
        ymax = df_rev[revenue_col].max()
        for p in ax.patches:
            h = p.get_height()
            ax.annotate(
                f"{h:,.0f}",
                (p.get_x() + p.get_width() / 2, h),
                ha="center",
                va="bottom",
                fontsize=10,
                xytext=(0, 4),
                textcoords="offset points",
            )
        ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

    # If you need the aggregated dataframe later, return it:
    return df_rev
In [ ]:
# Revenue by Product Type
df_revenue1 = revenue_by_category(
    data,
    category="Product_Type",
    figsize=(14, 7),
    rotate=60,
    title="Which Product Type Generates the Most Revenue?"
)

Observations:

📊 Bivariate Analysis – Revenue by Product_Type

  1. Overall Revenue Contribution
  • The total revenue across all product types is approximately 30.35 million.

  • Revenue contribution is unevenly distributed across product categories, indicating that some categories drive a disproportionate share of revenue.


  1. Top Revenue-Generating Product Types

The highest revenue contributors are:

  • Fruits and Vegetables – ~4.30M (Highest)

  • Snack Foods – ~3.99M

  • Dairy – ~2.81M

  • Frozen Foods – ~2.81M

  • Household – ~2.56M

📌 Key Insight:

Fruits and Vegetables alone generate the highest revenue, making it the most critical product category for SuperKart.


  1. Mid-Tier Revenue Categories
  • Baking Goods – ~2.45M

  • Canned – ~2.30M

  • Health and Hygiene – ~2.16M

  • Meat – ~2.13M

These categories:

  • Contribute steady and meaningful revenue

  • Represent essential and recurring consumer purchases


  1. Low Revenue-Generating Categories

The lowest revenue contributors are:

  • Soft Drinks – ~1.80M

  • Breads – ~0.71M

  • Hard Drinks – ~0.63M

  • Others – ~0.54M

  • Starchy Foods – ~0.52M

  • Breakfast – ~0.36M

  • Seafood – ~0.27M (Lowest)

📌 Implication:

These categories either have:

  • Lower demand

  • Lower pricing

  • Limited shelf presence

  • Or fewer product SKUs


  1. Business Interpretation
  • High revenue from Fruits & Vegetables and Snack Foods suggests:

    • High demand

    • Frequent purchases

    • Fast inventory turnover

  • Low-performing categories may require:

    • Better promotion

    • Optimized pricing

    • Improved shelf placement

    • Or strategic reduction if margins are low


  1. Relationship with Earlier EDA
  • Revenue dominance aligns with:

    • High representation of these categories in the dataset

    • Likely higher product turnover

  • Confirms that Product_Type is a strong driver of sales revenue, not just sales count.


Summary

Fruits and Vegetables generate the highest revenue for SuperKart, followed by Snack Foods and Dairy. Revenue contribution varies significantly across product types, highlighting Product_Type as a key driver of sales performance and a critical feature for forecasting models.

In [ ]:
# Revenue by Product Sugar Content
df_revenue2 = revenue_by_category(
    data,
    category="Product_Sugar_Content",
    figsize=(9, 6),
    rotate=45,
    title="Revenue by Product Sugar Content"
)

Observations:

📊 Bivariate Analysis – Revenue by Product_Sugar_Content

  1. Overall Revenue Contribution
  • The total revenue across all sugar content categories is approximately 30.36 million.

  • Revenue distribution across sugar content levels is highly imbalanced, indicating strong consumer preference patterns.


  1. Top Revenue-Generating Sugar Category
Sugar Content Revenue Share (Approx.)
Low Sugar ~16.82M ~55%
Regular ~7.87M ~26%
No Sugar ~5.27M ~17%
reg ~0.39M ~1%

📌 Key Insight:

Low Sugar products dominate revenue generation, contributing more than half of total sales.


  1. Data Quality Observation
  • The presence of reg as a separate category:

    • Indicates a data inconsistency / labeling issue

    • Likely represents “Regular” sugar content

📌 Action Required:

This category should be merged with “Regular” during data cleaning.


  1. Business Interpretation
  • Strong revenue dominance of Low Sugar products suggests:

    • Growing consumer preference for health-conscious options

    • Successful product placement and assortment strategy

  • No Sugar products also contribute meaningfully, reinforcing the health trend.


  1. Strategic Implications
  • Increase focus on:

    • Low Sugar and No Sugar product variants

    • Promotions and shelf placement for these categories

  • Reevaluate Regular sugar products:

    • Improve marketing

    • Reposition pricing if needed


Summary

Low Sugar products generate the majority of revenue, highlighting a strong consumer shift toward healthier options. Data inconsistency in sugar labeling should be corrected to ensure accurate modeling and insights.

Let us see from which type of stores and locations the revenue generation is more.

In [ ]:
# Revenue by Store Id
df_store_revenue = revenue_by_category(
    data,
    category="Store_Id",
    title="Revenue by Store",
    rotate=60,
    top_n=15,          # optional: store IDs can be many; keeps it readable
    figsize=(14, 6)
)

Observations:

📊 Bivariate Analysis – Revenue by Store_Id

  1. Overall Revenue Distribution
  • Total revenue across all stores is approximately 30.36 million.

  • Revenue generation is highly uneven across stores, indicating strong store-level performance differences.


  1. Store-wise Revenue Contribution
Store_Id Revenue Approx. Share
OUT004 ~15.43M ~51%
OUT003 ~6.67M ~22%
OUT001 ~6.22M ~21%
OUT002 ~2.03M ~7%

📌 Key Insight:

OUT004 alone contributes more than half of the total revenue, making it the most dominant store.


  1. *Performance Gap Analysis
  • OUT004 significantly outperforms all other stores combined.

  • OUT003 and OUT001 show similar and moderate performance.

  • OUT002 generates the least revenue, lagging far behind.

📌 Implication:

Revenue generation is not evenly distributed geographically or operationally.


  1. Business Interpretation

Possible reasons for OUT004’s dominance:

  • Larger store size

  • Better product assortment

  • Higher footfall

  • Favorable city tier or location

  • Higher concentration of high-MRP and high-demand products

Conversely, OUT002 may suffer from:

  • Smaller store size

  • Lower customer traffic

  • Less optimal location

  • Limited inventory mix


  1. Relationship with Earlier EDA
  • OUT004 also had:

    • Highest number of product entries

    • Strong presence of high-performing product categories

  • Reinforces that store characteristics play a major role in revenue generation.


Summary

Revenue generation varies significantly across stores, with OUT004 contributing over half of total revenue. This highlights the critical role of store-specific factors such as size, location, and assortment in driving sales performance.

In [ ]:
# Revenue by Store Size
df_revenue3 = revenue_by_category(
    data,
    category="Store_Size",
    title="Revenue by Store Size",
    rotate=0,
    figsize=(8, 6)
)

Observations:

📊 Bivariate Analysis – Revenue by Store Size

  1. Overall Revenue Context
  • Total revenue across all stores is approximately 30.36 million.

  • Revenue contribution varies significantly by store size, indicating that store scale has a strong influence on sales performance.


  1. Store Size-wise Revenue Contribution
Store Size Revenue Approx. Share
Medium ~22.10M ~73%
High ~6.22M ~21%
Small ~2.03M ~6%

📌 Key Insight:

Medium-sized stores dominate revenue generation, contributing nearly three-fourths of total revenue.


  1. Interpretation of Store Size Impact
  • Medium stores likely strike the best balance between:

    • Product variety

    • Operational efficiency

    • Customer footfall

  • High-sized stores, despite larger physical space, contribute less than expected, possibly due to:

    • Higher operational costs

    • Diminishing returns on space utilization

  • Small stores generate minimal revenue, consistent with limited assortment and lower footfall.


  1. Relationship with Univariate EDA
  • Medium-sized stores also had the highest frequency count in the dataset.

  • Their dominance in both count and revenue reinforces their strategic importance in the business model.


  1. Business Implications
  • Expansion strategy should prioritize medium-sized stores.

  • Optimization opportunities:

    • Improve revenue per square foot in high-sized stores.

    • Re-evaluate inventory and location strategy for small stores.


Summary

Medium-sized stores are the primary revenue drivers, contributing nearly 73% of total revenue, making store size a critical determinant of sales performance.

In [ ]:
# Revenue by Store Location City Type
df_revenue4 = revenue_by_category(
    data,
    category="Store_Location_City_Type",
    title="Revenue by Store Location City Type",
    rotate=0,
    figsize=(9, 6)
)

Observations:

📊 Bivariate Analysis – Revenue by Store Location City Type

  1. Overall Revenue Context
  • Total revenue across all stores is approximately 30.36 million.

  • Revenue contribution varies significantly by city tier, highlighting the importance of store location in sales performance.


  1. City Tier-wise Revenue Contribution
City Tier Revenue Approx. Share
Tier 2 ~21.65M ~71%
Tier 1 ~6.67M ~22%
Tier 3 ~2.03M ~7%

📌 Key Insight:

Tier 2 cities dominate revenue generation, contributing over 70% of total revenue, outperforming even Tier 1 cities.


  1. Interpretation of Location Impact
  • Tier 2 cities likely benefit from:

    • High population density

    • Moderate competition

    • Strong demand for value-driven retail formats

  • Tier 1 cities, despite higher purchasing power, may face:

    • Market saturation

    • Higher operational costs

  • Tier 3 cities show limited revenue potential, possibly due to:

    • Lower footfall

    • Smaller store formats

    • Limited product assortment


  1. Consistency with Previous EDA
  • Tier 2 cities also had the highest store count in univariate analysis.

  • The alignment of high store presence and high revenue reinforces Tier 2 cities as the company’s core market.


  1. Business Implications
  • Expansion and investment strategies should focus on Tier 2 locations.

  • Tier 1 strategies should emphasize:

    • Premium products

    • Differentiated offerings

  • Tier 3 stores may require:

    • Cost optimization

    • Targeted product mixes


Summary

Tier 2 cities are the primary revenue drivers, contributing over 70% of total revenue, underscoring the strategic importance of mid-tier urban markets.

In [ ]:
# Revenue by Store Type
df_revenue5 = revenue_by_category(
    data,
    category="Store_Type",
    title="Revenue by Store Type",
    rotate=0,
    figsize=(9, 6)
)

Observations:

📊 Bivariate Analysis – Revenue by Store Type

  1. Overall Revenue Context
  • Total revenue across all store types is approximately 30.36 million.

  • Revenue distribution varies significantly across different store formats, indicating that store type plays a crucial role in sales performance.


  1. Store Type-wise Revenue Contribution
Store Type Revenue Approx. Share
Supermarket Type2 ~15.43M ~51%
Departmental Store ~6.67M ~22%
Supermarket Type1 ~6.22M ~21%
Food Mart ~2.03M ~7%

📌 Key Insight:

Supermarket Type2 dominates revenue generation, contributing over half of the total revenue on its own.


  1. Interpretation of Store Type Impact
  • Supermarket Type2 stores likely benefit from:

    • Wider product assortment

    • Higher customer footfall

    • Strong presence in high-performing locations (e.g., Tier 2 cities)

  • Departmental Stores and Supermarket Type1 show comparable performance, suggesting:

    • Moderate scale

    • Stable but less aggressive sales potential

  • Food Marts generate the least revenue, consistent with:

    • Smaller store size

    • Limited product variety

    • Convenience-focused shopping behavior


  1. Consistency with Earlier EDA Findings
  • Supermarket Type2 stores were:

    • Most frequent in the dataset

    • Dominant in Store_Id (OUT004) revenue

  • This reinforces the conclusion that store format + scale + location jointly drive revenue.


  1. Business Implications
  • Expansion strategy should prioritize Supermarket Type2 stores.

  • Opportunities exist to:

    • Upgrade Supermarket Type1 stores to Type2 formats

    • Optimize product mix in Departmental Stores

  • Food Marts may be best suited for niche or essential-only strategies.


Summary

Supermarket Type2 stores are the primary revenue drivers, contributing over 50% of total revenue, making store format a critical determinant of sales performance.

Let's check the distribution of our target variable i.e Product_Store_Sales_Total with the other categorical columns

In [ ]:
def nice_boxplot_by_category(
    data,
    x_cat,
    y="Product_Store_Sales_Total",
    figsize=(14, 8),
    title=None,
    rotate=60,
    color="#8b5cf6",
    show_stats_subtitle=True,
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    fig, ax = plt.subplots(figsize=figsize)

    # Boxplot (no hue needed when it equals x; avoids duplicate legends)
    sns.boxplot(
        data=data,
        x=x_cat,
        y=y,
        ax=ax,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.2),
        boxprops=dict(linewidth=1.2),
    )

    main_title = title or f"Boxplot - {x_cat} vs {y}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    if show_stats_subtitle:
        n = int(data[[x_cat, y]].dropna().shape[0])
        groups = int(data[x_cat].nunique(dropna=True))
        fig.text(
            0.5, title_y - 0.045,
            f"points={n:,}   groups={groups:,}",
            ha="center", va="top", fontsize=11
        )

    ax.set_xlabel(x_cat)
    ax.set_ylabel(f"{y} (of each product)")

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()
In [ ]:
# Store Id vs Product Store Sales Total
nice_boxplot_by_category(
    data,
    x_cat="Store_Id",
    y="Product_Store_Sales_Total",
    figsize=(14, 8),
    title="Boxplot - Store_Id vs Product_Store_Sales_Total",
    rotate=90
)

Observations:

📊 Bivariate Analysis – Store_Id vs Product_Store_Sales_Total (Boxplot)

  1. Overall Distribution Insight
  • The boxplot compares product-level sales distribution across four stores (OUT001–OUT004).

  • There is substantial variation in median sales, spread, and outliers across stores, indicating store-specific sales behavior.


  1. Median Sales Comparison
  • OUT003 shows the highest median product sales, indicating stronger per-product performance.

  • OUT001 follows with moderately high median sales.

  • OUT004 has a lower median compared to OUT003 and OUT001, despite being the highest in total revenue.

  • OUT002 has the lowest median sales, confirming its weaker performance.

📌 Key Insight:

Higher total revenue does not always imply higher per-product sales.


  1. Variability & Spread
  • OUT003 has the widest IQR (box width), suggesting:

    • Greater diversity in product performance

    • Presence of both very high and moderate selling products

  • OUT004 shows a moderate spread, indicating relatively consistent product sales.

  • OUT002 has a narrower distribution, reflecting limited sales potential and fewer high-performing products.


  1. Outlier Analysis
  • OUT003 exhibits several high-value outliers, including the maximum observed sales (~8000).

  • OUT004 also contains multiple high outliers but fewer extreme values.

  • OUT002 has some very low outliers, indicating poorly performing products.

📌 Implication:

Some stores rely on blockbuster products, while others show uniform but lower performance.


  1. Business Interpretation
  • OUT003:

    • Strong per-product revenue potential

    • Opportunity to scale top-performing SKUs

  • OUT004:

    • High total revenue driven by volume and breadth, not necessarily high per-product sales
  • OUT002:

    • Requires product assortment and pricing strategy review

  1. Alignment with Previous Revenue Analysis
  • Although OUT004 generates the highest total revenue, its median product sales are lower than OUT003, suggesting:

    • Revenue dominance comes from more products sold, not higher sales per product
  • Confirms why Store_Id is a critical feature for modeling.


Summary

Product-level sales distributions vary significantly across stores, with OUT003 showing the highest median and variability, while OUT004’s high total revenue is driven by volume rather than per-product dominance.

In [ ]:
# Store Size vs Product Store Sales Total
nice_boxplot_by_category(
    data,
    x_cat="Store_Size",
    y="Product_Store_Sales_Total",
    figsize=(12, 7),
    title="Boxplot - Store_Size vs Product_Store_Sales_Total",
    rotate=0
)

Observations:

📊 Bivariate Analysis – Store_Size vs Product_Store_Sales_Total (Boxplot)

  1. Overall Pattern
  • Product-level sales vary significantly across store sizes.

  • Store size clearly influences both median sales and variability, confirming it as a strong driver of revenue.


  1. Median Sales Comparison
  • High-sized stores have the highest median product sales, indicating stronger per-product revenue.

  • Medium-sized stores show a moderate median, lower than High but significantly higher than Small.

  • Small-sized stores have the lowest median sales, reflecting limited sales capacity.

📌 Clear hierarchy:

High > Medium > Small in terms of per-product sales.


  1. Variability & Distribution
  • Medium-sized stores exhibit the widest spread and many high outliers, suggesting:

    • A mix of average and blockbuster products

    • Greater heterogeneity in product performance

  • High-sized stores have a more compact distribution, indicating:

    • Consistently strong product sales

    • Better standardization and optimized assortments

  • Small stores show:

    • Narrower spread

    • Limited upside and fewer high-performing products


  1. Outlier Behavior
  • Medium stores include extreme high outliers (up to ~8000), showing potential for exceptional products.

  • High stores have fewer extreme outliers but consistently high sales.

  • Small stores contain low-end outliers, highlighting weaker or non-performing SKUs.


  1. Business Interpretation
  • High stores:

    • Best suited for premium and high-MRP products

    • Stable and predictable revenue per product

  • Medium stores:

    • Strong growth opportunities

    • Ideal for experimentation and new product launches

  • Small stores:

    • Limited revenue potential per product

    • Require focused assortment and cost optimization


  1. Consistency with Revenue Aggregates
  • This boxplot aligns perfectly with earlier findings where:

    • Medium-sized stores generated the highest total revenue

    • Despite High stores having higher per-product medians, Medium stores win on volume + diversity

📌 Key takeaway:

Total revenue dominance ≠ highest per-product sales.


Summary

Product-level sales increase with store size, with High stores showing the strongest per-product performance, while Medium stores balance consistency and extreme high-selling products.

Let's now try to find out some relationship between the other columns

In [ ]:
def nice_boxplot_relation(
    data,
    x_cat,
    y_num,
    figsize=(14, 8),
    title=None,
    rotate=60,
    color="#8b5cf6",
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    fig, ax = plt.subplots(figsize=figsize)

    sns.boxplot(
        data=data,
        x=x_cat,
        y=y_num,
        ax=ax,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.2),
        boxprops=dict(linewidth=1.2),
    )

    fig.suptitle(title or f"Boxplot - {x_cat} vs {y_num}", fontsize=15, fontweight="bold", y=title_y)

    n = int(data[[x_cat, y_num]].dropna().shape[0])
    groups = int(data[x_cat].nunique(dropna=True))
    fig.text(0.5, title_y - 0.045, f"points={n:,}   groups={groups:,}", ha="center", va="top", fontsize=11)

    ax.set_xlabel("Types of Products" if x_cat == "Product_Type" else x_cat)
    ax.set_ylabel(y_num)

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()
In [ ]:
# Product Type Vs Product Weight
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Product_Type", y="Product_Weight", hue="Product_Type")
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_Weight")
plt.xlabel("Types of Products")
plt.ylabel("Product_Weight")
plt.legend([], [], frameon=False)  # hide redundant legend
plt.show()

Observations:

📦 Boxplot Interpretation: Product_Type vs Product_Weight

🔎 What this plot shows

  • X-axis: Product categories

  • Y-axis: Product weight

  • Each box summarizes the distribution of product weights within a product type:

    • Median (line)

    • Interquartile range (IQR)

    • Whiskers (typical min/max)

    • Dots (outliers)


🧠 Key Observations

  1. Weight distributions are remarkably consistent across product types
  • Most product categories have:

    • Median weight ≈ 12–13 units

    • IQR roughly between 11 and 14

  • This indicates standardized packaging sizes across the business.

✅ No product type is fundamentally heavier or lighter than others.


  1. Minor category-level variations (but not strong)
  • Slightly higher medians seen in:

    • Starchy Foods

    • Seafood

    • Others

  • Slightly lower medians in:

    • Frozen Foods

    • Canned

  • These differences are small and overlapping, not statistically dominant.

📌 Conclusion:

Product_Type does not strongly determine Product_Weight


  1. Presence of outliers across all categories
  • Outliers exist on both lower and higher ends:

    • Low-end: ~4–6 units

    • High-end: ~18–22 units

  • This suggests:

    • Multiple pack sizes

    • Special SKUs (bulk / premium packs)

⚠️ These outliers are expected and realistic, not data errors.


  1. Variability is similar across categories
  • No product type shows:

    • Extreme dispersion

    • Unusually tight or wide spread

  • Reinforces the idea of uniform packaging standards


🔗 Relation to Your Earlier Findings

This plot aligns perfectly with your correlation results:

  • Product_Weight has a strong positive correlation with sales

  • But Product_Type does NOT explain weight differences

➡️ Therefore:

  • Weight affects sales independently

  • Product_Type influences sales through demand, pricing, and volume, not weight


📝 One-line EDA Summary

Product weight distributions are largely consistent across product categories, indicating standardized packaging sizes. While product weight strongly influences sales, it is not driven by product type, suggesting an independent effect on revenue.

Let's find out whether there is some relationship between the weight of the product and its sugar content

In [ ]:
# Product Sugar Content Vs Product Weight
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Product_Sugar_Content", y="Product_Weight", hue="Product_Sugar_Content")
plt.xticks(rotation=0)
plt.title("Boxplot - Product_Sugar_Content Vs Product_Weight")
plt.xlabel("Product_Sugar_Content")
plt.ylabel("Product_Weight")
plt.legend([], [], frameon=False)  # hide redundant legend
plt.show()

Observations:

🍬📦 Relationship: Product Sugar Content vs Product Weight

🔍 What this plot analyzes

  • X-axis: Product Sugar Content (Low Sugar, Regular, No Sugar, reg)

  • Y-axis: Product Weight

  • Goal: Check whether sugar content influences product weight


🧠 Key Observations

  1. Product weight is largely independent of sugar content
  • Median weights across all sugar categories are very similar:

    • Roughly 12–13 units for all groups
  • Interquartile ranges (IQRs) overlap heavily

✅ This indicates no strong relationship between sugar content and product weight.


  1. Variability is consistent across categories
  • All sugar categories show:

    • Comparable spread

    • Similar whisker lengths

    • Outliers on both ends

  • No category shows unusually heavy or light products overall

📌 Sugar formulation does not drive packaging size.


  1. Outliers exist in every group (expected behavior)
  • Low-end outliers (~4–6 units)

  • High-end outliers (~18–22 units)

These likely represent:

  • Mini packs

  • Family or bulk packs

  • Special SKUs

⚠️ These are natural business variations, not data issues.


4 The "reg" category is likely a data-quality issue

  • "reg" appears redundant with "Regular"

  • Its distribution mirrors Regular almost exactly


🔗 Alignment with Your Previous Findings

This result is consistent with earlier insights:

  • Product_Weight strongly correlates with sales

  • Sugar content strongly affects revenue composition

  • But sugar content does NOT affect weight

➡️ Therefore:

  • Sugar content impacts sales via consumer preference

  • Weight impacts sales via volume/quantity

  • These effects are independent


📝 One-line EDA Summary

Product weight distributions are consistent across sugar content categories, indicating that sugar formulation does not influence packaging size. Product weight and sugar content independently contribute to sales behavior.

Let's analyze the sugar content of different product types

In [ ]:
def nice_crosstab_heatmap(
    data,
    rows="Product_Sugar_Content",
    cols="Product_Type",
    normalize=None,          # None, "index" (row %), "columns" (col %), "all" (overall %)
    figsize=(14, 8),
    cmap="viridis",
    title=None,
    title_y=0.98,
):
    sns.set_theme(style="white", context="notebook")

    ct = pd.crosstab(data[rows], data[cols], dropna=False)

    # Normalize if requested
    if normalize is not None:
        ct_plot = ct.div(ct.sum(axis=1), axis=0) if normalize == "index" else \
                  ct.div(ct.sum(axis=0), axis=1) if normalize == "columns" else \
                  ct / ct.values.sum() if normalize == "all" else ct
        annot_fmt = ".1%"  # show as percent
        annot_data = ct_plot
        vmin, vmax = 0, 1
    else:
        annot_fmt = "g"    # integer
        annot_data = ct
        vmin, vmax = None, None

    fig, ax = plt.subplots(figsize=figsize)

    sns.heatmap(
        ct_plot if normalize is not None else ct,
        annot=annot_data,
        fmt=annot_fmt,
        cmap=cmap,
        linewidths=0.6,
        linecolor="white",
        cbar=True,
        vmin=vmin,
        vmax=vmax,
        ax=ax,
    )

    main_title = title or (
        f"{rows} vs {cols}" + (" (Row %)" if normalize == "index" else
                              " (Column %)" if normalize == "columns" else
                              " (Overall %)" if normalize == "all" else
                              " (Counts)")
    )
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    fig.text(
        0.5, title_y - 0.045,
        f"rows={len(data):,}   unique_{rows}={data[rows].nunique(dropna=True):,}   unique_{cols}={data[cols].nunique(dropna=True):,}",
        ha="center", va="top", fontsize=11
    )

    ax.set_ylabel(rows)
    ax.set_xlabel(cols)

    # Make labels readable
    ax.tick_params(axis="x", rotation=45)
    ax.tick_params(axis="y", rotation=0)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right")

    sns.despine(ax=ax, left=True, bottom=True)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

    return ct
In [ ]:
# Heatmap Product Sugar of Different Product Types
nice_crosstab_heatmap(
    data,
    rows="Product_Sugar_Content",
    cols="Product_Type",
    normalize=None,
    figsize=(14, 8),
    cmap="viridis",
    title="Sugar Content Across Product Types (Counts)"
)
Out[ ]:
Product_Type Baking Goods Breads Breakfast Canned Dairy Frozen Foods Fruits and Vegetables Hard Drinks Health and Hygiene Household Meat Others Seafood Snack Foods Soft Drinks Starchy Foods
Product_Sugar_Content
Low Sugar 462 148 65 402 590 531 864 128 0 0 377 0 47 804 370 97
No Sugar 0 0 0 0 0 0 0 0 628 740 0 151 0 0 0 0
Regular 240 49 38 264 199 264 372 52 0 0 232 0 26 334 141 40
reg 14 3 3 11 7 16 13 6 0 0 9 0 3 11 8 4

Observations:

  1. Target Variable: Product_Store_Sales_Total
  • Distribution is approximately normal with mild right skew.

  • Mean ≈ Median → good for regression modeling.

  • Presence of high-end outliers (up to ~8000), but not extreme enough to discard blindly.

  • ✔️ No transformation is strictly required, though log-transform could be tested.


  1. Numeric Feature Relationships

Correlation Heatmap (Pearson)

Strong positive relationships with sales:

  • Product_MRP → Sales (≈ 0.79) 🔥

  • Product_Weight → Sales (≈ 0.74) 🔥

Moderate relationship:

  • Product_Weight ↔ Product_MRP (≈ 0.53)

Weak / No relationship:

  • Product_Allocated_Area → Sales (≈ 0.00)

  • Store_Establishment_Year → Sales (≈ -0.19)

📌 Conclusion:

Product_MRP and Product_Weight are the most powerful numeric predictors.


  1. Scatter Plot Insights

Product Weight vs Sales

  • Clear linear upward trend

  • Heavier products → higher total sales

Product MRP vs Sales

  • Strong linear pattern

  • Higher MRP products consistently generate more revenue

Product Allocated Area vs Sales

  • No clear pattern

  • Scatter is diffuse → confirms low correlation

📌 Modeling takeaway:

Consider dropping or down-weighting Product_Allocated_Area unless interactions are used.


  1. Categorical Variable Distributions

Product Type (Count)

Top categories by presence:

  1. Fruits & Vegetables

  2. Snack Foods

  3. Frozen Foods

  4. Dairy

Balanced enough → good categorical signal.

Product Sugar Content

  • Low Sugar dominates (55.7%)

  • Regular ≈ 25.7%

  • No Sugar ≈ 17.3%

  • reg ≈ 1.2% → ⚠️ likely a data quality issue (typo of “Regular”)

📌 Action:

Merge reg → Regular


  1. Revenue Analysis (Most Important Business Insight)

Revenue by Product Type

Top revenue generators:

  1. Fruits & Vegetables 🥇

  2. Snack Foods

  3. Dairy

  4. Frozen Foods

Lowest:

  • Seafood

  • Breakfast

  • Starchy Foods

📌 Insight:

High-volume essentials outperform niche categories.


Revenue by Sugar Content

  • Low Sugar → ~55% of total revenue 🔥

  • Regular → ~26%

  • No Sugar → ~17%

  • reg negligible

📌 Business Insight:

Health-conscious products are not only popular but profitable.


Revenue by Store

  • OUT004 alone contributes ~50% of total revenue

  • OUT002 is significantly underperforming

📌 Store-level imbalance detected


Revenue by Store Size

  • Medium stores dominate revenue (~73%)

  • High > Small, but Medium is the sweet spot


Revenue by City Tier

  • Tier 2 cities generate the most revenue

  • Tier 1 < Tier 2

  • Tier 3 lowest

📌 Key insight:

Tier 2 cities + Medium stores = highest ROI combination


Revenue by Store Type

  • Supermarket Type2 dominates

  • Followed by Departmental Store

  • Food Mart is lowest


  1. Boxplot Insights (Sales Distributions)

Store ID vs Sales

  • OUT003 and OUT004 have higher medians

  • OUT002 has:

    *Lowest median

    • More low-end outliers

Store Size vs Sales

  • High size → highest median sales

  • Medium has higher total revenue due to volume

  • Small stores consistently underperform


  1. Product Characteristics Analysis

Product Type vs Weight

  • Very similar median weights across categories

  • Slightly heavier:

    • Snack Foods

    • Starchy Foods

    • Household

📌 Weight is not category-driven, but still predictive of sales.


Sugar Content vs Weight

  • No strong weight difference across sugar categories

  • Weight is independent of sugar classification


  1. Sugar Content × Product Type (Crosstab Heatmap)

Key patterns:

  • Low Sugar dominates Fruits & Vegetables, Snack Foods

  • No Sugar almost exclusive to Health & Hygiene & Household

  • Very clean segmentation → good categorical signal

📌 Excellent feature interaction potential:

  • Product_Type × Sugar_Content

  1. Data Quality Notes (Important)

⚠️ Fix these before modeling:

  • Merge reg → Regular

  • Consider encoding Store_Id carefully (target encoding recommended)

  • Product_Allocated_Area has low predictive power

Let's find out how many items of each product type has been sold in each of the stores

In [ ]:
def nice_store_producttype_heatmap(
    data,
    store_col="Store_Id",
    product_col="Product_Type",
    figsize=(14, 8),
    cmap="viridis",
    annot="auto",          # "auto", True, False
    title=None,
    title_y=0.98,
):
    sns.set_theme(style="white", context="notebook")

    # ✅ Completed crosstab: Store_Id vs Product_Type
    ct = pd.crosstab(data[store_col], data[product_col], dropna=False)

    # Auto annotation decision (avoid clutter for large matrices)
    if annot == "auto":
        annot = (ct.shape[0] <= 15) and (ct.shape[1] <= 12)

    fig, ax = plt.subplots(figsize=figsize)

    sns.heatmap(
        ct,
        annot=annot,
        fmt="g",
        cmap=cmap,
        linewidths=0.6,
        linecolor="white",
        cbar=True,
        ax=ax,
    )

    fig.suptitle(
        title or f"Items Sold: {product_col} by {store_col}",
        fontsize=15,
        fontweight="bold",
        y=title_y,
    )
    fig.text(
        0.5,
        title_y - 0.045,
        f"rows={len(data):,}   stores={ct.shape[0]:,}   product_types={ct.shape[1]:,}",
        ha="center",
        va="top",
        fontsize=11,
    )

    ax.set_ylabel("Stores")
    ax.set_xlabel("Product_Type")

    # Make labels readable
    ax.tick_params(axis="x", rotation=45)
    ax.tick_params(axis="y", rotation=0)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right")

    sns.despine(ax=ax, left=True, bottom=True)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

    return ct
In [ ]:
nice_store_producttype_heatmap(data, annot=True)
Out[ ]:
Product_Type Baking Goods Breads Breakfast Canned Dairy Frozen Foods Fruits and Vegetables Hard Drinks Health and Hygiene Household Meat Others Seafood Snack Foods Soft Drinks Starchy Foods
Store_Id
OUT001 136 30 10 119 150 142 199 38 114 134 130 31 13 202 106 32
OUT002 96 23 15 88 104 101 168 30 91 100 87 19 10 146 62 12
OUT003 99 34 19 90 145 122 182 23 89 107 106 32 13 186 74 28
OUT004 385 113 62 380 397 446 700 95 334 399 295 69 40 615 277 69

Observations:

🛒 Items Sold: Product Type × Store Id — Key Insights

  1. Dominance of OUT004 Across Product Types
  • OUT004 clearly outperforms all other stores across every product category.

  • Especially strong in:

    • Fruits & Vegetables (700)

    • Snack Foods (615)

    • Frozen Foods (446)

    • Dairy (397)

    • Household (399)

📌 Interpretation:

OUT004 is the primary revenue and volume driver, likely due to:

  • Larger store size

  • Better location (Tier 2 + high footfall)

  • Broader assortment and inventory depth


  1. Consistent Category Leaders Across Stores

Across all stores, the most sold product types are:

Product Type Observation
Fruits & Vegetables Top-selling category in every store
Snack Foods Second-highest volume consistently
Household Strong and stable across stores
Frozen Foods & Dairy Medium-to-high consistent demand

📌 Insight:

These are essential, high-frequency purchase categories, driving store traffic and repeat purchases.


  1. Low-Volume Categories Are Universally Low

Categories with consistently low sales across all stores:

  • Seafood

  • Breakfast

  • Breads

  • Others

📌 Interpretation:

Low demand appears structural, not store-specific — suggesting:

  • Limited consumer preference

  • Possibly niche or premium products

  • Potential candidates for assortment rationalization


  1. Store-wise Performance Pattern
Store Pattern
OUT004 High-volume, diversified sales across all categories
OUT001 & OUT003 Mid-performing, similar patterns
OUT002 Lowest sales across most categories

📌 Interpretation:

OUT002 may suffer from:

  • Smaller store size

  • Less optimal location

  • Lower customer footfall


  1. Strong Alignment With Revenue Analysis

This heatmap perfectly explains your earlier findings:

  • OUT004 → highest revenue

  • Fruits & Vegetables + Snack Foods → top revenue contributors

  • Volume-driven categories = revenue drivers

This confirms that revenue is volume-led, not just price-led.


🎯 Business Implications

  • Inventory prioritization:

Allocate more shelf space and inventory to:

  • Fruits & Vegetables

  • Snack Foods

  • Household essentials

  • Store strategy:

    • Replicate OUT004’s layout, assortment, and promotions in other stores

    • Investigate why OUT002 underperforms

  • Category optimization:

    • Review low-performing categories (Seafood, Breakfast) for SKU reduction

Different product types have different prices. Let's analyze the trend.

In [ ]:
def nice_boxplot_price_trend(
    data,
    x_cat,
    y_num,
    figsize=(14, 8),
    title=None,
    rotate=60,
    color="#8b5cf6",
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    fig, ax = plt.subplots(figsize=figsize)

    sns.boxplot(
        data=data,
        x=x_cat,
        y=y_num,
        ax=ax,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.2),
        boxprops=dict(linewidth=1.2),
    )

    fig.suptitle(
        title or f"Boxplot - {x_cat} vs {y_num}",
        fontsize=15,
        fontweight="bold",
        y=title_y,
    )

    n = int(data[[x_cat, y_num]].dropna().shape[0])
    groups = int(data[x_cat].nunique(dropna=True))
    fig.text(
        0.5,
        title_y - 0.045,
        f"points={n:,}   product_types={groups:,}",
        ha="center",
        va="top",
        fontsize=11,
    )

    ax.set_xlabel("Product_Type" if x_cat == "Product_Type" else x_cat)
    ax.set_ylabel(f"{y_num} (of each product)")

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()
In [ ]:
# Boxplot Product Type Vs Product MRP
plt.figure(figsize=[14, 8])
sns.boxplot(
    data=data,
    x="Product_Type",
    y="Product_MRP",
    hue="Product_Type"
)
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_MRP")
plt.xlabel("Product_Type")
plt.ylabel("Product_MRP (of each product)")
plt.legend([], [], frameon=False)  # hide redundant legend
plt.show()

Observations:

  • Similar central pricing across categories:

Most product types have comparable median MRPs, clustered roughly in the same range, indicating no extreme base-price differences across categories.

  • Wide price dispersion within each product type:

Every category shows a broad interquartile range (IQR), meaning products within the same type span multiple price points (budget to premium).

  • Presence of high-price outliers across many categories:

Almost all product types contain upper-end outliers (₹230–₹270 range), suggesting premium SKUs exist in nearly every category.

  • Some categories show slightly higher upper spread:

Categories such as Starchy Foods, Others, Fruits & Vegetables, and Meat exhibit wider upper tails, indicating more high-MRP products compared to others.

  • Lower-end pricing consistency:

Minimum MRPs across most product types are fairly similar, showing price floors do not differ much by category.

  • Product type is not a strong standalone price separator:

Since medians and IQRs overlap heavily, Product_Type alone does not strongly explain MRP variation—price variation is largely within categories rather than between them.

Let's find out how the Product_MRP varies with the different stores

In [ ]:
def nice_boxplot_store_mrp(
    data,
    x_cat="Store_Id",
    y_num="Product_MRP",
    figsize=(14, 8),
    title="Boxplot - Store_Id vs Product_MRP",
    rotate=90,
    color="#8b5cf6",
    title_y=0.98,
    top_n=None,  # optional: show only top N stores by count to reduce clutter
):
    sns.set_theme(style="whitegrid", context="notebook")

    df = data[[x_cat, y_num]].dropna()

    # Optional: reduce clutter by keeping only top N stores by number of items
    if top_n is not None:
        top_stores = df[x_cat].value_counts().head(top_n).index
        df = df[df[x_cat].isin(top_stores)]

    fig, ax = plt.subplots(figsize=figsize)

    sns.boxplot(
        data=df,
        x=x_cat,
        y=y_num,
        ax=ax,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.2),
        boxprops=dict(linewidth=1.2),
    )

    fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)

    n = int(df.shape[0])
    stores = int(df[x_cat].nunique())
    fig.text(
        0.5, title_y - 0.045,
        f"points={n:,}   stores={stores:,}",
        ha="center", va="top", fontsize=11
    )

    ax.set_xlabel("Stores")
    ax.set_ylabel("Product_MRP (of each product)")

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right")

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()
In [ ]:
# Product MRP with Different Stores
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Store_Id", y="Product_MRP", hue="Store_Id")
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Id Vs Product_MRP")
plt.xlabel("Stores")
plt.ylabel("Product_MRP (of each product)")
plt.legend([], [], frameon=False)  # hide the huge redundant legend
plt.show()

Observations:

  • OUT003 has the highest median Product_MRP, indicating that this store generally sells higher-priced products compared to others.

  • OUT001 also shows a relatively high median MRP, but slightly lower than OUT003.

  • OUT004 has a moderate median MRP, positioned below OUT001 and OUT003 but above OUT002.

  • OUT002 clearly has the lowest median Product_MRP, suggesting it focuses more on lower-priced products.

  • Price variability (IQR) is widest for OUT003, meaning it carries a broader range of product prices.

  • OUT002 shows a narrower IQR, indicating more consistent (and generally lower) pricing.

  • OUT001 and OUT004 exhibit moderate variability in MRPs.

  • High-price outliers are most prominent in OUT003, reinforcing the presence of premium-priced products.

  • OUT002 also shows outliers, but these are mostly upper outliers, standing out against its generally low-price distribution.

  • All stores contain some low-price outliers, but they are more noticeable in OUT002 and OUT004.

Overall insight:

Product pricing strategy differs significantly by store. OUT003 and OUT001 cater more toward higher-priced items, while OUT002 appears to be a value-oriented store with lower and more tightly clustered MRPs.

Let's delve deeper and do a detailed analysis of each of the stores.

OUT001

In [ ]:
data.loc[data["Store_Id"] == "OUT001"].describe(include="all").T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Product_Id 1586 1586 NC7187 1 NaN NaN NaN NaN NaN NaN NaN
Product_Weight 1586.0 NaN NaN NaN 13.458865 2.064975 6.16 12.0525 13.96 14.95 17.97
Product_Sugar_Content 1586 4 Low Sugar 845 NaN NaN NaN NaN NaN NaN NaN
Product_Allocated_Area 1586.0 NaN NaN NaN 0.068768 0.047131 0.004 0.033 0.0565 0.094 0.295
Product_Type 1586 16 Snack Foods 202 NaN NaN NaN NaN NaN NaN NaN
Product_MRP 1586.0 NaN NaN NaN 160.514054 30.359059 71.35 141.72 168.32 182.9375 226.59
Store_Id 1586 1 OUT001 1586 NaN NaN NaN NaN NaN NaN NaN
Store_Establishment_Year 1586.0 NaN NaN NaN 1987.0 0.0 1987.0 1987.0 1987.0 1987.0 1987.0
Store_Size 1586 1 High 1586 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 1586 1 Tier 2 1586 NaN NaN NaN NaN NaN NaN NaN
Store_Type 1586 1 Supermarket Type1 1586 NaN NaN NaN NaN NaN NaN NaN
Product_Store_Sales_Total 1586.0 NaN NaN NaN 3923.778802 904.62901 2300.56 3285.51 4139.645 4639.4 4997.63
In [ ]:
data.loc[data["Store_Id"] == "OUT001", "Product_Store_Sales_Total"].sum()
Out[ ]:
np.float64(6223113.18)

OUT001 has generated total revenue of 6223113 from the sales of goods.

In [ ]:
def store_revenue_breakdown_by_product(
    data,
    store_id="",
    product_col="Product_Type",
    revenue_col="Product_Store_Sales_Total",
    figsize=(14, 7),
    rotate=60,
    color="#8b5cf6",
    top_n=None,          # optional: show only top N product types
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    df_store = data.loc[data["Store_Id"] == store_id, [product_col, revenue_col]].dropna()

    df_rev = (
        df_store.groupby(product_col, as_index=False)[revenue_col]
        .sum()
        .sort_values(revenue_col, ascending=False)
    )

    if top_n is not None:
        df_rev = df_rev.head(top_n)

    total_rev = df_rev[revenue_col].sum()
    n_rows = len(df_store)
    n_types = df_rev[product_col].nunique()

    fig, ax = plt.subplots(figsize=figsize)

    sns.barplot(
        data=df_rev,
        x=product_col,
        y=revenue_col,
        ax=ax,
        color=color,
        edgecolor="white",
        linewidth=1,
    )

    # Title + subtitle (same style)
    fig.suptitle(f"{store_id} Revenue by {product_col}", fontsize=15, fontweight="bold", y=title_y)
    fig.text(
        0.5,
        title_y - 0.045,
        f"total_revenue={total_rev:,.0f}   items_rows={n_rows:,}   product_types={n_types:,}",
        ha="center",
        va="top",
        fontsize=11,
    )

    ax.set_xlabel(product_col)
    ax.set_ylabel(revenue_col)

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    # Value labels
    ymax = df_rev[revenue_col].max() if not df_rev.empty else 0
    for p in ax.patches:
        h = p.get_height()
        ax.annotate(
            f"{h:,.0f}",
            (p.get_x() + p.get_width() / 2, h),
            ha="center",
            va="bottom",
            fontsize=10,
            xytext=(0, 4),
            textcoords="offset points",
        )
    ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

    return df_rev
In [ ]:
df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT001")

OUT002

In [ ]:
data.loc[data["Store_Id"] == "OUT002"].describe(include="all").T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Product_Id 1152 1152 NC2769 1 NaN NaN NaN NaN NaN NaN NaN
Product_Weight 1152.0 NaN NaN NaN 9.911241 1.799846 4.0 8.7675 9.795 10.89 19.82
Product_Sugar_Content 1152 4 Low Sugar 658 NaN NaN NaN NaN NaN NaN NaN
Product_Allocated_Area 1152.0 NaN NaN NaN 0.067747 0.047567 0.006 0.031 0.0545 0.09525 0.292
Product_Type 1152 16 Fruits and Vegetables 168 NaN NaN NaN NaN NaN NaN NaN
Product_MRP 1152.0 NaN NaN NaN 107.080634 24.912333 31.0 92.8275 104.675 117.8175 224.93
Store_Id 1152 1 OUT002 1152 NaN NaN NaN NaN NaN NaN NaN
Store_Establishment_Year 1152.0 NaN NaN NaN 1998.0 0.0 1998.0 1998.0 1998.0 1998.0 1998.0
Store_Size 1152 1 Small 1152 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 1152 1 Tier 3 1152 NaN NaN NaN NaN NaN NaN NaN
Store_Type 1152 1 Food Mart 1152 NaN NaN NaN NaN NaN NaN NaN
Product_Store_Sales_Total 1152.0 NaN NaN NaN 1762.942465 462.862431 33.0 1495.4725 1889.495 2133.6225 2299.63
In [ ]:
data.loc[data["Store_Id"] == "OUT002", "Product_Store_Sales_Total"].sum()
Out[ ]:
np.float64(2030909.72)

OUT002 has generated total revenue of 2030910 from the sales of goods.

In [ ]:
df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT002")

OUT003

In [ ]:
data.loc[data["Store_Id"] == "OUT003"].describe(include="all").T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Product_Id 1349 1349 NC522 1 NaN NaN NaN NaN NaN NaN NaN
Product_Weight 1349.0 NaN NaN NaN 15.103692 1.893531 7.35 14.02 15.18 16.35 22.0
Product_Sugar_Content 1349 4 Low Sugar 750 NaN NaN NaN NaN NaN NaN NaN
Product_Allocated_Area 1349.0 NaN NaN NaN 0.068637 0.048708 0.004 0.031 0.057 0.094 0.298
Product_Type 1349 16 Snack Foods 186 NaN NaN NaN NaN NaN NaN NaN
Product_MRP 1349.0 NaN NaN NaN 181.358725 24.796429 85.88 166.92 179.67 198.07 266.0
Store_Id 1349 1 OUT003 1349 NaN NaN NaN NaN NaN NaN NaN
Store_Establishment_Year 1349.0 NaN NaN NaN 1999.0 0.0 1999.0 1999.0 1999.0 1999.0 1999.0
Store_Size 1349 1 Medium 1349 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 1349 1 Tier 1 1349 NaN NaN NaN NaN NaN NaN NaN
Store_Type 1349 1 Departmental Store 1349 NaN NaN NaN NaN NaN NaN NaN
Product_Store_Sales_Total 1349.0 NaN NaN NaN 4946.966323 677.539953 3069.24 4355.39 4958.29 5366.59 8000.0
In [ ]:
data.loc[data["Store_Id"] == "OUT003", "Product_Store_Sales_Total"].sum()
Out[ ]:
np.float64(6673457.57)
In [ ]:
df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT003")

OUT004

In [ ]:
data.loc[data["Store_Id"] == "OUT004"].describe(include="all").T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Product_Id 4676 4676 NC584 1 NaN NaN NaN NaN NaN NaN NaN
Product_Weight 4676.0 NaN NaN NaN 12.349613 1.428199 7.34 11.37 12.37 13.3025 17.79
Product_Sugar_Content 4676 4 Low Sugar 2632 NaN NaN NaN NaN NaN NaN NaN
Product_Allocated_Area 4676.0 NaN NaN NaN 0.069092 0.048584 0.004 0.031 0.056 0.097 0.297
Product_Type 4676 16 Fruits and Vegetables 700 NaN NaN NaN NaN NaN NaN NaN
Product_MRP 4676.0 NaN NaN NaN 142.399709 17.513973 83.04 130.54 142.82 154.1925 197.66
Store_Id 4676 1 OUT004 4676 NaN NaN NaN NaN NaN NaN NaN
Store_Establishment_Year 4676.0 NaN NaN NaN 2009.0 0.0 2009.0 2009.0 2009.0 2009.0 2009.0
Store_Size 4676 1 Medium 4676 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 4676 1 Tier 2 4676 NaN NaN NaN NaN NaN NaN NaN
Store_Type 4676 1 Supermarket Type2 4676 NaN NaN NaN NaN NaN NaN NaN
Product_Store_Sales_Total 4676.0 NaN NaN NaN 3299.312111 468.271692 1561.06 2942.085 3304.18 3646.9075 5462.86
In [ ]:
data.loc[data["Store_Id"] == "OUT004", "Product_Store_Sales_Total"].sum()
Out[ ]:
np.float64(15427583.43)
In [ ]:
df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT004")

Observations:

🏬 Store-wise Detailed Observations


🔵 OUT001 — High-priced, Stable Performer

Store Profile

  • Store Type: Supermarket Type1

  • Store Size: High

  • City Tier: Tier 2

  • Establishment Year: 1987 (oldest store)

Product MRP Behavior

  • Mean MRP: ~160.5

  • Median MRP: ~168.3

  • Pricing is moderately high and consistent.

  • Boxplot shows:

    • Tight IQR → controlled pricing strategy

    • Few extreme outliers → limited ultra-premium SKUs

  • Indicates price stability over aggressive discounting.

Sales Performance

  • Total Revenue: ~6.22M

  • Avg Sales per product: ~3924

  • Sales spread: Moderate (std ≈ 904)

Product Mix & Revenue Drivers

  • Top revenue categories:

    • Snack Foods

    • Fruits & Vegetables

    • Dairy

  • Balanced contribution across categories → diversified demand

  • No over-dependence on a single category.

Interpretation

  • Mature store with:

    • Reliable pricing

    • Balanced category mix

    • Steady sales

  • Performs well without extreme pricing or promotional volatility.


🔴 OUT002 — Low-price, Low-volume Store

Store Profile

  • Store Type: Food Mart

  • Store Size: Small

  • City Tier: Tier 3

  • Establishment Year: 1998

Product MRP Behavior

  • Mean MRP: ~107.1

  • Median MRP: ~104.7 (lowest among all stores)

  • Boxplot characteristics:

    • Lowest MRP range

    • Many low-end outliers → economy pricing

  • Minimal premium pricing presence.

Sales Performance

  • Total Revenue: ~2.03M (lowest)

  • Avg Sales per product: ~1763

  • Low variance → consistently low ticket sizes.

Product Mix & Revenue Drivers

  • Top categories:

    • Fruits & Vegetables

    • Snack Foods

  • Weak performance in:

    • Meat

    • Household

    • Premium categories

Interpretation

  • Store is:

    • Highly price-sensitive

    • Volume-constrained

  • Likely serving budget-conscious customers

  • Limited upselling potential due to low MRP ceiling.


🟢 OUT003 — Premium Pricing, High Value per Product

Store Profile

  • Store Type: Departmental Store

  • Store Size: Medium

  • City Tier: Tier 1

  • Establishment Year: 1999

Product MRP Behavior

  • Mean MRP: ~181.4 (highest)

  • Median MRP: ~179.7

  • Boxplot shows:

    • Wide IQR

    • Many high-end outliers (up to ~266)

  • Strong presence of premium SKUs.

Sales Performance

  • Total Revenue: ~6.67M

  • Avg Sales per product: ~4947 (highest)

  • Highest maximum sales (~8000)

Product Mix & Revenue Drivers

  • Strong categories:

    • Snack Foods

    • Fruits & Vegetables

    • Dairy

  • Premium categories perform consistently well.

Interpretation

  • Best store for:

    • High-margin products

    • Premium assortment

  • Customers show lower price sensitivity

  • Ideal candidate for premium expansion & exclusive SKUs.


🟣 OUT004 — High-volume, Revenue Powerhouse

Store Profile

  • Store Type: Supermarket Type2

  • Store Size: Medium

  • City Tier: Tier 2

  • Establishment Year: 2009 (newest)

Product MRP Behavior

  • Mean MRP: ~142.4

  • Median MRP: ~142.8

  • Boxplot indicates:

    • Moderate pricing

    • Controlled spread

    • Few extreme outliers

Sales Performance

  • Total Revenue: ~15.43M (highest by far)

  • Avg Sales per product: ~3299

  • Sales are driven by volume, not high price.

Product Mix & Revenue Drivers

  • Dominant categories:

    • Fruits & Vegetables

    • Snack Foods

    • Frozen Foods

  • Strong across all categories, not niche-dependent.

Interpretation

  • Slightly lower MRP than OUT003, but massive volume compensates.

  • Best example of volume-led revenue strategy.


🔎 Cross-Store Comparative Insights

Dimension OUT001 OUT002 OUT003 OUT004
Avg MRP Medium-High Low Highest Medium
Revenue High Lowest High Highest
Pricing Strategy Stable Budget Premium Balanced
Volume Medium Low Medium Very High
Best Use Case Stability Price-led Margin-led Scale-led

📌 Final Strategic Takeaways

  • OUT003 → maximize premium & margins

  • OUT004 → expand assortment & inventory (volume monster)

  • OUT001 → maintain consistency, low risk

  • OUT002 → needs either volume growth or pricing rethink

Let's find out the revenue generated by the stores from each of the product types.

In [ ]:
df1 = data.groupby(["Product_Type", "Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
df1
Out[ ]:
Product_Type Store_Id Product_Store_Sales_Total
0 Baking Goods OUT001 525131.04
1 Baking Goods OUT002 169860.50
2 Baking Goods OUT003 491908.20
3 Baking Goods OUT004 1266086.26
4 Breads OUT001 121274.09
5 Breads OUT002 43419.47
6 Breads OUT003 175391.93
7 Breads OUT004 374856.75
8 Breakfast OUT001 38161.10
9 Breakfast OUT002 23396.10
10 Breakfast OUT003 95634.08
11 Breakfast OUT004 204939.13
12 Canned OUT001 449016.38
13 Canned OUT002 151467.66
14 Canned OUT003 452445.17
15 Canned OUT004 1247153.50
16 Dairy OUT001 598767.62
17 Dairy OUT002 178888.18
18 Dairy OUT003 715814.94
19 Dairy OUT004 1318447.30
20 Frozen Foods OUT001 558556.81
21 Frozen Foods OUT002 180295.95
22 Frozen Foods OUT003 597608.42
23 Frozen Foods OUT004 1473519.65
24 Fruits and Vegetables OUT001 792992.59
25 Fruits and Vegetables OUT002 298503.56
26 Fruits and Vegetables OUT003 897437.46
27 Fruits and Vegetables OUT004 2311899.66
28 Hard Drinks OUT001 152920.74
29 Hard Drinks OUT002 54281.85
30 Hard Drinks OUT003 110760.30
31 Hard Drinks OUT004 307851.73
32 Health and Hygiene OUT001 435005.31
33 Health and Hygiene OUT002 164660.81
34 Health and Hygiene OUT003 439139.18
35 Health and Hygiene OUT004 1124901.91
36 Household OUT001 531371.38
37 Household OUT002 184665.65
38 Household OUT003 523981.64
39 Household OUT004 1324721.50
40 Meat OUT001 505867.28
41 Meat OUT002 151800.01
42 Meat OUT003 520939.68
43 Meat OUT004 950604.97
44 Others OUT001 123977.09
45 Others OUT002 32835.73
46 Others OUT003 159963.75
47 Others OUT004 224719.73
48 Seafood OUT001 52936.84
49 Seafood OUT002 17663.35
50 Seafood OUT003 65337.48
51 Seafood OUT004 136466.37
52 Snack Foods OUT001 806142.24
53 Snack Foods OUT002 255317.57
54 Snack Foods OUT003 918510.44
55 Snack Foods OUT004 2009026.70
56 Soft Drinks OUT001 410548.69
57 Soft Drinks OUT002 103808.35
58 Soft Drinks OUT003 365046.30
59 Soft Drinks OUT004 917641.38
60 Starchy Foods OUT001 120443.98
61 Starchy Foods OUT002 20044.98
62 Starchy Foods OUT003 143538.60
63 Starchy Foods OUT004 234746.89

Observations:

OUT001

  • Revenue is well balanced across categories, with no extreme dependency on a single product type.

  • Snack Foods, Fruits & Vegetables, Dairy, Frozen Foods are the top contributors.

  • Breakfast and Seafood generate the least revenue, indicating low demand.

  • Performs moderately across both food and non-food (Household, Health & Hygiene) categories.


OUT002

  • Overall lowest total revenue among all stores.

  • Strongest categories are Fruits & Vegetables and Snack Foods, but at much lower scale.

  • Breakfast, Seafood, Starchy Foods, Others perform very weakly.

  • Indicates a small-format / low-footfall store with limited high-value sales.


OUT003

  • High-revenue store with strong performance across most categories.

  • Snack Foods and Fruits & Vegetables dominate sales.

  • Dairy, Frozen Foods, Household, Meat also contribute significantly.

  • Weakest categories remain Breakfast and Seafood, consistent with other stores.


OUT004

  • Top-performing store by a large margin.

  • Extremely strong in Fruits & Vegetables, Snack Foods, Frozen Foods, Dairy, Household.

  • Even traditionally low categories (Breakfast, Seafood, Others) perform better here.

  • Indicates large store size, high footfall, and wide product acceptance.


Cross-store patterns

  • Snack Foods and Fruits & Vegetables are the top revenue drivers across all stores.

  • Breakfast and Seafood are consistently the lowest-performing categories.

  • Revenue scale increases clearly from OUT002 → OUT001 → OUT003 → OUT004.

  • High-performing stores show diversified revenue, not dependence on a single category.

Let's find out the revenue generated by the stores from products having different levels of sugar content.

In [ ]:
df2 = data.groupby(["Product_Sugar_Content", "Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
df2
Out[ ]:
Product_Sugar_Content Store_Id Product_Store_Sales_Total
0 Low Sugar OUT001 3300834.93
1 Low Sugar OUT002 1156758.85
2 Low Sugar OUT003 3706903.24
3 Low Sugar OUT004 8658908.78
4 No Sugar OUT001 1090353.78
5 No Sugar OUT002 382162.19
6 No Sugar OUT003 1123084.57
7 No Sugar OUT004 2674343.14
8 Regular OUT001 1749444.51
9 Regular OUT002 472112.50
10 Regular OUT003 1743566.35
11 Regular OUT004 3902547.93
12 reg OUT001 82479.96
13 reg OUT002 19876.18
14 reg OUT003 99903.41
15 reg OUT004 191783.58

Data Preprocessing

Replacing the values in the Product_Sugar_Content column

In [ ]:
# Replacing reg with Regular
data.Product_Sugar_Content.replace(to_replace=["reg"], value=["Regular"], inplace=True)
In [ ]:
data.Product_Sugar_Content.value_counts()
Out[ ]:
count
Product_Sugar_Content
Low Sugar 4885
Regular 2359
No Sugar 1519

Exploring Patterns in Product_IDs

In [ ]:
## Extracting the first two characters from the Product_Id column and storing it in another column
data["Product_Id_char"] = data["Product_Id"].str[:2]
data.head()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total Product_Id_char
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40 FD
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02 FD
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16 FD
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18 FD
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36 NC
In [ ]:
data["Product_Id_char"].unique()
Out[ ]:
array(['FD', 'NC', 'DR'], dtype=object)
In [ ]:
data.loc[data.Product_Id_char == "FD", "Product_Type"].unique()
Out[ ]:
array(['Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Snack Foods',
       'Meat', 'Fruits and Vegetables', 'Breads', 'Breakfast',
       'Starchy Foods', 'Seafood'], dtype=object)
In [ ]:
data.loc[data.Product_Id_char == "DR", "Product_Type"].unique()
Out[ ]:
array(['Hard Drinks', 'Soft Drinks'], dtype=object)
In [ ]:
data.loc[data.Product_Id_char == "NC", "Product_Type"].unique()
Out[ ]:
array(['Health and Hygiene', 'Household', 'Others'], dtype=object)

Observations:

🔹 Product_Sugar_Content (After Cleaning)

  • The typo reg was successfully standardized to Regular, removing category ambiguity.

  • Low Sugar products dominate the dataset, followed by Regular, then No Sugar.

  • This suggests customer demand (and assortment strategy) is skewed toward low-sugar options.


🔹 Product_ID Prefix Analysis (Product_Id_char)

I identified three clear product families using the first two characters:

  1. FD (Food Products)
  • Covers most product categories:

    • Frozen Foods, Dairy, Canned, Baking Goods, Snack Foods

    • Meat, Fruits & Vegetables, Breads, Breakfast

    • Starchy Foods, Seafood

  • Indicates FD is the core retail assortment, driving volume and revenue.


  1. DR (Drinks)
  • Exclusively mapped to:

    • Hard Drinks

    • Soft Drinks

  • Shows a clean and well-segmented beverage classification.


  1. NC (Non-Consumables)
  • Limited to:

    • Health and Hygiene

    • Household

    • Others

  • These are non-food essentials, likely lower in frequency but important for basket value.


🔹 Structural Insights

  • Product IDs are not random — they encode category intelligence.

  • This structure can be very useful for feature engineering, such as:

    • Group-level demand modeling

    • Category-specific pricing or sales behavior

  • The dataset shows strong internal consistency between Product_ID patterns and Product_Type.


🔹 Modeling & EDA Implications

  • Product_Id_char is a high-value categorical feature for:

    • Sales prediction

    • Customer demand segmentation

  • Sugar content is imbalanced, so stratification or weighting may be needed in models.

  • FD products will likely dominate predictions, while DR and NC may behave differently.

Store's Age

In [ ]:
# Outlet Age
data["Store_Age_Years"] = 2025 - data.Store_Establishment_Year

Grouping Product Types into Perishables and Non-Perishables.

In [ ]:
perishables = [
    "Dairy",
    "Meat",
    "Fruits and Vegetables",
    "Breakfast",
    "Breads",
    "Seafood",
]
In [ ]:
def change(x):
    if x in perishables:
        return "Perishables"
    else:
        return "Non Perishables"
In [ ]:
data['Product_Type_Category'] = data['Product_Type'].apply(change)
In [ ]:
data.head()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total Product_Id_char Store_Age_Years Product_Type_Category
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40 FD 16 Non Perishables
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02 FD 26 Perishables
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16 FD 38 Non Perishables
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18 FD 38 Non Perishables
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36 NC 27 Non Perishables

Observations:

🔹 Store_Age_Years

  • Store age ranges roughly from ~16 to ~38 years, indicating a mix of newer and very mature stores.

  • Older stores (≈35–38 years) are mostly OUT001 and OUT003, suggesting:

    • Long-standing market presence

    • Likely stable customer base and mature operations

  • Newer stores (≈16–27 years), such as OUT004, still show strong sales, indicating that age alone does not limit performance.

Insight: Store age may influence customer trust and assortment depth, but store size, location, and type likely play a stronger role in sales.


🔹 Product_Type_Category (Perishables vs Non-Perishables)

  • Perishables include: Dairy, Meat, Fruits & Vegetables, Breakfast, Breads, Seafood.

  • Non-Perishables dominate the dataset, including:

    • Frozen Foods, Canned, Baking Goods, Snacks, Beverages, Household, Health & Hygiene.

Observation:

  • The majority of rows fall under Non-Perishables, suggesting:

    • Higher assortment depth

    • Better shelf life and inventory stability

  • Perishables are fewer but typically high-frequency purchase items.


🔹 Combined Insights

  • Older stores + perishables likely require stronger cold-chain and inventory management.

  • Newer or smaller stores may rely more on non-perishables due to:

    • Lower spoilage risk

    • Easier logistics

  • This binary category can help explain sales variance, especially when combined with:

    • Store_Size

    • Store_Type

    • Store_Location_City_Type


🔹 Modeling Value

  • Store_Age_Years is a strong continuous feature for regression.

  • Product_Type_Category (binary) is:

    • Easy to encode

    • Highly interpretable

    • Useful for capturing operational differences in sales behavior

Outlier Check

In [ ]:
def nice_outlier_boxgrid_2col(
    data,
    exclude=("Store_Establishment_Year", "Store_Age_Years"),
    cols=None,
    whis=1.5,
    ncols=2,                 # ✅ two plots per row
    figsize=None,
    color="#8b5cf6",
    title="Outlier Check (Boxplots)",
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    # Select numeric columns
    if cols is None:
        cols = data.select_dtypes(include=np.number).columns.tolist()

    # Exclude if present
    cols = [c for c in cols if c not in set(exclude)]
    if not cols:
        raise ValueError("No numeric columns left to plot after exclusions.")

    n = len(cols)
    nrows = math.ceil(n / ncols)

    # Auto figure size tuned for 2-col layout
    if figsize is None:
        figsize = (14, 3.4 * nrows)

    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
    axes = np.array(axes).reshape(-1)

    # Title + subtitle (same look as your previous charts)
    fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)
    fig.text(
        0.5,
        title_y - 0.045,
        f"numeric_features={n:,}   whis={whis}   layout={ncols} per row",
        ha="center",
        va="top",
        fontsize=11,
    )

    for i, col in enumerate(cols):
        ax = axes[i]
        x = data[col].dropna()

        sns.boxplot(
            x=x,
            ax=ax,
            color=color,
            whis=whis,
            showmeans=True,
            meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
            medianprops=dict(color="black", linewidth=1.8),
            whiskerprops=dict(linewidth=1.2),
            boxprops=dict(linewidth=1.2),
        )

        ax.set_title(col, fontsize=12, pad=10)
        ax.set_xlabel("")
        ax.set_ylabel("")
        ax.grid(True, axis="x", alpha=0.25)  # subtle guidance
        sns.despine(ax=ax, left=True, bottom=True)

    # Hide unused axes
    for j in range(n, len(axes)):
        axes[j].axis("off")

    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()
In [ ]:
nice_outlier_boxgrid_2col(data)

Observations:

🔹 Product_Weight

  • Most product weights are concentrated between ~10 and ~15 units.

  • There are outliers on both ends:

    • Very light products (< ~7)

    • Very heavy products (> ~19–22)

  • Distribution is fairly symmetric, suggesting natural variation by product type rather than data errors.

Takeaway: Outliers look realistic (different packaging sizes), not anomalies.


🔹 Product_Allocated_Area

  • Majority of values lie in the 0.03–0.10 range.

  • Strong right-skew with many high-end outliers (up to ~0.30).

  • Indicates some products require significantly more shelf space.

Takeaway: High-end outliers likely represent bulky or premium-display products.


🔹 Product_MRP

  • Core price range is ~120 to ~170.

  • Clear upper-end outliers beyond ₹220–₹270.

  • A few low-priced outliers (< ~70) also exist.

Takeaway: Price outliers reflect premium and budget product segments, not noise.


🔹 Product_Store_Sales_Total

  • Highly right-skewed distribution.

  • Most sales totals fall between ~2500 and ~4500.

  • Several very high outliers (up to ~8000), indicating top-performing products.

  • A few low-end outliers, likely slow-moving or niche items.

Takeaway: Sales outliers are business-critical (star vs low-performing products).


🔹 Overall Conclusion

  • Outliers are meaningful and business-driven, not data quality issues.

  • Removing them could erase important patterns.

  • Better strategies:

    • Log-transform Product_Store_Sales_Total

    • Use robust models (tree-based, quantile-based)

Data Preparation for Modeling

In [ ]:
data.head()
Out[ ]:
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total Product_Id_char Store_Age_Years Product_Type_Category
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40 FD 16 Non Perishables
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02 FD 26 Perishables
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16 FD 38 Non Perishables
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18 FD 38 Non Perishables
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36 NC 27 Non Perishables

Let's remove the columns that are not required.

In [ ]:
data = data.drop(["Product_Id", "Product_Type", "Store_Id", "Store_Establishment_Year"], axis=1)
In [ ]:
data.shape
Out[ ]:
(8763, 11)
In [ ]:
data.head()
Out[ ]:
Product_Weight Product_Sugar_Content Product_Allocated_Area Product_MRP Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total Product_Id_char Store_Age_Years Product_Type_Category
0 12.66 Low Sugar 0.027 117.08 Medium Tier 2 Supermarket Type2 2842.40 FD 16 Non Perishables
1 16.54 Low Sugar 0.144 171.43 Medium Tier 1 Departmental Store 4830.02 FD 26 Perishables
2 14.28 Regular 0.031 162.08 High Tier 2 Supermarket Type1 4130.16 FD 38 Non Perishables
3 12.10 Low Sugar 0.112 186.31 High Tier 2 Supermarket Type1 4132.18 FD 38 Non Perishables
4 9.57 No Sugar 0.010 123.67 Small Tier 3 Food Mart 2279.36 NC 27 Non Perishables
In [ ]:
data.describe(include='all').T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Product_Weight 8763.0 NaN NaN NaN 12.653792 2.21732 4.0 11.15 12.66 14.18 22.0
Product_Sugar_Content 8763 3 Low Sugar 4885 NaN NaN NaN NaN NaN NaN NaN
Product_Allocated_Area 8763.0 NaN NaN NaN 0.068786 0.048204 0.004 0.031 0.056 0.096 0.298
Product_MRP 8763.0 NaN NaN NaN 147.032539 30.69411 31.0 126.16 146.74 167.585 266.0
Store_Size 8763 3 Medium 6025 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 8763 3 Tier 2 6262 NaN NaN NaN NaN NaN NaN NaN
Store_Type 8763 4 Supermarket Type2 4676 NaN NaN NaN NaN NaN NaN NaN
Product_Store_Sales_Total 8763.0 NaN NaN NaN 3464.00364 1065.630494 33.0 2761.715 3452.34 4145.165 8000.0
Product_Id_char 8763 3 FD 6539 NaN NaN NaN NaN NaN NaN NaN
Store_Age_Years 8763.0 NaN NaN NaN 22.967249 8.388381 16.0 16.0 16.0 27.0 38.0
Product_Type_Category 8763 2 Non Perishables 5718 NaN NaN NaN NaN NaN NaN NaN
In [ ]:
# Separating features and the target column
X = data.drop("Product_Store_Sales_Total", axis=1)
y = data["Product_Store_Sales_Total"]
In [ ]:
print(X.shape)
print(y.shape)
(8763, 10)
(8763,)
In [ ]:
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1, shuffle=True
)
In [ ]:
X_train.shape, X_test.shape
Out[ ]:
((6134, 10), (2629, 10))

Observations:

Key Observations from the Split

  • ✅ Train–test split is correct:

    • Train: 6,134 rows

    • Test: 2,629 rows

  • ✅ Target separation is clean (Product_Store_Sales_Total)

  • ✅ No row leakage (shuffle=True, fixed random_state)

Data Pre-processing Pipeline

In [ ]:
categorical_features = data.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features
Out[ ]:
['Product_Sugar_Content',
 'Store_Size',
 'Store_Location_City_Type',
 'Store_Type',
 'Product_Id_char',
 'Product_Type_Category']
In [ ]:
# Create a preprocessing pipeline for the categorical features
preprocessor = make_column_transformer(
    (Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
)

Model Building

Define functions for Model Evaluation

In [ ]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mean_absolute_percentage_error(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

The ML models to be built can be any two out of the following:

  1. Decision Tree
  2. Bagging
  3. Random Forest
  4. AdaBoost
  5. Gradient Boosting
  6. XGBoost

Decision Tree Model

In [ ]:
dtree = DecisionTreeRegressor(random_state=1)
dtree = make_pipeline(preprocessor,dtree)
dtree.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('decisiontreeregressor',
                 DecisionTreeRegressor(random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking model performance on training set

In [ ]:
dtree_model_train_perf = model_performance_regression(dtree, X_train, y_train)
dtree_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 596.978222 468.965498 0.685033 0.684519 0.16569

Checking model performance on test set

In [ ]:
dtree_model_test_perf = model_performance_regression(dtree, X_test, y_test)
dtree_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 615.933034 485.429583 0.668482 0.667215 0.187421

Observations:

  • The pipeline is correctly structured, combining preprocessing (One-Hot Encoding) and the Decision Tree regressor, ensuring consistent data handling during training and testing.

  • Training and test performance are fairly close, indicating that the model is not severely overfitting.

  • R² score (~0.68 on train and ~0.67 on test) suggests the model explains around two-thirds of the variance in product sales, which is reasonable for a baseline model.

  • Adjusted R² is very close to R², implying that the number of predictors introduced by one-hot encoding is not excessively inflating model performance.

  • RMSE increases slightly on the test set, showing a small generalization error but acceptable stability.

  • MAE values are consistent across train and test, indicating stable average prediction errors.

  • MAPE (~16.5% train, ~18.7% test) shows moderate relative error, meaning predictions are reasonably close in percentage terms.

  • Unpruned Decision Tree captures non-linear relationships, but may still be sensitive to noise and outliers in sales data.

  • Model serves well as a baseline, but performance can likely be improved with ensemble methods (Random Forest, Gradient Boosting).

Bagging Regressor

In [ ]:
bagging_regressor = BaggingRegressor(random_state=1)
bagging_regressor = make_pipeline(preprocessor,bagging_regressor)
bagging_regressor.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('baggingregressor', BaggingRegressor(random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking model performance on training set

In [ ]:
bagging_regressor_model_train_perf = model_performance_regression(bagging_regressor, X_train, y_train)
bagging_regressor_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 597.064588 469.426208 0.684942 0.684428 0.165799

Checking model performance on test set

In [ ]:
bagging_regressor_model_test_perf = model_performance_regression(bagging_regressor, X_test, y_test)
bagging_regressor_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 615.866125 485.588892 0.668554 0.667288 0.18735

Observations:

  • Pipeline integration is consistent, combining preprocessing (One-Hot Encoding) with the Bagging Regressor, ensuring uniform feature handling.

  • Training performance is similar to the Decision Tree, with an R² of ~0.68, indicating comparable explanatory power.

  • Test R² (~0.67) closely matches training R², showing good generalization and reduced overfitting compared to a single tree.

  • RMSE and MAE values are almost identical on train and test sets, highlighting stability in predictions.

  • MAPE (~16.6% train, ~18.7% test) suggests reasonable percentage-level prediction accuracy, similar to the Decision Tree.

  • Bagging reduces variance, but the improvement over a single Decision Tree is marginal in this setup.

  • Model performance indicates limited gains without tuning, likely because default base estimators are already simple.

  • Useful as a variance-reduction baseline, but stronger ensemble methods may yield better improvements.

Random Forest Model

In [ ]:
rf_estimator = RandomForestRegressor(random_state=1)
rf_estimator = make_pipeline(preprocessor,rf_estimator)
rf_estimator.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking model performance on training set

In [ ]:
rf_estimator_model_train_perf = model_performance_regression(rf_estimator, X_train, y_train)
rf_estimator_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 596.994959 468.87585 0.685016 0.684501 0.165674

Checking model performance on test set

In [ ]:
rf_estimator_model_test_perf = model_performance_regression(rf_estimator, X_test, y_test)
rf_estimator_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 615.906846 485.311027 0.66851 0.667244 0.187394

Observations:

  • Seamless integration with the preprocessing pipeline, ensuring consistent encoding of categorical variables before modeling.

  • Training performance (R² ≈ 0.685) is almost identical to Decision Tree and Bagging models, indicating similar explanatory power.

  • Test performance (R² ≈ 0.669) closely matches training performance, showing good generalization and low overfitting.

  • RMSE (~ 616) and MAE (~ 485) on the test set are nearly the same as Bagging and Decision Tree, suggesting limited incremental improvement.

  • MAPE (~18.7% on test) remains consistent across all tree-based models tried so far.

  • Random Forest’s variance reduction is evident, but its benefit is muted, likely due to:

  • Limited signal in the available features

  • Default hyperparameters (e.g., number of trees, depth)

  • Model stability is strong, as seen from minimal train–test performance gap.

  • Better potential than Bagging with tuning, especially by adjusting n_estimators, max_depth, and min_samples_leaf.

AdaBoost Regressor

In [ ]:
ab_regressor = AdaBoostRegressor(random_state=1)
ab_regressor = make_pipeline(preprocessor,ab_regressor)
ab_regressor.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('adaboostregressor', AdaBoostRegressor(random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking model performance on training set

In [ ]:
ab_regressor_model_train_perf = model_performance_regression(ab_regressor, X_train, y_train)
ab_regressor_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 627.778405 512.191947 0.651694 0.651126 0.172981

Checking model performance on test set

In [ ]:
ab_regressor_model_test_perf = model_performance_regression(ab_regressor, X_test, y_test)
ab_regressor_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 647.483018 530.557049 0.633649 0.63225 0.193723

Observations:

  • Lower overall performance compared to other tree-based models (Decision Tree, Bagging, Random Forest).

  • Training R² ≈ 0.652, which is already weaker than previous models, indicating limited ability to capture underlying patterns.

  • Test R² drops further to ≈ 0.634, showing poorer generalization on unseen data.

  • Highest error metrics among all models tested so far:

    • Test RMSE ≈ 647

    • Test MAE ≈ 531

    • Test MAPE ≈ 19.4%

  • Larger train–test performance gap compared to Random Forest and Bagging, suggesting instability.

  • AdaBoost’s sensitivity to noisy data and outliers likely impacts performance, especially given:

    • Wide variance in Product_Store_Sales_Total

    • Presence of outliers observed earlier in numerical features

  • Default weak learners (shallow trees) may be underfitting the data.

  • Not well-suited in current configuration for this regression task without careful tuning.

Overall: AdaBoost underperforms relative to other ensemble methods and is the weakest model tested so far for predicting product store sales.

Gradient Boosting Regressor

In [ ]:
gb_estimator = GradientBoostingRegressor(random_state=1)
gb_estimator = make_pipeline(preprocessor,gb_estimator)
gb_estimator.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking model performance on training set

In [ ]:
gb_estimator_model_train_perf = model_performance_regression(gb_estimator, X_train, y_train)
gb_estimator_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 597.006101 469.061969 0.685004 0.684489 0.16573

Checking model performance on test set

In [ ]:
gb_estimator_model_test_perf = model_performance_regression(gb_estimator, X_test, y_test)
gb_estimator_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 615.902369 485.444821 0.668515 0.667249 0.187447

Observations:

  • Strong and stable performance, very similar to Decision Tree, Bagging, and Random Forest models.

  • Training performance:

    • R² ≈ 0.685

    • RMSE ≈ 597

    • Indicates the model captures a good amount of variance without overfitting.

  • Test performance remains consistent:

    • R² ≈ 0.669

    • RMSE ≈ 616

    • MAE ≈ 485

  • Minimal train–test gap, suggesting good generalization.

  • MAPE (~18.7%) is comparable to Bagging and Random Forest, and clearly better than AdaBoost.

  • Gradient Boosting handles non-linear relationships and feature interactions effectively, even with mixed numerical and one-hot encoded categorical features.

  • Performance improvement over AdaBoost shows that sequential boosting with gradient optimization is more robust to noise in this dataset.

  • Default hyperparameters already yield competitive results, indicating good baseline suitability.

Overall: Gradient Boosting is a strong candidate model, offering balanced bias–variance trade-off and performance on par with the best models tested so far.

XGBoost Regressor

In [ ]:
xgb_estimator = XGBRegressor(random_state=1)
xgb_estimator = make_pipeline(preprocessor,xgb_estimator)
xgb_estimator.fit(X_train, y_train)
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              co...
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, random_state=1, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking model performance on training set

In [ ]:
xgb_estimator_model_train_perf = model_performance_regression(xgb_estimator, X_train, y_train)
xgb_estimator_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 596.978222 468.965507 0.685033 0.684519 0.16569

Checking model performance on test set

In [ ]:
xgb_estimator_model_test_perf = model_performance_regression(xgb_estimator, X_test, y_test)
xgb_estimator_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 615.933034 485.429585 0.668482 0.667215 0.187421

Observations:

  • Performance is on par with tree-based ensemble models (Decision Tree, Bagging, Random Forest, Gradient Boosting).

  • Training results:

    • R² ≈ 0.685

    • RMSE ≈ 597

    • MAE ≈ 469

  • Test results remain stable:

    • R² ≈ 0.668

    • RMSE ≈ 616

    • MAE ≈ 485

  • Very small train–test gap, indicating no significant overfitting.

  • MAPE (~18.7%) is consistent with Random Forest and Gradient Boosting.

  • Despite XGBoost’s advanced regularization and boosting strategy, default hyperparameters do not significantly outperform other ensemble models in this setup.

  • Performance similarity suggests that the feature set and preprocessing pipeline are the main performance drivers, rather than the specific ensemble algorithm.

  • XGBoost’s strength (handling complex interactions and regularization) is likely underutilized without hyperparameter tuning.

Overall: XGBoost is a robust and reliable model, but in its current untuned form, it does not provide a clear advantage over Random Forest or Gradient Boosting for this dataset.

Model Performance Improvement - Hyperparameter Tuning

Hyperparameter Tuning - Decision Tree

In [ ]:
# Uncomment the below snippet of code if decision tree regressor is to be used

# Choose the type of classifier.
dtree_tuned = DecisionTreeRegressor(random_state=1)
dtree_tuned = make_pipeline(preprocessor,dtree_tuned)

# Grid of parameters to choose from
parameters = {
     "decisiontreeregressor__max_depth": list(np.arange(2, 6)),
     "decisiontreeregressor__min_samples_leaf": [1, 3, 5],
     "decisiontreeregressor__max_leaf_nodes": [2, 3, 5, 10, 15],
     "decisiontreeregressor__min_impurity_decrease": [0.001, 0.01, 0.1],
 }

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=r2_score, cv=3, n_jobs =-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)
Best Parameters Found:
{'decisiontreeregressor__max_depth': np.int64(2), 'decisiontreeregressor__max_leaf_nodes': 2, 'decisiontreeregressor__min_impurity_decrease': 0.001, 'decisiontreeregressor__min_samples_leaf': 1}

Checking model performance on training set

In [ ]:
dtree_tuned_model_train_perf = model_performance_regression(dtree_tuned, X_train, y_train)
dtree_tuned_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 830.838204 656.388225 0.389929 0.388932 0.214436

Checking model performance on test set

In [ ]:
dtree_tuned_model_test_perf = model_performance_regression(dtree_tuned, X_test, y_test)
dtree_tuned_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 845.130586 668.489477 0.375851 0.373467 0.234787

Observations:

  • The tuned Decision Tree performs significantly worse than the untuned version and all other ensemble models.

  • R-squared drops sharply (~0.38) on both train and test sets, indicating very low explanatory power.

  • RMSE and MAE increase substantially, showing higher prediction errors after tuning.

  • Similar train and test performance suggests no overfitting, but rather strong underfitting.

  • The selected best parameters (very shallow tree: max_depth = 2, max_leaf_nodes = 2) overly restrict model complexity.

  • The model fails to capture nonlinear relationships present in the data.

  • Hyperparameter tuning over-regularized the model, hurting performance instead of improving it.

  • This confirms that single decision trees are not suitable for this dataset compared to ensemble-based methods.

Conclusion:

The tuned Decision Tree is the worst-performing model and should be discarded in favor of ensemble models like Random Forest, Gradient Boosting, or XGBoost.

Hyperparameter Tuning - Bagging Regressor

In [ ]:
# Choose the type of regressor.
bagging_estimator_tuned = BaggingRegressor(random_state=1)
bagging_estimator_tuned = make_pipeline(preprocessor,bagging_estimator_tuned)

# Grid of parameters to choose from
parameters = {
     "baggingregressor__max_samples": [0.7, 0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
     "baggingregressor__max_features": [0.7, 0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
     "baggingregressor__n_estimators": [10, 30, 50, 100] #Complete the code to define the list of values to be tuned
}

# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)
Best Parameters Found:
{'baggingregressor__max_features': 0.7, 'baggingregressor__max_samples': 0.7, 'baggingregressor__n_estimators': 10}

Checking model performance on training set

In [ ]:
bagging_estimator_tuned_model_train_perf = model_performance_regression(bagging_estimator_tuned, X_train, y_train)
bagging_estimator_tuned_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 597.209398 469.235665 0.684789 0.684274 0.165821

Checking model performance on test set

In [ ]:
bagging_estimator_tuned_model_test_perf = model_performance_regression(bagging_estimator_tuned, X_test, y_test)
bagging_estimator_tuned_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 616.149152 485.593111 0.668249 0.666982 0.187425

Observations:

  • The tuned Bagging Regressor shows almost no improvement over the untuned version.

  • R-squared (~0.668 on test) remains virtually unchanged, indicating similar explanatory power.

  • RMSE and MAE on the test set are nearly identical to the base Bagging model, suggesting limited gains from tuning.

  • Train and test metrics are very close, indicating good generalization and low overfitting.

  • The best parameters selected (max_samples = 0.7, max_features = 0.7, n_estimators = 10) favor higher randomness and fewer trees, reducing variance but not boosting accuracy.

  • Increasing ensemble complexity (more estimators or features) did not significantly improve performance, implying the model has reached a performance plateau.

  • Bagging remains stable and robust, but tuning alone cannot extract additional predictive power from the current feature set.

Conclusion:

Hyperparameter tuning does not materially enhance the Bagging Regressor. While it generalizes well, its performance is capped, making it less competitive than more expressive ensemble methods like Gradient Boosting or XGBoost.

Hyperparameter Tuning - Random Forest

In [ ]:
# Choose the type of classifier.
rf_tuned = RandomForestRegressor(random_state=1)
rf_tuned = make_pipeline(preprocessor,rf_tuned)

# Grid of parameters to choose from
parameters = {
     "randomforestregressor__max_depth": [10, 20, 30, None], #Complete the code to define the list of values to be tuned
     "randomforestregressor__max_features": ['sqrt', 'log2', 1.0, 0.7], #Complete the code to define the list of values to be tuned
     "randomforestregressor__n_estimators": [100, 200, 300], #Complete the code to define the list of values to be tuned
}

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)
Best Parameters Found:
{'randomforestregressor__max_depth': 10, 'randomforestregressor__max_features': 'sqrt', 'randomforestregressor__n_estimators': 100}

Checking model performance on training set

In [ ]:
rf_tuned_model_train_perf = model_performance_regression(rf_tuned, X_train, y_train)
rf_tuned_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 596.994959 468.87585 0.685016 0.684501 0.165674

Checking model performance on test set

In [ ]:
rf_tuned_model_test_perf = model_performance_regression(rf_tuned, X_test, y_test)
rf_tuned_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 615.906846 485.311027 0.66851 0.667244 0.187394

Observations:

  • The tuned Random Forest shows almost identical performance to the default Random Forest model.

  • Test R² (~0.6685) remains unchanged, indicating no meaningful improvement in explanatory power.

  • RMSE and MAE on the test set are nearly the same as the untuned model, confirming marginal gains from tuning.

  • The selected parameters (max_depth = 10, max_features = 'sqrt', n_estimators = 100) impose controlled tree complexity, helping prevent overfitting.

  • Train and test metrics are closely aligned, suggesting good generalization and stable learning.

  • Increasing the number of trees beyond 100 or allowing deeper trees did not improve performance, implying diminishing returns.

  • The model appears bias-limited rather than variance-limited, meaning feature richness matters more than hyperparameter tuning.

Conclusion:

Hyperparameter tuning does not significantly enhance Random Forest performance for this dataset. While the model is stable and reliable, further gains are more likely to come from feature engineering or advanced boosting methods rather than additional tuning.

Hyperparameter Tuning - AdaBoost Regressor

In [ ]:
# Choose the type of classifier.
ab_tuned = AdaBoostRegressor(random_state=1)
ab_tuned = make_pipeline(preprocessor,ab_tuned)
# Grid of parameters to choose from
parameters = {
     "adaboostregressor__n_estimators": [50, 100, 150, 200], #Complete the code to define the list of values to be tuned
     "adaboostregressor__learning_rate": [0.01, 0.1, 0.5, 1.0], #Complete the code to define the list of values to be tuned
}


# Run the grid search
grid_obj = GridSearchCV(ab_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
ab_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
ab_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)
Best Parameters Found:
{'adaboostregressor__learning_rate': 0.01, 'adaboostregressor__n_estimators': 50}

Checking model performance on training set

In [ ]:
ab_tuned_model_train_perf = model_performance_regression(ab_tuned, X_train, y_train)
ab_tuned_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 597.816706 473.070121 0.684148 0.683632 0.166248

Checking model performance on test set

In [ ]:
ab_tuned_model_test_perf = model_performance_regression(ab_tuned, X_test, y_test)
ab_tuned_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 597.816706 473.070121 0.684148 0.683632 0.166248

Observations:

  • Hyperparameter tuning selected a low learning rate (0.01) with fewer estimators (50), indicating that conservative boosting works better for this dataset.

  • Compared to the untuned AdaBoost model, the performance has improved slightly, especially in terms of RMSE and MAE.

  • R² (~0.684) is higher than the default AdaBoost model, showing better variance explanation after tuning.

  • Training and test metrics are identical, suggesting the model has high bias and limited flexibility, but also very stable generalization.

  • The low learning rate reduces the risk of overfitting, but also limits the model’s ability to capture complex nonlinear patterns.

  • Despite tuning, AdaBoost still underperforms compared to Random Forest, Gradient Boosting, and XGBoost.

  • The model benefits from tuning more than Decision Tree, but remains less competitive overall.

Conclusion:

Hyperparameter tuning improves AdaBoost modestly, but the model remains bias-constrained. For stronger predictive performance, tree-based ensemble methods with higher capacity (Random Forest, Gradient Boosting, XGBoost) are more suitable for this problem.

Hyperparameter Tuning - Gradient Boosting Regressor

In [ ]:
# Choose the type of classifier.
gb_tuned = GradientBoostingRegressor(random_state=1)
gb_tuned = make_pipeline(preprocessor,gb_tuned)

# Grid of parameters to choose from
parameters = {
     "gradientboostingregressor__n_estimators": [100, 200, 300], #Complete the code to define the list of values to be tuned
     "gradientboostingregressor__subsample": [0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
     "gradientboostingregressor__max_features": [0.8, 1.0, 'sqrt', 'log2'], #Complete the code to define the list of values to be tuned
     "gradientboostingregressor__max_depth": [3, 4, 5] #Complete the code to define the list of values to be tuned
}


# Run the grid search
grid_obj = GridSearchCV(gb_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gb_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)
Best Parameters Found:
{'gradientboostingregressor__max_depth': 3, 'gradientboostingregressor__max_features': 0.8, 'gradientboostingregressor__n_estimators': 100, 'gradientboostingregressor__subsample': 0.8}

Checking model performance on training set

In [ ]:
gb_tuned_model_train_perf = model_performance_regression(gb_tuned, X_train, y_train)
gb_tuned_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 597.002307 469.083177 0.685008 0.684493 0.165728

Checking model performance on test set

In [ ]:
gb_tuned_model_test_perf = model_performance_regression(gb_tuned, X_test, y_test)
gb_tuned_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 615.872215 485.466649 0.668547 0.667281 0.187474

Observations:

  • Hyperparameter tuning selected a shallow tree depth (max_depth = 3), indicating that simple base learners generalize better for this dataset.

  • The model prefers subsampling (subsample = 0.8) and feature subsampling (max_features = 0.8), which helps reduce overfitting and improve robustness.

  • The chosen number of estimators (100) balances learning capacity and stability without excessive complexity.

  • Training performance remains strong (R² ≈ 0.685), similar to the untuned Gradient Boosting model.

  • Test performance (R² ≈ 0.669, RMSE ≈ 616) is very close to training performance, showing good generalization.

  • Compared to AdaBoost, the tuned Gradient Boosting model shows lower error and higher R², confirming its superior learning capability.

  • Hyperparameter tuning results in marginal but consistent improvements, suggesting the base model was already well-specified.

  • Performance is comparable to Random Forest and XGBoost, making it one of the top-performing models in this study.

Conclusion:

The tuned Gradient Boosting Regressor achieves a strong bias–variance balance, with stable generalization and competitive accuracy. It is a reliable final model choice, especially when interpretability and controlled complexity are important.

Hyperparameter Tuning - XGBoost Regressor

In [ ]:
# Choose the type of classifier.
xgb_tuned = XGBRegressor(random_state=1)
xgb_tuned = make_pipeline(preprocessor,xgb_tuned)

# Grid of parameters to choose from
parameters = {
     "xgbregressor__n_estimators": [100, 200], #Complete the code to define the list of values to be tuned
     "xgbregressor__subsample": [0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
     "xgbregressor__gamma": [0, 1, 5], #Complete the code to define the list of values to be tuned
     "xgbregressor__colsample_bytree": [0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
     "xgbregressor__colsample_bylevel":[0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
     "xgbregressor__max_depth": [3, 5, 7]
}

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)
Best Parameters Found:
{'xgbregressor__colsample_bylevel': 0.7, 'xgbregressor__colsample_bytree': 0.7, 'xgbregressor__gamma': 0, 'xgbregressor__max_depth': 3, 'xgbregressor__n_estimators': 100, 'xgbregressor__subsample': 0.7}

Checking model performance on training set

In [ ]:
xgb_tuned_model_train_perf = model_performance_regression(xgb_tuned, X_train, y_train)
xgb_tuned_model_train_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 597.081727 468.747993 0.684924 0.684409 0.16565

Checking model performance on test set

In [ ]:
xgb_tuned_model_test_perf = model_performance_regression(xgb_tuned, X_test, y_test)
xgb_tuned_model_test_perf
Out[ ]:
RMSE MAE R-squared Adj. R-squared MAPE
0 616.05301 485.055015 0.668353 0.667086 0.187289

Observations:

  • Hyperparameter tuning selected a shallow tree depth (max_depth = 3), reinforcing that simpler trees generalize better for this dataset.

  • The model favors aggressive subsampling (subsample = 0.7, colsample_bytree = 0.7, colsample_bylevel = 0.7), which helps control overfitting and improves model stability.

  • A gamma value of 0 indicates that allowing splits without additional loss penalty works well, suggesting the data benefits from flexible splitting.

  • The optimal number of estimators (100) provides sufficient boosting rounds without overfitting.

  • Training performance is strong (R² ≈ 0.685, RMSE ≈ 597), comparable to untuned and other tuned ensemble models.

  • Test performance (R² ≈ 0.668, RMSE ≈ 616) is very close to training metrics, indicating good generalization.

  • Hyperparameter tuning yields only marginal improvements, suggesting the original XGBoost model was already near optimal.

  • Compared to AdaBoost, XGBoost performs significantly better; however, its performance is very similar to Random Forest and Gradient Boosting.

Conclusion:

The tuned XGBoost Regressor demonstrates stable and robust performance with strong generalization. While tuning provides limited gains, XGBoost remains one of the top-performing models and is a strong candidate for final deployment, especially when predictive accuracy is prioritized.

Model Performance Comparison, Final Model Selection, and Serialization

In [ ]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        rf_estimator_model_train_perf.T,      # Random Forest (base)
        rf_tuned_model_train_perf.T,          # Random Forest (tuned)
        xgb_estimator_model_train_perf.T,     # XGBoost (base)
        xgb_tuned_model_train_perf.T,         # XGBoost (tuned)
    ],
    axis=1,
)

models_train_comp_df.columns = [
    "Random Forest Estimator",
    "Random Forest Tuned",
    "XGBoost Estimator",
    "XGBoost Tuned",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Random Forest Estimator Random Forest Tuned XGBoost Estimator XGBoost Tuned
RMSE 596.994959 596.994959 596.978222 597.081727
MAE 468.875850 468.875850 468.965507 468.747993
R-squared 0.685016 0.685016 0.685033 0.684924
Adj. R-squared 0.684501 0.684501 0.684519 0.684409
MAPE 0.165674 0.165674 0.165690 0.165650
In [ ]:
# Testing performance comparison

models_test_comp_df = pd.concat(
    [
        rf_estimator_model_test_perf.T,      # Random Forest (base)
        rf_tuned_model_test_perf.T,          # Random Forest (tuned)
        xgb_estimator_model_test_perf.T,     # XGBoost (base)
        xgb_tuned_model_test_perf.T,         # XGBoost (tuned)
    ],
    axis=1,
)

models_test_comp_df.columns = [
    "Random Forest Estimator",
    "Random Forest Tuned",
    "XGBoost Estimator",
    "XGBoost Tuned",
]

print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[ ]:
Random Forest Estimator Random Forest Tuned XGBoost Estimator XGBoost Tuned
RMSE 615.906846 615.906846 615.933034 616.053010
MAE 485.311027 485.311027 485.429585 485.055015
R-squared 0.668510 0.668510 0.668482 0.668353
Adj. R-squared 0.667244 0.667244 0.667215 0.667086
MAPE 0.187394 0.187394 0.187421 0.187289
In [ ]:

In [ ]:
if rf_tuned_model_train_perf["RMSE"][0] < xgb_tuned_model_train_perf["RMSE"][0]:
    best_model = rf_tuned
else:
    best_model = xgb_tuned

print(f"The best performing model is: {best_model}")
The best performing model is: Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=10, max_features='sqrt',
                                       random_state=1))])

Observations:

🔍 Observations for Final Model Selection

  • Random Forest (base and tuned) consistently delivers the best overall performance across both training and test datasets.

  • The tuned Random Forest does not improve performance over the base Random Forest, indicating that the default parameters were already near-optimal.

  • XGBoost (base and tuned) performs very similarly to Random Forest but shows:

    • Slightly higher RMSE and MAE

    • Marginally lower R² and Adjusted R² on the test set

  • The performance gap between training and test sets for Random Forest is small, indicating good generalization and minimal overfitting.

  • Tuned XGBoost does not outperform base XGBoost, suggesting limited benefit from hyperparameter tuning for this dataset.

  • Among all models compared:

    • Lowest Test RMSE & MAE → Random Forest

    • Highest Test R² & Adjusted R² → Random Forest

    • Lowest Test MAPE → Random Forest


Final Selection Justification

  • Random Forest Regressor is selected as the best-performing and most stable model

  • It balances accuracy, robustness, and generalization

  • Hyperparameter tuning did not yield meaningful gains, reinforcing confidence in the chosen model

Model Serialization

In [ ]:
# Create a folder for storing the files needed for web app deployment
os.makedirs("/content/drive/MyDrive/Model Deployment/Full_Code/backend_files", exist_ok=True)
In [ ]:
# Define the file path to save (serialize) the trained model along with the data preprocessing steps
saved_model_path = "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/SuperKart_v1_0.joblib"
In [ ]:
# Save the best trained model pipeline using joblib
joblib.dump(best_model, saved_model_path) #Complete the code to pass the variable name of the best model

print(f"Model saved successfully at {saved_model_path}")
Model saved successfully at /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/SuperKart_v1_0.joblib
In [ ]:
# Load the saved model pipeline from the file
saved_model = joblib.load("/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/SuperKart_v1_0.joblib") #Complete the code to define the name of the saved model

# Confirm the model is loaded
print("Model loaded successfully.")
Model loaded successfully.
In [ ]:
saved_model
Out[ ]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=10, max_features='sqrt',
                                       random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let's try making predictions on the test set using the deserialized model.

In [ ]:
# Test a prediction to confirm functionality
sample_preds = saved_model.predict(X_test[:5])
print("\n Sample Predictions on Test Set:\n", sample_preds)
 Sample Predictions on Test Set:
 [3301.4740379  4861.77518803 4858.00647634 3294.17251579 3951.95954778]

Observations:

  • A dedicated directory was created to store all files required for backend deployment, ensuring a well-structured and maintainable deployment setup.

  • The trained model was saved using the joblib library as a single .joblib file, which includes the complete machine learning pipeline.

  • The serialized object contains both the data preprocessing steps (ColumnTransformer and OneHotEncoder) and the final Random Forest regression model, ensuring consistency between training and inference.

  • Successful execution messages confirm that the model was correctly saved to disk at the specified location.

  • The model was subsequently reloaded from the saved file, verifying that the serialization process was successful and the file is usable.

  • Inspection of the loaded object confirms that it is a Pipeline, demonstrating that all preprocessing and modeling components are preserved together.

  • The OneHotEncoder is configured with handle_unknown='ignore', which improves robustness by allowing the model to handle unseen categorical values during real-time predictions without errors.

  • Overall, the serialization process ensures the selected best-performing model is deployment-ready and can be directly integrated into a production or web application environment.

Deployment - Backend

Setting up a Hugging Face Docker Space for the Backend

In [ ]:
# Import the login function from the huggingface_hub library
from huggingface_hub import login
from huggingface_hub import create_repo
import os

# Login to Hugging Face account using access token
from google.colab import userdata
hf_token = userdata.get('SuperKart1')
login(token=hf_token)
In [ ]:
# create the repository for the Hugging Face Space

try:
  create_repo("SuperKartBackend",
        repo_type="space",  # Specify the repository type as "space"
        space_sdk="docker",  # Specify the space SDK as "docker" to create a Docker space
        private=False  # Set to True if the space to be private
    )

# Handle potential errors during repository creation
except Exception as e:
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Skipping creation.")
    else:
        print(f"Error creating repository: {e}")

Flask Web Framework - app.py

In [ ]:
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/app.py"

# Import necessary libraries
import numpy as np
import joblib
import pandas as pd
from flask import Flask, request, jsonify
import traceback
import math

# Define the path where the model is saved
model_file_name = "SuperKart_v1_0.joblib"

try:
    # Load the trained machine learning model
    model = joblib.load(model_file_name)
except FileNotFoundError:
    print(f"Error: Model file not found at {model_file_name}")
    model = None
except Exception as e:
    print(f"Error loading model: {e}")
    traceback.print_exc()
    model = None

# Initialize the Flask app
app = Flask(__name__)

@app.route('/')
def home():
    return "Welcome to the Super Kart Product Sales Price Prediction API!"

# ---------------- single Prediction Endpoint ----------------
@app.route('/v1/salesprice', methods=['POST'])
def predict_sales_price():
    if model is None:
        return jsonify({"error": "Model not loaded. Cannot make predictions."}), 500

    try:
        property_data = request.get_json(force=True)

        expected_keys = [
            'Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area',
            'Product_MRP', 'Store_Size', 'Store_Location_City_Type',
            'Store_Type', 'Product_Id_char', 'Store_Age_Years', 'Product_Type_Category'
        ]
        if not all(key in property_data for key in expected_keys):
            missing_keys = [key for key in expected_keys if key not in property_data]
            return jsonify({"error": f"Missing keys in input data: {missing_keys}"}), 400

        sample = {key: property_data.get(key) for key in expected_keys}
        input_data = pd.DataFrame([sample])

        predicted_sales_price = model.predict(input_data)
        predicted_price = round(float(predicted_sales_price[0]), 2)

        if math.isinf(predicted_price) or math.isnan(predicted_price):
            return jsonify({"error": "Prediction resulted in an invalid value."}), 400

        return jsonify({'Predicted Price': predicted_price}), 200

    except Exception as e:
        print(f"Error during single prediction: {e}")
        traceback.print_exc()
        return jsonify({"error": "Internal server error", "details": str(e)}), 500

# ---------------- Batch Prediction Endpoint ----------------
@app.route('/v1/salespricebatch', methods=['POST'])
def predict_sales_price_batch():
    """
    Expects a CSV file with one product per row.
    Returns JSON: a list of dicts with `row_id` and predicted price.
    """
    if model is None:
        return jsonify({"error": "Model not loaded. Cannot make predictions."}), 500

    if 'file' not in request.files:
        return jsonify({"error": "No file uploaded"}), 400

    try:
        file = request.files['file']
        input_data = pd.read_csv(file)

        expected_columns = [
            'Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area',
            'Product_MRP', 'Store_Size', 'Store_Location_City_Type',
            'Store_Type', 'Product_Id_char', 'Store_Age_Years', 'Product_Type_Category'
        ]
        missing_columns = [col for col in expected_columns if col not in input_data.columns]
        if missing_columns:
            return jsonify({"error": f"Missing required columns: {missing_columns}"}), 400

        input_data.reset_index(inplace=True)
        input_data.rename(columns={'index': 'row_id'}, inplace=True)

        predictions = model.predict(input_data[expected_columns])
        predicted_prices = [round(float(p), 2) for p in predictions]

        results = [
            {"row_id": row_id, "Predicted Price": price}
            for row_id, price in zip(input_data['row_id'], predicted_prices)
        ]

        return jsonify(results), 200

    except Exception as e:
        print(f"Error during batch prediction: {e}")
        traceback.print_exc()
        return jsonify({"error": "Internal server error during batch prediction.", "details": str(e)}), 500

if __name__ == '__main__':
    pass
Writing /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/app.py

Dependency File

In [ ]:
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/requirements.txt"
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
streamlit==1.43.2
flask-cors==3.0.10
Writing /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/requirements.txt

Dockerfile

In [ ]:
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/Dockerfile"
# Use slim Python image
FROM python:3.9-slim

# Set working directory inside the container
WORKDIR /app

# Copy project files into the container
COPY . .

# Install dependencies and print package list to verify gunicorn is installed
RUN pip install --no-cache-dir --upgrade pip \
 && pip install --no-cache-dir -r requirements.txt \
 && echo "Installed packages:" \
 && pip list

# Expose the port Hugging Face expects
EXPOSE 7860

# Start the Flask app using gunicorn
# - app: refers to app.py
# - app: the Flask app object in app.py (corrected from rental_price_predictor_api)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:app"]
Writing /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/Dockerfile

Uploading files to Hugging Face Space for Backend

In [ ]:
# for hugging face space authentication to upload files
from huggingface_hub import HfApi

# Hugging Face space id - Backend
repo_id = "randley7/SuperKartBackend"

# Initialize the API
api = HfApi()

#Mention the folder path explicitly
folder_path = "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/"

# Upload Streamlit app files
api.upload_folder(folder_path=folder_path,repo_id=repo_id,repo_type="space")

print(f"Files from {folder_path} successfully uploaded to the Hugging Face Space: {repo_id}")
Files from /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/ successfully uploaded to the Hugging Face Space: randley7/SuperKartBackend

Deployment - Frontend

Setting up Hugging Face streamlit space for Frontend

In [ ]:
# Try to create the repository for the Hugging Face Space

try:
    create_repo("SuperKartFrontend",
        repo_type="space",  # Specify the repository type as "space"
        space_sdk="docker",  # Specify the space SDK as "streamlit" to create a Streamlit space
        private=False  # Set to True if you want the space to be private
    )

# Handle potential errors during repository creation
except Exception as e:
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Skipping creation.")
    else:
        print(f"Error creating repository: {e}")
In [ ]:
# Create the directory if it doesn't exist and then write the file
import os
os.makedirs("/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files", exist_ok=True)

Dockerfile

In [ ]:
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/Dockerfile"
# Use Python base image
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Copy all files into the container
COPY . /app

# Install dependencies
RUN pip install --upgrade pip && \
    pip install -r requirements.txt

# Expose port for Streamlit
EXPOSE 7860

# Run Streamlit app
CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]
Writing /content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/Dockerfile

Streamlit UI - app.py

In [ ]:
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/app.py"

# import
import streamlit as st
import pandas as pd
import requests

# Streamlit UI
st.title("SuperKart Sales Prediction App")
st.write("Predict store sales based on product and store attributes.")

# Input fields for product and store data
Product_Weight = st.number_input("Product Weight", min_value=0.0, value=12.66)
Product_Sugar_Content = st.selectbox("Product Sugar Content", ["Low Sugar", "Regular", "No Sugar"])
Product_Allocated_Area = st.number_input("Product Allocated Area", min_value=0.0, step=0.1)
Product_MRP = st.number_input("Product MRP", min_value=0.0, step=0.1)
Store_Size = st.selectbox("Store Size", ["Small", "Medium", "High"])
Store_Location_City_Type = st.selectbox("Store Location City Type", ["Tier 1", "Tier 2", "Tier 3"])
Store_Type = st.selectbox("Store Type",["Supermarket Type2", "Departmental Store", "Supermarket Type1", "Food Mart"])
Product_Id_char = st.selectbox("Product Id Char", ["FD", "NC", "DR"])
Store_Age_Years = st.number_input("Store Age Years", min_value=0, step=1)
Product_Type_Category = st.selectbox("Product Type Category", ["Perishables", "Non Perishables"])

input_data = pd.DataFrame([{
    'Product_Weight': Product_Weight,
    'Product_Sugar_Content': Product_Sugar_Content,
    'Product_Allocated_Area': Product_Allocated_Area,
    'Product_MRP': Product_MRP,
    'Store_Size': Store_Size,
    'Store_Location_City_Type': Store_Location_City_Type,
    'Store_Type': Store_Type,
    'Product_Id_char': Product_Id_char,
    'Store_Age_Years': Store_Age_Years,
    'Product_Type_Category': Product_Type_Category
}])


# Predict button
if st.button("Predict"):
    try:
        response = requests.post(
            "https://randley7-SuperKartBackend.hf.space/v1/salesprice",
            json=input_data.to_dict(orient='records')[0]
        )
        if response.status_code == 200:
            prediction = response.json().get("Predicted Price", "No prediction returned")
            st.success(f"Predicted Sales Price: {prediction}")
        else:
            st.error("Error making prediction.")
            st.text(response.text)
    except Exception as e:
        st.error(f"Exception occurred: {e}")

# ----------------- Batch Prediction -----------------
st.subheader("Batch Prediction")

uploaded_file = st.file_uploader("Upload CSV file for batch prediction", type=["csv"])

if uploaded_file is not None:
    if st.button("PredictBatch"):
        try:
            files = {"file": (uploaded_file.name, uploaded_file, "text/csv")}
            response = requests.post(
                "https://randley7-SuperKartBackend.hf.space/v1/salespricebatch",
                files=files
            )
            if response.status_code == 200:
                predictions = response.json()
                st.success("Batch predictions completed!")

                # Convert to DataFrame and display
                df_predictions = pd.DataFrame(predictions)
                st.dataframe(df_predictions)

                # Download button
                csv = df_predictions.to_csv(index=False).encode('utf-8')
                st.download_button(
                    label="Download Predictions as CSV",
                    data=csv,
                    file_name="SuperKart_Predicted_Sales.csv",
                    mime="text/csv"
                )
            else:
                st.error("Error making batch prediction.")
                st.text(response.text)
        except Exception as e:
            st.error(f"Exception occurred: {e}")
Overwriting /content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/app.py

Dependencies File

In [ ]:
%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/requirements.txt"
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
streamlit==1.43.2
flask-cors==3.0.10
Overwriting /content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/requirements.txt

Uploading files for Hugging Face Space for the Frontend

In [ ]:
# for hugging face space authentication to upload files
from huggingface_hub import HfApi

repo_id = "randley7/SuperKartFrontend"

# Initialize the API
api = HfApi()

#Mention the folder path explicitly
folder_path = "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/"

# Upload Streamlit app files
api.upload_folder(folder_path=folder_path, repo_id=repo_id,repo_type="space")

print(f"Files from {folder_path} successfully uploaded to the Hugging Face Space: {repo_id}")
Files from /content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/ successfully uploaded to the Hugging Face Space: randley7/SuperKartFrontend

Interfacing using Flask API

Prediction

In [ ]:
# Import the necessary libraries
import json
import requests
import pandas as pd
import numpy as np

#Base URL of the deployed Flask API on Hugging Face Space
model_root_url = "https://randley7-SuperKartBackend.hf.space"

#Endpoint for single inference
model_url = model_root_url + "/v1/salesprice"

#Payload with necessary features for single inference prediction
payload = {
    'Product_Weight': 12.66,
    'Product_Sugar_Content': "Low Sugar",
    'Product_Allocated_Area': 0.20,
    'Product_MRP': 0.30,
    'Store_Size': "Small",
    'Store_Location_City_Type': "Tier 1",
    'Store_Type': "Supermarket Type2",
    'Product_Id_char': "FD",
    'Store_Age_Years': 10,
    'Product_Type_Category': "Non Perishables"
}

#sending a POST request to the model endpoint with the payload
response = requests.post(model_url, json=payload)

print(model_url)
print(response)

# ALWAYS print the raw response text for debugging
print("Raw response text:")
print(response.text)

# Check if the response is successful (status code 200) before trying to parse JSON
if response.status_code == 200:
    try:
        # Attempt to parse the JSON
        print("Parsed JSON response:")
        print(response.json())
    except json.JSONDecodeError as e:
        print(f"JSON Decode Error occurred: {e}")
        print("Could not parse response as JSON despite 200 status code.")
else:
    # If the response was not successful, print the status code and the raw text
    print(f"Error: Received status code {response.status_code}")
    print("Response content (if any):")
    print(response.text) # Print raw text to see the error message from the backend
https://randley7-SuperKartBackend.hf.space/v1/salesprice
<Response [200]>
Raw response text:
{"Predicted Price":3547.64}

Parsed JSON response:
{'Predicted Price': 3547.64}

Observations:

Observations – Backend and Frontend Integration

  • The frontend Streamlit application successfully communicates with the backend Flask API using HTTP POST requests.

  • Real-time predictions are displayed in the UI, confirming seamless data flow from user input → backend inference → frontend response.

  • The same API endpoint supports both programmatic access (via Python requests) and UI-based interaction, increasing system flexibility.

  • The frontend correctly parses backend JSON responses and presents predictions in a user-friendly format.

  • The batch prediction workflow is fully integrated, allowing users to upload CSV files and receive multiple predictions in one request.

  • Download functionality for batch prediction results enhances usability and supports real-world analytical workflows.


Observations – Deployment Validation and System Robustness

  • The backend and frontend are deployed as independent Hugging Face Spaces, ensuring modularity and easier maintenance.

  • Dockerized deployments ensure consistent runtime environments and eliminate dependency conflicts.

  • Version-pinned dependencies in requirements.txt improve reproducibility and long-term stability.

  • The serialized model pipeline (including preprocessing) ensures identical transformations during both training and inference.

  • Successful predictions from both direct API calls and the UI confirm end-to-end system correctness.

  • The deployed system demonstrates production readiness with clear endpoints, validation checks, and scalable architecture.


Observations – Interfacing Using Flask API

  • The Flask API is successfully deployed on Hugging Face Spaces and is accessible via a public HTTPS endpoint, enabling external inference requests.

  • A RESTful design is followed, with a dedicated /v1/salesprice endpoint for single predictions that accepts JSON payloads.

  • Input features in the API payload exactly match the features used during model training, ensuring schema consistency and preventing inference mismatches.

  • The API correctly returns HTTP 200 responses for valid requests, confirming proper request handling and inference execution.

  • JSON responses are well-structured and include a clearly labeled Predicted Price, facilitating easy consumption by downstream applications.

  • Robust debugging practices are demonstrated by logging raw response text and safely handling JSON decoding.

  • Error handling is implemented to capture invalid payloads, missing keys, or unexpected runtime issues, improving API reliability.


Overall Observation

The project demonstrates a complete, production-grade machine learning deployment pipeline, covering model training, evaluation, selection, serialization, backend API development, frontend integration, and cloud deployment. The seamless interaction between components validates both the technical soundness and practical usability of the solution.

Actionable Insights and Business Recommendations

SuperKart Sales Prediction Project

Actionable Insights

  1. Product Characteristics Strongly Influence Sales Outcomes
  • Product attributes such as MRP, weight, sugar content, and category (perishable vs non-perishable) play a significant role in predicting sales value.

  • Products with optimized pricing and appropriate shelf allocation tend to generate higher predicted sales.

  • Perishable and non-perishable products show distinct sales behavior, indicating different demand dynamics.

Insight: Sales performance is highly sensitive to product-level decisions rather than being driven by a single store factor.


  1. Store Attributes Drive Demand Variability
  • Store size, store type, city tier, and store age contribute meaningfully to sales predictions.

  • Larger stores and stores located in higher-tier cities tend to exhibit stronger sales potential.

  • Older stores show more stable and predictable sales patterns, likely due to established customer bases.

Insight: Store-level heterogeneity must be accounted for when planning inventory and pricing strategies.


  1. Machine Learning Models Capture Non-Linear Sales Drivers
  • Tree-based ensemble models (Random Forest and XGBoost) significantly outperform simpler models.

  • The selected Random Forest model demonstrates consistent generalization, with minimal performance gap between training and test data.

  • Hyperparameter tuning improves stability but yields marginal gains over the base ensemble models.

Insight: Sales relationships are non-linear, and ensemble models are well-suited for capturing complex interactions between product and store features.


  1. Model Generalization Indicates Reliable Forecasting
  • Comparable RMSE, MAE, and R² values across training and test sets indicate low overfitting.

  • The model’s performance consistency suggests it can be trusted for real-world sales estimation scenarios.

Insight: The model can be reliably used for operational and tactical decision-making rather than just exploratory analysis.


  1. Deployment Enables Real-Time and Scalable Decision Support
  • The Flask API enables real-time inference for individual product–store combinations.

  • Batch prediction support allows large-scale forecasting across catalogs and store networks.

  • The Streamlit frontend provides accessibility for non-technical business users.

Insight: The solution moves beyond analytics into an operational decision-support system.



Business Recommendations

  1. Optimize Product Pricing and Placement Strategy
  • Use the model to simulate sales outcomes for different MRP and shelf-space allocation combinations.

  • Identify price bands that maximize sales without eroding demand, especially for high-volume categories.

  • Allocate premium shelf space to products with higher predicted sales impact.

Recommendation: Integrate the model into pricing and merchandising decisions to maximize revenue per square foot.


  1. Implement Store-Specific Inventory Planning
  • Adjust inventory levels based on store size, location tier, and store maturity.

  • Avoid uniform inventory policies across all stores, as demand patterns vary significantly.

  • Use batch predictions to forecast store-level demand before replenishment cycles.

Recommendation: Move from centralized inventory planning to store-cluster-based demand forecasting.


  1. Support New Store and Product Launch Decisions
  • Use the model to estimate expected sales for new products or newly opened stores using proxy attributes.

  • Evaluate product–store fit before rollout to reduce launch risk.

  • Prioritize high-potential store locations and product categories.

Recommendation: Use predictive insights to de-risk expansion and new product introduction strategies.


  1. Enhance Promotional and Marketing Effectiveness
  • Identify products with high baseline demand and amplify them during promotional campaigns.

  • Avoid over-promoting products with inherently low predicted demand.

  • Tailor promotions based on store location and customer demographics inferred from city tiers.

Recommendation: Shift from blanket promotions to data-driven, targeted marketing campaigns.


  1. Enable Sales and Category Teams with Self-Service Analytics
  • Provide business teams access to the deployed Streamlit application for scenario analysis.

  • Allow category managers to test “what-if” scenarios by adjusting product and store attributes.

  • Reduce dependency on technical teams for routine sales forecasting.

Recommendation: Democratize predictive insights to improve agility and decision-making speed.


  1. Integrate the Model into Core Business Systems
  • Embed the API into ERP, inventory management, or demand planning systems.

  • Automate daily or weekly batch predictions for operational planning.

  • Continuously retrain the model with new sales data to maintain accuracy.

Recommendation: Treat the model as a living system, not a one-time analytical output.


Strategic Impact Summary

  • Revenue Growth: Improved pricing and assortment decisions driven by predictive insights.

  • Cost Reduction: Lower inventory holding and wastage through accurate demand estimation.

  • Operational Efficiency: Faster, data-backed decisions enabled by real-time inference.

  • Scalability: Cloud deployment supports expansion across regions and product lines.

  • Competitive Advantage: Advanced analytics capability embedded into everyday business operations.


Final Insight

The SuperKart Sales Prediction system transforms historical data into actionable intelligence, enabling the organization to move from reactive decision-making to proactive, predictive retail strategy.

Links of the Hugging Face spaces:

Screenshot 2026-01-09 164241.png