By Randley Morales¶

Problem Statement¶

Business Context¶

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

Objective¶

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

Data Description¶

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

Product_Id - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
Product_Weight - weight of each product
Product_Sugar_Content - sugar content of each product like low sugar, regular and no sugar
Product_Allocated_Area - ratio of the allocated display area of each product to the total display area of all the products in a store
Product_Type - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
Product_MRP - maximum retail price of each product
Store_Id - unique identifier of each store
Store_Establishment_Year - year in which the store was established
Store_Size - size of the store depending on sq. feet like high, medium and low
Store_Location_City_Type - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
Store_Type - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
Product_Store_Sales_Total - total revenue generated by the sale of that particular product in that particular store

Installing and Importing the necessary libraries¶

# Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.4 huggingface_hub==0.34.0 -q

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 7.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 223.6/223.6 MB 5.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 558.7/558.7 kB 25.4 MB/s eta 0:00:00

Note:

After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab) and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

import warnings
warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# For splitting the dataset
from sklearn.model_selection import train_test_split

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


# Libraries different ensemble classifiers
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

# Libraries to get different metric scores
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    mean_absolute_percentage_error
)

# To create the pipeline
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline

# To tune different models and standardize
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder

# To serialize the model
import joblib

# os related functionalities
import os

# API request
import requests

# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi

import math

Loading the dataset¶

# Connect to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

# Load the dataset from a CSV file into a Pandas DataFrame
kart = pd.read_csv("/content/drive/MyDrive/Model Deployment/Full_Code/SuperKart.csv")

# Make a copy of Kart
data = kart.copy()

Data Overview¶

View the first and last 5 rows of the dataset¶

# The first 5 rows of the dataset
data.head()

# The last 5 rows of the dataset
data.tail()

Understand the shape of the dataset¶

# Checking shape of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")

There are 8763 rows and 12 columns.

# Display the column names of the dataset
data.columns

Index(['Product_Id', 'Product_Weight', 'Product_Sugar_Content',
       'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Id',
       'Store_Establishment_Year', 'Store_Size', 'Store_Location_City_Type',
       'Store_Type', 'Product_Store_Sales_Total'],
      dtype='object')

Check the data types of the columns for the dataset¶

# Checking column datatypes and number of non-null values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Id                 8763 non-null   object 
 1   Product_Weight             8763 non-null   float64
 2   Product_Sugar_Content      8763 non-null   object 
 3   Product_Allocated_Area     8763 non-null   float64
 4   Product_Type               8763 non-null   object 
 5   Product_MRP                8763 non-null   float64
 6   Store_Id                   8763 non-null   object 
 7   Store_Establishment_Year   8763 non-null   int64  
 8   Store_Size                 8763 non-null   object 
 9   Store_Location_City_Type   8763 non-null   object 
 10  Store_Type                 8763 non-null   object 
 11  Product_Store_Sales_Total  8763 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 821.7+ KB

Checking for duplicate values¶

# Checking for duplicate values
data.duplicated().sum()

np.int64(0)

Checking for missing values¶

# Checking for missing values
data.isnull().sum()

Let's check the statistical summary of the data.¶

# Statistical summary of the data for both numerical and categorical columns
data.describe(include='all').T

Observations:¶

📌 Observations on the Dataset

Data Types & Structure

The dataset has 12 columns:
- 4 numerical (float/int)
  - Product_Weight (float)
  - Product_Allocated_Area (float)
  - Product_MRP (float)
  - Store_Establishment_Year (int)
  - Product_Store_Sales_Total (float – target)
- 7 categorical (object)
  - Product and store identifiers and descriptors.
Data types are appropriate and consistent with business meaning.
Memory usage is low (~822 KB), making it efficient for experimentation.

Missing Values

No missing values across all columns.
This eliminates the need for:
- Imputation strategies
- Row/column removal
The dataset is clean and ready for modeling.

Duplicate Records

0 duplicate rows found.
Each product–store combination appears to be unique, improving data reliability.
No deduplication steps are required.

Categorical Feature Insights

High-cardinality column:
- Product_Id → 8,763 unique values
- Acts more like an identifier than a predictive feature.
- Should be removed or feature-engineered (e.g., prefix extraction).
Low to moderate cardinality columns:
- Product_Sugar_Content → 4 categories (Low Sugar most frequent)
- Product_Type → 16 categories (Fruits & Vegetables most common)
- Store_Id → 4 stores
- Store_Size → 3 levels (Medium dominant)
- Store_Location_City_Type → 3 tiers (Tier 2 most frequent)
- Store_Type → 4 types (Supermarket Type2 dominant)

Numerical Feature Distribution

Product_Weight

Mean ≈ 12.65
Range: 4 to 22
Fairly symmetric distribution with moderate variance.

Product_Allocated_Area

Mean ≈ 0.069
Highly right-skewed (most products have small shelf area).
Likely a strong driver of sales visibility.

Product_MRP

Mean ≈ 147
Range: 31 to 266
Wide pricing range suggests varied product positioning.

Store_Establishment_Year

Range: 1987 to 2009
Median ≈ 2009
Can be converted into Store_Age for better interpretability.

Target Variable (Product_Store_Sales_Total)

Mean ≈ 3464
Standard deviation ≈ 1066
Range: 33 to 8000
Indicates:
- High variability in sales
- Possible outliers
Distribution is likely right-skewed, which tree-based models handle well.

Modeling Implications

Dataset is fully clean (no missing or duplicate values).
Strong mix of:
- Product-level features
- Store-level features
Tree-based regressors (Random Forest, XGBoost) are ideal.
Feature engineering opportunities:
- Drop or transform Product_Id
- Create Store_Age

Summary

The dataset consists of 8,763 clean and duplicate-free records with a balanced mix of numerical and categorical features. There are no missing values, and the target variable shows significant variability, making the dataset suitable for regression modeling using ensemble-based methods with appropriate feature engineering.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

#Function to plot a boxplot and a histogram
def histogram_boxplot(
    data,
    feature,
    figsize=(12, 7),
    kde=True,
    bins="auto",
    title=None,
    color="#8b5cf6",
    hist_alpha=0.35,
    show_stats_box=True,
    show_stats_subtitle=True,
    plot_gap=0.18,
    title_y=0.98,
    top_margin=0.90,
):
    """
    Combined boxplot + histogram for a numeric feature, with optional stats.
    """
    sns.set_theme(style="whitegrid", context="notebook")

    x = data[feature].dropna()
    if x.empty:
        raise ValueError(f"Column '{feature}' has no non-null values to plot.")

    # --- Summary stats ---
    n = x.shape[0]
    std = x.std()
    min_v = x.min()
    max_v = x.max()
    mean_v = x.mean()
    median_v = x.median()

    fig, (ax_box, ax_hist) = plt.subplots(
        nrows=2,
        sharex=True,
        figsize=figsize,
        gridspec_kw={"height_ratios": (0.28, 0.72), "hspace": plot_gap},
    )

    # --- Boxplot ---
    sns.boxplot(
        x=x,
        ax=ax_box,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.3),
        boxprops=dict(linewidth=1.3),
    )
    ax_box.set(xlabel="")
    ax_box.set_yticks([])
    sns.despine(ax=ax_box, left=True, bottom=True)

    # --- Histogram ---
    sns.histplot(
        x=x,
        ax=ax_hist,
        bins=bins if bins is not None else "auto",
        kde=kde,
        color=color,
        alpha=hist_alpha,
        edgecolor="white",
        linewidth=1,
    )

    # Mean/Median lines
    ax_hist.axvline(mean_v, color="#16a34a", linestyle="--", linewidth=2, label=f"Mean: {mean_v:,.2f}")
    ax_hist.axvline(median_v, color="#111827", linestyle="-", linewidth=2, label=f"Median: {median_v:,.2f}")
    ax_hist.legend(frameon=True, fontsize=10, loc="upper right")

    ax_hist.set_ylabel("Count")
    ax_hist.set_xlabel(feature)
    sns.despine(ax=ax_hist)

    # --- Title + subtitle (stats line) ---
    main_title = title or f"Distribution of {feature}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    if show_stats_subtitle:
        subtitle = f"n={n:,}   std={std:,.2f}   min={min_v:,.2f}   max={max_v:,.2f}"
        # place subtitle just below suptitle
        fig.text(0.5, title_y - 0.045, subtitle, ha="center", va="top", fontsize=11)

    # --- Stats box inside histogram ---
    if show_stats_box:
        stats_text = (
            f"n = {n:,}\n"
            f"std = {std:,.2f}\n"
            f"min = {min_v:,.2f}\n"
            f"max = {max_v:,.2f}"
        )
        ax_hist.text(
            0.01, 0.98, stats_text,
            transform=ax_hist.transAxes,
            va="top", ha="left",
            fontsize=10,
            bbox=dict(boxstyle="round,pad=0.35", facecolor="white", edgecolor="#e5e7eb", alpha=0.95),
        )

    # Make room for title/subtitle
    fig.subplots_adjust(top=top_margin)

    return fig, (ax_box, ax_hist)

Product Weight¶

# Product Weight
histogram_boxplot(data, "Product_Weight", show_stats_box=False, show_stats_subtitle=True)
plt.show()

Observations:¶

📊 Univariate Analysis – Product_Weight

Distribution Shape

The distribution of Product_Weight is approximately normal (bell-shaped).
Mean (12.65) and median (12.66) are almost identical, indicating a highly symmetric distribution.
The KDE curve confirms no strong skewness.

📌 Implication:

Since the feature is close to normally distributed, no transformation (log/sqrt) is required.

Central Tendency & Spread

Mean: ~12.65
Median: ~12.66
Standard Deviation: ~2.22

This suggests:

Most products cluster tightly around the mean.
Product weights are well standardized, which is common in retail packaging.

Range & Variability

Minimum: 4.0
Maximum: 22.0
Interquartile Range (IQR): roughly between 11 and 14

Most product weights lie in a narrow, realistic range, showing controlled product sizing.

Outliers

A few outliers exist on both lower and upper ends:
- Very light products (~4–6)
- Very heavy products (~20–22)
These outliers are business-valid, not data errors (e.g., small sachets vs bulk items).

📌 Implication:

Outliers should not be removed, especially when using tree-based models (Random Forest, XGBoost), which are robust to them.

Data Quality

No missing values observed.
No abnormal spikes or irregular gaps.
Distribution aligns well with real-world retail data.

Summary

Product_Weight follows a near-normal distribution with minimal skewness and reasonable variability. The presence of a few valid outliers reflects real-world product diversity. No transformation or outlier treatment is required, making it a stable and reliable feature for modeling.

Product Allocated Area¶

# Product Allocated Area
histogram_boxplot(data, "Product_Allocated_Area", show_stats_box=False, show_stats_subtitle=True)
plt.show()

Observations:¶

📊 Univariate Analysis – Product_Allocated_Area

Distribution Shape

The distribution of Product_Allocated_Area is highly right-skewed (positively skewed).
Most values are concentrated toward the lower end, with a long tail extending to the right.
The KDE curve confirms a non-normal distribution.

📌 Implication:

This feature does not follow a normal distribution, and skewness should be considered during modeling.

Central Tendency

Mean ≈ 0.07
Median ≈ 0.06
Mean is greater than the median, which is characteristic of right-skewed data.

This indicates that:

A small number of products receive disproportionately large shelf space.
Most products occupy relatively limited display area.

Range & Variability

Minimum: ~0.00
Maximum: ~0.30
Standard deviation: ~0.05

The wide spread relative to the mean suggests:

Significant variation in shelf allocation across products.
Shelf space is a highly differentiated business decision.

Outliers

The boxplot reveals multiple upper-end outliers.
These represent products with exceptionally high shelf visibility.
These outliers are business-driven and meaningful, not data errors.

📌 Implication:

Outliers should not be removed, especially for tree-based models that can leverage them effectively.

Business Interpretation

Products with higher allocated area likely:
- Are high-demand or fast-moving items
- Have stronger brand presence or promotional support
Shelf space is expected to have a direct positive impact on sales.

Summary

Product_Allocated_Area exhibits a strongly right-skewed distribution with several meaningful upper-end outliers, indicating that most products receive limited shelf space while a few receive significantly higher visibility. This feature is expected to have a strong influence on sales and should be retained without outlier removal.

Product MRP¶

# Product MRP
histogram_boxplot(data, "Product_MRP", show_stats_box=False, show_stats_subtitle=True)
plt.show()

Observations:¶

📊 Univariate Analysis – Product_MRP

Distribution Shape

The distribution of Product_MRP is approximately normal (bell-shaped).
The KDE curve shows a symmetric pattern around the center.
Mean (147.03) and median (146.74) are almost identical, indicating very low skewness.

📌 Implication:

No transformation (log/sqrt) is required for this feature.

Central Tendency & Spread

Mean: ~147.03
Median: ~146.74
Standard Deviation: ~30.69

This indicates:

Moderate variability in product pricing.
Prices are well-distributed across a mid-range retail spectrum.

Range & Pricing Segments

Minimum: 31
Maximum: 266

This suggests the presence of:

Low-priced, mass-market products
High-priced, premium products

The dataset covers a wide price band, making it informative for modeling sales behavior.

Outliers

A small number of outliers on both ends of the price spectrum:
- Very low-priced items
- Premium-priced products
These outliers are realistic and business-valid, not data issues.

📌 Implication:

Outliers should be retained, especially for tree-based models that handle them naturally.

Business Interpretation

Product_MRP is expected to have a strong influence on sales revenue:
- Higher-priced products contribute more to total sales value
- Interaction with volume and shelf space is likely
It may interact with:
- Product_Allocated_Area
- Store_Type
- Store_Location_City_Type

Summary

Product_MRP follows an approximately normal distribution with minimal skewness and a wide price range. The presence of valid low- and high-priced products makes it a strong and reliable predictor of sales without requiring transformation or outlier treatment.

Product Store Sales Total¶

# Product Store Sales Total
histogram_boxplot(data, "Product_Store_Sales_Total", show_stats_box=False, show_stats_subtitle=True)
plt.show()

Observations:¶

📊 Univariate Analysis – Product_Store_Sales_Total

Distribution Shape

The distribution of Product_Store_Sales_Total is approximately bell-shaped with slight right skew.
The KDE curve peaks around the center and tapers gradually on both sides.
Mean (3,464) and median (3,452) are very close, indicating near-symmetry.

📌 Implication:

The target variable is well-behaved, making it suitable for a wide range of regression models.

Central Tendency & Variability

Mean: ~3,464
Median: ~3,452
Standard Deviation: ~1,066

This indicates:

Significant variation in sales across products and stores.
Sales performance differs meaningfully depending on product and store characteristics.

Range of Sales

Minimum: ~33
Maximum: ~8,000

This wide range suggests:

Some products have very low sales, possibly due to low demand or poor placement.
High-performing products contribute substantially higher revenue.

Outliers

Boxplot shows outliers on both lower and upper ends:
- Very low sales (near zero)
- Extremely high sales (>6,000)
These values are business-realistic and expected in retail data.

📌 Implication:

Outliers should not be removed, as they represent genuine business scenarios and carry important signals.

Business Interpretation

Sales distribution reflects:
- A majority of products generating moderate sales
- A small proportion of high-performing products driving revenue
This aligns with the Pareto principle (80/20 rule) commonly seen in retail.

Summary

Product_Store_Sales_Total shows a near-normal distribution with moderate variability and meaningful outliers. The wide sales range reflects real-world retail behavior, making the target variable suitable for regression modeling without aggressive transformation or outlier treatment.

# Function to create labeled barplots
def labeled_barplot(
    data,
    feature,
    perc=False,
    n=None,
    figsize=None,
    title=None,
    color="#8b5cf6",
    rotate=45,
    show_stats_subtitle=True,
):

    sns.set_theme(style="whitegrid", context="notebook")

    s = data[feature]
    total = len(s)
    missing = int(s.isna().sum())

    # Use a fill value so missing categories can be seen (optional but helpful)
    plot_s = s.fillna("Missing")

    # Build counts and optionally select top-n
    vc = plot_s.value_counts(dropna=False)
    if n is not None:
        vc = vc.head(n)

    order = vc.index.tolist()
    n_cat = len(order)

    # Auto figure sizing
    if figsize is None:
        width = max(8, min(16, 1.1 * n_cat + 2))
        figsize = (width, 6)

    fig, ax = plt.subplots(figsize=figsize)

    # Bars
    sns.countplot(
        x=plot_s,
        order=order,
        color=color,
        ax=ax,
        edgecolor="white",
        linewidth=1,
    )

    # Titles (similar to the previous function style)
    main_title = title or f"Distribution of {feature}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=0.98)

    if show_stats_subtitle:
        subtitle = f"rows={total:,}   unique={s.nunique(dropna=True):,}   missing={missing:,}"
        fig.text(0.5, 0.94, subtitle, ha="center", va="top", fontsize=11)

    # Axis labels
    ax.set_xlabel(feature)
    ax.set_ylabel("Percentage" if perc else "Count")

    # Nice tick labels
    ax.tick_params(axis="x", rotation=rotate)
    for tick in ax.get_xticklabels():
        tick.set_horizontalalignment("right" if rotate else "center")

    # Value labels on bars
    ymax = 0
    for p in ax.patches:
        h = p.get_height()
        ymax = max(ymax, h)

        if perc:
            label = f"{(h / total) * 100:.1f}%"
        else:
            label = f"{int(h):,}"

        ax.annotate(
            label,
            (p.get_x() + p.get_width() / 2, h),
            ha="center",
            va="bottom",
            fontsize=11,
            xytext=(0, 4),
            textcoords="offset points",
        )

    # Add headroom so labels don't touch the top
    ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)

    # Clean spines
    sns.despine(ax=ax)

    # Layout room for suptitle/subtitle
    fig.tight_layout(rect=[0, 0, 1, 0.92])

    plt.show()

Product Sugar Content¶

# Product Sugar Content
labeled_barplot(data, "Product_Sugar_Content", perc=True)

Observations:¶

📊 Univariate Analysis – Product_Sugar_Content

Category Distribution

The variable Product_Sugar_Content has 4 distinct categories:

Category	Approx. Percentage
Low Sugar	~55.7%
Regular	~25.7%
No Sugar	~17.3%
reg	~1.2%

Low Sugar products dominate the dataset, accounting for more than half of all observations.
Regular sugar products form the second-largest segment.
No Sugar products represent a smaller but meaningful portion.
The category reg appears to be a label inconsistency, not a true separate category.

Data Quality Insight

There are no missing values in this feature.
The presence of both "Regular" and "reg" indicates a data quality issue due to inconsistent labeling.

📌 Required Action:

"reg" should be merged with "Regular" to avoid:

Incorrect category inflation
Unnecessary dummy variables after encoding

Class Imbalance

The distribution is moderately imbalanced:
- Low Sugar products heavily outweigh other categories.
However, all categories still have sufficient representation.

📌 Implication:

This imbalance is not severe and does not require resampling, but it should be kept in mind during model interpretation.

Business Interpretation

The dominance of Low Sugar products reflects:
- Increasing consumer preference for healthier food options
- Retail strategy focused on health-conscious offerings
No Sugar products may cater to:
- Niche markets
- Specific dietary needs (e.g., diabetic-friendly products)

Summary

Product_Sugar_Content is a categorical feature dominated by Low Sugar products, indicating a health-oriented product mix. A minor labeling inconsistency (reg vs Regular) must be resolved before modeling. The feature shows meaningful variation and is suitable for one-hot encoding.

Product Type¶

# Product Type
labeled_barplot(data, "Product_Type", perc=True)

Observations:¶

📊 Univariate Analysis – Product_Type

Category Distribution

Product_Type contains 16 distinct categories, indicating a diverse product portfolio.
The distribution is uneven, with certain categories contributing significantly more products than others.

Top contributing categories:

Fruits and Vegetables – ~14.3% (highest)
Snack Foods – ~13.1%
Frozen Foods – ~9.3%
Dairy – ~9.1%
Household – ~8.4%

These categories together account for more than half of the dataset.

Mid-Tier Categories

Baking Goods, Canned, Health and Hygiene, Meat each contribute 7–8%.
These categories represent stable, essential consumer goods with consistent presence across stores.

Low-Frequency Categories

Soft Drinks – ~5.9%
Breads, Hard Drinks, Others, Starchy Foods, Breakfast, Seafood each contribute less than 3%.
Seafood has the lowest representation (~0.9%).

📌 Implication:

Some categories are sparsely represented, which may limit their standalone predictive power.

Data Quality & Completeness

No missing values observed.
Category labels are clean and interpretable.
No obvious inconsistencies or noise in category naming.

Business Interpretation

Dominance of Fruits & Vegetables and Snack Foods suggests:
- High demand and fast-moving inventory
- Frequent replenishment cycles
Lower presence of categories like Seafood and Breakfast may reflect:
- Supply constraints
- Lower consumer demand
- Store-type or location-based limitations

Summary

Product_Type shows a diverse yet imbalanced distribution, with Fruits and Vegetables and Snack Foods dominating the product mix. While most categories are well-represented, a few low-frequency types may contribute limited predictive power. The feature is clean and suitable for one-hot encoding with minimal preprocessing.

Store Id¶

# Store ID
labeled_barplot(data, "Store_Id", perc=True)

Observations:¶

📊 Univariate Analysis – Store_Id

Category Distribution

Store_Id has 4 unique stores.
The distribution is highly imbalanced across stores:

Store_Id	Approx. Share
OUT004	~53.4%
OUT001	~18.1%
OUT003	~15.4%
OUT002	~13.1%

OUT004 alone contributes more than half of all observations.

Data Quality

No missing values detected.
Store identifiers are clean and consistent.

Business Interpretation

The dominance of OUT004 suggests:
- Larger store size
- Higher product assortment
- Possibly higher footfall or longer operational history
Smaller representation from other stores may reflect:
- Smaller physical size
- Lower product variety
- Different regional demand

Summary

Store_Id shows a highly imbalanced distribution, with OUT004 contributing over half of the observations. This reflects real operational differences across stores and should be preserved as a categorical feature during modeling.

Store Size¶

# Store Size
labeled_barplot(data, "Store_Size", perc=True)

Observations:¶

📊 Univariate Analysis – Store_Size

Category Distribution

Store_Size has 3 distinct categories: Small, Medium, High.
The distribution is highly skewed toward Medium-sized stores:

Store Size	Approx. Share
Medium	~68.8%
High	~18.1%
Small	~13.1%

Nearly two-thirds of all observations come from Medium-sized stores.

Data Quality

No missing values are present.
Store size categories are clean, consistent, and interpretable.

Business Interpretation

The dominance of Medium-sized stores suggests:
- SuperKart’s primary operational focus is on mid-sized outlets.
- These stores likely balance product variety and operating costs efficiently.
High-sized stores may:
- Carry wider assortments
- Generate higher total sales per store
Small stores may:
- Have limited shelf space
- Focus on essential or fast-moving products

Summary

Store_Size is a clean categorical feature dominated by Medium-sized stores, reflecting the retailer’s core store format. The feature is business-relevant and likely to significantly influence sales performance.

Store Location City Type¶

# Store Location City Type
labeled_barplot(data, "Store_Location_City_Type", perc=True)

Observations:¶

📊 Univariate Analysis – Store_Location_City_Type

Category Distribution

Store_Location_City_Type has 3 distinct categories: Tier 1, Tier 2, Tier 3.
The distribution is heavily skewed toward Tier 2 cities:

City Tier	Approx. Share
Tier 2	~71.5%
Tier 1	~15.4%
Tier 3	~13.1%

Over 70% of all observations come from Tier 2 locations.

Data Quality

No missing values present.
City tier labels are consistent and well-defined.

Business Interpretation

The dominance of Tier 2 cities suggests:
- SuperKart’s strategic focus on fast-growing urban markets
- Lower operational costs compared to Tier 1 cities
Tier 1 cities:
- Likely have higher purchasing power
- May generate higher revenue per product
Tier 3 cities:
- Possibly lower demand
- More price-sensitive customer base

Summary

Store_Location_City_Type is dominated by Tier 2 cities, indicating a strategic focus on mid-tier urban markets. The feature is clean, business-relevant, and expected to have a significant impact on sales performance.

Store Type¶

# Store Type
labeled_barplot(data, "Store_Type", perc=True)

Observations:¶

📊 Univariate Analysis – Store_Type

Category Distribution

Store_Type has 4 distinct categories:
- Supermarket Type1
- Supermarket Type2
- Departmental Store
- Food Mart
The distribution is clearly dominated by Supermarket Type2:

Store Type	Approx. Share
Supermarket Type2	~53.4%
Supermarket Type1	~18.1%
Departmental Store	~15.4%
Food Mart	~13.1%

More than half of all observations come from Supermarket Type2 stores.

Data Quality

No missing values detected.
Store type labels are consistent and business-meaningful.

Business Interpretation

The dominance of Supermarket Type2 suggests:
- These stores may be larger, better stocked, or more strategically located.
- They likely generate higher sales volumes due to better infrastructure and assortment.
Supermarket Type1 and Departmental Stores provide moderate coverage.
Food Marts represent smaller, possibly neighborhood-focused outlets.

Summary

Store_Type is dominated by Supermarket Type2 stores, indicating a core store format that likely drives the majority of sales. The feature is clean, well-distributed, and highly relevant for sales prediction modeling.

Bivariate Analysis¶

Correlation matrix¶

# Correlation Matrix
def nice_corr_heatmap_complete(
    data,
    cols=None,
    method="pearson",
    figsize=(12, 9),
    cmap="Spectral",
    annot="auto",
    fmt=".2f",
    linewidths=0.6,
    cbar_shrink=0.85,
    title="Correlation Heatmap",
    subtitle=True,
    title_y=0.98,
    top_margin=0.90,
    square=True,
):
    sns.set_theme(style="white", context="notebook")

    if cols is None:
        cols = data.select_dtypes(include=np.number).columns.tolist()
    if len(cols) == 0:
        raise ValueError("No numeric columns found to compute correlation.")

    corr = data[cols].corr(method=method)

    # Auto-annotation to avoid clutter on big matrices
    if annot == "auto":
        annot = corr.shape[0] <= 12

    fig, ax = plt.subplots(figsize=figsize)

    sns.heatmap(
        corr,
        cmap=cmap,
        vmin=-1, vmax=1, center=0,
        square=square,
        linewidths=linewidths,
        linecolor="white",
        annot=annot,
        fmt=fmt,
        annot_kws={"size": 9} if annot else None,
        cbar_kws={"shrink": cbar_shrink, "pad": 0.02},
        ax=ax,
    )

    fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)

    if subtitle:
        rows = len(data)
        n_features = len(cols)
        miss = int(data[cols].isna().sum().sum())
        sub = f"rows={rows:,}   numeric_features={n_features:,}   missing_values_in_matrix={miss:,}   method={method}"
        fig.text(0.5, title_y - 0.045, sub, ha="center", va="top", fontsize=11)

    ax.tick_params(axis="x", rotation=45)
    ax.tick_params(axis="y", rotation=0)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right")

    sns.despine(ax=ax, left=True, bottom=True)

    fig.subplots_adjust(top=top_margin)
    fig.tight_layout(rect=[0, 0, 1, 0.90])

    plt.show()

# Correlation Heatmap
nice_corr_heatmap_complete(data)

Observations:¶

📊 Bivariate Analysis – Correlation Matrix (Numerical Features)

Numerical Features Considered

The correlation matrix includes 5 numerical variables:

Product_Weight
Product_Allocated_Area
Product_MRP
Store_Establishment_Year
Product_Store_Sales_Total (Target)

Pearson correlation method is used.

🔍 Key Observations & Insights

Strong Positive Correlation with Target

🔹 Product_MRP vs Product_Store_Sales_Total

Correlation ≈ +0.79 (Strong Positive)
This is the strongest correlation with the target.

📌 Interpretation:

Higher-priced products tend to generate higher total sales value, which is expected in revenue-based forecasting.

🔹 Product_Weight vs Product_Store_Sales_Total

Correlation ≈ +0.74 (Strong Positive)

📌 Interpretation:

Heavier products may:

Be sold in larger quantities
Represent premium or bulk items This makes Product_Weight a strong predictor of sales.

Weak / No Correlation with Target

🔹 Product_Allocated_Area vs Product_Store_Sales_Total

Correlation ≈ 0.00 (No Linear Relationship)

📌 Interpretation:

Although shelf space is important from a business perspective, its linear relationship with sales is weak.

However:

This does not mean the feature is useless
The relationship may be non-linear, which tree-based models can capture

Store Age Effect

🔹 Store_Establishment_Year vs Product_Store_Sales_Total

Correlation ≈ −0.19 (Weak Negative)

📌 Interpretation:

Older stores (lower establishment year) tend to have slightly higher sales, possibly due to:

Established customer base
Brand familiarity

This effect is weak but meaningful.

Inter-Feature Correlations

🔹 Product_MRP vs Product_Weight

Correlation ≈ +0.53 (Moderate Positive)

📌 Interpretation:

Heavier products often cost more, which is logically consistent.

⚠️ Multicollinearity Check:

Correlation is moderate, not high enough to cause serious multicollinearity issues.
Safe for use in both linear and tree-based models.

Low Risk of Multicollinearity

No pair of independent variables shows very high correlation (>0.85).
This indicates:
- Stable model training
- Reliable coefficient interpretation (for linear models)

Summary

The correlation analysis shows that Product_MRP and Product_Weight have strong positive relationships with total sales, making them key predictors. Store establishment year shows a weak negative correlation, while product allocated area has no linear relationship, suggesting potential non-linear effects. No severe multicollinearity is observed, supporting the use of all numerical features in modeling.

Let's check the distribution of our target variable i.e Product_Store_Sales_Total with the numeric columns¶

# Function to plot scatterz
def nice_scatterplot(
    data,
    x,
    y="Product_Store_Sales_Total",
    figsize=(8, 6),
    title=None,
    subtitle=True,
    color="#8b5cf6",
    alpha=0.45,
    s=45,
    add_regline=False,   # set True if you want a trend line
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    fig, ax = plt.subplots(figsize=figsize)

    sns.scatterplot(
        data=data,
        x=x,
        y=y,
        ax=ax,
        color=color,
        alpha=alpha,
        s=s,
        edgecolor="white",
        linewidth=0.6,
    )

    # Optional trend line (nice for relationships)
    if add_regline:
        sns.regplot(
            data=data,
            x=x,
            y=y,
            scatter=False,
            ax=ax,
            ci=None,
            line_kws={"linewidth": 2},
        )

    main_title = title or f"{y} vs {x}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    if subtitle:
        n = int(data[[x, y]].dropna().shape[0])
        miss = int(data[[x, y]].isna().any(axis=1).sum())
        fig.text(
            0.5, title_y - 0.045,
            f"points={n:,}   rows_with_missing={miss:,}",
            ha="center", va="top", fontsize=11
        )

    ax.set_xlabel(x)
    ax.set_ylabel(y)

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

# 1) Product_Weight vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_Weight")

Observations:¶

📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_Weight

Nature of Relationship

The scatter plot shows a strong positive linear relationship between Product_Weight and Product_Store_Sales_Total.
As product weight increases, total sales value generally increases as well.
This visually confirms the high positive correlation (~0.74) observed in the correlation matrix.

Trend Pattern

Data points form a clear upward-sloping pattern.
The relationship appears approximately linear, especially in the mid-range of product weights (8–18 units).
No abrupt breaks or non-linear curves are visible.

📌 Implication:

Both linear and tree-based models can effectively capture this relationship.

Variability & Spread

For lower weights (≈ 4–7):
- Sales values are generally lower and less dispersed.
For mid to higher weights (≈ 10–18):
- Sales values show greater spread, indicating:
  - Influence of other factors such as price, store type, or shelf space.
Variance slightly increases with weight (mild heteroscedasticity).

Outliers

A few points show:
- Very high sales (>7,000)
- Very low sales at moderate weights
These outliers are business-valid (e.g., premium or bulk products).

📌 Implication:

Outliers should not be removed, as they represent genuine sales behavior.

Business Interpretation

Heavier products often:
- Cost more
- Are sold in bulk or premium segments
This naturally leads to higher revenue per product, explaining the strong positive trend.

Summary

The scatter plot reveals a strong positive linear relationship between Product_Weight and total sales, indicating that heavier products tend to generate higher revenue. The trend aligns with correlation analysis, shows realistic variability, and confirms Product_Weight as a key predictor for sales forecasting.

# 2) Product_Allocated_Area vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_Allocated_Area")

Observations:¶

📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_Allocated_Area

Nature of Relationship

The scatter plot shows no strong linear relationship between Product_Allocated_Area and Product_Store_Sales_Total.
Sales values are widely dispersed across all levels of allocated area.
This visually confirms the near-zero correlation observed in the correlation matrix.

📌 Key Insight:

Shelf space alone does not linearly explain sales performance.

Distribution Pattern

Most products have low allocated area (0.00–0.10).
High allocated area values (>0.20) are rare.
Across both low and high shelf space:
- Sales range from very low to very high.
- No clear upward or downward trend is visible.

Variability & Spread

For low allocated area:
- Sales show very high variability.
For higher allocated area:
- Sales remain scattered with no consistent increase.
Variance remains roughly constant, indicating no clear heteroscedastic pattern.

Outliers

A few products with high shelf space but moderate sales.
Some products with low shelf space but very high sales.
These are business-realistic scenarios:
- Popular items sell well even with limited shelf space
- Poor-performing products may still receive promotional space

📌 Implication:

Outliers are meaningful and should be retained.

Business Interpretation

Shelf space allocation is likely:
- Influenced by expected demand, not actual sales alone
- Interacting with other variables such as:
  - Product price
  - Product type
  - Store size
Sales performance is multi-factor driven, not dependent on shelf space alone.

Summary

The scatter plot indicates no clear linear relationship between Product_Allocated_Area and total sales, suggesting that shelf space alone does not drive sales outcomes. However, the feature may still be valuable through non-linear interactions with other product and store attributes.

# 3) Product_MRP vs Product_Store_Sales_Total
nice_scatterplot(data, x="Product_MRP")

Observations:¶

📊 Bivariate Analysis – Product_Store_Sales_Total vs Product_MRP

Nature of Relationship

The scatter plot shows a strong positive linear relationship between Product_MRP and Product_Store_Sales_Total.
As product price increases, total sales revenue consistently increases.
This visually confirms the high positive correlation (~0.79) seen in the correlation matrix.

Trend Pattern

Data points form a clear upward-sloping trend.
The relationship appears almost linear across the entire price range.
Minimal curvature or deviation from linearity is observed.

📌 Implication:

This feature is well-suited for linear regression as well as tree-based models.

Variability & Spread

At lower MRP values (30–80):
- Sales values are generally lower and less dispersed.
At mid to high MRP values (100–220):
- Sales values increase substantially.
- Variability increases slightly, indicating influence of additional factors such as store type and shelf space.

Outliers

A few products exhibit:
- Very high MRP (>250) with high sales
- Moderate MRP with unusually low or high sales
These are business-valid scenarios, not anomalies.

📌 Implication:

Outliers should be retained as they provide valuable information.

Business Interpretation

Higher-priced products:
- Generate more revenue per unit sold
- Are often associated with premium or bulk offerings
This explains why MRP is one of the strongest drivers of sales revenue.

Summary

Product_MRP exhibits a strong positive linear relationship with total sales, indicating that higher-priced products consistently generate higher revenue. This confirms Product_MRP as one of the most influential predictors in the sales forecasting model.

Let us see from which product type the company is generating most of the revenue¶

def revenue_by_category(
    data,
    category,
    revenue_col="Product_Store_Sales_Total",
    top_n=None,
    figsize=None,
    title=None,
    rotate=45,
    color="#8b5cf6",
    show_values=True,
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    # Aggregate revenue
    df_rev = (
        data.groupby(category, dropna=False)[revenue_col]
        .sum()
        .reset_index()
        .sort_values(revenue_col, ascending=False)
    )

    # Optionally limit to top N categories
    if top_n is not None:
        df_rev = df_rev.head(top_n)

    # Auto size
    if figsize is None:
        width = max(9, min(18, 1.1 * len(df_rev) + 3))
        figsize = (width, 6)

    fig, ax = plt.subplots(figsize=figsize)

    sns.barplot(
        data=df_rev,
        x=category,
        y=revenue_col,
        ax=ax,
        color=color,
        edgecolor="white",
        linewidth=1,
    )

    # Titles (same style)
    main_title = title or f"Revenue by {category}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    total_rev = df_rev[revenue_col].sum()
    top_cat = df_rev.iloc[0][category]
    top_rev = df_rev.iloc[0][revenue_col]
    fig.text(
        0.5,
        title_y - 0.045,
        f"total_revenue={total_rev:,.0f}   top={top_cat} ({top_rev:,.0f})",
        ha="center",
        va="top",
        fontsize=11,
    )

    ax.set_xlabel(category)
    ax.set_ylabel("Revenue")

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    # Value labels on bars
    if show_values:
        ymax = df_rev[revenue_col].max()
        for p in ax.patches:
            h = p.get_height()
            ax.annotate(
                f"{h:,.0f}",
                (p.get_x() + p.get_width() / 2, h),
                ha="center",
                va="bottom",
                fontsize=10,
                xytext=(0, 4),
                textcoords="offset points",
            )
        ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

    # If you need the aggregated dataframe later, return it:
    return df_rev

# Revenue by Product Type
df_revenue1 = revenue_by_category(
    data,
    category="Product_Type",
    figsize=(14, 7),
    rotate=60,
    title="Which Product Type Generates the Most Revenue?"
)

Observations:¶

📊 Bivariate Analysis – Revenue by Product_Type

Overall Revenue Contribution

The total revenue across all product types is approximately 30.35 million.
Revenue contribution is unevenly distributed across product categories, indicating that some categories drive a disproportionate share of revenue.

Top Revenue-Generating Product Types

The highest revenue contributors are:

Fruits and Vegetables – ~4.30M (Highest)
Snack Foods – ~3.99M
Dairy – ~2.81M
Frozen Foods – ~2.81M
Household – ~2.56M

📌 Key Insight:

Fruits and Vegetables alone generate the highest revenue, making it the most critical product category for SuperKart.

Mid-Tier Revenue Categories

Baking Goods – ~2.45M
Canned – ~2.30M
Health and Hygiene – ~2.16M
Meat – ~2.13M

These categories:

Contribute steady and meaningful revenue
Represent essential and recurring consumer purchases

Low Revenue-Generating Categories

The lowest revenue contributors are:

Soft Drinks – ~1.80M
Breads – ~0.71M
Hard Drinks – ~0.63M
Others – ~0.54M
Starchy Foods – ~0.52M
Breakfast – ~0.36M
Seafood – ~0.27M (Lowest)

📌 Implication:

These categories either have:

Lower demand
Lower pricing
Limited shelf presence
Or fewer product SKUs

Business Interpretation

High revenue from Fruits & Vegetables and Snack Foods suggests:
- High demand
- Frequent purchases
- Fast inventory turnover
Low-performing categories may require:
- Better promotion
- Optimized pricing
- Improved shelf placement
- Or strategic reduction if margins are low

Relationship with Earlier EDA

Revenue dominance aligns with:
- High representation of these categories in the dataset
- Likely higher product turnover
Confirms that Product_Type is a strong driver of sales revenue, not just sales count.

Summary

Fruits and Vegetables generate the highest revenue for SuperKart, followed by Snack Foods and Dairy. Revenue contribution varies significantly across product types, highlighting Product_Type as a key driver of sales performance and a critical feature for forecasting models.

# Revenue by Product Sugar Content
df_revenue2 = revenue_by_category(
    data,
    category="Product_Sugar_Content",
    figsize=(9, 6),
    rotate=45,
    title="Revenue by Product Sugar Content"
)

Observations:¶

📊 Bivariate Analysis – Revenue by Product_Sugar_Content

Overall Revenue Contribution

The total revenue across all sugar content categories is approximately 30.36 million.
Revenue distribution across sugar content levels is highly imbalanced, indicating strong consumer preference patterns.

Top Revenue-Generating Sugar Category

Sugar Content	Revenue	Share (Approx.)
Low Sugar	~16.82M	~55%
Regular	~7.87M	~26%
No Sugar	~5.27M	~17%
reg	~0.39M	~1%

📌 Key Insight:

Low Sugar products dominate revenue generation, contributing more than half of total sales.

Data Quality Observation

The presence of reg as a separate category:
- Indicates a data inconsistency / labeling issue
- Likely represents “Regular” sugar content

📌 Action Required:

This category should be merged with “Regular” during data cleaning.

Business Interpretation

Strong revenue dominance of Low Sugar products suggests:
- Growing consumer preference for health-conscious options
- Successful product placement and assortment strategy
No Sugar products also contribute meaningfully, reinforcing the health trend.

Strategic Implications

Increase focus on:
- Low Sugar and No Sugar product variants
- Promotions and shelf placement for these categories
Reevaluate Regular sugar products:
- Improve marketing
- Reposition pricing if needed

Summary

Low Sugar products generate the majority of revenue, highlighting a strong consumer shift toward healthier options. Data inconsistency in sugar labeling should be corrected to ensure accurate modeling and insights.

Let us see from which type of stores and locations the revenue generation is more.¶

# Revenue by Store Id
df_store_revenue = revenue_by_category(
    data,
    category="Store_Id",
    title="Revenue by Store",
    rotate=60,
    top_n=15,          # optional: store IDs can be many; keeps it readable
    figsize=(14, 6)
)

Observations:¶

📊 Bivariate Analysis – Revenue by Store_Id

Overall Revenue Distribution

Total revenue across all stores is approximately 30.36 million.
Revenue generation is highly uneven across stores, indicating strong store-level performance differences.

Store-wise Revenue Contribution

Store_Id	Revenue	Approx. Share
OUT004	~15.43M	~51%
OUT003	~6.67M	~22%
OUT001	~6.22M	~21%
OUT002	~2.03M	~7%

📌 Key Insight:

OUT004 alone contributes more than half of the total revenue, making it the most dominant store.

*Performance Gap Analysis

OUT004 significantly outperforms all other stores combined.
OUT003 and OUT001 show similar and moderate performance.
OUT002 generates the least revenue, lagging far behind.

📌 Implication:

Revenue generation is not evenly distributed geographically or operationally.

Business Interpretation

Possible reasons for OUT004’s dominance:

Larger store size
Better product assortment
Higher footfall
Favorable city tier or location
Higher concentration of high-MRP and high-demand products

Conversely, OUT002 may suffer from:

Smaller store size
Lower customer traffic
Less optimal location
Limited inventory mix

Relationship with Earlier EDA

OUT004 also had:
- Highest number of product entries
- Strong presence of high-performing product categories
Reinforces that store characteristics play a major role in revenue generation.

Summary

Revenue generation varies significantly across stores, with OUT004 contributing over half of total revenue. This highlights the critical role of store-specific factors such as size, location, and assortment in driving sales performance.

# Revenue by Store Size
df_revenue3 = revenue_by_category(
    data,
    category="Store_Size",
    title="Revenue by Store Size",
    rotate=0,
    figsize=(8, 6)
)

Observations:¶

📊 Bivariate Analysis – Revenue by Store Size

Overall Revenue Context

Total revenue across all stores is approximately 30.36 million.
Revenue contribution varies significantly by store size, indicating that store scale has a strong influence on sales performance.

Store Size-wise Revenue Contribution

Store Size	Revenue	Approx. Share
Medium	~22.10M	~73%
High	~6.22M	~21%
Small	~2.03M	~6%

📌 Key Insight:

Medium-sized stores dominate revenue generation, contributing nearly three-fourths of total revenue.

Interpretation of Store Size Impact

Medium stores likely strike the best balance between:
- Product variety
- Operational efficiency
- Customer footfall
High-sized stores, despite larger physical space, contribute less than expected, possibly due to:
- Higher operational costs
- Diminishing returns on space utilization
Small stores generate minimal revenue, consistent with limited assortment and lower footfall.

Relationship with Univariate EDA

Medium-sized stores also had the highest frequency count in the dataset.
Their dominance in both count and revenue reinforces their strategic importance in the business model.

Business Implications

Expansion strategy should prioritize medium-sized stores.
Optimization opportunities:
- Improve revenue per square foot in high-sized stores.
- Re-evaluate inventory and location strategy for small stores.

Summary

Medium-sized stores are the primary revenue drivers, contributing nearly 73% of total revenue, making store size a critical determinant of sales performance.

# Revenue by Store Location City Type
df_revenue4 = revenue_by_category(
    data,
    category="Store_Location_City_Type",
    title="Revenue by Store Location City Type",
    rotate=0,
    figsize=(9, 6)
)

Observations:¶

📊 Bivariate Analysis – Revenue by Store Location City Type

Overall Revenue Context

Total revenue across all stores is approximately 30.36 million.
Revenue contribution varies significantly by city tier, highlighting the importance of store location in sales performance.

City Tier-wise Revenue Contribution

City Tier	Revenue	Approx. Share
Tier 2	~21.65M	~71%
Tier 1	~6.67M	~22%
Tier 3	~2.03M	~7%

📌 Key Insight:

Tier 2 cities dominate revenue generation, contributing over 70% of total revenue, outperforming even Tier 1 cities.

Interpretation of Location Impact

Tier 2 cities likely benefit from:
- High population density
- Moderate competition
- Strong demand for value-driven retail formats
Tier 1 cities, despite higher purchasing power, may face:
- Market saturation
- Higher operational costs
Tier 3 cities show limited revenue potential, possibly due to:
- Lower footfall
- Smaller store formats
- Limited product assortment

Consistency with Previous EDA

Tier 2 cities also had the highest store count in univariate analysis.
The alignment of high store presence and high revenue reinforces Tier 2 cities as the company’s core market.

Business Implications

Expansion and investment strategies should focus on Tier 2 locations.
Tier 1 strategies should emphasize:
- Premium products
- Differentiated offerings
Tier 3 stores may require:
- Cost optimization
- Targeted product mixes

Summary

Tier 2 cities are the primary revenue drivers, contributing over 70% of total revenue, underscoring the strategic importance of mid-tier urban markets.

# Revenue by Store Type
df_revenue5 = revenue_by_category(
    data,
    category="Store_Type",
    title="Revenue by Store Type",
    rotate=0,
    figsize=(9, 6)
)

Observations:¶

📊 Bivariate Analysis – Revenue by Store Type

Overall Revenue Context

Total revenue across all store types is approximately 30.36 million.
Revenue distribution varies significantly across different store formats, indicating that store type plays a crucial role in sales performance.

Store Type-wise Revenue Contribution

Store Type	Revenue	Approx. Share
Supermarket Type2	~15.43M	~51%
Departmental Store	~6.67M	~22%
Supermarket Type1	~6.22M	~21%
Food Mart	~2.03M	~7%

📌 Key Insight:

Supermarket Type2 dominates revenue generation, contributing over half of the total revenue on its own.

Interpretation of Store Type Impact

Supermarket Type2 stores likely benefit from:
- Wider product assortment
- Higher customer footfall
- Strong presence in high-performing locations (e.g., Tier 2 cities)
Departmental Stores and Supermarket Type1 show comparable performance, suggesting:
- Moderate scale
- Stable but less aggressive sales potential
Food Marts generate the least revenue, consistent with:
- Smaller store size
- Limited product variety
- Convenience-focused shopping behavior

Consistency with Earlier EDA Findings

Supermarket Type2 stores were:
- Most frequent in the dataset
- Dominant in Store_Id (OUT004) revenue
This reinforces the conclusion that store format + scale + location jointly drive revenue.

Business Implications

Expansion strategy should prioritize Supermarket Type2 stores.
Opportunities exist to:
- Upgrade Supermarket Type1 stores to Type2 formats
- Optimize product mix in Departmental Stores
Food Marts may be best suited for niche or essential-only strategies.

Summary

Supermarket Type2 stores are the primary revenue drivers, contributing over 50% of total revenue, making store format a critical determinant of sales performance.

Let's check the distribution of our target variable i.e Product_Store_Sales_Total with the other categorical columns¶

def nice_boxplot_by_category(
    data,
    x_cat,
    y="Product_Store_Sales_Total",
    figsize=(14, 8),
    title=None,
    rotate=60,
    color="#8b5cf6",
    show_stats_subtitle=True,
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    fig, ax = plt.subplots(figsize=figsize)

    # Boxplot (no hue needed when it equals x; avoids duplicate legends)
    sns.boxplot(
        data=data,
        x=x_cat,
        y=y,
        ax=ax,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.2),
        boxprops=dict(linewidth=1.2),
    )

    main_title = title or f"Boxplot - {x_cat} vs {y}"
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    if show_stats_subtitle:
        n = int(data[[x_cat, y]].dropna().shape[0])
        groups = int(data[x_cat].nunique(dropna=True))
        fig.text(
            0.5, title_y - 0.045,
            f"points={n:,}   groups={groups:,}",
            ha="center", va="top", fontsize=11
        )

    ax.set_xlabel(x_cat)
    ax.set_ylabel(f"{y} (of each product)")

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

# Store Id vs Product Store Sales Total
nice_boxplot_by_category(
    data,
    x_cat="Store_Id",
    y="Product_Store_Sales_Total",
    figsize=(14, 8),
    title="Boxplot - Store_Id vs Product_Store_Sales_Total",
    rotate=90
)

Observations:¶

📊 Bivariate Analysis – Store_Id vs Product_Store_Sales_Total (Boxplot)

Overall Distribution Insight

The boxplot compares product-level sales distribution across four stores (OUT001–OUT004).
There is substantial variation in median sales, spread, and outliers across stores, indicating store-specific sales behavior.

Median Sales Comparison

OUT003 shows the highest median product sales, indicating stronger per-product performance.
OUT001 follows with moderately high median sales.
OUT004 has a lower median compared to OUT003 and OUT001, despite being the highest in total revenue.
OUT002 has the lowest median sales, confirming its weaker performance.

📌 Key Insight:

Higher total revenue does not always imply higher per-product sales.

Variability & Spread

OUT003 has the widest IQR (box width), suggesting:
- Greater diversity in product performance
- Presence of both very high and moderate selling products
OUT004 shows a moderate spread, indicating relatively consistent product sales.
OUT002 has a narrower distribution, reflecting limited sales potential and fewer high-performing products.

Outlier Analysis

OUT003 exhibits several high-value outliers, including the maximum observed sales (~8000).
OUT004 also contains multiple high outliers but fewer extreme values.
OUT002 has some very low outliers, indicating poorly performing products.

📌 Implication:

Some stores rely on blockbuster products, while others show uniform but lower performance.

Business Interpretation

OUT003:
- Strong per-product revenue potential
- Opportunity to scale top-performing SKUs
OUT004:
- High total revenue driven by volume and breadth, not necessarily high per-product sales
OUT002:
- Requires product assortment and pricing strategy review

Alignment with Previous Revenue Analysis

Although OUT004 generates the highest total revenue, its median product sales are lower than OUT003, suggesting:
- Revenue dominance comes from more products sold, not higher sales per product
Confirms why Store_Id is a critical feature for modeling.

Summary

Product-level sales distributions vary significantly across stores, with OUT003 showing the highest median and variability, while OUT004’s high total revenue is driven by volume rather than per-product dominance.

# Store Size vs Product Store Sales Total
nice_boxplot_by_category(
    data,
    x_cat="Store_Size",
    y="Product_Store_Sales_Total",
    figsize=(12, 7),
    title="Boxplot - Store_Size vs Product_Store_Sales_Total",
    rotate=0
)

Observations:¶

📊 Bivariate Analysis – Store_Size vs Product_Store_Sales_Total (Boxplot)

Overall Pattern

Product-level sales vary significantly across store sizes.
Store size clearly influences both median sales and variability, confirming it as a strong driver of revenue.

Median Sales Comparison

High-sized stores have the highest median product sales, indicating stronger per-product revenue.
Medium-sized stores show a moderate median, lower than High but significantly higher than Small.
Small-sized stores have the lowest median sales, reflecting limited sales capacity.

📌 Clear hierarchy:

High > Medium > Small in terms of per-product sales.

Variability & Distribution

Medium-sized stores exhibit the widest spread and many high outliers, suggesting:
- A mix of average and blockbuster products
- Greater heterogeneity in product performance
High-sized stores have a more compact distribution, indicating:
- Consistently strong product sales
- Better standardization and optimized assortments
Small stores show:
- Narrower spread
- Limited upside and fewer high-performing products

Outlier Behavior

Medium stores include extreme high outliers (up to ~8000), showing potential for exceptional products.
High stores have fewer extreme outliers but consistently high sales.
Small stores contain low-end outliers, highlighting weaker or non-performing SKUs.

Business Interpretation

High stores:
- Best suited for premium and high-MRP products
- Stable and predictable revenue per product
Medium stores:
- Strong growth opportunities
- Ideal for experimentation and new product launches
Small stores:
- Limited revenue potential per product
- Require focused assortment and cost optimization

Consistency with Revenue Aggregates

This boxplot aligns perfectly with earlier findings where:
- Medium-sized stores generated the highest total revenue
- Despite High stores having higher per-product medians, Medium stores win on volume + diversity

📌 Key takeaway:

Total revenue dominance ≠ highest per-product sales.

Summary

Product-level sales increase with store size, with High stores showing the strongest per-product performance, while Medium stores balance consistency and extreme high-selling products.

Let's now try to find out some relationship between the other columns¶

def nice_boxplot_relation(
    data,
    x_cat,
    y_num,
    figsize=(14, 8),
    title=None,
    rotate=60,
    color="#8b5cf6",
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    fig, ax = plt.subplots(figsize=figsize)

    sns.boxplot(
        data=data,
        x=x_cat,
        y=y_num,
        ax=ax,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.2),
        boxprops=dict(linewidth=1.2),
    )

    fig.suptitle(title or f"Boxplot - {x_cat} vs {y_num}", fontsize=15, fontweight="bold", y=title_y)

    n = int(data[[x_cat, y_num]].dropna().shape[0])
    groups = int(data[x_cat].nunique(dropna=True))
    fig.text(0.5, title_y - 0.045, f"points={n:,}   groups={groups:,}", ha="center", va="top", fontsize=11)

    ax.set_xlabel("Types of Products" if x_cat == "Product_Type" else x_cat)
    ax.set_ylabel(y_num)

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

# Product Type Vs Product Weight
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Product_Type", y="Product_Weight", hue="Product_Type")
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_Weight")
plt.xlabel("Types of Products")
plt.ylabel("Product_Weight")
plt.legend([], [], frameon=False)  # hide redundant legend
plt.show()

Observations:¶

📦 Boxplot Interpretation: Product_Type vs Product_Weight

🔎 What this plot shows

X-axis: Product categories
Y-axis: Product weight
Each box summarizes the distribution of product weights within a product type:
- Median (line)
- Interquartile range (IQR)
- Whiskers (typical min/max)
- Dots (outliers)

🧠 Key Observations

Weight distributions are remarkably consistent across product types

Most product categories have:
- Median weight ≈ 12–13 units
- IQR roughly between 11 and 14
This indicates standardized packaging sizes across the business.

✅ No product type is fundamentally heavier or lighter than others.

Minor category-level variations (but not strong)

Slightly higher medians seen in:
- Starchy Foods
- Seafood
- Others
Slightly lower medians in:
- Frozen Foods
- Canned
These differences are small and overlapping, not statistically dominant.

📌 Conclusion:

Product_Type does not strongly determine Product_Weight

Presence of outliers across all categories

Outliers exist on both lower and higher ends:
- Low-end: ~4–6 units
- High-end: ~18–22 units
This suggests:
- Multiple pack sizes
- Special SKUs (bulk / premium packs)

⚠️ These outliers are expected and realistic, not data errors.

Variability is similar across categories

No product type shows:
- Extreme dispersion
- Unusually tight or wide spread
Reinforces the idea of uniform packaging standards

🔗 Relation to Your Earlier Findings

This plot aligns perfectly with your correlation results:

Product_Weight has a strong positive correlation with sales
But Product_Type does NOT explain weight differences

➡️ Therefore:

Weight affects sales independently
Product_Type influences sales through demand, pricing, and volume, not weight

📝 One-line EDA Summary

Product weight distributions are largely consistent across product categories, indicating standardized packaging sizes. While product weight strongly influences sales, it is not driven by product type, suggesting an independent effect on revenue.

Let's find out whether there is some relationship between the weight of the product and its sugar content¶

# Product Sugar Content Vs Product Weight
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Product_Sugar_Content", y="Product_Weight", hue="Product_Sugar_Content")
plt.xticks(rotation=0)
plt.title("Boxplot - Product_Sugar_Content Vs Product_Weight")
plt.xlabel("Product_Sugar_Content")
plt.ylabel("Product_Weight")
plt.legend([], [], frameon=False)  # hide redundant legend
plt.show()

Observations:¶

🍬📦 Relationship: Product Sugar Content vs Product Weight

🔍 What this plot analyzes

X-axis: Product Sugar Content (Low Sugar, Regular, No Sugar, reg)
Y-axis: Product Weight
Goal: Check whether sugar content influences product weight

🧠 Key Observations

Product weight is largely independent of sugar content

Median weights across all sugar categories are very similar:
- Roughly 12–13 units for all groups
Interquartile ranges (IQRs) overlap heavily

✅ This indicates no strong relationship between sugar content and product weight.

Variability is consistent across categories

All sugar categories show:
- Comparable spread
- Similar whisker lengths
- Outliers on both ends
No category shows unusually heavy or light products overall

📌 Sugar formulation does not drive packaging size.

Outliers exist in every group (expected behavior)

Low-end outliers (~4–6 units)
High-end outliers (~18–22 units)

These likely represent:

Mini packs
Family or bulk packs
Special SKUs

⚠️ These are natural business variations, not data issues.

4 The "reg" category is likely a data-quality issue

"reg" appears redundant with "Regular"
Its distribution mirrors Regular almost exactly

🔗 Alignment with Your Previous Findings

This result is consistent with earlier insights:

Product_Weight strongly correlates with sales
Sugar content strongly affects revenue composition
But sugar content does NOT affect weight

➡️ Therefore:

Sugar content impacts sales via consumer preference
Weight impacts sales via volume/quantity
These effects are independent

📝 One-line EDA Summary

Product weight distributions are consistent across sugar content categories, indicating that sugar formulation does not influence packaging size. Product weight and sugar content independently contribute to sales behavior.

Let's analyze the sugar content of different product types¶

def nice_crosstab_heatmap(
    data,
    rows="Product_Sugar_Content",
    cols="Product_Type",
    normalize=None,          # None, "index" (row %), "columns" (col %), "all" (overall %)
    figsize=(14, 8),
    cmap="viridis",
    title=None,
    title_y=0.98,
):
    sns.set_theme(style="white", context="notebook")

    ct = pd.crosstab(data[rows], data[cols], dropna=False)

    # Normalize if requested
    if normalize is not None:
        ct_plot = ct.div(ct.sum(axis=1), axis=0) if normalize == "index" else \
                  ct.div(ct.sum(axis=0), axis=1) if normalize == "columns" else \
                  ct / ct.values.sum() if normalize == "all" else ct
        annot_fmt = ".1%"  # show as percent
        annot_data = ct_plot
        vmin, vmax = 0, 1
    else:
        annot_fmt = "g"    # integer
        annot_data = ct
        vmin, vmax = None, None

    fig, ax = plt.subplots(figsize=figsize)

    sns.heatmap(
        ct_plot if normalize is not None else ct,
        annot=annot_data,
        fmt=annot_fmt,
        cmap=cmap,
        linewidths=0.6,
        linecolor="white",
        cbar=True,
        vmin=vmin,
        vmax=vmax,
        ax=ax,
    )

    main_title = title or (
        f"{rows} vs {cols}" + (" (Row %)" if normalize == "index" else
                              " (Column %)" if normalize == "columns" else
                              " (Overall %)" if normalize == "all" else
                              " (Counts)")
    )
    fig.suptitle(main_title, fontsize=15, fontweight="bold", y=title_y)

    fig.text(
        0.5, title_y - 0.045,
        f"rows={len(data):,}   unique_{rows}={data[rows].nunique(dropna=True):,}   unique_{cols}={data[cols].nunique(dropna=True):,}",
        ha="center", va="top", fontsize=11
    )

    ax.set_ylabel(rows)
    ax.set_xlabel(cols)

    # Make labels readable
    ax.tick_params(axis="x", rotation=45)
    ax.tick_params(axis="y", rotation=0)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right")

    sns.despine(ax=ax, left=True, bottom=True)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

    return ct

# Heatmap Product Sugar of Different Product Types
nice_crosstab_heatmap(
    data,
    rows="Product_Sugar_Content",
    cols="Product_Type",
    normalize=None,
    figsize=(14, 8),
    cmap="viridis",
    title="Sugar Content Across Product Types (Counts)"
)

Observations:¶

Target Variable: Product_Store_Sales_Total

Distribution is approximately normal with mild right skew.
Mean ≈ Median → good for regression modeling.
Presence of high-end outliers (up to ~8000), but not extreme enough to discard blindly.
✔️ No transformation is strictly required, though log-transform could be tested.

Numeric Feature Relationships

Correlation Heatmap (Pearson)

Strong positive relationships with sales:

Product_MRP → Sales (≈ 0.79) 🔥
Product_Weight → Sales (≈ 0.74) 🔥

Moderate relationship:

Product_Weight ↔ Product_MRP (≈ 0.53)

Weak / No relationship:

Product_Allocated_Area → Sales (≈ 0.00)
Store_Establishment_Year → Sales (≈ -0.19)

📌 Conclusion:

Product_MRP and Product_Weight are the most powerful numeric predictors.

Scatter Plot Insights

Product Weight vs Sales

Clear linear upward trend
Heavier products → higher total sales

Product MRP vs Sales

Strong linear pattern
Higher MRP products consistently generate more revenue

Product Allocated Area vs Sales

No clear pattern
Scatter is diffuse → confirms low correlation

📌 Modeling takeaway:

Consider dropping or down-weighting Product_Allocated_Area unless interactions are used.

Categorical Variable Distributions

Product Type (Count)

Top categories by presence:

Fruits & Vegetables
Snack Foods
Frozen Foods
Dairy

Balanced enough → good categorical signal.

Product Sugar Content

Low Sugar dominates (55.7%)
Regular ≈ 25.7%
No Sugar ≈ 17.3%
reg ≈ 1.2% → ⚠️ likely a data quality issue (typo of “Regular”)

📌 Action:

Merge reg → Regular

Revenue Analysis (Most Important Business Insight)

Revenue by Product Type

Top revenue generators:

Fruits & Vegetables 🥇
Snack Foods
Dairy
Frozen Foods

Lowest:

Seafood
Breakfast
Starchy Foods

📌 Insight:

High-volume essentials outperform niche categories.

Revenue by Sugar Content

Low Sugar → ~55% of total revenue 🔥
Regular → ~26%
No Sugar → ~17%
reg negligible

📌 Business Insight:

Health-conscious products are not only popular but profitable.

Revenue by Store

OUT004 alone contributes ~50% of total revenue
OUT002 is significantly underperforming

📌 Store-level imbalance detected

Revenue by Store Size

Medium stores dominate revenue (~73%)
High > Small, but Medium is the sweet spot

Revenue by City Tier

Tier 2 cities generate the most revenue
Tier 1 < Tier 2
Tier 3 lowest

📌 Key insight:

Tier 2 cities + Medium stores = highest ROI combination

Revenue by Store Type

Supermarket Type2 dominates
Followed by Departmental Store
Food Mart is lowest

Boxplot Insights (Sales Distributions)

Store ID vs Sales

OUT003 and OUT004 have higher medians
OUT002 has:

*Lowest median
- More low-end outliers

Store Size vs Sales

High size → highest median sales
Medium has higher total revenue due to volume
Small stores consistently underperform

Product Characteristics Analysis

Product Type vs Weight

Very similar median weights across categories
Slightly heavier:
- Snack Foods
- Starchy Foods
- Household

📌 Weight is not category-driven, but still predictive of sales.

Sugar Content vs Weight

No strong weight difference across sugar categories
Weight is independent of sugar classification

Sugar Content × Product Type (Crosstab Heatmap)

Key patterns:

Low Sugar dominates Fruits & Vegetables, Snack Foods
No Sugar almost exclusive to Health & Hygiene & Household
Very clean segmentation → good categorical signal

📌 Excellent feature interaction potential:

Product_Type × Sugar_Content

Data Quality Notes (Important)

⚠️ Fix these before modeling:

Merge reg → Regular
Consider encoding Store_Id carefully (target encoding recommended)
Product_Allocated_Area has low predictive power

Let's find out how many items of each product type has been sold in each of the stores¶

def nice_store_producttype_heatmap(
    data,
    store_col="Store_Id",
    product_col="Product_Type",
    figsize=(14, 8),
    cmap="viridis",
    annot="auto",          # "auto", True, False
    title=None,
    title_y=0.98,
):
    sns.set_theme(style="white", context="notebook")

    # ✅ Completed crosstab: Store_Id vs Product_Type
    ct = pd.crosstab(data[store_col], data[product_col], dropna=False)

    # Auto annotation decision (avoid clutter for large matrices)
    if annot == "auto":
        annot = (ct.shape[0] <= 15) and (ct.shape[1] <= 12)

    fig, ax = plt.subplots(figsize=figsize)

    sns.heatmap(
        ct,
        annot=annot,
        fmt="g",
        cmap=cmap,
        linewidths=0.6,
        linecolor="white",
        cbar=True,
        ax=ax,
    )

    fig.suptitle(
        title or f"Items Sold: {product_col} by {store_col}",
        fontsize=15,
        fontweight="bold",
        y=title_y,
    )
    fig.text(
        0.5,
        title_y - 0.045,
        f"rows={len(data):,}   stores={ct.shape[0]:,}   product_types={ct.shape[1]:,}",
        ha="center",
        va="top",
        fontsize=11,
    )

    ax.set_ylabel("Stores")
    ax.set_xlabel("Product_Type")

    # Make labels readable
    ax.tick_params(axis="x", rotation=45)
    ax.tick_params(axis="y", rotation=0)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right")

    sns.despine(ax=ax, left=True, bottom=True)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

    return ct

nice_store_producttype_heatmap(data, annot=True)

Observations:¶

🛒 Items Sold: Product Type × Store Id — Key Insights

Dominance of OUT004 Across Product Types

OUT004 clearly outperforms all other stores across every product category.
Especially strong in:
- Fruits & Vegetables (700)
- Snack Foods (615)
- Frozen Foods (446)
- Dairy (397)
- Household (399)

📌 Interpretation:

OUT004 is the primary revenue and volume driver, likely due to:

Larger store size
Better location (Tier 2 + high footfall)
Broader assortment and inventory depth

Consistent Category Leaders Across Stores

Across all stores, the most sold product types are:

Product Type	Observation
Fruits & Vegetables	Top-selling category in every store
Snack Foods	Second-highest volume consistently
Household	Strong and stable across stores
Frozen Foods & Dairy	Medium-to-high consistent demand

📌 Insight:

These are essential, high-frequency purchase categories, driving store traffic and repeat purchases.

Low-Volume Categories Are Universally Low

Categories with consistently low sales across all stores:

Seafood
Breakfast
Breads
Others

📌 Interpretation:

Low demand appears structural, not store-specific — suggesting:

Limited consumer preference
Possibly niche or premium products
Potential candidates for assortment rationalization

Store-wise Performance Pattern

Store	Pattern
OUT004	High-volume, diversified sales across all categories
OUT001 & OUT003	Mid-performing, similar patterns
OUT002	Lowest sales across most categories

📌 Interpretation:

OUT002 may suffer from:

Smaller store size
Less optimal location
Lower customer footfall

Strong Alignment With Revenue Analysis

This heatmap perfectly explains your earlier findings:

OUT004 → highest revenue
Fruits & Vegetables + Snack Foods → top revenue contributors
Volume-driven categories = revenue drivers

This confirms that revenue is volume-led, not just price-led.

🎯 Business Implications

Inventory prioritization:

Allocate more shelf space and inventory to:

Fruits & Vegetables
Snack Foods
Household essentials

Store strategy:
- Replicate OUT004’s layout, assortment, and promotions in other stores
- Investigate why OUT002 underperforms
Category optimization:
- Review low-performing categories (Seafood, Breakfast) for SKU reduction

Different product types have different prices. Let's analyze the trend.¶

def nice_boxplot_price_trend(
    data,
    x_cat,
    y_num,
    figsize=(14, 8),
    title=None,
    rotate=60,
    color="#8b5cf6",
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    fig, ax = plt.subplots(figsize=figsize)

    sns.boxplot(
        data=data,
        x=x_cat,
        y=y_num,
        ax=ax,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.2),
        boxprops=dict(linewidth=1.2),
    )

    fig.suptitle(
        title or f"Boxplot - {x_cat} vs {y_num}",
        fontsize=15,
        fontweight="bold",
        y=title_y,
    )

    n = int(data[[x_cat, y_num]].dropna().shape[0])
    groups = int(data[x_cat].nunique(dropna=True))
    fig.text(
        0.5,
        title_y - 0.045,
        f"points={n:,}   product_types={groups:,}",
        ha="center",
        va="top",
        fontsize=11,
    )

    ax.set_xlabel("Product_Type" if x_cat == "Product_Type" else x_cat)
    ax.set_ylabel(f"{y_num} (of each product)")

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

# Boxplot Product Type Vs Product MRP
plt.figure(figsize=[14, 8])
sns.boxplot(
    data=data,
    x="Product_Type",
    y="Product_MRP",
    hue="Product_Type"
)
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_MRP")
plt.xlabel("Product_Type")
plt.ylabel("Product_MRP (of each product)")
plt.legend([], [], frameon=False)  # hide redundant legend
plt.show()

Observations:¶

Similar central pricing across categories:

Most product types have comparable median MRPs, clustered roughly in the same range, indicating no extreme base-price differences across categories.

Wide price dispersion within each product type:

Every category shows a broad interquartile range (IQR), meaning products within the same type span multiple price points (budget to premium).

Presence of high-price outliers across many categories:

Almost all product types contain upper-end outliers (₹230–₹270 range), suggesting premium SKUs exist in nearly every category.

Some categories show slightly higher upper spread:

Categories such as Starchy Foods, Others, Fruits & Vegetables, and Meat exhibit wider upper tails, indicating more high-MRP products compared to others.

Lower-end pricing consistency:

Minimum MRPs across most product types are fairly similar, showing price floors do not differ much by category.

Product type is not a strong standalone price separator:

Since medians and IQRs overlap heavily, Product_Type alone does not strongly explain MRP variation—price variation is largely within categories rather than between them.

Let's find out how the Product_MRP varies with the different stores¶

def nice_boxplot_store_mrp(
    data,
    x_cat="Store_Id",
    y_num="Product_MRP",
    figsize=(14, 8),
    title="Boxplot - Store_Id vs Product_MRP",
    rotate=90,
    color="#8b5cf6",
    title_y=0.98,
    top_n=None,  # optional: show only top N stores by count to reduce clutter
):
    sns.set_theme(style="whitegrid", context="notebook")

    df = data[[x_cat, y_num]].dropna()

    # Optional: reduce clutter by keeping only top N stores by number of items
    if top_n is not None:
        top_stores = df[x_cat].value_counts().head(top_n).index
        df = df[df[x_cat].isin(top_stores)]

    fig, ax = plt.subplots(figsize=figsize)

    sns.boxplot(
        data=df,
        x=x_cat,
        y=y_num,
        ax=ax,
        color=color,
        showmeans=True,
        meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
        medianprops=dict(color="black", linewidth=2),
        whiskerprops=dict(linewidth=1.2),
        boxprops=dict(linewidth=1.2),
    )

    fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)

    n = int(df.shape[0])
    stores = int(df[x_cat].nunique())
    fig.text(
        0.5, title_y - 0.045,
        f"points={n:,}   stores={stores:,}",
        ha="center", va="top", fontsize=11
    )

    ax.set_xlabel("Stores")
    ax.set_ylabel("Product_MRP (of each product)")

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right")

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

# Product MRP with Different Stores
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Store_Id", y="Product_MRP", hue="Store_Id")
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Id Vs Product_MRP")
plt.xlabel("Stores")
plt.ylabel("Product_MRP (of each product)")
plt.legend([], [], frameon=False)  # hide the huge redundant legend
plt.show()

Observations:¶

OUT003 has the highest median Product_MRP, indicating that this store generally sells higher-priced products compared to others.
OUT001 also shows a relatively high median MRP, but slightly lower than OUT003.
OUT004 has a moderate median MRP, positioned below OUT001 and OUT003 but above OUT002.
OUT002 clearly has the lowest median Product_MRP, suggesting it focuses more on lower-priced products.
Price variability (IQR) is widest for OUT003, meaning it carries a broader range of product prices.
OUT002 shows a narrower IQR, indicating more consistent (and generally lower) pricing.
OUT001 and OUT004 exhibit moderate variability in MRPs.
High-price outliers are most prominent in OUT003, reinforcing the presence of premium-priced products.
OUT002 also shows outliers, but these are mostly upper outliers, standing out against its generally low-price distribution.
All stores contain some low-price outliers, but they are more noticeable in OUT002 and OUT004.

Overall insight:

Product pricing strategy differs significantly by store. OUT003 and OUT001 cater more toward higher-priced items, while OUT002 appears to be a value-oriented store with lower and more tightly clustered MRPs.

Let's delve deeper and do a detailed analysis of each of the stores.¶

OUT001¶

data.loc[data["Store_Id"] == "OUT001"].describe(include="all").T

data.loc[data["Store_Id"] == "OUT001", "Product_Store_Sales_Total"].sum()

np.float64(6223113.18)

OUT001 has generated total revenue of 6223113 from the sales of goods.

def store_revenue_breakdown_by_product(
    data,
    store_id="",
    product_col="Product_Type",
    revenue_col="Product_Store_Sales_Total",
    figsize=(14, 7),
    rotate=60,
    color="#8b5cf6",
    top_n=None,          # optional: show only top N product types
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    df_store = data.loc[data["Store_Id"] == store_id, [product_col, revenue_col]].dropna()

    df_rev = (
        df_store.groupby(product_col, as_index=False)[revenue_col]
        .sum()
        .sort_values(revenue_col, ascending=False)
    )

    if top_n is not None:
        df_rev = df_rev.head(top_n)

    total_rev = df_rev[revenue_col].sum()
    n_rows = len(df_store)
    n_types = df_rev[product_col].nunique()

    fig, ax = plt.subplots(figsize=figsize)

    sns.barplot(
        data=df_rev,
        x=product_col,
        y=revenue_col,
        ax=ax,
        color=color,
        edgecolor="white",
        linewidth=1,
    )

    # Title + subtitle (same style)
    fig.suptitle(f"{store_id} Revenue by {product_col}", fontsize=15, fontweight="bold", y=title_y)
    fig.text(
        0.5,
        title_y - 0.045,
        f"total_revenue={total_rev:,.0f}   items_rows={n_rows:,}   product_types={n_types:,}",
        ha="center",
        va="top",
        fontsize=11,
    )

    ax.set_xlabel(product_col)
    ax.set_ylabel(revenue_col)

    ax.tick_params(axis="x", rotation=rotate)
    for t in ax.get_xticklabels():
        t.set_horizontalalignment("right" if rotate else "center")

    # Value labels
    ymax = df_rev[revenue_col].max() if not df_rev.empty else 0
    for p in ax.patches:
        h = p.get_height()
        ax.annotate(
            f"{h:,.0f}",
            (p.get_x() + p.get_width() / 2, h),
            ha="center",
            va="bottom",
            fontsize=10,
            xytext=(0, 4),
            textcoords="offset points",
        )
    ax.set_ylim(0, ymax * 1.12 if ymax > 0 else 1)

    sns.despine(ax=ax)
    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

    return df_rev

df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT001")

OUT002¶

data.loc[data["Store_Id"] == "OUT002"].describe(include="all").T

data.loc[data["Store_Id"] == "OUT002", "Product_Store_Sales_Total"].sum()

np.float64(2030909.72)

OUT002 has generated total revenue of 2030910 from the sales of goods.

df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT002")

OUT003

data.loc[data["Store_Id"] == "OUT003"].describe(include="all").T

data.loc[data["Store_Id"] == "OUT003", "Product_Store_Sales_Total"].sum()

np.float64(6673457.57)

df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT003")

OUT004

data.loc[data["Store_Id"] == "OUT004"].describe(include="all").T

data.loc[data["Store_Id"] == "OUT004", "Product_Store_Sales_Total"].sum()

np.float64(15427583.43)

df_OUT001 = store_revenue_breakdown_by_product(data, store_id="OUT004")

Observations:¶

🏬 Store-wise Detailed Observations

🔵 OUT001 — High-priced, Stable Performer

Store Profile

Store Type: Supermarket Type1
Store Size: High
City Tier: Tier 2
Establishment Year: 1987 (oldest store)

Product MRP Behavior

Mean MRP: ~160.5
Median MRP: ~168.3
Pricing is moderately high and consistent.
Boxplot shows:
- Tight IQR → controlled pricing strategy
- Few extreme outliers → limited ultra-premium SKUs
Indicates price stability over aggressive discounting.

Sales Performance

Total Revenue: ~6.22M
Avg Sales per product: ~3924
Sales spread: Moderate (std ≈ 904)

Product Mix & Revenue Drivers

Top revenue categories:
- Snack Foods
- Fruits & Vegetables
- Dairy
Balanced contribution across categories → diversified demand
No over-dependence on a single category.

Interpretation

Mature store with:
- Reliable pricing
- Balanced category mix
- Steady sales
Performs well without extreme pricing or promotional volatility.

🔴 OUT002 — Low-price, Low-volume Store

Store Profile

Store Type: Food Mart
Store Size: Small
City Tier: Tier 3
Establishment Year: 1998

Product MRP Behavior

Mean MRP: ~107.1
Median MRP: ~104.7 (lowest among all stores)
Boxplot characteristics:
- Lowest MRP range
- Many low-end outliers → economy pricing
Minimal premium pricing presence.

Sales Performance

Total Revenue: ~2.03M (lowest)
Avg Sales per product: ~1763
Low variance → consistently low ticket sizes.

Product Mix & Revenue Drivers

Top categories:
- Fruits & Vegetables
- Snack Foods
Weak performance in:
- Meat
- Household
- Premium categories

Interpretation

Store is:
- Highly price-sensitive
- Volume-constrained
Likely serving budget-conscious customers
Limited upselling potential due to low MRP ceiling.

🟢 OUT003 — Premium Pricing, High Value per Product

Store Profile

Store Type: Departmental Store
Store Size: Medium
City Tier: Tier 1
Establishment Year: 1999

Product MRP Behavior

Mean MRP: ~181.4 (highest)
Median MRP: ~179.7
Boxplot shows:
- Wide IQR
- Many high-end outliers (up to ~266)
Strong presence of premium SKUs.

Sales Performance

Total Revenue: ~6.67M
Avg Sales per product: ~4947 (highest)
Highest maximum sales (~8000)

Product Mix & Revenue Drivers

Strong categories:
- Snack Foods
- Fruits & Vegetables
- Dairy
Premium categories perform consistently well.

Interpretation

Best store for:
- High-margin products
- Premium assortment
Customers show lower price sensitivity
Ideal candidate for premium expansion & exclusive SKUs.

🟣 OUT004 — High-volume, Revenue Powerhouse

Store Profile

Store Type: Supermarket Type2
Store Size: Medium
City Tier: Tier 2
Establishment Year: 2009 (newest)

Product MRP Behavior

Mean MRP: ~142.4
Median MRP: ~142.8
Boxplot indicates:
- Moderate pricing
- Controlled spread
- Few extreme outliers

Sales Performance

Total Revenue: ~15.43M (highest by far)
Avg Sales per product: ~3299
Sales are driven by volume, not high price.

Product Mix & Revenue Drivers

Dominant categories:
- Fruits & Vegetables
- Snack Foods
- Frozen Foods
Strong across all categories, not niche-dependent.

Interpretation

Slightly lower MRP than OUT003, but massive volume compensates.
Best example of volume-led revenue strategy.

🔎 Cross-Store Comparative Insights

Dimension	OUT001	OUT002	OUT003	OUT004
Avg MRP	Medium-High	Low	Highest	Medium
Revenue	High	Lowest	High	Highest
Pricing Strategy	Stable	Budget	Premium	Balanced
Volume	Medium	Low	Medium	Very High
Best Use Case	Stability	Price-led	Margin-led	Scale-led

📌 Final Strategic Takeaways

OUT003 → maximize premium & margins
OUT004 → expand assortment & inventory (volume monster)
OUT001 → maintain consistency, low risk
OUT002 → needs either volume growth or pricing rethink

Let's find out the revenue generated by the stores from each of the product types.

df1 = data.groupby(["Product_Type", "Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
df1

Observations:¶

OUT001

Revenue is well balanced across categories, with no extreme dependency on a single product type.
Snack Foods, Fruits & Vegetables, Dairy, Frozen Foods are the top contributors.
Breakfast and Seafood generate the least revenue, indicating low demand.
Performs moderately across both food and non-food (Household, Health & Hygiene) categories.

OUT002

Overall lowest total revenue among all stores.
Strongest categories are Fruits & Vegetables and Snack Foods, but at much lower scale.
Breakfast, Seafood, Starchy Foods, Others perform very weakly.
Indicates a small-format / low-footfall store with limited high-value sales.

OUT003

High-revenue store with strong performance across most categories.
Snack Foods and Fruits & Vegetables dominate sales.
Dairy, Frozen Foods, Household, Meat also contribute significantly.
Weakest categories remain Breakfast and Seafood, consistent with other stores.

OUT004

Top-performing store by a large margin.
Extremely strong in Fruits & Vegetables, Snack Foods, Frozen Foods, Dairy, Household.
Even traditionally low categories (Breakfast, Seafood, Others) perform better here.
Indicates large store size, high footfall, and wide product acceptance.

Cross-store patterns

Snack Foods and Fruits & Vegetables are the top revenue drivers across all stores.
Breakfast and Seafood are consistently the lowest-performing categories.
Revenue scale increases clearly from OUT002 → OUT001 → OUT003 → OUT004.
High-performing stores show diversified revenue, not dependence on a single category.

Let's find out the revenue generated by the stores from products having different levels of sugar content.

df2 = data.groupby(["Product_Sugar_Content", "Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
df2

Data Preprocessing¶

Replacing the values in the Product_Sugar_Content column¶

# Replacing reg with Regular
data.Product_Sugar_Content.replace(to_replace=["reg"], value=["Regular"], inplace=True)

data.Product_Sugar_Content.value_counts()

Exploring Patterns in Product_IDs¶

## Extracting the first two characters from the Product_Id column and storing it in another column
data["Product_Id_char"] = data["Product_Id"].str[:2]
data.head()

data["Product_Id_char"].unique()

array(['FD', 'NC', 'DR'], dtype=object)

data.loc[data.Product_Id_char == "FD", "Product_Type"].unique()

array(['Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Snack Foods',
       'Meat', 'Fruits and Vegetables', 'Breads', 'Breakfast',
       'Starchy Foods', 'Seafood'], dtype=object)

data.loc[data.Product_Id_char == "DR", "Product_Type"].unique()

array(['Hard Drinks', 'Soft Drinks'], dtype=object)

data.loc[data.Product_Id_char == "NC", "Product_Type"].unique()

array(['Health and Hygiene', 'Household', 'Others'], dtype=object)

Observations:¶

🔹 Product_Sugar_Content (After Cleaning)

The typo reg was successfully standardized to Regular, removing category ambiguity.
Low Sugar products dominate the dataset, followed by Regular, then No Sugar.
This suggests customer demand (and assortment strategy) is skewed toward low-sugar options.

🔹 Product_ID Prefix Analysis (Product_Id_char)

I identified three clear product families using the first two characters:

FD (Food Products)

Covers most product categories:
- Frozen Foods, Dairy, Canned, Baking Goods, Snack Foods
- Meat, Fruits & Vegetables, Breads, Breakfast
- Starchy Foods, Seafood
Indicates FD is the core retail assortment, driving volume and revenue.

DR (Drinks)

Exclusively mapped to:
- Hard Drinks
- Soft Drinks
Shows a clean and well-segmented beverage classification.

NC (Non-Consumables)

Limited to:
- Health and Hygiene
- Household
- Others
These are non-food essentials, likely lower in frequency but important for basket value.

🔹 Structural Insights

Product IDs are not random — they encode category intelligence.
This structure can be very useful for feature engineering, such as:
- Group-level demand modeling
- Category-specific pricing or sales behavior
The dataset shows strong internal consistency between Product_ID patterns and Product_Type.

🔹 Modeling & EDA Implications

Product_Id_char is a high-value categorical feature for:
- Sales prediction
- Customer demand segmentation
Sugar content is imbalanced, so stratification or weighting may be needed in models.
FD products will likely dominate predictions, while DR and NC may behave differently.

Store's Age¶

# Outlet Age
data["Store_Age_Years"] = 2025 - data.Store_Establishment_Year

Grouping Product Types into Perishables and Non-Perishables.¶

perishables = [
    "Dairy",
    "Meat",
    "Fruits and Vegetables",
    "Breakfast",
    "Breads",
    "Seafood",
]

def change(x):
    if x in perishables:
        return "Perishables"
    else:
        return "Non Perishables"

data['Product_Type_Category'] = data['Product_Type'].apply(change)

data.head()

Observations:¶

🔹 Store_Age_Years

Store age ranges roughly from ~16 to ~38 years, indicating a mix of newer and very mature stores.
Older stores (≈35–38 years) are mostly OUT001 and OUT003, suggesting:
- Long-standing market presence
- Likely stable customer base and mature operations
Newer stores (≈16–27 years), such as OUT004, still show strong sales, indicating that age alone does not limit performance.

Insight: Store age may influence customer trust and assortment depth, but store size, location, and type likely play a stronger role in sales.

🔹 Product_Type_Category (Perishables vs Non-Perishables)

Perishables include: Dairy, Meat, Fruits & Vegetables, Breakfast, Breads, Seafood.
Non-Perishables dominate the dataset, including:
- Frozen Foods, Canned, Baking Goods, Snacks, Beverages, Household, Health & Hygiene.

Observation:

The majority of rows fall under Non-Perishables, suggesting:
- Higher assortment depth
- Better shelf life and inventory stability
Perishables are fewer but typically high-frequency purchase items.

🔹 Combined Insights

Older stores + perishables likely require stronger cold-chain and inventory management.
Newer or smaller stores may rely more on non-perishables due to:
- Lower spoilage risk
- Easier logistics
This binary category can help explain sales variance, especially when combined with:
- Store_Size
- Store_Type
- Store_Location_City_Type

🔹 Modeling Value

Store_Age_Years is a strong continuous feature for regression.
Product_Type_Category (binary) is:
- Easy to encode
- Highly interpretable
- Useful for capturing operational differences in sales behavior

Outlier Check¶

def nice_outlier_boxgrid_2col(
    data,
    exclude=("Store_Establishment_Year", "Store_Age_Years"),
    cols=None,
    whis=1.5,
    ncols=2,                 # ✅ two plots per row
    figsize=None,
    color="#8b5cf6",
    title="Outlier Check (Boxplots)",
    title_y=0.98,
):
    sns.set_theme(style="whitegrid", context="notebook")

    # Select numeric columns
    if cols is None:
        cols = data.select_dtypes(include=np.number).columns.tolist()

    # Exclude if present
    cols = [c for c in cols if c not in set(exclude)]
    if not cols:
        raise ValueError("No numeric columns left to plot after exclusions.")

    n = len(cols)
    nrows = math.ceil(n / ncols)

    # Auto figure size tuned for 2-col layout
    if figsize is None:
        figsize = (14, 3.4 * nrows)

    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
    axes = np.array(axes).reshape(-1)

    # Title + subtitle (same look as your previous charts)
    fig.suptitle(title, fontsize=15, fontweight="bold", y=title_y)
    fig.text(
        0.5,
        title_y - 0.045,
        f"numeric_features={n:,}   whis={whis}   layout={ncols} per row",
        ha="center",
        va="top",
        fontsize=11,
    )

    for i, col in enumerate(cols):
        ax = axes[i]
        x = data[col].dropna()

        sns.boxplot(
            x=x,
            ax=ax,
            color=color,
            whis=whis,
            showmeans=True,
            meanprops=dict(marker="D", markerfacecolor="white", markeredgecolor="black", markersize=6),
            medianprops=dict(color="black", linewidth=1.8),
            whiskerprops=dict(linewidth=1.2),
            boxprops=dict(linewidth=1.2),
        )

        ax.set_title(col, fontsize=12, pad=10)
        ax.set_xlabel("")
        ax.set_ylabel("")
        ax.grid(True, axis="x", alpha=0.25)  # subtle guidance
        sns.despine(ax=ax, left=True, bottom=True)

    # Hide unused axes
    for j in range(n, len(axes)):
        axes[j].axis("off")

    fig.tight_layout(rect=[0, 0, 1, 0.92])
    plt.show()

nice_outlier_boxgrid_2col(data)

Observations:¶

🔹 Product_Weight

Most product weights are concentrated between ~10 and ~15 units.
There are outliers on both ends:
- Very light products (< ~7)
- Very heavy products (> ~19–22)
Distribution is fairly symmetric, suggesting natural variation by product type rather than data errors.

Takeaway: Outliers look realistic (different packaging sizes), not anomalies.

🔹 Product_Allocated_Area

Majority of values lie in the 0.03–0.10 range.
Strong right-skew with many high-end outliers (up to ~0.30).
Indicates some products require significantly more shelf space.

Takeaway: High-end outliers likely represent bulky or premium-display products.

🔹 Product_MRP

Core price range is ~120 to ~170.
Clear upper-end outliers beyond ₹220–₹270.
A few low-priced outliers (< ~70) also exist.

Takeaway: Price outliers reflect premium and budget product segments, not noise.

🔹 Product_Store_Sales_Total

Highly right-skewed distribution.
Most sales totals fall between ~2500 and ~4500.
Several very high outliers (up to ~8000), indicating top-performing products.
A few low-end outliers, likely slow-moving or niche items.

Takeaway: Sales outliers are business-critical (star vs low-performing products).

🔹 Overall Conclusion

Outliers are meaningful and business-driven, not data quality issues.
Removing them could erase important patterns.
Better strategies:
- Log-transform Product_Store_Sales_Total
- Use robust models (tree-based, quantile-based)

Data Preparation for Modeling¶

data.head()

Let's remove the columns that are not required.

data = data.drop(["Product_Id", "Product_Type", "Store_Id", "Store_Establishment_Year"], axis=1)

data.shape

(8763, 11)

data.head()

data.describe(include='all').T

# Separating features and the target column
X = data.drop("Product_Store_Sales_Total", axis=1)
y = data["Product_Store_Sales_Total"]

print(X.shape)
print(y.shape)

(8763, 10)
(8763,)

# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1, shuffle=True
)

X_train.shape, X_test.shape

((6134, 10), (2629, 10))

Observations:¶

Key Observations from the Split

✅ Train–test split is correct:
- Train: 6,134 rows
- Test: 2,629 rows
✅ Target separation is clean (Product_Store_Sales_Total)
✅ No row leakage (shuffle=True, fixed random_state)

Data Pre-processing Pipeline¶

categorical_features = data.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features

['Product_Sugar_Content',
 'Store_Size',
 'Store_Location_City_Type',
 'Store_Type',
 'Product_Id_char',
 'Product_Type_Category']

# Create a preprocessing pipeline for the categorical features
preprocessor = make_column_transformer(
    (Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
)

Model Building¶

Define functions for Model Evaluation¶

# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mean_absolute_percentage_error(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

The ML models to be built can be any two out of the following:

Decision Tree
Bagging
Random Forest
AdaBoost
Gradient Boosting
XGBoost

Decision Tree Model¶

dtree = DecisionTreeRegressor(random_state=1)
dtree = make_pipeline(preprocessor,dtree)
dtree.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('decisiontreeregressor',
                 DecisionTreeRegressor(random_state=1))])

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('decisiontreeregressor',
                 DecisionTreeRegressor(random_state=1))])

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Id_char',
                                  'Product_Type_Category'])])

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Id_char', 'Product_Type_Category']

OneHotEncoder(handle_unknown='ignore')

DecisionTreeRegressor(random_state=1)

Checking model performance on training set¶

dtree_model_train_perf = model_performance_regression(dtree, X_train, y_train)
dtree_model_train_perf

Checking model performance on test set¶

dtree_model_test_perf = model_performance_regression(dtree, X_test, y_test)
dtree_model_test_perf

Observations:¶

The pipeline is correctly structured, combining preprocessing (One-Hot Encoding) and the Decision Tree regressor, ensuring consistent data handling during training and testing.
Training and test performance are fairly close, indicating that the model is not severely overfitting.
R² score (~0.68 on train and ~0.67 on test) suggests the model explains around two-thirds of the variance in product sales, which is reasonable for a baseline model.
Adjusted R² is very close to R², implying that the number of predictors introduced by one-hot encoding is not excessively inflating model performance.
RMSE increases slightly on the test set, showing a small generalization error but acceptable stability.
MAE values are consistent across train and test, indicating stable average prediction errors.
MAPE (~16.5% train, ~18.7% test) shows moderate relative error, meaning predictions are reasonably close in percentage terms.
Unpruned Decision Tree captures non-linear relationships, but may still be sensitive to noise and outliers in sales data.
Model serves well as a baseline, but performance can likely be improved with ensemble methods (Random Forest, Gradient Boosting).

Bagging Regressor¶

bagging_regressor = BaggingRegressor(random_state=1)
bagging_regressor = make_pipeline(preprocessor,bagging_regressor)
bagging_regressor.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('baggingregressor', BaggingRegressor(random_state=1))])

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('baggingregressor', BaggingRegressor(random_state=1))])

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Id_char',
                                  'Product_Type_Category'])])

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Id_char', 'Product_Type_Category']

OneHotEncoder(handle_unknown='ignore')

BaggingRegressor(random_state=1)

Checking model performance on training set¶

bagging_regressor_model_train_perf = model_performance_regression(bagging_regressor, X_train, y_train)
bagging_regressor_model_train_perf

Checking model performance on test set¶

bagging_regressor_model_test_perf = model_performance_regression(bagging_regressor, X_test, y_test)
bagging_regressor_model_test_perf

Observations:¶

Pipeline integration is consistent, combining preprocessing (One-Hot Encoding) with the Bagging Regressor, ensuring uniform feature handling.
Training performance is similar to the Decision Tree, with an R² of ~0.68, indicating comparable explanatory power.
Test R² (~0.67) closely matches training R², showing good generalization and reduced overfitting compared to a single tree.
RMSE and MAE values are almost identical on train and test sets, highlighting stability in predictions.
MAPE (~16.6% train, ~18.7% test) suggests reasonable percentage-level prediction accuracy, similar to the Decision Tree.
Bagging reduces variance, but the improvement over a single Decision Tree is marginal in this setup.
Model performance indicates limited gains without tuning, likely because default base estimators are already simple.
Useful as a variance-reduction baseline, but stronger ensemble methods may yield better improvements.

Random Forest Model¶

rf_estimator = RandomForestRegressor(random_state=1)
rf_estimator = make_pipeline(preprocessor,rf_estimator)
rf_estimator.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=1))])

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=1))])

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Id_char',
                                  'Product_Type_Category'])])

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Id_char', 'Product_Type_Category']

OneHotEncoder(handle_unknown='ignore')

RandomForestRegressor(random_state=1)

Checking model performance on training set¶

rf_estimator_model_train_perf = model_performance_regression(rf_estimator, X_train, y_train)
rf_estimator_model_train_perf

Checking model performance on test set¶

rf_estimator_model_test_perf = model_performance_regression(rf_estimator, X_test, y_test)
rf_estimator_model_test_perf

Observations:¶

Seamless integration with the preprocessing pipeline, ensuring consistent encoding of categorical variables before modeling.
Training performance (R² ≈ 0.685) is almost identical to Decision Tree and Bagging models, indicating similar explanatory power.
Test performance (R² ≈ 0.669) closely matches training performance, showing good generalization and low overfitting.
RMSE (~ 616) and MAE (~ 485) on the test set are nearly the same as Bagging and Decision Tree, suggesting limited incremental improvement.
MAPE (~18.7% on test) remains consistent across all tree-based models tried so far.
Random Forest’s variance reduction is evident, but its benefit is muted, likely due to:
Limited signal in the available features
Default hyperparameters (e.g., number of trees, depth)
Model stability is strong, as seen from minimal train–test performance gap.
Better potential than Bagging with tuning, especially by adjusting n_estimators, max_depth, and min_samples_leaf.

AdaBoost Regressor¶

ab_regressor = AdaBoostRegressor(random_state=1)
ab_regressor = make_pipeline(preprocessor,ab_regressor)
ab_regressor.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('adaboostregressor', AdaBoostRegressor(random_state=1))])

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('adaboostregressor', AdaBoostRegressor(random_state=1))])

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Id_char',
                                  'Product_Type_Category'])])

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Id_char', 'Product_Type_Category']

OneHotEncoder(handle_unknown='ignore')

AdaBoostRegressor(random_state=1)

Checking model performance on training set¶

ab_regressor_model_train_perf = model_performance_regression(ab_regressor, X_train, y_train)
ab_regressor_model_train_perf

Checking model performance on test set¶

ab_regressor_model_test_perf = model_performance_regression(ab_regressor, X_test, y_test)
ab_regressor_model_test_perf

Observations:¶

Lower overall performance compared to other tree-based models (Decision Tree, Bagging, Random Forest).
Training R² ≈ 0.652, which is already weaker than previous models, indicating limited ability to capture underlying patterns.
Test R² drops further to ≈ 0.634, showing poorer generalization on unseen data.
Highest error metrics among all models tested so far:
- Test RMSE ≈ 647
- Test MAE ≈ 531
- Test MAPE ≈ 19.4%
Larger train–test performance gap compared to Random Forest and Bagging, suggesting instability.
AdaBoost’s sensitivity to noisy data and outliers likely impacts performance, especially given:
- Wide variance in Product_Store_Sales_Total
- Presence of outliers observed earlier in numerical features
Default weak learners (shallow trees) may be underfitting the data.
Not well-suited in current configuration for this regression task without careful tuning.

Overall: AdaBoost underperforms relative to other ensemble methods and is the weakest model tested so far for predicting product store sales.

Gradient Boosting Regressor¶

gb_estimator = GradientBoostingRegressor(random_state=1)
gb_estimator = make_pipeline(preprocessor,gb_estimator)
gb_estimator.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(random_state=1))])

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('gradientboostingregressor',
                 GradientBoostingRegressor(random_state=1))])

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Id_char',
                                  'Product_Type_Category'])])

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Id_char', 'Product_Type_Category']

OneHotEncoder(handle_unknown='ignore')

GradientBoostingRegressor(random_state=1)

Checking model performance on training set¶

gb_estimator_model_train_perf = model_performance_regression(gb_estimator, X_train, y_train)
gb_estimator_model_train_perf

Checking model performance on test set¶

gb_estimator_model_test_perf = model_performance_regression(gb_estimator, X_test, y_test)
gb_estimator_model_test_perf

Observations:¶

Strong and stable performance, very similar to Decision Tree, Bagging, and Random Forest models.
Training performance:
- R² ≈ 0.685
- RMSE ≈ 597
- Indicates the model captures a good amount of variance without overfitting.
Test performance remains consistent:
- R² ≈ 0.669
- RMSE ≈ 616
- MAE ≈ 485
Minimal train–test gap, suggesting good generalization.
MAPE (~18.7%) is comparable to Bagging and Random Forest, and clearly better than AdaBoost.
Gradient Boosting handles non-linear relationships and feature interactions effectively, even with mixed numerical and one-hot encoded categorical features.
Performance improvement over AdaBoost shows that sequential boosting with gradient optimization is more robust to noise in this dataset.
Default hyperparameters already yield competitive results, indicating good baseline suitability.

Overall: Gradient Boosting is a strong candidate model, offering balanced bias–variance trade-off and performance on par with the best models tested so far.

XGBoost Regressor¶

xgb_estimator = XGBRegressor(random_state=1)
xgb_estimator = make_pipeline(preprocessor,xgb_estimator)
xgb_estimator.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              co...
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, random_state=1, ...))])

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              co...
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, random_state=1, ...))])

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Id_char',
                                  'Product_Type_Category'])])

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Id_char', 'Product_Type_Category']

OneHotEncoder(handle_unknown='ignore')

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=1, ...)

Checking model performance on training set¶

xgb_estimator_model_train_perf = model_performance_regression(xgb_estimator, X_train, y_train)
xgb_estimator_model_train_perf

Checking model performance on test set¶

xgb_estimator_model_test_perf = model_performance_regression(xgb_estimator, X_test, y_test)
xgb_estimator_model_test_perf

Observations:¶

Performance is on par with tree-based ensemble models (Decision Tree, Bagging, Random Forest, Gradient Boosting).
Training results:
- R² ≈ 0.685
- RMSE ≈ 597
- MAE ≈ 469
Test results remain stable:
- R² ≈ 0.668
- RMSE ≈ 616
- MAE ≈ 485
Very small train–test gap, indicating no significant overfitting.
MAPE (~18.7%) is consistent with Random Forest and Gradient Boosting.
Despite XGBoost’s advanced regularization and boosting strategy, default hyperparameters do not significantly outperform other ensemble models in this setup.
Performance similarity suggests that the feature set and preprocessing pipeline are the main performance drivers, rather than the specific ensemble algorithm.
XGBoost’s strength (handling complex interactions and regularization) is likely underutilized without hyperparameter tuning.

Overall: XGBoost is a robust and reliable model, but in its current untuned form, it does not provide a clear advantage over Random Forest or Gradient Boosting for this dataset.

Model Performance Improvement - Hyperparameter Tuning¶

Hyperparameter Tuning - Decision Tree¶

# Uncomment the below snippet of code if decision tree regressor is to be used

# Choose the type of classifier.
dtree_tuned = DecisionTreeRegressor(random_state=1)
dtree_tuned = make_pipeline(preprocessor,dtree_tuned)

# Grid of parameters to choose from
parameters = {
     "decisiontreeregressor__max_depth": list(np.arange(2, 6)),
     "decisiontreeregressor__min_samples_leaf": [1, 3, 5],
     "decisiontreeregressor__max_leaf_nodes": [2, 3, 5, 10, 15],
     "decisiontreeregressor__min_impurity_decrease": [0.001, 0.01, 0.1],
 }

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=r2_score, cv=3, n_jobs =-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)

Best Parameters Found:
{'decisiontreeregressor__max_depth': np.int64(2), 'decisiontreeregressor__max_leaf_nodes': 2, 'decisiontreeregressor__min_impurity_decrease': 0.001, 'decisiontreeregressor__min_samples_leaf': 1}

Checking model performance on training set¶

dtree_tuned_model_train_perf = model_performance_regression(dtree_tuned, X_train, y_train)
dtree_tuned_model_train_perf

Checking model performance on test set¶

dtree_tuned_model_test_perf = model_performance_regression(dtree_tuned, X_test, y_test)
dtree_tuned_model_test_perf

Observations:¶

The tuned Decision Tree performs significantly worse than the untuned version and all other ensemble models.
R-squared drops sharply (~0.38) on both train and test sets, indicating very low explanatory power.
RMSE and MAE increase substantially, showing higher prediction errors after tuning.
Similar train and test performance suggests no overfitting, but rather strong underfitting.
The selected best parameters (very shallow tree: max_depth = 2, max_leaf_nodes = 2) overly restrict model complexity.
The model fails to capture nonlinear relationships present in the data.
Hyperparameter tuning over-regularized the model, hurting performance instead of improving it.
This confirms that single decision trees are not suitable for this dataset compared to ensemble-based methods.

Conclusion:

The tuned Decision Tree is the worst-performing model and should be discarded in favor of ensemble models like Random Forest, Gradient Boosting, or XGBoost.

Hyperparameter Tuning - Bagging Regressor¶

# Choose the type of regressor.
bagging_estimator_tuned = BaggingRegressor(random_state=1)
bagging_estimator_tuned = make_pipeline(preprocessor,bagging_estimator_tuned)

# Grid of parameters to choose from
parameters = {
     "baggingregressor__max_samples": [0.7, 0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
     "baggingregressor__max_features": [0.7, 0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
     "baggingregressor__n_estimators": [10, 30, 50, 100] #Complete the code to define the list of values to be tuned
}

# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)

Best Parameters Found:
{'baggingregressor__max_features': 0.7, 'baggingregressor__max_samples': 0.7, 'baggingregressor__n_estimators': 10}

Checking model performance on training set¶

bagging_estimator_tuned_model_train_perf = model_performance_regression(bagging_estimator_tuned, X_train, y_train)
bagging_estimator_tuned_model_train_perf

Checking model performance on test set¶

bagging_estimator_tuned_model_test_perf = model_performance_regression(bagging_estimator_tuned, X_test, y_test)
bagging_estimator_tuned_model_test_perf

Observations:¶

The tuned Bagging Regressor shows almost no improvement over the untuned version.
R-squared (~0.668 on test) remains virtually unchanged, indicating similar explanatory power.
RMSE and MAE on the test set are nearly identical to the base Bagging model, suggesting limited gains from tuning.
Train and test metrics are very close, indicating good generalization and low overfitting.
The best parameters selected (max_samples = 0.7, max_features = 0.7, n_estimators = 10) favor higher randomness and fewer trees, reducing variance but not boosting accuracy.
Increasing ensemble complexity (more estimators or features) did not significantly improve performance, implying the model has reached a performance plateau.
Bagging remains stable and robust, but tuning alone cannot extract additional predictive power from the current feature set.

Conclusion:

Hyperparameter tuning does not materially enhance the Bagging Regressor. While it generalizes well, its performance is capped, making it less competitive than more expressive ensemble methods like Gradient Boosting or XGBoost.

Hyperparameter Tuning - Random Forest¶

# Choose the type of classifier.
rf_tuned = RandomForestRegressor(random_state=1)
rf_tuned = make_pipeline(preprocessor,rf_tuned)

# Grid of parameters to choose from
parameters = {
     "randomforestregressor__max_depth": [10, 20, 30, None], #Complete the code to define the list of values to be tuned
     "randomforestregressor__max_features": ['sqrt', 'log2', 1.0, 0.7], #Complete the code to define the list of values to be tuned
     "randomforestregressor__n_estimators": [100, 200, 300], #Complete the code to define the list of values to be tuned
}

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)

Best Parameters Found:
{'randomforestregressor__max_depth': 10, 'randomforestregressor__max_features': 'sqrt', 'randomforestregressor__n_estimators': 100}

Checking model performance on training set¶

rf_tuned_model_train_perf = model_performance_regression(rf_tuned, X_train, y_train)
rf_tuned_model_train_perf

Checking model performance on test set¶

rf_tuned_model_test_perf = model_performance_regression(rf_tuned, X_test, y_test)
rf_tuned_model_test_perf

Observations:¶

The tuned Random Forest shows almost identical performance to the default Random Forest model.
Test R² (~0.6685) remains unchanged, indicating no meaningful improvement in explanatory power.
RMSE and MAE on the test set are nearly the same as the untuned model, confirming marginal gains from tuning.
The selected parameters (max_depth = 10, max_features = 'sqrt', n_estimators = 100) impose controlled tree complexity, helping prevent overfitting.
Train and test metrics are closely aligned, suggesting good generalization and stable learning.
Increasing the number of trees beyond 100 or allowing deeper trees did not improve performance, implying diminishing returns.
The model appears bias-limited rather than variance-limited, meaning feature richness matters more than hyperparameter tuning.

Conclusion:

Hyperparameter tuning does not significantly enhance Random Forest performance for this dataset. While the model is stable and reliable, further gains are more likely to come from feature engineering or advanced boosting methods rather than additional tuning.

Hyperparameter Tuning - AdaBoost Regressor¶

# Choose the type of classifier.
ab_tuned = AdaBoostRegressor(random_state=1)
ab_tuned = make_pipeline(preprocessor,ab_tuned)
# Grid of parameters to choose from
parameters = {
     "adaboostregressor__n_estimators": [50, 100, 150, 200], #Complete the code to define the list of values to be tuned
     "adaboostregressor__learning_rate": [0.01, 0.1, 0.5, 1.0], #Complete the code to define the list of values to be tuned
}


# Run the grid search
grid_obj = GridSearchCV(ab_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
ab_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
ab_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)

Best Parameters Found:
{'adaboostregressor__learning_rate': 0.01, 'adaboostregressor__n_estimators': 50}

Checking model performance on training set¶

ab_tuned_model_train_perf = model_performance_regression(ab_tuned, X_train, y_train)
ab_tuned_model_train_perf

Checking model performance on test set¶

ab_tuned_model_test_perf = model_performance_regression(ab_tuned, X_test, y_test)
ab_tuned_model_train_perf

Observations:¶

Hyperparameter tuning selected a low learning rate (0.01) with fewer estimators (50), indicating that conservative boosting works better for this dataset.
Compared to the untuned AdaBoost model, the performance has improved slightly, especially in terms of RMSE and MAE.
R² (~0.684) is higher than the default AdaBoost model, showing better variance explanation after tuning.
Training and test metrics are identical, suggesting the model has high bias and limited flexibility, but also very stable generalization.
The low learning rate reduces the risk of overfitting, but also limits the model’s ability to capture complex nonlinear patterns.
Despite tuning, AdaBoost still underperforms compared to Random Forest, Gradient Boosting, and XGBoost.
The model benefits from tuning more than Decision Tree, but remains less competitive overall.

Conclusion:

Hyperparameter tuning improves AdaBoost modestly, but the model remains bias-constrained. For stronger predictive performance, tree-based ensemble methods with higher capacity (Random Forest, Gradient Boosting, XGBoost) are more suitable for this problem.

Hyperparameter Tuning - Gradient Boosting Regressor¶

# Choose the type of classifier.
gb_tuned = GradientBoostingRegressor(random_state=1)
gb_tuned = make_pipeline(preprocessor,gb_tuned)

# Grid of parameters to choose from
parameters = {
     "gradientboostingregressor__n_estimators": [100, 200, 300], #Complete the code to define the list of values to be tuned
     "gradientboostingregressor__subsample": [0.8, 0.9, 1.0], #Complete the code to define the list of values to be tuned
     "gradientboostingregressor__max_features": [0.8, 1.0, 'sqrt', 'log2'], #Complete the code to define the list of values to be tuned
     "gradientboostingregressor__max_depth": [3, 4, 5] #Complete the code to define the list of values to be tuned
}


# Run the grid search
grid_obj = GridSearchCV(gb_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gb_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)

Best Parameters Found:
{'gradientboostingregressor__max_depth': 3, 'gradientboostingregressor__max_features': 0.8, 'gradientboostingregressor__n_estimators': 100, 'gradientboostingregressor__subsample': 0.8}

Checking model performance on training set¶

gb_tuned_model_train_perf = model_performance_regression(gb_tuned, X_train, y_train)
gb_tuned_model_train_perf

Checking model performance on test set¶

gb_tuned_model_test_perf = model_performance_regression(gb_tuned, X_test, y_test)
gb_tuned_model_test_perf

Observations:¶

Hyperparameter tuning selected a shallow tree depth (max_depth = 3), indicating that simple base learners generalize better for this dataset.
The model prefers subsampling (subsample = 0.8) and feature subsampling (max_features = 0.8), which helps reduce overfitting and improve robustness.
The chosen number of estimators (100) balances learning capacity and stability without excessive complexity.
Training performance remains strong (R² ≈ 0.685), similar to the untuned Gradient Boosting model.
Test performance (R² ≈ 0.669, RMSE ≈ 616) is very close to training performance, showing good generalization.
Compared to AdaBoost, the tuned Gradient Boosting model shows lower error and higher R², confirming its superior learning capability.
Hyperparameter tuning results in marginal but consistent improvements, suggesting the base model was already well-specified.
Performance is comparable to Random Forest and XGBoost, making it one of the top-performing models in this study.

Conclusion:

The tuned Gradient Boosting Regressor achieves a strong bias–variance balance, with stable generalization and competitive accuracy. It is a reliable final model choice, especially when interpretability and controlled complexity are important.

Hyperparameter Tuning - XGBoost Regressor¶

# Choose the type of classifier.
xgb_tuned = XGBRegressor(random_state=1)
xgb_tuned = make_pipeline(preprocessor,xgb_tuned)

# Grid of parameters to choose from
parameters = {
     "xgbregressor__n_estimators": [100, 200], #Complete the code to define the list of values to be tuned
     "xgbregressor__subsample": [0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
     "xgbregressor__gamma": [0, 1, 5], #Complete the code to define the list of values to be tuned
     "xgbregressor__colsample_bytree": [0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
     "xgbregressor__colsample_bylevel":[0.7, 0.8, 1.0], #Complete the code to define the list of values to be tuned
     "xgbregressor__max_depth": [3, 5, 7]
}

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)

print("Best Parameters Found:")
print(grid_obj.best_params_)

Best Parameters Found:
{'xgbregressor__colsample_bylevel': 0.7, 'xgbregressor__colsample_bytree': 0.7, 'xgbregressor__gamma': 0, 'xgbregressor__max_depth': 3, 'xgbregressor__n_estimators': 100, 'xgbregressor__subsample': 0.7}

Checking model performance on training set¶

xgb_tuned_model_train_perf = model_performance_regression(xgb_tuned, X_train, y_train)
xgb_tuned_model_train_perf

Checking model performance on test set¶

xgb_tuned_model_test_perf = model_performance_regression(xgb_tuned, X_test, y_test)
xgb_tuned_model_test_perf

Observations:¶

Hyperparameter tuning selected a shallow tree depth (max_depth = 3), reinforcing that simpler trees generalize better for this dataset.
The model favors aggressive subsampling (subsample = 0.7, colsample_bytree = 0.7, colsample_bylevel = 0.7), which helps control overfitting and improves model stability.
A gamma value of 0 indicates that allowing splits without additional loss penalty works well, suggesting the data benefits from flexible splitting.
The optimal number of estimators (100) provides sufficient boosting rounds without overfitting.
Training performance is strong (R² ≈ 0.685, RMSE ≈ 597), comparable to untuned and other tuned ensemble models.
Test performance (R² ≈ 0.668, RMSE ≈ 616) is very close to training metrics, indicating good generalization.
Hyperparameter tuning yields only marginal improvements, suggesting the original XGBoost model was already near optimal.
Compared to AdaBoost, XGBoost performs significantly better; however, its performance is very similar to Random Forest and Gradient Boosting.

Conclusion:

The tuned XGBoost Regressor demonstrates stable and robust performance with strong generalization. While tuning provides limited gains, XGBoost remains one of the top-performing models and is a strong candidate for final deployment, especially when predictive accuracy is prioritized.

Model Performance Comparison, Final Model Selection, and Serialization¶

# training performance comparison

models_train_comp_df = pd.concat(
    [
        rf_estimator_model_train_perf.T,      # Random Forest (base)
        rf_tuned_model_train_perf.T,          # Random Forest (tuned)
        xgb_estimator_model_train_perf.T,     # XGBoost (base)
        xgb_tuned_model_train_perf.T,         # XGBoost (tuned)
    ],
    axis=1,
)

models_train_comp_df.columns = [
    "Random Forest Estimator",
    "Random Forest Tuned",
    "XGBoost Estimator",
    "XGBoost Tuned",
]

print("Training performance comparison:")
models_train_comp_df

Training performance comparison:

# Testing performance comparison

models_test_comp_df = pd.concat(
    [
        rf_estimator_model_test_perf.T,      # Random Forest (base)
        rf_tuned_model_test_perf.T,          # Random Forest (tuned)
        xgb_estimator_model_test_perf.T,     # XGBoost (base)
        xgb_tuned_model_test_perf.T,         # XGBoost (tuned)
    ],
    axis=1,
)

models_test_comp_df.columns = [
    "Random Forest Estimator",
    "Random Forest Tuned",
    "XGBoost Estimator",
    "XGBoost Tuned",
]

print("Test performance comparison:")
models_test_comp_df

Test performance comparison:

if rf_tuned_model_train_perf["RMSE"][0] < xgb_tuned_model_train_perf["RMSE"][0]:
    best_model = rf_tuned
else:
    best_model = xgb_tuned

print(f"The best performing model is: {best_model}")

The best performing model is: Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=10, max_features='sqrt',
                                       random_state=1))])

Observations:¶

🔍 Observations for Final Model Selection

Random Forest (base and tuned) consistently delivers the best overall performance across both training and test datasets.
The tuned Random Forest does not improve performance over the base Random Forest, indicating that the default parameters were already near-optimal.
XGBoost (base and tuned) performs very similarly to Random Forest but shows:
- Slightly higher RMSE and MAE
- Marginally lower R² and Adjusted R² on the test set
The performance gap between training and test sets for Random Forest is small, indicating good generalization and minimal overfitting.
Tuned XGBoost does not outperform base XGBoost, suggesting limited benefit from hyperparameter tuning for this dataset.
Among all models compared:
- Lowest Test RMSE & MAE → Random Forest
- Highest Test R² & Adjusted R² → Random Forest
- Lowest Test MAPE → Random Forest

✅ Final Selection Justification

Random Forest Regressor is selected as the best-performing and most stable model
It balances accuracy, robustness, and generalization
Hyperparameter tuning did not yield meaningful gains, reinforcing confidence in the chosen model

Model Serialization¶

# Create a folder for storing the files needed for web app deployment
os.makedirs("/content/drive/MyDrive/Model Deployment/Full_Code/backend_files", exist_ok=True)

# Define the file path to save (serialize) the trained model along with the data preprocessing steps
saved_model_path = "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/SuperKart_v1_0.joblib"

# Save the best trained model pipeline using joblib
joblib.dump(best_model, saved_model_path) #Complete the code to pass the variable name of the best model

print(f"Model saved successfully at {saved_model_path}")

Model saved successfully at /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/SuperKart_v1_0.joblib

# Load the saved model pipeline from the file
saved_model = joblib.load("/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/SuperKart_v1_0.joblib") #Complete the code to define the name of the saved model

# Confirm the model is loaded
print("Model loaded successfully.")

Model loaded successfully.

saved_model

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=10, max_features='sqrt',
                                       random_state=1))])

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type',
                                                   'Product_Id_char',
                                                   'Product_Type_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=10, max_features='sqrt',
                                       random_state=1))])

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Id_char',
                                  'Product_Type_Category'])])

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Id_char', 'Product_Type_Category']

OneHotEncoder(handle_unknown='ignore')

RandomForestRegressor(max_depth=10, max_features='sqrt', random_state=1)

Let's try making predictions on the test set using the deserialized model.

# Test a prediction to confirm functionality
sample_preds = saved_model.predict(X_test[:5])
print("\n Sample Predictions on Test Set:\n", sample_preds)

 Sample Predictions on Test Set:
 [3301.4740379  4861.77518803 4858.00647634 3294.17251579 3951.95954778]

Observations:¶

A dedicated directory was created to store all files required for backend deployment, ensuring a well-structured and maintainable deployment setup.
The trained model was saved using the joblib library as a single .joblib file, which includes the complete machine learning pipeline.
The serialized object contains both the data preprocessing steps (ColumnTransformer and OneHotEncoder) and the final Random Forest regression model, ensuring consistency between training and inference.
Successful execution messages confirm that the model was correctly saved to disk at the specified location.
The model was subsequently reloaded from the saved file, verifying that the serialization process was successful and the file is usable.
Inspection of the loaded object confirms that it is a Pipeline, demonstrating that all preprocessing and modeling components are preserved together.
The OneHotEncoder is configured with handle_unknown='ignore', which improves robustness by allowing the model to handle unseen categorical values during real-time predictions without errors.
Overall, the serialization process ensures the selected best-performing model is deployment-ready and can be directly integrated into a production or web application environment.

Deployment - Backend¶

Setting up a Hugging Face Docker Space for the Backend¶

# Import the login function from the huggingface_hub library
from huggingface_hub import login
from huggingface_hub import create_repo
import os

# Login to Hugging Face account using access token
from google.colab import userdata
hf_token = userdata.get('SuperKart1')
login(token=hf_token)

# create the repository for the Hugging Face Space

try:
  create_repo("SuperKartBackend",
        repo_type="space",  # Specify the repository type as "space"
        space_sdk="docker",  # Specify the space SDK as "docker" to create a Docker space
        private=False  # Set to True if the space to be private
    )

# Handle potential errors during repository creation
except Exception as e:
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Skipping creation.")
    else:
        print(f"Error creating repository: {e}")

Flask Web Framework - app.py¶

%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/app.py"

# Import necessary libraries
import numpy as np
import joblib
import pandas as pd
from flask import Flask, request, jsonify
import traceback
import math

# Define the path where the model is saved
model_file_name = "SuperKart_v1_0.joblib"

try:
    # Load the trained machine learning model
    model = joblib.load(model_file_name)
except FileNotFoundError:
    print(f"Error: Model file not found at {model_file_name}")
    model = None
except Exception as e:
    print(f"Error loading model: {e}")
    traceback.print_exc()
    model = None

# Initialize the Flask app
app = Flask(__name__)

@app.route('/')
def home():
    return "Welcome to the Super Kart Product Sales Price Prediction API!"

# ---------------- single Prediction Endpoint ----------------
@app.route('/v1/salesprice', methods=['POST'])
def predict_sales_price():
    if model is None:
        return jsonify({"error": "Model not loaded. Cannot make predictions."}), 500

    try:
        property_data = request.get_json(force=True)

        expected_keys = [
            'Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area',
            'Product_MRP', 'Store_Size', 'Store_Location_City_Type',
            'Store_Type', 'Product_Id_char', 'Store_Age_Years', 'Product_Type_Category'
        ]
        if not all(key in property_data for key in expected_keys):
            missing_keys = [key for key in expected_keys if key not in property_data]
            return jsonify({"error": f"Missing keys in input data: {missing_keys}"}), 400

        sample = {key: property_data.get(key) for key in expected_keys}
        input_data = pd.DataFrame([sample])

        predicted_sales_price = model.predict(input_data)
        predicted_price = round(float(predicted_sales_price[0]), 2)

        if math.isinf(predicted_price) or math.isnan(predicted_price):
            return jsonify({"error": "Prediction resulted in an invalid value."}), 400

        return jsonify({'Predicted Price': predicted_price}), 200

    except Exception as e:
        print(f"Error during single prediction: {e}")
        traceback.print_exc()
        return jsonify({"error": "Internal server error", "details": str(e)}), 500

# ---------------- Batch Prediction Endpoint ----------------
@app.route('/v1/salespricebatch', methods=['POST'])
def predict_sales_price_batch():
    """
    Expects a CSV file with one product per row.
    Returns JSON: a list of dicts with `row_id` and predicted price.
    """
    if model is None:
        return jsonify({"error": "Model not loaded. Cannot make predictions."}), 500

    if 'file' not in request.files:
        return jsonify({"error": "No file uploaded"}), 400

    try:
        file = request.files['file']
        input_data = pd.read_csv(file)

        expected_columns = [
            'Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area',
            'Product_MRP', 'Store_Size', 'Store_Location_City_Type',
            'Store_Type', 'Product_Id_char', 'Store_Age_Years', 'Product_Type_Category'
        ]
        missing_columns = [col for col in expected_columns if col not in input_data.columns]
        if missing_columns:
            return jsonify({"error": f"Missing required columns: {missing_columns}"}), 400

        input_data.reset_index(inplace=True)
        input_data.rename(columns={'index': 'row_id'}, inplace=True)

        predictions = model.predict(input_data[expected_columns])
        predicted_prices = [round(float(p), 2) for p in predictions]

        results = [
            {"row_id": row_id, "Predicted Price": price}
            for row_id, price in zip(input_data['row_id'], predicted_prices)
        ]

        return jsonify(results), 200

    except Exception as e:
        print(f"Error during batch prediction: {e}")
        traceback.print_exc()
        return jsonify({"error": "Internal server error during batch prediction.", "details": str(e)}), 500

if __name__ == '__main__':
    pass

Writing /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/app.py

Dependency File¶

%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/requirements.txt"
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
streamlit==1.43.2
flask-cors==3.0.10

Writing /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/requirements.txt

Dockerfile¶

%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/Dockerfile"
# Use slim Python image
FROM python:3.9-slim

# Set working directory inside the container
WORKDIR /app

# Copy project files into the container
COPY . .

# Install dependencies and print package list to verify gunicorn is installed
RUN pip install --no-cache-dir --upgrade pip \
 && pip install --no-cache-dir -r requirements.txt \
 && echo "Installed packages:" \
 && pip list

# Expose the port Hugging Face expects
EXPOSE 7860

# Start the Flask app using gunicorn
# - app: refers to app.py
# - app: the Flask app object in app.py (corrected from rental_price_predictor_api)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:app"]

Writing /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/Dockerfile

Uploading files to Hugging Face Space for Backend¶

# for hugging face space authentication to upload files
from huggingface_hub import HfApi

# Hugging Face space id - Backend
repo_id = "randley7/SuperKartBackend"

# Initialize the API
api = HfApi()

#Mention the folder path explicitly
folder_path = "/content/drive/MyDrive/Model Deployment/Full_Code/backend_files/"

# Upload Streamlit app files
api.upload_folder(folder_path=folder_path,repo_id=repo_id,repo_type="space")

print(f"Files from {folder_path} successfully uploaded to the Hugging Face Space: {repo_id}")

Files from /content/drive/MyDrive/Model Deployment/Full_Code/backend_files/ successfully uploaded to the Hugging Face Space: randley7/SuperKartBackend

Link of Hugging face space for Backend app¶

Backend Link

https://huggingface.co/spaces/randley7/SuperKartBackend

Deployment - Frontend¶

Setting up Hugging Face streamlit space for Frontend¶

# Try to create the repository for the Hugging Face Space

try:
    create_repo("SuperKartFrontend",
        repo_type="space",  # Specify the repository type as "space"
        space_sdk="docker",  # Specify the space SDK as "streamlit" to create a Streamlit space
        private=False  # Set to True if you want the space to be private
    )

# Handle potential errors during repository creation
except Exception as e:
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Skipping creation.")
    else:
        print(f"Error creating repository: {e}")

# Create the directory if it doesn't exist and then write the file
import os
os.makedirs("/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files", exist_ok=True)

Dockerfile¶

%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/Dockerfile"
# Use Python base image
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Copy all files into the container
COPY . /app

# Install dependencies
RUN pip install --upgrade pip && \
    pip install -r requirements.txt

# Expose port for Streamlit
EXPOSE 7860

# Run Streamlit app
CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]

Writing /content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/Dockerfile

Streamlit UI - app.py¶

%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/app.py"

# import
import streamlit as st
import pandas as pd
import requests

# Streamlit UI
st.title("SuperKart Sales Prediction App")
st.write("Predict store sales based on product and store attributes.")

# Input fields for product and store data
Product_Weight = st.number_input("Product Weight", min_value=0.0, value=12.66)
Product_Sugar_Content = st.selectbox("Product Sugar Content", ["Low Sugar", "Regular", "No Sugar"])
Product_Allocated_Area = st.number_input("Product Allocated Area", min_value=0.0, step=0.1)
Product_MRP = st.number_input("Product MRP", min_value=0.0, step=0.1)
Store_Size = st.selectbox("Store Size", ["Small", "Medium", "High"])
Store_Location_City_Type = st.selectbox("Store Location City Type", ["Tier 1", "Tier 2", "Tier 3"])
Store_Type = st.selectbox("Store Type",["Supermarket Type2", "Departmental Store", "Supermarket Type1", "Food Mart"])
Product_Id_char = st.selectbox("Product Id Char", ["FD", "NC", "DR"])
Store_Age_Years = st.number_input("Store Age Years", min_value=0, step=1)
Product_Type_Category = st.selectbox("Product Type Category", ["Perishables", "Non Perishables"])

input_data = pd.DataFrame([{
    'Product_Weight': Product_Weight,
    'Product_Sugar_Content': Product_Sugar_Content,
    'Product_Allocated_Area': Product_Allocated_Area,
    'Product_MRP': Product_MRP,
    'Store_Size': Store_Size,
    'Store_Location_City_Type': Store_Location_City_Type,
    'Store_Type': Store_Type,
    'Product_Id_char': Product_Id_char,
    'Store_Age_Years': Store_Age_Years,
    'Product_Type_Category': Product_Type_Category
}])


# Predict button
if st.button("Predict"):
    try:
        response = requests.post(
            "https://randley7-SuperKartBackend.hf.space/v1/salesprice",
            json=input_data.to_dict(orient='records')[0]
        )
        if response.status_code == 200:
            prediction = response.json().get("Predicted Price", "No prediction returned")
            st.success(f"Predicted Sales Price: {prediction}")
        else:
            st.error("Error making prediction.")
            st.text(response.text)
    except Exception as e:
        st.error(f"Exception occurred: {e}")

# ----------------- Batch Prediction -----------------
st.subheader("Batch Prediction")

uploaded_file = st.file_uploader("Upload CSV file for batch prediction", type=["csv"])

if uploaded_file is not None:
    if st.button("PredictBatch"):
        try:
            files = {"file": (uploaded_file.name, uploaded_file, "text/csv")}
            response = requests.post(
                "https://randley7-SuperKartBackend.hf.space/v1/salespricebatch",
                files=files
            )
            if response.status_code == 200:
                predictions = response.json()
                st.success("Batch predictions completed!")

                # Convert to DataFrame and display
                df_predictions = pd.DataFrame(predictions)
                st.dataframe(df_predictions)

                # Download button
                csv = df_predictions.to_csv(index=False).encode('utf-8')
                st.download_button(
                    label="Download Predictions as CSV",
                    data=csv,
                    file_name="SuperKart_Predicted_Sales.csv",
                    mime="text/csv"
                )
            else:
                st.error("Error making batch prediction.")
                st.text(response.text)
        except Exception as e:
            st.error(f"Exception occurred: {e}")

Overwriting /content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/app.py

Dependencies File¶

%%writefile "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/requirements.txt"
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
streamlit==1.43.2
flask-cors==3.0.10

Overwriting /content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/requirements.txt

Uploading files for Hugging Face Space for the Frontend¶

# for hugging face space authentication to upload files
from huggingface_hub import HfApi

repo_id = "randley7/SuperKartFrontend"

# Initialize the API
api = HfApi()

#Mention the folder path explicitly
folder_path = "/content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/"

# Upload Streamlit app files
api.upload_folder(folder_path=folder_path, repo_id=repo_id,repo_type="space")

print(f"Files from {folder_path} successfully uploaded to the Hugging Face Space: {repo_id}")

Files from /content/drive/MyDrive/Model Deployment/Full_Code/frontend_files/ successfully uploaded to the Hugging Face Space: randley7/SuperKartFrontend

Link of Hugging face space for Frontend¶

Frontend Link

https://huggingface.co/spaces/randley7/SuperKartFrontend

Interfacing using Flask API¶

Prediction¶

# Import the necessary libraries
import json
import requests
import pandas as pd
import numpy as np

#Base URL of the deployed Flask API on Hugging Face Space
model_root_url = "https://randley7-SuperKartBackend.hf.space"

#Endpoint for single inference
model_url = model_root_url + "/v1/salesprice"

#Payload with necessary features for single inference prediction
payload = {
    'Product_Weight': 12.66,
    'Product_Sugar_Content': "Low Sugar",
    'Product_Allocated_Area': 0.20,
    'Product_MRP': 0.30,
    'Store_Size': "Small",
    'Store_Location_City_Type': "Tier 1",
    'Store_Type': "Supermarket Type2",
    'Product_Id_char': "FD",
    'Store_Age_Years': 10,
    'Product_Type_Category': "Non Perishables"
}

#sending a POST request to the model endpoint with the payload
response = requests.post(model_url, json=payload)

print(model_url)
print(response)

# ALWAYS print the raw response text for debugging
print("Raw response text:")
print(response.text)

# Check if the response is successful (status code 200) before trying to parse JSON
if response.status_code == 200:
    try:
        # Attempt to parse the JSON
        print("Parsed JSON response:")
        print(response.json())
    except json.JSONDecodeError as e:
        print(f"JSON Decode Error occurred: {e}")
        print("Could not parse response as JSON despite 200 status code.")
else:
    # If the response was not successful, print the status code and the raw text
    print(f"Error: Received status code {response.status_code}")
    print("Response content (if any):")
    print(response.text) # Print raw text to see the error message from the backend

https://randley7-SuperKartBackend.hf.space/v1/salesprice
<Response [200]>
Raw response text:
{"Predicted Price":3547.64}

Parsed JSON response:
{'Predicted Price': 3547.64}

Observations:¶

Observations – Backend and Frontend Integration

The frontend Streamlit application successfully communicates with the backend Flask API using HTTP POST requests.
Real-time predictions are displayed in the UI, confirming seamless data flow from user input → backend inference → frontend response.
The same API endpoint supports both programmatic access (via Python requests) and UI-based interaction, increasing system flexibility.
The frontend correctly parses backend JSON responses and presents predictions in a user-friendly format.
The batch prediction workflow is fully integrated, allowing users to upload CSV files and receive multiple predictions in one request.
Download functionality for batch prediction results enhances usability and supports real-world analytical workflows.

Observations – Deployment Validation and System Robustness

The backend and frontend are deployed as independent Hugging Face Spaces, ensuring modularity and easier maintenance.
Dockerized deployments ensure consistent runtime environments and eliminate dependency conflicts.
Version-pinned dependencies in requirements.txt improve reproducibility and long-term stability.
The serialized model pipeline (including preprocessing) ensures identical transformations during both training and inference.
Successful predictions from both direct API calls and the UI confirm end-to-end system correctness.
The deployed system demonstrates production readiness with clear endpoints, validation checks, and scalable architecture.

Observations – Interfacing Using Flask API

The Flask API is successfully deployed on Hugging Face Spaces and is accessible via a public HTTPS endpoint, enabling external inference requests.
A RESTful design is followed, with a dedicated /v1/salesprice endpoint for single predictions that accepts JSON payloads.
Input features in the API payload exactly match the features used during model training, ensuring schema consistency and preventing inference mismatches.
The API correctly returns HTTP 200 responses for valid requests, confirming proper request handling and inference execution.
JSON responses are well-structured and include a clearly labeled Predicted Price, facilitating easy consumption by downstream applications.
Robust debugging practices are demonstrated by logging raw response text and safely handling JSON decoding.
Error handling is implemented to capture invalid payloads, missing keys, or unexpected runtime issues, improving API reliability.

Overall Observation

The project demonstrates a complete, production-grade machine learning deployment pipeline, covering model training, evaluation, selection, serialization, backend API development, frontend integration, and cloud deployment. The seamless interaction between components validates both the technical soundness and practical usability of the solution.

Actionable Insights and Business Recommendations¶

SuperKart Sales Prediction Project

Actionable Insights

Product Characteristics Strongly Influence Sales Outcomes

Product attributes such as MRP, weight, sugar content, and category (perishable vs non-perishable) play a significant role in predicting sales value.
Products with optimized pricing and appropriate shelf allocation tend to generate higher predicted sales.
Perishable and non-perishable products show distinct sales behavior, indicating different demand dynamics.

Insight: Sales performance is highly sensitive to product-level decisions rather than being driven by a single store factor.

Store Attributes Drive Demand Variability

Store size, store type, city tier, and store age contribute meaningfully to sales predictions.
Larger stores and stores located in higher-tier cities tend to exhibit stronger sales potential.
Older stores show more stable and predictable sales patterns, likely due to established customer bases.

Insight: Store-level heterogeneity must be accounted for when planning inventory and pricing strategies.

Machine Learning Models Capture Non-Linear Sales Drivers

Tree-based ensemble models (Random Forest and XGBoost) significantly outperform simpler models.
The selected Random Forest model demonstrates consistent generalization, with minimal performance gap between training and test data.
Hyperparameter tuning improves stability but yields marginal gains over the base ensemble models.

Insight: Sales relationships are non-linear, and ensemble models are well-suited for capturing complex interactions between product and store features.

Model Generalization Indicates Reliable Forecasting

Comparable RMSE, MAE, and R² values across training and test sets indicate low overfitting.
The model’s performance consistency suggests it can be trusted for real-world sales estimation scenarios.

Insight: The model can be reliably used for operational and tactical decision-making rather than just exploratory analysis.

Deployment Enables Real-Time and Scalable Decision Support

The Flask API enables real-time inference for individual product–store combinations.
Batch prediction support allows large-scale forecasting across catalogs and store networks.
The Streamlit frontend provides accessibility for non-technical business users.

Insight: The solution moves beyond analytics into an operational decision-support system.

Business Recommendations

Optimize Product Pricing and Placement Strategy

Use the model to simulate sales outcomes for different MRP and shelf-space allocation combinations.
Identify price bands that maximize sales without eroding demand, especially for high-volume categories.
Allocate premium shelf space to products with higher predicted sales impact.

Recommendation: Integrate the model into pricing and merchandising decisions to maximize revenue per square foot.

Implement Store-Specific Inventory Planning

Adjust inventory levels based on store size, location tier, and store maturity.
Avoid uniform inventory policies across all stores, as demand patterns vary significantly.
Use batch predictions to forecast store-level demand before replenishment cycles.

Recommendation: Move from centralized inventory planning to store-cluster-based demand forecasting.

Support New Store and Product Launch Decisions

Use the model to estimate expected sales for new products or newly opened stores using proxy attributes.
Evaluate product–store fit before rollout to reduce launch risk.
Prioritize high-potential store locations and product categories.

Recommendation: Use predictive insights to de-risk expansion and new product introduction strategies.

Enhance Promotional and Marketing Effectiveness

Identify products with high baseline demand and amplify them during promotional campaigns.
Avoid over-promoting products with inherently low predicted demand.
Tailor promotions based on store location and customer demographics inferred from city tiers.

Recommendation: Shift from blanket promotions to data-driven, targeted marketing campaigns.

Enable Sales and Category Teams with Self-Service Analytics

Provide business teams access to the deployed Streamlit application for scenario analysis.
Allow category managers to test “what-if” scenarios by adjusting product and store attributes.
Reduce dependency on technical teams for routine sales forecasting.

Recommendation: Democratize predictive insights to improve agility and decision-making speed.

Integrate the Model into Core Business Systems

Embed the API into ERP, inventory management, or demand planning systems.
Automate daily or weekly batch predictions for operational planning.
Continuously retrain the model with new sales data to maintain accuracy.

Recommendation: Treat the model as a living system, not a one-time analytical output.

Strategic Impact Summary

Revenue Growth: Improved pricing and assortment decisions driven by predictive insights.
Cost Reduction: Lower inventory holding and wastage through accurate demand estimation.
Operational Efficiency: Faster, data-backed decisions enabled by real-time inference.
Scalability: Cloud deployment supports expansion across regions and product lines.
Competitive Advantage: Advanced analytics capability embedded into everyday business operations.

Final Insight

The SuperKart Sales Prediction system transforms historical data into actionable intelligence, enabling the organization to move from reactive decision-making to proactive, predictive retail strategy.

Links of the Hugging Face spaces:¶

Screenshot 2026-01-09 164241.png

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Product_Id	8763	8763	FD306	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Weight	8763.0	NaN	NaN	NaN	12.653792	2.21732	4.0	11.15	12.66	14.18	22.0
Product_Sugar_Content	8763	4	Low Sugar	4885	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Allocated_Area	8763.0	NaN	NaN	NaN	0.068786	0.048204	0.004	0.031	0.056	0.096	0.298
Product_Type	8763	16	Fruits and Vegetables	1249	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_MRP	8763.0	NaN	NaN	NaN	147.032539	30.69411	31.0	126.16	146.74	167.585	266.0
Store_Id	8763	4	OUT004	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Establishment_Year	8763.0	NaN	NaN	NaN	2002.032751	8.388381	1987.0	1998.0	2009.0	2009.0	2009.0
Store_Size	8763	3	Medium	6025	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	8763	3	Tier 2	6262	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	8763	4	Supermarket Type2	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Store_Sales_Total	8763.0	NaN	NaN	NaN	3464.00364	1065.630494	33.0	2761.715	3452.34	4145.165	8000.0

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Product_Id	1586	1586	NC7187	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Weight	1586.0	NaN	NaN	NaN	13.458865	2.064975	6.16	12.0525	13.96	14.95	17.97
Product_Sugar_Content	1586	4	Low Sugar	845	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Allocated_Area	1586.0	NaN	NaN	NaN	0.068768	0.047131	0.004	0.033	0.0565	0.094	0.295
Product_Type	1586	16	Snack Foods	202	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_MRP	1586.0	NaN	NaN	NaN	160.514054	30.359059	71.35	141.72	168.32	182.9375	226.59
Store_Id	1586	1	OUT001	1586	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Establishment_Year	1586.0	NaN	NaN	NaN	1987.0	0.0	1987.0	1987.0	1987.0	1987.0	1987.0
Store_Size	1586	1	High	1586	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	1586	1	Tier 2	1586	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	1586	1	Supermarket Type1	1586	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Store_Sales_Total	1586.0	NaN	NaN	NaN	3923.778802	904.62901	2300.56	3285.51	4139.645	4639.4	4997.63

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Product_Id	1152	1152	NC2769	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Weight	1152.0	NaN	NaN	NaN	9.911241	1.799846	4.0	8.7675	9.795	10.89	19.82
Product_Sugar_Content	1152	4	Low Sugar	658	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Allocated_Area	1152.0	NaN	NaN	NaN	0.067747	0.047567	0.006	0.031	0.0545	0.09525	0.292
Product_Type	1152	16	Fruits and Vegetables	168	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_MRP	1152.0	NaN	NaN	NaN	107.080634	24.912333	31.0	92.8275	104.675	117.8175	224.93
Store_Id	1152	1	OUT002	1152	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Establishment_Year	1152.0	NaN	NaN	NaN	1998.0	0.0	1998.0	1998.0	1998.0	1998.0	1998.0
Store_Size	1152	1	Small	1152	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	1152	1	Tier 3	1152	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	1152	1	Food Mart	1152	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Store_Sales_Total	1152.0	NaN	NaN	NaN	1762.942465	462.862431	33.0	1495.4725	1889.495	2133.6225	2299.63

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Product_Id	1349	1349	NC522	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Weight	1349.0	NaN	NaN	NaN	15.103692	1.893531	7.35	14.02	15.18	16.35	22.0
Product_Sugar_Content	1349	4	Low Sugar	750	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Allocated_Area	1349.0	NaN	NaN	NaN	0.068637	0.048708	0.004	0.031	0.057	0.094	0.298
Product_Type	1349	16	Snack Foods	186	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_MRP	1349.0	NaN	NaN	NaN	181.358725	24.796429	85.88	166.92	179.67	198.07	266.0
Store_Id	1349	1	OUT003	1349	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Establishment_Year	1349.0	NaN	NaN	NaN	1999.0	0.0	1999.0	1999.0	1999.0	1999.0	1999.0
Store_Size	1349	1	Medium	1349	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	1349	1	Tier 1	1349	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	1349	1	Departmental Store	1349	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Store_Sales_Total	1349.0	NaN	NaN	NaN	4946.966323	677.539953	3069.24	4355.39	4958.29	5366.59	8000.0

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Product_Id	4676	4676	NC584	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Weight	4676.0	NaN	NaN	NaN	12.349613	1.428199	7.34	11.37	12.37	13.3025	17.79
Product_Sugar_Content	4676	4	Low Sugar	2632	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Allocated_Area	4676.0	NaN	NaN	NaN	0.069092	0.048584	0.004	0.031	0.056	0.097	0.297
Product_Type	4676	16	Fruits and Vegetables	700	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_MRP	4676.0	NaN	NaN	NaN	142.399709	17.513973	83.04	130.54	142.82	154.1925	197.66
Store_Id	4676	1	OUT004	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Establishment_Year	4676.0	NaN	NaN	NaN	2009.0	0.0	2009.0	2009.0	2009.0	2009.0	2009.0
Store_Size	4676	1	Medium	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	4676	1	Tier 2	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	4676	1	Supermarket Type2	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Product_Store_Sales_Total	4676.0	NaN	NaN	NaN	3299.312111	468.271692	1561.06	2942.085	3304.18	3646.9075	5462.86

	Product_Id	Product_Weight	Product_Sugar_Content	Product_Allocated_Area	Product_Type	Product_MRP	Store_Id	Store_Establishment_Year	Store_Size	Store_Location_City_Type	Store_Type	Product_Store_Sales_Total
0	FD6114	12.66	Low Sugar	0.027	Frozen Foods	117.08	OUT004	2009	Medium	Tier 2	Supermarket Type2	2842.40
1	FD7839	16.54	Low Sugar	0.144	Dairy	171.43	OUT003	1999	Medium	Tier 1	Departmental Store	4830.02
2	FD5075	14.28	Regular	0.031	Canned	162.08	OUT001	1987	High	Tier 2	Supermarket Type1	4130.16
3	FD8233	12.10	Low Sugar	0.112	Baking Goods	186.31	OUT001	1987	High	Tier 2	Supermarket Type1	4132.18
4	NC1180	9.57	No Sugar	0.010	Health and Hygiene	123.67	OUT002	1998	Small	Tier 3	Food Mart	2279.36

	Product_Id	Product_Weight	Product_Sugar_Content	Product_Allocated_Area	Product_Type	Product_MRP	Store_Id	Store_Establishment_Year	Store_Size	Store_Location_City_Type	Store_Type	Product_Store_Sales_Total
8758	NC7546	14.80	No Sugar	0.016	Health and Hygiene	140.53	OUT004	2009	Medium	Tier 2	Supermarket Type2	3806.53
8759	NC584	14.06	No Sugar	0.142	Household	144.51	OUT004	2009	Medium	Tier 2	Supermarket Type2	5020.74
8760	NC2471	13.48	No Sugar	0.017	Health and Hygiene	88.58	OUT001	1987	High	Tier 2	Supermarket Type1	2443.42
8761	NC7187	13.89	No Sugar	0.193	Household	168.44	OUT001	1987	High	Tier 2	Supermarket Type1	4171.82
8762	FD306	14.73	Low Sugar	0.177	Snack Foods	224.93	OUT002	1998	Small	Tier 3	Food Mart	2186.08

	Product_Type	Store_Id	Product_Store_Sales_Total
0	Baking Goods	OUT001	525131.04
1	Baking Goods	OUT002	169860.50
2	Baking Goods	OUT003	491908.20
3	Baking Goods	OUT004	1266086.26
4	Breads	OUT001	121274.09
5	Breads	OUT002	43419.47
6	Breads	OUT003	175391.93
7	Breads	OUT004	374856.75
8	Breakfast	OUT001	38161.10
9	Breakfast	OUT002	23396.10
10	Breakfast	OUT003	95634.08
11	Breakfast	OUT004	204939.13
12	Canned	OUT001	449016.38
13	Canned	OUT002	151467.66
14	Canned	OUT003	452445.17
15	Canned	OUT004	1247153.50
16	Dairy	OUT001	598767.62
17	Dairy	OUT002	178888.18
18	Dairy	OUT003	715814.94
19	Dairy	OUT004	1318447.30
20	Frozen Foods	OUT001	558556.81
21	Frozen Foods	OUT002	180295.95
22	Frozen Foods	OUT003	597608.42
23	Frozen Foods	OUT004	1473519.65
24	Fruits and Vegetables	OUT001	792992.59
25	Fruits and Vegetables	OUT002	298503.56
26	Fruits and Vegetables	OUT003	897437.46
27	Fruits and Vegetables	OUT004	2311899.66
28	Hard Drinks	OUT001	152920.74
29	Hard Drinks	OUT002	54281.85
30	Hard Drinks	OUT003	110760.30
31	Hard Drinks	OUT004	307851.73
32	Health and Hygiene	OUT001	435005.31
33	Health and Hygiene	OUT002	164660.81
34	Health and Hygiene	OUT003	439139.18
35	Health and Hygiene	OUT004	1124901.91
36	Household	OUT001	531371.38
37	Household	OUT002	184665.65
38	Household	OUT003	523981.64
39	Household	OUT004	1324721.50
40	Meat	OUT001	505867.28
41	Meat	OUT002	151800.01
42	Meat	OUT003	520939.68
43	Meat	OUT004	950604.97
44	Others	OUT001	123977.09
45	Others	OUT002	32835.73
46	Others	OUT003	159963.75
47	Others	OUT004	224719.73
48	Seafood	OUT001	52936.84
49	Seafood	OUT002	17663.35
50	Seafood	OUT003	65337.48
51	Seafood	OUT004	136466.37
52	Snack Foods	OUT001	806142.24
53	Snack Foods	OUT002	255317.57
54	Snack Foods	OUT003	918510.44
55	Snack Foods	OUT004	2009026.70
56	Soft Drinks	OUT001	410548.69
57	Soft Drinks	OUT002	103808.35
58	Soft Drinks	OUT003	365046.30
59	Soft Drinks	OUT004	917641.38
60	Starchy Foods	OUT001	120443.98
61	Starchy Foods	OUT002	20044.98
62	Starchy Foods	OUT003	143538.60
63	Starchy Foods	OUT004	234746.89

	Product_Sugar_Content	Store_Id	Product_Store_Sales_Total
0	Low Sugar	OUT001	3300834.93
1	Low Sugar	OUT002	1156758.85
2	Low Sugar	OUT003	3706903.24
3	Low Sugar	OUT004	8658908.78
4	No Sugar	OUT001	1090353.78
5	No Sugar	OUT002	382162.19
6	No Sugar	OUT003	1123084.57
7	No Sugar	OUT004	2674343.14
8	Regular	OUT001	1749444.51
9	Regular	OUT002	472112.50
10	Regular	OUT003	1743566.35
11	Regular	OUT004	3902547.93
12	reg	OUT001	82479.96
13	reg	OUT002	19876.18
14	reg	OUT003	99903.41
15	reg	OUT004	191783.58

	Random Forest Estimator	Random Forest Tuned	XGBoost Estimator	XGBoost Tuned
RMSE	596.994959	596.994959	596.978222	597.081727
MAE	468.875850	468.875850	468.965507	468.747993
R-squared	0.685016	0.685016	0.685033	0.684924
Adj. R-squared	0.684501	0.684501	0.684519	0.684409
MAPE	0.165674	0.165674	0.165690	0.165650

	Random Forest Estimator	Random Forest Tuned	XGBoost Estimator	XGBoost Tuned
RMSE	615.906846	615.906846	615.933034	616.053010
MAE	485.311027	485.311027	485.429585	485.055015
R-squared	0.668510	0.668510	0.668482	0.668353
Adj. R-squared	0.667244	0.667244	0.667215	0.667086
MAPE	0.187394	0.187394	0.187421	0.187289