Data Modeling

data engineering

data modeling

drawing

What Is Data Modelling?

Data modelling is the process used to structure how data is stored, as well as modelling relationships within the data. The goal is to create a visual data map that accurately describes the data structure, how data will flow through the system whilst highlighting important data relationships. This can involve the data input itself, the data infrastructure and output, whether that’s predictive models, ML algorithms, AI or other products/services.

Project Overview

Problem Statement and Objective

The main goal is to store and provide access to both actual and historical price data with granularity for region, SKU, Channel, and bottom prices for selling and buying. This data-mart will support analytics, reporting, and business intelligence needs.

This mart will help team data to analyze more deeper such as:

Price prediction
Recommendation system for High velocity and margin for each area, channel and SKU
Part of matchmaking recommendation to generate transaction with high margin
etc

issues From product FOV

drawing

issues From Business FOV

drawing

scope

drawing

Success Criteria

Define how the datamart’s success will be measured:

All key metrics are available with the required granularity.
Data is accurate, consistent, and available for reporting.
The datamart supports required performance and scalability.

Timeline

Define a high-level timeline with phases: drawing

Comparing analysis

drawing

Gap Analysis

Assess gaps between current systems and the desired state: drawing

Data Sources and Input Tables

drawing

Source Systems:

List the source systems (e.g., transactional sales systems, procurement systems, ERP, third-party APIs).

Input Tables:

Specify the tables and fields that will feed into the datamart (e.g., sales_transactions, procurement_orders, product_master, region_hierarchy).

Data Frequency:

Define how often the data will be updated (e.g., daily, weekly, real-time).

Data Model Design

drawing

Entities:

Define key entities: Region, SKU, Time, parameters.

Granularity:

Data will be aggregated at the parameters and regional levels with time-based dimensions (daily, weekly, or monthly granularity).

Data Modeling Approach

Star Schema: The data model will use a star schema design. This structure will include:

Fact Table: parameter_fact — contains quantitative data, representing the main transactions.
Dimension Tables: sku_dimension, region_dimension, time_dimension, and potentially others depending on business needs (e.g., supplier_dimension).

The star schema is chosen for its simplicity and performance, optimized for fast query response in reporting and analytics.

Fact Table

Fact Table: parameter_fact The fact table stores the main transactional data. It will be the central table of the datamart, containing measures (e.g., actual parameter) and foreign keys to dimension tables.

Schema Definition

Column Name	Data Type	Description
parameter_fact_id	INT	Primary key (auto-increment)
sku_id	INT	Foreign key to sku_dimension
region_id	INT	Foreign key to region_dimension
date_id	INT	Foreign key to time_dimension
…	…	…

Design Considerations

Granularity: The lowest level of detail is at the SKU, region, and daily level (based on date_id).
Currency Handling: parameters fields should accommodate different currencies; conversions might be applied if needed.
Historical Tracking: Each record represents a snapshot of parameters at a specific time, ensuring a historical trail.

Dimension Tables

sku_dimension Table Stores metadata about the products (SKUs).

Column Name	Data Type	Description
sku_id	INT	Primary key
sku_name	VARCHAR	Name of the SKU/product
category	VARCHAR	Product category (e.g., Electronics).
…	…	…

region_dimension Table Contains information about regions or markets.

Column Name	Data Type	Description
region_id	INT	Primary key
region_name	VARCHAR	Name of the region (e.g., West Java)
region_code	VARCHAR	Code or abbreviation for the region
…	…	…

time_dimension Table Provides a time-based reference for daily, weekly, and monthly aggregations.

Column Name	Data Type	Description
date_id	INT	Primary key (YYYYMMDD format)
date	DATE	Full date (YYYY-MM-DD)
day	INT	Day of the month
…	…	…

Relationships

The tables are connected via foreign keys:

prarameter_fact.sku_id → sku_dimension.sku_id
prarameter_fact.region_id → region_dimension.region_id
prarameter_fact.date_id → time_dimension.date_id
sku_dimension.supplier_id → supplier_dimension.supplier_id

Data Flow and ETL Considerations

Source to Target Mapping:

Extract actual parameters and historical records from transactional tables.
Transform data by aggregating parameters and calculating bottom parameters.
Load into the parameters_fact table and update relevant dimensions.

Incremental Loading:

Daily updates for new parameters and changes in historical records.
Implement Slowly Changing Dimensions (SCD Type 2) for tracking changes in dimensions like sku_dimension or region_dimension.

Transformation Logic

Data Cleansing:

Standardize parameters, remove outliers, handle missing values.

Aggregation Rules:

Define how data will be aggregated at different levels (e.g., averaging for time periods, lowest parameters).

Historical Tracking:

Ensure historical changes in parameters are tracked using slowly changing dimensions or versioning.

Performance Considerations

Indexing:

Index SKU, region, and date columns to optimize query performance.

Partitioning:

Partition data by region and time period for efficient data retrieval.

Example Queries

Query 1: Get the historical trend of the bottom selling price for a specific SKU in a region.

SELECT date, bottom_selling_price
FROM price_fact
JOIN sku_dimension ON price_fact.sku_id = sku_dimension.sku_id
JOIN region_dimension ON price_fact.region_id = region_dimension.region_id
WHERE sku_name = 'Product A' AND region_name = 'West Java'
ORDER BY date;

Query 2: Calculate the average selling price for each SKU in a given month.

SELECT sku_name, AVG(actual_selling_price) AS avg_selling_price
FROM price_fact
JOIN sku_dimension ON price_fact.sku_id = sku_dimension.sku_id
JOIN time_dimension ON price_fact.date_id = time_dimension.date_id
WHERE month = 10 AND year = 2024
GROUP BY sku_name;