Link Search Menu Expand Document

Data Modeling

data engineering

data modeling

drawing

What Is Data Modelling?

Data modelling is the process used to structure how data is stored, as well as modelling relationships within the data. The goal is to create a visual data map that accurately describes the data structure, how data will flow through the system whilst highlighting important data relationships. This can involve the data input itself, the data infrastructure and output, whether that’s predictive models, ML algorithms, AI or other products/services.

Project Overview

Problem Statement and Objective

The main goal is to store and provide access to both actual and historical price data with granularity for region, SKU, Channel, and bottom prices for selling and buying. This data-mart will support analytics, reporting, and business intelligence needs.

This mart will help team data to analyze more deeper such as:

  • Price prediction
  • Recommendation system for High velocity and margin for each area, channel and SKU
  • Part of matchmaking recommendation to generate transaction with high margin
  • etc

issues From product FOV

drawing

issues From Business FOV

drawing

scope

drawing

Success Criteria

Define how the datamart’s success will be measured:

  • All key metrics are available with the required granularity.
  • Data is accurate, consistent, and available for reporting.
  • The datamart supports required performance and scalability.

Timeline

Define a high-level timeline with phases: drawing


Comparing analysis

drawing


Gap Analysis

Assess gaps between current systems and the desired state: drawing


Data Sources and Input Tables

drawing

Source Systems:

  • List the source systems (e.g., transactional sales systems, procurement systems, ERP, third-party APIs).

Input Tables:

  • Specify the tables and fields that will feed into the datamart (e.g., sales_transactions, procurement_orders, product_master, region_hierarchy).

Data Frequency:

  • Define how often the data will be updated (e.g., daily, weekly, real-time).

Data Model Design

drawing

Entities:

  • Define key entities: Region, SKU, Time, parameters.

Granularity:

  • Data will be aggregated at the parameters and regional levels with time-based dimensions (daily, weekly, or monthly granularity).

Data Modeling Approach

Star Schema: The data model will use a star schema design. This structure will include:

  • Fact Table: parameter_fact — contains quantitative data, representing the main transactions.
  • Dimension Tables: sku_dimension, region_dimension, time_dimension, and potentially others depending on business needs (e.g., supplier_dimension).

The star schema is chosen for its simplicity and performance, optimized for fast query response in reporting and analytics.

Fact Table

Fact Table: parameter_fact The fact table stores the main transactional data. It will be the central table of the datamart, containing measures (e.g., actual parameter) and foreign keys to dimension tables.

Schema Definition

Column Name Data Type Description
parameter_fact_id INT Primary key (auto-increment)
sku_id INT Foreign key to sku_dimension
region_id INT Foreign key to region_dimension
date_id INT Foreign key to time_dimension

Design Considerations

  • Granularity: The lowest level of detail is at the SKU, region, and daily level (based on date_id).
  • Currency Handling: parameters fields should accommodate different currencies; conversions might be applied if needed.
  • Historical Tracking: Each record represents a snapshot of parameters at a specific time, ensuring a historical trail.

Dimension Tables

sku_dimension Table Stores metadata about the products (SKUs).

Column Name Data Type Description
sku_id INT Primary key
sku_name VARCHAR Name of the SKU/product
category VARCHAR Product category (e.g., Electronics).

region_dimension Table Contains information about regions or markets.

Column Name Data Type Description
region_id INT Primary key
region_name VARCHAR Name of the region (e.g., West Java)
region_code VARCHAR Code or abbreviation for the region

time_dimension Table Provides a time-based reference for daily, weekly, and monthly aggregations.

Column Name Data Type Description
date_id INT Primary key (YYYYMMDD format)
date DATE Full date (YYYY-MM-DD)
day INT Day of the month

Relationships

The tables are connected via foreign keys:

  • prarameter_fact.sku_idsku_dimension.sku_id
  • prarameter_fact.region_idregion_dimension.region_id
  • prarameter_fact.date_idtime_dimension.date_id
  • sku_dimension.supplier_idsupplier_dimension.supplier_id

Data Flow and ETL Considerations

Source to Target Mapping:

  • Extract actual parameters and historical records from transactional tables.
  • Transform data by aggregating parameters and calculating bottom parameters.
  • Load into the parameters_fact table and update relevant dimensions.

Incremental Loading:

  • Daily updates for new parameters and changes in historical records.
  • Implement Slowly Changing Dimensions (SCD Type 2) for tracking changes in dimensions like sku_dimension or region_dimension.

Transformation Logic

Data Cleansing:

  • Standardize parameters, remove outliers, handle missing values.

Aggregation Rules:

  • Define how data will be aggregated at different levels (e.g., averaging for time periods, lowest parameters).

Historical Tracking:

  • Ensure historical changes in parameters are tracked using slowly changing dimensions or versioning.

Performance Considerations

Indexing:

  • Index SKU, region, and date columns to optimize query performance.

Partitioning:

  • Partition data by region and time period for efficient data retrieval.

Example Queries

Query 1: Get the historical trend of the bottom selling price for a specific SKU in a region.

SELECT date, bottom_selling_price
FROM price_fact
JOIN sku_dimension ON price_fact.sku_id = sku_dimension.sku_id
JOIN region_dimension ON price_fact.region_id = region_dimension.region_id
WHERE sku_name = 'Product A' AND region_name = 'West Java'
ORDER BY date;

Query 2: Calculate the average selling price for each SKU in a given month.

SELECT sku_name, AVG(actual_selling_price) AS avg_selling_price
FROM price_fact
JOIN sku_dimension ON price_fact.sku_id = sku_dimension.sku_id
JOIN time_dimension ON price_fact.date_id = time_dimension.date_id
WHERE month = 10 AND year = 2024
GROUP BY sku_name;

Security and Access Control

Data Access:

  • Define user roles and permissions (e.g., read-only access for analysts, write access for data engineers).

Data Sensitivity:

  • Handling of sensitive pricing information, especially PII data.

Testing and Validation

Test Scenarios:

  • Data consistency checks, historical parameters accuracy validation, region-wise aggregations.

Acceptance Criteria:

  • Data accuracy should meet business requirements, performance benchmarks for query speed.

MART_xx SCHEMA

Define Key Metrics and Definitions drawing


Recommendation and Added Values

drawing


Risks and Mitigation

drawing


Stakeholders

drawing