Kedro 101

tools

simulation

Kedro

drawing

Introduction

Kedro is an open-source Python framework that helps data scientists and engineers build reliable, scalable, and maintainable data pipelines. It brings best practices from software engineering into data science projects, making it easier to move from exploration to production.

Why Use Kedro?

Modularity & Reusability – Organizes code in a structured, reusable way.
Pipeline Management – Helps design and visualize complex workflows.
Configuration Management – Keeps secrets and parameters cleanly separated.
Testing & Versioning – Encourages best practices like unit testing and reproducibility.
Seamless Deployment – Works well with cloud platforms and production environments.

Step-by-Step Guide to Using Kedro

Install Kedro

Make sure we have Python installed, then run:
```
  pip install kedro
```
Create a New Kedro Project

To create a new project, we can run kedro new.
```
  kedro new --telemetry=no
```
There are some templates that we can used such as:
- Default spaceflights starter (spaceflights-pandas): Added if we selected any combination of linting, testing, custom logging, documentation, and data structure, unless we also selected PySpark or Kedro Viz.
- PySpark spaceflights starter (spaceflights-pyspark): Added if we selected PySpark with any other tools, unless we also selected Kedro Viz.
- Kedro Viz spaceflights starter (spaceflights-pandas-viz): Added if Kedro Viz was one of our tools choices, unless we also selected PySpark.
- Full feature spaceflights starter (spaceflights-pyspark-viz): Added if we selected all available tools, including PySpark and Kedro Viz.
These templates can be used by:
```
  kedro new --starter=spaceflights-pandas-viz --telemetry=no
```
Next, we will be ask to type our project and folder name.
Install all Libraries at requirements.txt
```
  pip install requirements.txt
```

Understand the Project Structure

A typical Kedro project looks like this:

  my_kedro_project/
  ├── data/                       # Data storage
  ├── conf/                       # Configuration files
  ├── src/                        # Source code
  │   ├── pipelines/              # pipelines
  │   ├── pipeline_registry.py    # Dataset management
  │   ├── settings.py             # setting
  └── tests/                      # Unit tests

Additionally, we can add notebook folder to store all jupyter notebook that used for the modeling

  my_kedro_project/
  ├── data/                       # Data storage
  ├── conf/                       # Configuration files
  ├── notebook/                   # for jupyter notebook files
  ├── src/                        # Source code
  │   ├── pipelines/              # pipelines
  │   ├── pipeline_registry.py    # Dataset management
  │   ├── settings.py             # setting
  └── tests/                      # Unit tests

Prepare our jupyter notebook to do data science things.
Convert all process into function to make it easier to implement into kedro

Build Our First Data Pipeline

Kedro uses nodes (functions) and pipelines (workflow sequences).

let’s say we have 2 process, that do a data engineering and modeling. We can split these process into 2 folders.

  my_kedro_project/
  ├── data/                       # Data storage
  ├── conf/                       # Configuration files
  ├── notebook/                   # for jupyter notebook files
  ├── src/                        # Source code
  │   ├── pipelines/              # Pipelines folder
  │   │   ├── data_preprocessing  # Data processing pipelines
  │   │   │   ├── pipeline.py     # Pipeline process for data preprocessing
  │   │   │   └── nodes.py        # Data transformation functions
  │   │   └── modeling            # Modeling pipelines   
  │   │   │   ├── pipeline.py     # Pipeline process for modeling
  │   │   │   └── nodes.py        # Data transformation functions  
  │   ├── pipeline_registry.py    # Dataset management
  │   ├── settings.py             # setting
  └ ...

nodes.py

input all functions (and all library) at notebook in previous step into this file.

  import pandas as pd
  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import OneHotEncoder, StandardScaler
  from sklearn.compose import ColumnTransformer

  def baca_file(url):
  raw_df = pd.read_csv(url)
  return raw_df
  ...

pipeline.py

This file is a main process in kedro that run all registered Node in a Pipeline.

  from kedro.pipeline import Pipeline, node, pipeline
  from .nodes import *

  def create_pipeline(**kwargs) -> Pipeline:
      return pipeline(
          [
              node(
                  func= baca_file,
                  inputs='params:url',
                  outputs='raw_df',
                  name='baca_file'
              )
          ]
      )            
  ...

we can add all nodes inside return pipeline line.

Define the Data Catalog

  my_kedro_project/
  ├── data/                       # Data storage
  ├── conf/                       # Configuration files
  │   └── base/                   # base folder
  │       ├── parameters_data_preprocessing.yml
  │       ├── catalog.yml
  │       └── parameters_modeling.yml 
  ├── notebook/                   # for jupyter notebook files
  ├── src/                        # Source code
  │   ├── pipelines/              # Pipelines folder
  │   │   ├── data_preprocessing  # Data processing pipelines
  │   │   │   ├── pipeline.py     # Pipeline process for data preprocessing
  │   │   │   └── nodes.py        # Data transformation functions
  │   │   └── modeling            # Modeling pipelines   
  │   │       ├── pipeline.py     # Pipeline process for modeling
  │   │       └── nodes.py        # Data transformation functions  
  │   ├── pipeline_registry.py    # Dataset management
  │   ├── settings.py             # setting
  └ ...

We can define all input and output data that used by pipeline.

Define input data Create a file at conf/base/parameters_data_preprocessing.yml to specify where data comes from, let say that the data come from url. we can define it using:
```
  url: https://docs.google.com/spreadsheets/d/e/ ... pub?gid=0&single=true&output=csv
```
we can also used parameters_modeling.yaml if there are files that used in modeling process.
Define output data Edit conf/base/catalog.yml to specify where data comes from:

raw_data: type: pandas.CSVDataSet filepath: data/01_raw/raw_data.csv

cleaned_data: type: pandas.CSVDataSet filepath: data/02_intermediate/cleaned_data.csv

The results will be stored in each folder at filepath.

Run the Pipeline
```
  kedro run
```
Kedro will automatically execute the pipeline and store the processed data in the specified locations.
Visualize the Pipeline

To see how your data flows through the pipeline:
```
  kedro viz
```
This launches an interactive graph in your browser.

What’s Next?

Kedro is a powerful tool that can help you move from messy notebooks to well-structured, production-ready projects. Once comfortable, explore:

Advanced Pipelines (Branching, Dependencies)
Integrations with MLFlow, Airflow, and Cloud Services
Custom Hooks & Plugins

Kedro makes data science cleaner, more efficient, and scalable.

reference

https://docs.kedro.org/en/stable/index.html
https://kedro.org/