Link Search Menu Expand Document

Kedro 101

tools

simulation

Kedro

drawing

Introduction

Kedro is an open-source Python framework that helps data scientists and engineers build reliable, scalable, and maintainable data pipelines. It brings best practices from software engineering into data science projects, making it easier to move from exploration to production.

Why Use Kedro?

  • Modularity & Reusability – Organizes code in a structured, reusable way.
  • Pipeline Management – Helps design and visualize complex workflows.
  • Configuration Management – Keeps secrets and parameters cleanly separated.
  • Testing & Versioning – Encourages best practices like unit testing and reproducibility.
  • Seamless Deployment – Works well with cloud platforms and production environments.

Step-by-Step Guide to Using Kedro

  • Install Kedro

    Make sure we have Python installed, then run:

      pip install kedro
    
  • Create a New Kedro Project

    To create a new project, we can run kedro new.

      kedro new --telemetry=no
    

    There are some templates that we can used such as:

    • Default spaceflights starter (spaceflights-pandas): Added if we selected any combination of linting, testing, custom logging, documentation, and data structure, unless we also selected PySpark or Kedro Viz.
    • PySpark spaceflights starter (spaceflights-pyspark): Added if we selected PySpark with any other tools, unless we also selected Kedro Viz.
    • Kedro Viz spaceflights starter (spaceflights-pandas-viz): Added if Kedro Viz was one of our tools choices, unless we also selected PySpark.
    • Full feature spaceflights starter (spaceflights-pyspark-viz): Added if we selected all available tools, including PySpark and Kedro Viz.

    These templates can be used by:

      kedro new --starter=spaceflights-pandas-viz --telemetry=no
    

    Next, we will be ask to type our project and folder name.

    drawing

  • Install all Libraries at requirements.txt

      pip install requirements.txt
    
  • Understand the Project Structure

    A typical Kedro project looks like this:

      my_kedro_project/
      ├── data/                       # Data storage
      ├── conf/                       # Configuration files
      ├── src/                        # Source code
      │   ├── pipelines/              # pipelines
      │   ├── pipeline_registry.py    # Dataset management
      │   ├── settings.py             # setting
      └── tests/                      # Unit tests
    
  • Additionally, we can add notebook folder to store all jupyter notebook that used for the modeling

      my_kedro_project/
      ├── data/                       # Data storage
      ├── conf/                       # Configuration files
      ├── notebook/                   # for jupyter notebook files
      ├── src/                        # Source code
      │   ├── pipelines/              # pipelines
      │   ├── pipeline_registry.py    # Dataset management
      │   ├── settings.py             # setting
      └── tests/                      # Unit tests
    
  • Prepare our jupyter notebook to do data science things.

    drawing

  • Convert all process into function to make it easier to implement into kedro

    drawing

  • Build Our First Data Pipeline

    Kedro uses nodes (functions) and pipelines (workflow sequences).

    let’s say we have 2 process, that do a data engineering and modeling. We can split these process into 2 folders.

      my_kedro_project/
      ├── data/                       # Data storage
      ├── conf/                       # Configuration files
      ├── notebook/                   # for jupyter notebook files
      ├── src/                        # Source code
      │   ├── pipelines/              # Pipelines folder
      │   │   ├── data_preprocessing  # Data processing pipelines
      │   │   │   ├── pipeline.py     # Pipeline process for data preprocessing
      │   │   │   └── nodes.py        # Data transformation functions
      │   │   └── modeling            # Modeling pipelines   
      │   │   │   ├── pipeline.py     # Pipeline process for modeling
      │   │   │   └── nodes.py        # Data transformation functions  
      │   ├── pipeline_registry.py    # Dataset management
      │   ├── settings.py             # setting
      └ ...
    
    • nodes.py

      input all functions (and all library) at notebook in previous step into this file.

        import pandas as pd
        from sklearn.model_selection import train_test_split
        from sklearn.preprocessing import OneHotEncoder, StandardScaler
        from sklearn.compose import ColumnTransformer
      
        def baca_file(url):
        raw_df = pd.read_csv(url)
        return raw_df
        ...
      
    • pipeline.py

      This file is a main process in kedro that run all registered Node in a Pipeline.

        from kedro.pipeline import Pipeline, node, pipeline
        from .nodes import *
      
        def create_pipeline(**kwargs) -> Pipeline:
            return pipeline(
                [
                    node(
                        func= baca_file,
                        inputs='params:url',
                        outputs='raw_df',
                        name='baca_file'
                    )
                ]
            )            
        ...
      

      we can add all nodes inside return pipeline line.

  • Define the Data Catalog

      my_kedro_project/
      ├── data/                       # Data storage
      ├── conf/                       # Configuration files
      │   └── base/                   # base folder
      │       ├── parameters_data_preprocessing.yml
      │       ├── catalog.yml
      │       └── parameters_modeling.yml 
      ├── notebook/                   # for jupyter notebook files
      ├── src/                        # Source code
      │   ├── pipelines/              # Pipelines folder
      │   │   ├── data_preprocessing  # Data processing pipelines
      │   │   │   ├── pipeline.py     # Pipeline process for data preprocessing
      │   │   │   └── nodes.py        # Data transformation functions
      │   │   └── modeling            # Modeling pipelines   
      │   │       ├── pipeline.py     # Pipeline process for modeling
      │   │       └── nodes.py        # Data transformation functions  
      │   ├── pipeline_registry.py    # Dataset management
      │   ├── settings.py             # setting
      └ ...
    

    We can define all input and output data that used by pipeline.

    • Define input data Create a file at conf/base/parameters_data_preprocessing.yml to specify where data comes from, let say that the data come from url. we can define it using:

        url: https://docs.google.com/spreadsheets/d/e/ ... pub?gid=0&single=true&output=csv
      

      we can also used parameters_modeling.yaml if there are files that used in modeling process.

    • Define output data Edit conf/base/catalog.yml to specify where data comes from:

      raw_data: type: pandas.CSVDataSet filepath: data/01_raw/raw_data.csv

      cleaned_data: type: pandas.CSVDataSet filepath: data/02_intermediate/cleaned_data.csv

      The results will be stored in each folder at filepath.

  • Run the Pipeline

      kedro run
    

    Kedro will automatically execute the pipeline and store the processed data in the specified locations.

    drawing drawing

  • Visualize the Pipeline

    To see how your data flows through the pipeline:

      kedro viz
    

    This launches an interactive graph in your browser.

    drawing drawing

What’s Next?

Kedro is a powerful tool that can help you move from messy notebooks to well-structured, production-ready projects. Once comfortable, explore:

  • Advanced Pipelines (Branching, Dependencies)
  • Integrations with MLFlow, Airflow, and Cloud Services
  • Custom Hooks & Plugins

Kedro makes data science cleaner, more efficient, and scalable.

reference

  • https://docs.kedro.org/en/stable/index.html
  • https://kedro.org/