Waylon Walker

Posted on Aug 19, 2021 • Originally published at waylonwalker.com

What is Kedro

#python #kedro #datascience

Kedro is an unopinionated Data Engineering framework that comes with a somewhat opinionated template. It gives the user a way to build pipelines that automatically take care of io through the use of abstract DataSets that the user specifies through Catalog entries. These Catalog entries are loaded,ran through a function, and saved by Nodes. The order that these Nodes are executed are determined by the Pipeline, which is a DAG. It's the runner's job to manage the execution of the Nodes.

https://waylonwalker.com/what-is-kedro-1/

This is an updated version of my original what-is-kedro article

Hot Take

If you are doing a series of operations to data with python, especially if you are using something as supported as pandas, you should be using a framework that gives you a pipeline as a DAG and abstracts io.

Orchestrators

Like I said, kedro is unopinionated it does determine where or how your data should be ran. The kedro team does support the following Orchestrators with very little add on to the base template.

DataSets

Did I say kedro is unopionated? Datasets are what allow kedro too be so flexible accross a number of different python objects. Any python object can be made into a kedro dataset. Kedro comes out of the box with many purpose built DataSets like storing pandas DataFrames to parquet, csv, or a sql table. If kedro does not come with support for the type of python objects you work with don't worry, you can for the closest option they support and build your own. Or if you do not want to build your own, you can use a PickleDataSet for anything.

Catalog

You will not often be creating your own datasets, most of what you need would already be taken care of by the kedro framework. What you will need to do is to use the existing DataSets to build your data catalog.

Kedro takes care of all of the file io for you, you simply need to use the catalog to tell kedro what type of DataSet to use and any extra information that DataSet needs. Much of the time this is simply a filepath.

Typically the catalog is specified in yaml format. If you are not familiar with yaml, I suggest learnxinyminutes.com/docs/yaml/ as a resource of examples.

test:
  type: pandas.CSVDataSet
  filepath: s3://your_bucket/test.csv #

Here is the most basic yaml catalog entry taken from the kedro docs

cars:
  type: pandas.CSVDataSet
  filepath: data/01_raw/company/cars.csv
    sep: ','
    load_args:
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .

Here is a bit more complex example that takes in load_args and save_args
docs

Nodes

Nodes are a very core part of kedro to build the DAG. These nodes are what provides the definition of what catalog entries, get passed into which function, and output to another catalog entry.

import pandas as pd
import numpy as np

def clean_data(cars: pd.DataFrame,
               boats: pd.DataFrame) -> Dict[str, pd.DataFrame]:
    return dict(cars_df=cars.dropna(), boats_df=boats.dropna())

def halve_dataframe(data: pd.DataFrame) -> List[pd.DataFrame]:
    return np.array_split(data, 2)

nodes = [
    node(clean_data,
         inputs=['cars2017', 'boats2017'],
         outputs=dict(cars_df='clean_cars2017',
                      boats_df='clean_boats2017')),
    node(halve_dataframe,
         'clean_cars2017',
         ['train_cars2017', 'test_cars2017']),
    node(halve_dataframe,
         dict(data='clean_boats2017'),
         ['train_boats2017', 'test_boats2017'])
]

Here is an example of three nodes taken from their
docs

Pipeline

The kedro Pipeline, is a DAG (Directed Acyclic Graph). It is a graph object that flows in one direction. You can slice into the pipeline using a few built in graph method to_nodes, from_nodes, to_outputs, and from_inputs. You can chain up these method calls since each one returns a new Pipeline object. You can also ask a pipline for its edges with inputs, and outputs. You can also list every dataset along the way with all_inputs or all_outputs.
Lastly you can convert it back into a list of nodes with
nodes.

from kedro.pipeline import Pipeline, node

# using our nodes from last tim
Pipeline(nodes)

Runner

The runner is the bridge between kedro and the orchestrators. The kedro team provides some basic runners for running pipelines locally, built right into the framework, but adding on new runners for different orchestrators is done through the use of adding in a new runner to your project.

Hooks

Kedro allows you to hook into a number of lifecycle methods through the use of the pluggy framework. Yes the one that pytest is built on. There are a number of different lifecycle methods that allow us to hook in around where kedro is running such as before_pipeline_run or
after_catalog_loaded`.

DEV Community