DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Josue Luzardo Gebrim
Josue Luzardo Gebrim

Posted on

Kedro: The Best Python Framework for Data Science!!!

Kedro: Python Framework for data sciences!

Obs: This post was first posted on Medium

This publication is a summary with some examples about the Python Kedro Framework, being open-source created by Quantumblack, widely used to code in Python in a reproducible, sustainable and modular way to create β€œbatch” pipelines with several β€œsteps”.

This Framework has been increasingly gaining space and being adopted by the community, especially when it is necessary to create a sequential execution β€œmat” with different steps; this fact has happened mainly in the development of codes focused on data science due to its ease of use together Python code, rich documentation, as well as being extremely simple and intuitive.

Installation:

Using the PIP:

pip install kedro

Using Anaconda:

conda install -c conda-forge kedro

Elements of Kedro:

  • Node:

β€œIt is a wrapper for a Python function that names the inputs and outputs of that function” that is; it is a block of code that can direct the execution of a certain sequence of codes or even other blocks.

# importing the library
from kedro.pipeline import node

# Preparing the first "node"
def return_greeting():
    return "Hello"

#defining the node that will return
return_greeting_node = node(func=return_greeting, inputs=None, outputs="my_salutation")
Enter fullscreen mode Exit fullscreen mode
  • Pipeline:

β€œA pipeline organizes the dependencies and order of execution of a collection of nodes and connects inputs and outputs, maintaining its modular code. The pipeline determines the order of execution of the node by resolving dependencies, and does not necessarily execute the nodes in the order in which they are transmitted. ”

#importing the library
from kedro.pipeline import Pipeline

# Assigning "nodes" to "pipeline"
pipeline = Pipeline([return_greeting_node, join_statements_node])
Enter fullscreen mode Exit fullscreen mode
  • DataCatalog:

β€œA DataCatalog is a Kedro concept. It is the record of all data sources that the project can use. It maps the names of node inputs and outputs as keys in a DataSet, which is a Kedro class that can be specialized for different types of data storage. Kedro uses a MemoryDataSet for data that is simply stored in memory. ”

#importing the library
from kedro.io import DataCatalog, MemoryDataSet

# Preparing the "data catalog"
data_catalog = DataCatalog({"my_salutation": MemoryDataSet()})
Enter fullscreen mode Exit fullscreen mode
  • Runner:

The Runner is an object that runs the pipeline. Kedro resolves the order in which the nodes are executed:

  1. Kedro first performs return_greeting_node. This performs return_greeting, which receives no input, but produces the string β€œHello”.

  2. The output string is stored in the MemoryDataSet called my_salutation. Kedro then executes the second node, join_statements_node.

  3. This loads the my_salutation dataset and injects it into the join_statements function.

  4. The function joins the incoming greeting with β€œKedro!” to form the output string β€œHello Kedro!”

  5. The pipeline output is returned in a dictionary with the key my_message.

Example: β€œhello_kedro.py”

"""Contents of the hello_kedro.py file"""
from kedro.io import DataCatalog, MemoryDataSet
from kedro.pipeline import node, Pipeline
from kedro.runner import SequentialRunner

# Prepare the "data catalog"
data_catalog = DataCatalog({"my_salutation": MemoryDataSet()})

# Prepare the first "node"
def return_greeting():
    return "Hello"return_greeting_node = node(return_greeting, inputs=None, outputs="my_salutation")

# Prepare the second "node"
def join_statements(greeting):
    return f"{greeting} Kedro!"join_statements_node = node(
    join_statements, inputs="my_salutation", outputs="my_message"
)

# Assign "nodes" to a "pipeline"
pipeline = Pipeline([return_greeting_node, join_statements_node])

# Create a "runner" to run the "pipeline"
runner = SequentialRunner()

# Execute a pipeline
print(runner.run(pipeline, data_catalog))
Enter fullscreen mode Exit fullscreen mode

To execute the code above, just use the command below in the terminal:

python hello_kedro.py

The following message will appear on the console:

{β€˜my_message’: β€˜Hello Kedro!’}

Another way to visualize the execution of pipelines in kedro is using the kedro-viz plugin:

Image description
To learn more visit: https://github.com/quantumblacklabs/kedro-viz

Happy Integrations:

In addition to being very easy to use with Python, kedro has some integrations with other tools, solutions, and environments, some of which are:

  • Kedro + Airflow:

Using Astronomer it is possible to deploy and manage a kedro pipeline using Apache Airflow as if they were DAG’s:

https://medium.com/quantumblack/kedro-airflow-0-4-0-orchestrating-kedro-pipelines-with-airflow-23fb1283f24d

https://kedro.readthedocs.io/en/stable/10_deployment/11_airflow_astronomer.html

  • Kedro + Prefect:

We were also able to deploy and manage a Kedro pipeline in the Prefect Core environment:

https://kedro.readthedocs.io/en/stable/10_deployment/05_prefect.html

  • Kedro + Google Cloud DataProc:

We were able to use kedro within the Google cloud Dataproc:

https://medium.com/@zhongchen/kedro-in-jupyter-notebooks-on-google-gcp-dataproc-31d5f45ad235

  • Kedro + AWS BATCH:

To deploy in the AWS environment we can use the AWS Batch:

  • Kedro + Amazon SageMaker:

It is possible to integrate Kedro with Amazon SageMaker:

https://kedro.readthedocs.io/en/stable/10_deployment/09_aws_sagemaker.html

  • Kedro + PySpark:

With Kedro we can simplify configurations of deploying pipelines using Apache Spark via PySpark to centralize Spark configurations such as memory usage, manage Spark sections and contexts, and even where to populate Populated Dataframes:

https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html

  • Kedro + Databricks:

It is possible to deploy a Kedro pipeline and use it within a Databricks cluster easily:

https://kedro.readthedocs.io/en/stable/10_deployment/08_databricks.html

  • kedro + Argo Workflows:

We have the possibility to do an automated deployment in a container environment such as Red Hat Openshift, Kubernetes and many others, using Argo Workflows:

https://kedro.readthedocs.io/en/stable/10_deployment/04_argo.html

  • kedro + Kubeflow:

To deploy in container environments, we also have the option of using kubeflow integration:

https://kedro.readthedocs.io/en/stable/10_deployment/06_kubeflow.html#deployment-with-kubeflow-pipelines

This publication was a brief summary of Kedro, its components, and some interesting integrations, what did you think? Have you used it?

Buy Me A Coffee

Follow me on Medium :)

Top comments (1)

Collapse
 
eyalyoli profile image
Eyal Yoli Abs

Great! in the last code block you have missing newlines when defining the nodes

🌱 DEV runs on 100% open source code that we started called Forem.

You can contribute to the codebase or host your own.