kedro 0.16.2 just dropped last week with a long-awaited feature... catalog search! I went as far as monkey patching this into each of my projects. I work jump between a few really big projects that have tons of datasets. Being able to quickly search for what I need is so useful.
The Catalog
The kedro data catalog is a key component to the kedro framework. It handles all data loading and saving for you. It is configurable and hackable. Having all your data connections listed in one place make it so easy to pick your project up and move it to a completely new environment. That sweet imperative loading style saves so much read/write overhead. I can load all my data with a single command whether it's in amazon s3, google cloud platform, or a local file.
Kick start a toy project
Just like with most of these articles, I am going to create a conda environment so that I don't break any existing projects and scaffold up a toy project to learn from.
conda create -n kedro0162 python=3.8 -y
activate kedro0162
pip install kedro
kedro new # call it Kedro 0162 and click-through
cd kedro-0162
kedro install
Expect this set of commands to take a few minutes depending on your system, connection speed, and amount of packages already in your local cache.
Create some catalog
Now the power of the catalog search really starts to shine when your projects grow legs. You have groups of many datasets containing patterns of data including layer
, or source
among other things.
vim conf/base/catalog.yml
In the catalog, you will see a few lines of instructions followed by
example_iris_data:
type: pandas.CSVDataSet
filepath: data/01_raw/iris.csv
This gives us one stored catalog entry called example_iris_data
, it is a CSV file stored in data/01_raw/iris.csv
.
Let's make up a transportation company that is siloed into three different divisions and it is our job to bring their sales and product metadata into a single report. This company makes lifted-trucks
, primium-scoots
, and luxy-yahts
. and we know that we will want raw
, int
, pri
and modin
layers to start our project so let's scaffold up that catalog real quick.
# βββββββββ lifted-truck βββββββββ
raw_lifted_truck_sales:
type: pandas.CSVDataSet
filepath: data/01_raw/sales/lifted-truck.csv
int_lifted_truck_sales:
type: pandas.CSVDataSet
filepath: data/01_int/sales/lifted-truck.csv
pri_lifted_truck_sales:
type: pandas.CSVDataSet
filepath: data/01_pri/sales/lifted-truck.csv
raw_lifted_truck_info:
type: pandas.CSVDataSet
filepath: data/01_raw/info/lifted-truck.csv
int_lifted_truck_info:
type: pandas.CSVDataSet
filepath: data/01_int/info/lifted-truck.csv
pri_lifted_truck_info:
type: pandas.CSVDataSet
filepath: data/01_pri/info/lifted-truck.csv
# βββββββββ primium-scoot βββββββββ
raw_primium_scoot_sales:
type: pandas.CSVDataSet
filepath: data/01_raw/sales/primium-scoot.csv
int_primium_scoot_sales:
type: pandas.CSVDataSet
filepath: data/01_int/sales/primium-scoot.csv
pri_primium_scoot_sales:
type: pandas.CSVDataSet
filepath: data/01_pri/sales/primium-scoot.csv
raw_primium_scoot_info:
type: pandas.CSVDataSet
filepath: data/01_raw/info/primium-scoot.csv
int_primium_scoot_info:
type: pandas.CSVDataSet
filepath: data/01_int/info/primium-scoot.csv
pri_primium_scoot_info:
type: pandas.CSVDataSet
filepath: data/01_pri/info/primium-scoot.csv
# βββββββββ luxy-yaht βββββββββ
raw_luxy_yaht_sales:
type: pandas.CSVDataSet
filepath: data/01_raw/sales/luxy-yaht.csv
int_luxy_yaht_sales:
type: pandas.CSVDataSet
filepath: data/01_int/sales/luxy-yaht.csv
pri_luxy_yaht_sales:
type: pandas.CSVDataSet
filepath: data/01_pri/sales/luxy-yaht.csv
raw_luxy_yaht_info:
type: pandas.CSVDataSet
filepath: data/01_raw/info/luxy-yaht.csv
int_luxy_yaht_info:
type: pandas.CSVDataSet
filepath: data/01_int/info/luxy-yaht.csv
pri_luxy_yaht_info:
type: pandas.CSVDataSet
filepath: data/01_pri/info/luxy-yaht.csv
# βββββββββ combined βββββββββ
pri_combined_sales:
type: pandas.CSVDataSet
filepath: data/01_pri/sales/combined.csv
pri_combined_info:
type: pandas.CSVDataSet
filepath: data/01_pri/info/combined.csv
# βββββββββ modin βββββββββ
modin_main:
type: pandas.CSVDataSet
filepath: data/01_pri/info/combined.csv
Some examples of common regex uses
regex
gets really complicated fast, but these basic examples are very common use cases and will get you a long way without being very complicated.
-
term
- all catalog entries that includeterm
in the catalog entry -
^term
- all catalog entries that includeterm
at the beginning of the catalog entry -
term$
- all catalog entries that includeterm
at the end of the catalog entry -
term1.*term2
- include anything in betweenterm1
andterm2
. -
term1|term2
- all catalog entries that includeterm1
orterm2
Let's Search this thing
kedro has long included the catalog.list()
feature that returns a list of all datasets. Now the list
command takes in a regex_search
keyword argument. By default, it is empty and returns the entire catalog.
kedro ipython
list out all of the luxy-yahts
>>> catalog.list('luxy_yaht`)
['raw_luxy_yaht_sales',
'int_luxy_yaht_sales',
'pri_luxy_yaht_sales',
'raw_luxy_yaht_info',
'int_luxy_yaht_info',
'pri_luxy_yaht_info']
List out data by layer
Easy just search for the layer name.
raw
>>> catalog.list('raw')
['raw_lifted_truck_sales',
'raw_lifted_truck_info',
'raw_primium_scoot_sales',
'raw_primium_scoot_info',
'raw_luxy_yaht_sales',
'raw_luxy_yaht_info']
pri
>>> catalog.list('pri')
['pri_lifted_truck_sales',
'pri_lifted_truck_info',
'raw_primium_scoot_sales',
'int_primium_scoot_sales',
'pri_primium_scoot_sales',
'raw_primium_scoot_info',
'int_primium_scoot_info',
'pri_primium_scoot_info',
'pri_luxy_yaht_sales',
'pri_luxy_yaht_info',
'pri_combined_sales',
'pri_combined_info']
π² We just included every primium-scoot
dataset!
Here we just encountered our first need for regex
. I'll be the first to admit that I am really bad at regex, it's incredibly confusing, becomes read-only with much complexity, but is super powerful and used in a lot of places.
^term
beginning of catalog entry
The ^
regex operator searches for catalog entries that include the search term at the very beginning.
>>> catalog.list('^pri')
['pri_lifted_truck_sales',
'pri_lifted_truck_info',
'pri_primium_scoot_sales',
'pri_primium_scoot_info',
'pri_luxy_yaht_sales',
'pri_luxy_yaht_info',
'pri_combined_sales',
'pri_combined_info']
term$
end of catalog entry
The $
operator is the opposite of the ^
operator. It means give me all that matches that occur at the end of the catalog entry.
>>> catalog.list('info$')
['raw_lifted_truck_info',
'int_lifted_truck_info',
'pri_lifted_truck_info',
'raw_primium_scoot_info',
'int_primium_scoot_info',
'pri_primium_scoot_info',
'raw_luxy_yaht_info',
'int_luxy_yaht_info',
'pri_luxy_yaht_info',
'pri_combined_info']
term1.*term2
The .*
operator in regex means give me all the datasets that include the two terms no matter what is between them. There is also a .?
to only allow one character between them. More often than not I really just want the two patterns to exist in the dataset entry.
>>> catalog.list('raw.*info$')
['raw_lifted_truck_info',
'raw_primium_scoot_info',
'raw_luxy_yaht_info']
Some real things that we can do with search
Let's look at a few examples beyond the obvious of just searching for the dataset that we want to load.
Check Raw Data
While migrating pipelines between environments it's important to know if your raw datasets are available. I will argue that you should also consider looking at pipeline.inputs
as it cannot lie and gives you a true reading of the pipeline inputs. But another easy check might be to check all the datasets that the Data Engineers have labeled raw.
>>> {dataset: catalog.exists(dataset) for dataset in catalog.list('^raw')}
{'raw_lifted_truck_sales': False,
'raw_lifted_truck_info': False,
'raw_primium_scoot_sales': False,
'raw_primium_scoot_info': False,
'raw_luxy_yaht_sales': False,
'raw_luxy_yaht_info': False}
Since we just created a dummy catalog the data does not exist in this example.
Create a new catalog
Let's say that we have someone on the team who is from the land division of our company and they want a simplified catalog readily available that does not include any marine data.
To do this we will need to reach a bit into the kedro internals for the DataCatalog
class and utilize a new regex operator |
.
>>> from kedro.io import DataCatalog
>>> land_catalog = DataCatalog(
{
dataset: getattr(catalog.datasets, dataset)
for dataset in catalog.list('truck|scoot')
}
)
>>> land_catalog.list()
['raw_lifted_truck_sales',
'int_lifted_truck_sales',
'pri_lifted_truck_sales',
'raw_lifted_truck_info',
'int_lifted_truck_info',
'pri_lifted_truck_info',
'raw_primium_scoot_sales',
'int_primium_scoot_sales',
'pri_primium_scoot_sales',
'raw_primium_scoot_info',
'int_primium_scoot_info',
'pri_primium_scoot_info']
regex recap
-
^term
- beginning -
term$
- end -
term1.*term2
- anything in between -
term1|term2
- or
π see an issue, edit this post on GitHub
I have been writing short snippets about my mentality breaking into the tech/data industry in my newsletter, π check it out and lets get the conversation started.
Top comments (0)