Pizofreude

Posted on Feb 17

Study Note dlt Fundamentals Course - Lesson 2: dlt Sources and Resources, Create First dlt Pipeline

#dataengineering #dezoomcamp #dlt #beginners

Overview

In this lesson, you learned about creating a dlt pipeline by grouping resources into a source. You also learned about dlt transformers and how to use them to perform additional steps in the pipeline.

Key Concepts

dlt Resources: A resource is a logical grouping of data within a data source, typically holding data of similar structure and origin.
dlt Sources: A source is a logical grouping of resources, e.g., endpoints of a single API.
dlt Transformers: Special dlt resources that can be fed data from another resource to perform additional steps in the pipeline.

Creating a dlt Pipeline

A dlt pipeline is created by grouping resources into a source. Here's an example of how to create a pipeline using a list of dictionaries:

@dlt.resource
def my_dict_list():
    return [
        {"id": 1, "name": "Pikachu"},
        {"id": 2, "name": "Charizard"}
    ]

Using dlt Sources

A source is a logical grouping of resources. You can declare a source by decorating a function that returns or yields one or more resources with @dlt.source.

@dlt.source
def my_source():
    return [
        my_dict_list(),
        other_resource()
    ]

Using dlt Transformers

dlt transformers are special resources that can be fed data from another resource to perform additional steps in the pipeline.

@dlt.transformer
def get_pokemon_info(data):
    for pokemon in data:
        response = requests.get(f"<https://pokeapi.co/api/v2/pokemon/{pokemon['id']}>")
        pokemon['info'] = response.json()
    return data

Exercise 1: Create a Pipeline for GitHub API - Repos Endpoint

Explore the GitHub API and understand the endpoint to list public repositories for an organization.
Build the pipeline using dlt.pipeline, dlt.resource, and dlt.source to extract and load data into a destination.
Use duckdb connection, sql_client, or pipeline.dataset() to check the number of columns in the github_repos table.

Exercise 2: Create a Pipeline for GitHub API - Stargazers Endpoint

Create a dlt.transformer for the "stargazers" endpoint for the dlt-hub organization.
Use the github_repos resource as a main resource for the transformer.
Use duckdb connection, sql_client, or pipeline.dataset() to check the number of columns in the github_stargazer table.

Reducing Nesting Level of Generated Tables

You can limit how deep dlt goes when generating nested tables and flattening dicts into columns. By default, the library will descend and generate nested tables for all nested lists, without limit.

@dlt.source
def my_source():
    return [
        my_dict_list(nesting_level=1)
    ]

Typical Settings

nesting_level: The number of levels to descend and generate nested tables.

Next Steps

Proceed to the next lesson to learn more about dlt pipelines and how to use them to extract and load data into a destination.

DEV Community