Introduction:
In this article, we'll explore the process of debugging a Python data pipeline that fetches and stars GitHub repositories related to data engineering. Our pipeline will utilize the GitHub API to fetch repository information, process the data, and star the repositories.
Step 1: Setting Up Logging and Debugging Messages
To begin, let's set up the logging module in Python to get valuable insights into our data pipeline's execution. We'll create a data_pipeline.py file and include the necessary imports and basic configuration for logging.
# data_pipeline.py
import logging
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# Your GitHub API credentials
GITHUB_API_TOKEN = 'YOUR_GITHUB_API_TOKEN'
Step 2: Fetching Data from GitHub API
Next, we'll implement the function to fetch GitHub repositories related to data engineering. We'll use the popular requests library to make API calls.
import requests
def fetch_data_from_github():
url = 'https://api.github.com/search/repositories'
params = {'q': 'dataengineering', 'sort': 'stars', 'order': 'desc'}
try:
response = requests.get(url, params=params, headers={'Authorization': f'token {GITHUB_API_TOKEN}'})
response.raise_for_status()
data = response.json()
return data['items']
except requests.exceptions.RequestException as e:
logger.error(f"Failed to fetch data from GitHub API: {e}")
return []
Step 3: Unit Testing for GitHub API
To ensure the GitHub API function behaves correctly, let's write some unit tests using pytest.
# test_data_pipeline.py
import data_pipeline
def test_fetch_data_from_github():
# Mock the API response for testing
data_pipeline.GITHUB_API_TOKEN = 'TEST_TOKEN'
data_pipeline.requests.get = lambda *args, **kwargs: MockApiResponse()
repositories = data_pipeline.fetch_data_from_github()
assert len(repositories) == 2
class MockApiResponse:
def __init__(self):
self.status_code = 200
def json(self):
return {
'items': [
{'name': 'around-dataengineering', 'html_url': 'https://github.com/around-dataengineering'},
{'name': 'dataengineering', 'html_url': 'https://github.com/dataengineering'}
]
}
Step 4: Star the GitHub Repositories
Now, we'll implement the function to star the fetched GitHub repositories. We'll use the pygithub library, which simplifies working with the GitHub API.
from github import Github
def star_repositories(repositories):
try:
github_client = Github(GITHUB_API_TOKEN)
user = github_client.get_user()
for repo in repositories:
repo_obj = github_client.get_repo(repo['name'])
user.add_to_starred(repo_obj)
logger.info(f"Starred repository: {repo['name']}")
except Exception as e:
logger.error(f"Failed to star repositories: {e}")
Step 5: Debugging with Interactive Debugger (pdb)
Now that we have our main functions implemented, let's use the interactive debugger pdb to trace and inspect the pipeline's execution. We'll add a breakpoint in the star_repositories function and run the pipeline.
import pdb
def star_repositories(repositories):
try:
github_client = Github(GITHUB_API_TOKEN)
user = github_client.get_user()
for repo in repositories:
repo_obj = github_client.get_repo(repo['name'])
pdb.set_trace() # Set a breakpoint here
user.add_to_starred(repo_obj)
logger.info(f"Starred repository: {repo['name']}")
except Exception as e:
logger.error(f"Failed to star repositories: {e}")
Step 6: Running the Pipeline and Debugging
Finally, let's run the pipeline and debug it using the pdb interactive debugger. We'll execute the fetch_data_from_github and star_repositories functions in sequence.
if __name__ == '__main__':
repositories = fetch_data_from_github()
star_repositories(repositories)
When the pdb breakpoint is hit, you can inspect variable values, step through the code, and identify any issues. Use commands like next (n), step (s), and continue (c) to navigate through the code.
Conclusion:
Debugging Python data pipelines is essential to ensure their reliability and efficiency. By implementing logging, unit testing, interactive debugging, and using relevant libraries and tools, you can identify and resolve issues effectively.
Top comments (0)