I recently put together a daily data pipeline to pull and process financial data. Simple stupid, right? Grab data, clean it, load it.
Of course, the devil's in the details. And one of the biggest headaches? Market holidays.
The pipeline needed to run daily, grabbing data from the prior day's trading. Weekdays are easy. Weekends? Obvious. But what about holidays when the markets are closed? If the pipeline blindly tries to grab data for a non-trading day, it errors out. The whole thing grinds to a halt. Nobody wants that.
Here's how I tackled this, focusing on making the pipeline solid and avoiding those failures.
First, I needed a way to determine if a given date was a market holiday. Hardcoding a list? Brittle. Constant updating? No thanks. Instead, I decided to calculate market holidays dynamically.
There are ways to do this, depending on the markets you care about. I wrote a function that checks if a given date is a business day. If it isn't, it flags it as a holiday.
The core of this function? Checking the day of the week (weekends are obviously holidays) and then iterating through a list of known holidays for the current year. This list is built based on the country and market.
Here's a simplified Python snippet:
import datetime
def is_market_holiday(date):
"""
Checks if a given date is a market holiday.
"""
# Check if it's a weekend
if date.weekday() >= 5: # 5 and 6 are Saturday and Sunday
return True
# Placeholder for a more sophisticated holiday calculation.
# In a real implementation, you'd generate a list of holidays
# for the given year based on the specific market.
# This example just checks for a hardcoded Independence Day.
if date.month == 7 and date.day == 4:
return True
return False
# Example usage:
today = datetime.date.today()
if is_market_holiday(today):
print(f"{today} is a market holiday.")
else:
print(f"{today} is a trading day.")
This is basic. Expand it to include your actual holiday calculation logic. The key is to wrap this logic in a function that's easily called from the pipeline.
Next, I dropped this holiday check into the pipeline's scheduling logic. Instead of running every day no matter what, the pipeline now checks if the target date is a trading day. If it's a holiday, the pipeline skips the data-grabbing step. It either waits until the next scheduled run or backfills the data for the previous trading day.
This "backfilling" strategy is important. If the pipeline skips a holiday, you don't want to lose that day's data. Instead, you grab the data for the last actual trading day.
Here's how this might look in a pipeline orchestration tool:
daily_pipeline:
schedule: "0 0 * * *" # Run every day at midnight
steps:
- name: Check if today is a market holiday
task: is_market_holiday
date: "{{ execution_date }}"
output: holiday_flag
- name: Fetch data
task: fetch_market_data
date: "{{ execution_date - 1 }}" # Fetch data for *yesterday*
when: "{{ holiday_flag == False }}" # Only run if not a holiday
- name: Process data
task: process_data
input_data: "{{ fetch_market_data.output_data }}"
- name: Load data
task: load_data
input_data: "{{ process_data.output_data }}"
execution_date is the date the pipeline is running for. The is_market_holiday task decides if that date is a holiday. If it's not, the fetch_market_data task grabs data for the previous day (which should be a trading day).
This isn't perfect. It assumes that the market is always open the day before a holiday. Not always true. You might need to tweak the logic to handle multiple consecutive holidays.
Finally, I set up monitoring and alerting. The pipeline sends notifications if it hits an unexpected error or if the data quality checks fail. This lets me quickly spot and fix issues, including those related to holiday handling.
Building a solid data pipeline? It means thinking about edge cases like market holidays. By handling these proactively, you make sure your pipeline runs reliably. And it provides good data, even when the markets are closed.
Top comments (0)