Building a Daily Data Pipeline That Survives Market Holidays

#dataengineering #automation #python #tutorial

So, I had to build this data pipeline that grabs financial market data. Daily. Sounds simple, right?

Not so fast. Market holidays threw a wrench in the works. The data source wouldn't update, and my pipeline couldn't just choke on those days. Pipeline failures are annoying.

I needed the pipeline to either run smoothly, processing the data available (even if it was the same as yesterday), or skip the run entirely — no errors, no alerts.

Here's the simple stupid strategy I landed on:

Get a holiday calendar.
Check the date at the start.
Run or exit based on that check.
Log the decision.

Let's look at some Python code.

First, the holiday calendar. Hardcoding? No thanks. I'm not into maintenance nightmares. Instead, I generate it with code:

from datetime import date
# Assume 'market_holidays' is a function to generate a list of holidays
# based on the current year.
holiday_list = market_holidays(date.today().year)

Now, before the actual processing, check if today's a holiday:

today = date.today()
if today in holiday_list:
    print("Today is a market holiday. Skipping data processing.")
    # Optionally, log this decision to a file or database.
    # Here's a basic logging example:
    with open("pipeline_log.txt", "a") as log_file:
        log_file.write(f"{today}: Skipped due to holiday\n")
    exit()  # Gracefully exit the script

If it's not a holiday, the script just keeps going.

But what about the next business day? How do we ensure the pipeline runs then and processes the data correctly? Initially, I let the scheduler handle it. It worked, assuming the scheduler retried on the next available day.

But sometimes the data source only provided data for trading days. That's when I added a "lookback" mechanism. If today's a holiday, check the previous business day.

This meant tweaking the date checking:

from datetime import date, timedelta

today = date.today()
data_date = today  # Default to today

if today in holiday_list:
    print("Today is a market holiday. Checking previous business day.")
    data_date = today - timedelta(days=1)
    # Keep subtracting days until we find a non-holiday
    while data_date in holiday_list:
        data_date -= timedelta(days=1)

    print(f"Using data from {data_date}")
    # Log this decision
    with open("pipeline_log.txt", "a") as log_file:
        log_file.write(f"{today}: Using data from {data_date} (holiday fallback)\n")

# Now, fetch data using the 'data_date' variable
# data = fetch_market_data(data_date)

This ensures the pipeline grabs data from the most recent trading day, even on holidays. This is obvious.

A few more things to remember.

Error handling is vital. What if the data source is down for other reasons? Add retry logic and logging, so you can diagnose issues quickly.

And monitoring? Gotta track performance and spot anomalies. If the pipeline keeps skipping runs, that's a problem.

Test everything. Simulate holidays to see if it behaves as expected.

Building a robust data pipeline that can handle market holidays takes thought and planning. A holiday calendar, conditional execution, and a lookback mechanism? This is the way.

Now I can sleep soundly, knowing my pipeline won't bother me on Christmas.

DEV Community

Building a Daily Data Pipeline That Survives Market Holidays

Top comments (0)