Taylor Coleman

Posted on Dec 15

The Data Engineers Descent Into Datetime Hell

#python #datascience #data #dataengineering

People cannot spell for s***. And datetimes are very difficult to format consistently for various reasons. Date inference is genuinely hard—there's no universal standard, and everyone has their own idea of what "01/02/2023" means (January 2nd or February 1st?). This is my personal battle.

There's a reason this meme is one of the most upvoted posts on r/dataengineering. We've all been there.

The AWS Glue Migration That Wasn't

Here's something I dealt with recently. We were migrating 50+ tables from MySQL to Postgres using AWS Glue. Should be straightforward—Glue's built for this. You set up the connection, map the tables, hit run, grab coffee.

Table #37 fails. Then #41. Then #44.

The error message is useless: ERROR: invalid input syntax for type timestamp. Cool. Which timestamp? Which row? Which of the 47 datetime columns in this table?

You're dealing with 10 million rows. And ONE row—ONE!!!—is causing the entire migration to fail.

You dig into the source data. The created_at column has entries like:

2023-01-15 14:30:00   -- Standard MySQL datetime
01/15/2023 2:30 PM    -- Someone's Excel export
January 15, 2023      -- Marketing team entry
2023-1-15             -- Missing zero padding
15-Jan-23             -- European contractor
NULL                  -- Actually fine
""                    -- Empty string (NOT fine)
Q1 2023               -- Why? Just... why?

MySQL doesn't care. It's got loose type coercion. Postgres? Postgres will not accept this nonsense.

PySpark Won't Save You

"Fine," you think. "I'll just use PySpark to clean this first."

from pyspark.sql import SparkSession
from pyspark.sql.types import TimestampType

spark = SparkSession.builder.getOrCreate()
df = spark.read.jdbc(url="jdbc:mysql://...", table="orders")

# Try to cast it
df = df.withColumn("created_at", F.col("created_at").cast(TimestampType()))

PySpark can't infer it. Which is bull****, by the way. It just returns NULL for anything it doesn't understand. No error. No warning. Just silent data loss.

So now you're writing custom parsing logic.

What I Actually Tried

Attempt 1: "Spark's built-in functions will handle this"

from pyspark.sql import functions as F
from pyspark.sql.types import TimestampType

df = df.withColumn("created_at", F.col("created_at").cast(TimestampType()))

Result: 30% of values become NULL. No error. No warning. Just gone.

"Fine, I'll try to_timestamp() with format strings:"

df = df.withColumn("created_at",
    F.to_timestamp(F.col("created_at"), "yyyy-MM-dd HH:mm:ss"))

Result: 70% become NULL because they're not in that exact format.

"I'll try multiple formats!"

df = df.withColumn("created_at",
    F.coalesce(
        F.to_timestamp(F.col("created_at"), "yyyy-MM-dd HH:mm:ss"),
        F.to_timestamp(F.col("created_at"), "MM/dd/yyyy"),
        F.to_timestamp(F.col("created_at"), "dd/MM/yyyy"),
        F.to_timestamp(F.col("created_at"), "yyyy-MM-dd")
    ))

Now I have 8% NULL. The other 92% parsed! But which format matched which row? No idea. And "January 15, 2023" is still NULL.

Added 12 more formats to the coalesce(). The query plan is now unreadable. Execution time: 45 minutes for 10 million rows.

Attempt 2: "I'll just write a UDF"

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from datetime import datetime
import dateutil.parser

@udf(returnType=StringType())
def parse_datetime(value):
    if not value:
        return None
    try:
        return dateutil.parser.parse(value).strftime('%Y-%m-%d %H:%M:%S')
    except:
        return None

df = df.withColumn("created_at", parse_datetime(F.col("created_at")))

Execution time: 6 hours.

Why? Because UDFs serialize every row to Python, parse it, serialize it back. For 10 million rows. On distributed data. The Spark UI shows I'm not even using the cluster, it's all bottlenecked on Python serialization.

"Fine, I'll use pandas UDFs, those are faster:"

from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf(StringType())
def parse_datetime_pandas(s: pd.Series) -> pd.Series:
    return pd.to_datetime(s, errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')

Execution time: 2 hours. Better! But still 8% NULL. Still no idea which rows failed or why.

Attempt 3: "I'll do multiple passes with different UDFs"

# Pass 1: Standard formats
df = df.withColumn("created_at_clean", parse_udf_standard(F.col("created_at")))

# Pass 2: Fix nulls with named months
df = df.withColumn("created_at_clean",
    F.when(F.col("created_at_clean").isNull(),
        parse_udf_named_months(F.col("created_at")))
    .otherwise(F.col("created_at_clean")))

# Pass 3: Fix nulls with European formats
df = df.withColumn("created_at_clean",
    F.when(F.col("created_at_clean").isNull(),
        parse_udf_european(F.col("created_at")))
    .otherwise(F.col("created_at_clean")))

# Pass 4: ...you get the idea

Execution time: 4 hours (multiple full scans of 10 million rows)
Code length: 600 lines across 8 different UDFs
Success rate: 99.1%
My mental state: Broken

Attempt 4: "Maybe I can use regex_extract?"

# Extract ISO format dates
df = df.withColumn("extracted",
    F.regexp_extract(F.col("created_at"), r'(\d{4}-\d{2}-\d{2})', 1))

# Extract MM/DD/YYYY
df = df.withColumn("extracted",
    F.when(F.col("extracted") == "",
        F.regexp_extract(F.col("created_at"), r'(\d{2})/(\d{2})/(\d{4})', 0))
    .otherwise(F.col("extracted")))

This extracted dates but didn't parse them into the right format. 01/15/2023 stayed as 01/15/2023. I need it as 2023-01-15.

Now I need to parse what I extracted. Back to F.to_timestamp(). Back to NULLs.

Attempt 5: "I'll just dump to pandas and fix it there"

# Collect to pandas
pdf = df.toPandas()

# Fix in pandas
pdf['created_at'] = pd.to_datetime(pdf['created_at'], errors='coerce')

# Back to Spark
df = spark.createDataFrame(pdf)

Memory error. 10 million rows don't fit in memory on the driver node. Typical.

"Fine, I'll do it in partitions:"

def fix_partition(iterator):
    for pdf in iterator:
        pdf['created_at'] = pd.to_datetime(pdf['created_at'], errors='coerce')
        yield pdf

df = df.mapInPandas(fix_partition, schema=df.schema)

This works but takes 3 hours and yet still have 8% NULL values.

Attempt 6: The 2AM Abomination

from pyspark.sql.functions import pandas_udf
import pandas as pd
from dateutil import parser
import re

@pandas_udf(StringType())
def parse_datetime_nuclear_option(s: pd.Series) -> pd.Series:
    def parse_single(value):
        if pd.isna(value) or value == "":
            return None

        # Try pandas first (fast)
        try:
            return pd.to_datetime(value).strftime('%Y-%m-%d %H:%M:%S')
        except:
            pass

        # Try dateutil (slow but flexible)
        try:
            return parser.parse(str(value), fuzzy=True).strftime('%Y-%m-%d %H:%M:%S')
        except:
            pass

        # Try regex extraction for ISO
        match = re.search(r'\d{4}-\d{2}-\d{2}', str(value))
        if match:
            return match.group(0) + ' 00:00:00'

        # Try named months
        months = {'january': '01', 'jan': '01', 'february': '02', ...}
        # ... 40 more lines of string manipulation

        # Try quarter notation
        if 'Q' in str(value):
            # ... 20 more lines

        return None

    return s.apply(parse_single)

This is so frustrating. Just kill me.

df = df.withColumn("created_at", parse_datetime_nuclear_option(F.col("created_at")))

Execution time: 8 hours
Code length: 200 lines in a single UDF
Success rate: 99.4%
The remaining 0.6%: Truly cursed data like "FY Q3 2023", "sometime in january", and my personal favorite: "2023-13-45" (month 13, day 45—someone just mashing numbers)
Cost: $47 in AWS Glue DPU hours

At this point it's 4am. The migration is still failing. I have a 200-line UDF that takes 8 hours to run and still doesn't work for all rows. And I still have 12 other tables with datetime columns to fix.

Read the rest at Full Blog Post Here