Tetsuya Wakita

Posted on Apr 8

I Rewrote python-dateutil in Rust — Even a Naive Port Is Up to 94x Faster

#python #rust #maturin #dateutil

TL;DR — pip install python-dateutil-rs, change one import, get 5x–94x faster date parsing, recurrence rules, and timezone lookups. It's a line-by-line Rust port via PyO3 — no code changes required. GitHub | PyPI

python-dateutil Is Everywhere — and It's Pure Python

python-dateutil is one of the most depended-on packages in the Python ecosystem. With 300M+ monthly downloads on PyPI, it powers date parsing, relative deltas, recurrence rules, and timezone handling across countless applications.

But it's written in pure Python. For hot paths — parsing thousands of ISO timestamps in a data pipeline, expanding recurring calendar events, computing relative dates in a loop — that means leaving significant performance on the table.

What if you could make it up to 94x faster without changing a single line of your application code?

pip install python-dateutil-rs

That's python-dateutil-rs — a Rust-backed drop-in replacement I built using PyO3 and maturin. Same API, same behavior, dramatically faster.

And here's the thing: this is a naive, line-by-line port. The Rust code mirrors the original Python logic almost 1:1, with no Rust-specific optimizations yet. The speedups come purely from Rust's compiled nature — no dynamic dispatch, minimal GC pressure, no interpreter overhead.

The Approach: Module-by-Module Rewrite via PyO3

A full rewrite of a battle-tested library is risky. Instead, I took an incremental approach: rewrite each module in Rust independently, validate against the original test suite (~13,000 lines of tests, all passing), and ship value at every step.

The architecture uses maturin's mixed layout — a Rust extension module and Python wrapper live in the same package:

dateutil_rs (Python package)
  ├── _native (Rust extension via PyO3/maturin)  ← all modules
  └── Python wrappers                             ← API compatibility layer

Every module is now implemented in Rust: parser, relativedelta, rrule, tz, easter, utils, and weekday constants. The Python layer handles API compatibility — for example, serializing custom parserinfo lookup tables and forwarding them to the Rust parser:

from dateutil_rs._native import isoparse
from dateutil_rs._native import parse as _parse_rs

def parse(timestr, parserinfo=None, **kwargs):
    if parserinfo is not None:
        # Serialize custom lookup tables → forward to Rust
        config = parserinfo._to_rust_config()
        return _parse_rs(timestr, parserinfo_config=config, **kwargs)
    return _parse_rs(timestr, **kwargs)

The API is identical to python-dateutil — just change the import:

# Before (python-dateutil)
from dateutil.parser import parse
from dateutil.relativedelta import relativedelta
from dateutil.rrule import rrule, MONTHLY
from dateutil.tz import gettz, tzutc

# After (python-dateutil-rs) — same code
from dateutil_rs.parser import parse
from dateutil_rs.relativedelta import relativedelta
from dateutil_rs.rrule import rrule, MONTHLY
from dateutil_rs.tz import gettz, tzutc

Benchmark Results

All benchmarks: Python 3.13, macOS Apple Silicon M3 (arm64), Rust 1.86 stable, release build (--release), measured with pytest-benchmark. Compared against python-dateutil 2.9.0.

At a Glance

Module	Speedup Range	Highlight
Timezone	1.7x – 94.3x	Batch `gettz()` with cache: 94.3x
ISO Parser	5.1x – 23.5x	With microseconds: 23.5x
RRule	3.1x – 20.0x	`rrulestr` with dtstart: 20.0x
RelativeDelta	4.9x – 18.7x	Multiply by scalar: 18.7x
General Parser	1.3x – 3.5x	Fuzzy parsing: 3.5x
Easter	3.2x – 6.2x	1000 years batch: 6.2x

Note on the 94.3x headline number: The timezone benchmark includes a Rust-side RwLock<HashMap> cache (similar to python-dateutil's _TzFactory). Without caching, pure timezone operations still see 1.7x–3.8x speedups. The most representative single-operation speedup is 23.5x on ISO parsing.

ISO Parsing — Up to 23.5x Faster

Benchmark	python-dateutil	dateutil-rs	Speedup
Date only	0.81 µs	0.08 µs	10.7x
Datetime	2.06 µs	0.10 µs	20.9x
Datetime + TZ	3.11 µs	0.61 µs	5.1x
With microseconds	2.67 µs	0.11 µs	23.5x
7 various formats	23.10 µs	3.10 µs	7.4x

RelativeDelta — Up to 18.7x Faster

Benchmark	python-dateutil	dateutil-rs	Speedup
Create simple	0.94 µs	0.19 µs	4.9x
Add months	1.62 µs	0.13 µs	12.7x
Subtract	3.03 µs	0.18 µs	16.5x
Multiply by scalar	1.49 µs	0.08 µs	18.7x
Sequential add ×12	19.41 µs	1.70 µs	11.4x
Month-end overflow	1.74 µs	0.13 µs	13.3x

RRule (Recurrence Rules) — Up to 20x Faster

Benchmark	python-dateutil	dateutil-rs	Speedup
Weekly ×52	105.33 µs	34.44 µs	3.1x
Monthly ×120	448.46 µs	114.12 µs	3.9x
Yearly ×100	2,144.58 µs	235.02 µs	9.1x
rrulestr simple	3.14 µs	0.53 µs	6.0x
rrulestr complex	6.60 µs	0.90 µs	7.4x
rrulestr with dtstart	15.61 µs	0.78 µs	20.0x

Timezone — Up to 94.3x Faster (Cache-Inclusive)

Benchmark	python-dateutil	dateutil-rs	Speedup
gettz various (×10)	714.93 µs	7.59 µs	94.3x
gettz offset	20.20 µs	5.26 µs	3.8x
resolve_imaginary	6.54 µs	3.19 µs	2.0x
datetime_exists	3.15 µs	1.62 µs	1.9x
convert chain	6.74 µs	4.08 µs	1.7x

The 94.3x speedup on batch gettz() comes from a process-global RwLock<HashMap> cache — similar to python-dateutil's _TzFactory, but with Rust's near-zero-cost Arc clone for cached timezone objects. Even without the cache hit, individual timezone operations see 1.7x–3.8x improvements from compiled execution alone.

Easter — Up to 6.2x Faster

Benchmark	python-dateutil	dateutil-rs	Speedup
Single call	0.51 µs	0.13 µs	3.9x
1000 years	437.88 µs	70.46 µs	6.2x

Why Is It Fast? It's Just Rust Being Rust

Here's the surprising part: the Rust code is a faithful, almost line-by-line translation of the Python source. I didn't design clever algorithms or exploit Rust-specific data structures. The speedups come from the fundamental differences between compiled and interpreted execution.

Profiling revealed that Python's overhead isn't in any single place — it's everywhere. Every function call, every attribute access, every integer comparison, every object allocation pays an interpreter tax. In a tight loop, these micro-costs compound dramatically. That's why even a naive port sees 5x–94x improvements on batch operations.

Let's look at three modules to see exactly where the performance comes from:

isoparse: Byte-Level Parsing (23.5x)

The original isoparser in Python uses string slicing and int() conversion. The Rust version does the same logic — but on byte arrays instead of Python strings:

fn parse_isodate_common(&self, dt_str: &[u8]) -> Result<([i32; 3], usize), String> {
    let mut c = [1i32; 3]; // [year, month, day]

    // Year: parse 4 ASCII bytes directly
    c[0] = parse_int(&dt_str[..4])?;
    let mut pos = 4;

    let has_sep = dt_str[pos] == b'-';
    if has_sep { pos += 1; }

    // Month: 2 bytes
    c[1] = parse_int(&dt_str[pos..pos + 2])?;
    pos += 2;

    // Day: 2 bytes
    // ...
}

Same algorithm as the Python version. No string allocation, no regex, no dynamic dispatch — just the inherent overhead of interpretation removed. The parse_int helper converts ASCII bytes to integers with a multiply-and-add loop. For a microsecond field like 123456, this is dramatically faster than Python's int("123456").

Result: "2024-01-15T10:30:00.123456+09:00" goes from 2.67 µs to 0.11 µs.

RelativeDelta: Fixed Struct vs Dynamic Attributes (18.7x)

Python's relativedelta stores fields as instance attributes with getattr/setattr patterns. The Rust version uses a fixed struct — same fields, same logic:

pub struct RelativeDelta {
    // Relative (plural) fields — offsets
    years: i32,
    months: i32,
    days: f64,
    hours: f64,
    minutes: f64,
    seconds: f64,
    microseconds: f64,
    // Absolute (singular) fields — "set to this value"
    year: Option<i32>,
    month: Option<i32>,
    day: Option<i32>,
    // ...
}

Multiplication is just multiplying each field — no dictionary lookups, no attribute resolution, no type coercion overhead. Same algorithm, 18.7x faster.

RRule: Same Algorithm, Less Overhead (9.1x–20x)

Recurrence rules (RFC 5545) are the most complex module — they expand patterns like "every 3rd Tuesday of the month" into concrete dates. The Rust implementation follows the same expansion algorithm as python-dateutil:

Stack-allocated date arithmetic via chrono instead of Python datetime objects
No per-iteration object allocation or GC pressure
Integer comparisons instead of Python object protocol (__lt__, __eq__)

For yearly rules over 100 occurrences, the overhead compounds: 2,144 µs → 235 µs. For rrulestr parsing, the gap is even wider — 20x faster because string-to-rule conversion involves heavy tokenization that benefits enormously from compiled execution.

What's Next: v1 — Actually Optimized Rust

Everything above is v0 — a compatibility-first port that proves Rust's floor is Python's ceiling. The Rust code deliberately mirrors Python's logic to ensure behavioral fidelity against 13,000 lines of inherited tests.

v1 is where things get interesting. It will be a clean Rust-native implementation with real optimizations:

Optimization	Technique	Target Impact
Zero-copy parsing	`&str` slices instead of `String` clones	Parser: 5x–50x
Compile-time hash maps	`phf` perfect hash for weekday/month lookup	Parser: faster tokenization
Buffer reuse	Pre-allocated buffers cleared across iterations	RRule: 5x–15x
Minimal allocations	`SmallVec`, stack buffers, arena patterns	All modules
Streamlined API	Drop rarely-used features (fuzzy parse, POSIX tz)	Smaller, faster core

v1 will ship as:

dateutil on crates.io — pure Rust crate, no PyO3 dependency
dateutil on PyPI — thin PyO3 binding layer wrapping the Rust core

The v0 → v1 migration path:

v0.x: faithful python-dateutil port (current — 1.3x–94x faster)
v1.x: Rust-optimized core (target — 5x–100x faster)

If a naive port already delivers up to 94x speedups, imagine what happens when the Rust code is actually written like Rust.

What I Learned Building This

PyO3 Is Remarkably Ergonomic

PyO3's derive macros make exposing Rust types to Python straightforward:

#[pyclass(name = "relativedelta")]
pub struct RelativeDelta { /* ... */ }

#[pymethods]
impl RelativeDelta {
    #[new]
    fn new(/* 15+ optional kwargs */) -> Self { /* ... */ }

    fn __add__(&self, other: &Bound<'_, PyAny>) -> PyResult<PyObject> {
        // Handle datetime + relativedelta
    }
}

The trickiest part was datetime ↔ chrono conversion. PyO3's chrono feature handles the basics, but timezone-aware datetimes require careful handling of the Python tzinfo protocol.

A Faithful Port Is a Valid Strategy

The instinct when rewriting in Rust is to redesign everything. I resisted that. By keeping the logic 1:1 with Python, I could validate correctness against the original test suite — and still ship massive speedups. The optimization pass (v1) comes later, built on a foundation that is already proven correct.

Where the Overhead Actually Is

The takeaway is clear: CPython's per-operation overhead is small in isolation, but it compounds multiplicatively in loops. A date parser that touches 10 fields per call, called 10,000 times, pays the interpreter tax 100,000 times. Rust pays it zero times.

Try It

pip install python-dateutil-rs

GitHub: wakita181009/dateutil-rs
PyPI: python-dateutil-rs
Python: 3.10–3.14 on Linux and macOS (Windows support planned)

The project is open source (MIT). If your application use Ws python-dateutil, switching is a one-line import change. v0 is stable, fully tested, and API-compatible. v1 — the Rust-optimized core with zero-copy parsing and phf hash maps — is coming.

The numbers speak for themselves. And once zero-copy parsing and arena allocation land in v1, they're about to get a lot bigger.

DEV Community