Siddharth Pogul

Posted on Feb 18

Building rapidlog: Why I Made a 3x Faster Python Logger

#performance #python #showdev #tooling

The Problem Nobody Talks About

You're building a production Python application. Everything's humming: API handlers fast, database queries optimized, caching in place.
Then you flip on logging.

Suddenly, throughput tanks. CPU spikes. Latency goes haywire.
Why? Your app logs from 4 worker threads. Each thread competes for the same lock in logging.Handler. While one thread serializes JSON, the others sit spinning. You're not bottlenecked on I/O anymore - you're bottlenecked on a single lock.
This is why I built rapidlog.

The Benchmark That Started It All

Here's what I found comparing Python's stdlib logging against other libraries (all with identical JSON output, 4 threads, 100K logs per thread):

Library	Throughput	vs stdlib
rapidlog	20,133 logs/sec	3.1x faster
structlog	12,101 logs/sec	1.86x faster
stdlib-json	6,487 logs/sec	baseline
loguru	3,248 logs/sec	0.50x (slower!)

That 3.1x difference? In production, that's 13.6K extra events per second you can handle without scaling up servers.
For a company logging 100M events/day across 4 worker threads, that's the difference between 10 servers and 3 servers.

Why Is stdlib So Slow Under Load?

Let's trace through what happens when you call logger.info():
With stdlib logging:

Thread 1: logger.info() → acquire lock
Thread 2: logger.info() → WAIT (lock held by Thread 1)
Thread 3: logger.info() → WAIT
Thread 4: logger.info() → WAIT
Thread 1: serialize JSON → format record → write to stdout → release lock
Threads 2–4: race to acquire lock (one succeeds, others wait again)

Every single call hits the lock. JSON serialization happens inside the lock. You've made your hot path serialization-bound AND lock-bound.
With rapidlog:

Thread 1: logger.info() → append to thread-local buffer (no lock)
Thread 2: logger.info() → append to thread-local buffer (no lock)
Thread 3: logger.info() → append to thread-local buffer (no lock)
Thread 4: logger.info() → append to thread-local buffer (no lock)
Writer thread: drain all buffers → serialize JSON → write to stdout

The hot path is buffer append only. No locks. No serialization. Then a background thread handles the expensive stuff (JSON, I/O) in batches.

Architecture: Per-Thread Buffers + Async Writer

Here's the design in detail:

Layer 1: Hot Path (Per-Thread)

def _log(self, level: str, msg: str, **kwargs):
    if level not in _LEVELS or _LEVELS[level] < self.level_value:
        return  # Quick exit

    # Append to thread-local buffer (single append, no lock)
    self._thread_local.buffer.append([
        time.time_ns(),
        level,
        msg,
        kwargs,
        threading.current_thread().ident
    ])

No dict creation. No serialization. Just a fast append to a pre-allocated list.

Layer 2: Cross-Thread Handoff (RingQueue)

When the thread-local buffer fills, it flushes to a bounded RingQueue:

class RingQueue:
    def __init__(self, capacity: int):
        self.buffer = [None] * capacity
        self.write_pos = 0
        self.read_pos = 0
        self.lock = threading.Lock()

    def put(self, item):
        with self.lock:
            if self.write_pos - self.read_pos >= self.capacity:
                # Queue full, wait or drop
                return False
            self.buffer[self.write_pos % self.capacity] = item
            self.write_pos += 1
            return True

This is a multi-producer/single-consumer design. Multiple threads append records, one writer thread drains.

Layer 3: Writer Thread (Background)

def _writer_loop(self):
    while self.running:
        batch = self.queue.get_many(batch_size=256, timeout=0.01)

        # Serialize all JSON in writer thread
        output = []
        for record in batch:
            json_str = json.dumps({
                "ts_ns": record[0],
                "level": record[1],
                "msg": record[2],
                **record[3],
                "thread": record[4]
            })
            output.append(json_str)

        # Single I/O operation
        self.stdout.buffer.write(b"\n".join(output) + b"\n")

All JSON serialization happens here, outside the hot path. Batching reduces I/O syscalls.

The Trade-Off: Memory vs Throughput

This design uses memory to buy speed:
Low-memory preset:

Buffer size: 2,048 records per thread
Peak memory: ~2–4 MiB
Use when: Lambda, containers with tight memory

Balanced preset (default):

Buffer size: 32,768 records per thread
Peak memory: ~5–10 MiB
Use when: General-purpose apps

Throughput preset:

Buffer size: 131,072 records per thread
Peak memory: ~10–20 MiB
Use when: High-volume logging (100K+ logs/sec)

This is an intentional trade-off. Standard library makes the opposite choice (minimal memory, multiple locks). rapidlog assumes "memory is cheaper than CPU under load" and bets accordingly.

Code Example: Real-World Usage

Before (stdlib + structlog/python-json-logger)

import logging
from pythonjsonlogger import jsonlogger

# Setup is tedious
handler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter(fmt='%(timestamp)s %(level)s %(name)s %(message)s')
handler.setFormatter(formatter)
logger = logging.getLogger(__name__)
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# API is awkward
@app.post("/users")
def create_user(user_id: int, email: str):
    logger.info("user_create", extra={"fields": {"user_id": user_id, "email": email}})
    # ...
    return {"status": "ok"}

After (rapidlog)

from rapidlog import get_logger

# One line
logger = get_logger(level="INFO")

# Clean API
@app.post("/users")
def create_user(user_id: int, email: str):
    logger.info("user_create", user_id=user_id, email=email)
    # ...
    return {"status": "ok"}

Both output the same JSON:

{"ts_ns": 1739462130123456789, "level": "INFO", "msg": "user_create", "user_id": 123, "email": "bob@example.com", "thread": 12345}

When to Use rapidlog (And When NOT To)

Use rapidlog when:

✅ You're logging 10K+ events/sec
✅ You have 4+ worker threads logging concurrently
✅ You need structured JSON by default
✅ You want zero external dependencies
✅ Latency matters (e.g., fintech, gaming, real-time APIs)

Don't use rapidlog when:

❌ You're logging <1K events/sec (stdlib is fine, no contention)
❌ You need file rotation (coming in v2, but not there yet)
❌ You need colors/pretty output for development (use stdlib)
❌ Memory is extremely constrained and you can't spare 2+ MiB
❌ You're on Python < 3.10

Benchmarking Methodology (Why I Trust These Numbers)

I benchmarked against 6+ libraries with fair conditions:

Same output format across all (structured JSON, ~100 bytes/log)
Except fastlogging, which doesn't support structured JSON
Real I/O (logs written to file, not discarded)
In-memory benchmarks are meaningless for logging
Configurable thread counts (1, 4, 8 workers)
Documented trade-offs (memory, dependencies, features) All code is in the GitHub repo if you want to reproduce.

v2 Roadmap (What's Coming)

v1 is intentionally minimal. v2 will add:

Multiple sinks (file, network, cloud)
Sampling (drop 1-in-N records for very high volume)
Custom encoders (MessagePack, Protobuf)
OpenTelemetry integration (correlation IDs, trace context)
Datadog/Honeycomb SDK examples

Why Open-Source?

Why share this instead of keeping it proprietary?
Selfish reasons:

I care deeply about performance engineering, and open-sourcing helps pressure-test ideas in the real world.
Feedback from real users (you!) makes the library better, faster.
It's a long-term engineering project where I can share trade-offs, benchmarking methodology, and design decisions transparently.

Altruistic reasons:

This is a universal problem. Lock contention under load affects every multi-threaded Python app.
stdlib logging is good enough for 95% of cases. The other 5% shouldn't have to ship custom solutions.

Questions I Expect (And Answers)

Q: Why not just use async/await for logging?
A: Async adds overhead (context switching, event loop overhead). Pre-allocated buffers are simpler and faster for this narrow use case.

Q: What about the GIL?
A: GIL + stdlib's lock is a double hit. rapidlog works within GIL constraints by deferring expensive work (serialization) to a writer thread. This is a known pattern (think: RabbitMQ design).

Q: How does this compare to Loguru?
A: Loguru is more feature-rich and easier to use. rapidlog is faster for high-volume multi-threaded scenarios. Pick based on your constraints.

Q: Can I use this with Django/Flask?
A: Yes! Examples coming in the repo. Anywhere you'd use stdlib logging, rapidlog works.

Q: Is this production-ready?
A: v1.0 is stable (37 comprehensive tests). It's intentionally minimal, but what's there is solid.

Try It Out

pip install rapidlog

from rapidlog import get_logger
logger = get_logger()
logger.info("Hello, high-performance logging!", request_id="abc123", latency_ms=42)
logger.close()

Full docs: https://github.com/sid19991/rapidlog

What I Learned Building This

Lock contention is invisible until you benchmark. Profilers don't always show it clearly.
Pre-allocation trades memory for speed. This is unfashionable in today's auto-scaling world, but it works.
The GIL is a constraint, not a blocker. You can build fast Python within its limits.
Benchmarking is hard. Fair comparison requires controlling for output format, I/O, thread count, and more.
Simple designs win. Per-thread buffers + async writer is decades-old (used in real databases). Innovation isn't always new.

Let's Talk

What do you think? Are you logging 10K+ events/sec and hitting lock contention? Would you use this?
Hit me up:

GitHub issues: https://github.com/sid19991/rapidlog/issues
Twitter/X: siddharthpogul

Links:

GitHub: https://github.com/sid19991/rapidlog
PyPI: https://pypi.org/project/rapidlog/
Benchmarks: In README + GitHub repo

DEV Community