The Problem Nobody Talks About
You're building a production Python application. Everything's humming: API handlers fast, database queries optimized, caching in place.
Then you flip on logging.
Suddenly, throughput tanks. CPU spikes. Latency goes haywire.
Why? Your app logs from 4 worker threads. Each thread competes for the same lock in logging.Handler. While one thread serializes JSON, the others sit spinning. You're not bottlenecked on I/O anymore - you're bottlenecked on a single lock.
This is why I built rapidlog.
The Benchmark That Started It All
Here's what I found comparing Python's stdlib logging against other libraries (all with identical JSON output, 4 threads, 100K logs per thread):
| Library | Throughput | vs stdlib |
|---|---|---|
| rapidlog | 20,133 logs/sec | 3.1x faster |
| structlog | 12,101 logs/sec | 1.86x faster |
| stdlib-json | 6,487 logs/sec | baseline |
| loguru | 3,248 logs/sec | 0.50x (slower!) |
That 3.1x difference? In production, that's 13.6K extra events per second you can handle without scaling up servers.
For a company logging 100M events/day across 4 worker threads, that's the difference between 10 servers and 3 servers.
Why Is stdlib So Slow Under Load?
Let's trace through what happens when you call logger.info():
With stdlib logging:
Thread 1: logger.info() → acquire lock
Thread 2: logger.info() → WAIT (lock held by Thread 1)
Thread 3: logger.info() → WAIT
Thread 4: logger.info() → WAIT
Thread 1: serialize JSON → format record → write to stdout → release lock
Threads 2–4: race to acquire lock (one succeeds, others wait again)
Every single call hits the lock. JSON serialization happens inside the lock. You've made your hot path serialization-bound AND lock-bound.
With rapidlog:
Thread 1: logger.info() → append to thread-local buffer (no lock)
Thread 2: logger.info() → append to thread-local buffer (no lock)
Thread 3: logger.info() → append to thread-local buffer (no lock)
Thread 4: logger.info() → append to thread-local buffer (no lock)
Writer thread: drain all buffers → serialize JSON → write to stdout
The hot path is buffer append only. No locks. No serialization. Then a background thread handles the expensive stuff (JSON, I/O) in batches.
Architecture: Per-Thread Buffers + Async Writer
Here's the design in detail:
Layer 1: Hot Path (Per-Thread)
def _log(self, level: str, msg: str, **kwargs):
if level not in _LEVELS or _LEVELS[level] < self.level_value:
return # Quick exit
# Append to thread-local buffer (single append, no lock)
self._thread_local.buffer.append([
time.time_ns(),
level,
msg,
kwargs,
threading.current_thread().ident
])
No dict creation. No serialization. Just a fast append to a pre-allocated list.
Layer 2: Cross-Thread Handoff (RingQueue)
When the thread-local buffer fills, it flushes to a bounded RingQueue:
class RingQueue:
def __init__(self, capacity: int):
self.buffer = [None] * capacity
self.write_pos = 0
self.read_pos = 0
self.lock = threading.Lock()
def put(self, item):
with self.lock:
if self.write_pos - self.read_pos >= self.capacity:
# Queue full, wait or drop
return False
self.buffer[self.write_pos % self.capacity] = item
self.write_pos += 1
return True
This is a multi-producer/single-consumer design. Multiple threads append records, one writer thread drains.
Layer 3: Writer Thread (Background)
def _writer_loop(self):
while self.running:
batch = self.queue.get_many(batch_size=256, timeout=0.01)
# Serialize all JSON in writer thread
output = []
for record in batch:
json_str = json.dumps({
"ts_ns": record[0],
"level": record[1],
"msg": record[2],
**record[3],
"thread": record[4]
})
output.append(json_str)
# Single I/O operation
self.stdout.buffer.write(b"\n".join(output) + b"\n")
All JSON serialization happens here, outside the hot path. Batching reduces I/O syscalls.
The Trade-Off: Memory vs Throughput
This design uses memory to buy speed:
Low-memory preset:
- Buffer size: 2,048 records per thread
- Peak memory: ~2–4 MiB
- Use when: Lambda, containers with tight memory
Balanced preset (default):
- Buffer size: 32,768 records per thread
- Peak memory: ~5–10 MiB
- Use when: General-purpose apps
Throughput preset:
- Buffer size: 131,072 records per thread
- Peak memory: ~10–20 MiB
- Use when: High-volume logging (100K+ logs/sec)
This is an intentional trade-off. Standard library makes the opposite choice (minimal memory, multiple locks). rapidlog assumes "memory is cheaper than CPU under load" and bets accordingly.
Code Example: Real-World Usage
Before (stdlib + structlog/python-json-logger)
import logging
from pythonjsonlogger import jsonlogger
# Setup is tedious
handler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter(fmt='%(timestamp)s %(level)s %(name)s %(message)s')
handler.setFormatter(formatter)
logger = logging.getLogger(__name__)
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# API is awkward
@app.post("/users")
def create_user(user_id: int, email: str):
logger.info("user_create", extra={"fields": {"user_id": user_id, "email": email}})
# ...
return {"status": "ok"}
After (rapidlog)
from rapidlog import get_logger
# One line
logger = get_logger(level="INFO")
# Clean API
@app.post("/users")
def create_user(user_id: int, email: str):
logger.info("user_create", user_id=user_id, email=email)
# ...
return {"status": "ok"}
Both output the same JSON:
{"ts_ns": 1739462130123456789, "level": "INFO", "msg": "user_create", "user_id": 123, "email": "bob@example.com", "thread": 12345}
When to Use rapidlog (And When NOT To)
Use rapidlog when:
✅ You're logging 10K+ events/sec
✅ You have 4+ worker threads logging concurrently
✅ You need structured JSON by default
✅ You want zero external dependencies
✅ Latency matters (e.g., fintech, gaming, real-time APIs)
Don't use rapidlog when:
❌ You're logging <1K events/sec (stdlib is fine, no contention)
❌ You need file rotation (coming in v2, but not there yet)
❌ You need colors/pretty output for development (use stdlib)
❌ Memory is extremely constrained and you can't spare 2+ MiB
❌ You're on Python < 3.10
Benchmarking Methodology (Why I Trust These Numbers)
I benchmarked against 6+ libraries with fair conditions:
- Same output format across all (structured JSON, ~100 bytes/log)
- Except fastlogging, which doesn't support structured JSON
- Real I/O (logs written to file, not discarded)
- In-memory benchmarks are meaningless for logging
- Configurable thread counts (1, 4, 8 workers)
- Documented trade-offs (memory, dependencies, features) All code is in the GitHub repo if you want to reproduce.
v2 Roadmap (What's Coming)
v1 is intentionally minimal. v2 will add:
- Multiple sinks (file, network, cloud)
- Sampling (drop 1-in-N records for very high volume)
- Custom encoders (MessagePack, Protobuf)
- OpenTelemetry integration (correlation IDs, trace context)
- Datadog/Honeycomb SDK examples
Why Open-Source?
Why share this instead of keeping it proprietary?
Selfish reasons:
- I care deeply about performance engineering, and open-sourcing helps pressure-test ideas in the real world.
- Feedback from real users (you!) makes the library better, faster.
- It's a long-term engineering project where I can share trade-offs, benchmarking methodology, and design decisions transparently.
Altruistic reasons:
- This is a universal problem. Lock contention under load affects every multi-threaded Python app.
- stdlib logging is good enough for 95% of cases. The other 5% shouldn't have to ship custom solutions.
Questions I Expect (And Answers)
Q: Why not just use async/await for logging?
A: Async adds overhead (context switching, event loop overhead). Pre-allocated buffers are simpler and faster for this narrow use case.
Q: What about the GIL?
A: GIL + stdlib's lock is a double hit. rapidlog works within GIL constraints by deferring expensive work (serialization) to a writer thread. This is a known pattern (think: RabbitMQ design).
Q: How does this compare to Loguru?
A: Loguru is more feature-rich and easier to use. rapidlog is faster for high-volume multi-threaded scenarios. Pick based on your constraints.
Q: Can I use this with Django/Flask?
A: Yes! Examples coming in the repo. Anywhere you'd use stdlib logging, rapidlog works.
Q: Is this production-ready?
A: v1.0 is stable (37 comprehensive tests). It's intentionally minimal, but what's there is solid.
Try It Out
pip install rapidlog
from rapidlog import get_logger
logger = get_logger()
logger.info("Hello, high-performance logging!", request_id="abc123", latency_ms=42)
logger.close()
Full docs: https://github.com/sid19991/rapidlog
What I Learned Building This
- Lock contention is invisible until you benchmark. Profilers don't always show it clearly.
- Pre-allocation trades memory for speed. This is unfashionable in today's auto-scaling world, but it works.
- The GIL is a constraint, not a blocker. You can build fast Python within its limits.
- Benchmarking is hard. Fair comparison requires controlling for output format, I/O, thread count, and more.
- Simple designs win. Per-thread buffers + async writer is decades-old (used in real databases). Innovation isn't always new.
Let's Talk
What do you think? Are you logging 10K+ events/sec and hitting lock contention? Would you use this?
Hit me up:
- GitHub issues: https://github.com/sid19991/rapidlog/issues
- Twitter/X: siddharthpogul
Links:
- GitHub: https://github.com/sid19991/rapidlog
- PyPI: https://pypi.org/project/rapidlog/
- Benchmarks: In README + GitHub repo
Top comments (0)