Urvil Joshi

Posted on May 27 • Originally published at Medium on Mar 17

Canonical Log Lines Stripe Brilliant Technique for Production Observability

#backend #devops #monitoring #sre

Stripe published a technique they call canonical log lines one fat, structured log line emitted at the end of every request that contains all the important telemetry in one place. Sounds too simple right but it fundamentally changes how you query, debug, and understand production systems.

🚨The Problem With Traditional Logging

Let’s say your payment API receives a single request. Internally, it will hit every layer authentication, rate limiting, database queries, and response generation. Traditional logging looks something like this:

[2024-03-18 22:48:32.990] Request started http_method=POST http_path=/v1/charges request_id=req_123
[2024-03-18 22:48:32.991] User authenticated auth_type=api_key key_id=mk_123 user_id=usr_123
[2024-03-18 22:48:32.992] Rate limiting ran rate_allowed=true rate_quota=100 rate_remaining=99
[2024-03-18 22:48:32.998] Charge created charge_id=ch_123 team=acquiring
[2024-03-18 22:48:32.999] Request finished duration=0.009 http_status=200 database_queries=34

This looks perfectly fine. And for simple questions, it works:

# Was anything rate limited recently?
"Rate limiting ran" rate_allowed=false

# Duration stats over the last hour
"Request finished" earliest=-1h | stats count p50(duration) p95(duration) p99(duration)

But now imagine for a incident and someone asks: “Which users are being rate limited the most?”

Suddenly you have a problem. The user_id field is in the authentication line. The rate_allowed field is in the rate limiting line. They're completely separate. To link them, you need to join across lines using request_id as a common bridge field.

*PROBLEM * : Each log line only knows what its own module knows. The rate limiter doesn’t know the user ID. The auth module does not know if rate limiting passed. The information you need is almost always spread across multiple lines of log which you is too much processing when we consider millions or billions of requests.

✨The Solution: One Canonical Line With All Importent Data

The canonical log line is Stripe’s answer: at the end of every request, create a single structured fat log line that contains all the key data together. Keep in mind not instead of regular logs but in addition to them.

[2024-03-18 22:48:32.999] canonical-log-line
  alloc_count=9123
  auth_type=api_key
  database_queries=34
  duration=0.009
  http_method=POST
  http_path=/v1/charges
  http_status=200
  key_id=mk_123
  permissions_used=account_write
  rate_allowed=true
  rate_quota=100
  rate_remaining=99
  request_id=req_123
  user_id=usr_123

Every key piece of information about this request like auth, rate limiting, HTTP details, database stats lives in a single readable line. Now the query for “who is being rate limited most?” becomes:

Canonical lines are an ergonomic feature for engineers. By collecting everything that is important and accessible through queries that are easy to write, makes production incident easy to debug and analyse_._

🔍More Real Query Examples

The power of canonical log lines is not just rate limiting. Because every key field is together, you can ask almost any operational question in a single line of query syntax.

Example 1 :Performance by Endpoint

During an incident, you want to see latency percentiles for the /v1/charges endpoint for a specific user, while filtering out client errors :

Example 2 — Detect a Bug vs. Legitimate Rate Limiting

Is rate limiting hitting few users or is it a bug affecting everyone?

Example 3 — TLS Version Adoption (Real Stripe Use Case)

Stripe needed to migrate users from TLS 1.0/1.1 to TLS 1.2. They could query this instantly with canonical lines:

🍥Architecture Used By Stripe

The beauty of canonical log lines is that the implementation is simple.Idea is to do it in middleware, which makes it one completely automatic and second we have have strict control over the it.

During a request’s lifecycle, each module decorates a shared environment object with relevant fields. The canonical logger sits at the very end of the middleware chain and drains all those fields into one log line.

Stripe wraps the canonical line emission in a Ruby ensure block which means it runs even if an exception was thrown mid-request. This guarantees you always have observability, especially during the incidents when you need it most.

Stripe’s approach for canonical line storage:

Kafka serialize canonical lines as Protocol Buffers and push asynchronously to a Kafka topic. This keeps the request path fast
Splunk will be ingested almost in real-time. Perfect when you need answers in seconds. Great for the last hour of data.
A consumer reads from Kafka, batches the data, and writes it to S3. Periodic jobs ingest it into Redshift for SQL-based long-term analytics over months of history
Stripe’s Developer Dashboard is powered by MapReduce jobs over these S3 archives.

Long-Term Analytics in SQL (Redshift)

Stripe archives canonical lines to Redshift. This lets them run months of historical queries in standard SQL

🧰Canonical vs. Other Observability Tools

Canonical log lines do not replace metrics or distributed tracing. They occupy a unique place in observability:

Ideally we should use all four, with canonical log lines as the first view in any debugging process because they are the fast to query and the cheap to aggregate.

✍️Conclusion

Canonical log lines are not a new technology. They are a simple convention that makes your existing logging infrastructure dramatically more powerful:

The beauty of this pattern is that it scales from a small start-up to Stripe’s global payment infrastructure. The implementation is simple yet powerful.

DEV Community