DEV Community

James Rivers
James Rivers

Posted on

Why Your Kubernetes Log Parsing Is Silently Dropping Events (And How to Fix It)

Why Your Kubernetes Log Parsing Is Silently Dropping Events (And How to Fix It)

You're losing logs right now. Probably 5–15% of them. No errors, no alerts — they just vanish into the void. You'll find out in a postmortem.

I've spent the last few years debugging production logging pipelines across EKS, GKE, and AKS clusters, and the same failure modes come up again and again. Here are the real culprits — with the regex patterns that actually fix them.


Failure Mode 1: IPv6 Pod Addresses Break Your IP Regex

The naive IP pattern everyone copies from Stack Overflow:

(\d{1,3}\.){3}\d{1,3}
Enter fullscreen mode Exit fullscreen mode

This works fine until your cluster runs dual-stack (IPv4 + IPv6), or you're on AWS EKS with VPC CNI in IPv6 mode. A log line like:

2024-01-15T10:23:44Z pod/frontend-7d9f8b [::ffff:10.0.1.42]:8080 GET /api/health 200
Enter fullscreen mode Exit fullscreen mode

...will match nothing. Your aggregator silently skips the IP extraction, the field is null, your alert never fires.

The fix:

(?:(?:[0-9]{1,3}\.){3}[0-9]{1,3}|(?:[0-9a-fA-F]{1,4}:){2,7}[0-9a-fA-F]{1,4}|::(?:[0-9a-fA-F]{1,4}:){0,6}[0-9a-fA-F]{1,4}|::ffff:(?:[0-9]{1,3}\.){3}[0-9]{1,3})
Enter fullscreen mode Exit fullscreen mode

Ugly? Yes. Does it handle IPv4-mapped IPv6 (::ffff:10.0.1.42)? Also yes.


Failure Mode 2: CRI-O Fragments Your Java Stack Traces

Docker's json-file driver wraps each log line in a JSON object. Clean, predictable. But CRI-O (used by default on OpenShift and many EKS/GKE configs) uses its own format:

2024-01-15T10:23:44.123456789Z stdout P java.lang.NullPointerException
2024-01-15T10:23:44.123456790Z stdout P     at com.example.Service.handle(Service.java:42)
2024-01-15T10:23:44.123456791Z stdout F     at com.example.Main.main(Main.java:10)
Enter fullscreen mode Exit fullscreen mode

The P means "partial" (more lines coming). F means "final" (this completes the message).

If your Fluent Bit or Vector config doesn't handle the P/F markers, each line becomes a separate log event. A 30-line Java stack trace becomes 30 individual entries, none of them actionable.

Vector config to reassemble CRI-O multiline:

[transforms.reassemble_crio]
type = "reduce"
inputs = ["kubernetes_logs"]
group_by = ["kubernetes.pod_name", "kubernetes.container_name"]
merge_strategies.message = "concat_newline"

[transforms.reassemble_crio.ends_when]
type = "vrl"
source = '''
match(.stream, r'stdout|stderr') && match(.message, r'^(?:\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z) (?:stdout|stderr) F ')
'''
Enter fullscreen mode Exit fullscreen mode

Failure Mode 3: Timestamp Skew Breaks Event Ordering

Kubernetes logs carry at least three timestamps:

  1. The timestamp the application wrote to stdout
  2. The timestamp the container runtime attached when buffering
  3. The timestamp Fluent Bit/Vector added when it read the file

When your pipeline uses the wrong one, events appear out of order in Elasticsearch/Loki. Queries for "what happened between 10:00 and 10:01" return incomplete results.

The pattern for Fluent Bit to prefer the application timestamp:

[PARSER]
    Name        k8s_app_timestamp
    Format      regex
    Regex       ^(?<app_time>\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}(?:\.\d+)?(?:Z|[+-]\d{2}:?\d{2})?)[\s\t]+(?<log>.*)$
    Time_Key    app_time
    Time_Format %Y-%m-%dT%H:%M:%S.%L%z
    Time_Keep   On
Enter fullscreen mode Exit fullscreen mode

Setting Time_Keep On preserves both the application time (used for indexing) and the collection time (useful for lag monitoring).


Failure Mode 4: Buffer Overflows Silently Drop Long Lines

Most log aggregators have a default line length limit:

  • Fluent Bit: 32KB
  • Vector: no hard limit, but memory pressure can cause drops
  • Logstash: configurable, often 1MB

A long stack trace, a large JSON blob, or a noisy debug log can exceed these limits. What happens? The line is silently truncated or dropped depending on your config.

In Fluent Bit, set this explicitly:

[SERVICE]
    Flush         1
    Log_Level     info
    # Increase buffer limits
    HTTP_Server   On

[INPUT]
    Name              tail
    Path              /var/log/containers/*.log
    Buffer_Chunk_Size 32k
    Buffer_Max_Size   256k
    Skip_Long_Lines   Off
Enter fullscreen mode Exit fullscreen mode

Skip_Long_Lines Off means Fluent Bit will error visibly instead of silently dropping. Much easier to debug.


The Broader Problem

These are just four patterns. In production I've catalogued 50+ failure modes across:

  • Docker json-file, containerd, CRI-O, and journald log formats
  • Vector, Fluent Bit, Logstash, and rsyslog aggregators
  • AWS EKS, GCP GKE, Azure AKS, and self-hosted Kubernetes
  • Mixed runtime clusters (common during upgrades)

If you're building or maintaining a production logging pipeline, I've packaged all of these into a Production Log Parsing Pack — 50+ copy-paste regex patterns and complete aggregator configs, each with the real log line that breaks the naive version and the production-safe replacement.

It's the reference I wish I'd had when I started. Available on Gumroad for £9.50 (~$12).


Quick Diagnostic: Are You Losing Logs?

Run this against your logging pipeline to check:

# Compare log counts: what the app emitted vs what reached your backend
# In your app pod:
kubectl exec -n myapp deploy/frontend -- sh -c \
  "echo 'test-marker-$(date +%s)' && sleep 1"

# In Loki/Elasticsearch: search for 'test-marker' within the next 30 seconds
# If it doesn't appear, you have a silent drop somewhere in the pipeline
Enter fullscreen mode Exit fullscreen mode

If the test marker disappears, work backwards through your pipeline stages — it's almost always a regex parse failure causing the event to be filtered before indexing.


Questions about your specific stack? Drop them in the comments. Happy to help debug specific CRI-O or containerd configs.

Top comments (0)