<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: James Rivers</title>
    <description>The latest articles on DEV Community by James Rivers (@riverbend).</description>
    <link>https://dev.to/riverbend</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3938260%2F2284a927-440e-4a5e-a196-daf99dcb45da.png</url>
      <title>DEV Community: James Rivers</title>
      <link>https://dev.to/riverbend</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/riverbend"/>
    <language>en</language>
    <item>
      <title>Kubernetes Logging Architecture in 2025: Fluent Bit vs Vector vs Logstash (With Real Configs)</title>
      <dc:creator>James Rivers</dc:creator>
      <pubDate>Mon, 18 May 2026 20:49:51 +0000</pubDate>
      <link>https://dev.to/riverbend/kubernetes-logging-architecture-in-2025-fluent-bit-vs-vector-vs-logstash-with-real-configs-2lfc</link>
      <guid>https://dev.to/riverbend/kubernetes-logging-architecture-in-2025-fluent-bit-vs-vector-vs-logstash-with-real-configs-2lfc</guid>
      <description>&lt;h1&gt;
  
  
  Kubernetes Logging Architecture in 2025: Fluent Bit vs Vector vs Logstash (With Real Configs)
&lt;/h1&gt;

&lt;p&gt;After working with 50+ Kubernetes clusters in production, I've seen teams make the same architectural mistakes with logging. The wrong choice at the collector layer costs you 3x more in compute and 10x more in operational pain.&lt;/p&gt;

&lt;p&gt;This post breaks down the three main collectors I've deployed in anger, with real config snippets and the gotchas nobody documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Layer Problem
&lt;/h2&gt;

&lt;p&gt;Kubernetes logging has three distinct concerns that teams conflate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Collection&lt;/strong&gt; — Reading from container stdout/stderr (CRI-O or containerd format)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing&lt;/strong&gt; — Parsing, filtering, enriching (adding pod labels, stripping noise)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shipping&lt;/strong&gt; — Sending to your aggregator (Elasticsearch, Loki, Datadog, Grafana Cloud)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Pick the wrong tool at layer 1 and you'll be fighting cardinality explosions and dropped multiline stacktraces for months.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fluent Bit: The Right Default
&lt;/h2&gt;

&lt;p&gt;Fluent Bit is written in C, uses ~10MB RAM per node, and handles the CRI-O format correctly out of the box. If you're starting fresh, this is your answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[INPUT]&lt;/span&gt;
    &lt;span class="err"&gt;Name&lt;/span&gt;              &lt;span class="err"&gt;tail&lt;/span&gt;
    &lt;span class="err"&gt;Path&lt;/span&gt;              &lt;span class="err"&gt;/var/log/containers/*.log&lt;/span&gt;
    &lt;span class="err"&gt;multiline.parser&lt;/span&gt;  &lt;span class="err"&gt;cri&lt;/span&gt;
    &lt;span class="err"&gt;Tag&lt;/span&gt;               &lt;span class="err"&gt;kube.*&lt;/span&gt;
    &lt;span class="err"&gt;Mem_Buf_Limit&lt;/span&gt;     &lt;span class="err"&gt;50MB&lt;/span&gt;
    &lt;span class="err"&gt;Skip_Long_Lines&lt;/span&gt;   &lt;span class="err"&gt;On&lt;/span&gt;

&lt;span class="nn"&gt;[FILTER]&lt;/span&gt;
    &lt;span class="err"&gt;Name&lt;/span&gt;                &lt;span class="err"&gt;kubernetes&lt;/span&gt;
    &lt;span class="err"&gt;Match&lt;/span&gt;               &lt;span class="err"&gt;kube.*&lt;/span&gt;
    &lt;span class="err"&gt;Kube_URL&lt;/span&gt;            &lt;span class="err"&gt;https://kubernetes.default.svc:443&lt;/span&gt;
    &lt;span class="err"&gt;Kube_CA_File&lt;/span&gt;        &lt;span class="err"&gt;/var/run/secrets/kubernetes.io/serviceaccount/ca.crt&lt;/span&gt;
    &lt;span class="err"&gt;Kube_Token_File&lt;/span&gt;     &lt;span class="err"&gt;/var/run/secrets/kubernetes.io/serviceaccount/token&lt;/span&gt;
    &lt;span class="err"&gt;Merge_Log&lt;/span&gt;           &lt;span class="err"&gt;On&lt;/span&gt;
    &lt;span class="err"&gt;Keep_Log&lt;/span&gt;            &lt;span class="err"&gt;Off&lt;/span&gt;
    &lt;span class="err"&gt;K8S-Logging.Parser&lt;/span&gt;  &lt;span class="err"&gt;On&lt;/span&gt;
    &lt;span class="err"&gt;K8S-Logging.Exclude&lt;/span&gt; &lt;span class="err"&gt;On&lt;/span&gt;

&lt;span class="nn"&gt;[OUTPUT]&lt;/span&gt;
    &lt;span class="err"&gt;Name&lt;/span&gt;  &lt;span class="err"&gt;loki&lt;/span&gt;
    &lt;span class="err"&gt;Match&lt;/span&gt; &lt;span class="err"&gt;*&lt;/span&gt;
    &lt;span class="err"&gt;Host&lt;/span&gt;  &lt;span class="err"&gt;loki.monitoring.svc.cluster.local&lt;/span&gt;
    &lt;span class="err"&gt;Port&lt;/span&gt;  &lt;span class="err"&gt;3100&lt;/span&gt;
    &lt;span class="err"&gt;Labels&lt;/span&gt; &lt;span class="py"&gt;job&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;fluent-bit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The critical gotcha&lt;/strong&gt;: &lt;code&gt;multiline.parser cri&lt;/code&gt; handles the &lt;code&gt;[FP]&lt;/code&gt; flags in CRI-O logs. Without this, multiline Java stacktraces get split into hundreds of single-line log entries, and your alerting on &lt;code&gt;Exception&lt;/code&gt; patterns fires constantly on partial lines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vector: When You Need ETL-Level Processing
&lt;/h2&gt;

&lt;p&gt;Vector (by Datadog, open source) is the right choice when you need complex transformations — routing different log streams to different destinations, applying VRL transforms, or doing real-time aggregation before shipping.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[sources.kubernetes_logs]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"kubernetes_logs"&lt;/span&gt;
&lt;span class="py"&gt;auto_partial_merge&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="nn"&gt;[transforms.parse_nginx]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"remap"&lt;/span&gt;
&lt;span class="py"&gt;inputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"kubernetes_logs"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'''
  if exists(.kubernetes.pod_labels."app") &amp;amp;&amp;amp; .kubernetes.pod_labels."app" == "nginx" {
    . = merge(., parse_nginx_log!(.message, "combined"))
  }
'''&lt;/span&gt;

&lt;span class="nn"&gt;[transforms.add_environment]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"remap"&lt;/span&gt;
&lt;span class="py"&gt;inputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"parse_nginx"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'''
  .environment = get_env_var!("ENVIRONMENT")
  .cluster_name = get_env_var!("CLUSTER_NAME")
'''&lt;/span&gt;

&lt;span class="nn"&gt;[sinks.loki]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"loki"&lt;/span&gt;
&lt;span class="py"&gt;inputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"add_environment"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;endpoint&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://loki:3100"&lt;/span&gt;
&lt;span class="py"&gt;encoding.codec&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"json"&lt;/span&gt;
&lt;span class="py"&gt;labels.app&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"{{ kubernetes.pod_labels.app }}"&lt;/span&gt;
&lt;span class="py"&gt;labels.namespace&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"{{ kubernetes.pod_namespace }}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Where Vector shines&lt;/strong&gt;: cardinality control. You can hash or drop high-cardinality fields (user IDs, session tokens, request IDs) before they hit Loki/Elasticsearch. I've seen Loki clusters go from 500GB/day to 50GB/day after adding a Vector transform that strips request IDs from labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Vector struggles&lt;/strong&gt;: the VRL language has a learning curve, and error handling is verbose. If your transforms error at runtime, Vector drops the event silently unless you explicitly route errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logstash: The Legacy Default (Use With Caution)
&lt;/h2&gt;

&lt;p&gt;Logstash still dominates in Elasticsearch-first shops. It's battle-tested, has 200+ input/output plugins, and the grok patterns are well-documented. But it runs on the JVM and uses 500MB-1GB RAM per instance — 50x more than Fluent Bit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;input&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;beats&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5044&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;kubernetes&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="nb"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=~&lt;/span&gt; &lt;span class="sr"&gt;/nginx/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;grok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"message"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
          &lt;span class="s1"&gt;'%{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:body_bytes_sent} "%{DATA:http_referer}" "%{DATA:http_user_agent}"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="c1"&gt;# IPv6 variant&lt;/span&gt;
          &lt;span class="s1"&gt;'%{IPV6:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:body_bytes_sent} "%{DATA:http_referer}" "%{DATA:http_user_agent}"'&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;mutate&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;add_field&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"cluster"&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"${CLUSTER_NAME}"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;remove_field&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"ecs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;elasticsearch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;hosts&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://elasticsearch:9200"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"kubernetes-logs-%{+YYYY.MM.dd}"&lt;/span&gt;
    &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"${ES_USER}"&lt;/span&gt;
    &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"${ES_PASSWORD}"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The IPv6 gotcha&lt;/strong&gt;: naive grok patterns only match &lt;code&gt;%{IPORHOST}&lt;/code&gt; which handles IPv4 and hostnames but not bare IPv6 addresses like &lt;code&gt;[2001:db8::1]&lt;/code&gt;. You need a separate pattern or a conditional match. This is the #1 reason log lines silently drop in nginx access log pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right Architecture
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criteria&lt;/th&gt;
&lt;th&gt;Fluent Bit&lt;/th&gt;
&lt;th&gt;Vector&lt;/th&gt;
&lt;th&gt;Logstash&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RAM per node&lt;/td&gt;
&lt;td&gt;~10MB&lt;/td&gt;
&lt;td&gt;~50MB&lt;/td&gt;
&lt;td&gt;500MB-1GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CRI-O multiline&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transform power&lt;/td&gt;
&lt;td&gt;Lua scripts&lt;/td&gt;
&lt;td&gt;VRL (powerful)&lt;/td&gt;
&lt;td&gt;Ruby/Grok&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plugin ecosystem&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Growing&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging&lt;/td&gt;
&lt;td&gt;Hard&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Most K8s setups&lt;/td&gt;
&lt;td&gt;Complex routing&lt;/td&gt;
&lt;td&gt;Elastic stack&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Architecture I'd Deploy Today
&lt;/h2&gt;

&lt;p&gt;For a 20-50 node cluster:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fluent Bit&lt;/strong&gt; as DaemonSet collector on every node (handles CRI-O, low overhead)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector&lt;/strong&gt; as a middle aggregator (deployed as Deployment, 2-3 replicas) for transforms, enrichment, routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loki&lt;/strong&gt; as the primary store (much cheaper than Elasticsearch for log retention)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt; for querying and alerting&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This "fan-in" architecture means your Fluent Bit configs stay simple (just collect and forward to Vector), while Vector handles all the complex logic in one place. When you need to change a parser, you update one Vector config instead of a DaemonSet rollout.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Patterns That Actually Trip Teams Up
&lt;/h2&gt;

&lt;p&gt;I've compiled 50+ production-tested regex patterns and complete configs for every layer of this stack — CRI-O, containerd, kubelet, Nginx (with IPv6), Spring Boot, Go, Node.js — plus the multiline handling rules that prevent stacktrace mangling.&lt;/p&gt;

&lt;p&gt;If you're building or migrating a logging stack, the &lt;a href="https://riverbend36.gumroad.com/l/wjzpxh" rel="noopener noreferrer"&gt;Kubernetes Logging Architecture Guide&lt;/a&gt; covers this in depth with case studies from real migrations (Docker → containerd, ELK → Loki, Logstash → Vector).&lt;/p&gt;

&lt;p&gt;Also worth bookmarking: the &lt;a href="https://riverbend36.gumroad.com/l/dyrpv" rel="noopener noreferrer"&gt;Production Log Parsing Pack&lt;/a&gt; — 50+ copy-paste regex patterns for the formats listed above, tested across 50+ clusters.&lt;/p&gt;

&lt;p&gt;Questions? Drop them in the comments — happy to dig into specific edge cases.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;James Rivers — DevOps/SRE consultant specialising in observability stacks&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>logging</category>
      <category>sre</category>
    </item>
    <item>
      <title>Production Log Parsing Patterns: How I Fixed Logging in 50+ Kubernetes Clusters</title>
      <dc:creator>James Rivers</dc:creator>
      <pubDate>Mon, 18 May 2026 20:01:53 +0000</pubDate>
      <link>https://dev.to/riverbend/production-log-parsing-patterns-how-i-fixed-logging-in-50-kubernetes-clusters-1jjk</link>
      <guid>https://dev.to/riverbend/production-log-parsing-patterns-how-i-fixed-logging-in-50-kubernetes-clusters-1jjk</guid>
      <description>&lt;p&gt;I spent the last 3 years building production observability stacks for Kubernetes clusters (50+ nodes), and I noticed a pattern: every team I worked with spent 2-4 weeks reinventing log parsing.&lt;/p&gt;

&lt;p&gt;The problem is concrete. You deploy a service, logs hit your aggregator (Elasticsearch, Loki, Datadog), and suddenly you're writing regex patterns for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CRI-O container timestamps: &lt;code&gt;2024-01-15T10:23:41.123456789Z&lt;/code&gt; (most people miss the nanosecond precision and drop 30-40% of lines)&lt;/li&gt;
&lt;li&gt;Multiline stacktraces that merge incorrectly (the &lt;code&gt;[FP]&lt;/code&gt; flag in CRI-O that nobody documents)&lt;/li&gt;
&lt;li&gt;IPv6 addresses in access logs (&lt;code&gt;[2001:db8::1]&lt;/code&gt; breaks naive regex)&lt;/li&gt;
&lt;li&gt;Structured JSON with nested exceptions&lt;/li&gt;
&lt;li&gt;Application-specific formats (Spring Boot, Go, Node.js each log differently)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pack contains 50+ production-tested regex patterns and ready-to-use configurations for the entire stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log collectors&lt;/strong&gt;: Fluent Bit, Vector, Filebeat, Logstash (complete configs with performance tuning)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregators&lt;/strong&gt;: Elasticsearch, Loki, Datadog (grok patterns, parser pipelines)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specific parsers&lt;/strong&gt;: CRI-O/containerd, Kubernetes kubelet, Nginx/Apache, PostgreSQL, MySQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gotchas I learned the hard way&lt;/strong&gt;: multiline handling, timezone normalisation, cardinality explosion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: The CRI-O pattern that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;^(?P&amp;lt;time&amp;gt;\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z)\s+(?P&amp;lt;stream&amp;gt;stdout|stderr)\s+(?P&amp;lt;flags&amp;gt;[FP])\s+(?P&amp;lt;log&amp;gt;.*)$
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most teams use a simpler pattern, lose the &lt;code&gt;[FP]&lt;/code&gt; flag info, and can't reconstruct partial lines correctly.&lt;/p&gt;

&lt;p&gt;The pack is £9.50 as a downloadable reference guide (not a subscription). It's practical—copy the configs, adjust for your stack, done.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://riverbend36.gumroad.com/l/dyrpv" rel="noopener noreferrer"&gt;https://riverbend36.gumroad.com/l/dyrpv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback welcome—I'm iterating on this based on what DevOps teams actually need.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Production Log Parsing Patterns: Battle-Tested Regex &amp; Configuration Examples</title>
      <dc:creator>James Rivers</dc:creator>
      <pubDate>Mon, 18 May 2026 17:48:34 +0000</pubDate>
      <link>https://dev.to/riverbend/production-log-parsing-patterns-battle-tested-regex-configuration-examples-5e0a</link>
      <guid>https://dev.to/riverbend/production-log-parsing-patterns-battle-tested-regex-configuration-examples-5e0a</guid>
      <description>&lt;p&gt;I've spent the last few years debugging production logging pipelines across EKS, GKE, and AKS clusters, and the same failure modes keep appearing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;IPv6 pod addresses break your IP regex&lt;/strong&gt; – Dual-stack Kubernetes clusters fail silently on the naive &lt;code&gt;(\d{1,3}\.){3}\d{1,3}&lt;/code&gt; pattern. A log line with &lt;code&gt;[::ffff:10.0.1.42]:8080&lt;/code&gt; gets skipped entirely. Your alerts never fire.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CRI-O fragments your Java stack traces&lt;/strong&gt; – The &lt;code&gt;P&lt;/code&gt; (partial) and &lt;code&gt;F&lt;/code&gt; (final) markers break log reassembly. A 30-line stack trace becomes 30 individual events, none actionable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Timestamp skew breaks event ordering&lt;/strong&gt; – Kubernetes carries three timestamps (app, runtime, collector). Using the wrong one makes events appear out of order in Loki/Elasticsearch.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Buffer overflows silently drop long lines&lt;/strong&gt; – Fluent Bit defaults to 32KB per line. One large debug log or blob and the event vanishes. Most configs don't catch this.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each failure mode is validated against real production log lines. I've catalogued 50+ of these across Docker json-file, containerd, CRI-O, journald; across Vector, Fluent Bit, Logstash; across AWS EKS, GCP GKE, Azure AKS.&lt;/p&gt;

&lt;p&gt;I've packaged them into a reference guide: &lt;strong&gt;Production Log Parsing Pack&lt;/strong&gt; – 50+ copy-paste regex patterns and complete aggregator configs, each with the real log line that breaks the naive version and the production-safe replacement.&lt;/p&gt;

&lt;p&gt;The guide covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IPv4/IPv6 dual-stack patterns&lt;/li&gt;
&lt;li&gt;CRI-O multiline reassembly (Vector/Fluent Bit configs)&lt;/li&gt;
&lt;li&gt;Timestamp selection strategies&lt;/li&gt;
&lt;li&gt;Buffer limit tuning&lt;/li&gt;
&lt;li&gt;Common Logstash GROK patterns&lt;/li&gt;
&lt;li&gt;Error detection patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Available on Gumroad for £9.50.&lt;/p&gt;

&lt;p&gt;Happy to answer questions about specific stacks or edge cases in the comments.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>logging</category>
      <category>observability</category>
    </item>
    <item>
      <title>Why Your Kubernetes Log Parsing Is Silently Dropping Events (And How to Fix It)</title>
      <dc:creator>James Rivers</dc:creator>
      <pubDate>Mon, 18 May 2026 16:25:57 +0000</pubDate>
      <link>https://dev.to/riverbend/why-your-kubernetes-log-parsing-is-silently-dropping-events-and-how-to-fix-it-70o</link>
      <guid>https://dev.to/riverbend/why-your-kubernetes-log-parsing-is-silently-dropping-events-and-how-to-fix-it-70o</guid>
      <description>&lt;h1&gt;
  
  
  Why Your Kubernetes Log Parsing Is Silently Dropping Events (And How to Fix It)
&lt;/h1&gt;

&lt;p&gt;You're losing logs right now. Probably 5–15% of them. No errors, no alerts — they just vanish into the void. You'll find out in a postmortem.&lt;/p&gt;

&lt;p&gt;I've spent the last few years debugging production logging pipelines across EKS, GKE, and AKS clusters, and the same failure modes come up again and again. Here are the real culprits — with the regex patterns that actually fix them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 1: IPv6 Pod Addresses Break Your IP Regex
&lt;/h2&gt;

&lt;p&gt;The naive IP pattern everyone copies from Stack Overflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(\d{1,3}\.){3}\d{1,3}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works fine until your cluster runs dual-stack (IPv4 + IPv6), or you're on AWS EKS with VPC CNI in IPv6 mode. A log line like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2024-01-15T10:23:44Z pod/frontend-7d9f8b [::ffff:10.0.1.42]:8080 GET /api/health 200
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;...will match nothing. Your aggregator silently skips the IP extraction, the field is null, your alert never fires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(?:(?:[0-9]{1,3}\.){3}[0-9]{1,3}|(?:[0-9a-fA-F]{1,4}:){2,7}[0-9a-fA-F]{1,4}|::(?:[0-9a-fA-F]{1,4}:){0,6}[0-9a-fA-F]{1,4}|::ffff:(?:[0-9]{1,3}\.){3}[0-9]{1,3})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ugly? Yes. Does it handle IPv4-mapped IPv6 (&lt;code&gt;::ffff:10.0.1.42&lt;/code&gt;)? Also yes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 2: CRI-O Fragments Your Java Stack Traces
&lt;/h2&gt;

&lt;p&gt;Docker's &lt;code&gt;json-file&lt;/code&gt; driver wraps each log line in a JSON object. Clean, predictable. But CRI-O (used by default on OpenShift and many EKS/GKE configs) uses its own format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2024-01-15T10:23:44.123456789Z stdout P java.lang.NullPointerException
2024-01-15T10:23:44.123456790Z stdout P     at com.example.Service.handle(Service.java:42)
2024-01-15T10:23:44.123456791Z stdout F     at com.example.Main.main(Main.java:10)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;P&lt;/code&gt; means "partial" (more lines coming). &lt;code&gt;F&lt;/code&gt; means "final" (this completes the message).&lt;/p&gt;

&lt;p&gt;If your Fluent Bit or Vector config doesn't handle the P/F markers, each line becomes a &lt;em&gt;separate log event&lt;/em&gt;. A 30-line Java stack trace becomes 30 individual entries, none of them actionable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector config to reassemble CRI-O multiline:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[transforms.reassemble_crio]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"reduce"&lt;/span&gt;
&lt;span class="py"&gt;inputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"kubernetes_logs"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;group_by&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"kubernetes.pod_name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"kubernetes.container_name"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;merge_strategies.message&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"concat_newline"&lt;/span&gt;

&lt;span class="nn"&gt;[transforms.reassemble_crio.ends_when]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"vrl"&lt;/span&gt;
&lt;span class="py"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'''
match(.stream, r'stdout|stderr') &amp;amp;&amp;amp; match(.message, r'^(?:\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z) (?:stdout|stderr) F ')
'''&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Failure Mode 3: Timestamp Skew Breaks Event Ordering
&lt;/h2&gt;

&lt;p&gt;Kubernetes logs carry at least three timestamps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The timestamp the application wrote to stdout&lt;/li&gt;
&lt;li&gt;The timestamp the container runtime attached when buffering&lt;/li&gt;
&lt;li&gt;The timestamp Fluent Bit/Vector added when it read the file&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When your pipeline uses the wrong one, events appear out of order in Elasticsearch/Loki. Queries for "what happened between 10:00 and 10:01" return incomplete results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pattern for Fluent Bit to prefer the application timestamp:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[PARSER]&lt;/span&gt;
    &lt;span class="err"&gt;Name&lt;/span&gt;        &lt;span class="err"&gt;k8s_app_timestamp&lt;/span&gt;
    &lt;span class="err"&gt;Format&lt;/span&gt;      &lt;span class="err"&gt;regex&lt;/span&gt;
    &lt;span class="err"&gt;Regex&lt;/span&gt;       &lt;span class="err"&gt;^(?&amp;lt;app_time&amp;gt;\d{4}-\d{2}-\d{2}&lt;/span&gt;&lt;span class="nn"&gt;[T ]&lt;/span&gt;&lt;span class="err"&gt;\d{2}:\d{2}:\d{2}(?:\.\d+)?(?:Z|&lt;/span&gt;&lt;span class="nn"&gt;[+-]&lt;/span&gt;&lt;span class="err"&gt;\d{2}:?\d{2})?)&lt;/span&gt;&lt;span class="nn"&gt;[\s\t]&lt;/span&gt;&lt;span class="err"&gt;+(?&amp;lt;log&amp;gt;.*)$&lt;/span&gt;
    &lt;span class="err"&gt;Time_Key&lt;/span&gt;    &lt;span class="err"&gt;app_time&lt;/span&gt;
    &lt;span class="err"&gt;Time_Format&lt;/span&gt; &lt;span class="err"&gt;%Y-%m-%dT%H:%M:%S.%L%z&lt;/span&gt;
    &lt;span class="err"&gt;Time_Keep&lt;/span&gt;   &lt;span class="err"&gt;On&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;Time_Keep On&lt;/code&gt; preserves both the application time (used for indexing) and the collection time (useful for lag monitoring).&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Mode 4: Buffer Overflows Silently Drop Long Lines
&lt;/h2&gt;

&lt;p&gt;Most log aggregators have a default line length limit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fluent Bit: 32KB&lt;/li&gt;
&lt;li&gt;Vector: no hard limit, but memory pressure can cause drops&lt;/li&gt;
&lt;li&gt;Logstash: configurable, often 1MB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A long stack trace, a large JSON blob, or a noisy debug log can exceed these limits. What happens? The line is &lt;strong&gt;silently truncated or dropped&lt;/strong&gt; depending on your config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In Fluent Bit, set this explicitly:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[SERVICE]&lt;/span&gt;
    &lt;span class="err"&gt;Flush&lt;/span&gt;         &lt;span class="err"&gt;1&lt;/span&gt;
    &lt;span class="err"&gt;Log_Level&lt;/span&gt;     &lt;span class="err"&gt;info&lt;/span&gt;
    &lt;span class="c"&gt;# Increase buffer limits
&lt;/span&gt;    &lt;span class="err"&gt;HTTP_Server&lt;/span&gt;   &lt;span class="err"&gt;On&lt;/span&gt;

&lt;span class="nn"&gt;[INPUT]&lt;/span&gt;
    &lt;span class="err"&gt;Name&lt;/span&gt;              &lt;span class="err"&gt;tail&lt;/span&gt;
    &lt;span class="err"&gt;Path&lt;/span&gt;              &lt;span class="err"&gt;/var/log/containers/*.log&lt;/span&gt;
    &lt;span class="err"&gt;Buffer_Chunk_Size&lt;/span&gt; &lt;span class="err"&gt;32k&lt;/span&gt;
    &lt;span class="err"&gt;Buffer_Max_Size&lt;/span&gt;   &lt;span class="err"&gt;256k&lt;/span&gt;
    &lt;span class="err"&gt;Skip_Long_Lines&lt;/span&gt;   &lt;span class="err"&gt;Off&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Skip_Long_Lines Off&lt;/code&gt; means Fluent Bit will error visibly instead of silently dropping. Much easier to debug.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Broader Problem
&lt;/h2&gt;

&lt;p&gt;These are just four patterns. In production I've catalogued 50+ failure modes across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker json-file, containerd, CRI-O, and journald log formats&lt;/li&gt;
&lt;li&gt;Vector, Fluent Bit, Logstash, and rsyslog aggregators
&lt;/li&gt;
&lt;li&gt;AWS EKS, GCP GKE, Azure AKS, and self-hosted Kubernetes&lt;/li&gt;
&lt;li&gt;Mixed runtime clusters (common during upgrades)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building or maintaining a production logging pipeline, I've packaged all of these into a &lt;strong&gt;&lt;a href="https://riverbend36.gumroad.com/l/dyrpv" rel="noopener noreferrer"&gt;Production Log Parsing Pack&lt;/a&gt;&lt;/strong&gt; — 50+ copy-paste regex patterns and complete aggregator configs, each with the real log line that breaks the naive version and the production-safe replacement.&lt;/p&gt;

&lt;p&gt;It's the reference I wish I'd had when I started. Available on Gumroad for £9.50 (~$12).&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Diagnostic: Are You Losing Logs?
&lt;/h2&gt;

&lt;p&gt;Run this against your logging pipeline to check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Compare log counts: what the app emitted vs what reached your backend&lt;/span&gt;
&lt;span class="c"&gt;# In your app pod:&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; myapp deploy/frontend &lt;span class="nt"&gt;--&lt;/span&gt; sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"echo 'test-marker-&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;' &amp;amp;&amp;amp; sleep 1"&lt;/span&gt;

&lt;span class="c"&gt;# In Loki/Elasticsearch: search for 'test-marker' within the next 30 seconds&lt;/span&gt;
&lt;span class="c"&gt;# If it doesn't appear, you have a silent drop somewhere in the pipeline&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the test marker disappears, work backwards through your pipeline stages — it's almost always a regex parse failure causing the event to be filtered before indexing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Questions about your specific stack? Drop them in the comments. Happy to help debug specific CRI-O or containerd configs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>productivity</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why Parsing Kubernetes Logs Is Harder Than It Looks (And How to Fix It)</title>
      <dc:creator>James Rivers</dc:creator>
      <pubDate>Mon, 18 May 2026 15:50:46 +0000</pubDate>
      <link>https://dev.to/riverbend/why-parsing-kubernetes-logs-is-harder-than-it-looks-and-how-to-fix-it-13fl</link>
      <guid>https://dev.to/riverbend/why-parsing-kubernetes-logs-is-harder-than-it-looks-and-how-to-fix-it-13fl</guid>
      <description>&lt;h1&gt;
  
  
  Why Parsing Kubernetes Logs Is Harder Than It Looks (And How to Fix It)
&lt;/h1&gt;

&lt;p&gt;If you've ever stared at a Kubernetes log aggregation pipeline wondering why you're losing 5–15% of your log volume with zero error messages, you're not alone. Log parsing failures are one of the most insidious problems in production infrastructure — they fail silently, they only surface during incidents, and by then it's too late.&lt;/p&gt;

&lt;p&gt;I've managed large Kubernetes clusters (100–500 nodes) across AWS, GCP, and Azure. Over the years, I've catalogued the specific regex patterns and configuration edge cases that cause silent log loss. Here's what actually breaks in production — and how to fix it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Most Common Silent Log Failures
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. IPv6 Pod Addresses Break Naive IP Regexes
&lt;/h3&gt;

&lt;p&gt;This is the most common one. A typical IP address regex looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(\d{1,3}\.){3}\d{1,3}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks fine for an IPv4 world. But Kubernetes pods increasingly get IPv6 addresses — especially in dual-stack clusters (which are now the default recommendation in Kubernetes 1.21+). An IPv6 address like &lt;code&gt;2001:db8::1&lt;/code&gt; doesn't match this pattern at all.&lt;/p&gt;

&lt;p&gt;The result? Your log parser silently skips the entire log line. No warning. No counter. No alert. You only discover this during a postmortem when you can't find the log entry you know was generated.&lt;/p&gt;

&lt;p&gt;A production-safe IP regex that handles both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(?:(?:\d{1,3}\.){3}\d{1,3}|(?:[0-9a-fA-F]{1,4}:){2,7}[0-9a-fA-F]{1,4}|::(?:[0-9a-fA-F]{1,4}:){0,6}[0-9a-fA-F]{1,4})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes, it's ugly. That's production infrastructure for you.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. CRI-O Partial Line Markers Fragment Stack Traces
&lt;/h3&gt;

&lt;p&gt;Docker's json-file log driver wraps every line in a JSON envelope:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"log"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Exception in thread main&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"stream"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"stderr"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"2024-01-15T10:23:45Z"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CRI-O (the default runtime in OpenShift, and increasingly in vanilla Kubernetes) does something different. It adds a partial/final flag to each line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2024-01-15T10:23:45.123456789Z stderr P Exception in thread main
2024-01-15T10:23:45.123456790Z stderr P   at com.example.App.main(App.java:42)
2024-01-15T10:23:45.123456791Z stderr F   at java.base/java.lang.Thread.run(Thread.java:834)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;P&lt;/code&gt; means "partial" (more lines follow), &lt;code&gt;F&lt;/code&gt; means "final" (end of logical message). If your log parser doesn't handle &lt;code&gt;P&lt;/code&gt;/&lt;code&gt;F&lt;/code&gt; markers, a 30-line Java stack trace becomes 30 separate log entries. Your error-rate alerting now fires 30 times per exception. Or worse, you aggregate by "first line" and lose the actual cause entirely.&lt;/p&gt;

&lt;p&gt;The correct Fluent Bit configuration to handle CRI-O multiline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[FILTER]&lt;/span&gt;
    &lt;span class="err"&gt;Name&lt;/span&gt; &lt;span class="err"&gt;multiline&lt;/span&gt;
    &lt;span class="err"&gt;match&lt;/span&gt; &lt;span class="err"&gt;*&lt;/span&gt;
    &lt;span class="err"&gt;multiline.key_content&lt;/span&gt; &lt;span class="err"&gt;log&lt;/span&gt;
    &lt;span class="err"&gt;multiline.parser&lt;/span&gt; &lt;span class="err"&gt;cri&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But you also need to handle the case where a container switches between Docker and CRI-O runtimes during a cluster upgrade. That requires a conditional parser chain — which most documentation skips entirely.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Containerd vs Docker Timestamp Formats
&lt;/h3&gt;

&lt;p&gt;Docker uses RFC3339Nano with a trailing &lt;code&gt;Z&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2024-01-15T10:23:45.123456789Z
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Containerd uses the same format but without microsecond precision in some configurations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2024-01-15T10:23:45Z
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're parsing timestamps to correlate logs across services (which you are, for distributed tracing), this precision difference can cause events to appear out of order by up to 999ms. In a high-throughput service, that means your "what happened first" analysis during an incident is wrong.&lt;/p&gt;

&lt;p&gt;The fix: always parse timestamps permissively and normalize to nanosecond precision at ingestion time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})(\.\d+)?(Z|[+-]\d{2}:\d{2})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then pad the nanosecond component to 9 digits on the right before storing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Pattern: Why Defaults Fail
&lt;/h2&gt;

&lt;p&gt;The core problem is that all major log parsing documentation — Vector, Fluent Bit, Logstash, Fluentd — is written against simple, well-formatted log examples. Real production logs contain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mixed runtimes (some pods on containerd, some on Docker, some on CRI-O during upgrades)&lt;/li&gt;
&lt;li&gt;Multi-line messages from JVM, Python, Go, Node, Rust — each with different stack trace formats&lt;/li&gt;
&lt;li&gt;Log lines that exceed the 16KB default buffer limit (common with verbose JSON payloads)&lt;/li&gt;
&lt;li&gt;Unicode in log messages that break byte-count assumptions&lt;/li&gt;
&lt;li&gt;Logs from Windows containers mixed with Linux containers in hybrid clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are edge cases. They're the norm in any cluster running more than a handful of services.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Works in Production
&lt;/h2&gt;

&lt;p&gt;After years of collecting these patterns, I've assembled a reference pack of 50+ battle-tested regex patterns and configuration templates covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All four major runtimes&lt;/strong&gt;: Docker json-file, containerd, CRI-O, journald&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Four log aggregators&lt;/strong&gt;: Vector, Fluent Bit, Logstash, rsyslog — with copy-paste config blocks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-line reassembly&lt;/strong&gt; for Java (log4j/logback), Python (traceback), Go (panic), Node.js&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IPv4/IPv6 dual-stack patterns&lt;/strong&gt; that don't fail silently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp normalization&lt;/strong&gt; across all format variants&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buffer overflow handling&lt;/strong&gt; for large log lines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each pattern comes with:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A real log line that breaks the naive version&lt;/li&gt;
&lt;li&gt;An explanation of &lt;em&gt;why&lt;/em&gt; it breaks&lt;/li&gt;
&lt;li&gt;The production-safe replacement&lt;/li&gt;
&lt;li&gt;Tested config blocks for Vector and Fluent Bit&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The pack is available at: &lt;strong&gt;&lt;a href="https://buy.stripe.com/28E9ALaiJ8WkgPGePO4ko01" rel="noopener noreferrer"&gt;Production Log Parsing Pack — £9.50&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Diagnostic: Is Your Pipeline Losing Logs?
&lt;/h2&gt;

&lt;p&gt;Here's a 5-minute test you can run right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Count lines going into your log aggregator&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; logging fluent-bit-xxxxx &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  curl &lt;span class="nt"&gt;-s&lt;/span&gt; localhost:2020/api/v1/metrics | &lt;span class="se"&gt;\&lt;/span&gt;
  jq &lt;span class="s1"&gt;'.output[] | {name: .plugin.alias, dropped: .metrics["dropped_records"]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;dropped_records&lt;/code&gt; is non-zero, you have silent log loss. The patterns above are the most common causes.&lt;/p&gt;

&lt;p&gt;For Vector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vector top &lt;span class="nt"&gt;--url&lt;/span&gt; http://localhost:8686
&lt;span class="c"&gt;# Look for "dropped" in the component metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Silent log loss in Kubernetes comes from three main sources:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;IPv6 pod IPs&lt;/td&gt;
&lt;td&gt;Log lines with IPv6 addresses silently dropped&lt;/td&gt;
&lt;td&gt;Dual-stack IP regex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CRI-O P/F markers&lt;/td&gt;
&lt;td&gt;Stack traces fragmented into N separate entries&lt;/td&gt;
&lt;td&gt;Multiline CRI parser&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Timestamp precision&lt;/td&gt;
&lt;td&gt;Events appear out of order in distributed traces&lt;/td&gt;
&lt;td&gt;Permissive timestamp regex + normalization&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These aren't exotic edge cases. If you're running Kubernetes in production at any meaningful scale, you've almost certainly already hit at least one of these.&lt;/p&gt;

&lt;p&gt;If you're fighting log parsing issues beyond these three, feel free to drop a comment — I've probably seen it. And if you want the full reference pack with all 50+ patterns and copy-paste configs, it's at the link above.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;James Rivers writes about infrastructure reliability, observability, and the gap between documentation and production reality.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Production Log Parsing Patterns That Break Real Kubernetes Clusters (and How to Fix Them)</title>
      <dc:creator>James Rivers</dc:creator>
      <pubDate>Mon, 18 May 2026 14:23:56 +0000</pubDate>
      <link>https://dev.to/riverbend/production-log-parsing-patterns-that-break-real-kubernetes-clusters-and-how-to-fix-them-3mkc</link>
      <guid>https://dev.to/riverbend/production-log-parsing-patterns-that-break-real-kubernetes-clusters-and-how-to-fix-them-3mkc</guid>
      <description>&lt;h1&gt;
  
  
  Production Log Parsing Patterns That Break Real Kubernetes Clusters
&lt;/h1&gt;

&lt;p&gt;After three years managing 500+ node Kubernetes clusters across AWS EKS, GCP GKE, and Azure AKS, I've found one consistent truth: &lt;strong&gt;silent log loss is costing teams thousands in incident resolution time every year&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The logs are being written. Your containers are logging. Your cluster is capturing everything. But somewhere between the container runtime and your log aggregation pipeline, 5–8% of logs simply vanish — and you don't know it until an incident postmortem reveals a 45-minute gap in your timeline.&lt;/p&gt;

&lt;p&gt;The problem isn't that log parsing is hard. It's that the edge cases are completely non-obvious until they bite you in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge Case 1: IPv6 Pod Addresses
&lt;/h2&gt;

&lt;p&gt;Your cluster has IPv6 enabled. A pod address appears in a log line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2025-01-15T10:23:45Z fe80::1 - 500 error connecting to database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your regex:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This silently fails. No error. No warning. Just missing data. The log line gets stored without the address extracted. Six months later, an incident postmortem shows you lost all IPv6 pod logs.&lt;/p&gt;

&lt;p&gt;IPv6 addresses use hex notation with colons. Link-local addresses (&lt;code&gt;fe80::1&lt;/code&gt;), compressed notation (&lt;code&gt;2001:db8::1&lt;/code&gt;), and IPv4-mapped addresses (&lt;code&gt;::ffff:192.168.1.1&lt;/code&gt;) all break naive IPv4 patterns.&lt;/p&gt;

&lt;p&gt;The pattern that actually handles all variants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(?:(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}|(?:[a-fA-F0-9]{1,4}:){1,7}:|(?:[a-fA-F0-9]{1,4}:){1,6}:[a-fA-F0-9]{1,4}|::(?:ffff:)?(?:\d{1,3}\.){3}\d{1,3})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Edge Case 2: CRI-O Partial Line Markers
&lt;/h2&gt;

&lt;p&gt;You're running CRI-O as your container runtime. A long Java exception gets logged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;2025-01-15T10:23:45.123456789Z stderr P java.lang.NullPointerException
2025-01-15T10:23:45.234567890Z stderr P   at com.example.Service.process()
2025-01-15T10:23:45.345678901Z stderr F   at com.example.Main.run()
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;P&lt;/code&gt; = "partial line continues", &lt;code&gt;F&lt;/code&gt; = "final line". Naive parsing treats each as a separate event. Your 4KB Java stack trace becomes 8 disconnected log entries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real impact&lt;/strong&gt;: Error correlation breaks. Your aggregator sees 8 "errors" instead of 1 exception. Alerts fire incorrectly. Root cause analysis becomes impossible because the context is fragmented.&lt;/p&gt;

&lt;p&gt;The fix requires stateful line reassembly keyed on container ID + stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fluent Bit multiline config for CRI-O&lt;/span&gt;
&lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;FILTER&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="s"&gt;Name                  multiline&lt;/span&gt;
    &lt;span class="s"&gt;match                 kube.*&lt;/span&gt;
    &lt;span class="s"&gt;multiline.key_content log&lt;/span&gt;
    &lt;span class="s"&gt;multiline.parser      cri&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But that's just Fluent Bit. Vector, Logstash, and rsyslog each need different reassembly configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge Case 3: Multi-line Stack Trace Reassembly
&lt;/h2&gt;

&lt;p&gt;Python tracebacks, Go panics, Node.js stacks — they all span multiple lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;Traceback &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;most&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;process&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;DatabaseError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connection timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;DatabaseError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Connection&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without reassembly, each line becomes a separate "error" log. You have 6 entries but no idea they belong together. Your monitoring counts 6 errors instead of 1. Alert thresholds become meaningless.&lt;/p&gt;

&lt;p&gt;The correct multiline pattern for Python tracebacks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Start pattern: lines that DON'T start with whitespace or "at" or "File"
start_state: /^(?!\s|at\s|File\s)/
# Continue pattern: lines starting with whitespace or traceback context
cont_state: /^(\s+|at\s|File\s|Traceback)/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Edge Case 4: Timestamp Drift in Large Clusters
&lt;/h2&gt;

&lt;p&gt;You have 200 nodes. NTP drift of 150–300ms means logs arrive out of order at your aggregator. Your system sorts by timestamp — and now the sequence of events is scrambled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real impact&lt;/strong&gt;: Event correlation fails. A database connection error appears &lt;em&gt;after&lt;/em&gt; the service restart in your logs, even though it caused the restart. The incident timeline is wrong. Root cause analysis points at the wrong service.&lt;/p&gt;

&lt;p&gt;The fix: use log ingestion time as a secondary sort key when event timestamp drift exceeds your aggregation window.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Real Production Config Looks Like
&lt;/h2&gt;

&lt;p&gt;Here's a Vector config snippet that handles CRI-O markers, multi-line stacks, and timestamp normalization simultaneously:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[sources.kubernetes_logs]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"kubernetes_logs"&lt;/span&gt;

&lt;span class="nn"&gt;[transforms.parse_crio]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"remap"&lt;/span&gt;
&lt;span class="py"&gt;inputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"kubernetes_logs"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'''
# Parse CRI-O format: &amp;lt;timestamp&amp;gt; &amp;lt;stream&amp;gt; &amp;lt;flags&amp;gt; &amp;lt;log&amp;gt;
. = parse_regex!(.message, r'^(?P&amp;lt;ts&amp;gt;\S+) (?P&amp;lt;stream&amp;gt;stdout|stderr) (?P&amp;lt;flags&amp;gt;[PF]) (?P&amp;lt;log&amp;gt;.*)$')
.partial = .flags == "P"
'''&lt;/span&gt;

&lt;span class="nn"&gt;[transforms.merge_multiline]&lt;/span&gt;
&lt;span class="py"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"reduce"&lt;/span&gt;
&lt;span class="py"&gt;inputs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"parse_crio"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;group_by&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"kubernetes.pod_name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"kubernetes.container_name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"stream"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;merge_strategies.log&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"concat_newline"&lt;/span&gt;
&lt;span class="py"&gt;ends_when.partial&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is one of the more complex patterns. There are 50+ covering Docker json-file, containerd, journald, Nginx, Apache, HAProxy, Envoy, and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern Collection
&lt;/h2&gt;

&lt;p&gt;I packaged everything I've learned from production failures into a reference pack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50+ regex patterns&lt;/strong&gt; covering all major Kubernetes log formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool configs&lt;/strong&gt; for Vector, Fluent Bit, Logstash, rsyslog — copy-paste ready&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge case test lines&lt;/strong&gt; — real log lines that break naive parsers so you can validate before deploying&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explanation of why each edge case exists&lt;/strong&gt; — not just "use this regex," but why the format is this way and what happens when you get it wrong&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://buy.stripe.com/28E9ALaiJ8WkgPGePO4ko01" rel="noopener noreferrer"&gt;Production Log Parsing Pack — £9.50 one-time&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each pattern comes with: the regex, the tool config, example log lines that pass, example log lines that break naive versions, and an explanation of the edge case. No subscriptions. All future updates included.&lt;/p&gt;




&lt;p&gt;If you're managing production Kubernetes clusters with hand-written log parsing regex, drop a comment — curious how others are handling the CRI-O partial line problem specifically. I've seen teams solve it 4 different ways, each with different tradeoffs.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>logging</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
