Parse Complex Logs at Collection Time with CLS LogListener Pipelines

#devops #observability #cloud #ai

Complex production logs rarely arrive in one clean format. A single line may start with a timestamp, continue with delimiter-separated fields, embed a JSON object, and end with extra wrapper text. If the collector can only apply one parser, the downstream log platform receives either a lossy structure or a raw string that operators must clean later.

The source article presents Tencent Cloud CLS LogListener composite parsing as a collector-side pipeline for this situation. LogListener can run one or more processors in sequence, so a log line can be split, decoded, filtered, enriched, and reassembled before it is uploaded to CLS.

When composite parsing is useful

Composite parsing is designed for three source-backed scenarios:

Scenario	What happens in the log line	What the pipeline does
Multiple parsing modes are needed	One part is delimiter-separated, while another part is JSON or key-value text	Apply different processors to different fields after the first split
Some fields need post-processing	Parsed fields include values that should be dropped, renamed, or supplemented	Run processors such as field dropping or metadata extraction
Both patterns appear together	A line needs multiple extraction steps and field-level transformation	Chain processors in order inside the LogListener configuration

The flow is simple: split the original log into segments, process each segment with the right processor, and output only the fields that should become structured log content.

Processor map from the source article

The source screenshot includes the available processors for this workflow:

Function	Processor	Source-backed use
Extract fields	`processor_log_string`	Multi-character line parsing, usually for advanced single-line logs
Extract fields	`processor_multiline`	First-line regex parsing for multiline logs
Extract fields	`processor_multiline_fullregex`	First-line regex plus full multiline regex extraction
Extract fields	`processor_fullregex`	Regex extraction for a single-line field
Extract fields	`processor_json`	Expand a field value as JSON
Extract fields	`processor_split_delimiter`	Split fields by one or more delimiter characters
Extract fields	`processor_split_key_value`	Extract key-value pairs
Process fields	`processor_drop`	Drop selected fields
Process fields	`processor_timeformat`	Parse a source time field, convert the time format, and set the log timestamp

Pattern 1: drop fields before upload

If a raw log contains three key-value pairs but only key2 is useful, the source article uses processor_drop to remove key1 and key3.

Input:

key1:value1
key2:value2
key3:value3

LogListener configuration:

{
  "processors": [
    {
      "type": "processor_drop",
      "detail": {
        "Sourcekey": ["key1", "key3"]
      }
    }
  ]
}

Output:

key2:value2

This is the cheapest kind of log optimization: reduce payload size and storage cost by removing fields that do not need to be indexed or analyzed.

Pattern 2: enrich logs from metadata

The article also shows a metadata-enrichment case. A log body such as value1,value2 is collected from a file path, and the collector extracts ownership fields from that path. The source notes that meta_processor requires LogListener 2.7.4 or later.

Input:

value1,value2

Path:

/usr/local/loglistener-2.7.4/testdir/test.log

Configuration shape:

{
  "processors": [
    {
      "type": "processor_split_delimiter",
      "detail": {
        "Delimiter": ",",
        "ExtractKeys": ["msg1", "msg2"]
      }
    },
    {
      "type": "meta_processor",
      "detail": {
        "ExtractKeys": ["FILENAME"]
      },
      "processors": [
        {
          "type": "processor_fullregex",
          "detail": {
            "KeepSource": false,
            "SourceKey": "FILENAME",
            "ExtractRegex": "/\\w+/\\w+/(\\w+)-([^/]+)/(\\w+)/(\\w+).*",
            "ExtractKeys": ["app", "ver", "logdir", "logname"]
          }
        }
      ]
    }
  ]
}

Output fields shown by the source:

msg1:value1
msg2:value2
__TAG__.app: loglistener
__TAG__.ver: 2.7.4
__TAG__.logname: test
__TAG__.logdir: testdir

Pattern 3: parse nested fields with child processors

The custom parsing example starts with one comma-separated line:

1571394459,http://127.0.0.1/my/course/4|10.135.46.111|200,status:DEAD,

The pipeline first splits the line into time, msg1, and msg2. Child processors then convert the Unix timestamp, split msg1 by |, and parse msg2 as key-value content.

{
  "processors": [
    {
      "type": "processor_split_delimiter",
      "detail": {
        "Delimiter": ",",
        "ExtractKeys": ["time", "msg1", "msg2"]
      },
      "processors": [
        {
          "type": "processor_timeformat",
          "detail": {
            "KeepSource": true,
            "TimeFormat": "%s",
            "SourceKey": "time"
          }
        },
        {
          "type": "processor_split_delimiter",
          "detail": {
            "KeepSource": false,
            "Delimiter": "|",
            "SourceKey": "msg1",
            "ExtractKeys": ["submsg1", "submsg2", "submsg3"]
          }
        },
        {
          "type": "processor_split_key_value",
          "detail": {
            "KeepSource": false,
            "Delimiter": ":",
            "SourceKey": "msg2"
          }
        }
      ]
    }
  ]
}

Output:

time: 1571394459
submsg1: http://127.0.0.1/my/course/4
submsg2: 10.135.46.111
submsg3: 200
status: DEAD

Pattern 4: unwrap a real access log with JSON inside

The final source example is a slash-wrapped log line:

2016-01-02 12:59:59/log_start/{"remote_ip":"10.135.46.111","body_sent":23,"responsetime":0.232,"upstreamtime":"0.232","upstreamhost":"unix:/tmp/php-cgi.sock","http_host":"127.0.0.1","method":"POST","url":"/event/dispatch","request":"POST /event/dispatch HTTP/1.1","xff":"-","referer":"http://127.0.0.1/my/course/4","agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0","response_code":"200"}/log_end/

Expected behavior:

Split the log by / into five segments.
Keep the first segment as time.
Drop the wrapper fields.
Expand the JSON segment.

{
  "processors": [
    {
      "type": "processor_split_delimiter",
      "detail": {
        "KeepSource": false,
        "Delimiter": "/",
        "ExtractKeys": ["time", "msg2", "msg3", "msg4", "msg5"]
      },
      "processors": [
        {
          "type": "processor_drop",
          "detail": {
            "SourceKey": "msg2"
          }
        },
        {
          "type": "processor_json",
          "detail": {
            "KeepSource": false,
            "SourceKey": "msg3"
          }
        },
        {
          "type": "processor_drop",
          "detail": {
            "SourceKey": "msg4"
          }
        },
        {
          "type": "processor_drop",
          "detail": {
            "SourceKey": "msg5"
          }
        }
      ]
    }
  ]
}

The final structured fields include time, agent, body_sent, http_host, method, referer, remote_ip, request, response_code, responsetime, upstreamhost, upstreamtime, url, and xff.

Practical checklist

Use composite parsing when a single parser cannot express the source log format.
Keep the first split simple, then apply child processors to specific fields.
Drop wrapper or low-value fields before upload when they are not needed for search or analysis.
Convert image-only configuration examples into selectable JSON so future operators can copy and review them.
Keep the pipeline order explicit; LogListener executes processor configuration in sequence.