Semgrep Observability with OpenTelemetry

#security #observability #opentelemetry #devops

Semgrep is a great open source security and code validation tool. Semgrep revolves around rules like this:

rules:
  - id: print-to-logger
    pattern: print($VAR)
    message: Use logging.info() instead of print()
    language: python
    severity: MEDIUM
    fix: logger.info($MSG)

The rule above will raise a MEDIUM severity issue every time a use of print() is used in your Python code. It will also provide the recommended fix and even take the value inside the print statement and produce the fix content. Thus print("Hello world!") becomes logger.info("Hello world!")

The rules.yaml file is then used to validate one or more (in this case) Python files:

semgrep scan -f rules.yaml app.py

Capturing Semgrep Output using the OpenTelemetry Collector

Semgrep is capable of producing JSON output which means it's really easy to grab using the OpenTelemetry collector. Let's re-run the previous command with a few more flags to produce JSON:

semgrep scan -f rules.yaml --json -o out.json app.py

It produces single line JSON (JSONL) like this (which I've expanded here for readability):

{
    "version": "1.152.0",
    "results": [
       // Rule violations are listed here...
        {
            "check_id": "print-to-logger",
            "path": "app.py",
            "start": {
                "line": 4,
                "col": 5,
                "offset": 46
            },
            "end": {
                "line": 4,
                "col": 18,
                "offset": 59
            },
            "extra": {
                "message": "Use logging.info() instead of print()",
                "fix": "logger.info(\"blah\")",
                "metadata": {},
                "severity": "MEDIUM",
                "fingerprint": "requires login",
                "lines": "requires login",
                "validation_state": "NO_VALIDATOR",
                "engine_kind": "OSS"
            }
        },
        ...
      ],
        "rules": [],
        "rules_parse_time": 0.0018589496612548828,
        "profiling_times": {
            "config_time": 0.1376628875732422,
            "core_time": 0.2947859764099121,
            "ignores_time": 4.291534423828125e-05,
            "total_time": 0.43769073486328125
        },
        "parsing_time": {
            "total_time": 0.0,
            "per_file_time": {
                "mean": 0.0,
                "std_dev": 0.0
            },
            "very_slow_stats": {
                "time_ratio": 0.0,
                "count_ratio": 0.0
            },
            "very_slow_files": []
        },
        "scanning_time": {
            "total_time": 0.009443998336791992,
            "per_file_time": {
                "mean": 0.009443998336791992,
                "std_dev": 0.0
            },
            "very_slow_stats": {
                "time_ratio": 0.0,
                "count_ratio": 0.0
            },
            "very_slow_files": []
        },
        "matching_time": {
            "total_time": 0.0,
            "per_file_and_rule_time": {
                "mean": 0.0,
                "std_dev": 0.0
            },
            "very_slow_stats": {
                "time_ratio": 0.0,
                "count_ratio": 0.0
            },
            "very_slow_rules_on_files": []
        },
        "tainting_time": {
            "total_time": 0.0,
            "per_def_and_rule_time": {
                "mean": 0.0,
                "std_dev": 0.0
            },
            "very_slow_stats": {
                "time_ratio": 0.0,
                "count_ratio": 0.0
            },
            "very_slow_rules_on_defs": []
        },
        "fixpoint_timeouts": [],
        "prefiltering": {
            "project_level_time": 0.0,
            "file_level_time": 0.0,
            "rules_with_project_prefilters_ratio": 0.0,
            "rules_with_file_prefilters_ratio": 1.0,
            "rules_selected_ratio": 1.0,
            "rules_matched_ratio": 1.0
        },
        "targets": [],
        "total_bytes": 0,
        "max_memory_bytes": 120384832
    }

Configure the OpenTelemetry collector to:

Monitor out.json
Parse the body text as JSON

The transform processor can also be used to process the JSONL lines as they transit through the collector. In this case the rules:

Set both the time and observed_time to the current time (since the log line doesn't explicitly state a timestamp
Adds a new Key/Value attribute pair to each log record of tool: semgrep (this is useful when the log line hits your Observability backend for filtering)
Adds another new Key/Value attribute pair to each log record where the key == results_found and the value is the length of the results array (again useful for backend processing - your O11y system may be able to compute lengths from an input array, but you can add it here to offload the processing / cost)
The final two rules effectively overwrite the version key as semgrep_version.

Note: The collector cannot rename attribute keys so you actually take the current value of version (ie. "1.152.0"), create a new attribute called semgrep_version, set the value of the value using the existing value and finally delete the existing version attribute.

receivers:
  filelog:
    include: [out.json]
    start_at: beginning
    operators:
      - type: json_parser
        parse_from: body

processors:
  transform:
    error_mode: ignore
    log_statements:
      - statements:
        - set(log.time, Now()) 
        - set(log.observed_time, Now())
        - set(log.attributes["tool"], "semgrep")
        - set(log.attributes["results_found"], Len(log.attributes["results"]))
        - set(log.attributes["semgrep_version"], log.attributes["version"])
        - delete_key(log.attributes, "version")

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [transform]
      exporters: [debug]

Save Money by shrinking output on no violations

The output can be a bit wordy even when there are no violations. We can shrink this using the transform processor with an additional rule:

- set(log.body, "Semgrep scan finished. No issues found.") where Len(log.attributes["results"]) == 0

Create metrics from log content

Notice there are lots of metric fields in the JSON so use the signal_to_metrics connector to transform log content to real OpenTelemetry metrics.

Add this content to the collector YAML (connectors should be at the same level as receivers and processors.

Then add the signal_to_metrics as both an output of the logs pipeline and an input to a metrics pipeline (that you need to define).

The idea here is that logs flow into the connector, are transformed to metrics and into the metrics pipeline they go.

connectors:
  signal_to_metrics:
    logs:
      - name: max_memory_bytes
        description: Extract the first number from the string
        gauge:
          value: Double(log.attributes["time"]["max_memory_bytes"])
      - name: profiling_times.config_time
        description: Extract the first number from the string
        gauge:
          value: Double(log.attributes["time"]["profiling_times"]["config_time"])

...

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [transform]
      exporters: [debug, signal_to_metrics]
    metrics:
      receivers: [signal_to_metrics]
      processors: []
      exporters: [debug]