DEV Community

Adam Gardner
Adam Gardner

Posted on

Semgrep Observability with OpenTelemetry

Semgrep is a great open source security and code validation tool. Semgrep revolves around rules like this:

rules:
  - id: print-to-logger
    pattern: print($VAR)
    message: Use logging.info() instead of print()
    language: python
    severity: MEDIUM
    fix: logger.info($MSG)
Enter fullscreen mode Exit fullscreen mode

The rule above will raise a MEDIUM severity issue every time a use of print() is used in your Python code. It will also provide the recommended fix and even take the value inside the print statement and produce the fix content. Thus print("Hello world!") becomes logger.info("Hello world!")

The rules.yaml file is then used to validate one or more (in this case) Python files:

semgrep scan -f rules.yaml app.py
Enter fullscreen mode Exit fullscreen mode

Capturing Semgrep Output using the OpenTelemetry Collector

Semgrep is capable of producing JSON output which means it's really easy to grab using the OpenTelemetry collector. Let's re-run the previous command with a few more flags to produce JSON:

semgrep scan -f rules.yaml --json -o out.json app.py
Enter fullscreen mode Exit fullscreen mode

It produces single line JSON (JSONL) like this (which I've expanded here for readability):

{
    "version": "1.152.0",
    "results": [
       // Rule violations are listed here...
        {
            "check_id": "print-to-logger",
            "path": "app.py",
            "start": {
                "line": 4,
                "col": 5,
                "offset": 46
            },
            "end": {
                "line": 4,
                "col": 18,
                "offset": 59
            },
            "extra": {
                "message": "Use logging.info() instead of print()",
                "fix": "logger.info(\"blah\")",
                "metadata": {},
                "severity": "MEDIUM",
                "fingerprint": "requires login",
                "lines": "requires login",
                "validation_state": "NO_VALIDATOR",
                "engine_kind": "OSS"
            }
        },
        ...
      ],
        "rules": [],
        "rules_parse_time": 0.0018589496612548828,
        "profiling_times": {
            "config_time": 0.1376628875732422,
            "core_time": 0.2947859764099121,
            "ignores_time": 4.291534423828125e-05,
            "total_time": 0.43769073486328125
        },
        "parsing_time": {
            "total_time": 0.0,
            "per_file_time": {
                "mean": 0.0,
                "std_dev": 0.0
            },
            "very_slow_stats": {
                "time_ratio": 0.0,
                "count_ratio": 0.0
            },
            "very_slow_files": []
        },
        "scanning_time": {
            "total_time": 0.009443998336791992,
            "per_file_time": {
                "mean": 0.009443998336791992,
                "std_dev": 0.0
            },
            "very_slow_stats": {
                "time_ratio": 0.0,
                "count_ratio": 0.0
            },
            "very_slow_files": []
        },
        "matching_time": {
            "total_time": 0.0,
            "per_file_and_rule_time": {
                "mean": 0.0,
                "std_dev": 0.0
            },
            "very_slow_stats": {
                "time_ratio": 0.0,
                "count_ratio": 0.0
            },
            "very_slow_rules_on_files": []
        },
        "tainting_time": {
            "total_time": 0.0,
            "per_def_and_rule_time": {
                "mean": 0.0,
                "std_dev": 0.0
            },
            "very_slow_stats": {
                "time_ratio": 0.0,
                "count_ratio": 0.0
            },
            "very_slow_rules_on_defs": []
        },
        "fixpoint_timeouts": [],
        "prefiltering": {
            "project_level_time": 0.0,
            "file_level_time": 0.0,
            "rules_with_project_prefilters_ratio": 0.0,
            "rules_with_file_prefilters_ratio": 1.0,
            "rules_selected_ratio": 1.0,
            "rules_matched_ratio": 1.0
        },
        "targets": [],
        "total_bytes": 0,
        "max_memory_bytes": 120384832
    }
Enter fullscreen mode Exit fullscreen mode

Configure the OpenTelemetry collector to:

  1. Monitor out.json
  2. Parse the body text as JSON

The transform processor can also be used to process the JSONL lines as they transit through the collector. In this case the rules:

  1. Set both the time and observed_time to the current time (since the log line doesn't explicitly state a timestamp
  2. Adds a new Key/Value attribute pair to each log record of tool: semgrep (this is useful when the log line hits your Observability backend for filtering)
  3. Adds another new Key/Value attribute pair to each log record where the key == results_found and the value is the length of the results array (again useful for backend processing - your O11y system may be able to compute lengths from an input array, but you can add it here to offload the processing / cost)
  4. The final two rules effectively overwrite the version key as semgrep_version.

Note: The collector cannot rename attribute keys so you actually take the current value of version (ie. "1.152.0"), create a new attribute called semgrep_version, set the value of the value using the existing value and finally delete the existing version attribute.

receivers:
  filelog:
    include: [out.json]
    start_at: beginning
    operators:
      - type: json_parser
        parse_from: body

processors:
  transform:
    error_mode: ignore
    log_statements:
      - statements:
        - set(log.time, Now()) 
        - set(log.observed_time, Now())
        - set(log.attributes["tool"], "semgrep")
        - set(log.attributes["results_found"], Len(log.attributes["results"]))
        - set(log.attributes["semgrep_version"], log.attributes["version"])
        - delete_key(log.attributes, "version")

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [transform]
      exporters: [debug]
Enter fullscreen mode Exit fullscreen mode

Save Money by shrinking output on no violations

The output can be a bit wordy even when there are no violations. We can shrink this using the transform processor with an additional rule:

- set(log.body, "Semgrep scan finished. No issues found.") where Len(log.attributes["results"]) == 0
Enter fullscreen mode Exit fullscreen mode

Create metrics from log content

Notice there are lots of metric fields in the JSON so use the signal_to_metrics connector to transform log content to real OpenTelemetry metrics.

Add this content to the collector YAML (connectors should be at the same level as receivers and processors.

Then add the signal_to_metrics as both an output of the logs pipeline and an input to a metrics pipeline (that you need to define).

The idea here is that logs flow into the connector, are transformed to metrics and into the metrics pipeline they go.

connectors:
  signal_to_metrics:
    logs:
      - name: max_memory_bytes
        description: Extract the first number from the string
        gauge:
          value: Double(log.attributes["time"]["max_memory_bytes"])
      - name: profiling_times.config_time
        description: Extract the first number from the string
        gauge:
          value: Double(log.attributes["time"]["profiling_times"]["config_time"])

...

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [transform]
      exporters: [debug, signal_to_metrics]
    metrics:
      receivers: [signal_to_metrics]
      processors: []
      exporters: [debug]
Enter fullscreen mode Exit fullscreen mode

Summary

Semgrep is a great security and validation tool and the results are really easy to process using the OpenTelemetry collector.

Subscribe to me on YouTube for more Observability and OpenTelemetry content.

Top comments (0)