Semgrep is a great open source security and code validation tool. Semgrep revolves around rules like this:
rules:
- id: print-to-logger
pattern: print($VAR)
message: Use logging.info() instead of print()
language: python
severity: MEDIUM
fix: logger.info($MSG)
The rule above will raise a MEDIUM severity issue every time a use of print() is used in your Python code. It will also provide the recommended fix and even take the value inside the print statement and produce the fix content. Thus print("Hello world!") becomes logger.info("Hello world!")
The rules.yaml file is then used to validate one or more (in this case) Python files:
semgrep scan -f rules.yaml app.py
Capturing Semgrep Output using the OpenTelemetry Collector
Semgrep is capable of producing JSON output which means it's really easy to grab using the OpenTelemetry collector. Let's re-run the previous command with a few more flags to produce JSON:
semgrep scan -f rules.yaml --json -o out.json app.py
It produces single line JSON (JSONL) like this (which I've expanded here for readability):
{
"version": "1.152.0",
"results": [
// Rule violations are listed here...
{
"check_id": "print-to-logger",
"path": "app.py",
"start": {
"line": 4,
"col": 5,
"offset": 46
},
"end": {
"line": 4,
"col": 18,
"offset": 59
},
"extra": {
"message": "Use logging.info() instead of print()",
"fix": "logger.info(\"blah\")",
"metadata": {},
"severity": "MEDIUM",
"fingerprint": "requires login",
"lines": "requires login",
"validation_state": "NO_VALIDATOR",
"engine_kind": "OSS"
}
},
...
],
"rules": [],
"rules_parse_time": 0.0018589496612548828,
"profiling_times": {
"config_time": 0.1376628875732422,
"core_time": 0.2947859764099121,
"ignores_time": 4.291534423828125e-05,
"total_time": 0.43769073486328125
},
"parsing_time": {
"total_time": 0.0,
"per_file_time": {
"mean": 0.0,
"std_dev": 0.0
},
"very_slow_stats": {
"time_ratio": 0.0,
"count_ratio": 0.0
},
"very_slow_files": []
},
"scanning_time": {
"total_time": 0.009443998336791992,
"per_file_time": {
"mean": 0.009443998336791992,
"std_dev": 0.0
},
"very_slow_stats": {
"time_ratio": 0.0,
"count_ratio": 0.0
},
"very_slow_files": []
},
"matching_time": {
"total_time": 0.0,
"per_file_and_rule_time": {
"mean": 0.0,
"std_dev": 0.0
},
"very_slow_stats": {
"time_ratio": 0.0,
"count_ratio": 0.0
},
"very_slow_rules_on_files": []
},
"tainting_time": {
"total_time": 0.0,
"per_def_and_rule_time": {
"mean": 0.0,
"std_dev": 0.0
},
"very_slow_stats": {
"time_ratio": 0.0,
"count_ratio": 0.0
},
"very_slow_rules_on_defs": []
},
"fixpoint_timeouts": [],
"prefiltering": {
"project_level_time": 0.0,
"file_level_time": 0.0,
"rules_with_project_prefilters_ratio": 0.0,
"rules_with_file_prefilters_ratio": 1.0,
"rules_selected_ratio": 1.0,
"rules_matched_ratio": 1.0
},
"targets": [],
"total_bytes": 0,
"max_memory_bytes": 120384832
}
Configure the OpenTelemetry collector to:
- Monitor
out.json - Parse the body text as JSON
The transform processor can also be used to process the JSONL lines as they transit through the collector. In this case the rules:
- Set both the
timeandobserved_timeto the current time (since the log line doesn't explicitly state a timestamp - Adds a new Key/Value attribute pair to each log record of
tool: semgrep(this is useful when the log line hits your Observability backend for filtering) - Adds another new Key/Value attribute pair to each log record where the key ==
results_foundand the value is the length of theresultsarray (again useful for backend processing - your O11y system may be able to compute lengths from an input array, but you can add it here to offload the processing / cost) - The final two rules effectively overwrite the
versionkey assemgrep_version.
Note: The collector cannot rename attribute keys so you actually take the current value of
version(ie."1.152.0"), create a new attribute calledsemgrep_version, set the value of the value using the existing value and finally delete the existingversionattribute.
receivers:
filelog:
include: [out.json]
start_at: beginning
operators:
- type: json_parser
parse_from: body
processors:
transform:
error_mode: ignore
log_statements:
- statements:
- set(log.time, Now())
- set(log.observed_time, Now())
- set(log.attributes["tool"], "semgrep")
- set(log.attributes["results_found"], Len(log.attributes["results"]))
- set(log.attributes["semgrep_version"], log.attributes["version"])
- delete_key(log.attributes, "version")
exporters:
debug:
verbosity: detailed
service:
pipelines:
logs:
receivers: [filelog]
processors: [transform]
exporters: [debug]
Save Money by shrinking output on no violations
The output can be a bit wordy even when there are no violations. We can shrink this using the transform processor with an additional rule:
- set(log.body, "Semgrep scan finished. No issues found.") where Len(log.attributes["results"]) == 0
Create metrics from log content
Notice there are lots of metric fields in the JSON so use the signal_to_metrics connector to transform log content to real OpenTelemetry metrics.
Add this content to the collector YAML (connectors should be at the same level as receivers and processors.
Then add the signal_to_metrics as both an output of the logs pipeline and an input to a metrics pipeline (that you need to define).
The idea here is that logs flow into the connector, are transformed to metrics and into the metrics pipeline they go.
connectors:
signal_to_metrics:
logs:
- name: max_memory_bytes
description: Extract the first number from the string
gauge:
value: Double(log.attributes["time"]["max_memory_bytes"])
- name: profiling_times.config_time
description: Extract the first number from the string
gauge:
value: Double(log.attributes["time"]["profiling_times"]["config_time"])
...
service:
pipelines:
logs:
receivers: [filelog]
processors: [transform]
exporters: [debug, signal_to_metrics]
metrics:
receivers: [signal_to_metrics]
processors: []
exporters: [debug]
Summary
Semgrep is a great security and validation tool and the results are really easy to process using the OpenTelemetry collector.
Subscribe to me on YouTube for more Observability and OpenTelemetry content.
Top comments (0)