Bala Paranj

Posted on May 23

The contract is the interface: agent-driven Steampipe Stave in one command

#aws #security #cloud #devops

Consider a typical cloud-security tool's onboarding flow. A customer installs the tool. The tool's collector tries to authenticate to AWS, fails because the role isn't there yet, the customer follows three pages of setup docs, the role gets created, the collector authenticates, the collector runs, the collector finds nothing because the tool only knows about S3 and IAM and the customer's workload is on EKS. End of week one.

We don't ship a collector. Stave evaluates obs.v0.1 JSON snapshots — whatever produces them. That decision sounds extreme until you've watched the same "the collector doesn't see our environment" conversation play out three times. So instead of a collector, Stave ships a contract: per-asset JSON Schemas, per-asset Steampipe→Stave column mappings, and one command (stave contract show) that emits everything an agent needs to author its own ingest. The customer's preferred source (Steampipe, AWS Config, Terraform state, an internal inventory API) plugs in by satisfying the contract.

This post walks through the steps that closes the pipeline.

What the customer sees

$ stave contract show --asset-type aws_s3_bucket
Contract: aws_s3_bucket
Schema:   schemas/observation/v1/asset-types/aws_s3_bucket.schema.json
Controls: 102 | Chains: 15

Property paths (catalog reads these — sorted by chain unlock, then control unlock):

  PATH                                                          CONTROLS  CHAINS  SEVERITY  NOTE
  ────                                                          ────────  ──────  ────────  ────
  storage.kind                                                  91        15      critical
  storage.tags.data-classification                              14        2       critical  intent
  storage.access.public_read                                    8         2       critical
  storage.controls.public_access_fully_blocked                  3         1       critical
  ...

Steampipe mapping: contracts/steampipe/aws_s3_bucket.yaml

That output names everything the customer's ingest agent needs:

The schema — the JSON Schema the agent's output must satisfy
The property paths — what fields the catalog actually reads on this asset type, ranked by how many controls and chains they unlock
The mapping — a ready-to-run YAML telling the agent which Steampipe column maps to which Stave property path

For the 17 most catalog-impactful asset types, the mapping is committed. For the rest, the customer's agent has the schema; it can author its own.

The YAML mapping format

The Steampipe→Stave mapping is one ordered list of operations per asset type. Four operation kinds cover every transform shape:

field — direct column → property mapping with optional coerce/default
static — a fixed value (e.g. properties.storage.kind: bucket)
extract — pull a nested JSON value from a JSON-shaped column
computed — derive from already-set property paths (all / any reduction)

Operations run in YAML order; later ops can read paths written by earlier ones. The first mapping we wrote — contracts/steampipe/aws_s3_bucket.yaml — replaced a Python function with a declarative file. The loader changes are 100 lines; the resulting observation is byte-identical to what the imperative function produced.

operations:
  - kind: static
    path: properties.storage.kind
    value: bucket

  - kind: field
    path: properties.storage.tags
    column: tags
    default: {}
    type: dict

  - kind: extract
    path: properties.storage.encryption.algorithm
    column: server_side_encryption_configuration
    json_path: "Rules.0.ApplyServerSideEncryptionByDefault.SSEAlgorithm"
    key_variants:
      Rules: rules
      SSEAlgorithm: sse_algorithm
    default: "none"

  - kind: computed
    path: properties.storage.controls.public_access_fully_blocked
    op: all
    inputs:
      - properties.storage.controls.public_access_block.block_public_acls
      - properties.storage.controls.public_access_block.block_public_policy
      - properties.storage.controls.public_access_block.ignore_public_acls
      - properties.storage.controls.public_access_block.restrict_public_buckets

The format is the contract. Any agent in any language can parse the YAML and produce conforming observations.

Per-asset JSON Schemas

The catalog ships 3,957 controls; together they declare applicable_asset_types for 109 distinct asset types. To validate that a mapping's target paths are real, we needed a JSON Schema per asset type. Hand-authoring 109 schemas is a Tuesday lost; the schema generator already existed (it walks every control's predicate AST and infers the property paths + types), but defaulted to the top-3 most-used types.

go run ./internal/tools/genassetschemas/... -top 200
make sync-schemas

Output: 109 per-asset schemas under schemas/observation/v1/asset-types/. Every level is additionalProperties: true — the schemas are discoverability artifacts, not restrictive gates. A schema that lists one property (security_hub.enabled on aws_securityhub_account, for example) tells an agent "this asset type matters to the catalog; here is the one property to populate." Thin schemas are still useful.

Ten hand-authored mappings

The next 10 asset types by control coverage — aws_iam_role, aws_lambda_function, aws_cognito_user_pool, aws_cloudtrail_trail, aws_kms_key, aws_ec2_instance, aws_sqs_queue, aws_iam_user, aws_opensearch_domain, aws_stepfunctions_state_machine — got hand-authored mappings. They served two purposes: actual coverage for the most-asked-for types, and a ground-truth corpus to validate Iter 5's auto-generator against.

Every mapping carries a derived_properties: block listing the catalog-read properties that cannot come from a single Steampipe column. Example from aws_iam_role.yaml:

derived_properties:
  - path: properties.identity.role.cross_account_trust_without_external_id
    source: "Parse trust_policy — detect external Account in Principal without sts:ExternalId condition"
  - path: properties.identity.permission_categories.has_incompatible_categories
    source: Policy analysis against controldata/taxonomy/permission_categories.yaml
  - path: properties.identity.access_advisor.available
    source: iam:GenerateServiceLastAccessedDetails + iam:GetServiceLastAccessedDetails (separate API call per role)

That block is the agent's TODO list. Silently producing an observation without those derived properties is the failure mode the derived_properties: section prevents — Stave's controls don't see the property, the catalog finds nothing wrong, the breach happens anyway.

The Contract Show Command

The three sources — schema, predicate index, mapping file — already existed. Joining them required three separate file reads. The new command joins them once:

stave contract show --asset-type aws_iam_role --format json

{
  "asset_type": "aws_iam_role",
  "has_schema": true,
  "schema_path": "schemas/observation/v1/asset-types/aws_iam_role.schema.json",
  "controls_count": 198,
  "chains_count": 38,
  "property_paths": [
    {
      "path": "properties.identity.kind",
      "controls_count": 196,
      "chains_count": 35,
      "max_severity": "critical",
      "is_intent_property": false
    },
    ...
  ],
  "steampipe_mapping": "contracts/steampipe/aws_iam_role.yaml"
}

Or:

stave contract show --list

Asset types with controls: 109 (schema: 109, steampipe mapping: 17)

  TYPE                              SCHEMA  CONTROLS  CHAINS  MAPPING
  ────                              ──────  ────────  ──────  ───────
  aws_iam_role                      yes     198       38      steampipe
  aws_s3_bucket                     yes     102       15      steampipe
  aws_lambda_function               yes     169       12      steampipe
  aws_bedrock_agent                 yes     24        5       -
  ...

The implementation reuses everything already in the codebase: compose.LoadControlsFrom, compose.LoadChainDefinitions, predindex.Build (the same index the stave gaps command uses), and a 50-line helper in internal/contracts/schema/load.go to access the embedded per-asset schemas. The command is ~330 lines; nothing is new data — it's projection over existing data.

Auto-generator

The remaining ~98 asset types could be hand-authored or auto-generated. We tried auto. The generator joins the cached Steampipe column catalog with each per-asset schema's property paths, applies a four-rule matching priority (per-asset overrides, schema-path lookup with multi-token scoring, tags convention, fallback to properties.<ns>.<col>), and emits a YAML in the same operations-list format Iter 1 established.

make gen-steampipe-mappings           # generate, skip existing
make gen-steampipe-mappings-validate  # measure accuracy

Validation runs the generator against the 11 hand-authored YAMLs (Iter 1 + Iter 3) and compares the auto-generated (column, path) tuples against the ground truth:

Overall: 149/177 = 84% accuracy across 17 type(s)

84% — past the 80% target. The remaining 16% are the multi-target JSON-path extracts the brief flagged as inherently manual (one column → two property paths is not something a name-similarity heuristic can synthesise). Auto-generated YAMLs carry _auto_generated: true + _review_required: N + _unmatched_paths: [...] so the reviewer's surface is bounded.

The detailed story of the heuristic — and how it went from 8% accuracy on the first pass to 84% on the fourth — is its own post. The point here is what's committed: 17 total mappings (11 hand-authored, 6 auto-generated), every one of them an artifact a customer's agent can read in any language.

Who owns contract sits where it does

The architecture choice that makes this work: extractors are client-owned. Stave does not ship a collector. The contracts/steampipe/ directory contains instructions, not code. An agent reads the schema and the mapping; the agent produces the observation; Stave evaluates the observation. The collector boundary is a file, not a process.

This decision has been in our architecture docs since the project started, but until now there was no single command that surfaced the contract to an agent. An agent that wanted to author a Steampipe ingest for a new asset type had to:

Find the per-asset schema (one of several embedded directories)
Decide what property paths to populate (no canonical list — derive from controls)
Map Steampipe columns to those paths (no template — invent it)

The agent runs one command and gets all three. The agent runs make gen-steampipe-mappings and gets a starting-point YAML it can refine. The integration is a lot easier.

What stayed out of Stave

Nothing in the Stave Go binary changed across the five iterations except the new cmd/contract/ directory (one file, ~330 LOC). The agent infrastructure is:

examples/agents/stave_transform.py — reference loader (Python)
contracts/steampipe/*.yaml — 17 mappings (committed)
scripts/gen-steampipe-mappings.py — auto-generator (Python, ~280 LOC)
scripts/steampipe-columns.json — cached column catalog (refreshable from a live Steampipe install)

The deterministic policy engine is unchanged. The contract evolves; the engine doesn't.

The Generic Pipeline Shape

Replace Steampipe with any external data source — AWS Config, Terraform state, your internal inventory, Salesforce, OpenAPI specs — and the pipeline shape is the same:

Define the canonical target contract. For Stave it's obs.v0.1 JSON with per-asset-type sub-schemas. For your tool, it's whatever shape your engine reads.
Author one mapping per source per asset type. YAML is fine. Operations list with field/static/extract/computed semantics covers most transform shapes.
Ship a discovery command. One CLI that joins the schema + the path list + the mapping into a single agent-readable output. The agent stops needing your team's docs.
Auto-generate the boring half. Most column→path mappings are name-similarity. The exceptions are rare enough to hand-author. Use the hand-authored set as a ground-truth corpus to measure your generator's accuracy.
Mark uncertainty explicitly. _review_required, _unmatched_paths, derived_properties:. Silent gaps are worse than loud ones.

Five points, one functioning pipeline. The customer who needed three pages of collector setup now needs make gen-steampipe-mappings and an agent that can read a YAML.

DEV Community