Consider a typical cloud-security tool's onboarding flow. A customer installs the tool. The tool's collector tries to authenticate to AWS, fails because the role isn't there yet, the customer follows three pages of setup docs, the role gets created, the collector authenticates, the collector runs, the collector finds nothing because the tool only knows about S3 and IAM and the customer's workload is on EKS. End of week one.
We don't ship a collector. Stave evaluates obs.v0.1 JSON snapshots — whatever produces them. That decision sounds extreme until you've watched the same "the collector doesn't see our environment" conversation play out three times. So instead of a collector, Stave ships a contract: per-asset JSON Schemas, per-asset Steampipe→Stave column mappings, and one command (stave contract show) that emits everything an agent needs to author its own ingest. The customer's preferred source (Steampipe, AWS Config, Terraform state, an internal inventory API) plugs in by satisfying the contract.
This post walks through the steps that closes the pipeline.
What the customer sees
$ stave contract show --asset-type aws_s3_bucket
Contract: aws_s3_bucket
Schema: schemas/observation/v1/asset-types/aws_s3_bucket.schema.json
Controls: 102 | Chains: 15
Property paths (catalog reads these — sorted by chain unlock, then control unlock):
PATH CONTROLS CHAINS SEVERITY NOTE
──── ──────── ────── ──────── ────
storage.kind 91 15 critical
storage.tags.data-classification 14 2 critical intent
storage.access.public_read 8 2 critical
storage.controls.public_access_fully_blocked 3 1 critical
...
Steampipe mapping: contracts/steampipe/aws_s3_bucket.yaml
That output names everything the customer's ingest agent needs:
- The schema — the JSON Schema the agent's output must satisfy
- The property paths — what fields the catalog actually reads on this asset type, ranked by how many controls and chains they unlock
- The mapping — a ready-to-run YAML telling the agent which Steampipe column maps to which Stave property path
For the 17 most catalog-impactful asset types, the mapping is committed. For the rest, the customer's agent has the schema; it can author its own.
The YAML mapping format
The Steampipe→Stave mapping is one ordered list of operations per asset type. Four operation kinds cover every transform shape:
-
field— direct column → property mapping with optional coerce/default -
static— a fixed value (e.g.properties.storage.kind: bucket) -
extract— pull a nested JSON value from a JSON-shaped column -
computed— derive from already-set property paths (all/anyreduction)
Operations run in YAML order; later ops can read paths written by earlier ones. The first mapping we wrote — contracts/steampipe/aws_s3_bucket.yaml — replaced a Python function with a declarative file. The loader changes are 100 lines; the resulting observation is byte-identical to what the imperative function produced.
operations:
- kind: static
path: properties.storage.kind
value: bucket
- kind: field
path: properties.storage.tags
column: tags
default: {}
type: dict
- kind: extract
path: properties.storage.encryption.algorithm
column: server_side_encryption_configuration
json_path: "Rules.0.ApplyServerSideEncryptionByDefault.SSEAlgorithm"
key_variants:
Rules: rules
SSEAlgorithm: sse_algorithm
default: "none"
- kind: computed
path: properties.storage.controls.public_access_fully_blocked
op: all
inputs:
- properties.storage.controls.public_access_block.block_public_acls
- properties.storage.controls.public_access_block.block_public_policy
- properties.storage.controls.public_access_block.ignore_public_acls
- properties.storage.controls.public_access_block.restrict_public_buckets
The format is the contract. Any agent in any language can parse the YAML and produce conforming observations.
Per-asset JSON Schemas
The catalog ships 3,957 controls; together they declare applicable_asset_types for 109 distinct asset types. To validate that a mapping's target paths are real, we needed a JSON Schema per asset type. Hand-authoring 109 schemas is a Tuesday lost; the schema generator already existed (it walks every control's predicate AST and infers the property paths + types), but defaulted to the top-3 most-used types.
go run ./internal/tools/genassetschemas/... -top 200
make sync-schemas
Output: 109 per-asset schemas under schemas/observation/v1/asset-types/. Every level is additionalProperties: true — the schemas are discoverability artifacts, not restrictive gates. A schema that lists one property (security_hub.enabled on aws_securityhub_account, for example) tells an agent "this asset type matters to the catalog; here is the one property to populate." Thin schemas are still useful.
Ten hand-authored mappings
The next 10 asset types by control coverage — aws_iam_role, aws_lambda_function, aws_cognito_user_pool, aws_cloudtrail_trail, aws_kms_key, aws_ec2_instance, aws_sqs_queue, aws_iam_user, aws_opensearch_domain, aws_stepfunctions_state_machine — got hand-authored mappings. They served two purposes: actual coverage for the most-asked-for types, and a ground-truth corpus to validate Iter 5's auto-generator against.
Every mapping carries a derived_properties: block listing the catalog-read properties that cannot come from a single Steampipe column. Example from aws_iam_role.yaml:
derived_properties:
- path: properties.identity.role.cross_account_trust_without_external_id
source: "Parse trust_policy — detect external Account in Principal without sts:ExternalId condition"
- path: properties.identity.permission_categories.has_incompatible_categories
source: Policy analysis against controldata/taxonomy/permission_categories.yaml
- path: properties.identity.access_advisor.available
source: iam:GenerateServiceLastAccessedDetails + iam:GetServiceLastAccessedDetails (separate API call per role)
That block is the agent's TODO list. Silently producing an observation without those derived properties is the failure mode the derived_properties: section prevents — Stave's controls don't see the property, the catalog finds nothing wrong, the breach happens anyway.
The Contract Show Command
The three sources — schema, predicate index, mapping file — already existed. Joining them required three separate file reads. The new command joins them once:
stave contract show --asset-type aws_iam_role --format json
{
"asset_type": "aws_iam_role",
"has_schema": true,
"schema_path": "schemas/observation/v1/asset-types/aws_iam_role.schema.json",
"controls_count": 198,
"chains_count": 38,
"property_paths": [
{
"path": "properties.identity.kind",
"controls_count": 196,
"chains_count": 35,
"max_severity": "critical",
"is_intent_property": false
},
...
],
"steampipe_mapping": "contracts/steampipe/aws_iam_role.yaml"
}
Or:
stave contract show --list
Asset types with controls: 109 (schema: 109, steampipe mapping: 17)
TYPE SCHEMA CONTROLS CHAINS MAPPING
──── ────── ──────── ────── ───────
aws_iam_role yes 198 38 steampipe
aws_s3_bucket yes 102 15 steampipe
aws_lambda_function yes 169 12 steampipe
aws_bedrock_agent yes 24 5 -
...
The implementation reuses everything already in the codebase: compose.LoadControlsFrom, compose.LoadChainDefinitions, predindex.Build (the same index the stave gaps command uses), and a 50-line helper in internal/contracts/schema/load.go to access the embedded per-asset schemas. The command is ~330 lines; nothing is new data — it's projection over existing data.
Auto-generator
The remaining ~98 asset types could be hand-authored or auto-generated. We tried auto. The generator joins the cached Steampipe column catalog with each per-asset schema's property paths, applies a four-rule matching priority (per-asset overrides, schema-path lookup with multi-token scoring, tags convention, fallback to properties.<ns>.<col>), and emits a YAML in the same operations-list format Iter 1 established.
make gen-steampipe-mappings # generate, skip existing
make gen-steampipe-mappings-validate # measure accuracy
Validation runs the generator against the 11 hand-authored YAMLs (Iter 1 + Iter 3) and compares the auto-generated (column, path) tuples against the ground truth:
Overall: 149/177 = 84% accuracy across 17 type(s)
84% — past the 80% target. The remaining 16% are the multi-target JSON-path extracts the brief flagged as inherently manual (one column → two property paths is not something a name-similarity heuristic can synthesise). Auto-generated YAMLs carry _auto_generated: true + _review_required: N + _unmatched_paths: [...] so the reviewer's surface is bounded.
The detailed story of the heuristic — and how it went from 8% accuracy on the first pass to 84% on the fourth — is its own post. The point here is what's committed: 17 total mappings (11 hand-authored, 6 auto-generated), every one of them an artifact a customer's agent can read in any language.
Who owns contract sits where it does
The architecture choice that makes this work: extractors are client-owned. Stave does not ship a collector. The contracts/steampipe/ directory contains instructions, not code. An agent reads the schema and the mapping; the agent produces the observation; Stave evaluates the observation. The collector boundary is a file, not a process.
This decision has been in our architecture docs since the project started, but until now there was no single command that surfaced the contract to an agent. An agent that wanted to author a Steampipe ingest for a new asset type had to:
- Find the per-asset schema (one of several embedded directories)
- Decide what property paths to populate (no canonical list — derive from controls)
- Map Steampipe columns to those paths (no template — invent it)
The agent runs one command and gets all three. The agent runs make gen-steampipe-mappings and gets a starting-point YAML it can refine. The integration is a lot easier.
What stayed out of Stave
Nothing in the Stave Go binary changed across the five iterations except the new cmd/contract/ directory (one file, ~330 LOC). The agent infrastructure is:
-
examples/agents/stave_transform.py— reference loader (Python) -
contracts/steampipe/*.yaml— 17 mappings (committed) -
scripts/gen-steampipe-mappings.py— auto-generator (Python, ~280 LOC) -
scripts/steampipe-columns.json— cached column catalog (refreshable from a live Steampipe install)
The deterministic policy engine is unchanged. The contract evolves; the engine doesn't.
The Generic Pipeline Shape
Replace Steampipe with any external data source — AWS Config, Terraform state, your internal inventory, Salesforce, OpenAPI specs — and the pipeline shape is the same:
Define the canonical target contract. For Stave it's
obs.v0.1JSON with per-asset-type sub-schemas. For your tool, it's whatever shape your engine reads.Author one mapping per source per asset type. YAML is fine. Operations list with field/static/extract/computed semantics covers most transform shapes.
Ship a discovery command. One CLI that joins the schema + the path list + the mapping into a single agent-readable output. The agent stops needing your team's docs.
Auto-generate the boring half. Most column→path mappings are name-similarity. The exceptions are rare enough to hand-author. Use the hand-authored set as a ground-truth corpus to measure your generator's accuracy.
Mark uncertainty explicitly.
_review_required,_unmatched_paths,derived_properties:. Silent gaps are worse than loud ones.
Five points, one functioning pipeline. The customer who needed three pages of collector setup now needs make gen-steampipe-mappings and an agent that can read a YAML.
Top comments (0)