The Zero-Trust Data Stack: Building Deterministic Agents with Google ADK

#googleaichallenge #ai #python

The "Trusted Analyst" Problem

When business stakeholders ask, “What was our Q3 revenue?” a hallucinating agent can invent numbers or misread columns. In enterprise contexts that becomes legal/financial risk, not a cute demo problem. We need answers that are: correct, explainable, reproducible, and auditable. ADK is designed for building code-first multi-agent systems that make these properties achievable.
DEV Community
+1

The Architecture: A 6-Agent Assembly Line

Design idea: split responsibility across small, well-specified agents. Each agent has a single purpose and uses deterministic tools whenever possible.
Agents (6):

Ingestor — accepts file/database input, infers schema, profiles data (row counts, types, nulls).
Schema & Typing Validator — verifies column types, mappings (e.g., date, currency, customer_id), rejects ambiguous or malformed inputs; produces a canonical schema.
Deterministic Query Generator — produces SQL using templates and a deterministic engine (no freeform LLM SQL). Uses parameterized templates, typed placeholders, and a SQL AST builder.
Executor (Safe Runner) — runs SQL queries in a sandboxed environment (read-only, statement limits, sampling) and returns raw results plus execution plan, row counts, and checksums.
Statistical Checker / Data Profiler — runs deterministic checks on results (aggregates vs expected ranges, null ratios, duplicates, joins cardinality checks), flags anomalies.
Synthesizer/Reporter (+ Provenance) — creates the human answer, attaches provenance (original query, SQL text, execution plan, checksums, dataset snapshot index), and a confidence score with specific reason string(s).

Visualizing the Flow

(Imagine a left-to-right pipeline.)

The "Secret Sauce": Deterministic Tools

The single biggest defense vs hallucination is remove nondeterminism where possible. Examples:
SQL templating + AST builders: Build queries by composing AST nodes (no freeform LLM text). This guarantees syntactic correctness and limits injection / bad reads.
Linters and static analyzers (e.g., SQLFluff or a SQL AST validator) to ensure queries follow best practices.
Deterministic transform functions for parsing dates/numbers: use typed parsers with fallback rules, not heuristic prose.
Unit tests & snapshot tests: For every analytic endpoint, run a deterministic suite (e.g., expected aggregate for a fixed seed dataset).
Checksums and sampling: the executor returns a hash of the resultset (or top-N canonical sample) so consumers can verify the same result is being used.

Handling the "Messy Reality" of Data

Real data is rarely clean. Practical parts to implement:

Auto profiling: detect suspicious columns (e.g., numeric stored as string), outlier rates, rate of parsing failures.
Column mapping UI / schema prompts: when the validator finds ambiguity (e.g., “is this column revenue or discount?”), surface a small verification step for a human-in-loop or a concise verification prompt.
Robust parsing rules: explicit locale/format rules for dates, currencies, thousands separators; fallback strategies that are logged.
Progressive aggregation: for very large datasets, run sampling + incremental aggregations and compare to full-run results.
Explainable imputations: if imputing missing data, log method and effect size (e.g., “imputed 12% of revenue rows with median per-customer”).
Escape hatches: allow a user to pin schema or upload a small canonical sample to avoid repeated ambiguity.

The "Evaluator" Loop: AI Checking AI

An evaluator agent runs a suite of automated checks after the initial run:

Replay deterministic steps: re-run SQL templates with the same parameters and confirm checksum matches the Executor’s output.
Sanity tests: compare top-line aggregates against last known stable values (e.g., month-over-month bounds), check for impossible values (negative revenue), ensure join cardinalities make sense.
Counterfactual queries: run alternate queries that test assumptions (e.g., run with and without particular filters, or with an extra grouping to see if aggregation collapses).
Explainability check: ensure every synthetic sentence references one or more provenance items (e.g., “Q3 revenue = $X (SQL: …, checksum: …)”).
Score and veto: if checks fail, the evaluator returns a veto with a clear failure reason and suggested remedial action (human review, improved schema mapping, or re-run with different filters).

Practical code sketch (Python + ADK pseudocode)

Below is a simplified sketch to illustrate the idea

Operational concerns & best practices

RBAC for agents and tools: limit which agents can run writes or access PII. Executor should be read-only by default.
Observability and replay: store event logs and snapshots so you can repro exactly what happened.
Testing harness: use codelabs / unit tests to run canned questions against synthetic datasets as regression tests.
Human-in-the-loop escalation: for ambiguous answers or high-impact queries, preview SQL + provenance to a human reviewer before presenting to executives.
Release gating: require certain check pass thresholds before promoting a report from staging to production.

Try It Yourself (quick path)

Read the ADK docs & quickstart — get the ADK Python quickstart running. This shows how to define agents and run them locally
Follow a codelab — Google codelabs have hands-on examples (code reviewer, travel agent, database toolbox) you can adapt. Start with a DB example.
Build the minimal pipeline:

Ingest: load a CSV and run a deterministic profiler (pandas + custom rules).
Schema validator: map ambiguous columns via a small prompt only for naming help; persist the choice.
Deterministic SQL generator: implement templated SQL functions using a SQL AST library (avoid string concatenation).
Executor: run queries in a read-only sandbox and return a checksum/sample.
Checker: implement deterministic checks (counts, sums, null rates).
Synthesizer: LLM generates explanation but always references SQL + checksum.

Add evaluator: make an agent that replays and runs the check suite; if it fails, create a human review ticket.
Iterate: add more automated checks and unit tests (snapshot tests of expected aggregates).