DEV Community

kasi viswanath vandanapu
kasi viswanath vandanapu

Posted on

SQL Comparison Library Architecture

Purpose

Design a deterministic-first library that compares SQL and results on one database instance, explains mismatches, and optionally adds AI judgment.

The library evaluates:

  • user_query
  • actual_sql
  • expected_sql
  • actual_result
  • expected_result

The library does not claim universal semantic equivalence. It reports behavior on the evaluated database context.

Design Principles

  • Deterministic before AI: deterministic metrics are the source of truth; AI is advisory.
  • Structure and result are separate: SQL form and output correctness are scored independently.
  • Diagnostics over raw score: every important score must include a reason and evidence.
  • Configurable semantics: set, multiset, ordered, and numeric tolerance modes are first-class.
  • Production-safe outputs: return machine-friendly schema and human-readable explanations.

High-Level Architecture

The system is organized into five layers:

  1. Validation
  2. Structural comparison
  3. Result comparison
  4. Diagnostic attribution
  5. Optional AI judge
┌─────────────────────────────────────────────────────────────────────┐
│  INPUT                                                              │
│  user_query · actual_sql · expected_sql                             │
│  actual_result · expected_result                                    │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│  LAYER 1 · Validation                                               │
│  ├── Parse actual_sql → AST                                         │
│  ├── Parse expected_sql → AST                                       │
│  ├── Canonicalize SQL strings                                       │
│  ├── Execute both queries in a read-only sandbox                    │
│  ├── Capture DML errors · wall-clock timings · schemas              │
│  └── Materialize result sets                                        │
└───────────┬─────────────────────────────┬───────────────────────────┘
            │ parse + exec ok                 │ parse or exec fail
            ▼                                 │
┌───────────────────────────────────────────╒─────────────────────────┐
│  LAYER 2 · Structural Comparison           ┃                         │
│  ├── normalized_sql_match  [boolean gate]   ┃                         │
│  ├── clause_match          [diagnostic]     ┃                         │
│  └── clause_weighted_distance  [score]      ┃                         │
└───────────┬────────────────────────────────┘                         │
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 3 · Result Comparison                                         │
│  ├── result_equality_family  (exact / order-insensitive / ordered)   │
│  ├── schema_match                                                    │
│  ├── row_overlap · cell_overlap  [graded quality]                    │
│  ├── cardinality_match                                               │
│  ├── numeric_tolerance_match  (atol / rtol)                          │
│  └── null_handling_match  (3VL)                                      │
└───────────┬─────────────────────────────────────────────────────────┤
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 4 · Diagnostic Attribution                                    │
│  ├── projection / filter / join error scores                         │
│  ├── grouping / aggregate function error scores                      │
│  ├── ordering flag · top-k score · cardinality explosion flag        │
│  ├── per_column_mismatch_map  [explanation layer]                    │
│  └── confidence · evidence_strength · ambiguous_case                │
└───────────┬─────────────────────────────────────────────────────────┤
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 5 · Optional AI Judge                                         │
│  ├── intent_adequacy_judgment                                        │
│  ├── database_scoped_equivalence_judgment                            │
│  ├── acceptable_deviation_judgment                                   │
│  ├── short_rationale · evidence_bullets · decision_summary           │
│  ├── severity_classification                                         │
│  └── confidence · evidence_strength · ambiguous_case                │
└───────────┬─────────────────────────────────────────────────────────┘
            │─────────────────────────────► parse or exec fail path ►┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│  FINAL OUTPUT                                                        │
│  deterministic_verdict · ai_advisory_verdict                        │
│  verdict_consistency_flag · overall_score · severity                │
│  structure_score · result_score · diagnostic_scores                 │
│  explanations · mismatch_map · warnings                             │
└─────────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

1) Validation Layer

Responsibilities:

  • Validate the request payload and runtime config.
  • Parse actual_sql and expected_sql into ASTs using a dialect-aware SQL parser.
  • Canonicalize SQL strings for structural comparison.
  • Execute both queries in a read-only execution sandbox.
  • Capture DML runtime errors, wall-clock timings, materialized result sets, and output relation schemas.

Output type: booleans + errors

Internal intermediates vs external outputs:

  • ast_actual, ast_expected, actual_result_table, expected_result_table, actual_schema, and expected_schema are internal intermediates consumed by downstream layers. They are not included in the final report unless explicitly requested via config.
  • All other Validation outputs (parse_success, execution_success, execution_error, execution_time, normalized_sql) are external-facing and appear in the final report.

Primary outputs:

  • parse_success_actual: True if actual_sql produces a valid AST with no syntax errors.
  • parse_success_expected: True if expected_sql produces a valid AST with no syntax errors.
  • normalized_sql_actual: Canonicalized SQL string of actual_sql after normalization.
  • normalized_sql_expected: Canonicalized SQL string of expected_sql after normalization.
  • ast_actual: Abstract Syntax Tree (AST) of actual_sql.
  • ast_expected: Abstract Syntax Tree (AST) of expected_sql.
  • execution_success_actual: True if actual_sql executes without a DML runtime error.
  • execution_success_expected: True if expected_sql executes without a DML runtime error.
  • execution_error_actual: Error taxonomy category + raw DB error message for actual_sql, if any.
  • execution_error_expected: Error taxonomy category + raw DB error message for expected_sql, if any.
  • execution_time_actual_ms: Wall-clock execution time of actual_sql in milliseconds.
  • execution_time_expected_ms: Wall-clock execution time of expected_sql in milliseconds.
  • actual_result_table: Materialized result set (tuples) returned by actual_sql.
  • expected_result_table: Materialized result set (tuples) returned by expected_sql.
  • actual_schema: Output relation schema of actual_sql — attribute names and data types.
  • expected_schema: Output relation schema of expected_sql — attribute names and data types.

2) Structural Comparison Layer

Responsibilities:

  • Compare canonicalized SQL strings as a fast boolean gate.
  • Perform per-clause AST comparison across all SQL clauses.
  • Compute a clause-weighted structural divergence score.

Output type: scores + clause diffs

Metric role hierarchy:

  • normalized_sql_match: fast boolean gate — if true, deeper structural comparison can be skipped.
  • clause_match: diagnostic detail — reveals per-clause divergence regardless of string match outcome.
  • clause_weighted_distance: structural score — aggregated from per-clause weighted comparison.

Primary outputs:

  • normalized_sql_match: Boolean gate — True if canonicalized SQL strings are identical after normalization.
  • clause_match: Per-clause AST comparison across SELECT (projection), WHERE (selection predicate), HAVING (post-aggregation predicate), JOIN (join graph and join type), GROUP BY (aggregation grain), ORDER BY, LIMIT/OFFSET (top-k), DISTINCT, WINDOW/OVER, and set operators (UNION/INTERSECT/EXCEPT).
  • clause_weighted_distance: Clause-weighted structural divergence score; 0 = identical, higher = more divergent.

3) Result Comparison Layer

Responsibilities:

  • Compare materialized result sets under configurable set, bag (multiset), or ordered semantics.
  • Evaluate relation schema alignment, graded tuple-level and cell-level overlap, cardinality delta, three-valued logic (3VL) NULL consistency, and numeric tolerance.

Output type: scores + mismatch maps

Result equality family:

  • Use one metric family, result_equality_family, with an explicit comparison_mode.
  • comparison_mode selects evaluation semantics: exact / order-insensitive / order-sensitive.
  • result_equality_family.mode_pass is the single boolean pass/fail for the selected mode.
  • Optional mode_details can include non-primary mode outcomes for debugging, but they do not act as independent peer metrics.
  • row_overlap and cell_overlap are graded quality metrics measuring partial match degree.
  • per_column_mismatch_map is the explanation layer and lives in Layer 4 Diagnostic Attribution.

Primary outputs:

  • result_equality_family: Mode-driven result comparison object:
    • comparison_mode: exact | order-insensitive | order-sensitive.
    • mode_pass: Boolean verdict for the selected comparison_mode.
    • mode_details (optional): non-primary mode outcomes retained for diagnostics only.
  • schema_match: Relation schema comparison — attribute count, attribute names, data types, and optional attribute ordering.
  • row_overlap: Graded quality — tuple-level overlap expressed as Jaccard similarity or F1 score (precision + recall over shared tuples).
  • cell_overlap: Graded quality — value-level (attribute × tuple) overlap after alignment; measures localized mismatch granularity.
  • cardinality_match: Cardinality comparison — absolute row-count delta and relative cardinality ratio between result sets.
  • numeric_tolerance_match: Floating-point and decimal comparison using atol and rtol thresholds.
  • null_handling_match: True if NULL placement and three-valued logic (3VL) behavior is consistent across result sets.

4) Diagnostic Attribution Layer

Responsibilities:

  • Attribute result mismatch to likely root SQL clauses using heuristic diagnostic scoring.
  • Emit clause-level error scores with confidence and evidence strength.
  • Detect ordering-only mismatch and cardinality fanout or compression anomalies.
  • Produce the per-attribute mismatch explanation map.

Output type: error categories + evidence

Every diagnostic output includes: likely source clause, confidence (0.0–1.0), evidence_strength (low / medium / high), and ambiguous_case flag when multiple root clauses are plausible.

Primary outputs:

  • projection_error_score: Estimated impact of differences in the SELECT projection list — missing, extra, or transformed attributes.
  • filter_error_score: Estimated impact of predicate differences in WHERE (selection) and HAVING (post-aggregation filter).
  • join_error_score: Estimated impact of join graph topology, join predicate (ON condition), or join type differences (INNER / LEFT / RIGHT / FULL / CROSS / ANTI / SEMI).
  • grouping_error_score: Estimated impact of incorrect aggregation grain in the GROUP BY clause.
  • aggregate_function_error_score: Estimated impact of wrong aggregate function (SUM vs COUNT vs AVG vs MIN vs MAX vs STDDEV).
  • ordering_error_flag: True when mismatch is caused solely by tuple ordering; bag-semantics result sets are equal.
  • limit_topk_error_score: Estimated impact of LIMIT, TOP, FETCH FIRST, OFFSET, or ROW_NUMBER top-k differences.
  • cardinality_explosion_flag: True when result set cardinality deviates significantly from expected due to join fanout or over-filtering.
  • per_column_mismatch_map: Explanation layer — attribute-level mismatch rate and severity breakdown explaining row_overlap and cell_overlap.
  • confidence: 0.0–1.0 confidence in the primary diagnostic attribution.
  • evidence_strength: low / medium / high — based on corroborating signal count and quality.
  • ambiguous_case: True when multiple clauses are plausible root causes with similar evidence weight.

5) Optional AI Judge Layer

Responsibilities:

  • Apply LLM-as-judge evaluation grounded strictly in deterministic metric outputs.
  • Assess natural-language intent adequacy relative to the user query.
  • Judge database-scoped equivalence as observed on this database instance — not universal semantic equivalence.
  • Classify deviation acceptability using domain and business context.
  • Never issue a verdict that contradicts the deterministic verdict from Layers 1–4.
  • Consume the optional context field from the Request Model (schema metadata and business constraints) to ground judgment in domain-specific rules.

Output type: advisory label + rationale

Verdict structure — the final report separates three fields:

  • deterministic_verdict: authoritative pass/fail derived from Layers 1–4.
  • ai_advisory_verdict: the AI judge's advisory classification.
  • verdict_consistency_flag: True if both verdicts agree; False when they diverge, which triggers a consistency warning.

AI output format — internal reasoning traces are never exposed; instead:

  • short_rationale: one-to-two sentence summary of the AI judge's reasoning.
  • evidence_bullets: ordered list of key deterministic signals the AI judge used.
  • decision_summary: final label and justification only.

Primary outputs:

  • intent_adequacy_judgment: LLM-as-judge assessment of whether actual_sql satisfies user_query intent as well as expected_sql.
  • database_scoped_equivalence_judgment: LLM assessment of observed equivalence on this database instance for the given task — does not claim universal semantic equivalence.
  • acceptable_deviation_judgment: Grounded classification of whether mismatch is harmless (e.g., ordering-only), acceptable (minor rounding), or critical (wrong data).
  • short_rationale: One-to-two sentence summary of the AI judge's reasoning.
  • evidence_bullets: Ordered list of key deterministic signals used in the AI judgment.
  • decision_summary: Final label and justification without internal reasoning traces.
  • severity_classification: Severity triage label: pass / minor issue / moderate issue / major issue / critical failure.
  • confidence: 0.0–1.0 confidence in the AI judgment.
  • evidence_strength: low / medium / high.
  • ambiguous_case: True when the AI judge cannot distinguish between two plausible classifications.

Core Components

A. Request Model

Input contract for one evaluation run.

Suggested fields:

  • user_query: string
  • actual_sql: string
  • expected_sql: string
  • config: evaluation options
  • context: optional schema metadata and business constraints

B. Execution Adapter

DB adapter abstraction to support one configured database engine per run.

Interface shape:

  • execute(sql) -> execution metadata + materialized result
  • explain(sql) -> optional plan metadata

Sandbox isolation:

  • Queries run inside a read-only transaction (or equivalent read-only session mode) to prevent side-effects.
  • A configurable query timeout aborts long-running executions.
  • On completion or timeout, the transaction is rolled back to ensure no persistent state changes.

C. SQL Parser and Normalizer

Responsibilities:

  • Parse SQL into an Abstract Syntax Tree (AST) using a dialect-aware parser (e.g., sqlglot).
  • Canonicalize SQL string: lowercase keywords, collapse whitespace, normalize alias formatting.
  • Normalize commutative expressions (e.g., AND operand ordering) where semantics are preserved.

D. Structural Comparator

Responsibilities:

  • Per-clause AST matcher across projection (SELECT), selection predicate (WHERE), join graph, aggregation grain (GROUP BY), post-aggregation predicate (HAVING), ordering (ORDER BY), top-k (LIMIT/OFFSET), and set operators.
  • Clause-weighted distance calculator producing a normalized structural divergence score.

E. Result Comparator

Responsibilities:

  • Relation schema comparator (attribute names, data types, attribute order).
  • Ordered and unordered tuple comparators for set and bag (multiset) semantics.
  • Cell-level (attribute × tuple) alignment and value matching.
  • Floating-point comparator using atol/rtol tolerance thresholds.
  • NULL-aware comparator respecting three-valued logic (3VL) semantics.

F. Diagnostic Engine

Responsibilities:

  • Convert raw structural and result divergence signals into clause-attributed error scores.
  • Build per-attribute mismatch profiles from the output relation.
  • Generate deterministic, evidence-backed explanations without LLM involvement.

G. Scoring and Severity Engine

Responsibilities:

  • Compute category scores.
  • Aggregate weighted overall score.
  • Map to severity labels.

H. Report Builder

Responsibilities:

  • Emit final output schema.
  • Include evidence, warnings, and plain-language explanations.

I. Optional AI Judge Adapter

Responsibilities:

  • Construct grounded LLM-as-judge prompts that embed deterministic metric evidence.
  • Return structured AI judgments with confidence scores.
  • Record model identifier, prompt version, and token usage for reproducibility.

Metric Processing Pipeline

Recommended sequence:

  1. Validation (parse checks + execution)
  2. Structural comparison metrics
  3. Result comparison metrics
  4. Diagnostic attribution metrics (computed after structural and result evidence is available)
  5. Optional performance metrics
  6. Optional AI judge metrics
  7. Final synthesis — compute deterministic_verdict, run AI judge, emit verdict_consistency_flag

If either query fails parse or execution, continue producing partial diagnostics where possible and return explicit blocked-reason fields.

Failure-State Precedence Rules

Apply deterministic precedence in this exact order during final synthesis:

  1. Parse failure precedence: if parse_success_actual = false or parse_success_expected = false, set deterministic_verdict = fail, set severity = critical failure, and mark downstream structural/result comparisons as blocked.
  2. Execution failure precedence: if both parse checks pass but execution_success_actual = false or execution_success_expected = false, set deterministic_verdict = fail, set severity = critical failure, and mark result comparison fields as blocked.
  3. Result-family precedence: if validation passes, evaluate result_equality_family.mode_pass for the configured comparison_mode. A false mode_pass sets deterministic_verdict = fail regardless of advisory AI output.
  4. Structural/diagnostic precedence: when result equality fails, use structure and diagnostic signals only for attribution and severity refinement; they never override deterministic fail to pass.
  5. Advisory precedence: ai_advisory_verdict can only annotate. It cannot overturn deterministic_verdict in any failure state.
  6. Performance metric precedence: if execution_success_actual = false or execution_success_expected = false, performance metrics (execution_time_ratio, resource_cost_score, plan_complexity_score) are blocked and set to null.

Configuration Model

Key runtime options:

  • comparison_mode: exact / order-insensitive / order-sensitive
  • numeric_abs_tolerance: float
  • numeric_rel_tolerance: float
  • null_equality_mode: strict/sql_semantic
  • require_column_name_match: true/false
  • require_column_order_match: true/false
  • max_rows_for_cell_alignment: integer
  • query_timeout_ms: integer (maximum wall-clock time per query execution before abort)
  • enable_ai_judge: true/false
  • enable_performance_metrics: true/false

Comparison mode mapping:

  • exact: schema, values, and row order must all match.
  • order-insensitive: bag (multiset) semantics — duplicates count, row order ignored.
  • order-sensitive: full ordered comparison including tuple sequence; required for top-k and pagination queries.

Suggested Output Contract

Top-level fields:

  • deterministic_verdict: authoritative pass/fail derived from Layers 1–4.
  • ai_advisory_verdict: advisory classification from the AI judge (optional).
  • verdict_consistency_flag: True if both verdicts agree; False triggers a consistency warning.
  • overall_score: 0.0 when deterministic_verdict is fail due to parse or execution failure; weighted aggregate otherwise.
  • structure_score
  • result_score
  • diagnostic_scores
  • performance_scores
  • severity
  • explanations
  • mismatch_map
  • error_types: list of Error Taxonomy categories triggered during execution (maps to the DML error taxonomy).
  • blocked_reason: present when downstream layers are blocked by an upstream failure; indicates which precedence rule triggered the block (e.g., "parse_failure" or "execution_failure").
  • warnings
  • validity: parse and execution outcomes
  • evidence: clause-level and attribute-level details
  • run_metadata: timings, db identifier, version, config hash

Scoring Strategy

Use weighted aggregation with explicit defaults:

  • validity gate: parse or execution failure sets deterministic_verdict to fail and overall_score to 0.0 before any weighted scoring runs.
  • result gate: result_equality_family.mode_pass acts as a boolean gate — a false mode_pass sets deterministic_verdict to fail. It does not contribute a numeric weight.
  • structure_score: weighted blend of normalized match, clause_match, and derived clause-weighted similarity (computed from clause_weighted_distance at scoring time).
  • result_score: weighted blend of schema, overlap, cardinality, tolerance, and NULL handling (mode_pass is the gate, not a blend input).
  • diagnostic_scores: not just penalties; include confidence and evidence count.

Derived metric note:

  • clause_weighted_similarity is not stored as a primary output; it is derived when needed from clause_weighted_distance.
  • execution_time_ratio is derived from execution_time_actual_ms / execution_time_expected_ms and is not independently stored in Validation.

Severity mapping example:

  • pass
  • minor issue
  • moderate issue
  • major issue
  • critical failure

Error Taxonomy

DML runtime execution error categories:

  • missing_table: Referenced relation does not exist in the database catalog.
  • missing_column: Referenced attribute does not exist in the relation schema.
  • type_mismatch: Operand data types are incompatible for the operation or predicate.
  • division_by_zero: Arithmetic expression produces a division-by-zero runtime error.
  • invalid_aggregation: Aggregate function used in an invalid context (e.g., non-grouped attribute in SELECT without GROUP BY).
  • ambiguous_reference: Attribute reference is ambiguous across joined relations without qualification.
  • permission_error: Executing principal lacks read permission on the referenced relation or schema.
  • unknown_error: Uncategorized runtime error; preserve raw DB error message for inspection.

Always store both the mapped taxonomy category and the raw database error message.

Deterministic Diagnostics Strategy

Detailed heuristic rules for the Diagnostic Engine (Component F). Layer 4 defines the output schema; this section defines the attribution logic.

Heuristic clause-attribution signals:

  • Projection mismatch: output relation schema or SELECT expression differs while tuple cardinality is close.
  • Predicate (filter) mismatch: high tuple overlap asymmetry correlated with WHERE/HAVING predicate divergence in the AST.
  • Join mismatch: cardinality fanout or compression combined with join graph topology or join predicate divergence.
  • Aggregation grain mismatch: aggregate score drift paired with incorrect or missing GROUP BY attributes.
  • Ordering-only mismatch: bag-semantics result sets are equal but ordered result sets differ — ordering_error_flag is set.
  • Top-k mismatch: result set divergence concentrated near the LIMIT/OFFSET truncation boundary.

Every diagnostic output must include:

  • what differed (clause and attribute level)
  • likely source clause (projection / selection predicate / join / aggregation grain / ordering / top-k)
  • confidence score
  • supporting evidence (cardinality delta, shared attribute overlap, predicate diff summary)

Performance and Explainability

Optional performance metrics should be collected only when stable and comparable:

  • execution_time_ratio
  • resource_cost_score
  • plan_complexity_score

Include a warning when noisy conditions reduce comparability (cold cache, variable load, missing plan stats).

AI Judge Guardrails

Normative rules for AI judge behavior. Verdict structure and output format are defined in Layer 5.

  • AI judge is optional and additive.
  • AI output is always advisory — deterministic_verdict from Layers 1–4 takes precedence.
  • AI cannot flip deterministic pass/fail flags.
  • When ai_advisory_verdict disagrees with deterministic_verdict, verdict_consistency_flag is set to False and a consistency warning is included in the report.
  • AI responses expose only: short_rationale, evidence_bullets, decision_summary. No internal reasoning trace is returned.
  • Include model identifier, prompt version, and confidence score.
  • Support policy-based disablement for production pipelines.

Implementation Roadmap

Phase 1: Core MVP

Build first:

  • parse success
  • execution success and error type
  • normalized SQL match
  • clause match
  • result_equality_family (mode-driven)
  • schema match
  • row overlap
  • cardinality match

Phase 2: Debugging Expansion

Add:

  • projection/filter/join/grouping/aggregate diagnostic scores
  • ordering-only mismatch flag
  • per-column mismatch map
  • numeric tolerance match
  • NULL handling match

Phase 3: Production and Research Features

Add:

  • performance metrics
  • AI judge metrics
  • richer severity policy profiles

Metrics Glossary

The Metrics Glossary is the normative reference for metric names, types, and descriptions. Layer sections provide architectural context and processing logic.

Validation Metrics

Metric Layer Type Description
parse_success_actual Validation bool True if actual_sql produces a valid AST with no syntax errors
parse_success_expected Validation bool True if expected_sql produces a valid AST with no syntax errors
normalized_sql_actual Validation string Canonicalized SQL string of actual_sql after normalization
normalized_sql_expected Validation string Canonicalized SQL string of expected_sql after normalization
ast_actual Validation AST Abstract Syntax Tree (AST) of actual_sql
ast_expected Validation AST Abstract Syntax Tree (AST) of expected_sql
execution_success_actual Validation bool True if actual_sql executes without a DML runtime error
execution_success_expected Validation bool True if expected_sql executes without a DML runtime error
execution_error_actual Validation object Error taxonomy category + raw DB error message for actual_sql
execution_error_expected Validation object Error taxonomy category + raw DB error message for expected_sql
execution_time_actual_ms Validation float Wall-clock runtime of actual_sql in milliseconds
execution_time_expected_ms Validation float Wall-clock runtime of expected_sql in milliseconds
actual_result_table Validation relation Materialized result set (tuples) returned by actual_sql
expected_result_table Validation relation Materialized result set (tuples) returned by expected_sql
actual_schema Validation object Output relation schema of actual_sql - attribute names and data types
expected_schema Validation object Output relation schema of expected_sql - attribute names and data types

Structural Comparison Metrics

Metric Role Type Description
normalized_sql_match Boolean gate bool True if canonicalized SQL strings are identical; if true, deeper structural comparison can be skipped
clause_match Diagnostic detail object Per-clause AST comparison: projection (SELECT), selection predicate (WHERE), post-aggregation predicate (HAVING), join graph and join type, aggregation grain (GROUP BY), ordering (ORDER BY), top-k (LIMIT/OFFSET), set operators (UNION/INTERSECT/EXCEPT)
clause_weighted_distance Structural score float Clause-weighted structural divergence; 0 = identical, higher = more divergent

Diagnostic Attribution Metrics

Metric Layer Type Description
projection_error_score Diagnostic float Estimated impact from differences in the SELECT projection list - missing, extra, or transformed attributes
filter_error_score Diagnostic float Estimated impact from predicate differences in WHERE (selection) and HAVING (post-aggregation filter)
join_error_score Diagnostic float Estimated impact from join graph topology, join predicate (ON condition), or join type (INNER / LEFT / RIGHT / FULL / CROSS / ANTI / SEMI)
grouping_error_score Diagnostic float Estimated impact from incorrect aggregation grain in the GROUP BY clause
aggregate_function_error_score Diagnostic float Estimated impact from wrong aggregate function (SUM vs COUNT vs AVG vs MIN vs MAX vs STDDEV)
ordering_error_flag Diagnostic bool True when mismatch is caused solely by tuple ordering; bag-semantics result sets are equal
limit_topk_error_score Diagnostic float Estimated impact from LIMIT, TOP, FETCH FIRST, OFFSET, or ROW_NUMBER top-k differences
cardinality_explosion_flag Diagnostic bool True when result set cardinality deviates significantly from expected due to join fanout or over-filtering
per_column_mismatch_map Diagnostic object Explanation layer - attribute-level mismatch rate and severity breakdown explaining row_overlap and cell_overlap
confidence Diagnostic float 0.0-1.0 confidence in the primary diagnostic attribution
evidence_strength Diagnostic label low / medium / high - based on corroborating signal count and quality
ambiguous_case Diagnostic bool True when multiple clauses are plausible root causes with similar evidence weight

Result Comparison Metrics

Metric Role Type Description
result_equality_family Equality family object Single mode-driven equality family with comparison_mode (exact / order-insensitive / order-sensitive), mode_pass, and optional mode_details for diagnostics
schema_match Schema object Relation schema comparison: attribute count, attribute names, data types, and optional attribute ordering
row_overlap Graded quality float Tuple-level overlap expressed as Jaccard similarity or F1 score (precision + recall over shared tuples)
cell_overlap Graded quality float Value-level (attribute x tuple) overlap after alignment; measures localized mismatch granularity
cardinality_match Cardinality object Cardinality comparison - absolute row-count delta and relative cardinality ratio between result sets
numeric_tolerance_match Tolerance object Floating-point comparison using atol and rtol: match if
null_handling_match NULL semantics bool True if NULL placement and three-valued logic (3VL) behavior is consistent across result sets

Optional Performance Metrics

Metric Layer Type Description
execution_time_ratio Performance float Derived: ratio of execution_time_actual_ms to execution_time_expected_ms; not independently stored in Validation
resource_cost_score Performance float Relative cost estimate from the query optimizer execution plan (EXPLAIN) if available
plan_complexity_score Performance float Weighted count of expensive query plan operators: full scans, hash joins, sorts, nested loops

Final Output Fields

Metric Layer Type Description
deterministic_verdict Final output label Authoritative pass/fail derived from Layers 1-4 deterministic metrics
ai_advisory_verdict Final output label Advisory classification from the AI judge; never overrides deterministic_verdict
verdict_consistency_flag Final output bool True if both verdicts agree; False triggers a consistency warning
blocked_reason Final output string Present when downstream layers are blocked by an upstream failure; indicates the triggering precedence rule

AI Judge Metrics

Metric Layer Type Description
intent_adequacy_judgment AI Judge label LLM-as-judge assessment of whether actual_sql satisfies user_query intent as well as expected_sql
database_scoped_equivalence_judgment AI Judge label LLM assessment of observed equivalence on this database instance - does not claim universal semantic equivalence
acceptable_deviation_judgment AI Judge label Grounded classification: harmless (ordering-only), acceptable (minor rounding), or critical (wrong data)
short_rationale AI Judge string One-to-two sentence summary of the AI judge's reasoning
evidence_bullets AI Judge list Ordered list of key deterministic signals used in the AI judgment
decision_summary AI Judge string Final label and justification without internal reasoning traces
severity_classification AI Judge label Severity triage label: pass / minor issue / moderate issue / major issue / critical failure
confidence AI Judge float 0.0-1.0 confidence in the AI judgment
evidence_strength AI Judge label low / medium / high - based on quality of deterministic evidence available
ambiguous_case AI Judge bool True when the AI judge cannot distinguish between two plausible classifications

Top comments (0)