kasi viswanath vandanapu

Posted on Apr 1

SQL Comparison Library Architecture

#sql #ai #evaluation #llm

Purpose

Design a deterministic-first library that compares SQL and results on one database instance, explains mismatches, and optionally adds AI judgment.

The library evaluates:

user_query
actual_sql
expected_sql
actual_result
expected_result

The library does not claim universal semantic equivalence. It reports behavior on the evaluated database context.

Design Principles

Deterministic before AI: deterministic metrics are the source of truth; AI is advisory.
Structure and result are separate: SQL form and output correctness are scored independently.
Diagnostics over raw score: every important score must include a reason and evidence.
Configurable semantics: set, multiset, ordered, and numeric tolerance modes are first-class.
Production-safe outputs: return machine-friendly schema and human-readable explanations.

High-Level Architecture

The system is organized into five layers:

Validation
Structural comparison
Result comparison
Diagnostic attribution
Optional AI judge

┌─────────────────────────────────────────────────────────────────────┐
│  INPUT                                                              │
│  user_query · actual_sql · expected_sql                             │
│  actual_result · expected_result                                    │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│  LAYER 1 · Validation                                               │
│  ├── Parse actual_sql → AST                                         │
│  ├── Parse expected_sql → AST                                       │
│  ├── Canonicalize SQL strings                                       │
│  ├── Execute both queries in a read-only sandbox                    │
│  ├── Capture DML errors · wall-clock timings · schemas              │
│  └── Materialize result sets                                        │
└───────────┬─────────────────────────────┬───────────────────────────┘
            │ parse + exec ok                 │ parse or exec fail
            ▼                                 │
┌───────────────────────────────────────────╒─────────────────────────┐
│  LAYER 2 · Structural Comparison           ┃                         │
│  ├── normalized_sql_match  [boolean gate]   ┃                         │
│  ├── clause_match          [diagnostic]     ┃                         │
│  └── clause_weighted_distance  [score]      ┃                         │
└───────────┬────────────────────────────────┘                         │
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 3 · Result Comparison                                         │
│  ├── result_equality_family  (exact / order-insensitive / ordered)   │
│  ├── schema_match                                                    │
│  ├── row_overlap · cell_overlap  [graded quality]                    │
│  ├── cardinality_match                                               │
│  ├── numeric_tolerance_match  (atol / rtol)                          │
│  └── null_handling_match  (3VL)                                      │
└───────────┬─────────────────────────────────────────────────────────┤
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 4 · Diagnostic Attribution                                    │
│  ├── projection / filter / join error scores                         │
│  ├── grouping / aggregate function error scores                      │
│  ├── ordering flag · top-k score · cardinality explosion flag        │
│  ├── per_column_mismatch_map  [explanation layer]                    │
│  └── confidence · evidence_strength · ambiguous_case                │
└───────────┬─────────────────────────────────────────────────────────┤
            ▼                                                           │
┌─────────────────────────────────────────────────────────────────────┤
│  LAYER 5 · Optional AI Judge                                         │
│  ├── intent_adequacy_judgment                                        │
│  ├── database_scoped_equivalence_judgment                            │
│  ├── acceptable_deviation_judgment                                   │
│  ├── short_rationale · evidence_bullets · decision_summary           │
│  ├── severity_classification                                         │
│  └── confidence · evidence_strength · ambiguous_case                │
└───────────┬─────────────────────────────────────────────────────────┘
            │─────────────────────────────► parse or exec fail path ►┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│  FINAL OUTPUT                                                        │
│  deterministic_verdict · ai_advisory_verdict                        │
│  verdict_consistency_flag · overall_score · severity                │
│  structure_score · result_score · diagnostic_scores                 │
│  explanations · mismatch_map · warnings                             │
└─────────────────────────────────────────────────────────────────────┘

1) Validation Layer

Responsibilities:

Validate the request payload and runtime config.
Parse actual_sql and expected_sql into ASTs using a dialect-aware SQL parser.
Canonicalize SQL strings for structural comparison.
Execute both queries in a read-only execution sandbox.
Capture DML runtime errors, wall-clock timings, materialized result sets, and output relation schemas.

Output type: booleans + errors

Internal intermediates vs external outputs:

ast_actual, ast_expected, actual_result_table, expected_result_table, actual_schema, and expected_schema are internal intermediates consumed by downstream layers. They are not included in the final report unless explicitly requested via config.
All other Validation outputs (parse_success, execution_success, execution_error, execution_time, normalized_sql) are external-facing and appear in the final report.

Primary outputs:

parse_success_actual: True if actual_sql produces a valid AST with no syntax errors.
parse_success_expected: True if expected_sql produces a valid AST with no syntax errors.
normalized_sql_actual: Canonicalized SQL string of actual_sql after normalization.
normalized_sql_expected: Canonicalized SQL string of expected_sql after normalization.
ast_actual: Abstract Syntax Tree (AST) of actual_sql.
ast_expected: Abstract Syntax Tree (AST) of expected_sql.
execution_success_actual: True if actual_sql executes without a DML runtime error.
execution_success_expected: True if expected_sql executes without a DML runtime error.
execution_error_actual: Error taxonomy category + raw DB error message for actual_sql, if any.
execution_error_expected: Error taxonomy category + raw DB error message for expected_sql, if any.
execution_time_actual_ms: Wall-clock execution time of actual_sql in milliseconds.
execution_time_expected_ms: Wall-clock execution time of expected_sql in milliseconds.
actual_result_table: Materialized result set (tuples) returned by actual_sql.
expected_result_table: Materialized result set (tuples) returned by expected_sql.
actual_schema: Output relation schema of actual_sql — attribute names and data types.
expected_schema: Output relation schema of expected_sql — attribute names and data types.

2) Structural Comparison Layer

Responsibilities:

Compare canonicalized SQL strings as a fast boolean gate.
Perform per-clause AST comparison across all SQL clauses.
Compute a clause-weighted structural divergence score.

Output type: scores + clause diffs

Metric role hierarchy:

normalized_sql_match: fast boolean gate — if true, deeper structural comparison can be skipped.
clause_match: diagnostic detail — reveals per-clause divergence regardless of string match outcome.
clause_weighted_distance: structural score — aggregated from per-clause weighted comparison.

Primary outputs:

normalized_sql_match: Boolean gate — True if canonicalized SQL strings are identical after normalization.
clause_match: Per-clause AST comparison across SELECT (projection), WHERE (selection predicate), HAVING (post-aggregation predicate), JOIN (join graph and join type), GROUP BY (aggregation grain), ORDER BY, LIMIT/OFFSET (top-k), DISTINCT, WINDOW/OVER, and set operators (UNION/INTERSECT/EXCEPT).
clause_weighted_distance: Clause-weighted structural divergence score; 0 = identical, higher = more divergent.

3) Result Comparison Layer

Responsibilities:

Compare materialized result sets under configurable set, bag (multiset), or ordered semantics.
Evaluate relation schema alignment, graded tuple-level and cell-level overlap, cardinality delta, three-valued logic (3VL) NULL consistency, and numeric tolerance.

Output type: scores + mismatch maps

Result equality family:

Use one metric family, result_equality_family, with an explicit comparison_mode.
comparison_mode selects evaluation semantics: exact / order-insensitive / order-sensitive.
result_equality_family.mode_pass is the single boolean pass/fail for the selected mode.
Optional mode_details can include non-primary mode outcomes for debugging, but they do not act as independent peer metrics.
row_overlap and cell_overlap are graded quality metrics measuring partial match degree.
per_column_mismatch_map is the explanation layer and lives in Layer 4 Diagnostic Attribution.

Primary outputs:

result_equality_family: Mode-driven result comparison object:
- comparison_mode: exact | order-insensitive | order-sensitive.
- mode_pass: Boolean verdict for the selected comparison_mode.
- mode_details (optional): non-primary mode outcomes retained for diagnostics only.
schema_match: Relation schema comparison — attribute count, attribute names, data types, and optional attribute ordering.
row_overlap: Graded quality — tuple-level overlap expressed as Jaccard similarity or F1 score (precision + recall over shared tuples).
cell_overlap: Graded quality — value-level (attribute × tuple) overlap after alignment; measures localized mismatch granularity.
cardinality_match: Cardinality comparison — absolute row-count delta and relative cardinality ratio between result sets.
numeric_tolerance_match: Floating-point and decimal comparison using atol and rtol thresholds.
null_handling_match: True if NULL placement and three-valued logic (3VL) behavior is consistent across result sets.

4) Diagnostic Attribution Layer

Responsibilities:

Attribute result mismatch to likely root SQL clauses using heuristic diagnostic scoring.
Emit clause-level error scores with confidence and evidence strength.
Detect ordering-only mismatch and cardinality fanout or compression anomalies.
Produce the per-attribute mismatch explanation map.

Output type: error categories + evidence

Every diagnostic output includes: likely source clause, confidence (0.0–1.0), evidence_strength (low / medium / high), and ambiguous_case flag when multiple root clauses are plausible.

Primary outputs:

projection_error_score: Estimated impact of differences in the SELECT projection list — missing, extra, or transformed attributes.
filter_error_score: Estimated impact of predicate differences in WHERE (selection) and HAVING (post-aggregation filter).
join_error_score: Estimated impact of join graph topology, join predicate (ON condition), or join type differences (INNER / LEFT / RIGHT / FULL / CROSS / ANTI / SEMI).
grouping_error_score: Estimated impact of incorrect aggregation grain in the GROUP BY clause.
aggregate_function_error_score: Estimated impact of wrong aggregate function (SUM vs COUNT vs AVG vs MIN vs MAX vs STDDEV).
ordering_error_flag: True when mismatch is caused solely by tuple ordering; bag-semantics result sets are equal.
limit_topk_error_score: Estimated impact of LIMIT, TOP, FETCH FIRST, OFFSET, or ROW_NUMBER top-k differences.
cardinality_explosion_flag: True when result set cardinality deviates significantly from expected due to join fanout or over-filtering.
per_column_mismatch_map: Explanation layer — attribute-level mismatch rate and severity breakdown explaining row_overlap and cell_overlap.
confidence: 0.0–1.0 confidence in the primary diagnostic attribution.
evidence_strength: low / medium / high — based on corroborating signal count and quality.
ambiguous_case: True when multiple clauses are plausible root causes with similar evidence weight.

5) Optional AI Judge Layer

Responsibilities:

Apply LLM-as-judge evaluation grounded strictly in deterministic metric outputs.
Assess natural-language intent adequacy relative to the user query.
Judge database-scoped equivalence as observed on this database instance — not universal semantic equivalence.
Classify deviation acceptability using domain and business context.
Never issue a verdict that contradicts the deterministic verdict from Layers 1–4.
Consume the optional context field from the Request Model (schema metadata and business constraints) to ground judgment in domain-specific rules.

Output type: advisory label + rationale

Verdict structure — the final report separates three fields:

deterministic_verdict: authoritative pass/fail derived from Layers 1–4.
ai_advisory_verdict: the AI judge's advisory classification.
verdict_consistency_flag: True if both verdicts agree; False when they diverge, which triggers a consistency warning.

AI output format — internal reasoning traces are never exposed; instead:

short_rationale: one-to-two sentence summary of the AI judge's reasoning.
evidence_bullets: ordered list of key deterministic signals the AI judge used.
decision_summary: final label and justification only.

Primary outputs:

intent_adequacy_judgment: LLM-as-judge assessment of whether actual_sql satisfies user_query intent as well as expected_sql.
database_scoped_equivalence_judgment: LLM assessment of observed equivalence on this database instance for the given task — does not claim universal semantic equivalence.
acceptable_deviation_judgment: Grounded classification of whether mismatch is harmless (e.g., ordering-only), acceptable (minor rounding), or critical (wrong data).
short_rationale: One-to-two sentence summary of the AI judge's reasoning.
evidence_bullets: Ordered list of key deterministic signals used in the AI judgment.
decision_summary: Final label and justification without internal reasoning traces.
severity_classification: Severity triage label: pass / minor issue / moderate issue / major issue / critical failure.
confidence: 0.0–1.0 confidence in the AI judgment.
evidence_strength: low / medium / high.
ambiguous_case: True when the AI judge cannot distinguish between two plausible classifications.

Core Components

A. Request Model

Input contract for one evaluation run.

Suggested fields:

user_query: string
actual_sql: string
expected_sql: string
config: evaluation options
context: optional schema metadata and business constraints

B. Execution Adapter

DB adapter abstraction to support one configured database engine per run.

Interface shape:

execute(sql) -> execution metadata + materialized result
explain(sql) -> optional plan metadata

Sandbox isolation:

Queries run inside a read-only transaction (or equivalent read-only session mode) to prevent side-effects.
A configurable query timeout aborts long-running executions.
On completion or timeout, the transaction is rolled back to ensure no persistent state changes.

C. SQL Parser and Normalizer

Responsibilities:

Parse SQL into an Abstract Syntax Tree (AST) using a dialect-aware parser (e.g., sqlglot).
Canonicalize SQL string: lowercase keywords, collapse whitespace, normalize alias formatting.
Normalize commutative expressions (e.g., AND operand ordering) where semantics are preserved.

D. Structural Comparator

Responsibilities:

Per-clause AST matcher across projection (SELECT), selection predicate (WHERE), join graph, aggregation grain (GROUP BY), post-aggregation predicate (HAVING), ordering (ORDER BY), top-k (LIMIT/OFFSET), and set operators.
Clause-weighted distance calculator producing a normalized structural divergence score.

E. Result Comparator

Responsibilities:

Relation schema comparator (attribute names, data types, attribute order).
Ordered and unordered tuple comparators for set and bag (multiset) semantics.
Cell-level (attribute × tuple) alignment and value matching.
Floating-point comparator using atol/rtol tolerance thresholds.
NULL-aware comparator respecting three-valued logic (3VL) semantics.

F. Diagnostic Engine

Responsibilities:

Convert raw structural and result divergence signals into clause-attributed error scores.
Build per-attribute mismatch profiles from the output relation.
Generate deterministic, evidence-backed explanations without LLM involvement.

G. Scoring and Severity Engine

Responsibilities:

Compute category scores.
Aggregate weighted overall score.
Map to severity labels.

H. Report Builder

Responsibilities:

Emit final output schema.
Include evidence, warnings, and plain-language explanations.

I. Optional AI Judge Adapter

Responsibilities:

Construct grounded LLM-as-judge prompts that embed deterministic metric evidence.
Return structured AI judgments with confidence scores.
Record model identifier, prompt version, and token usage for reproducibility.

Metric Processing Pipeline

Recommended sequence:

Validation (parse checks + execution)
Structural comparison metrics
Result comparison metrics
Diagnostic attribution metrics (computed after structural and result evidence is available)
Optional performance metrics
Optional AI judge metrics
Final synthesis — compute deterministic_verdict, run AI judge, emit verdict_consistency_flag

If either query fails parse or execution, continue producing partial diagnostics where possible and return explicit blocked-reason fields.

Failure-State Precedence Rules

Apply deterministic precedence in this exact order during final synthesis:

Parse failure precedence: if parse_success_actual = false or parse_success_expected = false, set deterministic_verdict = fail, set severity = critical failure, and mark downstream structural/result comparisons as blocked.
Execution failure precedence: if both parse checks pass but execution_success_actual = false or execution_success_expected = false, set deterministic_verdict = fail, set severity = critical failure, and mark result comparison fields as blocked.
Result-family precedence: if validation passes, evaluate result_equality_family.mode_pass for the configured comparison_mode. A false mode_pass sets deterministic_verdict = fail regardless of advisory AI output.
Structural/diagnostic precedence: when result equality fails, use structure and diagnostic signals only for attribution and severity refinement; they never override deterministic fail to pass.
Advisory precedence: ai_advisory_verdict can only annotate. It cannot overturn deterministic_verdict in any failure state.
Performance metric precedence: if execution_success_actual = false or execution_success_expected = false, performance metrics (execution_time_ratio, resource_cost_score, plan_complexity_score) are blocked and set to null.

Configuration Model

Key runtime options:

comparison_mode: exact / order-insensitive / order-sensitive
numeric_abs_tolerance: float
numeric_rel_tolerance: float
null_equality_mode: strict/sql_semantic
require_column_name_match: true/false
require_column_order_match: true/false
max_rows_for_cell_alignment: integer
query_timeout_ms: integer (maximum wall-clock time per query execution before abort)
enable_ai_judge: true/false
enable_performance_metrics: true/false

Comparison mode mapping:

exact: schema, values, and row order must all match.
order-insensitive: bag (multiset) semantics — duplicates count, row order ignored.
order-sensitive: full ordered comparison including tuple sequence; required for top-k and pagination queries.

Suggested Output Contract

Top-level fields:

deterministic_verdict: authoritative pass/fail derived from Layers 1–4.
ai_advisory_verdict: advisory classification from the AI judge (optional).
verdict_consistency_flag: True if both verdicts agree; False triggers a consistency warning.
overall_score: 0.0 when deterministic_verdict is fail due to parse or execution failure; weighted aggregate otherwise.
structure_score
result_score
diagnostic_scores
performance_scores
severity
explanations
mismatch_map
error_types: list of Error Taxonomy categories triggered during execution (maps to the DML error taxonomy).
blocked_reason: present when downstream layers are blocked by an upstream failure; indicates which precedence rule triggered the block (e.g., "parse_failure" or "execution_failure").
warnings
validity: parse and execution outcomes
evidence: clause-level and attribute-level details
run_metadata: timings, db identifier, version, config hash

Scoring Strategy

Use weighted aggregation with explicit defaults:

validity gate: parse or execution failure sets deterministic_verdict to fail and overall_score to 0.0 before any weighted scoring runs.
result gate: result_equality_family.mode_pass acts as a boolean gate — a false mode_pass sets deterministic_verdict to fail. It does not contribute a numeric weight.
structure_score: weighted blend of normalized match, clause_match, and derived clause-weighted similarity (computed from clause_weighted_distance at scoring time).
result_score: weighted blend of schema, overlap, cardinality, tolerance, and NULL handling (mode_pass is the gate, not a blend input).
diagnostic_scores: not just penalties; include confidence and evidence count.

Derived metric note:

clause_weighted_similarity is not stored as a primary output; it is derived when needed from clause_weighted_distance.
execution_time_ratio is derived from execution_time_actual_ms / execution_time_expected_ms and is not independently stored in Validation.

Severity mapping example:

pass
minor issue
moderate issue
major issue
critical failure

Error Taxonomy

DML runtime execution error categories:

missing_table: Referenced relation does not exist in the database catalog.
missing_column: Referenced attribute does not exist in the relation schema.
type_mismatch: Operand data types are incompatible for the operation or predicate.
division_by_zero: Arithmetic expression produces a division-by-zero runtime error.
invalid_aggregation: Aggregate function used in an invalid context (e.g., non-grouped attribute in SELECT without GROUP BY).
ambiguous_reference: Attribute reference is ambiguous across joined relations without qualification.
permission_error: Executing principal lacks read permission on the referenced relation or schema.
unknown_error: Uncategorized runtime error; preserve raw DB error message for inspection.

Always store both the mapped taxonomy category and the raw database error message.

Deterministic Diagnostics Strategy

Detailed heuristic rules for the Diagnostic Engine (Component F). Layer 4 defines the output schema; this section defines the attribution logic.

Heuristic clause-attribution signals:

Projection mismatch: output relation schema or SELECT expression differs while tuple cardinality is close.
Predicate (filter) mismatch: high tuple overlap asymmetry correlated with WHERE/HAVING predicate divergence in the AST.
Join mismatch: cardinality fanout or compression combined with join graph topology or join predicate divergence.
Aggregation grain mismatch: aggregate score drift paired with incorrect or missing GROUP BY attributes.
Ordering-only mismatch: bag-semantics result sets are equal but ordered result sets differ — ordering_error_flag is set.
Top-k mismatch: result set divergence concentrated near the LIMIT/OFFSET truncation boundary.

Every diagnostic output must include:

what differed (clause and attribute level)
likely source clause (projection / selection predicate / join / aggregation grain / ordering / top-k)
confidence score
supporting evidence (cardinality delta, shared attribute overlap, predicate diff summary)

Performance and Explainability

Optional performance metrics should be collected only when stable and comparable:

execution_time_ratio
resource_cost_score
plan_complexity_score

Include a warning when noisy conditions reduce comparability (cold cache, variable load, missing plan stats).

AI Judge Guardrails

Normative rules for AI judge behavior. Verdict structure and output format are defined in Layer 5.

AI judge is optional and additive.
AI output is always advisory — deterministic_verdict from Layers 1–4 takes precedence.
AI cannot flip deterministic pass/fail flags.
When ai_advisory_verdict disagrees with deterministic_verdict, verdict_consistency_flag is set to False and a consistency warning is included in the report.
AI responses expose only: short_rationale, evidence_bullets, decision_summary. No internal reasoning trace is returned.
Include model identifier, prompt version, and confidence score.
Support policy-based disablement for production pipelines.

Implementation Roadmap

Phase 1: Core MVP

Build first:

parse success
execution success and error type
normalized SQL match
clause match
result_equality_family (mode-driven)
schema match
row overlap
cardinality match

Phase 2: Debugging Expansion

Add:

projection/filter/join/grouping/aggregate diagnostic scores
ordering-only mismatch flag
per-column mismatch map
numeric tolerance match
NULL handling match

Phase 3: Production and Research Features

Add:

performance metrics
AI judge metrics
richer severity policy profiles

Metrics Glossary

The Metrics Glossary is the normative reference for metric names, types, and descriptions. Layer sections provide architectural context and processing logic.

Validation Metrics

Metric	Layer	Type	Description
parse_success_actual	Validation	bool	True if actual_sql produces a valid AST with no syntax errors
parse_success_expected	Validation	bool	True if expected_sql produces a valid AST with no syntax errors
normalized_sql_actual	Validation	string	Canonicalized SQL string of actual_sql after normalization
normalized_sql_expected	Validation	string	Canonicalized SQL string of expected_sql after normalization
ast_actual	Validation	AST	Abstract Syntax Tree (AST) of actual_sql
ast_expected	Validation	AST	Abstract Syntax Tree (AST) of expected_sql
execution_success_actual	Validation	bool	True if actual_sql executes without a DML runtime error
execution_success_expected	Validation	bool	True if expected_sql executes without a DML runtime error
execution_error_actual	Validation	object	Error taxonomy category + raw DB error message for actual_sql
execution_error_expected	Validation	object	Error taxonomy category + raw DB error message for expected_sql
execution_time_actual_ms	Validation	float	Wall-clock runtime of actual_sql in milliseconds
execution_time_expected_ms	Validation	float	Wall-clock runtime of expected_sql in milliseconds
actual_result_table	Validation	relation	Materialized result set (tuples) returned by actual_sql
expected_result_table	Validation	relation	Materialized result set (tuples) returned by expected_sql
actual_schema	Validation	object	Output relation schema of actual_sql - attribute names and data types
expected_schema	Validation	object	Output relation schema of expected_sql - attribute names and data types

Structural Comparison Metrics

Metric	Role	Type	Description
normalized_sql_match	Boolean gate	bool	True if canonicalized SQL strings are identical; if true, deeper structural comparison can be skipped
clause_match	Diagnostic detail	object	Per-clause AST comparison: projection (SELECT), selection predicate (WHERE), post-aggregation predicate (HAVING), join graph and join type, aggregation grain (GROUP BY), ordering (ORDER BY), top-k (LIMIT/OFFSET), set operators (UNION/INTERSECT/EXCEPT)
clause_weighted_distance	Structural score	float	Clause-weighted structural divergence; 0 = identical, higher = more divergent

Diagnostic Attribution Metrics

Metric	Layer	Type	Description
projection_error_score	Diagnostic	float	Estimated impact from differences in the SELECT projection list - missing, extra, or transformed attributes
filter_error_score	Diagnostic	float	Estimated impact from predicate differences in WHERE (selection) and HAVING (post-aggregation filter)
join_error_score	Diagnostic	float	Estimated impact from join graph topology, join predicate (ON condition), or join type (INNER / LEFT / RIGHT / FULL / CROSS / ANTI / SEMI)
grouping_error_score	Diagnostic	float	Estimated impact from incorrect aggregation grain in the GROUP BY clause
aggregate_function_error_score	Diagnostic	float	Estimated impact from wrong aggregate function (SUM vs COUNT vs AVG vs MIN vs MAX vs STDDEV)
ordering_error_flag	Diagnostic	bool	True when mismatch is caused solely by tuple ordering; bag-semantics result sets are equal
limit_topk_error_score	Diagnostic	float	Estimated impact from LIMIT, TOP, FETCH FIRST, OFFSET, or ROW_NUMBER top-k differences
cardinality_explosion_flag	Diagnostic	bool	True when result set cardinality deviates significantly from expected due to join fanout or over-filtering
per_column_mismatch_map	Diagnostic	object	Explanation layer - attribute-level mismatch rate and severity breakdown explaining row_overlap and cell_overlap
confidence	Diagnostic	float	0.0-1.0 confidence in the primary diagnostic attribution
evidence_strength	Diagnostic	label	low / medium / high - based on corroborating signal count and quality
ambiguous_case	Diagnostic	bool	True when multiple clauses are plausible root causes with similar evidence weight

Result Comparison Metrics

Metric	Role	Type	Description
result_equality_family	Equality family	object	Single mode-driven equality family with comparison_mode (exact / order-insensitive / order-sensitive), mode_pass, and optional mode_details for diagnostics
schema_match	Schema	object	Relation schema comparison: attribute count, attribute names, data types, and optional attribute ordering
row_overlap	Graded quality	float	Tuple-level overlap expressed as Jaccard similarity or F1 score (precision + recall over shared tuples)
cell_overlap	Graded quality	float	Value-level (attribute x tuple) overlap after alignment; measures localized mismatch granularity
cardinality_match	Cardinality	object	Cardinality comparison - absolute row-count delta and relative cardinality ratio between result sets
numeric_tolerance_match	Tolerance	object	Floating-point comparison using atol and rtol: match if
null_handling_match	NULL semantics	bool	True if NULL placement and three-valued logic (3VL) behavior is consistent across result sets

Optional Performance Metrics

Metric	Layer	Type	Description
execution_time_ratio	Performance	float	Derived: ratio of execution_time_actual_ms to execution_time_expected_ms; not independently stored in Validation
resource_cost_score	Performance	float	Relative cost estimate from the query optimizer execution plan (EXPLAIN) if available
plan_complexity_score	Performance	float	Weighted count of expensive query plan operators: full scans, hash joins, sorts, nested loops

Final Output Fields

Metric	Layer	Type	Description
deterministic_verdict	Final output	label	Authoritative pass/fail derived from Layers 1-4 deterministic metrics
ai_advisory_verdict	Final output	label	Advisory classification from the AI judge; never overrides deterministic_verdict
verdict_consistency_flag	Final output	bool	True if both verdicts agree; False triggers a consistency warning
blocked_reason	Final output	string	Present when downstream layers are blocked by an upstream failure; indicates the triggering precedence rule

AI Judge Metrics

Metric	Layer	Type	Description
intent_adequacy_judgment	AI Judge	label	LLM-as-judge assessment of whether actual_sql satisfies user_query intent as well as expected_sql
database_scoped_equivalence_judgment	AI Judge	label	LLM assessment of observed equivalence on this database instance - does not claim universal semantic equivalence
acceptable_deviation_judgment	AI Judge	label	Grounded classification: harmless (ordering-only), acceptable (minor rounding), or critical (wrong data)
short_rationale	AI Judge	string	One-to-two sentence summary of the AI judge's reasoning
evidence_bullets	AI Judge	list	Ordered list of key deterministic signals used in the AI judgment
decision_summary	AI Judge	string	Final label and justification without internal reasoning traces
severity_classification	AI Judge	label	Severity triage label: pass / minor issue / moderate issue / major issue / critical failure
confidence	AI Judge	float	0.0-1.0 confidence in the AI judgment
evidence_strength	AI Judge	label	low / medium / high - based on quality of deterministic evidence available
ambiguous_case	AI Judge	bool	True when the AI judge cannot distinguish between two plausible classifications