A technical deep-dive on the architecture of modern data quality systems
The False Positive Problem
Your pipeline reports success. Schema validation passes. Record counts match. NULL constraints hold. Yet your downstream systems are making decisions on garbage data.
This isn't a monitoring failure—it's an architectural blind spot. Traditional data quality systems validate structure while semantic correctness goes unchecked.
The cost? Financial institutions lose an average of $15 million annually to poor data quality, with Citibank facing $536 million in fines between 2020-2024 for inadequate data governance.
The Three Layers of Data Validation
Most systems stop at Layer 2. They catch type errors and statistical outliers but miss semantic invalidity—data that is structurally perfect but contextually wrong.
Why Current Architectures Fail
1. The Extract-and-Inspect Bottleneck
Traditional data quality platforms follow an extract-and-inspect model where data is pulled from sources into the quality platform for validation. This creates:
- Scalability issues: Full table scans don't scale to modern data volumes
- Latency problems: Data moves through multiple hops before validation
- Resource constraints: Compute and storage costs explode with data growth
2. The Metadata-Only Trap
Data observability vendors addressed scalability by leveraging metadata to monitor quality without scanning entire datasets. Smart move for performance, but:
The trade-off: Data Observability trades off depth of monitoring for scalability
Metadata tells you record counts changed. It doesn't tell you the records contain test data.
3. The Rule Explosion
Organizations focus on detecting minor issues like null values while critical errors are overlooked, with audits showing 40-60% of checks targeting basic problems that rarely occur.
The pattern repeats:
- Edge case discovered in production
- New rule written to catch it
- Rule maintenance burden grows
- Coverage remains incomplete
The fundamental problem: Rules require knowing failure modes in advance. You can't write rules for unknowns.
The Architecture Shift: Push-Down + Semantic Validation
Modern solutions combine the strengths of both architectures by pushing queries down into the data platform while adding semantic understanding:
Key principle: Leverage platform-native compute for structural checks, use LLMs for semantic validation.
LLM-Based Semantic Validation:
Why LLMs Work for Data Quality
LLM-based workflow for automated tabular data validation uses semantic meaning and statistical properties to define validation rules.
Unlike statistical methods that see "TEST_STOCK" as just another string, LLMs understand:
- NYSE/NASDAQ ticker patterns
- Test data conventions
- Domain-specific terminology
- Temporal relationships
- Reference validity
The Embedding Architecture
Critical implementation details:
Embedding Model Selection: Transformer-based and instruction-tuned embeddings achieve top performance in 2025, with models like Gemini Embedding setting new records
Semantic Similarity for Validation: Using pre-trained embedding models for semantic matching enables comparison of meaning instead of words
Context Engineering: Semantic validation uses LLMs to evaluate content against complex, subjective, and contextual criteria that would be difficult to implement with traditional rule-based approaches
Prompt Engineering for Consistency
The challenge: LLMs are probabilistic. You need deterministic validation.
Solution pattern:
prompt = f"""
You are a data quality validator analyzing financial news data.
TASK: Identify semantic anomalies in the sample below.
DATA SAMPLE:
{json.dumps(sample_data)}
CHECK FOR:
1. Test Data Patterns
- Prefixes: test_, fake_, dummy_, placeholder_
- Suspicious values: "test_user", "lorem ipsum"
- Sequential or generated IDs
2. Domain Validity
- Stock symbols must exist on NYSE/NASDAQ/AMEX
- Sentiment scores must be in [-1, 1]
- Dates must be <= current date
3. Statistical Coherence
- Sentiment distribution should be natural (not all 0.5)
- Publication times should vary (not all midnight)
- Author count should match typical patterns
OUTPUT FORMAT (JSON only):
{{
"has_anomalies": boolean,
"confidence": float (0.0-1.0),
"anomalies": [
{{
"type": "test_data|invalid_reference|temporal_error|statistical_outlier",
"field": "column_name",
"evidence": ["specific example 1", "specific example 2"],
"severity": "LOW|MEDIUM|HIGH|CRITICAL",
"affected_rows": int
}}
],
"summary": "brief explanation"
}}
CONSTRAINTS:
- Only flag anomalies with >70% confidence
- Provide specific evidence for each finding
- Return valid JSON only (no markdown formatting)
"""
Key techniques:
- Low temperature (0.1) for consistency
- Structured JSON output with schema
- Explicit confidence thresholds
- Fallback handling for parsing failures
Multi-Agent Architecture: Beyond Single-Point Detection
2025 has been called the Year of Agentic AI, with 82% of organizations planning to integrate AI agents within 1-3 years.
Why Multi-Agent for Data Quality?
Single-model approaches have blind spots. Coordinated agents provide:
- Specialization: Each agent optimizes for specific validation types
- Redundancy: Multiple validation paths increase coverage
- Coordination: Orchestrator synthesizes findings and makes decisions
- Autonomy: System acts without human intervention
Reference Architecture
Agent Communication Protocol
Agent2Agent protocol gives agents a common, open language to collaborate—no matter which framework or vendor they are built on.
Implementation pattern:
class AgentMessage:
"""Structured message between agents"""
sender: str # agent_id
recipient: str # target agent or "broadcast"
message_type: str # "alert", "query", "action"
payload: dict # alert data or action request
correlation_id: str # trace related messages
timestamp: datetime
class MessageBus:
"""Central coordination"""
def publish(self, message: AgentMessage):
# Store in persistent queue (Firestore, Redis)
# Route to recipient(s)
# Log for observability
def subscribe(self, agent_id: str) -> List[AgentMessage]:
# Return pending messages for agent
Decision Logic: Autonomous Response
Implementation:
class PipelineOrchestrator:
def make_decision(
self,
schema_alert: Optional[Alert],
semantic_alert: Optional[Alert]
) -> Decision:
# Rule 1: Critical schema changes always pause
if schema_alert and schema_alert.severity == "CRITICAL":
return Decision(
action="pause_pipeline",
reason="Breaking schema change detected",
auto_execute=True
)
# Rule 2: High-confidence semantic anomalies
if semantic_alert and semantic_alert.confidence > 0.85:
if "test_data" in semantic_alert.types:
return Decision(
action="quarantine_and_rollback",
reason="Test data contamination detected",
auto_execute=True
)
# Rule 3: Multiple simultaneous issues
if schema_alert and semantic_alert:
return Decision(
action="emergency_pause",
reason="Compound failure detected",
auto_execute=True,
escalate=True
)
# Default: continue with logging
return Decision(action="monitor", auto_execute=False)
Production Considerations
Observability: Monitoring the Monitors
Data + AI observability enables hyper-scalable quality management through AI-enabled monitor creation, anomaly detection, and root-cause analysis.
Essential metrics:
- Agent decision traces (Jaeger, OpenTelemetry)
- LLM performance (LangSmith, Helicone, Weights & Biases)
- System health (Prometheus, Grafana)
- Cost tracking (per-validation, per-token)
Cost Optimization
LLM-based validation adds API costs. Strategies:
- Tiered validation: Use cheap statistical checks first, LLM only for suspicious data
- Batch processing: Group validations to reduce API overhead
- Model selection: GPT-4o-mini or similar models offer good balance of capability and cost
- Caching: Semantic cache using embeddings can reduce duplicate LLM calls, with 33% of queries being repeated
Deployment Architecture
Azure AI Foundry Agent Service and similar platforms provide enterprise-grade deployment with built-in testing, release, and reliability at scale.
Stack recommendations:
orchestration:
framework: LangGraph / CrewAI / AutoGen
runtime: Azure AI Foundry / Vertex AI Agent Engine
validation:
embeddings: text-embedding-3-large / Gemini Embedding
llm: GPT-4o-mini / Claude Haiku / Gemini Flash
storage:
vectors: Pinecone / Weaviate / Milvus
state: Firestore / Redis / DynamoDB
monitoring:
traces: OpenTelemetry
metrics: Prometheus
logs: Elasticsearch
Real-World Results
Financial Services: Test Data Detection
Scenario: News data pipeline syncing 50K articles/day
Problem: 847 test articles with TEST_STOCK, test_user_42, future dates (2099)
Traditional monitoring: All checks passed (syntactically correct)
Multi-agent system:
- Agent 1: Schema stable
- Agent 2: Semantic anomaly detected (94% confidence) ⚠️
- Agent 3: Auto-quarantine + pipeline pause
Outcome: 4-second detection, automatic remediation, $2M trading loss prevented
Healthcare: Reference Integrity
Scenario: Patient referral data with ICD-10 codes
Problem: 12% of codes were deprecated or non-existent
Traditional monitoring: Type checks passed (all valid strings)
LLM-based validation:
- Embedded ICD-10 reference knowledge
- Detected code validity issues
- Flagged temporal mismatches (codes used before approval date)
Outcome: 88% precision in identifying invalid medical codes
Implementation Recommendations
Start Small, Validate, Scale
Phase 1: Pilot
- Select one critical pipeline with known quality issues
- Implement semantic validator alongside existing monitoring
- Run in shadow mode (detection only, no actions)
- Measure: detection accuracy vs. production incidents
Phase 2: Automation
- Enable automatic actions for high-confidence anomalies
- Add schema monitoring agent
- Implement basic orchestration logic
- Monitor false positive rate
Phase 3: Scale
- Expand to additional pipelines
- Add specialized agents for domain-specific validation
- Implement full multi-agent coordination
- Optimize costs and performance
Technical Requirements
Minimum viable system:
# Core components
- BigQuery/Snowflake for data storage
- Vertex AI / Azure OpenAI for LLM access
- Cloud Run / Lambda for agent runtime
- Firestore / Redis for agent state
- GitHub Actions / Cloud Build for CI/CD
# Estimated costs (50K records/day):
- Embedding generation: $5-15/day
- LLM validation: $20-50/day (with smart sampling)
- Infrastructure: $10-30/day
# Total: ~$1,200-2,000/month
Evaluation Framework
Track these metrics to validate system performance:
| Metric | Target | How to Measure |
|---|---|---|
| True Positive Rate | >90% | Validated anomalies / Total anomalies |
| False Positive Rate | <5% | False alarms / Total alerts |
| Detection Latency | <5 sec | Time from ingestion to alert |
| Coverage | >95% | Fields validated / Total fields |
| Cost per Record | <$0.001 | Total cost / Records processed |
The Future: 2025-2027
Emerging Patterns
1. Specialized Domain Embeddings
Domain-specific embeddings (e.g., MedEmbed, CodeXEmbed) excel in specialized fields. Expect vertical-specific validation models for:
- Financial instruments
- Healthcare terminology
- Supply chain references
- Regulatory compliance
2. Multi-Modal Validation
Multimodal embeddings (e.g., CLIP) align different data types. Next generation:
- Image content validation against metadata
- Document text vs. structured field consistency
- Time-series patterns vs. event descriptions
3. Self-Healing Pipelines
By 2029, agentic AI predicted to autonomously resolve 80% of common issues. Future agents will:
- Detect anomalies
- Diagnose root causes
- Fix upstream issues
- Validate corrections
Protocol Standardization
New protocols like Model Context Protocol (MCP) and Agent-to-Agent (A2A) offer interoperability between AI client applications and agents.
What this means:
- Agents from different vendors can collaborate
- Standardized telemetry and observability
- Portable agent definitions across platforms
Conclusion: The Semantic Imperative
Traditional data quality monitoring asks "Did the data arrive correctly?"
The question should be "Is the data semantically valid?"
Solutions like Monte Carlo and WhyLabs are at the forefront of observability, offering real-time monitoring of data quality, lineage, and drift, but the architecture must evolve:
From: Reactive rule-based systems with structural focus
To: Proactive AI-powered systems with semantic understanding
The technical reality:
- 66% of banks struggle with data quality, 83% lack real-time access to transaction data
- Traditional monitoring cannot handle unstructured data like text, images, or documents
- Traditional siloed monitoring tools can't keep up with modern data architecture complexity
The path forward:
- Multi-agent systems with specialized validators
- LLM-based semantic understanding
- Autonomous decision-making and remediation
- Platform-native compute for scalability
The technology exists. The question is whether you'll adapt before the next $2M incident.
Research sources: Monte Carlo Data, Stanford AI Index 2025, Gartner Research, LangChain State of AI Agents, Microsoft Azure AI, Google Cloud Vertex AI, academic papers on semantic validation and LLM evaluation






Top comments (1)
Very nice nice article..