AI-generated garbage code — the so-called "AI Slop" — is contaminating the development ecosystem. We all know what happens when you ask ChatGPT to "build me a data validation library." Copy-paste, runtime error. Ask it to fix the error, it spawns another. An infinite loop of mediocrity.
I don't use that stuff. I built my own.
Truthound — "Sniff Out Bad Data"
A Zero-Configuration data quality framework powered by Polars. 7,613 test cases, 289 validators, 28 categories. I designed and implemented the core architecture myself — the tedious boilerplate excluded.
This hound relentlessly sniffs through your data, hunting down bad records with unyielding tenacity. Mercilessly.
import truthound as th
# That's it
report = th.check("data.csv")
No configuration files. No dozens of YAML lines. Throw your data at it, and it infers the schema and validates automatically.
Why Not Great Expectations?
"GX exists. Why reinvent the wheel?" I anticipated this question.
It's slow. GX is built on Pandas. Pandas' single-threaded execution model creates performance bottlenecks with moderately-sized datasets. Research from Vrije Universiteit Amsterdam demonstrates that Polars consumes approximately 8x less energy while achieving significantly faster execution times compared to Pandas in large-scale DataFrame operations (Malavolta et al., 2024). In TPC-H benchmarks, Polars and DuckDB exhibited an order of magnitude performance superiority over Dask and PySpark.
It's convoluted. GX's learning curve is precipitous. Data Context, Datasource, Batch Request, Expectation Suite, Checkpoint... Grasping these concepts alone takes days. By then, I've already finished validating and gone home.
Dashboard? Paid. Proper monitoring without GX Cloud requires building it yourself. Truthound Dashboard is open-source. Free.
Technical Architecture
Truthound's architecture diverges fundamentally from Great Expectations in design philosophy. It's not merely "faster because Polars." The entire data validation framework has been reconceptualized.
Test Environment
- Python 3.13.7 / Polars 1.37.1 / Truthound 1.0.13
- Platform: Darwin arm64 (Apple Silicon M-series), 8 cores
1. Expression Batching — Single collect() for Entire Validation
Truthound consolidates all validations into a single query plan.
class ExpressionBatchExecutor:
"""Execute all validators in a single collect()"""
def execute(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
# Multiple validators merged into one query plan
# Polars optimizer performs predicate pushdown, CSE
return optimized_collect(lf.select(all_exprs), streaming=True)
ValidationExpressionSpec defines each validator's expression, and ExpressionBatchExecutor bundles them into a single select() execution.
| Data Size | Sequential (3 validators) | Batched (single collect) | Speedup |
|---|---|---|---|
| 100K rows | 0.0008s | 0.0004s | 2.04x |
| 500K rows | 0.0009s | 0.0006s | 1.33x |
With smaller datasets, I/O overhead constitutes a larger proportion, amplifying batching efficacy. At scale, Polars' inherent optimizations are already efficient, reducing relative speedup while maintaining superior absolute performance.
2. PEP 562 Lazy Loading — 302 Validators Deferred
# truthound/validators/_lazy.py
VALIDATOR_IMPORT_MAP: dict[str, str] = {
"NullValidator": "truthound.validators.completeness.null",
"BetweenValidator": "truthound.validators.distribution.range",
# ... 302 mappings
}
def __getattr__(name: str):
"""Import only upon access"""
return validator_getattr(name)
| Operation | Time |
|---|---|
| Fresh module import | 0.01ms |
| First lazy access (NullValidator) | 4.25ms |
| Cached access (avg of 100) | 0.0005ms |
| Registered validators | 302 |
Eager loading all 302 validators would consume hundreds of milliseconds. Lazy loading ensures only utilized validators are imported, maintaining startup time at 4.25ms.
3. Vectorized Masking — Zero Python Callbacks
Data masking employs pure Polars expressions exclusively. No map_elements() invoking Python functions per row.
def _apply_hash(df: pl.DataFrame, col: str) -> pl.DataFrame:
"""Apply xxhash3 without Python callbacks"""
c = pl.col(col)
hashed = c.hash().cast(pl.String).str.slice(0, 16)
return df.with_columns(
pl.when(c.is_null()).then(pl.lit(None)).otherwise(hashed).alias(col)
)
| Data Size | Vectorized Hash | map_elements | Speedup | Throughput |
|---|---|---|---|---|
| 100K rows | 0.0033s | 0.1002s | 30.0x | 30M rows/sec |
| 500K rows | 0.0134s | 0.5032s | 37.5x | 37M rows/sec |
Vectorized operations utilizing Polars' native hash() function (xxhash3) outperform Python callbacks by 30-40x. Speedup scales with data volume.
4. Query Plan Optimization
Every collect() invocation leverages Polars query optimizations:
QUERY_OPTIMIZATIONS = {
"predicate_pushdown": True, # Apply filters as early as possible
"projection_pushdown": True, # Select only required columns
"comm_subexpr_elim": True, # Eliminate redundant expressions
"simplify_expression": True,
}
| Data Size | Optimized collect | Standard collect | Throughput |
|---|---|---|---|
| 500K rows | 0.0016s | 0.0045s | 320M rows/sec |
| 1M rows | 0.0025s | 0.0089s | 400M rows/sec |
Optimized collect achieves 2.8-3.5x speedup over standard execution, delivering 400 million rows/sec throughput at 1M rows.
5. DAG-Based Parallel Execution
Validator dependencies are analyzed to construct parallelizable execution groups.
class ValidatorPhase(Enum):
SCHEMA = auto() # Level 0: Schema validation
COMPLETENESS = auto() # Level 1: Null checks
UNIQUENESS = auto() # Level 2: Duplicate detection
FORMAT = auto() # Level 3: Pattern matching
RANGE = auto() # Level 4: Value range
STATISTICAL = auto() # Level 5: Aggregate statistics
CROSS_TABLE = auto() # Level 6: Multi-table
CUSTOM = auto() # Level 7: User-defined
ExecutionLevel orchestrates same-level validators via ThreadPoolExecutor for parallel execution.
6. Zero-Config Auto-Schema Learning
# This is everything
report = th.check("data.csv")
Internally, the learn() function aggregates statistics across all columns in a single pass.
| Data Size | Learning Time | Throughput | Columns |
|---|---|---|---|
| 100K rows | 0.031s | 3.2M rows/sec | 15 |
| 500K rows | 0.111s | 4.5M rows/sec | 15 |
| 1M rows | 0.216s | 4.6M rows/sec | 15 |
Throughput improves with data scale due to Polars' vectorization efficiency gains. Complete schema learning for 1M rows concludes within 0.22 seconds.
7. xxhash Cache Optimization
Cache fingerprinting utilizes xxhash:
def _fast_hash(content: str) -> str:
if _HAS_XXHASH:
return xxhash.xxh64(content.encode()).hexdigest()[:16]
return hashlib.sha256(content.encode()).hexdigest()[:16]
| Content Size | SHA256 | xxhash | Speedup |
|---|---|---|---|
| 1KB | 0.96μs | 0.41μs | 2.4x |
xxhash delivers 2.4x speedup over SHA256. Cumulative impact is substantial during mass cache key generation.
8. End-to-End Performance: th.check()
| Data Size | Execution Time | Throughput | Issues Found |
|---|---|---|---|
| 100K rows | 0.019s | 5.3M rows/sec | 12 |
| 500K rows | 0.100s | 5.0M rows/sec | 12 |
| 1M rows | 0.22s | 4.6M rows/sec | - |
th.check() encompasses schema learning plus automated validator execution end-to-end, sustaining approximately 5 million rows/sec throughput.
Performance Summary
| Optimization | Verified Metric | Notes |
|---|---|---|
| Expression Batching | 1.3-2.0x speedup | Based on 3 validators |
| PEP 562 Lazy Loading | 302 validators, 4.25ms first load | 0.0005ms post-cache |
| Vectorized Masking | 30-40x speedup vs map_elements | 37.5x at 500K rows |
| Query Optimization | 400M rows/sec | Aggregate queries |
| Schema Learning | 4.6M rows/sec | 1M rows, 15 cols |
| xxhash Cache | 2.4x speedup vs SHA256 | 1KB content |
| E2E th.check() | 5M rows/sec | Including auto validators |
Note: These metrics were measured under specific test conditions. Actual performance varies based on data characteristics, hardware, and validator configuration.
Origin Story
I was working on an agricultural data platform. Sensor telemetry, meteorological feeds, soil composition data... torrents of data pouring in, and the quality was atrocious. Nulls everywhere, type mismatches, values outside acceptable ranges. I was hemorrhaging time on data cleansing.
Attempted to adopt GX. It was slow. It was convoluted. And the dashboard? Paid.
So I built my own. Because I needed it. Isn't that what developers do — forge the tools they require? Weekends and overtime hours, ground to dust.
I attached truthound-dashboard as well. Free. Still under development, but the fundamentals are operational. Currently building truthound-orchestration for Airflow, Dagster, Prefect, and dbt integration.
Closing
There will be those who refuse to acknowledge Truthound. The "GX is the standard, why should I use this?" crowd. Cowards too entrenched in the familiar to venture toward new tools.
Inertia is formidable. Not something humans overcome effortlessly.
But a few minutes. pip install truthound and type th.check("data.csv") — a single line. If even that proves insurmountable, that person has elected to stagnate, to reject progress. A relic choosing obsolescence.
Brutal, unvarnished feedback is perpetually welcome. If you're going to criticize, do it properly.
Documentation: https://truthound.netlify.app
GitHub: https://github.com/seadonggyun4/Truthound
References
- Kolmogorov, A. N. (1933). "Sulla determinazione empirica di una legge di distribuzione." Giornale dell'Istituto Italiano degli Attuari, 4, 83-91.
- Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 413-422.
- Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). "LOF: Identifying Density-Based Local Outliers." ACM SIGMOD Record, 29(2), 93-104.
- Polars Documentation. https://pola.rs/

Top comments (0)