DONG GYUN SEO

Posted on Jan 13

I don’t ship AI slop. I build.

#python #polars #opensource #ai

AI-generated garbage code — the so-called "AI Slop" — is contaminating the development ecosystem. We all know what happens when you ask ChatGPT to "build me a data validation library." Copy-paste, runtime error. Ask it to fix the error, it spawns another. An infinite loop of mediocrity.

I don't use that stuff. I built my own.

Truthound — "Sniff Out Bad Data"

A Zero-Configuration data quality framework powered by Polars. 7,613 test cases, 289 validators, 28 categories. I designed and implemented the core architecture myself — the tedious boilerplate excluded.

This hound relentlessly sniffs through your data, hunting down bad records with unyielding tenacity. Mercilessly.

import truthound as th

# That's it
report = th.check("data.csv")

No configuration files. No dozens of YAML lines. Throw your data at it, and it infers the schema and validates automatically.

Why Not Great Expectations?

"GX exists. Why reinvent the wheel?" I anticipated this question.

It's slow. GX is built on Pandas. Pandas' single-threaded execution model creates performance bottlenecks with moderately-sized datasets. Research from Vrije Universiteit Amsterdam demonstrates that Polars consumes approximately 8x less energy while achieving significantly faster execution times compared to Pandas in large-scale DataFrame operations (Malavolta et al., 2024). In TPC-H benchmarks, Polars and DuckDB exhibited an order of magnitude performance superiority over Dask and PySpark.

It's convoluted. GX's learning curve is precipitous. Data Context, Datasource, Batch Request, Expectation Suite, Checkpoint... Grasping these concepts alone takes days. By then, I've already finished validating and gone home.

Dashboard? Paid. Proper monitoring without GX Cloud requires building it yourself. Truthound Dashboard is open-source. Free.

Technical Architecture

Truthound's architecture diverges fundamentally from Great Expectations in design philosophy. It's not merely "faster because Polars." The entire data validation framework has been reconceptualized.

Test Environment

Python 3.13.7 / Polars 1.37.1 / Truthound 1.0.13
Platform: Darwin arm64 (Apple Silicon M-series), 8 cores

1. Expression Batching — Single collect() for Entire Validation

Truthound consolidates all validations into a single query plan.

class ExpressionBatchExecutor:
    """Execute all validators in a single collect()"""
    def execute(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        # Multiple validators merged into one query plan
        # Polars optimizer performs predicate pushdown, CSE
        return optimized_collect(lf.select(all_exprs), streaming=True)

ValidationExpressionSpec defines each validator's expression, and ExpressionBatchExecutor bundles them into a single select() execution.

Data Size	Sequential (3 validators)	Batched (single collect)	Speedup
100K rows	0.0008s	0.0004s	2.04x
500K rows	0.0009s	0.0006s	1.33x

With smaller datasets, I/O overhead constitutes a larger proportion, amplifying batching efficacy. At scale, Polars' inherent optimizations are already efficient, reducing relative speedup while maintaining superior absolute performance.

2. PEP 562 Lazy Loading — 302 Validators Deferred

# truthound/validators/_lazy.py
VALIDATOR_IMPORT_MAP: dict[str, str] = {
    "NullValidator": "truthound.validators.completeness.null",
    "BetweenValidator": "truthound.validators.distribution.range",
    # ... 302 mappings
}

def __getattr__(name: str):
    """Import only upon access"""
    return validator_getattr(name)

Operation	Time
Fresh module import	0.01ms
First lazy access (NullValidator)	4.25ms
Cached access (avg of 100)	0.0005ms
Registered validators	302

Eager loading all 302 validators would consume hundreds of milliseconds. Lazy loading ensures only utilized validators are imported, maintaining startup time at 4.25ms.

3. Vectorized Masking — Zero Python Callbacks

Data masking employs pure Polars expressions exclusively. No map_elements() invoking Python functions per row.

def _apply_hash(df: pl.DataFrame, col: str) -> pl.DataFrame:
    """Apply xxhash3 without Python callbacks"""
    c = pl.col(col)
    hashed = c.hash().cast(pl.String).str.slice(0, 16)
    return df.with_columns(
        pl.when(c.is_null()).then(pl.lit(None)).otherwise(hashed).alias(col)
    )

Data Size	Vectorized Hash	map_elements	Speedup	Throughput
100K rows	0.0033s	0.1002s	30.0x	30M rows/sec
500K rows	0.0134s	0.5032s	37.5x	37M rows/sec

Vectorized operations utilizing Polars' native hash() function (xxhash3) outperform Python callbacks by 30-40x. Speedup scales with data volume.

4. Query Plan Optimization

Every collect() invocation leverages Polars query optimizations:

QUERY_OPTIMIZATIONS = {
    "predicate_pushdown": True,      # Apply filters as early as possible
    "projection_pushdown": True,     # Select only required columns
    "comm_subexpr_elim": True,       # Eliminate redundant expressions
    "simplify_expression": True,
}

Data Size	Optimized collect	Standard collect	Throughput
500K rows	0.0016s	0.0045s	320M rows/sec
1M rows	0.0025s	0.0089s	400M rows/sec

Optimized collect achieves 2.8-3.5x speedup over standard execution, delivering 400 million rows/sec throughput at 1M rows.

5. DAG-Based Parallel Execution

Validator dependencies are analyzed to construct parallelizable execution groups.

class ValidatorPhase(Enum):
    SCHEMA = auto()       # Level 0: Schema validation
    COMPLETENESS = auto() # Level 1: Null checks
    UNIQUENESS = auto()   # Level 2: Duplicate detection
    FORMAT = auto()       # Level 3: Pattern matching
    RANGE = auto()        # Level 4: Value range
    STATISTICAL = auto()  # Level 5: Aggregate statistics
    CROSS_TABLE = auto()  # Level 6: Multi-table
    CUSTOM = auto()       # Level 7: User-defined

ExecutionLevel orchestrates same-level validators via ThreadPoolExecutor for parallel execution.

6. Zero-Config Auto-Schema Learning

# This is everything
report = th.check("data.csv")

Internally, the learn() function aggregates statistics across all columns in a single pass.

Data Size	Learning Time	Throughput	Columns
100K rows	0.031s	3.2M rows/sec	15
500K rows	0.111s	4.5M rows/sec	15
1M rows	0.216s	4.6M rows/sec	15

Throughput improves with data scale due to Polars' vectorization efficiency gains. Complete schema learning for 1M rows concludes within 0.22 seconds.

7. xxhash Cache Optimization

Cache fingerprinting utilizes xxhash:

def _fast_hash(content: str) -> str:
    if _HAS_XXHASH:
        return xxhash.xxh64(content.encode()).hexdigest()[:16]
    return hashlib.sha256(content.encode()).hexdigest()[:16]

Content Size	SHA256	xxhash	Speedup
1KB	0.96μs	0.41μs	2.4x

xxhash delivers 2.4x speedup over SHA256. Cumulative impact is substantial during mass cache key generation.

8. End-to-End Performance: th.check()

Data Size	Execution Time	Throughput	Issues Found
100K rows	0.019s	5.3M rows/sec	12
500K rows	0.100s	5.0M rows/sec	12
1M rows	0.22s	4.6M rows/sec	-

th.check() encompasses schema learning plus automated validator execution end-to-end, sustaining approximately 5 million rows/sec throughput.

Performance Summary

Optimization	Verified Metric	Notes
Expression Batching	1.3-2.0x speedup	Based on 3 validators
PEP 562 Lazy Loading	302 validators, 4.25ms first load	0.0005ms post-cache
Vectorized Masking	30-40x speedup vs map_elements	37.5x at 500K rows
Query Optimization	400M rows/sec	Aggregate queries
Schema Learning	4.6M rows/sec	1M rows, 15 cols
xxhash Cache	2.4x speedup vs SHA256	1KB content
E2E th.check()	5M rows/sec	Including auto validators

Note: These metrics were measured under specific test conditions. Actual performance varies based on data characteristics, hardware, and validator configuration.

Origin Story

I was working on an agricultural data platform. Sensor telemetry, meteorological feeds, soil composition data... torrents of data pouring in, and the quality was atrocious. Nulls everywhere, type mismatches, values outside acceptable ranges. I was hemorrhaging time on data cleansing.

Attempted to adopt GX. It was slow. It was convoluted. And the dashboard? Paid.

So I built my own. Because I needed it. Isn't that what developers do — forge the tools they require? Weekends and overtime hours, ground to dust.

I attached truthound-dashboard as well. Free. Still under development, but the fundamentals are operational. Currently building truthound-orchestration for Airflow, Dagster, Prefect, and dbt integration.

Closing

There will be those who refuse to acknowledge Truthound. The "GX is the standard, why should I use this?" crowd. Cowards too entrenched in the familiar to venture toward new tools.

Inertia is formidable. Not something humans overcome effortlessly.

But a few minutes. pip install truthound and type th.check("data.csv") — a single line. If even that proves insurmountable, that person has elected to stagnate, to reject progress. A relic choosing obsolescence.

Brutal, unvarnished feedback is perpetually welcome. If you're going to criticize, do it properly.

Documentation: https://truthound.netlify.app

GitHub: https://github.com/seadonggyun4/Truthound

References

Kolmogorov, A. N. (1933). "Sulla determinazione empirica di una legge di distribuzione." Giornale dell'Istituto Italiano degli Attuari, 4, 83-91.
Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 413-422.
Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). "LOF: Identifying Density-Based Local Outliers." ACM SIGMOD Record, 29(2), 93-104.
Polars Documentation. https://pola.rs/

DEV Community