DEV Community

DONG GYUN SEO
DONG GYUN SEO

Posted on

I don’t ship AI slop. I build.

AI-generated garbage code — the so-called "AI Slop" — is contaminating the development ecosystem. We all know what happens when you ask ChatGPT to "build me a data validation library." Copy-paste, runtime error. Ask it to fix the error, it spawns another. An infinite loop of mediocrity.

I don't use that stuff. I built my own.


Truthound — "Sniff Out Bad Data"

A Zero-Configuration data quality framework powered by Polars. 7,613 test cases, 289 validators, 28 categories. I designed and implemented the core architecture myself — the tedious boilerplate excluded.

This hound relentlessly sniffs through your data, hunting down bad records with unyielding tenacity. Mercilessly.

import truthound as th

# That's it
report = th.check("data.csv")
Enter fullscreen mode Exit fullscreen mode

No configuration files. No dozens of YAML lines. Throw your data at it, and it infers the schema and validates automatically.


Why Not Great Expectations?

"GX exists. Why reinvent the wheel?" I anticipated this question.

It's slow. GX is built on Pandas. Pandas' single-threaded execution model creates performance bottlenecks with moderately-sized datasets. Research from Vrije Universiteit Amsterdam demonstrates that Polars consumes approximately 8x less energy while achieving significantly faster execution times compared to Pandas in large-scale DataFrame operations (Malavolta et al., 2024). In TPC-H benchmarks, Polars and DuckDB exhibited an order of magnitude performance superiority over Dask and PySpark.

It's convoluted. GX's learning curve is precipitous. Data Context, Datasource, Batch Request, Expectation Suite, Checkpoint... Grasping these concepts alone takes days. By then, I've already finished validating and gone home.

Dashboard? Paid. Proper monitoring without GX Cloud requires building it yourself. Truthound Dashboard is open-source. Free.


Technical Architecture

Truthound's architecture diverges fundamentally from Great Expectations in design philosophy. It's not merely "faster because Polars." The entire data validation framework has been reconceptualized.

Test Environment

  • Python 3.13.7 / Polars 1.37.1 / Truthound 1.0.13
  • Platform: Darwin arm64 (Apple Silicon M-series), 8 cores

1. Expression Batching — Single collect() for Entire Validation

Truthound consolidates all validations into a single query plan.

class ExpressionBatchExecutor:
    """Execute all validators in a single collect()"""
    def execute(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        # Multiple validators merged into one query plan
        # Polars optimizer performs predicate pushdown, CSE
        return optimized_collect(lf.select(all_exprs), streaming=True)
Enter fullscreen mode Exit fullscreen mode

ValidationExpressionSpec defines each validator's expression, and ExpressionBatchExecutor bundles them into a single select() execution.

Data Size Sequential (3 validators) Batched (single collect) Speedup
100K rows 0.0008s 0.0004s 2.04x
500K rows 0.0009s 0.0006s 1.33x

With smaller datasets, I/O overhead constitutes a larger proportion, amplifying batching efficacy. At scale, Polars' inherent optimizations are already efficient, reducing relative speedup while maintaining superior absolute performance.


2. PEP 562 Lazy Loading — 302 Validators Deferred

# truthound/validators/_lazy.py
VALIDATOR_IMPORT_MAP: dict[str, str] = {
    "NullValidator": "truthound.validators.completeness.null",
    "BetweenValidator": "truthound.validators.distribution.range",
    # ... 302 mappings
}

def __getattr__(name: str):
    """Import only upon access"""
    return validator_getattr(name)
Enter fullscreen mode Exit fullscreen mode
Operation Time
Fresh module import 0.01ms
First lazy access (NullValidator) 4.25ms
Cached access (avg of 100) 0.0005ms
Registered validators 302

Eager loading all 302 validators would consume hundreds of milliseconds. Lazy loading ensures only utilized validators are imported, maintaining startup time at 4.25ms.


3. Vectorized Masking — Zero Python Callbacks

Data masking employs pure Polars expressions exclusively. No map_elements() invoking Python functions per row.

def _apply_hash(df: pl.DataFrame, col: str) -> pl.DataFrame:
    """Apply xxhash3 without Python callbacks"""
    c = pl.col(col)
    hashed = c.hash().cast(pl.String).str.slice(0, 16)
    return df.with_columns(
        pl.when(c.is_null()).then(pl.lit(None)).otherwise(hashed).alias(col)
    )
Enter fullscreen mode Exit fullscreen mode
Data Size Vectorized Hash map_elements Speedup Throughput
100K rows 0.0033s 0.1002s 30.0x 30M rows/sec
500K rows 0.0134s 0.5032s 37.5x 37M rows/sec

Vectorized operations utilizing Polars' native hash() function (xxhash3) outperform Python callbacks by 30-40x. Speedup scales with data volume.


4. Query Plan Optimization

Every collect() invocation leverages Polars query optimizations:

QUERY_OPTIMIZATIONS = {
    "predicate_pushdown": True,      # Apply filters as early as possible
    "projection_pushdown": True,     # Select only required columns
    "comm_subexpr_elim": True,       # Eliminate redundant expressions
    "simplify_expression": True,
}
Enter fullscreen mode Exit fullscreen mode
Data Size Optimized collect Standard collect Throughput
500K rows 0.0016s 0.0045s 320M rows/sec
1M rows 0.0025s 0.0089s 400M rows/sec

Optimized collect achieves 2.8-3.5x speedup over standard execution, delivering 400 million rows/sec throughput at 1M rows.


5. DAG-Based Parallel Execution

Validator dependencies are analyzed to construct parallelizable execution groups.

class ValidatorPhase(Enum):
    SCHEMA = auto()       # Level 0: Schema validation
    COMPLETENESS = auto() # Level 1: Null checks
    UNIQUENESS = auto()   # Level 2: Duplicate detection
    FORMAT = auto()       # Level 3: Pattern matching
    RANGE = auto()        # Level 4: Value range
    STATISTICAL = auto()  # Level 5: Aggregate statistics
    CROSS_TABLE = auto()  # Level 6: Multi-table
    CUSTOM = auto()       # Level 7: User-defined
Enter fullscreen mode Exit fullscreen mode

ExecutionLevel orchestrates same-level validators via ThreadPoolExecutor for parallel execution.


6. Zero-Config Auto-Schema Learning

# This is everything
report = th.check("data.csv")
Enter fullscreen mode Exit fullscreen mode

Internally, the learn() function aggregates statistics across all columns in a single pass.

Data Size Learning Time Throughput Columns
100K rows 0.031s 3.2M rows/sec 15
500K rows 0.111s 4.5M rows/sec 15
1M rows 0.216s 4.6M rows/sec 15

Throughput improves with data scale due to Polars' vectorization efficiency gains. Complete schema learning for 1M rows concludes within 0.22 seconds.


7. xxhash Cache Optimization

Cache fingerprinting utilizes xxhash:

def _fast_hash(content: str) -> str:
    if _HAS_XXHASH:
        return xxhash.xxh64(content.encode()).hexdigest()[:16]
    return hashlib.sha256(content.encode()).hexdigest()[:16]
Enter fullscreen mode Exit fullscreen mode
Content Size SHA256 xxhash Speedup
1KB 0.96μs 0.41μs 2.4x

xxhash delivers 2.4x speedup over SHA256. Cumulative impact is substantial during mass cache key generation.


8. End-to-End Performance: th.check()

Data Size Execution Time Throughput Issues Found
100K rows 0.019s 5.3M rows/sec 12
500K rows 0.100s 5.0M rows/sec 12
1M rows 0.22s 4.6M rows/sec -

th.check() encompasses schema learning plus automated validator execution end-to-end, sustaining approximately 5 million rows/sec throughput.


Performance Summary

Optimization Verified Metric Notes
Expression Batching 1.3-2.0x speedup Based on 3 validators
PEP 562 Lazy Loading 302 validators, 4.25ms first load 0.0005ms post-cache
Vectorized Masking 30-40x speedup vs map_elements 37.5x at 500K rows
Query Optimization 400M rows/sec Aggregate queries
Schema Learning 4.6M rows/sec 1M rows, 15 cols
xxhash Cache 2.4x speedup vs SHA256 1KB content
E2E th.check() 5M rows/sec Including auto validators

Note: These metrics were measured under specific test conditions. Actual performance varies based on data characteristics, hardware, and validator configuration.


Origin Story

I was working on an agricultural data platform. Sensor telemetry, meteorological feeds, soil composition data... torrents of data pouring in, and the quality was atrocious. Nulls everywhere, type mismatches, values outside acceptable ranges. I was hemorrhaging time on data cleansing.

Attempted to adopt GX. It was slow. It was convoluted. And the dashboard? Paid.

So I built my own. Because I needed it. Isn't that what developers do — forge the tools they require? Weekends and overtime hours, ground to dust.

I attached truthound-dashboard as well. Free. Still under development, but the fundamentals are operational. Currently building truthound-orchestration for Airflow, Dagster, Prefect, and dbt integration.


Closing

There will be those who refuse to acknowledge Truthound. The "GX is the standard, why should I use this?" crowd. Cowards too entrenched in the familiar to venture toward new tools.

Inertia is formidable. Not something humans overcome effortlessly.

But a few minutes. pip install truthound and type th.check("data.csv") — a single line. If even that proves insurmountable, that person has elected to stagnate, to reject progress. A relic choosing obsolescence.

Brutal, unvarnished feedback is perpetually welcome. If you're going to criticize, do it properly.


Documentation: https://truthound.netlify.app

GitHub: https://github.com/seadonggyun4/Truthound


References

  1. Kolmogorov, A. N. (1933). "Sulla determinazione empirica di una legge di distribuzione." Giornale dell'Istituto Italiano degli Attuari, 4, 83-91.
  2. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 413-422.
  3. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). "LOF: Identifying Density-Based Local Outliers." ACM SIGMOD Record, 29(2), 93-104.
  4. Polars Documentation. https://pola.rs/

Top comments (0)