Nitesh Jain for Razorpay

Posted on Nov 26

Delta Lake Health Analyzer

#razorpay #dev #duckdb #deltalake

If you're running Delta Lake at any meaningful scale, you've probably experienced this pattern. Queries that used to complete in seconds now take minutes. Your cloud storage bill shows mysterious costs that keep climbing. And when you finally dig into the file structure, you discover you're sitting on tens of thousands of tiny files causing chaos. At Razorpay, where we process massive volumes of financial transaction data daily, this pattern became impossible to ignore.

The problem with Delta Lake table health is that it's invisible until it becomes a crisis. Small files accumulate gradually as streaming pipelines write data in micro-batches. Partitions develop skew as transaction volumes shift across merchants or time periods. Storage bloats with old data that should have been pruned but nobody remembered to check. By the time query performance degrades noticeably, you're already deep in the problem, and fixing it requires expensive OPTIMIZE operations that you can't justify without understanding the scope of the issue.

We needed visibility into our Delta Lake health, but existing solutions didn't fit our requirements. Commercial tools like Databricks table monitoring are tightly coupled to their platform and don't help if you're running Delta Lake on vanilla Spark or other compute engines. Open-source alternatives exist but require setting up Python environments, configuring access credentials, and writing custom scripts for each analysis scenario. What we wanted was something instant: point it at a table, get actionable insights in seconds, no installation required.

That's why we built the Delta Lake Health Analyzer, a completely browser-based diagnostic tool that analyzes table health, identifies optimization opportunities, and estimates cost savings without requiring any backend infrastructure. Here's the interesting part: everything runs in your browser using DuckDB WASM. The data never leaves your machine, which solved several problems we didn't even realize we had yet.

The Fundamental Problem: Delta Lake Degradation Patterns

Before diving into the solution, let's talk about what makes Delta Lake table health so critical and why it degrades over time. Understanding the failure modes helps appreciate why this kind of diagnostic tooling matters.

Delta Lake provides ACID transactions on top of object storage, which is powerful for data reliability but introduces operational complexity. Every transaction creates metadata describing file changes, and these metadata operations accumulate. The core issues we see repeatedly fall into three categories, and they're remarkably consistent across different use cases.

Small file proliferation is the most common problem. When you're ingesting data continuously through streaming pipelines, each micro-batch creates new files. If you're writing every few seconds, you'll generate thousands of files daily. These small files create two distinct problems. First, query engines need to open and read each file separately, which means thousands of S3 API calls instead of hundreds. The latency adds up, turning what should be sub-second queries into multi-minute waits. Second, cloud storage providers charge per API request, not just storage volume. Reading 10,000 files of 1MB each costs significantly more than reading 10 files of 1GB each, even though the total data volume is identical.

Partition skew develops naturally as data patterns change over time. Imagine you partition transactions by date and merchant segment. Initially, transaction volume might be balanced across segments. Six months later, a few large merchants dominate transaction volume, creating massive partitions for specific segments while others remain small. Query engines can't parallelize effectively when partitions have wildly different sizes; the largest partition becomes a bottleneck. Moreover, partition pruning becomes less effective because the query planner can't reliably predict which partitions need scanning based on file counts alone.

Storage inefficiency from uncompacted data and lack of pruning. Delta Lake supports time travel by maintaining historical versions of data, which is useful for auditing and debugging but accumulates storage costs. If you're not regularly running VACUUM to clean up old versions, you're paying to store data that will never be queried again. Similarly, tables with frequent updates or deletes develop tombstone files that mark rows as deleted without actually removing them from storage. Until you run OPTIMIZE to compact and rewrite these files, you're storing and scanning data that's logically deleted but physically present.

These problems don't appear overnight. They accumulate gradually, which makes them particularly insidious. By the time someone notices query performance degradation, the underlying file structure is already badly fragmented. Fixing it requires running expensive compaction operations, but without understanding the scope of the problem, it's hard to justify the compute cost or prioritize which tables need attention first.

Why We Built It Browser-First: The Architecture Decision That Changed Everything

Here's a choice that raised eyebrows when we proposed it. We built this entirely in the browser using DuckDB WASM. Most teams would instinctively reach for a backend service with Spark or Pandas doing the heavy lifting, some REST API exposing the analysis results, and a database storing historical health metrics. We went client-side, and it turned out to be exactly the right architectural decision for multiple reasons.

The primary driver was data governance. Our Delta Lake tables contain sensitive financial transaction data. Shipping that data to a backend service for analysis would trigger extensive security reviews. We'd need to document data flows, implement data masking for sensitive fields, set up audit logging for every access, get compliance sign-offs from multiple teams, and provision infrastructure in specific security zones with appropriate network policies. This process takes weeks to months at most organizations with mature security practices.

By processing everything in the browser, sensitive data never leaves the user's machine. The browser reads Delta Lake checkpoint files directly from S3 using pre-signed URLs with appropriate IAM permissions. All analysis happens locally in browser memory. Results are displayed in the UI but never transmitted anywhere. From IT security's perspective, this is no different than a user running AWS CLI commands on their laptop. We got security approval in days instead of months because we weren't introducing new data egress paths or storage systems.

The deployment simplicity was an unexpected bonus. Our entire application is static files: HTML, JavaScript, WASM binaries, and CSS. We deploy to a CDN and call it done. No backend services to provision, no databases to maintain, no monitoring infrastructure to set up. When we need to update the tool, we push new static files to the CDN and users automatically get the latest version on their next page load. No complex CI/CD pipelines, no blue-green deployments, no database migrations.

The performance surprised us positively. We were initially skeptical that a browser-based SQL engine could handle real-world checkpoint files. Delta Lake checkpoints for large tables can contain metadata for 50,000+ files in a single Parquet file. Loading and analyzing that in the browser sounded risky. However, DuckDB WASM proved remarkably capable. It handles our largest checkpoint files without breaking a sweat, and query performance is genuinely impressive. The SQL-based analysis we wrote executes in seconds, even for complex aggregations across tens of thousands of rows.

Moreover, there's no network latency. Traditional architectures send requests to a backend, wait for processing, and receive results over the network. Our tool loads the checkpoint once from S3, then all subsequent analysis happens locally. Users can slice and dice the data interactively without waiting for round-trip API calls. This instant feedback loop changes how people use the tool; they explore data more thoroughly because there's no penalty for trying different analysis angles.

The limitations are real but manageable. Browser memory constraints mean we can't analyze tables with hundreds of thousands of partitions or millions of files. For truly massive tables that exceed browser capabilities, the metadata alone would consume gigabytes of RAM. However, for 95% of our Delta Lake tables, these limitations don't matter. The typical table has hundreds to low thousands of partitions and tens of thousands of files. For edge cases that exceed browser memory, teams can export the checkpoint file and analyze it with backend tools. The browser-first approach works brilliantly for the common case, which is exactly what we needed.

How It Works: From Checkpoint to Actionable Insights

Let's walk through what actually happens when you analyze a table. Understanding the technical flow reveals why this approach is both practical and powerful.

The process begins when you provide a Delta table path. The browser reads the _last_checkpoint file from the _delta_log directory to determine the most recent checkpoint. This small JSON file tells us which checkpoint Parquet file contains the latest table state. We then fetch that checkpoint file from S3 using a pre-signed URL with the user's AWS credentials.

This checkpoint file is the key to everything. It's a Parquet file containing metadata for every active file in the Delta table: file paths, sizes, partition values, modification times, row counts, and statistics. For a table with 50,000 files, this checkpoint might be 20-30MB, which loads quickly even on modest internet connections. Once loaded into browser memory, DuckDB WASM makes this data queryable via SQL.

The file-level analysis examines the distribution of file sizes. We run queries like "how many files are under 128MB?" and "what's the total size of files under 10MB?" Small files are the primary indicator of optimization opportunities because they directly impact query performance and cloud costs. We also calculate the coefficient of variation (CV) for file sizes to understand how uniform the file distribution is. A high CV means file sizes vary wildly, suggesting inconsistent ingestion patterns or lack of compaction.

The partition-level analysis looks at how data is distributed across partitions. We count total partitions, calculate files per partition, and compute the coefficient of variation of partition sizes. High partition skew (high CV) means some partitions are massive while others are tiny, which hurts query parallelism. We identify the largest and smallest partitions by row count and size, helping users understand where imbalances exist.

The health scoring algorithm combines these metrics into a single 0-100 score. Here's the actual scoring logic we use:

def calculate_health_score(metrics):
    score = 100

    # Small files penalty (up to -40 points)
    small_file_ratio = metrics['small_files_count'] / metrics['total_files']
    if small_file_ratio > 0.5:
        score -= 40
    elif small_file_ratio > 0.3:
        score -= 25
    elif small_file_ratio > 0.1:
        score -= 10

    # Partition skew penalty (up to -30 points)
    if metrics['partition_cv'] > 2.0:
        score -= 30
    elif metrics['partition_cv'] > 1.5:
        score -= 20
    elif metrics['partition_cv'] > 1.0:
        score -= 10

    # Average file size penalty (up to -20 points)
    avg_file_size_mb = metrics['avg_file_size'] / (1024 * 1024)
    if avg_file_size_mb < 64:
        score -= 20
    elif avg_file_size_mb < 128:
        score -= 10

    # Partition count penalty (up to -10 points)
    if metrics['partition_count'] > 10000:
        score -= 10
    elif metrics['partition_count'] > 5000:
        score -= 5

    return max(0, score)

This scoring approach is opinionated but based on observed patterns across hundreds of tables. The small file ratio is weighted most heavily because it has the biggest impact on query performance. Partition skew matters for parallelism. Average file size provides a sanity check on overall table structure. Partition count flags tables that might have excessive partitioning granularity.

The beauty of this browser-based architecture is that once the checkpoint is loaded, all these analyses execute instantly. Users can explore different aspects of table health without waiting for backend processing. Want to see which specific partitions have the most files? Run a query. Curious about file size distribution over time? We can infer that from modification timestamps. Wondering if certain columns have high null rates that suggest pruning opportunities? Column statistics from the checkpoint reveal that immediately.

What the Tool Actually Does: Key Features

Let's talk about the capabilities that make this tool useful in day-to-day operations. These aren't just interesting statistics; they're actionable insights that drive real optimization decisions.

Health scoring and visualization provides the at-a-glance assessment. When you load a table, the first thing you see is the health score (0-100) with color coding: green for healthy (80+), yellow for attention needed (50-79), red for critical (below 50). Below the score, we break down the contributing factors: small file percentage, partition skew coefficient, average file size, and partition count. This breakdown helps you understand which specific issue is dragging down the score.

Here's how a Health Score Breakdown works:

File analysis digs into the details. We show file count distribution across size buckets (under 10MB, 10-64MB, 64-128MB, 128MB+) so you can see exactly where files cluster. A histogram visualizes this distribution, making patterns obvious. If you see a massive spike of files under 10MB, that's your smoking gun for why queries are slow. The tool also lists the largest and smallest files by path, which helps identify specific ingestion jobs or time periods that created problems.

Partition analysis reveals imbalances. We display partition count, files per partition (average, min, max), size per partition (average, min, max), and the coefficient of variation for partition sizes. High CV means significant skew. We also rank partitions by size and file count, showing the top 10 largest and most fragmented partitions. This targeting is valuable; you often don't need to optimize the entire table, just the handful of partitions causing the real problems.

Column-level insights come from Delta's built-in statistics. When Delta writes files, it collects min/max/null count statistics for each column. We surface these at the table level: which columns have the most nulls, which have the widest ranges, which might benefit from ZORDER optimization. ZORDER co-locates similar values in the same files, dramatically improving query performance when you're filtering on high-cardinality columns. The tool identifies candidate columns by looking at their cardinality and filter frequency patterns.

Cost estimation translates metrics into dollars. This was the feature that got the most enthusiastic feedback because it provides business justification for running optimization commands. We calculate estimated costs based on two factors: S3 API request pricing and query compute costs.

For S3 costs, the calculation is straightforward:

def estimate_s3_cost_savings(current_files, optimal_files):
    # S3 GET request pricing (rough average across regions)
    cost_per_1000_requests = 0.0004  # USD

    current_monthly_scans = current_files * 30  # assuming daily queries
    optimal_monthly_scans = optimal_files * 30

    current_cost = (current_monthly_scans / 1000) * cost_per_1000_requests
    optimal_cost = (optimal_monthly_scans / 1000) * cost_per_1000_requests

    savings = current_cost - optimal_cost
    return savings

For query compute costs, we estimate based on scan time reduction. Fewer files mean fewer seeks, less metadata processing, and faster query completion. The relationship isn't perfectly linear, but empirical testing shows that reducing file count by 10x typically improves query time by 3-5x for scan-heavy workloads. We use conservative estimates to avoid overpromising.

When users see "estimated monthly savings: $X from S3 optimization, $Y from faster queries," it changes the conversation. Suddenly running OPTIMIZE isn't just an operational task; it's a cost reduction initiative with measurable ROI.

Pruning recommendations identify opportunities to clean up old data. Delta Lake's time travel is powerful, but maintaining 90 days of history for a table that's only queried for the last 7 days is wasteful. The tool analyzes file modification timestamps and data freshness patterns to recommend appropriate VACUUM retention periods. We also flag tables with excessive deletion tombstones that need compaction to reclaim space.

What We Learned Building This

Building a browser-based data analysis tool taught us several lessons that weren't obvious from the outset.

DuckDB WASM is genuinely production-ready. We were skeptical about running a full SQL engine in the browser, but DuckDB WASM parsed our largest checkpoint files (30MB+, 50,000+ rows) without issues. Complex aggregations execute in milliseconds, and the SQL interface proved complete enough for all our analysis needs.

Browser memory limits matter less than expected. Modern browsers handle datasets in the hundreds of megabytes without problems. We implemented guardrails for extremely large checkpoints, but these edge cases are rare. Most Delta Lake tables have manageable checkpoint sizes.

Cost estimates drive action more than performance metrics. We thought query performance insights would motivate optimization. We were wrong. Showing "you're wasting $X per month on excessive S3 requests" provided concrete justification. Finance teams control prioritization, and they care about costs.

Column statistics are underutilized. Surfacing Delta Lake's min/max/null count statistics revealed patterns people didn't know existed. High null rates flagged data quality issues. Unexpected ranges revealed incorrect data types. The column analysis section became unexpectedly popular for data quality monitoring beyond just optimization.

What This Approach Can't Do (And Why That's Acceptable)

Browser-based analysis isn't a silver bullet. Massive tables with hundreds of thousands of partitions exceed browser capabilities. Real-time monitoring with automated alerts requires backend infrastructure. Historical trending is manual since we don't maintain server-side metrics. For very large tables, we sample files rather than analyzing all of them, introducing statistical uncertainty.

These limitations are real, but they don't invalidate the approach. For the vast majority of Delta Lake tables at typical organizations, browser-based analysis works excellently. The 5% of edge cases that exceed browser capabilities can use alternative tools. Optimizing the common case while providing escape hatches for edge cases is good engineering.

Future Directions: From Diagnostic to Predictive

The Delta Lake Health Analyzer has proven valuable as a diagnostic tool, but we're seeing patterns that suggest predictive possibilities.

Real-time streaming pipelines predictably create small file problems within 48-72 hours. Batch loads develop skew after 30-60 days when transaction volumes shift. These patterns are consistent enough to enable proactive maintenance.

Imagine automatic warnings: "Table X will hit critical small file threshold in 3 days" or "Partition skew will impact performance next week unless compaction runs today."

We're also exploring automated optimization recommendations beyond "run OPTIMIZE," integration with workflow orchestration platforms like Airflow, and data-driven ZORDER recommendations based on actual query patterns from warehouse logs.

Why This Matters Beyond Our Use Case

The browser-first, client-side processing pattern solves problems many teams face. Any scenario where you need to analyze data that's already accessible to users but don't want backend infrastructure benefits from this approach: log file analysis, configuration validation, data quality checking, cloud cost analysis.

The security model is particularly compelling for regulated industries. Client-side approaches bypass data governance overhead because data never enters new security zones or crosses compliance boundaries. The deployment simplicity also scales to many internal tools. Any read-only analysis tool can be deployed as static files, eliminating infrastructure costs and simplifying version management.

DuckDB WASM specifically opens up possibilities for browser-based data analysis that weren't practical before. The performance handles real-world datasets (tens to hundreds of megabytes), and we're likely to see more tools adopting this pattern where backend infrastructure is overkill.

The Bottom Line

The Delta Lake Health Analyzer demonstrates that sophisticated data analysis tools don't always require sophisticated infrastructure. By leveraging browser capabilities and DuckDB WASM, we built a tool that provides genuine value while remaining trivially simple to deploy and maintain.

The tool has become essential to our data engineering workflow. Teams check table health before expensive OPTIMIZE operations, use cost estimates to justify work to management, and leverage column analysis for data quality monitoring. The browser-based approach removed all the friction that would have existed with a traditional backend service.

This project validated that client-side data processing is a viable architectural pattern for internal tooling. When data is already accessible and analysis is read-only, processing in the browser solves deployment, security, and maintenance challenges elegantly. The limitations are real but acceptable for most use cases.

If you're building data analysis tools for internal use, consider the browser-first approach. For diagnostic and exploratory tools where users already have data access, client-side processing offers compelling advantages. The Delta Lake Health Analyzer proves you don't need complex infrastructure to solve complex problems. Sometimes the simplest architecture is the most powerful.

DEV Community