Simon Massey

Posted on Jan 15

📊 Line Histogram — The File Profiler You Didn't Know You Needed

#cli #llm #showdev #tooling

What If You Could See Your Data Before It Floods Your Context Window?

You know that moment when you asked your agent to check the data and before you can tell it not to eat the ocean it enters Compaction.

Yeah. We've all been there.

The Problem

🤯 Huge database dumps with unpredictable line sizes
💸 Token budgets disappearing into massive single-line JSON blobs
🔍 No quick way to see WHERE the chonky lines are hiding
⚠️ Agents choking on files you thought were reasonable

The Solution `line_histogram.awk`

A blazingly fast AWK script that gives you X-ray vision into your files' byte distribution.

# chmod +x ./line_histogram.awk
./line_histogram.awk huge_export.jsonl
File: huge_export.jsonl
Total bytes: 2847392
Total lines: 1000

Bucket Distribution:

Line Range      | Bytes        | Distribution
─────────────────┼──────────────┼──────────────────────────────────────────
1-100           |         4890 | ██
101-200         |         5234 | ██
201-300         |         5832 | ██
301-400         |         6128 | ██
401-500         |       385927 | ████████████████████████████████████████
501-600         |         5892 | ██
601-700         |         5234 | ██
701-800         |         6891 | ██
801-900         |         5328 | ██
901-1000        |         4982 | ██
─────────────────┼──────────────┼──────────────────────────────────────────

Boom. Line 450 is most of your file.

🚀 Features That Actually Matter

1. Histogram Mode (Default)

See the byte distribution across your file in 10 neat buckets. Spot the bloat instantly.

./line_histogram.awk myfile.txt

2. Surgical Line Extraction

Found a problem line? Extract it without loading the whole file into memory.

# Extract line 450 (the chonky one)
./line_histogram.awk -v mode=extract -v line=450 huge_export.jsonl

# Extract lines 100-200 for inspection
./line_histogram.awk -v mode=extract -v start=100 -v end=200 data.jsonl
Yes yes, those -v bits look odd but yes yes, they are needed as thats how the sed passes argument, who knew! (Hint: The AI)

3. Zero Dependencies

If your system has AWK (it does), you're good to go. No npm install, no pip install, no Docker containers. Just pure, unadulterated shell goodness.

4. Stupid Fast

Processes multi-GB files in seconds. AWK was built for this.

💡 Use Cases That'll Make You Look Like a Genius

Big Data? Big ideas!

For AI Agent Wranglers

Profile before you prompt: Know if that export file is safe to feed your agent
Smart sampling: Extract representative line ranges instead of the whole file
Debug token explosions: "Why did my context window fill up?" → histogram shows a 500KB line

For Data Engineers

Spot malformed CSVs: One line with 10,000 columns? Histogram shows it
Log file analysis: Find the log entries that are suspiciously huge
Database export QA: Verify export structure before importing elsewhere

For DevOps/SRE

Config file sanity checks: Spot embedded certificates or secrets bloating configs
Debug log truncation: See which lines are hitting your logger's size limits

📖 Pseudo Man Page (The Details)

NAME
line_histogram.awk — profile files by line size distribution or extract specific lines

SYNOPSIS
./line_histogram.awk [options] <file>
OPTIONS
-v mode=histogram     (default) Show byte distribution across 10 buckets
-v mode=extract       Extract specific line(s)
-v line=N             Extract single line N (requires mode=extract)
-v start=X            Extract lines X through Y (requires mode=extract)
-v end=Y              
-v outfile=FILE       Write output to FILE instead of stdout
MODES
Histogram Mode (Default)
Divides the file into 10 equal-sized buckets by line number and shows the byte distribution:

Bucket 1: Lines 1-10% → X bytes
Bucket 2: Lines 11-20% → Y bytes
...and so on
The visual histogram uses █ blocks scaled to the bucket with the most bytes.

Special cases:

Files ≤10 lines: Each line gets its own bucket
Remainder lines: Absorbed into bucket 10
Extract Mode
Pull specific lines without loading the entire file into your editor:

# Single line
./line_histogram.awk -v mode=extract -v line=42 file.txt

# Range
./line_histogram.awk -v mode=extract -v start=100 -v end=200 file.txt
EXIT STATUS
0: Success
1: Error (invalid line number, bad range, missing parameters)
EXAMPLES
Example 1: Quick file profile

./line_histogram.awk database_dump.jsonl
Example 2: Extract suspicious line for inspection

./line_histogram.awk -v mode=extract -v line=523 data.csv > suspicious_line.txt
Example 3: Sample middle section of large file

./line_histogram.awk -v mode=extract -v start=5000 -v end=5100 bigfile.log | less
Example 4: Save histogram to file

./line_histogram.awk -v outfile=analysis.txt huge_file.jsonl
🧪 Testing Suite Included

Not sure if it works? We've got you covered with a visual test suite:

# Generate test patterns
./generate_test_files.sh

# Run all tests
./run_tests.sh

The test suite generates files with known patterns:

📈 Triangle up/down: Ascending/descending line sizes
📦 Square: Uniform line lengths
🌙 Semicircle: sqrt curve distribution
🔔 Bell curve: Gaussian distribution
📍 Spike: One massive line in a sea of tiny ones
🎯 Edge cases: Empty files, single lines, exactly 10 lines

Watch the histograms match the patterns. It's oddly satisfying.

⚡ Installation

Star then download. Star. "⭐💫🌟" You know, like thumbs up, but for yoof of today. STAR THE GIST ⭐⭐⭐

[https://gist.github.com/simbo1905/0454936144ee8dbc55bdc96ef532555e]

Make executable

chmod +x line_histogram.awk

Optional: Add to PATH

cp line_histogram.awk ~/bin/line_histogram.awk

Or just run it directly:

awk -f line_histogram.awk yourfile.txt

🎯 Why This Exists

Born from frustration with AI agents eating context windows on mystery files. Sometimes you just need to know: "Is this file safe to feed my agent, or will line 847 consume my entire token budget?". As that is obviously how you think and act.

You are not a data engineer up at 2am using SCREAMING ALL CAPS as your LLM got into a crash loop trying to evaluate a JSONL extract that didn't fit into context. That is definitely not you, no. Me neither.

📜 License

MIT or Public Domain. Use it, abuse it, put it in production, whatever. No warranty implied—if it deletes your files, that's on you (though it only reads, so you're probably fine).

🤝 Contributing

It's AWK. If you can make it better, you're a wizard. PRs welcome - you will need to setup a repo though as I cannot be bothered. So just fork the gist and be done with it.

Made with ❤️ and frustration by someone who spent too many tokens on line 523 of a JSONL file.

Now go profile your files like a pro. 📊✨

DEV Community

📊 Line Histogram — The File Profiler You Didn't Know You Needed

What If You Could See Your Data Before It Floods Your Context Window?

The Problem

The Solution `line_histogram.awk`

🚀 Features That Actually Matter

💡 Use Cases That'll Make You Look Like a Genius

For AI Agent Wranglers

For Data Engineers

For DevOps/SRE

📖 Pseudo Man Page (The Details)

⚡ Installation

Make executable

🎯 Why This Exists

📜 License

🤝 Contributing

Top comments (0)

What If You Could See Your Data Before It Floods Your Context Window?

The Problem

The Solution line_histogram.awk

🚀 Features That Actually Matter

💡 Use Cases That'll Make You Look Like a Genius

For AI Agent Wranglers

For Data Engineers

For DevOps/SRE

📖 Pseudo Man Page (The Details)

⚡ Installation

Make executable

🎯 Why This Exists

📜 License

🤝 Contributing

The Solution `line_histogram.awk`