How to split 10GB JSON files in seconds without hitting RAM limits

ihar ivanuto — Wed, 01 Jul 2026 22:20:35 +0000

Hi Everyone!

We had this classic pain point on our project: constantly chewing through massive JSON arrays. Catalogs, analytics dumps, ML datasets — files ranging from a couple of hundred megabytes to tens of gigabytes.

The task was stupidly simple: split a giant JSON array into individual elements so we could chunk them or throw them into parallel processing. No data transformation, no querying by keys. We literally just needed to find where each chunk starts and ends.

Naturally, we started with the classic approach: json.Unmarshal -> slice -> json.Marshal. On a 10GB file, memory consumption went to the moon 🚀. We ended up spending more time fighting the Go garbage collector (GC) than doing actual work.

And then it clicked: to just move the data around, we don't need to understand what's inside it. We just need to find the boundaries.

Stop parsing, start scanning 🛑

Every parser out there (even the ultra-fast ones like sonic or simdjson) still builds a tree in memory. Instead, you can just treat the JSON as a raw byte stream. Look for structural markers, find the edges, and cut.

The entire logic boils down to a tiny state machine:

Nesting counter: { and [ go +1, } and ] go -1.
String tracking: keep track of when you enter "..." so you don't accidentally react to brackets inside a text field.
Escapes: a \" inside a string is a trap, not the end of the string.
The boundary: whenever your nesting depth is exactly 0, any comma , is where you split.

That’s it. We don't care about keys or values. We don't allocate a single byte, we just return memory views (slices) of the original buffer.

Here’s what the concept looks like in Go (oversimplified, ignoring string logic):

func findElements(data []byte) []Chunk {
    var chunks []Chunk
    depth := 0
    start := 0

    for i, b := range data {
        switch b {
        case '{', '[':
            depth++
        case '}', ']':
            depth--
        case ',':
            if depth == 0 {
                chunks = append(chunks, Chunk{Start: start, End: i})
                start = i + 1
            }
        }
    }

    if start < len(data) {
        chunks = append(chunks, Chunk{Start: start, End: len(data)})
    }
    return chunks
}

Obviously, this naive code will break on the first tricky whitespace or string, but you get the point. We aren't parsing. We are scanning.

Why is this so damn fast? ⚡

Zero allocations in the hot loop. You're just handing back data[start:end]. No new objects, no copying strings, no building hash maps.
Hardware absolutely loves it. Your entire working state is basically two integers. It easily fits in L1 cache, and memory reads are strictly sequential.
The branch predictor is happy. A simple state machine with highly predictable transitions is infinitely easier for the CPU to digest than a full parser juggling dozens of token types.

Look at how much work we are skipping:

Step	Standard Parser	Boundary Scanner
Read bytes	✅	✅
Classify tokens	✅	Only `{}[]"\` and `,`
Build hash maps	✅	❌
Allocate strings	✅	❌
Allocate slices	✅	❌
Type conversion	✅	❌
What you get back	`[]MyStruct`	`[][]byte` (pointers to original buffer)

We are literally throwing away 80% of the overhead.

But how fast is it actually? 🏎️

I got a bit carried away and polished this into a production-ready tool. I added proper string handling, escape tracking, and rewrote the hot loop in AVX2 assembly (chewing through 32 bytes per cycle using SIMD bitmasks).

Tbh, the results surprised even me:

Approach	What it does	Throughput	Memory Overhead
`encoding/json`	Full parse → Go structs	~107 MB/s	3-4x input size
`sonic` / `simdjson-go`	Optimized parse → structs/AST	~400–700 MB/s	~1.1x
My AVX2 scanner	Just finds boundaries	~4.1 GB/s	~1.0x (zero extra)

At 4.1 GB/s, the algorithm isn't even the bottleneck anymore. It's bottlenecked by the RAM's read bandwidth. The CPU is just sitting there waiting for the next cache line to arrive.

The catch (Tradeoffs) ⚠️

Because I went the unsafe and raw assembly route for maximum speed, you have to pay the price:

Platform-specific: The AVX2 branch only works on amd64. For ARM (hello MacBooks), you need a pure Go fallback.
Memory lifecycle danger: You are getting slices that point directly to the original buffer. If that []byte gets overwritten or GC'd while you're still working with the chunks... it's going to hurt.
No validation: The scanner takes your word that the JSON is valid. Feed it garbage, and it will silently slice up garbage.

TL;DR

The biggest insight was stupidly simple: stop thinking "I need to parse this JSON" and start thinking "I need to find boundaries in a byte stream". Once I changed my perspective, the code wrote itself and the performance gap was massive.

Has anyone else suffered through this? How do you guys route or chunk massive JSON payloads in production when you simply can't fit them into RAM?