DEV Community

Christian Seki
Christian Seki

Posted on

Handling Large JSON Files in Go: Naive vs. Chunked Approaches

Handling large files in memory can be a real headache, especially in environments with limited memory. In this post, we'll explore two different techniques for processing large JSON files in Go. We'll compare the naive approach, which reads the entire file into memory, with a more efficient buffered processing method. We'll use the pprof tool to profile and analyze the memory usage of each approach to see how they impact performance.

Github Repository

The Naive Approach

The naive approach involves reading the entire JSON file into memory and unmarshalling it into Go structures. While this method is straightforward, it can lead to high memory consumption, especially with really large files.

package parser

import (
    "encoding/json"
    "log"
    "os"

    "github.com/iamseki/dev-to/apps/processing-large-json-golang/internal"
)

func NaiveParseFile(filename string) {
    file, err := os.Open(filename)
    if err != nil {
        log.Fatalf("error on open file: %s, error: %v", filename, err)
    }
    defer file.Close()

    users := []internal.User{}
    decoder := json.NewDecoder(file)
    if err := decoder.Decode(&users); err != nil {
        log.Fatalf("error on decode users from json file: %v", err)
    }

    // DO STUFF
}
Enter fullscreen mode Exit fullscreen mode

Full code here

The Efficient Chunked Processing Approach

On the other hand, the chunked processing approach reads the JSON file in a buffered way. This method helps keep memory usage low, making it perfect for large files.

package parser

import (
    "bufio"
    "encoding/json"
    "log"
    "os"

    "github.com/iamseki/dev-to/apps/processing-large-json-golang/internal"
)

func OptimizedParseFile(filename string) {
    file, err := os.Open(filename)
    if err != nil {
        log.Fatalln("error opening file: ", err)
    }
    defer file.Close()

    reader := bufio.NewReader(file)
    decoder := json.NewDecoder(reader)

    token, err := decoder.Token()
    if err != nil {
        log.Fatalln("Error reading opening token: ", err)
    }

    if delim, ok := token.(json.Delim); !ok || delim != '[' {
        log.Fatalln("Expeceted start of JSON array")
    }

    for decoder.More() {
        user := &internal.User{}
        err := decoder.Decode(user)
        if err != nil {
            log.Fatalln("Error decoding JSON user: ", err)
        }

        // DO STUFF
    }

    // Read the closing bracket of the JSON array
    token, err = decoder.Token()
    if err != nil {
        log.Fatalln("Error reading closing token:", err)
    }

    // Check if the closing token is the end of the array
    if delim, ok := token.(json.Delim); !ok || delim != ']' {
        log.Fatalln("Expected end of JSON array")
    }

}

Enter fullscreen mode Exit fullscreen mode

Full code here

Setting up Memory Profiling with pprof

To get a better understanding of the memory usage of both approaches, we'll use the pprof profiling tool. Here's how to set it up in your Go application:

package main

import (
    "flag"
    "log"
    "os"
    "runtime"
    "runtime/pprof"

    "github.com/iamseki/dev-to/apps/processing-large-json-golang/internal/parser"
)

func main() {
    var filename string
    flag.StringVar(&filename, "filename", "defaultfile-1mb.json", "Filename to parse")
    flag.Parse()

    // profiling CPU
    cpu_prof, err := os.Create("cpu-naive.prof")
    if err != nil {
        log.Fatalf("error create cpu.prof: %v", err)
    }
    pprof.StartCPUProfile(cpu_prof)
    defer pprof.StopCPUProfile()

    parser.NaiveParseFile(filename)

    // profiling MEM
    mem_prof, err := os.Create("mem-naive.prof")
    if err != nil {
        log.Fatalf("error create mem.prof: %v", err)
    }
    defer mem_prof.Close()

    runtime.GC() // get up-to-date statistics
    if err := pprof.WriteHeapProfile(mem_prof); err != nil {
        log.Fatal("could not write memory profile: ", err)
    }
Enter fullscreen mode Exit fullscreen mode

Full code here

Analysing Profiling Data

We have two commands to run both approaches. Each command will also collect profiling data:

  • For the naive approach: yarn nx process-naive processing-large-json-golang --filename=largefile-20240718-083247-100mb.json
  • For the optimized approach: yarn nx process-optimized processing-large-json-golang --filename=largefile-20240718-083247-100mb.json

Note: Use the filename generated by the script usin: yarn nx generate-file processing-large-json-golang --size=100 for the --filename argument. You can also use the default 1MB file by not passing the argument at all.

Profiling Analysis

To analyze the profiling data, use the Golang tool pprof with the webserver command. More information about the pprof web interface can be found here.

  • To analyze the memory profile of the optimized version: yarn nx analyze-mem-optimized processing-large-json-golang

Image description

  • To analyze the memory profile of the naive version: yarn nx analyze-mem-naive processing-large-json-golang

Image description

Conclusion

Our analysis shows that the buffered processing approach significantly reduces memory usage compared to the naive method. Future work could explore additional optimizations and alternative methods for efficient file processing. Hopeffuly, this is useful for you guys! 😄


GitHub logo iamseki / dev-to

Implementations of dev.to blog posts

Top comments (2)

Collapse
 
erikkalkoken profile image
Erik Kalkoken

Thanks for sharing this interesting approach. I am currently building a viewer for large JSON files and this approach might help me further reduce memory consumption.

Collapse
 
chseki profile image
Christian Seki

Thank you for reading! I'm glad to hear that this will be useful for your use case!