Handling large files in memory can be a real headache, especially in environments with limited memory. In this post, we'll explore two different techniques for processing large JSON files in Go. We'll compare the naive approach, which reads the entire file into memory, with a more efficient buffered processing method. We'll use the pprof
tool to profile and analyze the memory usage of each approach to see how they impact performance.
The Naive Approach
The naive approach involves reading the entire JSON file into memory and unmarshalling it into Go structures. While this method is straightforward, it can lead to high memory consumption, especially with really large files.
package parser
import (
"encoding/json"
"log"
"os"
"github.com/iamseki/dev-to/apps/processing-large-json-golang/internal"
)
func NaiveParseFile(filename string) {
file, err := os.Open(filename)
if err != nil {
log.Fatalf("error on open file: %s, error: %v", filename, err)
}
defer file.Close()
users := []internal.User{}
decoder := json.NewDecoder(file)
if err := decoder.Decode(&users); err != nil {
log.Fatalf("error on decode users from json file: %v", err)
}
// DO STUFF
}
The Efficient Chunked Processing Approach
On the other hand, the chunked processing approach reads the JSON file in a buffered way. This method helps keep memory usage low, making it perfect for large files.
package parser
import (
"bufio"
"encoding/json"
"log"
"os"
"github.com/iamseki/dev-to/apps/processing-large-json-golang/internal"
)
func OptimizedParseFile(filename string) {
file, err := os.Open(filename)
if err != nil {
log.Fatalln("error opening file: ", err)
}
defer file.Close()
reader := bufio.NewReader(file)
decoder := json.NewDecoder(reader)
token, err := decoder.Token()
if err != nil {
log.Fatalln("Error reading opening token: ", err)
}
if delim, ok := token.(json.Delim); !ok || delim != '[' {
log.Fatalln("Expeceted start of JSON array")
}
for decoder.More() {
user := &internal.User{}
err := decoder.Decode(user)
if err != nil {
log.Fatalln("Error decoding JSON user: ", err)
}
// DO STUFF
}
// Read the closing bracket of the JSON array
token, err = decoder.Token()
if err != nil {
log.Fatalln("Error reading closing token:", err)
}
// Check if the closing token is the end of the array
if delim, ok := token.(json.Delim); !ok || delim != ']' {
log.Fatalln("Expected end of JSON array")
}
}
Setting up Memory Profiling with pprof
To get a better understanding of the memory usage of both approaches, we'll use the pprof
profiling tool. Here's how to set it up in your Go application:
package main
import (
"flag"
"log"
"os"
"runtime"
"runtime/pprof"
"github.com/iamseki/dev-to/apps/processing-large-json-golang/internal/parser"
)
func main() {
var filename string
flag.StringVar(&filename, "filename", "defaultfile-1mb.json", "Filename to parse")
flag.Parse()
// profiling CPU
cpu_prof, err := os.Create("cpu-naive.prof")
if err != nil {
log.Fatalf("error create cpu.prof: %v", err)
}
pprof.StartCPUProfile(cpu_prof)
defer pprof.StopCPUProfile()
parser.NaiveParseFile(filename)
// profiling MEM
mem_prof, err := os.Create("mem-naive.prof")
if err != nil {
log.Fatalf("error create mem.prof: %v", err)
}
defer mem_prof.Close()
runtime.GC() // get up-to-date statistics
if err := pprof.WriteHeapProfile(mem_prof); err != nil {
log.Fatal("could not write memory profile: ", err)
}
Analysing Profiling Data
We have two commands to run both approaches. Each command will also collect profiling data:
- For the naive approach:
yarn nx process-naive processing-large-json-golang --filename=largefile-20240718-083247-100mb.json
- For the optimized approach:
yarn nx process-optimized processing-large-json-golang --filename=largefile-20240718-083247-100mb.json
Note: Use the filename generated by the script usin:
yarn nx generate-file processing-large-json-golang --size=100
for the--filename
argument. You can also use the default 1MB file by not passing the argument at all.
Profiling Analysis
To analyze the profiling data, use the Golang tool pprof
with the webserver command. More information about the pprof web interface can be found here.
- To analyze the memory profile of the optimized version:
yarn nx analyze-mem-optimized processing-large-json-golang
- To analyze the memory profile of the naive version:
yarn nx analyze-mem-naive processing-large-json-golang
Conclusion
Our analysis shows that the buffered processing approach significantly reduces memory usage compared to the naive method. Future work could explore additional optimizations and alternative methods for efficient file processing. Hopeffuly, this is useful for you guys! 😄
Top comments (2)
Thanks for sharing this interesting approach. I am currently building a viewer for large JSON files and this approach might help me further reduce memory consumption.
Thank you for reading! I'm glad to hear that this will be useful for your use case!