amir

Posted on May 25

Memory Under the Hood: Why Go Often Feels Faster Than Python

#go #python #performance #backend

After years of building backend systems, working with data pipelines, debugging production issues, and watching servers behave differently under load, I learned one lesson the hard way:

Performance problems rarely start from syntax.

They usually start from how we think about memory.

When I was earlier in my career, I used to compare languages mostly by developer experience. Python felt clean, expressive, and fast to write. Go felt simple, strict, and very practical for backend services. But once I started dealing with large files, high-throughput APIs, queues, workers, containers, and memory pressure in production, I realized that the real difference is not just language design.

The real difference is what happens under the hood.

Why can a small Python script suddenly consume a huge amount of RAM?

Why can the same type of data processing in Go run with much lower memory usage?

Why does iterating over a slice of structs in Go usually feel much cheaper than iterating over a list of Python objects?

And why does reading a 10GB file incorrectly destroy both languages, even if one of them is “faster”?

This article is my practical breakdown of memory layout, allocation, garbage collection, cache locality, and large-file processing in Go and Python. I am not writing this as a language war. I use both languages. Python is still one of my favorite tools for automation, scripting, data work, and fast iteration.

But when you are building backend systems where memory, latency, and throughput matter, understanding these details can change how you write code.

The First Misunderstanding: Dynamic Array Is Not Linked List

One mistake I have seen many developers make is assuming that dynamic collections are somehow linked lists internally.

They are not.

A Python list is not a linked list.

A Go slice is not a linked list.

Both are built around the idea of a dynamic array, although their internal models are very different.

In Python, a list is basically a resizable array of references. The list itself stores pointers to objects. Those objects live somewhere else in memory.

In Go, a slice is a small header that points to an underlying array. That header contains three important pieces of information:

type SliceHeader struct {
    Data uintptr
    Len  int
    Cap  int
}

Conceptually, a Go slice contains:

a pointer to the underlying array
the current length
the current capacity

If you really need a linked list in Go, there is container/list, or you can implement your own node-based structure. But for most backend workloads, linked lists are not as useful as people think. They often hurt cache locality and add pointer chasing overhead.

That is an important point: Big-O complexity is not the whole story.

In real production systems, CPU cache behavior matters a lot.

Python Lists: Simple API, Expensive Objects

Python gives us a very clean programming model:

numbers = [1, 2, 3, 4, 5]

for n in numbers:
    print(n)

It looks like a list of integers.

But internally, it is not a compact array of raw integers like you might expect in C or Go.

A Python list stores references to Python objects. Each integer is a full Python object with metadata, reference count information, type information, and value storage.

So when you write:

numbers = [1, 2, 3]

You should not imagine this as:

[1][2][3]

It is closer to:

list
 ├── pointer ──> PyObject(1)
 ├── pointer ──> PyObject(2)
 └── pointer ──> PyObject(3)

This design gives Python a lot of flexibility. A single list can contain integers, strings, dictionaries, custom classes, and even functions:

items = [1, "hello", {"active": True}, lambda x: x * 2]

That flexibility is powerful, but it has a cost.

Every item access can involve pointer dereferencing. The CPU may need to jump to different memory locations. This causes more cache misses, and cache misses are expensive.

Modern CPUs are extremely fast when they can read predictable, contiguous memory. They are much slower when they constantly chase pointers across the heap.

This is one reason why pure Python loops over large collections can be slow compared to Go, C, Rust, or even NumPy-based Python code.

NumPy is fast not because Python magically became faster, but because NumPy stores data in compact native arrays and runs optimized native code.

Go Slices: Contiguous Memory and Better Cache Locality

Now compare that with Go:

numbers := []int{1, 2, 3, 4, 5}

for _, n := range numbers {
    fmt.Println(n)
}

A slice of integers in Go points to an underlying array of integers stored contiguously in memory.

Conceptually:

[1][2][3][4][5]

That is much friendlier for the CPU.

The CPU can load a cache line and read multiple nearby values efficiently. This is called cache locality, and it is one of those low-level concepts that directly affects high-level backend performance.

This becomes even more interesting with structs.

type User struct {
    ID     int64
    Active bool
    Score  float64
}

users := []User{
    {ID: 1, Active: true, Score: 91.2},
    {ID: 2, Active: false, Score: 72.5},
}

In this case, the actual User values are stored in the underlying array.

But if you write:

users := []*User{
    &User{ID: 1, Active: true, Score: 91.2},
    &User{ID: 2, Active: false, Score: 72.5},
}

Now you have a slice of pointers. This can be useful when you need shared mutable objects or want to avoid copying large structs, but it also means more pointer chasing.

This is why I try to be intentional with this decision.

A slice of values:

[]User

is not the same performance model as:

[]*User

Both are valid. But they are not the same.

In production systems, these small decisions start to matter when you process hundreds of thousands or millions of records.

Value Semantics vs Reference Semantics

Another major difference between Go and Python is how they treat values.

Python is reference-oriented. Variables are names bound to objects.

a = [1, 2, 3]
b = a
b.append(4)

print(a)  # [1, 2, 3, 4]

Both a and b refer to the same list object.

Go has stronger value semantics by default.

a := [3]int{1, 2, 3}
b := a

b[0] = 99

fmt.Println(a) // [1 2 3]
fmt.Println(b) // [99 2 3]

The array is copied.

But slices are different:

a := []int{1, 2, 3}
b := a

b[0] = 99

fmt.Println(a) // [99 2 3]
fmt.Println(b) // [99 2 3]

Why?

Because the slice header is copied, but both slice headers still point to the same underlying array.

This is one of the most important things every Go developer should deeply understand.

A slice is not the array itself. It is a descriptor over an array.

That is why bugs can happen when you pass slices around and mutate them without thinking about who else shares the same underlying array.

Allocation: The Hidden Cost Behind Simple Code

Allocation is one of the biggest silent performance costs in backend systems.

When code allocates too much memory, the garbage collector has more work to do. More GC work means more CPU overhead and sometimes more latency.

In Go, when a slice grows beyond its capacity, Go allocates a new underlying array and copies the existing elements into it.

Example:

items := []int{}

for i := 0; i < 1_000_000; i++ {
    items = append(items, i)
}

This works, but the slice may grow multiple times.

A better version:

items := make([]int, 0, 1_000_000)

for i := 0; i < 1_000_000; i++ {
    items = append(items, i)
}

Now Go knows the expected capacity from the beginning.

This does not mean you should always preallocate everything. But when you know the approximate size, preallocation is one of the simplest performance wins.

The same idea exists in many systems:

preallocate buffers
reuse memory when safe
avoid unnecessary temporary objects
avoid converting []byte to string too early
avoid building huge in-memory arrays when streaming is enough

Performance engineering is often not about writing complicated code.

It is about not forcing the runtime to clean up avoidable garbage.

Python Allocation and Object Overhead

Python has its own memory manager. For small objects, CPython uses specialized allocation strategies to make object creation faster.

But Python still pays the cost of object-heavy design.

A Python integer is not just a raw machine integer. A Python string is an object. A Python dict is very flexible, but it is not cheap. A Python class instance has overhead too, unless you optimize with tools like __slots__ or use more compact structures.

This is why Python can be very fast when the heavy work is pushed into optimized native libraries, but slower when the workload is pure Python object processing.

For example:

total = 0

for item in huge_list:
    total += item

If huge_list is a normal Python list of Python integers, every iteration involves Python-level object handling.

But with NumPy:

import numpy as np

arr = np.array(huge_list)
total = arr.sum()

The expensive loop is moved into optimized native code.

This is not just a library trick. It is a memory-layout trick.

Compact memory layout changes everything.

Garbage Collection: Python vs Go

Garbage collection is another area where the difference matters.

Python, specifically CPython, mainly uses reference counting. Every object tracks how many references point to it. When the reference count reaches zero, the object can be freed immediately.

Example:

a = []
b = a

del a
del b

After both references are gone, the list can be cleaned up.

But reference counting alone cannot handle reference cycles.

a = {}
b = {}

a["b"] = b
b["a"] = a

Now a references b, and b references a.

Even if nothing else references them, their reference counts may not reach zero naturally. That is why Python also has a cyclic garbage collector.

Go uses a concurrent mark-and-sweep garbage collector.

At a high level, Go's GC finds reachable objects, marks them as live, and then frees unreachable memory. It is designed to keep pause times low and run concurrently with the application as much as possible.

This is one of the reasons Go works well for backend services. You can build long-running APIs, workers, and network services with predictable latency when you write allocation-conscious code.

But Go's GC is not magic.

If your code allocates too much, creates too many temporary objects, or keeps references alive longer than needed, the GC still has to work harder.

In Go, good performance often comes from writing boring code:

buf := make([]byte, 64*1024)

Reuse buffers when appropriate.

Avoid unnecessary conversions.

Keep data structures simple.

Do not create object graphs when arrays or slices are enough.

The 10GB File Problem

One of the most common backend mistakes is reading a large file into memory.

In Python:

with open("large.log", "r") as f:
    data = f.read()

In Go:

data, err := os.ReadFile("large.log")
if err != nil {
    log.Fatal(err)
}

This may work for small files.

It may work for 100MB.

It may even work in development.

Then production gets a 10GB file, and the process gets killed by the operating system.

The problem is not Python or Go.

The problem is the strategy.

If the file is large, you should usually stream it.

Streaming Files in Python

Python gives you a very clean way to process files line by line:

with open("large.log", "r") as f:
    for line in f:
        process(line)

This does not load the whole file into memory.

For CSV files:

import csv

with open("large.csv", newline="") as f:
    reader = csv.reader(f)

    for row in reader:
        process(row)

For huge CSV files with Pandas:

import pandas as pd

for chunk in pd.read_csv("large.csv", chunksize=100_000):
    process(chunk)

For large JSON files, avoid loading everything with:

json.load(f)

if the file is too big.

For stream-style JSON parsing, libraries like ijson can process JSON incrementally.

Another powerful pattern in Python is generators:

def read_lines(path):
    with open(path, "r") as f:
        for line in f:
            yield line

for line in read_lines("large.log"):
    process(line)

The benefit is simple: only a small part of the data exists in memory at a time.

Streaming Files in Go

Go is extremely practical for this kind of workload.

For line-by-line processing:

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
)

func main() {
    file, err := os.Open("large.log")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)

    for scanner.Scan() {
        line := scanner.Text()
        process(line)
    }

    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

func process(line string) {
    fmt.Println(line)
}

This is simple and memory efficient.

But there is one important detail: bufio.Scanner has a default token size limit. For very long lines, you should increase the buffer:

scanner := bufio.NewScanner(file)

buf := make([]byte, 1024*1024)
scanner.Buffer(buf, 10*1024*1024)

For more control, I often prefer bufio.Reader:

reader := bufio.NewReaderSize(file, 64*1024)

for {
    line, err := reader.ReadString('\n')
    if len(line) > 0 {
        process(line)
    }

    if err != nil {
        break
    }
}

For JSON streaming, avoid reading the full file and unmarshalling everything:

var items []Item
json.Unmarshal(data, &items)

For large files, use a decoder:

decoder := json.NewDecoder(file)

for decoder.More() {
    var item Item
    if err := decoder.Decode(&item); err != nil {
        log.Fatal(err)
    }

    process(item)
}

Depending on the JSON structure, you may need to manually read tokens using Token().

The main point is this:

Do not make memory responsible for holding data that can be streamed.

Why Go Can Be Much Faster in File Processing

In one of my own experiments, I processed and filtered a large CSV log file using both Python and Go.

The Python version used a generator and csv.reader.

The Go version used buffered I/O and goroutines for parallel processing.

The difference was significant.

Python was clean and memory efficient, but slower because every row and cell still became Python-level objects. Go was faster because I could process bytes more directly, reduce allocations, and use multiple CPU cores more naturally.

The exact numbers always depend on the machine, disk, file format, parsing logic, and implementation. But the pattern is very common:

Python is excellent for developer speed.
Go is excellent for predictable resource usage and backend throughput.
Python becomes much faster when heavy work is moved into native libraries.
Go performs very well when memory layout and allocations are controlled.

This is not about saying one language is always better.

It is about knowing where each language shines.

A Practical Go Pattern: Worker Pool for Large Files

When processing huge files, I usually avoid creating one goroutine per line. That sounds concurrent, but it can destroy memory and scheduling performance.

Instead, I prefer a bounded worker pool.

package main

import (
    "bufio"
    "log"
    "os"
    "sync"
)

func main() {
    file, err := os.Open("large.log")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    lines := make(chan string, 10_000)

    var wg sync.WaitGroup
    workerCount := 8

    for i := 0; i < workerCount; i++ {
        wg.Add(1)

        go func() {
            defer wg.Done()

            for line := range lines {
                process(line)
            }
        }()
    }

    scanner := bufio.NewScanner(file)
    scanner.Buffer(make([]byte, 1024*1024), 10*1024*1024)

    for scanner.Scan() {
        lines <- scanner.Text()
    }

    close(lines)
    wg.Wait()

    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

func process(line string) {
    // parse, filter, transform, send to DB, etc.
}

The important part is this:

lines := make(chan string, 10_000)

The channel is bounded.

That means the reader cannot infinitely push data into memory. If workers are slower than the reader, backpressure naturally happens.

This is one of the most important backend engineering concepts:

Fast producers must not be allowed to destroy slow consumers.

Whether you are reading files, consuming Kafka messages, processing HTTP requests, or sending jobs to workers, you need backpressure.

Avoiding Unnecessary String Allocation in Go

One hidden cost in Go file processing is converting bytes to strings too early.

For example:

line := scanner.Text()

This returns a string, which may allocate.

If you need better performance and can safely process bytes, use:

line := scanner.Bytes()

But be careful: the byte slice returned by scanner.Bytes() is only valid until the next scan. If you need to keep it, you must copy it.

This is a classic Go tradeoff.

You can reduce allocations, but you must understand ownership and lifetime.

That is why I always say Go is simple, but not shallow.

Escape Analysis: The Invisible Performance Tool

One of the most powerful Go concepts is escape analysis.

Go decides whether a variable can stay on the stack or must move to the heap.

Example:

func createUser() *User {
    user := User{Name: "Amir"}
    return &user
}

Here, user escapes because we return its address. It cannot live only on the stack.

You can inspect escape decisions:

go build -gcflags="-m" ./...

You may see output like:

user escapes to heap

This does not automatically mean the code is bad. Sometimes heap allocation is necessary.

But when you are optimizing hot paths, escape analysis helps you understand why your GC pressure is high.

In backend systems, this matters in places like:

request parsing
JSON encoding/decoding
logging
metrics
queue consumers
data transformation pipelines
tight loops
large batch processing

If a function runs millions of times, a small allocation becomes a production problem.

Memory Is Not Just a Language Problem

A senior engineer should not only ask:

Which language is faster?

A better question is:

What memory model does this workload need?

For example:

If I am building an internal script, Python is probably perfect.

If I am writing data analysis code, Python with Pandas, Polars, DuckDB, or NumPy can be excellent.

If I am building a high-throughput API service with predictable latency, Go is often a strong choice.

If I am processing huge files and need simple deployment, Go gives me a very practical balance.

If I am building machine learning workflows, Python still has the ecosystem advantage.

The language is only one layer.

The real engineering decision is about workload, memory behavior, concurrency model, ecosystem, deployment, and operational cost.

Practical Rules I Use in Production

Here are some rules I personally follow when working with memory-heavy backend systems.

1. Never load a huge file fully unless you really need to

Stream it.

Line by line.

Chunk by chunk.

Token by token.

2. Preallocate when you know the size

In Go:

items := make([]Item, 0, expectedSize)

This can reduce allocations and copying.

3. Be careful with slices sharing the same underlying array

This can create subtle bugs and unexpected memory retention.

For example:

small := big[:10]

If big is huge, small may keep the whole underlying array alive.

If you only need the small part, copy it:

smallCopy := append([]byte(nil), big[:10]...)

4. Avoid unnecessary pointer-heavy structures

A slice of pointers is not always better.

Sometimes a slice of values is faster and simpler.

5. Measure before and after optimization

Use tools:

go test -bench=. -benchmem
go tool pprof

For Python:

python -m cProfile app.py

Also check memory profiling tools when memory matters.

6. Design with backpressure

Bound your queues.

Limit your workers.

Do not let fast input destroy your service.

7. Optimize hot paths, not everything

Readable code still matters.

Do not turn the whole codebase into a low-level memory puzzle.

Find the hot path, measure it, then optimize it.

Final Thoughts

The deeper I go into backend engineering, the more I respect memory.

Not because every developer needs to become a systems programmer, but because every production system eventually becomes a resource management problem.

CPU is limited.

Memory is limited.

Disk I/O is limited.

Network bandwidth is limited.

The job of a backend engineer is not just writing business logic. It is building software that behaves well under pressure.

Python gives us speed of development and a massive ecosystem.

Go gives us simplicity, strong concurrency, predictable deployment, and great performance for many backend workloads.

Both are useful.

But if you want to build scalable backend systems, process large files, reduce memory pressure, and understand why your service behaves the way it does, you need to look under the hood.

My personal rule is simple:

Write clean code first.

Understand memory second.

Measure everything.

Then optimize only what matters.

That mindset has helped me more than any single framework, library, or language feature.

Discussion

Have you ever had a service crash because it loaded too much data into memory?

Or have you used Go escape analysis, pprof, or Python profiling tools to debug a real production issue?

I would love to hear your experience.

DEV Community

Memory Under the Hood: Why Go Often Feels Faster Than Python

The First Misunderstanding: Dynamic Array Is Not Linked List

Python Lists: Simple API, Expensive Objects

Go Slices: Contiguous Memory and Better Cache Locality

Value Semantics vs Reference Semantics

Allocation: The Hidden Cost Behind Simple Code

Python Allocation and Object Overhead

Garbage Collection: Python vs Go

The 10GB File Problem

Streaming Files in Python

Streaming Files in Go

Why Go Can Be Much Faster in File Processing

A Practical Go Pattern: Worker Pool for Large Files

Avoiding Unnecessary String Allocation in Go

Escape Analysis: The Invisible Performance Tool

Memory Is Not Just a Language Problem

Practical Rules I Use in Production

1. Never load a huge file fully unless you really need to

2. Preallocate when you know the size

3. Be careful with slices sharing the same underlying array

4. Avoid unnecessary pointer-heavy structures

5. Measure before and after optimization

6. Design with backpressure

7. Optimize hot paths, not everything

Final Thoughts

Discussion

Top comments (0)