Go Zero-Copy Programming: Boost Performance with Memory Mapping and Direct I/O

#programming #devto #go #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Think of your computer's memory and storage like a busy office. The CPU is the manager, memory (RAM) is the desk where work happens, and the disk is a giant filing cabinet. Every time you need a file, a junior employee (the operating system) must walk to the cabinet, retrieve the document, place a copy on the desk for the manager to use, and later walk back to file any changes. This back-and-forth is slow.

What if the manager could work directly on the document while it's still in the filing cabinet? That's the core idea behind the techniques I want to show you. We can skip the copying step. In computing, we call this zero-copy. It lets programs handle data more like a streaming service and less like a library with a strict checkout process.

This matters because in data-heavy applications—like video transcoding, scientific computing, or high-frequency trading—the speed of moving data often limits everything else. The CPU waits idle for data to arrive. By reducing or eliminating these waits and the extra work of copying bytes around, we can make our programs dramatically faster and more efficient.

Let's start with a fundamental tool: memory-mapped files. Normally, when your Go program reads a file, it asks the operating system for data. The OS fetches data from disk into its own cache, then copies that data into your application's memory space. That's two steps and two copies of the data in memory.

Memory mapping removes the middleman. It asks the OS to map the file directly into your program's address space. Your program's memory addresses for a specific range point straight to the file's contents on disk. The OS handles loading pages of the file into physical RAM only when you touch them, and writing them back automatically.

Here is what this looks like in code. We'll create a structure to manage a memory-mapped file.

package main

import (
    "fmt"
    "log"
    "golang.org/x/sys/unix"
)

// ZeroCopyFile holds our connection to a memory-mapped file.
type ZeroCopyFile struct {
    fd     int      // The low-level file descriptor.
    size   int64    // How big the file is.
    data   []byte   // This slice's backing array is the mapped memory.
    mapped bool
}

// NewZeroCopyFile opens a file and prepares it for mapping.
func NewZeroCopyFile(path string) (*ZeroCopyFile, error) {
    // Open the file. We use unix.Open for low-level control.
    fd, err := unix.Open(path, unix.O_RDWR, 0644)
    if err != nil {
        return nil, err
    }

    // Get the file's size.
    stat, err := unix.Fstat(fd)
    if err != nil {
        unix.Close(fd)
        return nil, err
    }

    return &ZeroCopyFile{
        fd:   fd,
        size: stat.Size,
    }, nil
}

// MapMemory performs the actual magic. It tells the OS to map the file.
func (zcf *ZeroCopyFile) MapMemory() error {
    // The Mmap call is key. It returns a byte slice backed by the file.
    data, err := unix.Mmap(zcf.fd, 0, int(zcf.size), 
        unix.PROT_READ|unix.PROT_WRITE, unix.MAP_SHARED)
    if err != nil {
        return err
    }

    zcf.data = data
    zcf.mapped = true
    return nil
}

// ReadAt is now trivial. No system call, just a slice operation.
func (zcf *ZeroCopyFile) ReadAt(offset, length int64) ([]byte, error) {
    if offset+length > zcf.size {
        return nil, fmt.Errorf("read beyond file bounds")
    }
    // This is zero-copy. We're returning a slice pointing to the mapped memory.
    return zcf.data[offset:offset+length], nil
}

// WriteAt is just as simple.
func (zcf *ZeroCopyFile) WriteAt(offset int64, data []byte) error {
    if offset+int64(len(data)) > zcf.size {
        return fmt.Errorf("write beyond file bounds")
    }
    // copy() works directly on the mapped memory.
    copy(zcf.data[offset:], data)
    return nil
}

Using this is straightforward. You open a file, map it, and then interact with the data slice as if it were any other byte slice in Go. The operating system works in the background to sync changes to disk. You can call unix.Msync to force a sync.

I once used this to process a multi-gigabyte log file. The traditional method of reading it line by line using bufio.Scanner took minutes and consumed gigabytes of RAM. With memory mapping, the same analysis ran in seconds and memory usage was nearly flat. The file felt like it was already in memory.

The next level is moving this data over a network. Imagine you have a file on disk and need to send it to a client over a TCP socket. The classic way involves a loop: read a chunk from disk into an application buffer, then write that chunk to the socket. The data is copied from the OS disk cache to your app's memory, then from your app's memory to the OS network buffers.

The sendfile system call cuts out your application entirely. It instructs the kernel to copy data directly from one file descriptor (the file) to another (the socket). Your program just orchestrates the transfer.

Here's how you can use it in Go. We'll create a network transfer helper.

import (
    "sync/atomic"
    "time"
)

type ZeroCopyNetwork struct {
    stats TransferStats
}

func (zcn *ZeroCopyNetwork) SendFile(socketFd, fileFd int, offset, length int64) error {
    start := time.Now()

    var written int64
    for written < length {
        // This is the crucial call. The kernel handles the transfer.
        n, err := unix.Sendfile(socketFd, fileFd, &offset, int(length-written))
        if err != nil {
            return err
        }
        written += int64(n)
    }

    duration := time.Since(start)
    atomic.AddUint64(&zcn.stats.bytesTransferred, uint64(written))
    atomic.AddUint64(&zcn.stats.durationNs, uint64(duration.Nanoseconds()))

    return nil
}

To use this, you need the raw file descriptors for both the network connection and the file. For a network connection, you can access it through the syscall.Conn interface. The performance difference is not subtle. For large files, you can saturate a 10-gigabit network link with a fraction of a single CPU core, whereas the traditional method might max out multiple cores.

Sometimes you need to move data between two places that aren't a file and a socket, like between two sockets or a pipe and a file. Another system call, splice, is useful here. It moves data between two file descriptors without moving it to user space, using a pipe as a kernel-mediated buffer.

func (zcn *ZeroCopyNetwork) Splice(srcFd, dstFd int, length int64) error {
    pipe := make([]int, 2)
    if err := unix.Pipe(pipe); err != nil {
        return err
    }
    defer unix.Close(pipe[0])
    defer unix.Close(pipe[1])

    var transferred int64
    for transferred < length {
        // Move data from source into the pipe.
        n, err := unix.Splice(srcFd, nil, pipe[1], nil, int(length-transferred), unix.SPLICE_F_MOVE)
        if err != nil {
            return err
        }
        // Move data from the pipe to the destination.
        _, err = unix.Splice(pipe[0], nil, dstFd, nil, n, unix.SPLICE_F_MOVE)
        if err != nil {
            return err
        }
        transferred += int64(n)
    }
    return nil
}

This is incredibly powerful for building proxies or data transformation pipelines where you want to minimize data touching your application logic.

Now, memory mapping and sendfile are fantastic, but they rely on the OS's page cache. Sometimes you know your access pattern is so large and sequential that the OS cache is just overhead. You might want to talk directly to the disk. This is called Direct I/O.

Opening a file with the O_DIRECT flag bypasses the OS cache. There's a major caveat: all reads and writes must be aligned in memory and on disk to specific boundaries (like 512 bytes). It's more complex but can provide predictable performance.

type DirectIO struct {
    fd int
}

func NewDirectIO(path string) (*DirectIO, error) {
    // Note the O_DIRECT flag.
    fd, err := unix.Open(path, unix.O_RDWR|unix.O_DIRECT|unix.O_CREAT, 0644)
    if err != nil {
        return nil, err
    }
    return &DirectIO{fd: fd}, nil
}

func (dio *DirectIO) ReadAligned(offset, length int64) ([]byte, error) {
    // Direct I/O requires aligned buffers. We round up the size.
    alignedSize := ((length + 511) / 512) * 512
    // The buffer must be aligned in memory. We'll handle that next.
    buf := make([]byte, alignedSize)

    n, err := unix.Pread(dio.fd, buf, offset)
    if err != nil {
        return nil, err
    }
    return buf[:n], nil
}

The tricky part is memory alignment. The buffer you pass to read or write must start at a memory address that is a multiple of the disk's block size. Go's standard make doesn't guarantee this. We need a memory pool that creates aligned buffers.

import (
    "sync"
    "unsafe"
)

type MemoryPool struct {
    pool sync.Pool
    size int
}

func NewMemoryPool(bufferSize, alignment int) *MemoryPool {
    return &MemoryPool{
        pool: sync.Pool{
            New: func() interface{} {
                // Allocate extra bytes to allow for alignment shifting.
                buf := make([]byte, bufferSize+alignment)
                // Find the address of the first byte.
                addr := uintptr(unsafe.Pointer(&buf[0]))
                // Calculate the offset needed to align it.
                offset := alignment - int(addr)%alignment
                if offset == alignment {
                    offset = 0
                }
                // Return a slice that starts at the aligned address.
                return buf[offset : offset+bufferSize]
            },
        },
        size: bufferSize,
    }
}

This pool creates byte slices whose underlying memory is correctly aligned for Direct I/O. You Get a buffer, use it for your read or write, and Return it. This avoids the huge cost of allocating and garbage-collecting aligned buffers repeatedly.

When you have a very large file, you can process it faster by working on multiple parts at once. Let's combine memory mapping with parallel processing. We'll create a batch processor that divides the file into chunks and processes them concurrently.

type BatchProcessor struct {
    workers   int
    workQueue chan *BatchJob
    results   chan *BatchResult
}

func (bp *BatchProcessor) ProcessFile(file *ZeroCopyFile, chunkSize int64, 
    processor func([]byte) []byte) error {

    var wg sync.WaitGroup
    // Start worker goroutines.
    for i := 0; i < bp.workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for job := range bp.workQueue {
                // Workers read directly from mapped memory.
                data, err := file.ReadAt(job.offset, job.length)
                if err != nil {
                    bp.results <- &BatchResult{err: err}
                    return
                }
                processed := processor(data)
                if err := file.WriteAt(job.offset, processed); err != nil {
                    bp.results <- &BatchResult{err: err}
                    return
                }
                bp.results <- &BatchResult{err: nil}
            }
        }()
    }

    // Send jobs.
    go func() {
        for offset := int64(0); offset < file.size; offset += chunkSize {
            length := chunkSize
            if offset+length > file.size {
                length = file.size - offset
            }
            bp.workQueue <- &BatchJob{offset: offset, length: length}
        }
        close(bp.workQueue)
    }()

    // Wait and collect results.
    go func() {
        wg.Wait()
        close(bp.results)
    }()

    for result := range bp.results {
        if result.err != nil {
            return result.err
        }
    }
    return nil
}

This pattern is excellent for tasks like checksum calculation, compression, or search across a large file. Each worker operates on its own segment of the mapped memory, and there's no contention for locks because they work on non-overlapping regions.

Let's tie it all together in a main function that shows a realistic use case: reading a large file and sending it over the network using zero-copy techniques.

func main() {
    // 1. Memory-map the source file.
    srcFile, err := NewZeroCopyFile("large_video.mkv")
    if err != nil {
        log.Fatal(err)
    }
    defer srcFile.UnmapMemory()

    if err := srcFile.MapMemory(); err != nil {
        log.Fatal(err)
    }

    // 2. Establish a network connection.
    conn, err := net.Dial("tcp", "client.example.com:9000")
    if err != nil {
        log.Fatal(err)
    }
    defer conn.Close()

    // 3. Get the raw socket file descriptor.
    sc, ok := conn.(syscall.Conn)
    if !ok {
        log.Fatal("connection type doesn't allow raw fd access")
    }
    rc, err := sc.SyscallConn()
    if err != nil {
        log.Fatal(err)
    }
    var connFd int
    rc.Control(func(fd uintptr) { connFd = int(fd) })

    // 4. Perform the zero-copy transfer.
    transfer := &ZeroCopyNetwork{}
    start := time.Now()

    if err := transfer.SendFile(connFd, srcFile.fd, 0, srcFile.size); err != nil {
        log.Fatal(err)
    }

    duration := time.Since(start)
    stats := transfer.GetStats()

    fmt.Printf("Sent %d bytes in %v\n", stats.bytesTransferred, duration)
    fmt.Printf("Rate: %.2f MB/s\n", 
        float64(stats.bytesTransferred)/duration.Seconds()/1024/1024)
}

When you run this, you'll see CPU usage for your program is minimal. The system's sendfile call is doing the heavy lifting, moving data from the disk cache straight to the network card.

These techniques come with responsibilities. You must handle errors carefully, especially when unmapping memory. You need to be aware of system limits on memory map size. Not all filesystems support O_DIRECT or sendfile on every operating system. Always have a fallback to standard I/O.

For concurrent access to a memory-mapped file from multiple goroutines, you need synchronization if they write to overlapping regions. Use sync.RWMutex or channel coordination. Remember, you are working directly with memory; a stray pointer can corrupt your file.

I find these methods most rewarding when applied to problems where data is king. They transform a sluggish, resource-hungry process into a lean, fast operation. The code may be a bit more involved, requiring you to think about file descriptors and memory alignment, but the performance gains are real and measurable. Your application spends less time moving data and more time doing useful work, which is, after all, the point.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!