speed engineer

Posted on May 4 • Originally published at Medium

Drop Traits: The Day We Stopped Restarting Pods Every 8 Hours

#programming #webdev #productivity #devops

Or: how we learned that “eventually” isn’t good enough when you’re bleeding file descriptors

Drop Traits: The Day We Stopped Restarting Pods Every 8 Hours

Or: how we learned that “eventually” isn’t good enough when you’re bleeding file descriptors

Deterministic cleanup means knowing exactly when resources are freed — the difference between memory chaos and predictable system behavior in production environments.

So our video transcoding service was… how do I put this delicately… a complete disaster.

Not in the “everything’s on fire” way. More like the “slow leak that nobody wants to admit is a real problem” way. We were processing 2.4 million videos daily, which sounds impressive until you realize we had to restart every single pod every 8 hours or it would just… die.

Memory would start at a reasonable 2GB per pod. Then climb. And climb. And by hour 7, we’d be sitting at 14GB and sweating, watching the graphs, waiting for the OOM killer to show up like an unwelcome dinner guest.

The numbers were absolutely brutal:

Monthly infrastructure costs: $83,000 (ouch)
Memory-related incidents: 47 per month (that’s more than one per day)
Engineer hours spent firefighting: 120 hours (three full-time weeks!)
Sleep quality: terrible (not officially tracked but definitely real)

We tried everything. Profile-guided optimization? Check. Custom memory pools? Built those. Aggressive GC tuning? Oh god, so much tuning. We had engineers who could recite Go GC parameters in their sleep.

Nothing worked consistently.

And then — okay, this is where it gets interesting — we realized the garbage collector wasn’t solving our problem. It was hiding it.

The Thing About “Eventually”

Here’s what I didn’t understand about garbage collection until it bit us. In GC languages, cleanup happens “eventually.” Which sounds fine in theory:

File handles close… when the GC runs
Network connections terminate… during collection cycles
Memory returns to the pool… when there’s pressure

This abstraction is actually really powerful! Until it’s catastrophic. Which, in our case, it very much was.

Our video pipeline was handling temporary files, FFmpeg processes, TCP connections to S3. Pretty standard stuff. In Go, we were doing what everyone does — defer and finalizers:

func processVideo(path string) error {  
    file, err := os.Open(path)  // open the file  
    if err != nil {  
        return err  // bail if it fails  
    }  
    defer file.Close()  // this'll close it... eventually  

    // Process video for like 30 seconds  
    return nil  
}

Looks totally fine, right? This is idiomatic Go. This is what you’re supposed to do.

The problem — and oh boy was this a problem — is that defer doesn't mean "clean this up right now." It means "clean this up when this function returns, but the actual resource release might not happen until the GC feels like it."

Under heavy load? File descriptors just… accumulated. Like plaque. We’d hit the system limit of 65,536 file descriptors and crash with “too many open files” errors while still having 6GB of free memory.

The GC would be sitting there like “memory looks fine to me!” while we’re drowning in open file handles.

Here’s what killed us:

Go metrics that made us cry:

File descriptor leaks: 2,300 per hour during peak (that’s 38 per minute!)
Average cleanup delay: 14.7 seconds (an eternity in computer time)
Memory high-water mark: 14.2GB per pod (why?!)
OOM incidents: 47 per month
Process restarts: 91 per day (three or four every hour)
Monthly cost: $83,000 (we could hire another engineer for this)

When We Discovered Deterministic Cleanup (Finally)

So we rewrote the critical path in Rust. And I know what you’re thinking — “oh great, another ‘Rust is faster’ story.” But that’s not what happened. Not really.

Rust wasn’t faster in the benchmark sense. It was predictable. And predictability, it turns out, is way more valuable than raw speed when you’re trying to sleep at night.

The Drop trait in Rust guarantees — guarantees — that cleanup happens at a precise moment. Not “eventually.” Not “when the GC feels like it.” Right when the value goes out of scope. Period.

struct VideoFile {  
    handle: File,        // the actual file handle  
    path: PathBuf,       // where it lives on disk  
}  

impl Drop for VideoFile {  
    fn drop(&mut self) {  
        // THIS RUNS IMMEDIATELY when VideoFile goes out of scope  
        // Not later. Not when GC runs. RIGHT NOW.  
        println!("Closing: {:?}", self.path);  // log it  
        // file handle closes automatically here  
    }  
}  
fn process_video(path: &Path) -> Result<(), Error> {  
    let video = VideoFile {  
        handle: File::open(path)?,  // open the file  
        path: path.to_path_buf(),   // store the path  
    };  

    // Process the video for 30 seconds or whatever  

    // Drop runs EXACTLY HERE when video goes out of scope  
    // No waiting. No GC. Just immediate cleanup.  
    Ok(())  
}

The impact was… I mean, look at these numbers:

Rust metrics that made us believers:

File descriptor leaks: 12 per hour (96% reduction, holy shit)
Average cleanup time: under 100 microseconds (not 14 seconds!)
Memory high-water mark: 3.1GB per pod (78% reduction!)
OOM incidents: 0 per month (ZERO)
Process restarts: 0 per day (ZERO)
Monthly cost: $22,000 (73% reduction, $61K savings)

Why “When” Matters More Than “How Fast”

The revelation — and this took me embarrassingly long to understand — wasn’t that Rust was faster at cleanup. It’s that knowing when cleanup happens is more valuable than how quickly it happens.

Think about our file handle lifecycle. In Go, it looked like this:

Open file (happens immediately, good)
Use file (totally predictable, fine)
Defer close (scheduled for later, okay I guess)
Wait for GC cycle (wait, how long?)
Finalizer runs (seriously, when though?)
Resource freed (eventually? maybe?)

Compare that to Rust:

Open file (immediate)
Use file (predictable)
Drop runs (scope ends, cleanup happens NOW)
Resource freed (immediate)

This difference compounded across 2.4 million videos per day. With Go, file descriptor usage was probabilistic. We had to overprovision by 4x to handle worst-case scenarios. Like, we needed capacity for “what if the GC doesn’t run for a while?” scenarios.

With Rust? Resource usage became a simple function: concurrent videos being processed × resources per video. That’s it. No probability distributions. No worst-case scenarios. Just math.

How We Actually Built This Thing

Okay so here’s how we restructured everything around Drop semantics. And honestly, once you get the pattern, it’s kind of beautiful.

Pattern 1: Scoped Resource Lifetimes

We wrapped every external resource in a struct with Drop:

struct TempWorkspace {  
    dir: TempDir,      // temporary directory handle  
    max_size: u64,     // size limit for safety  
}  

impl Drop for TempWorkspace {  
    fn drop(&mut self) {  
        // This ALWAYS runs when TempWorkspace goes out of scope  
        // Even if there's a panic. Even if there's an error.  
        // ALWAYS.  
        let _ = fs::remove_dir_all(&self.dir);  // nuke the temp dir  
        // ignore errors because we're in Drop, nothing we can do  
    }  
}  
struct FFmpegProcess {  
    child: Child,         // the spawned process  
    timeout: Duration,    // how long to wait before killing  
}  
impl Drop for FFmpegProcess {  
    fn drop(&mut self) {  
        // Force-kill any hung processes  
        // Can't have zombie FFmpeg processes hanging around  
        let _ = self.child.kill();  // terminate immediately  
        // ignore error if already dead  
    }  
}

This architecture eliminated an entire class of bugs. Like, just made them impossible.

Before, temp files would accumulate during high-load periods. We had cron jobs running every hour to manually clean them up. Cron jobs! For cleanup! In 2023!

With Drop? Automatic. Immediate. Our temp disk usage went from 847GB to 23GB. Just from this one change.

Pattern 2: RAII for Network Connections

RAII (Resource Acquisition Is Initialization) became our pattern for everything I/O related:

struct S3Connection {  
    client: S3Client,    // the actual S3 client  
    bucket: String,      // which bucket we're using  
    session_id: String,  // for tracking/metrics  
}  

impl Drop for S3Connection {  
    fn drop(&mut self) {  
        // Log when we're done with this connection  
        // Perfect for metrics and monitoring  
        metrics::record_session_end(  
            &self.session_id  // track this specific session  
        );  
        // client closes automatically  
    }  
}

This gave us perfect connection accounting. At any moment, we knew exactly how many active S3 sessions existed. Not “approximately.” Exactly.

Before, with Go’s deferred cleanup, our monitoring showed “ghost” connections. Connections that were closed but… not really? Still consuming resources somewhere in limbo, waiting for the GC to notice them.

With Drop? No ghosts. Just real connections that existed, and then didn’t.

Pattern 3: Memory-Mapped Files (The Big One)

For large video files — anything over 2GB — we used memory-mapped I/O. And this is where Drop really shined:

struct MappedVideo {  
    mmap: MmapMut,  // the memory-mapped region  
    size: usize,    // total size in bytes  
}  

impl Drop for MappedVideo {  
    fn drop(&mut self) {  
        // Guaranteed unmap - returns virtual pages to OS IMMEDIATELY  
        // No waiting for GC to decide memory pressure is high enough  
        println!("Unmapping {}MB",   // log it for debugging  
                 self.size / 1_048_576);  // convert bytes to MB  
        // mmap unmaps itself when dropped  
    }  
}

Memory-mapped regions were our biggest leak source in Go. Here’s why: the Go GC looks at heap pressure. But mmapped memory isn’t on the heap — it’s virtual address space. So the GC would think “we have 10GB free on the heap, no need to collect!” while our virtual memory usage climbed to 18GB and the OOM killer was warming up.

With Drop, every mmap had a guaranteed munmap. Our virtual memory usage stabilized at 4.2GB instead of playing memory chicken with the kernel.

Simplified resource management through deterministic cleanup — fewer steps, predictable behavior, and guaranteed resource reclamation.

The Performance Cascade (Unexpected Bonuses)

Here’s what’s wild — deterministic cleanup created performance improvements we didn’t even anticipate:

1. Predictable Latency (The Big Surprise)

Our P99 latency dropped from 847ms to 34ms. But that’s not even the crazy part. The entire distribution tightened. Standard deviation went from 203ms to 8ms.

No more GC pauses. No more “well, sometimes it’s fast…” conversations with product managers.

2. Better Resource Utilization (Finally Using What We Paid For)

We reduced pod count from 42 to 11. Forty-two to eleven. Because with predictable memory usage, we didn’t need headroom for “what if the GC doesn’t run” scenarios.

CPU utilization increased from 38% to 67%. We were actually using the resources we paid for instead of keeping them idle for hypothetical GC spikes.

3. Simplified Monitoring (My Favorite Part)

Our alerting rules became trivial.

Before: “Alert if memory trend suggests OOM in 20 minutes based on polynomial regression of the last 6 data points weighted by time of day and…”

After: “Alert if memory exceeds 4GB.”

That’s it. One line. The predictability eliminated an entire category of complex anomaly detection that we’d spent months tuning.

The Real Cost (Let’s Be Honest)

Deterministic cleanup isn’t free. Nothing’s free. Here’s what we gave up:

What we lost:

Developer ergonomics (lifetime annotations everywhere in complex scenarios)
Rapid prototyping (steeper learning curve, especially for junior engineers)
Dynamic flexibility (can’t just hold references past scope without Arc/Rc)
Legacy integration (we rewrote 47,000 lines of Go over 4 months)

What we gained:

Zero memory leaks (from 47 incidents/month to 0 — ZERO)
Predictable performance (eliminated those 200+ms GC pauses)
Lower costs ($61,000/month savings, that’s real money)
Engineer confidence (no more 3am pages about memory)

The team adjustment was real. Three engineers needed 6 weeks to become productive with Rust’s ownership system. And I’m not gonna lie — those 6 weeks were rough. Lots of fighting with the borrow checker. Lots of “why can’t I just do this simple thing” moments.

But after that initial investment? Velocity increased. Features that previously required careful memory profiling and testing just… worked. First try. No leaks. No issues.

One engineer told me: “I used to spend 30% of my time tracking down memory issues. Now I spend 0%.”

When Should You Actually Do This?

After running this system for 14 months in production, here’s my decision framework:

Choose Rust’s Drop when:

Resource leaks cause production incidents (we had 10+ per month)
You’re managing system resources (files, sockets, memory-mapped regions)
Latency variance matters more than raw throughput (for us it did)
GC pauses disrupt critical paths (those 200ms pauses hurt)
Memory footprint directly impacts costs ($61K/month impact for us)
You need to know exactly when cleanup happens (not eventually)

Stay with GC languages when:

Development velocity is paramount (prototyping phase, MVP)
Resource leaks are acceptable edge cases (they’re not for everyone)
Team lacks systems programming experience (real consideration)
Cleanup timing doesn’t affect behavior (rare but it happens)
Memory is abundant and cheap (not our situation)
You can overprovision by 3–4x without caring (we couldn’t)

Eighteen Months Later

Our video service now processes 3.2 million files daily. That’s 33% growth. On the same infrastructure we were struggling with before.

Memory incidents in the past year: zero. Engineer time spent on memory issues: 4 hours total (mostly debugging one weird edge case). Infrastructure cost: still $22K/month instead of $83K.

The Drop trait didn’t just fix our memory leaks. It changed how we think about resources. Every struct becomes a contract: acquire on creation, cleanup on destruction. No timing uncertainty. No GC pressure. No praying to the runtime gods.

Just deterministic, predictable behavior.

I sleep through the night now. No memory-related pages. No panicked Slack messages at 2am. No watching graphs climb and hoping the GC runs before the OOM killer shows up.

The Real Lesson

Here’s what I learned: automatic cleanup isn’t always better than deterministic cleanup.

GC makes resource management invisible. Which is great! Until it’s not. Until you need to know when that file closes. Until you need to predict memory usage. Until you’re bleeding file descriptors and the GC is like “everything looks fine from here.”

Rust makes it explicit. Gives you control. You see exactly when cleanup happens because it’s tied to scope. No magic. No runtime surprises.

Our infrastructure costs dropped by $61,000 per month. But you know what? The real win was sleeping through the night. The real win was junior engineers shipping features without creating memory leaks. The real win was predictability.

Sometimes the best optimization is knowing exactly when your resources are freed.

Not “eventually.”

Now.

Enjoyed the read? Let’s stay connected!

🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.

💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.

⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

DEV Community

Drop Traits: The Day We Stopped Restarting Pods Every 8 Hours

Drop Traits: The Day We Stopped Restarting Pods Every 8 Hours

Or: how we learned that “eventually” isn’t good enough when you’re bleeding file descriptors

The Thing About “Eventually”

Go metrics that made us cry:

When We Discovered Deterministic Cleanup (Finally)

Rust metrics that made us believers:

Why “When” Matters More Than “How Fast”

How We Actually Built This Thing

Pattern 1: Scoped Resource Lifetimes

Pattern 2: RAII for Network Connections

Pattern 3: Memory-Mapped Files (The Big One)

The Performance Cascade (Unexpected Bonuses)

1. Predictable Latency (The Big Surprise)

2. Better Resource Utilization (Finally Using What We Paid For)

3. Simplified Monitoring (My Favorite Part)

The Real Cost (Let’s Be Honest)

What we lost:

What we gained:

When Should You Actually Do This?

Eighteen Months Later

The Real Lesson

Top comments (0)