aykhlf yassir

Posted on Feb 5

Python Internals: Reference Cycles and Garbage Collection

#python #architecture #performance

Understanding reference counting, garbage collection, and why your objects won't die

The Box vs. The Label

Quick: what does this code do?

a = [1, 2, 3]
b = a
b.append(4)
print(a)  # What prints?

If you're coming from C++ or Java, you might think "I assigned a to b, so they're separate copies. a should still be [1, 2, 3]."

But run this code and you'll see [1, 2, 3, 4].

Why? Because your mental model is wrong.

Variables Are Not Boxes

In C++, a variable is like a box with a name written on it. When you write int x = 5;, you create a box labeled "x" and put the value 5 inside it. When you assign int y = x;, you create a new box labeled "y" and copy the value into it.

Python doesn't work this way.

In Python, objects live in the heap. Variables are not boxes that contain objects; they're labels attached to objects.

a = [1, 2, 3]

What actually happens:

Python creates a list object [1, 2, 3] somewhere in memory
Python creates a label (reference) named a
Python sticks that label onto the object

When you write:

b = a

You're not copying the object. You're creating a second label b and sticking it onto the same object that a is pointing to.

    [1, 2, 3, 4] ← Object in memory
      ↑     ↑
      a     b   ← Two labels on one object

This is why modifying through b affects what you see through a, they're both pointing at the same object.

Proving It With `id()`

Every object in Python has a unique identifier its memory address:

a = [1, 2, 3]
b = a

print(id(a))  # 140234567890
print(id(b))  # 140234567890 - Same address!
print(a is b) # True - Same object

c = [1, 2, 3]
print(id(c))  # 140234567999 - Different address!
print(a is c) # False - Different objects
print(a == c) # True - Same contents

The is operator checks if two labels point to the same object, not whether the objects have the same value.

The Immutability Exception

This model explains why immutable types seem to behave differently:

x = 5
y = x
y = 10
print(x)  # Still 5!

Did the label model break? No! When you write y = 10, you're not modifying the object, you're unsticking the label y from the object 5 and re-sticking it to a different object 10. The original object 5 is unchanged, and x still points to it.

Before y = 10:
    5 ← Object
    ↑
   x,y

After y = 10:
    5 ← Object    10 ← New object
    ↑              ↑
    x              y

Immutable objects can't be modified, so every operation that looks like modification is actually creating a new object and moving the label.

Mutable Default Arguments

Now that you understand the label model, let's look at Python's most infamous gotcha:

def add_passenger(passenger, manifest=[]):
    manifest.append(passenger)
    return manifest

flight1 = add_passenger("Alice")
print(flight1)  # ['Alice']

flight2 = add_passenger("Bob")
print(flight2)  # ['Alice', 'Bob'] - WHAT?!

Why is Alice on Bob's flight?

The Hidden Function Object

When Python executes a def statement, it doesn't just "define a function and forget it." It creates a function object that lives in memory. Default argument values are created once when the function is defined and stored inside that function object.

Let's prove it:

def add_passenger(passenger, manifest=[]):
    manifest.append(passenger)
    return manifest

# The function object has a __defaults__ attribute
print(add_passenger.__defaults__)  # ([],)

# Call it once
add_passenger("Alice")
print(add_passenger.__defaults__)  # (['Alice'],)

# Call it again
add_passenger("Bob")
print(add_passenger.__defaults__)  # (['Alice', 'Bob'],)

The default list [] is created once, when def executes. Every call to the function that doesn't provide a manifest argument reuses that same list object.

The Execution Timeline

# DEFINITION TIME (happens once)
def add_passenger(passenger, manifest=[]):  # Create empty list, store in __defaults__
    manifest.append(passenger)
    return manifest

# CALL TIME 1
add_passenger("Alice")  # No manifest provided, use default (the list in __defaults__)
# The list is now ['Alice']

# CALL TIME 2  
add_passenger("Bob")    # No manifest provided, use default (SAME list!)
# The list is now ['Alice', 'Bob']

The Idiomatic Fix

Never use mutable objects as default arguments. Use None as a sentinel:

def add_passenger(passenger, manifest=None):
    if manifest is None:
        manifest = []  # Create a NEW list for each call
    manifest.append(passenger)
    return manifest

flight1 = add_passenger("Alice")
print(flight1)  # ['Alice']

flight2 = add_passenger("Bob")
print(flight2)  # ['Bob'] - Separate list!

Now each call that doesn't provide manifest creates a fresh, independent list.

When to Use Mutable Defaults (Intentionally)

There are rare cases where you want to preserve state across calls:

def cache_result(key, cache={}):
    if key not in cache:
        print(f"Computing {key}...")
        cache[key] = expensive_computation(key)
    return cache[key]

cache_result(5)  # Computing 5...
cache_result(5)  # (no output - cached!)

But this is almost always better expressed with a class or explicit cache variable.

The Life and Death of an Object: Reference Counting

So we know objects are created in memory and variables are labels pointing to them. But when does an object die?

The Reference Counter

Every Python object has a hidden field called ob_refcnt, a counter tracking how many labels are currently stuck to it.

typedef struct {
    Py_ssize_t ob_refcnt;   // Reference count
    PyTypeObject *ob_type;  // Type
    // ... actual data
} PyObject;

The rules are simple:

Create a label → Count +1
Remove a label → Count -1
Count reaches 0 → Immediate death

Let's trace the life cycle:

import sys

x = object()  # Create object, refcount = 1
print(sys.getrefcount(x))  # 2 (why? read on...)

y = x  # New label, refcount = 2 (+ 1 from getrefcount call = 3)
print(sys.getrefcount(x))  # 3

del y  # Remove label, refcount = 2
print(sys.getrefcount(x))  # 2

del x  # Refcount = 1 (from getrefcount), then 0 when function returns
# Object is immediately destroyed

The `sys.getrefcount()` Trap

Notice the reference count is always higher than expected? This is because calling sys.getrefcount(x) creates a temporary reference when passing x as an argument!

x = []
# Actual refs: x
print(sys.getrefcount(x))  # 2
# Why 2? Because during the call, refs are: x, argument to getrefcount()

The real count is always getrefcount(x) - 1.

What Creates References?

Understanding what increases the reference count is critical:

import sys

obj = object()
print(sys.getrefcount(obj))  # 2 (variable + function arg)

# Assignment creates a reference
another = obj
print(sys.getrefcount(obj))  # 3

# Lists/tuples/dicts create references
lst = [obj]
print(sys.getrefcount(obj))  # 4

# Function arguments create temporary references
def foo(x):
    print(sys.getrefcount(x))  # +1 from function call

foo(obj)  # Will print 5

# Deleting removes references
del another
del lst
print(sys.getrefcount(obj))  # Back to 2

Immediate Destruction

The beauty of reference counting is deterministic cleanup. The moment the last reference disappears, the object dies:

class Mortal:
    def __init__(self, name):
        self.name = name
        print(f"{name} is born")

    def __del__(self):
        # Called when refcount hits 0
        print(f"{name} dies")

print("Creating object...")
x = Mortal("Socrates")  # Socrates is born

print("Deleting reference...")
del x  # Socrates dies - immediately!

print("After deletion")

This is why Python doesn't need explicit free() or delete calls like C/C++. Objects clean up automatically when they're no longer needed.

The Reference Cycles Problem

Reference counting sounds perfect and intuitive. But it has a fatal weakness:

class Node:
    def __init__(self, name):
        self.name = name
        self.partner = None

    def __del__(self):
        print(f"Deleting {self.name}")

# Create two nodes
alice = Node("Alice")
bob = Node("Bob")

# Create a circular reference
alice.partner = bob
bob.partner = alice

# Delete our references
del alice
del bob

# ... crickets ...
# __del__ is never called!

What happened? Let's trace the references:

Before del:
    Node("Alice") ← alice variable
          ↕ partner
    Node("Bob") ← bob variable

After del alice, del bob:
    Node("Alice") 
          ↕ partner
    Node("Bob")

Even after we delete the variables alice and bob, the Node objects still hold references to each other. Their reference counts are still 1 (from the partner attribute), so they never get destroyed.

These objects are now unreachable from our code (we have no variables pointing to them), but they're still in memory, causing a memory leak.

The Real-World Impact

This isn't just theoretical. Reference cycles are common:

# Parent-child relationships
class Parent:
    def __init__(self):
        self.children = []

class Child:
    def __init__(self, parent):
        self.parent = parent
        parent.children.append(self)

# Event listeners
class Button:
    def __init__(self):
        self.click_handler = None

    def on_click(self, handler):
        self.click_handler = handler

# Handler holds button, button holds handler
button = Button()
button.on_click(lambda: print(f"Clicked {button}"))

# Data structures
class TreeNode:
    def __init__(self, value):
        self.value = value
        self.parent = None
        self.left = None
        self.right = None

root = TreeNode(1)
root.left = TreeNode(2)
root.left.parent = root  # Cycle!

The Garbage Collector

Python's solution is the Generational Garbage Collector, which exists specifically to find and destroy reference cycles.

How It Works

The GC periodically:

Scans all objects looking for those that reference each other
Builds a graph of references
Finds cycles (strongly connected components)
Determines if the cycle is reachable from any external reference
If not reachable, destroys the entire cycle

You can trigger it manually:

import gc

class Node:
    def __init__(self, name):
        self.name = name
        self.partner = None

    def __del__(self):
        print(f"Deleting {self.name}")

alice = Node("Alice")
bob = Node("Bob")
alice.partner = bob
bob.partner = alice

del alice, bob
print("After del...")

# Force garbage collection
gc.collect()  # Deleting Alice
              # Deleting Bob
print("After gc.collect()")

Generational Hypothesis

The GC is "generational" because it's based on an empirical observation: most objects die young.

Python divides objects into three generations:

Generation 0: New objects (checked frequently)
Generation 1: Objects that survived one GC pass (checked less often)
Generation 2: Long-lived objects (checked rarely)

import gc

# See the current thresholds
print(gc.get_threshold())  # (700, 10, 10)
# Meaning: Run gen0 after 700 allocations,
#          Run gen1 after 10 gen0 collections,
#          Run gen2 after 10 gen1 collections

# See collection stats
print(gc.get_count())  # (453, 5, 2)
# Current counts in each generation

The Cost of Cycles

The garbage collector isn't free. It has to:

Pause your program to scan objects
Build reference graphs
Traverse cycles

This is why avoiding cycles improves performance:

# Slower: Creates cycles
class Child:
    def __init__(self, parent):
        self.parent = parent
        parent.children.append(self)

# Faster: No cycles
class Child:
    def __init__(self, parent_name):
        self.parent_name = parent_name  # Store name, not reference

Weak References

Imagine building a cache:

cache = {}

def get_user(user_id):
    if user_id not in cache:
        cache[user_id] = load_from_database(user_id)
    return cache[user_id]

Problem: objects in cache never die. Even if nobody else needs them, the cache keeps them alive. This is a memory leak.

Strong vs. Weak References

Think of references like this:

Strong Reference (normal):

Like holding a dog's leash
The dog cannot leave while you hold the leash
The dog exists because you're holding it

Weak Reference:

Like pointing at a dog
You don't prevent the dog from leaving
If the owner (strong reference) leaves, the dog disappears
You're now pointing at nothing (None)

The `weakref` Module

import weakref

class HeavyObject:
    def __init__(self, data):
        self.data = data

    def __del__(self):
        print(f"Deleting {self.data}")

# Strong reference
obj = HeavyObject("important")
print(obj.data)  # "important"

# Weak reference
weak = weakref.ref(obj)
print(weak())  # <__main__.HeavyObject object> - obj still alive

# Delete strong reference
del obj  # Deleting important

# Weak reference now points to nothing
print(weak())  # None

The weak reference doesn't count toward ob_refcnt. When the last strong reference dies, the object is destroyed, even if weak references still exist.

The Optimized Cache: `WeakValueDictionary`

import weakref

cache = weakref.WeakValueDictionary()

class User:
    def __init__(self, user_id, name):
        self.user_id = user_id
        self.name = name

    def __del__(self):
        print(f"User {self.name} deleted from memory")

# Add to cache
user = User(1, "Alice")
cache[1] = user

print(cache[1].name)  # "Alice" - cache works

# Delete the strong reference
del user  # User Alice deleted from memory

# Cache automatically removed the entry!
print(1 in cache)  # False

When user was deleted, its reference count hit 0 (the cache only held a weak reference). The object was destroyed, and WeakValueDictionary automatically removed the entry.

This is how you build caches that don't leak memory.

Use Cases for Weak References

Caches: Store objects without preventing their cleanup
Observer patterns: Listeners shouldn't keep subjects alive
Circular references: Parent doesn't prevent child GC

# Observer pattern with weak refs
class Subject:
    def __init__(self):
        self._observers = []

    def attach(self, observer):
        # Store weak reference, not strong
        self._observers.append(weakref.ref(observer))

    def notify(self):
        # Filter out dead references
        self._observers = [obs for obs in self._observers if obs() is not None]
        for obs_ref in self._observers:
            obs = obs_ref()
            if obs is not None:
                obs.update()

When NOT to Use Weak References

Weak references have limitations:

# Can't create weak refs to basic types
try:
    weak = weakref.ref(42)
except TypeError:
    print("Can't weakref int")

try:
    weak = weakref.ref("hello")
except TypeError:
    print("Can't weakref str")

# Weak refs add overhead
# Don't use them everywhere - only where you need them

Recommended Resource

I highly recommend Nina Zakharenko's "Memory Management in Python" talk from PyCon 2016.

Watch on YouTube

Summary:

We've learned how Python manages object lifecycles:

Mental Models

Variables are labels, not boxes → b = a creates two labels on one object
Mutable defaults are shared → Use None sentinel pattern
Reference counting is immediate → Objects die when count hits 0

The Reference Counting System

Every object has ob_refcnt tracking references
Assignment increments count, deletion decrements
Count = 0 triggers immediate destruction via __del__

The Garbage Collector

Purpose: Find and destroy reference cycles
Strategy: Generational collection (young objects die fast)
Trigger: Automatically or via gc.collect()
Cost: Pause program to scan objects

Weak References

Strong reference: Keeps object alive (normal)
Weak reference: Points to object without preventing cleanup
WeakValueDictionary: Auto-cleaning cache
Use for: Caches, observers, avoiding cycles

The Professional Checklist

When designing classes:

# Avoid cycles in data structures
class TreeNode:
    def __init__(self, value):
        self.value = value
        self.children = []
        # DON'T store self.parent - creates cycle
        # Store parent_id or use weakref instead

# Use weak refs for caches
import weakref
class ResourceManager:
    def __init__(self):
        self._cache = weakref.WeakValueDictionary()

# Clean up resources explicitly
class FileHandler:
    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.cleanup()  # Don't rely on __del__

When writing functions:

# NEVER use mutable defaults
def process_items(items, buffer=[]):  # BAD
    pass

def process_items(items, buffer=None):  # GOOD
    if buffer is None:
        buffer = []

DEV Community

Python Internals: Reference Cycles and Garbage Collection

The Box vs. The Label

Variables Are Not Boxes

Proving It With `id()`

The Immutability Exception

Mutable Default Arguments

The Hidden Function Object

The Execution Timeline

The Idiomatic Fix

When to Use Mutable Defaults (Intentionally)

The Life and Death of an Object: Reference Counting

The Reference Counter

The `sys.getrefcount()` Trap

What Creates References?

Immediate Destruction

The Reference Cycles Problem

The Real-World Impact

The Garbage Collector

How It Works

Generational Hypothesis

The Cost of Cycles

Weak References

Strong vs. Weak References

The `weakref` Module

The Optimized Cache: `WeakValueDictionary`

Use Cases for Weak References

When NOT to Use Weak References

Recommended Resource

Summary:

Mental Models

The Reference Counting System

The Garbage Collector

Weak References

The Professional Checklist

Top comments (0)

The Box vs. The Label

Variables Are Not Boxes

Proving It With id()

The Immutability Exception

Mutable Default Arguments

The Hidden Function Object

The Execution Timeline

The Idiomatic Fix

When to Use Mutable Defaults (Intentionally)

The Life and Death of an Object: Reference Counting

The Reference Counter

The sys.getrefcount() Trap

What Creates References?

Immediate Destruction

The Reference Cycles Problem

The Real-World Impact

The Garbage Collector

How It Works

Generational Hypothesis

The Cost of Cycles

Weak References

Strong vs. Weak References

The weakref Module

The Optimized Cache: WeakValueDictionary

Use Cases for Weak References

When NOT to Use Weak References

Recommended Resource

Summary:

Mental Models

The Reference Counting System

The Garbage Collector

Weak References

The Professional Checklist

Proving It With `id()`

The `sys.getrefcount()` Trap

The `weakref` Module

The Optimized Cache: `WeakValueDictionary`