DEV Community

Cover image for Python and Memory Management, Part 3: Call Stack
Ja'far Khakpour
Ja'far Khakpour

Posted on

Python and Memory Management, Part 3: Call Stack

In our previous sctions about Python's memory management, we discovered how Python allocates and manages objects in memory ( PyObjects, Aernas and Allocation Pools). Now let's explore what happens when those objects come to life through code execution.
In this part, we will talk about another important section of application memory: the stack memory and how it is handled in a Python programs. We are going to talk about the bridge between static memory and dynamic execution: PyFrame object as a fundamental unit of Python's call stack.

From Memory to Execution: Connecting PyObjects to PyFrames

Before we dive into Python's frame system, we need to clear up a critical distinction that often confuses people coming from languages like C or C++. Python operates with two fundamentally different stack concepts that serve entirely different purposes, and confusing them can leads to misunderstanding.

Memory Layout of a Program

From the operating system's perspective, a process's memory is organized into several distinct sections:

1. Text (Code Segment)
It stores the executable instructions (Compiled machine instructions - machine code - from your application). This memory block is read-only, fixed size, and it could be shared between processes when possible.
This segment is loaded into memory when the executable file loaded, and OS would not allow to change content of this section in the runtime.

2. Data Segment
This section stores global and static variables of the program. Variables in this section live throughout program execution.

3. Heap Segment
This is the section that is used for dynamic memory allocation. This memory is what you allocate using malloc() and free it using free() in C code. The heap is shared by all functions and modulesin a process.
Python use a different memory section for storing normal variables in memory (e.g. it uses a special method named pymalloc() instead of malloc() which manages allocation of objects in anonymous segments of application memory). This section was the subject of previous parts of this post.

4. Stack
This segment stores function call information and local variables, function parameters and function return value addresses. It has automatic allocation/deallocation of memory via simple pointer arithmetic in machine codes (It is very fast). It has a LIFO structure with fixed-size per each thread (typically 1-8MB per thread on Linux).
But Python is not using this memory for manageing call stack. It use a different memory structure and manages call stack by itself.

In fact, not only Heap and Stack segments, but also code segment is also managed differently in Python.

Code as object

From last sections, we noticed that modules and functions are stored in memory sections that Python stores normal variables (like strings). In Python every function is just an object. Every executable code can be interpreted as a AST (Abstract Syntac Tree) and also execute every AST in the runtime (Not writing a code and executin in another process, we can just run them inside the same process). This is because Python is an interpreted language. When you run a .py file, the actual code segment of your code is Python executable, and your code is part of data this interpreter is working with.
For a function, you can see it's AST structure with a readable format:

import ast
import inspect

source_code = """
def func(a: int, b: int) -> int:
    "This function adds two numbers"
    return a + b
"""

print("Function source code:")
print(source_code)
print("="*80)
print("Equivalent AST Syntax:")
print(ast.dump(ast.parse(source_code), indent=4))
Enter fullscreen mode Exit fullscreen mode

You can use formatted dump output of above code to recreate the same function in runtime:


import ast
from ast import *

module_ast = Module(
    body=[
        FunctionDef(
            name='func',
            args=arguments(
                posonlyargs=[],
                args=[
                    arg(
                        arg='a',
                        annotation=Name(id='int', ctx=Load())
                    ),
                    arg(
                        arg='b',
                        annotation=Name(id='int', ctx=Load())
                    )
                ],
                kwonlyargs=[],
                kw_defaults=[],
                defaults=[]
            ),
            body=[
                Expr(
                    value=Constant(value='This function adds two numbers')
                ),
                Return(
                    value=BinOp(
                        left=Name(id='a', ctx=Load()),
                        op=Add(),
                        right=Name(id='b', ctx=Load())
                    )
                )
            ],
            decorator_list=[],
            returns=Name(id='int', ctx=Load()))],
    type_ignores=[]
)


ast.fix_missing_locations(module_ast)
code_obj = compile(module_ast, '<runtime>', 'exec')
print("code objet:", code_obj)
namespace = {}
exec(code_obj, namespace)

dynamic_func = namespace['func']

print(f"{dynamic_func(1, 3)=}")
print(f"{id(dynamic_func)=:x}")
Enter fullscreen mode Exit fullscreen mode

Now, we have a fully functioning function dynamic_func() which interpreter is not aware of until it runs the code. But after execution, you can see the function is working like a normal function.

Python Stack as Object

But this is not the only thing you can access as object, you can even access call stack in Python:

import inspect

def recursive_function(n, depth=0):
    """Demonstrate stack frames with recursion"""
    current_frame = inspect.currentframe()
    print(f"Depth: {depth} Frame ID: {id(current_frame)}, Previous Frame ID: {id(current_frame.f_back)}")

    if n > 0:
        recursive_function(n-1, depth=depth+1)
    else:
        demonstrate_frames(depth)

    print(f"Depth {depth}: Reurning from Frame ID: {id(current_frame.f_back)} , Local n: {n}")



def demonstrate_frames(depth):
    print("="*80)
    frame = inspect.currentframe()
    print(f"Frame details:")
    print(f"Current Line number: {frame.f_lineno}")
    print(f"Filename: {frame.f_code.co_filename}")
    print(f"Function name: {frame.f_code.co_name}")
    print(f"Code object:\t{frame.f_code}")
    print(f"Local variables: {list(frame.f_locals)}")
    print(f"Depth variable: {frame.f_locals['depth']}")
    print(f"Global variables: {list(frame.f_globals)}")


    print("="*80)
# Call the recursive function
recursive_function(2)

Enter fullscreen mode Exit fullscreen mode

On top of this output, you can see frame objects and each frame, previous frame which it has been called from:

Depth: 0 Frame ID: 140421115686208, Previous Frame ID: 140421113260096
Depth: 1 Frame ID: 140421115686448, Previous Frame ID: 140421115686208
Depth: 2 Frame ID: 140421115687168, Previous Frame ID: 140421115686448
Enter fullscreen mode Exit fullscreen mode

Frame stack in python is actually a linked list of frames starting from curent frame and groing back by a link to previous stack (f_back attribute). Frame object keeps track of code execution, context and variables.

In the output of this code, you can see different objects assigned to this frame, like function name, current line number, builtins, globals and locals variables. You can see all attributes of frame object here.

As seen above, call stack is a LIFO which has implemented as a linked list, not an array (like OS managed stack):

import gc
import sys

all_frame_ids = set()

def track_frame_allocation():
    """Track how frames are allocated and garbage collected"""
    def create_frames():
        """Create several frames to observe allocation patterns"""
        def level_3():
            frame = sys._getframe()
            print(f"  level 3: frame at 0x{id(frame):x}, function: {frame.f_code.co_name}")
            all_frame_ids.add(id(frame))
            return 3

        def level_2():
            frame = sys._getframe()
            print(f"  level 2: frame at 0x{id(frame):x}, function: {frame.f_code.co_name}")
            all_frame_ids.add(id(frame))
            return 2 + level_3()

        def level_1():
            frame = sys._getframe()
            print(f"  level 1: frame at 0x{id(frame):x}, function: {frame.f_code.co_name}")
            all_frame_ids.add(id(frame))
            return 1 + level_2()

        level_1()

    # Run multiple times to see if frames are reused
    print("=== FRAME ALLOCATION PATTERNS ===")

    for i in range(3):
        print(f"\nRun {i + 1}:")
        create_frames()
        # Force garbage collection and free memory used for frame objects
        gc.collect()

    print(f"\nUnique frame addresses across all runs: {len(all_frame_ids)}")

track_frame_allocation()
Enter fullscreen mode Exit fullscreen mode

when you run this code, you get this output:

=== FRAME ALLOCATION PATTERNS ===

Run 1:
  level 1: frame at 0x7fb2b1ff8110, function: level_1
  level 2: frame at 0x7fb2b1ff9be0, function: level_2
  level 3: frame at 0x7fb2b1f34b80, function: level_3

Run 2:
  level 1: frame at 0x7fb2b1ff9a40, function: level_1
  level 2: frame at 0x7fb2b1ff8110, function: level_2
  level 3: frame at 0x7fb2b1f34a00, function: level_3

Run 3:
  level 1: frame at 0x7fb2b1ff9a40, function: level_1
  level 2: frame at 0x7fb2b1ff8110, function: level_2
  level 3: frame at 0x7fb2b1f34a00, function: level_3

Unique frame addresses across all runs: 5
Enter fullscreen mode Exit fullscreen mode

As you can see, a memory address used for one iteration, could be easily reused in next iteration (but memory address order may change).
In a code execution a frame is released as soon as function call ends, and it return the output value, but this is not always true for generators and coroutines:

import gc
import inspect


def gen_numbers(n):
    i = 1
    while True:
        while i <= n:
            ## yield i, and receive value sent by generator.send() as x
            x = yield i
            if x == -1:
                ## generaot is called top level function (`level3()`)
                print(f"Inner loop done. show stack at level {i}:")
                stack = inspect.stack()
                for frame in stack:
                    print(f"  frame at 0x{id(frame[0]):x}, function: {frame[3]}")
                i = 1
            else:
                i += 1


def create_frames(generator):
    """Create several frames to observe allocation patterns"""
    def level_3(generator):
        ## You can send any value to the generator when iterating over it
        ## and next(generator) is just equivalent to generator.send(None)
        return generator.send(-1)

    def level_2(generator):
        return next(generator) + level_3(generator)

    def level_1(generator):
        return next(generator) + level_2(generator)

    level_1(generator)


# Run multiple times to see if frames are reused
print("=== FRAME ALLOCATION PATTERNS ===")
generator = gen_numbers(3)
for iteration in range(3):
    print(f"\nIteration {iteration + 1}:")
    create_frames(generator=generator)
    # Force garbage collection and check frame objects
    gc.collect()
Enter fullscreen mode Exit fullscreen mode

if you rune this code, you will see something like this:

=== FRAME ALLOCATION PATTERNS ===

Run 1:
inner loop done. show stack at level 3:
  frame at 0x7f0944582c50, function: gen_numbers
  frame at 0x7f0944344a90, function: level_3
  frame at 0x7f0944535180, function: level_2
  frame at 0x7f0944535000, function: level_1
  frame at 0x7f0944340880, function: create_frames
  frame at 0x7f09443407c0, function: <module>

Run 2:
inner loop done. show stack at level 3:
  frame at 0x7f0944582c50, function: gen_numbers
  frame at 0x7f0944344720, function: level_3
  frame at 0x7f0944537dc0, function: level_2
  frame at 0x7f0944534a00, function: level_1
  frame at 0x7f0944340040, function: create_frames
  frame at 0x7f09443407c0, function: <module>

Run 3:
inner loop done. show stack at level 3:
  frame at 0x7f0944582c50, function: gen_numbers
  frame at 0x7f0944344a90, function: level_3
  frame at 0x7f0944535180, function: level_2
  frame at 0x7f0944535000, function: level_1
  frame at 0x7f0944340880, function: create_frames
  frame at 0x7f09443407c0, function: <module>
Enter fullscreen mode Exit fullscreen mode

At each try, Address per each frame IDs for functions change, but there are two addresses not changing: root frame (<module>), which is obvious, and the generator function's frame at top of the stack. That's because the same generator (and it's frame) is used until end of code. This feature in python does not comply with standard C-like stack's LIFO srtucture, and this is exactly the point! Python behaves as a stacked language for it's normal function calls, but when you work with generators and coroutines, it behaves like a stackless language. This is a huge benefit for Python, it makes lazy evaluation, pipelines and chaining actions much easier using generators.

The same is going on for coroutines in a different way. Coroutines have smaller memory footprint compared to threads and processes. Processes are heavier than other two, and threads also have context switch costs (return control to OS, OS decides which thread to start execution (OS scheduling), store current registers and load the other thread's registers to start it's stack, and repeat), which you can add cost of Python Interpreter's Mutex (GIL) to this flow. This costs much more than switching control on PyFrames which happens inside Python interpreter. This allows the program to try smarter behavior when scheduling async tasks.

Take this sample code:

import time
import asyncio
import os
from argparse import ArgumentParser
from concurrent.futures import ThreadPoolExecutor, as_completed
import resource

N_TASKS = 10_000
IO_DELAY = 0.1
MAX_CONCURRENCY = 500

class PerformanceMonitor:
    """Simple performance monitoring context manager"""

    def __enter__(self):
        # Time measurements
        self.start_perf = time.perf_counter()

        # Resource measurements
        self.ru_start = resource.getrusage(resource.RUSAGE_SELF)

        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        # Get final measurements
        self.end_perf = time.perf_counter()
        self.ru_end = resource.getrusage(resource.RUSAGE_SELF)

    def perf_counter_time(self):
        return self.end_perf - self.start_perf
    def user_time(self):
        return self.ru_end.ru_utime - self.ru_start.ru_utime
    def sys_time(self):
        return self.ru_end.ru_stime - self.ru_start.ru_stime
    def cpu_time(self):
        return self.user_time() + self.sys_time()
    def cpu_percent(self):
        return (self.cpu_time() / self.perf_counter_time() * 100) if self.perf_counter_time() > 0 else 0
    def max_rss_kb(self):
        return self.ru_end.ru_maxrss / 1024.
    def voluntary_switches(self):
        return self.ru_end.ru_nvcsw - self.ru_start.ru_nvcsw
    def involuntary_switches(self):
        return self.ru_end.ru_nivcsw - self.ru_start.ru_nivcsw
    def page_faults(self):
        return self.ru_end.ru_majflt - self.ru_start.ru_majflt
    def page_reclaims(self):
        return self.ru_end.ru_minflt - self.ru_start.ru_minflt

    def print_results(self):
        print(
            "",
            f"{'='*50}",
        # Time metrics
            f"TIME METRICS:",
            f"  CPU time:          {self.cpu_time():.6f}s",
            f"       User:         {self.user_time():.6f}s",
            f"       System:       {self.sys_time():.6f}s",
            f"  CPU Usage Percent: {self.cpu_percent():.1f}%",
        # Memory
            "",
            f"  Max RSS:                          {self.max_rss_kb():,} MB",
        # Context switches
            f"  Voluntary Context switch:         {self.voluntary_switches():,}",
            f"  Involuntary Context switch:       {self.involuntary_switches():,}",

        # Page faults
            "",
            f"  Major (I/O):       {self.page_faults():,}",
            f"  Minor:             {self.page_reclaims():,}",
            sep="\n"
        )


def sync_stack():
    time.sleep(IO_DELAY)
    return 1


async def make_async_stack():
    await asyncio.sleep(IO_DELAY)
    return 1

def run_threads():
    print(f"Running in THREAD mode")

    with ThreadPoolExecutor(max_workers=MAX_CONCURRENCY) as executor:
        futures = [executor.submit(sync_stack) for _ in range(N_TASKS)]
        for f in as_completed(futures):
            f.result()

async def run_async():
    print(f"Running in ASYNC mode ")

    sem = asyncio.Semaphore(MAX_CONCURRENCY)

    async def limited_task():
        async with sem:
            await make_async_stack()

    tasks = [asyncio.create_task(limited_task()) for _ in range(N_TASKS)]
    await asyncio.gather(*tasks)

def run_noop():
    print("Doing Nothing! Return...")

def run_benchamrk(mode: str):
    print(
        "________________________________________",
        f"PID: {os.getpid()}",
        "Config:",
        f"  Tasks: {N_TASKS}, ",
        f"  Concurrency: {MAX_CONCURRENCY}, ",
        f"  Delay: {IO_DELAY}s, ",
        "﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌",
        sep="\n"
    )

    if mode == "threads":
        run_threads()

    if mode == "async":
        asyncio.run(run_async())

    if mode == "none":
        run_noop()

if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument(
        "mode",
        choices=["threads", "async", "none"],
    )
    arguments = parser.parse_args()
    mode = arguments.mode

    with PerformanceMonitor() as mon:
        run_benchamrk(mode)
    mon.print_results()

Enter fullscreen mode Exit fullscreen mode

This code runs a task concurrently using threads and async loops:

python concurrent_benchmark.py none # nothing to do. just start the app and immediately stop it
python concurrent_benchmark.py threads # run a very simple task which simulates IO wait in threads
python concurrent_benchmark.py async # simulate IO wait in async coros
Enter fullscreen mode Exit fullscreen mode

There is not big diference between execution time of two version and most of the time threaded execution end a little faster in my computer. But numbers we gather using PerformanceMonitor show some interesting differences:

1. CPU Time

CPU execution time has some meaningful difference between asnc and threads:

## Threads
  CPU time:          0.744787s
       User:         0.561942s
       System:       0.182845s
  CPU Usage Percent: 35.5%

## Async
  CPU time:          0.235827s
       User:         0.223514s
       System:       0.012313s
  CPU Usage Percent: 11.2%
Enter fullscreen mode Exit fullscreen mode

Amount of time Threaded benchmark use CPU is around ~3x Async benchmark. But there is one more interesting metric in these stats: System time for Threads is ~15x more than Async, because in threaded mode, the application has to release control to OS, so OS switches execution flow to another thread. But in Async mode, application itself is switching between coroutines and no need to ask OS to schedule them.
You can see these results here:

## Threads
  Voluntary Context switch:         30,351
  Involuntary Context switch:       50

## Async
  Voluntary Context switch:         54
  Involuntary Context switch:       16
Enter fullscreen mode Exit fullscreen mode

Voluntary switch happens when the application is voluntary releasing control. This can be for switching threads, or application do not have anything to do at the moment (all tasks are some simple sleeps, no much to do with CPU at when they are executed).

  1. There is one more metric in this benchmark, Memory Usage:
## No-OP
  Max RSS:                          19.820 MB

## Threads
  Max RSS:                          74.512 MB

## Async
  Max RSS:                          35.043 MB
Enter fullscreen mode Exit fullscreen mode

RAM needed for this script to just load and do nothing, is ~20MB. When running script using Async coros, it takes ~15MB more space, but Threads takes ~60MB more RAM space. And I have to mention, this test scenario was very simple case, but adding more CPU heavy works and a very deep stack can show worse performance in threads.

Over this series, we’ve traced Python’s memory architecture from the ground up (exploring PyObjects, the distinction between heap and stack, and the inner workings of PyFrames). These foundational concepts reveal why Python behaves the way it does, and more importantly, they illuminate the profound efficiency of generators and coroutines.

In mastering these tools and concepts, we don't just write better Python code, we also learn to think in harmony with underlying software and hardware layers, transforming our understanding of performance from an abstract concern into an intuitive mindset.

Top comments (0)