In our previous sctions about Python's memory management, we discovered how Python allocates and manages objects in memory ( PyObjects, Aernas and Allocation Pools). Now let's explore what happens when those objects come to life through code execution.
In this part, we will talk about another important section of application memory: the stack memory and how it is handled in a Python programs. We are going to talk about the bridge between static memory and dynamic execution: PyFrame object as a fundamental unit of Python's call stack.
From Memory to Execution: Connecting PyObjects to PyFrames
Before we dive into Python's frame system, we need to clear up a critical distinction that often confuses people coming from languages like C or C++. Python operates with two fundamentally different stack concepts that serve entirely different purposes, and confusing them can leads to misunderstanding.
Memory Layout of a Program
From the operating system's perspective, a process's memory is organized into several distinct sections:
1. Text (Code Segment)
It stores the executable instructions (Compiled machine instructions - machine code - from your application). This memory block is read-only, fixed size, and it could be shared between processes when possible.
This segment is loaded into memory when the executable file loaded, and OS would not allow to change content of this section in the runtime.
2. Data Segment
This section stores global and static variables of the program. Variables in this section live throughout program execution.
3. Heap Segment
This is the section that is used for dynamic memory allocation. This memory is what you allocate using malloc() and free it using free() in C code. The heap is shared by all functions and modulesin a process.
Python use a different memory section for storing normal variables in memory (e.g. it uses a special method named pymalloc() instead of malloc() which manages allocation of objects in anonymous segments of application memory). This section was the subject of previous parts of this post.
4. Stack
This segment stores function call information and local variables, function parameters and function return value addresses. It has automatic allocation/deallocation of memory via simple pointer arithmetic in machine codes (It is very fast). It has a LIFO structure with fixed-size per each thread (typically 1-8MB per thread on Linux).
But Python is not using this memory for manageing call stack. It use a different memory structure and manages call stack by itself.
In fact, not only Heap and Stack segments, but also code segment is also managed differently in Python.
Code as object
From last sections, we noticed that modules and functions are stored in memory sections that Python stores normal variables (like strings). In Python every function is just an object. Every executable code can be interpreted as a AST (Abstract Syntac Tree) and also execute every AST in the runtime (Not writing a code and executin in another process, we can just run them inside the same process). This is because Python is an interpreted language. When you run a .py file, the actual code segment of your code is Python executable, and your code is part of data this interpreter is working with.
For a function, you can see it's AST structure with a readable format:
import ast
import inspect
source_code = """
def func(a: int, b: int) -> int:
"This function adds two numbers"
return a + b
"""
print("Function source code:")
print(source_code)
print("="*80)
print("Equivalent AST Syntax:")
print(ast.dump(ast.parse(source_code), indent=4))
You can use formatted dump output of above code to recreate the same function in runtime:
import ast
from ast import *
module_ast = Module(
body=[
FunctionDef(
name='func',
args=arguments(
posonlyargs=[],
args=[
arg(
arg='a',
annotation=Name(id='int', ctx=Load())
),
arg(
arg='b',
annotation=Name(id='int', ctx=Load())
)
],
kwonlyargs=[],
kw_defaults=[],
defaults=[]
),
body=[
Expr(
value=Constant(value='This function adds two numbers')
),
Return(
value=BinOp(
left=Name(id='a', ctx=Load()),
op=Add(),
right=Name(id='b', ctx=Load())
)
)
],
decorator_list=[],
returns=Name(id='int', ctx=Load()))],
type_ignores=[]
)
ast.fix_missing_locations(module_ast)
code_obj = compile(module_ast, '<runtime>', 'exec')
print("code objet:", code_obj)
namespace = {}
exec(code_obj, namespace)
dynamic_func = namespace['func']
print(f"{dynamic_func(1, 3)=}")
print(f"{id(dynamic_func)=:x}")
Now, we have a fully functioning function dynamic_func() which interpreter is not aware of until it runs the code. But after execution, you can see the function is working like a normal function.
Python Stack as Object
But this is not the only thing you can access as object, you can even access call stack in Python:
import inspect
def recursive_function(n, depth=0):
"""Demonstrate stack frames with recursion"""
current_frame = inspect.currentframe()
print(f"Depth: {depth} Frame ID: {id(current_frame)}, Previous Frame ID: {id(current_frame.f_back)}")
if n > 0:
recursive_function(n-1, depth=depth+1)
else:
demonstrate_frames(depth)
print(f"Depth {depth}: Reurning from Frame ID: {id(current_frame.f_back)} , Local n: {n}")
def demonstrate_frames(depth):
print("="*80)
frame = inspect.currentframe()
print(f"Frame details:")
print(f"Current Line number: {frame.f_lineno}")
print(f"Filename: {frame.f_code.co_filename}")
print(f"Function name: {frame.f_code.co_name}")
print(f"Code object:\t{frame.f_code}")
print(f"Local variables: {list(frame.f_locals)}")
print(f"Depth variable: {frame.f_locals['depth']}")
print(f"Global variables: {list(frame.f_globals)}")
print("="*80)
# Call the recursive function
recursive_function(2)
On top of this output, you can see frame objects and each frame, previous frame which it has been called from:
Depth: 0 Frame ID: 140421115686208, Previous Frame ID: 140421113260096
Depth: 1 Frame ID: 140421115686448, Previous Frame ID: 140421115686208
Depth: 2 Frame ID: 140421115687168, Previous Frame ID: 140421115686448
Frame stack in python is actually a linked list of frames starting from curent frame and groing back by a link to previous stack (f_back attribute). Frame object keeps track of code execution, context and variables.
In the output of this code, you can see different objects assigned to this frame, like function name, current line number, builtins, globals and locals variables. You can see all attributes of frame object here.
As seen above, call stack is a LIFO which has implemented as a linked list, not an array (like OS managed stack):
import gc
import sys
all_frame_ids = set()
def track_frame_allocation():
"""Track how frames are allocated and garbage collected"""
def create_frames():
"""Create several frames to observe allocation patterns"""
def level_3():
frame = sys._getframe()
print(f" level 3: frame at 0x{id(frame):x}, function: {frame.f_code.co_name}")
all_frame_ids.add(id(frame))
return 3
def level_2():
frame = sys._getframe()
print(f" level 2: frame at 0x{id(frame):x}, function: {frame.f_code.co_name}")
all_frame_ids.add(id(frame))
return 2 + level_3()
def level_1():
frame = sys._getframe()
print(f" level 1: frame at 0x{id(frame):x}, function: {frame.f_code.co_name}")
all_frame_ids.add(id(frame))
return 1 + level_2()
level_1()
# Run multiple times to see if frames are reused
print("=== FRAME ALLOCATION PATTERNS ===")
for i in range(3):
print(f"\nRun {i + 1}:")
create_frames()
# Force garbage collection and free memory used for frame objects
gc.collect()
print(f"\nUnique frame addresses across all runs: {len(all_frame_ids)}")
track_frame_allocation()
when you run this code, you get this output:
=== FRAME ALLOCATION PATTERNS ===
Run 1:
level 1: frame at 0x7fb2b1ff8110, function: level_1
level 2: frame at 0x7fb2b1ff9be0, function: level_2
level 3: frame at 0x7fb2b1f34b80, function: level_3
Run 2:
level 1: frame at 0x7fb2b1ff9a40, function: level_1
level 2: frame at 0x7fb2b1ff8110, function: level_2
level 3: frame at 0x7fb2b1f34a00, function: level_3
Run 3:
level 1: frame at 0x7fb2b1ff9a40, function: level_1
level 2: frame at 0x7fb2b1ff8110, function: level_2
level 3: frame at 0x7fb2b1f34a00, function: level_3
Unique frame addresses across all runs: 5
As you can see, a memory address used for one iteration, could be easily reused in next iteration (but memory address order may change).
In a code execution a frame is released as soon as function call ends, and it return the output value, but this is not always true for generators and coroutines:
import gc
import inspect
def gen_numbers(n):
i = 1
while True:
while i <= n:
## yield i, and receive value sent by generator.send() as x
x = yield i
if x == -1:
## generaot is called top level function (`level3()`)
print(f"Inner loop done. show stack at level {i}:")
stack = inspect.stack()
for frame in stack:
print(f" frame at 0x{id(frame[0]):x}, function: {frame[3]}")
i = 1
else:
i += 1
def create_frames(generator):
"""Create several frames to observe allocation patterns"""
def level_3(generator):
## You can send any value to the generator when iterating over it
## and next(generator) is just equivalent to generator.send(None)
return generator.send(-1)
def level_2(generator):
return next(generator) + level_3(generator)
def level_1(generator):
return next(generator) + level_2(generator)
level_1(generator)
# Run multiple times to see if frames are reused
print("=== FRAME ALLOCATION PATTERNS ===")
generator = gen_numbers(3)
for iteration in range(3):
print(f"\nIteration {iteration + 1}:")
create_frames(generator=generator)
# Force garbage collection and check frame objects
gc.collect()
if you rune this code, you will see something like this:
=== FRAME ALLOCATION PATTERNS ===
Run 1:
inner loop done. show stack at level 3:
frame at 0x7f0944582c50, function: gen_numbers
frame at 0x7f0944344a90, function: level_3
frame at 0x7f0944535180, function: level_2
frame at 0x7f0944535000, function: level_1
frame at 0x7f0944340880, function: create_frames
frame at 0x7f09443407c0, function: <module>
Run 2:
inner loop done. show stack at level 3:
frame at 0x7f0944582c50, function: gen_numbers
frame at 0x7f0944344720, function: level_3
frame at 0x7f0944537dc0, function: level_2
frame at 0x7f0944534a00, function: level_1
frame at 0x7f0944340040, function: create_frames
frame at 0x7f09443407c0, function: <module>
Run 3:
inner loop done. show stack at level 3:
frame at 0x7f0944582c50, function: gen_numbers
frame at 0x7f0944344a90, function: level_3
frame at 0x7f0944535180, function: level_2
frame at 0x7f0944535000, function: level_1
frame at 0x7f0944340880, function: create_frames
frame at 0x7f09443407c0, function: <module>
At each try, Address per each frame IDs for functions change, but there are two addresses not changing: root frame (<module>), which is obvious, and the generator function's frame at top of the stack. That's because the same generator (and it's frame) is used until end of code. This feature in python does not comply with standard C-like stack's LIFO srtucture, and this is exactly the point! Python behaves as a stacked language for it's normal function calls, but when you work with generators and coroutines, it behaves like a stackless language. This is a huge benefit for Python, it makes lazy evaluation, pipelines and chaining actions much easier using generators.
The same is going on for coroutines in a different way. Coroutines have smaller memory footprint compared to threads and processes. Processes are heavier than other two, and threads also have context switch costs (return control to OS, OS decides which thread to start execution (OS scheduling), store current registers and load the other thread's registers to start it's stack, and repeat), which you can add cost of Python Interpreter's Mutex (GIL) to this flow. This costs much more than switching control on PyFrames which happens inside Python interpreter. This allows the program to try smarter behavior when scheduling async tasks.
Take this sample code:
import time
import asyncio
import os
from argparse import ArgumentParser
from concurrent.futures import ThreadPoolExecutor, as_completed
import resource
N_TASKS = 10_000
IO_DELAY = 0.1
MAX_CONCURRENCY = 500
class PerformanceMonitor:
"""Simple performance monitoring context manager"""
def __enter__(self):
# Time measurements
self.start_perf = time.perf_counter()
# Resource measurements
self.ru_start = resource.getrusage(resource.RUSAGE_SELF)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
# Get final measurements
self.end_perf = time.perf_counter()
self.ru_end = resource.getrusage(resource.RUSAGE_SELF)
def perf_counter_time(self):
return self.end_perf - self.start_perf
def user_time(self):
return self.ru_end.ru_utime - self.ru_start.ru_utime
def sys_time(self):
return self.ru_end.ru_stime - self.ru_start.ru_stime
def cpu_time(self):
return self.user_time() + self.sys_time()
def cpu_percent(self):
return (self.cpu_time() / self.perf_counter_time() * 100) if self.perf_counter_time() > 0 else 0
def max_rss_kb(self):
return self.ru_end.ru_maxrss / 1024.
def voluntary_switches(self):
return self.ru_end.ru_nvcsw - self.ru_start.ru_nvcsw
def involuntary_switches(self):
return self.ru_end.ru_nivcsw - self.ru_start.ru_nivcsw
def page_faults(self):
return self.ru_end.ru_majflt - self.ru_start.ru_majflt
def page_reclaims(self):
return self.ru_end.ru_minflt - self.ru_start.ru_minflt
def print_results(self):
print(
"",
f"{'='*50}",
# Time metrics
f"TIME METRICS:",
f" CPU time: {self.cpu_time():.6f}s",
f" User: {self.user_time():.6f}s",
f" System: {self.sys_time():.6f}s",
f" CPU Usage Percent: {self.cpu_percent():.1f}%",
# Memory
"",
f" Max RSS: {self.max_rss_kb():,} MB",
# Context switches
f" Voluntary Context switch: {self.voluntary_switches():,}",
f" Involuntary Context switch: {self.involuntary_switches():,}",
# Page faults
"",
f" Major (I/O): {self.page_faults():,}",
f" Minor: {self.page_reclaims():,}",
sep="\n"
)
def sync_stack():
time.sleep(IO_DELAY)
return 1
async def make_async_stack():
await asyncio.sleep(IO_DELAY)
return 1
def run_threads():
print(f"Running in THREAD mode")
with ThreadPoolExecutor(max_workers=MAX_CONCURRENCY) as executor:
futures = [executor.submit(sync_stack) for _ in range(N_TASKS)]
for f in as_completed(futures):
f.result()
async def run_async():
print(f"Running in ASYNC mode ")
sem = asyncio.Semaphore(MAX_CONCURRENCY)
async def limited_task():
async with sem:
await make_async_stack()
tasks = [asyncio.create_task(limited_task()) for _ in range(N_TASKS)]
await asyncio.gather(*tasks)
def run_noop():
print("Doing Nothing! Return...")
def run_benchamrk(mode: str):
print(
"________________________________________",
f"PID: {os.getpid()}",
"Config:",
f" Tasks: {N_TASKS}, ",
f" Concurrency: {MAX_CONCURRENCY}, ",
f" Delay: {IO_DELAY}s, ",
"﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌",
sep="\n"
)
if mode == "threads":
run_threads()
if mode == "async":
asyncio.run(run_async())
if mode == "none":
run_noop()
if __name__ == "__main__":
parser = ArgumentParser()
parser.add_argument(
"mode",
choices=["threads", "async", "none"],
)
arguments = parser.parse_args()
mode = arguments.mode
with PerformanceMonitor() as mon:
run_benchamrk(mode)
mon.print_results()
This code runs a task concurrently using threads and async loops:
python concurrent_benchmark.py none # nothing to do. just start the app and immediately stop it
python concurrent_benchmark.py threads # run a very simple task which simulates IO wait in threads
python concurrent_benchmark.py async # simulate IO wait in async coros
There is not big diference between execution time of two version and most of the time threaded execution end a little faster in my computer. But numbers we gather using PerformanceMonitor show some interesting differences:
1. CPU Time
CPU execution time has some meaningful difference between asnc and threads:
## Threads
CPU time: 0.744787s
User: 0.561942s
System: 0.182845s
CPU Usage Percent: 35.5%
## Async
CPU time: 0.235827s
User: 0.223514s
System: 0.012313s
CPU Usage Percent: 11.2%
Amount of time Threaded benchmark use CPU is around ~3x Async benchmark. But there is one more interesting metric in these stats: System time for Threads is ~15x more than Async, because in threaded mode, the application has to release control to OS, so OS switches execution flow to another thread. But in Async mode, application itself is switching between coroutines and no need to ask OS to schedule them.
You can see these results here:
## Threads
Voluntary Context switch: 30,351
Involuntary Context switch: 50
## Async
Voluntary Context switch: 54
Involuntary Context switch: 16
Voluntary switch happens when the application is voluntary releasing control. This can be for switching threads, or application do not have anything to do at the moment (all tasks are some simple sleeps, no much to do with CPU at when they are executed).
- There is one more metric in this benchmark, Memory Usage:
## No-OP
Max RSS: 19.820 MB
## Threads
Max RSS: 74.512 MB
## Async
Max RSS: 35.043 MB
RAM needed for this script to just load and do nothing, is ~20MB. When running script using Async coros, it takes ~15MB more space, but Threads takes ~60MB more RAM space. And I have to mention, this test scenario was very simple case, but adding more CPU heavy works and a very deep stack can show worse performance in threads.
Over this series, we’ve traced Python’s memory architecture from the ground up (exploring PyObjects, the distinction between heap and stack, and the inner workings of PyFrames). These foundational concepts reveal why Python behaves the way it does, and more importantly, they illuminate the profound efficiency of generators and coroutines.
In mastering these tools and concepts, we don't just write better Python code, we also learn to think in harmony with underlying software and hardware layers, transforming our understanding of performance from an abstract concern into an intuitive mindset.
Top comments (0)