Hi, everyone. This is yumyum116.
This article is part of a series of how standard library functions work.
I am glad that this will help beginners understand the underlying mechanisms behind these functions.
This topic arose from a personal experience in which I encountered a TLE, despite using an efficient algorithm to solve the problem. After investigating, I discovered that the issue was caused by calling the print function too many times within the program.
Based on this experience, this article explains how the print function works internally and why excessive use of it can lead to a TLE.
1. Example of a TLE Despite Using an Efficient Algorithm
In this chapter, I introduce an example of a program that results in a TLE despite using an efficient algorithm.
The program determines whether a given number is prime using the Sieve of Eratosthenes. At first glance, the algorithm itself is efficiet - but can you identify which part of the code causes of the TLE?
For reference, the input number satisfies the following conditions:
conditions:
MAX_A = 6000000
def eratosthenes(n):
is_prime = [True] * (n + 1)
is_prime[0] = is_prime[1] = False
for i in range(2, int(n ** 0.5) + 1):
if is_prime[i]:
for j in range(i * i, n + 1, i):
is_prime[j] = False
return is_prime
n = int(input())
arr = [int(input()) for _ in range(n)]
for i in range(n):
print("prime" if is_prime(arr[i]) else "not prime")
Next, I introduce a program that that fixes the TLE issue.
MAX_A = 6000000
def eratosthenes(n):
is_prime = [True] * (n + 1)
is_prime[0] = is_prime[1] = False
for i in range(2, int(n ** 0.5) + 1):
if is_prime[i]:
for j in range(i * i, n + 1, i):
is_prime[j] = False
return is_prime
n = int(input())
arr = [int(input()) for _ in range(n)]
is_prime_table = eratosthenes(MAX_A)
out = []
for x in arr:
out.append("prime" if is_prime_table[x] else "not prime")
print("\n".join(out))
In the next chapter, let's take a closer look at why the TLE happened.
2. What Happens Internally When Executing a Python Program
Before diving into the main discussion, let's take a look at what actually happens when a Python program is executed.
This section is a bit long, but understanding of this flow will help you build a deeper intuition about how Python programs work under the hood.
At a high level, the execution flow looks like this:
- Execute a Python program. -- the python interpreter starts runnung --
- Perform lexical analysis by breaking the source code into tokens.
- Generate a sequence of tokens.
- Parse the token sequence.
- Build an Abstract Syntax Tree (AST).
- Generate code objects from the AST.
- Compile the code objects into bytecode.
- Execute the bytecode on the Python virtual machine. -- the python interpreter completes execution --
- Execute machine instructions on the CPU.
Now, let's walk through a simple example.
Consider the following Python program (1).
# test.py
print("Hi, how are you?")
Lexical Analysis
When the Python interpreter runs test.py, it first performs lexical analysis.
During this process, the source code is broken down into tokens such as Hi, ,, how, are, you, and ?.
Parsing
Based on the tokens generated during lexical analysis, the interpreter builds a data structure called an Abstract Syntax Tree (AST) through parsing.
Let's take an actual look at the AST objects generated when executing test.py.
$ python
> import ast
> tree = ast.parse('print("Hi, how are you?")')
>
> print(ast.dump(tree, indent=4))
Module(
body=[
Expr(
value=Call(
func=Name(id='print', ctx=Load()),
args=[
Constant(value='Hi, how are you?')],
keywords=[]))],
type_ignores=[])
>
If you want to better grasp the structure of an AST, using a sentence that includes mathematical expressions can be very helpful. However, that is beyond the scope of this article.
Now, let's break down the generated AST.
① func=Name(id = 'print', ctx=Load())
This means the identifier print is loaded as a value.
ctx, which is one of the arguments of the Name node, specifies how the identifier is used. It can be set to Store() when assigning a value, Load() when reading a value, or Del() when deleting an element.
Structurally, this can be summarized as follows:
A
Namenode is a parsing node that contains the following information:
- The presence of the identifier
- How the identifier is used in context (in this example, it is used as
Load()).
② args=[Constant(value='Hi, how are you?')]
This represents a structural node that holds the value of an argument passed to a function. In computer science terms, this is an AST node that represents a string literal.
For reference, the Constant node has been used since Python 3.8, whereas Str was used in earlier versions of Python (prior to 3.8).
③ Call(...)
This node represents a function call statement and stores the following information:
i.
func- information about the called object
ii.args- expressions to be evaluated as positional arguments
iii.keywords- expressions to be evaluated as keyword arguments
④ Expr(...)
This node represents an expression whose purpose is only to produce output.
There are many other nodes at the same hierarchical level as Expr, each serving a different role. However, due to the scope of this article, I will introduce those nodes in a separate article.
⑤ Module(...)
This node represents the root AST node of a .py file.
As a supplement, body=[...] is a list of statements included in the source code, and type_ignores=[] stores additional information for type checkers. For example, it records the line numbers of comments that instruct the type checker to ignore type errors.
Generate Code Objects from the AST
In this step, the following processes are performed.
① Analyze the AST and perform the following tasks:
(i) Determine whether each variable is local, global, or free.
(ii) Register constants in the constant table.
(iii) Build a code object for each function and class.② Build the structural body of a
PyCodeObject
Ideally, the following elements are constructed as the internal structure of the code object.
CodeObject {
co_consts
co_names
co_varnames
co_freevars
co_cellvars
co_flags
co_code
}
Generate bytecode
After completing syntax analysis, the Python interpreter generates bytecode from the AST.
During this process, a source file named compile.c is executed. This file implements the compiler that translates the AST into bytecode.
The resulting bytecode is expressed as follows:
>> import dis
>>> dis.dis('print("Hi, how are you?")')
0 0 RESUME 0
1 2 PUSH_NULL
4 LOAD_NAME 0 (print)
6 LOAD_CONST 0 ('Hi, how are you?')
8 CALL 1
16 RETURN_VALUE
This representation is close to programs written in assembly or machine language, and LOAD_NAME 0 corresponds to a single bytecode instruction.
The full list of bytecode instructions can be found in opcode.h.
From a computer science perspective, this process converts the AST into instructions for a stack machine by traversing the AST nodes.
Conceptually, the following sequence of instructions is generated:
co_code = [
LOAD_NAME print
LOAD_CONST "Hi, how are you?"
CALL 1
RETURN_VALUE
]
Through this process, the bytecode is transformed into a form that the virtual machine can interpret directly.
Execute the bytecode on the Python virtual machine
As I mentioned in the section Generate bytecode, the Python virtual machine is a type of stack machine, which primarily uses a stack during calculation.
A stack is a data structure used to store values in such a way that new data is added on top of existing data. When data is removed, the most recently added value is taken first. This behavior is known as Last In, First Out(LIFO).
The bytecode generated by the Python interpreter is designed to be executed efficiently on a stack-based virtual machine.
Below, you can see a simplified explanation of how the previously shown bytecode is executed. For clarify, some details are omitted, so this description is not perfectly precise, but it should help build intuition.
| Instruction | Meaning |
|---|---|
| RESUME 0 | Represent the start of a function call |
| PUSH_NULL | Push NULL onto the stack to indicate that this is not a method call |
| LOAD_NAME 0 (print) | Push the value of the variable print onto the stack |
| LOAD_CONST 0 ('Hi, how are you?') | Push the value of the variable Hi, how are you? onto the stack |
| CALL 1 | Pop the number of values specified by the variable argc from the top of the stack, and call the corresponding callable object |
| RETURN_VALUE | Return to the original caller |
On the Python virtual machine, this bytecode is executed sequentially from the top, with each instruction performing operations that push the resulting Python objects onto the stack.
Execute machine instructions on the CPU
The CPU executes programs that have been loaded into memory. In the case of Python, the CPU executes the machine instructions that implement the Python virtual machine.
Let me briefly explain what machine instructions are.
Machine instructions represent operations using binary values composed of zeros and ones. For readability, hexadecimal notation is often used so that humans can more easily interpret them. If you are interested, you can open a .pyc file using a binary editor to see this representation yourself.
In the case of test.py, the machine instructions would look like the following. Note that these are shown in hexadecimal for human readability and differ from the actual machine instructions executed directly by the CPU.
Now, let's return to the main discussion.
For example, the CALL 1 bytecode instruction corresponds to invoking a specific case in a switch statement in C, conceptually described as follows:
switch (opcode){
case CALL:
/* Call a function with the given arguments */
}
To describe the entire flow precisely -- from execution on the Python virtual machine to execution on the CPU --it can be summarized as follows:
The
CALL 1instruction is read by CPython's C implementation, which branches to the correspondingcase CALL:in a switch statement. The CPU then executes the machine instructions that implement thatcase CALL:within CPython itself.
At the Python level, the callable function is written as print. However, the actual callable object is implemented in CPython's C code, specifically as builtin_print_impl.
The above describes the complete flow of how a Python program is executed.
3. What Happens When the print Function Is Called?
Now, let's take a closer look at the behavior of the print function.
Briefly speaking, print is not part of the standard library--it is a built-in function. Built-in functions are implemented directly in CPython's C source code.
You can find the function object for print in the CPython repository here.
As mentioned in the previous section, the impolementation corresponding to print is builtin_print_impl.
To keep the discussion focused, I will paste the relevant part of the original source code below.
static PyObject *
builtin_print_impl(PyObject *module, PyObject *args, PyObject *sep,
PyObject *end, PyObject *file, int flush)
/*[clinic end generated code: output=3cfc0940f5bc237b input=c143c575d24fe665]*/
{
int i, err;
if (file == Py_None) {
PyThreadState *tstate = _PyThreadState_GET();
file = _PySys_GetAttr(tstate, &_Py_ID(stdout));
if (file == NULL) {
PyErr_SetString(PyExc_RuntimeError, "lost sys.stdout");
return NULL;
}
/* sys.stdout may be None when FILE* stdout isn't connected */
if (file == Py_None) {
Py_RETURN_NONE;
}
}
if (sep == Py_None) {
sep = NULL;
}
else if (sep && !PyUnicode_Check(sep)) {
PyErr_Format(PyExc_TypeError,
"sep must be None or a string, not %.200s",
Py_TYPE(sep)->tp_name);
return NULL;
}
if (end == Py_None) {
end = NULL;
}
else if (end && !PyUnicode_Check(end)) {
PyErr_Format(PyExc_TypeError,
"end must be None or a string, not %.200s",
Py_TYPE(end)->tp_name);
return NULL;
}
for (i = 0; i < PyTuple_GET_SIZE(args); i++) {
if (i > 0) {
if (sep == NULL) {
err = PyFile_WriteString(" ", file);
}
else {
err = PyFile_WriteObject(sep, file, Py_PRINT_RAW);
}
if (err) {
return NULL;
}
}
err = PyFile_WriteObject(PyTuple_GET_ITEM(args, i), file, Py_PRINT_RAW);
if (err) {
return NULL;
}
}
if (end == NULL) {
err = PyFile_WriteString("\n", file);
}
else {
err = PyFile_WriteObject(end, file, Py_PRINT_RAW);
}
if (err) {
return NULL;
}
if (flush) {
PyObject *tmp = PyObject_CallMethodNoArgs(file, &_Py_ID(flush));
if (tmp == NULL) {
return NULL;
}
Py_DECREF(tmp);
}
Py_RETURN_NONE;
}
When the print function is called, characters are displayed on the standard output through the following sequenc of steps:
- Execute Python source code.
- Compile the source code into Python bytecode.
- Execute the bytecode on the Cpython virtual machine.
- Invoke the built-in
printfunction. - Call
file.write(). - Call the C standard library function
write(). --The steps above are executed within the Python runtime layer.-- - Invoke a system call handled by the operating system kernel.
- Output the characters to the standard output.
The connection between these steps and the previous sections may not be immediately clear, so before explaining each operation in detail, I will first provide some additional context.
The steps above describe the observable behavior at a high level, while the CPU is continuously executing instructions behind the scenes.
From the CPU's perspective, the steps above can be described as follows:
While executing the machine instructions that implement the CPython virtual machine, the CPU reaches a
CALLinstruction and invokes the machine instructions corresponding to the built-inPyFile_WriteObjecttoFileIO.write, and finally to thewritesystem call.
Visually, the process can be illustrated as follows:
CPU
└─ CPython VM(machine instructions)
└─ builtin print(machine instructions)
└─ PyFile_WriteObject(machine instructions)
└─ FileIO.write(machine instructions)
└─ libc write(machine instructions)
└─ kernel write(machine instructions)
With this overview in mind, let's move on to a detailed explanation of the entire execution flow of the print function.
Invoke the Built-in print Function
Here, the actual callable object is defined as follows:
static PyObject *
builtin_print_impl(PyObject *module, PyObject *args, PyObject *sep,
PyObject *end, PyObject *file, int flush)
When this object is invoked, the following steps are executed:
① Receive the given arguments as
PyObject*values.
② Interpret thesep,end,file, andflushparameters.
③ Determine the output destination (file), which defaults tosys.stdout.
Invoke the file.write() method on the output file object
In the following C implementation, the write method of the Python file object is invoked.
PyFile_WriteObject(obj, file, Py_PRINT_RAW);
Within builtinmodule.c, which was introduced in the previous section, the following function corresponds to this behavior.
PyFile_WriteObject(sep, file, Py_PRINT_RAW)
In effect, this is equivalent to calling sys.stdout.write(...) at the Python level.
Invoke the standard library function write()
Python's sys.stdout is composed of multiple layers of wrapper objects, as illustrated below.
TextIOWrapper
└─ BufferedWriter
└─ FileIO
The C implementation of FileIO.write() is located in Module/_io/fileio.c, and the function _io_FileIO_write_impl provides the low-level implementation of FileIO.write().
/*[clinic input]
_io.FileIO.write
cls: defining_class
b: Py_buffer
/
Write buffer b to file, return number of bytes written.
Only makes one system call, so not all of the data may be written.
The number of bytes actually written is returned. In non-blocking mode,
returns None if the write would block.
[clinic start generated code]*/
static PyObject *
_io_FileIO_write_impl(fileio *self, PyTypeObject *cls, Py_buffer *b)
/*[clinic end generated code: output=927e25be80f3b77b input=2776314f043088f5]*/
{
Py_ssize_t n;
int err;
if (self->fd < 0)
return err_closed();
if (!self->writable) {
_PyIO_State *state = get_io_state_by_cls(cls);
return err_mode(state, "writing");
}
n = _Py_write(self->fd, b->buf, b->len);
/* copy errno because PyBuffer_Release() can indirectly modify it */
err = errno;
if (n < 0) {
if (err == EAGAIN) {
PyErr_Clear();
Py_RETURN_NONE;
}
return NULL;
}
return PyLong_FromSsize_t(n);
}
(source)
The function prototype is shown below:
_Py_write(fd, buf, size)
At this level, execution transitions from the Python layer to the C I/O layer.
Invoke a System Call Handled by the Operating System Kernel.
In the entire execution of the print function, this step is the most expensive.
During a system call, the following operations occur:
① Transition from user space to kernel space.
② Perform a context switch.
③ Write to standard output within the operating system.
Transition from User Space to Kernel Space
This step means that the CPU switches its execution mode.
User space is where ordinary applications, such as Python programs, CPython itself and standard libraries, run. Code in user space cannot directly access hardware devices or protected memory.
Kernel space is where the operating system runs. Device operations, file I/O and process management are handled in this space.
The print function must transition from user space to kernel space in order to perform device-related operations. You can think of this transition as occurring when a system call, such as write(), is invoked.
Perform a Context Switch
This step means that the CPU switches its execution context.
There are two types of context switches. One is (A) a transition from user mode to kernel mode, as described above. The other is (B) a process switch, where the CPU switches from one process to another. In the case of the print function, the important context switch is (A).
This mode transition caused by a system call is the primary reason why I/O operations are expensive.
The Operating System Handles Standard Output
Briefly speaking, this step sends an instruction to the operating system that says, "Write these characters to the file descriptor whose value is 1."
(File descriptor 1 corresponds to standard output.)
Conceptually, standard output is processed as follows.
Execute sys_write(fd=1, buf)
↓
Resolve the file descriptor to an internal file structure
↓
Route the output to the corresponding device, file, or pipe
↓
Apply buffering if necessary
To summarize this flow more simply:
When
print()is called, CPython invokeswrite(), which switches the CPU execution mode from user mode to kernel mode.
The operating system then resolves the file descriptor with value1(standard output) and writes the data to the appropriate destination, such as a terminal, file, or pipe.
These kernel-level operations and device I/O are significantly more expensive than the mode switch itself, which is why frequent calls to print() can easily become a performance bottleneck.
Output Characters To the Standard Output
The kernel sends the characters to the appropriate output destination , which in this case is the terminal.
4. The Cause of the TLE: Calling print Inside a for Loop
First of all, thank you for staying with me up to this point.
As stated in the heading, the cause of the TLE I encountered was the repeated use of the print function inside a for loop.
More precisely, the implementation introduced in the first chapter, which is described below, triggers a TLE because print is executed on every iteration of the loop.
for i in range(n):
print("prime" if is_prime(arr[i]) else "not prime")
Each time the loop variable
is incremented by 1, the print function is called.
As explained throughout this article, calling print involves kernel-level operations and device I/O. Executing these expensive operations on every iteration significantly degrades performance.
For example, when the maximum input value of 380,000 is provided, the print() function is invoked 380,000 times.
This workload is simply too heavy for the CPU and the operating system to handle efficiently.
This example clearly demonstrates that--even when using an efficient algorithm--an inappropriate implementation choice can lead to disastrous performance under the given input constraints.
Now, let's take another look at the revised program.
out = []
for x in arr:
out.append("prime" if is_prime_table[x] else "not prime")
print("\n".join(out))
No matter how large the input value is, collecting the output values in an array results in calling the print function only once. When you compare a single call to print with 380,000 calls, the difference in CPU workload becomes immediately clear.
This experience taught me an important lesson: when you encounter a TLE despite using an efficient algorithm, suspect I/O operations.
I hope this article has helped you gain a deeper understanding of what happens behind the scenes and why such performance issues occur.
If you find any mistakes, please let me know through the feedback form. I will revise them as quickly as possible.
See you again in the next article!
Top comments (0)