View this post in Jupyter notebook: https://nbviewer.jupyter.org/github/nowke/notebooks/blob/master/Reverse%20engineering%20Python%27s%20memory%20-%20Integers.ipynb
If you are using Python for more than a year, you would have heard about different variants of it.
- CPython - original implementation which can be downloaded from python.org. Python code is compiled to bytecode using C.
- Jython - uses Java to convert the program to bytecode. Enables the program to run on JVM.
- PyPy - written in Python itself! It uses a Just-In-Time (JIT) compiler, which translates directly to machine code (instead of bytecode). Since native machine code is faster than running bytecode, PyPy provides speed improvements compared to CPython.
Here we talk only about CPython variant.
Basic data types
Determining the size of primitive data types
C
, similar to other programming languages has built-in primitive data types - int
, char
, long
, bool
etc.
Let's look at how we can find the size of each data types in your machine (I use 64-bit MacBook Air). We can use ctypes
module.
import ctypes
types = [('int', ctypes.c_int), ('char', ctypes.c_char),
('long', ctypes.c_long), ('bool', ctypes.c_bool),
('voidp', ctypes.c_voidp), ('ulong', ctypes.c_ulong)]
for ctype in types:
size = ctypes.sizeof(ctype[1])
print(f'c_{ctype[0]:<5} - {size} byte(s)')
c_int - 4 byte(s)
c_char - 1 byte(s)
c_long - 8 byte(s)
c_bool - 1 byte(s)
c_voidp - 8 byte(s)
c_ulong - 8 byte(s)
Variable representation
In Python, when a variable is declared var1 = 5
, var1
doesn't hold the value 5
(like in C). Rather, var1
points to the object containing value 5
.
var1 = 5
var2 = 5
Accessing address of variables
We can access the logical address of a variable using built-in the id()
function.
id(var1), id(var2), id(5)
(4418119136, 4418119136, 4418119136)
var1
, var2
and 5
refer to the same object in memory.
Let us inspect the address of integers from 1
to 10
and the difference between the addresses.
def print_addr(nums):
for num in nums:
addr = id(num)
print(f'Num = {num:<2}, id = {addr}, diff = {addr - id(num-1)}')
print_addr(range(1, 11))
Num = 1 , id = 4418119008, diff = 32
Num = 2 , id = 4418119040, diff = 32
Num = 3 , id = 4418119072, diff = 32
Num = 4 , id = 4418119104, diff = 32
Num = 5 , id = 4418119136, diff = 32
Num = 6 , id = 4418119168, diff = 32
Num = 7 , id = 4418119200, diff = 32
Num = 8 , id = 4418119232, diff = 32
Num = 9 , id = 4418119264, diff = 32
Num = 10, id = 4418119296, diff = 32
The numbers are placed in contiguously in memory. But is it true for large numbers? Let's try from 250
to 265
print_addr(range(250, 266))
Num = 250, id = 4418126976, diff = 32
Num = 251, id = 4418127008, diff = 32
Num = 252, id = 4418127040, diff = 32
Num = 253, id = 4418127072, diff = 32
Num = 254, id = 4418127104, diff = 32
Num = 255, id = 4418127136, diff = 32
Num = 256, id = 4418127168, diff = 32
Num = 257, id = 4463633392, diff = 45506224
Num = 258, id = 4463633328, diff = 32
Num = 259, id = 4463633296, diff = -96
Num = 260, id = 4463633392, diff = 64
Num = 261, id = 4463633328, diff = 32
Num = 262, id = 4463633296, diff = -96
Num = 263, id = 4463633392, diff = 64
Num = 264, id = 4463633328, diff = 32
Num = 265, id = 4463633296, diff = -96
Let's try negative numbers
print_addr(range(-10, 5))
Num = -10, id = 4462902928, diff = -730432
Num = -9, id = 4463633360, diff = 96
Num = -8, id = 4463633264, diff = 730336
Num = -7, id = 4462902928, diff = -730432
Num = -6, id = 4463633360, diff = 96
Num = -5, id = 4418118816, diff = -44784112
Num = -4, id = 4418118848, diff = 32
Num = -3, id = 4418118880, diff = 32
Num = -2, id = 4418118912, diff = 32
Num = -1, id = 4418118944, diff = 32
Num = 0 , id = 4418118976, diff = 32
Num = 1 , id = 4418119008, diff = 32
Num = 2 , id = 4418119040, diff = 32
Num = 3 , id = 4418119072, diff = 32
Num = 4 , id = 4418119104, diff = 32
We can observe that the numbers are contiguous from -5
to 256
. This is because, Python pre-allocates memory for numbers from -5
to 256
and creates the objects before your program starts.
Integers
Accessing integers from address
Using id()
function, we saw how to convert from variable -> address. How about the other way. Given an address, how to dig what's stored inside?
x = 45
addr_x = id(x)
print(addr_x)
4418120416
The function ctypes.<type>.from_address
takes address
and returns the <type>
value. Let's give addr_x
to c_int
type. Remember: We need to get 45
back from addr_x
print(ctypes.c_int.from_address(addr_x))
print(ctypes.c_int.from_address(addr_x + 8))
print(ctypes.c_int.from_address(addr_x + 16))
print(ctypes.c_int.from_address(addr_x + 24))
c_int(81)
c_int(122850480)
c_int(1)
c_int(45)
x_value = ctypes.c_int.from_address(addr_x + 24)
x_value.value
45
Why add 24
to the original address? An integer is stored similar to the below struct
in C (actual struct maybe different). The allocated 32
bytes are divided as follows.
struct {
long ob_refcnt; // long = 8 bytes (c_long)
void* ob_type; // void* = 8 bytes (c_voidp)
long ob_size; // long = 8 bytes (c_long)
uint ob_digit[]; // uint[] = (8 x ob_size) bytes (c_uint)
}
For the digit 45
, array size of ob_digit
will be 1
. Now let's unpack the above fields using the same from_address
function and actually make sense of the returned values.
1. ob_refcnt
-> represents the total number of references made to the object
ob_refcnt_x = ctypes.c_long.from_address(addr_x)
print(f'Total references made to {x} - {ob_refcnt_x.value}')
Total references made to 45 - 88
2. ob_type
-> Points to the type object, in our case int
Since int
is an object, it has an address in memory!
id(int)
4417817776
ob_type_x = ctypes.c_void_p.from_address(addr_x + 8)
ob_type_x.value
4417817776
Observe that, id(int)
is same as ob_type_x.value
3. ob_size
-> size of ob_digit
array.
We will see about its use further below
ob_size_x = ctypes.c_long.from_address(addr_x + 16)
ob_size_x.value
1
4. ob_digit[]
ob_digit_x = ctypes.c_uint.from_address(addr_x + 24)
ob_digit_x.value
45
print(f'''
Integer {x} is represented as,
struct_ {{
long ob_refcnt = {ctypes.c_long.from_address(addr_x).value}
void* ob_type = {ctypes.c_long.from_address(addr_x + 8).value}
long ob_size = {ctypes.c_long.from_address(addr_x + 16).value}
uint ob_digit[] = {ctypes.c_long.from_address(addr_x + 24).value}
}}
''')
Integer 45 is represented as,
struct_ {
long ob_refcnt = 91
void* ob_type = 4417817776
long ob_size = 1
uint ob_digit[] = 45
}
Big integers
In Python, we can declare and operate on integers of size as big as you want!
big_x = 2**1000
big_x
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
big_x + 1
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069377
How does Python store this value? It uses the fields ob_size
and ob_digit[]
. The formula for calculating the integer representation stored inside ob_digit
is as follows
where n = ob_size
Example 1
ob_size = 1
ob_digit = [45]
value = ob_digit[0] x (1024^3)^0
= 45 x 1
= 45
Example 2
ob_size = 2
ob_digit = [2, 3]
value = (ob_digit[0] x (1024^3)^0) +
(ob_digit[1] x (1024^3)^1)
= (2 x 1) + (3 x 1024^3)
= 3,221,225,474
Let's verify Example 2
y = 3221225474
addr_y = id(y)
y, addr_y
(3221225474, 4463632688)
ob_size_y = ctypes.c_long.from_address(addr_y + 16)
ob_size_y.value
2
ob_digit_y_0 = ctypes.c_uint.from_address(addr_y + 24)
ob_digit_y_0
c_uint(2)
ob_digit_y_1 = ctypes.c_uint.from_address(addr_y + 28)
ob_digit_y_1
c_uint(3)
Ta! We got back the values ob_size = 2
and ob_digit = [2, 3]
Size of an integer
Both x
and y
take 32 bytes
of memory. How about even bigger integers? We can make use of sys.getsizeof
function
import sys
sys.getsizeof(x), sys.getsizeof(y)
(28, 32)
for power in range(10, 101, 10):
num = 10 ** power
size = sys.getsizeof(num)
print(f'Number = 10^{power:<3}, Size = {size}')
Number = 10^10 , Size = 32
Number = 10^20 , Size = 36
Number = 10^30 , Size = 40
Number = 10^40 , Size = 44
Number = 10^50 , Size = 48
Number = 10^60 , Size = 52
Number = 10^70 , Size = 56
Number = 10^80 , Size = 60
Number = 10^90 , Size = 64
Number = 10^100, Size = 72
Recreating integers
We made use of from_address
to retrieve various parts of an integer. Let's actually write a custom structure that takes an integer using ctypes.Structure
abstract base class.
class IntStructure(ctypes.Structure):
_fields_ = [("ob_refcnt", ctypes.c_long),
("ob_type" , ctypes.c_void_p),
("ob_size" , ctypes.c_long),
("ob_digit" , ctypes.c_uint * 0)]
def __repr__(self):
digits = self.get_digits()
int_size = self.calc_size()
int_num = self.calc_integer(digits)
return (f'Int(ob_refcnt={self.ob_refcnt},\n'
f' ob_size={self.ob_size},\n'
f' ob_digit={digits},\n'
f' int_size={int_size} bytes,\n'
f' int_num={int_num})')
def get_digits(self):
"""Internal representation of `ob_digit` as a list"""
digits = ctypes.cast(
ctypes.byref(self.ob_digit),
ctypes.POINTER(ctypes.c_uint * self.ob_size)
).contents
return list(digits)
def calc_size(self):
"""Number of bytes consumed by the integer"""
size = (ctypes.sizeof(ctypes.c_long) +
ctypes.sizeof(ctypes.c_void_p) +
ctypes.sizeof(ctypes.c_long) +
ctypes.sizeof(ctypes.c_long))
if self.ob_size > 2:
size += (self.ob_size - 2) * ctypes.sizeof(ctypes.c_uint)
return size
def calc_integer(self, digits):
"""Retrieve the actual integer"""
return sum(
[digit * (1024**3)**i
for i, digit in enumerate(digits)]
)
a = 50
addr_a = id(a)
int_a = IntStructure.from_address(addr_a)
int_a
Int(ob_refcnt=77,
ob_size=1,
ob_digit=[50],
int_size=32 bytes,
int_num=50)
b = 10**40
addr_b = id(b)
int_b = IntStructure.from_address(addr_b)
int_b
Int(ob_refcnt=1,
ob_size=5,
ob_digit=[0, 668304384, 172751547, 175927511, 7523],
int_size=44 bytes,
int_num=10000000000000000000000000000000000000000)
Changing the value of an integer
Referred from r/Python
Do not do this anywhere in your actual programs! Very bad things can happen
Let's modify the integer 120
# Define a simpler structure (similar to IntStructure)
class MyInt(ctypes.Structure):
_fields_ = [("ob_refcnt", ctypes.c_long),
("ob_type" , ctypes.c_void_p),
("ob_size" , ctypes.c_ulong),
("ob_digit" , ctypes.c_long)]
addr_120 = id(120)
struct_120 = MyInt.from_address(addr_120)
struct_120.ob_digit = 25
120
25
120 + 120
50
60 + 60
25
What next?
Using similar techniques, we can decode how the basic data structures are implemented efficiently in Python.
- Python Lists
- Dictionaries
Top comments (0)