Marvin

Posted on Mar 13, 2022

Quick Peek Under the Hood: Indexing and Slicing in Python

#python

Overview

In Python, indexing and slicing are ways to get elements from iterables (like lists, tuples, and even string).

ingredients = ['beef', 'onions', 'tomato', 'potato', 'carrot']
print(ingredients[1])    # onions
print(ingredients[1:4])  # ['onions', 'tomato', 'potato']

This is something I have to look at recently. I am personally not a fan of deep dives in the C-level source code since things can always change (for example, ROT_TWO is set to be replaced in Python 3.11). Nevertheless, I think reading C code as a Python developer is a fun experience I want to share.

Below codes are performed on Python 3.10.

AST

In CPython, actual Python syntax go through parsing and compiling.

Python code gets parsed and converted into an abstract syntax tree. The ast module can help us illustrate this step.

import ast

print(ast.dump(ast.parse('[][0]', mode='eval'), indent=4))
# Result:
# Expression(
#     body=Subscript(
#         value=List(elts=[], ctx=Load()),
#         slice=Constant(value=0),
#         ctx=Load()))

print(ast.dump(ast.parse('[][::-1]', mode='eval'), indent=4))
# Result: 
# Expression(
#     body=Subscript(
#         value=List(elts=[], ctx=Load()),
#         slice=Slice(
#             step=UnaryOp(
#                 op=USub(),
#                 operand=Constant(value=1))),
#         ctx=Load()))

What we see in the two code snippets above is that both indexing and slicing are essentially the same operation, Subscript.

Subscript takes a value, a slice, and a ctx. We can learn from here that the difference between indexing and slicing is that:

indexing is when a Constant tree is passed into the Subscript's slice; whereas
slicing is when a Slice tree is passed otherwise.

Bytecode

The abstract syntax tree as shown above will eventually get compiled into bytecode. The dis module help us see it.

from dis import dis

dis('[][0]')
# Result: 
#   1           0 BUILD_LIST               0
#               2 LOAD_CONST               0 (0)
#               4 BINARY_SUBSCR
#               6 RETURN_VALUE

dis('[][::-1]')
# Result: 
#   1           0 BUILD_LIST               0
#               2 LOAD_CONST               0 (None)
#               4 LOAD_CONST               0 (None)
#               6 LOAD_CONST               1 (-1)
#               8 BUILD_SLICE              3
#              10 BINARY_SUBSCR
#              12 RETURN_VALUE

What we're seeing in these snippets is that Subscript is referred to as BINARY_SUBSCR at bytecode level. We can also see that if any of start, stop, or step is not supplied, None is put in place.

From the docs: the Slice Object

Part of the Python's builtins is the slice function which generates the

From the docs:

Slice objects are also generated when extended indexing syntax is used. For example: a[start:stop:step] or a[start:stop, i].

We can infer that when slicing ingredients[1:4], a slice(1,4) is generated along the way.

Under the C

The bytecodes such as BUILD_SLICE and BINARY_SUBSCR can be found at Python's ceval.c

Below are snippets:

TARGET(BINARY_SUBSCR) {
    PREDICTED(BINARY_SUBSCR);
    PyObject *sub = POP();
    PyObject *container = TOP();
    PyObject *res = PyObject_GetItem(container, sub);
    Py_DECREF(container);
    Py_DECREF(sub);
    SET_TOP(res);
    if (res == NULL)
        goto error;
    JUMPBY(INLINE_CACHE_ENTRIES_BINARY_SUBSCR);
    DISPATCH();
}

TARGET(BUILD_SLICE) {
    PyObject *start, *stop, *step, *slice;
    if (oparg == 3)
        step = POP();
    else
        step = NULL;
    stop = POP();
    start = TOP();
    slice = PySlice_New(start, stop, step);
    Py_DECREF(start);
    Py_DECREF(stop);
    Py_XDECREF(step);
    SET_TOP(slice);
    if (slice == NULL)
        goto error;
    DISPATCH();
}

Some Stuff

As stated above, it's indexing if a Contstant (i.e., int) is passed; whereas it's slicing if a Slice is passed.

Can we pass a slice() in __getitem__?

# Turns out, yes
# 
print(ingredients[slice(1,4)])
# Result:
# ['onions', 'tomato', 'potato']
# 
print(ingredients.__getitem__(slice(1,4)))
# Same result

From this line from the C code:

PyObject *res = PyObject_GetItem(container, sub);

we can infer that PyObject_GetItem is the C equivalent of __getitem__ from the data model. We now know that indexing, slicing and getting a value from a dictionary are all just __getitem__. This should not be a surprise considering the syntax, but I am.

(As a side note: if I had just read the Data Model docs, this readings done to make this post wouldn't have been necessary.)

Some implementations for the slice object is in Objects/sliceobject.c. There's a lot going in here.

In one of the snippets above, it is PySlice_New which creates a new slice object (it's been around since it was first introduced).

Looking at this commit, it seems that PySlice_New works like a def __init__ which simply returns a PyObject struct with attributes start, stop, and step. Calculations happen elsewhere. For example, the pointer *start is calculated on five occasions.

if (r->start == Py_None) {
    *start = *step < 0 ? length-1 : 0;
} else {
    if (!PyLong_Check(r->start)) return -1;
    *start = PyLong_AsSsize_t(r->start);
    if (*start < 0) *start += length;
}

if (r->start == Py_None) {
    *start = *step < 0 ? PY_SSIZE_T_MAX : 0;
}

if (*start < 0) {
    *start += length;
    if (*start < 0) {
        *start = (step < 0) ? -1 : 0;
    }
}
else if (*start >= length) {
    *start = (step < 0) ? length - 1 : length;
}

I don't think I need to understand this yet. Moreso, there's a lot of pointers at play in order to really read it, you need to stop thinking in Python and start thinking in C.

Another cool thing to note is a factory function at the bottom called slice_new, which, looking at the pull request, seem to be the slice builtin.

DEV Community