The Bytecode Interpreter (3.11)#

Preface#

The CPython 3.11 bytecode interpreter (a.k.a. virtual machine) has a number of improvements over 3.10. We describe the inner workings of the 3.11 interpreter here, with an emphasis on understanding not just the code but its design. While the interpreter is forever evolving, and the 3.12 design will undoubtedly be different again, knowing the 3.11 design will help you understand future improvements to the interpreter.

Introduction#

The job of the bytecode interpreter, in Python/ceval.c, is to execute Python code. Its main input is a code object, although this is not a direct argument to the interpreter. The interpreter is structured as a (recursive) function taking a thread state (tstate) and a stack frame (frame). The function also takes an integer throwflag, which is used by the implementation of generator.throw. It returns a new reference to a Python object (PyObject *) or an error indicator, NULL. Per PEP 523, this function is configurable by setting interp->eval_frame; we describe only the default function, _PyEval_EvalFrameDefault(). (This function’s signature has evolved and no longer matches what PEP 523 specifies; the thread state argument is added and the stack frame argument is no longer an object.)

The interpreter finds the code object by looking in the stack frame (frame->f_code). Various other items needed by the interpreter (e.g. globals and builtins) are also accessed via the stack frame. The thread state stores exception information and a variety of other information, such as the recursion depth. The thread state is also used to access per-interpreter state (tstate->interp) and per-runtime (i.e., truly global) state (tstate->interp->runtime).

Note the slightly confusing terminology here. “Interpreter” refers to the bytecode interpreter, a recursive function. “Interpreter state” refers to state shared by threads, each of which may be running its own bytecode interpreter. A single process may even host multiple interpreters, each with their own interpreter state, but sharing runtime state. The topic of multiple interpreters is covered by several PEPs, notably PEP 684, PEP 630, and PEP 554 (with more coming). The current document focuses on the bytecode interpreter.

Code objects#

The interpreter uses a code object (frame->f_code) as its starting point. Code objects contain many fields used by the interpreter, as well as some for use by debuggers and other tools. In 3.11, the final field of a code object is an array of indeterminate length containing the bytecode, code->co_code_adaptive. (In previous versions the code object was a bytes object, code->co_code; it was changed to save an allocation and to allow it to be mutated.)

Code objects are typically produced by the bytecode compiler, although they are often written to disk by one process and read back in by another. The disk version of a code object is serialized using the marshal protocol. Some code objects are pre-loaded into the interpreter using Tools/scripts/deepfreeze.py, which writes Python/deepfreeze/deepfreeze.c.

Code objects are nominally immutable. Some fields (including co_code_adaptive) are mutable, but mutable fields are not included when code objects are hashed or compared.

Instruction decoding#

The first task of the interpreter is to decode the bytecode instructions. Bytecode is stored as an array of 16-bit code units (_Py_CODEUNIT). Each code unit contains an 8-bit opcode and an 8-bit argument (oparg), both unsigned. In order to make the bytecode format independent of the machine byte order when stored on disk, opcode is always the first byte and oparg is always the second byte. Macros are used to extract the opcode and oparg from a code unit (_Py_OPCODE(word) and _Py_OPARG(word)). Some instructions (e.g. NOP or POP_TOP) have no argument – in this case we ignore oparg.

A simple instruction decoding loop would look like this:

_Py_CODEUNIT *first_instr = code->co_code_adaptive;
_Py_CODEUNIT *next_instr = first_instr;
while (1) {
    _Py_CODEUNIT word = *next_instr++;
    unsigned char opcode = _Py_OPCODE(word);
    unsigned int oparg = _Py_OPARG(word);
    switch (opcode) {
    // ... A case for each opcode ...
    }
}

This format supports 256 different opcodes, which is sufficient. However, it also limits oparg to 8-bit values, which is not. To overcome this, the EXTENDED_ARG opcode allows us to prefix any instruction with one or more additional data bytes. For example, this sequence of code units:

EXTENDED_ARG  1
EXTENDED_ARG  0
LOAD_CONST    2

would set opcode to LOAD_CONST and oparg to 65538 (i.e., 0x1_00_02). The compiler should limit itself to at most three EXTENDED_ARG prefixes, to allow the resulting oparg to fit in 32 bits, but the interpreter does not check this. A series of code units starting with zero to three EXTENDED_ARG opcodes followed by a primary opcode is called a complete instruction, to distinguish it from a single code unit, which is always two bytes. The following loop, to be inserted just above the switch statement, will make the above snippet decode a complete instruction:

while (opcode == EXTENDED_ARG) {
    word = *next_instr++;
    opcode = _Py_OPCODE(word);
    oparg = (oparg << 8) | _Py_OPARG(word);
}

For various reasons we’ll get to later (mostly efficiency, given that EXTENDED_ARG is rare) the actual code is different.

Jumps#

Note that when the switch statement is reached, next_instr (the “instruction offset”) already points to the next instruction. Thus, jump instructions can be implemented by manipulating next_instr:

  • An absolute jump (JUMP_ABSOLUTE) sets next_instr = first_instr + oparg.

  • A relative jump forward (JUMP_FORWARD) sets next_instr += oparg.

  • A relative jump backward sets next_instr -= oparg.

A relative jump whose oparg is zero is a no-op.

Inline cache entries#

Some (specialized or specializable) instructions have an associated “inline cache”. The inline cache consists of one or more two-byte entries included in the bytecode array as additional words following the opcode /oparg pair. The size of the inline cache for a particular instruction is fixed by its opcode alone. Moreover, the inline cache size for a family of specialized/specializable instructions (e.g., LOAD_ATTR, LOAD_ATTR_SLOT, LOAD_ATTR_MODULE) must all be the same. Cache entries are reserved by the compiler and initialized with zeros. If an instruction has an inline cache, the layout of its cache can be described by a struct definition and the address of the cache is given by casting next_instr to a pointer to the cache struct. The size of such a struct must be independent of the machine architecture, word size and alignment requirements. For 32-bit fields, the struct should use _Py_CODEUNIT field[2]. Even though inline cache entries are represented by code units, they do not have to conform to the opcode / oparg format.

The instruction implementation is responsible for advancing next_instr past the inline cache. For example, if an instruction’s inline cache is four bytes (i.e., two code units) in size, the code for the instruction must contain next_instr += 2;. This is equivalent to a relative forward jump by that many code units. (The proper way to code this is JUMPBY(n), where n is the number of code units to jump, typically given as a named constant.)

Serializing non-zero cache entries would present a problem because the serialization (marshal) format must be independent of the machine byte order.

More information about the use of inline caches can be found in PEP 659.

The evaluation stack#

Apart from unconditional jumps, almost all instructions read or write some data in the form of object references (PyObject *). The CPython 3.11 bytecode interpreter is a stack machine, meaning that it operates by pushing data onto and popping it off the stack. The stack is a pre-allocated array of object references. For example, the “add” instruction (which used to be called BINARY_ADD in 3.10 but is now BINARY_OP 0) pops two objects off the stack and pushes the result back onto the stack. An interesting property of the CPython bytecode interpreter is that the stack size required to evaluate a given function is known in advance. The stack size is computed by the bytecode compiler and is stored in code->co_stacksize. The interpreter uses this information to allocate stack.

The stack grows up in memory; the operation PUSH(x) is equivalent to *stack_pointer++ = x, whereas x = POP() means x = *--stack_pointer. There is no overflow or underflow check (except when compiled in debug mode) – it would be too expensive, so we really trust the compiler.

At any point during execution, the stack level is knowable based on the instruction pointer alone, and some properties of each item on the stack are also known. In particular, only a few instructions may push a NULL onto the stack, and the positions that may be NULL are known. A few other instructions (GET_ITER, FOR_ITER) push or pop an object that is known to be an iterator.

Instruction sequences that do not allow statically knowing the stack depth are deemed illegal. The bytecode compiler never generates such sequences. For example, the following sequence is illegal, because it keeps pushing items on the stack:

LOAD_FAST 0
JUMP_BACKWARD 2

Do not confuse the evaluation stack with the call stack, which is used to implement calling and returning from functions.

Error handling#

When an instruction like BINARY_OP encounters an error, an exception is raised. At this point, a traceback entry is added to the exception (by PyTraceBack_Here()) and cleanup is performed. In the simplest case (absent any try blocks), this results in the remaining objects being popped off the evaluation stack and their reference count decremented (if not NULL) . Then the interpreter function (_PyEval_EvalFrameDefault()) returns NULL.

However, if an exception is raised in a try block, the interpreter must jump to the corresponding except or finally block. In 3.10 and before, there was a separate “block stack” which was used to keep track of nesting try blocks. In 3.11, this mechanism has been replaced by a statically generated table, code->co_exceptiontable. The advantage of this approach is that entering and leaving a try block normally does not execute any code, making execution faster. But of course, this table needs to be generated by the compiler, and decoded (by get_exception_handler) when an exception happens.

Exception table format#

The table is conceptually a list of records, each containing four variable-length integer fields (in a unique format, see below):

  • start: start of try block, in code units from the start of the bytecode

  • length: size of the try block, in code units

  • target: start of the first instruction of the except or finally block, in code units from the start of the bytecode

  • depth_and_lasti: the low bit gives the “lasti” flag, the remaining bits give the stack depth

The stack depth is used to clean up evaluation stack entries above this depth. The “lasti” flag indicates whether, after stack cleanup, the instruction offset of the raising instruction should be pushed (as a PyLongObject *). For more information on the design, see Objects/exception_handling_notes.txt.

Each varint is encoded as one or more bytes. The high bit (bit 7) is reserved for random access – it is set for the first varint of a record. The second bit (bit 6) indicates whether this is the last byte or not – it is set for all but the last bytes of a varint. The low 6 bits (bits 0-5) are used for the integer value, in big-endian order.

To find the table entry (if any) for a given instruction offset, we can use bisection without decoding the whole table. We bisect the raw bytes, at each probe finding the start of the record by scanning back for a byte with the high bit set, and then decode the first varint. See get_exception_handler() in Python/ceval.c for the exact code (like all bisection algorithms, the code is a bit subtle).

The locations table#

Whenever an exception is raised, we add a traceback entry to the exception. The tb_lineno field of a traceback entry must be set to the line number of the instruction that raised it. This field is computed from the locations table, co_linetable (this name is an understatement), using PyCode_Addr2Line(). This table has an entry for every instruction rather than for every try block, so a compact format is very important.

The full design of the 3.11 locations table is written up in Objects/locations.md. While there are rumors that this file is slightly out of date, it is still the best reference we have. Don’t be confused by Objects/lnotab_notes.txt, which describes the 3.10 format. For backwards compatibility this format is still supported by the co_lnotab property.

The 3.11 location table format is different because it stores not just the starting line number for each instruction, but also the end line number, and the start and end column numbers. Note that traceback objects don’t store all this information – they store the start line number, for backward compatibility, and the “last instruction” value. The rest can be computed from the last instruction (tb_lasti) with the help of the locations table. For Python code, a convenient method exists, co_positions(), which returns an iterator of (line, endline, column, endcolumn) tuples, one per instruction. There is also co_lines() which returns an iterator of (start, end, line) tuples, where start and end are bytecode offsets. The latter is described by PEP 626; it is more compact, but doesn’t return end line numbers or column offsets. From C code, you have to call PyCode_Addr2Location().

Fortunately, the locations table is only consulted by exception handling (to set tb_lineno) and by tracing (to pass the line number to the tracing function). In order to reduce the overhead during tracing, the mapping from instruction offset to line number is cached in the _co_linearray field.

Exception chaining#

When an exception is raised during exception handling, the new exception is chained to the old one. This is done by making the __context__ field of the new exception point to the old one. This is the responsibility of _PyErr_SetObject() in Python/errors.c (which is ultimately called by all PyErr_Set*() functions). Separately, if a statement of the form raise X from Y is executed, the __cause__ field of the raised exception (X) is set to Y. This is done by PyException_SetCause(), called in response to all RAISE_VARARGS instructions. A special case is raise X from None, which sets the __cause__ field to None (at the C level, it sets cause to NULL).

(TODO: Other exception details.)

Python-to-Python calls#

The _PyEval_EvalFrameDefault() function is recursive, because sometimes the interpreter calls some C function that calls back into the interpreter. In 3.10 and before, this was the case even when a Python function called another Python function: The CALL instruction would call the tp_call dispatch function of the callee, which would extract the code object, create a new frame for the call stack, and then call back into the interpreter. This approach is very general but consumes several C stack frames for each nested Python call, thereby increasing the risk of an (unrecoverable) C stack overflow.

In 3.11, the CALL instruction special-cases function objects to “inline” the call. When a call gets inlined, a new frame gets pushed onto the call stack and the interpreter “jumps” to the start of the callee’s bytecode. When an inlined callee executes a RETURN_VALUE instruction, the frame is popped off the call stack and the interpreter returns to its caller, by popping a frame off the call stack and “jumping” to the return address. There is a flag in the frame (frame->is_entry) that indicates whether the frame was inlined (set if it wasn’t). If RETURN_VALUE finds this flag set, it performs the usual cleanup and returns from _PyEval_EvalFrameDefault() altogether, to a C caller.

A similar check is performed when an unhandled exception occurs.

The call stack#

Up through 3.10, the call stack used to be implemented as a singly-linked list of PyFrameObject objects. This was expensive because each call would require a heap allocation for the stack frame. (There was some optimization using a free list, but this was not always effective, because frames are variable length.)

In 3.11, frames are no longer fully-fledged objects. Instead, a leaner internal _PyInterpreterFrame structure is used, which is allocated using a custom allocator, _PyThreadState_BumpFramePointer(). Usually a frame allocation is just a pointer bump, which improves memory locality. The function _PyEvalFramePushAndInit() allocates and initializes a frame structure.

Sometimes an actual PyFrameObject is needed, usually because some Python code calls sys._getframe() or an extension module calls PyEval_GetFrame(). In this case we allocate a proper PyFrameObject and initialize it from the _PyInterpreterFrame. This is a pessimization, but fortunately happens rarely (as introspecting frames is not a common operation).

Things get more complicated when generators are involved, since those don’t follow the push/pop model. (The same applies to async functions, which are implemented using the same infrastructure.) A generator object has space for a _PyInterpreterFrame structure, including the variable-size part (used for locals and eval stack). When a generator (or async) function is first called, a special opcode RETURN_GENERATOR is executed, which is responsible for creating the generator object. The generator object’s _PyInterpreterFrame is initialized with a copy of the current stack frame. The current stack frame is then popped off the stack and the generator object is returned. (Details differ depending on the is_entry flag.) When the generator is resumed, the interpreter pushes the _PyInterpreterFrame onto the stack and resumes execution. (There is more hairiness for generators and their ilk; we’ll discuss these in a later section in more detail.)

(TODO: Also frame layout and use, and “locals plus”.)

All sorts of variables#

The bytecode compiler determines the scope in which each variable name is defined, and generates instructions accordingly. For example, loading a local variable onto the stack is done using LOAD_FAST, while loading a global is done using LOAD_GLOBAL. The key types of variables are:

  • fast locals: used in functions

  • (slow or regular) locals: used in classes and at the top level

  • globals and builtins: the compiler does not distinguish between globals and builtins (though the specializing interpreter does)

  • cells: used for nonlocal references

(TODO: Write the rest of this section. Alas, the author got distracted and won’t have time to continue this for a while.)

Other topics#

(TODO: Each of the following probably deserves its own section.)

  • co_consts, co_names, co_varnames, and their ilk

  • How calls work (how args are transferred, return, exceptions)

  • Generators, async functions, async generators, and yield from (next, send, throw, close; and await; and how this code breaks the interpreter abstraction)

  • Eval breaker (interrupts, GIL)

  • Tracing

  • Setting the current lineno (debugger-induced jumps)

  • Specialization, inline caches etc.