Analyzing "Sorting a million 32-bit integers in 2MB of RAM using Python"

2MB ought to be enough for anybody

Jan 27, 2024

Summary

We are going to revisit Guido's famous "Sorting a million 32-bit integers in 2MB of RAM using Python" article, explaining in detail how it works.

In the process, we will get to exercise rarely used modules like struct, array, heapq, etc., as well as generators.

You can grab the code, or read further.

A blast from the past

Today we are going to return to an old article from Guido Van Rossum published in 2008.

It was popular because the author of the language himself wrote it, to answer to what is clearly half a joking question, and in Python 3, which was quite the hot topic at the time. But I'll take a wild guess and assume the vast majority of my readers never heard of it.

It's an interesting article because it touches several Python limitations:

Python is slow, particularly to manipulate numbers.
Numbers in Python take a lot of RAM since they are C structs under the hood. Double whammy in Python 3 that doesn't have shorts and longs anymore.
Python sorted() is excellent, but loads everything in memory.
File handling in Python is very high level by default, so easy to use, but not performant.

The approach chosen by Guido is a two-phase multiway merge sort that goes as follow: load data from stdin - so it assumes you pipe the numbers to the Python program - then dispatch sorted chunks of the data in several files. Eventually, it uses the heapq module to read everything back from those files, in order, and write that to stdout.

In the article, this looks like this:

#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4

def intsfromfile(f):
  while True:
     a = array.array('i')
     a.fromstring(f.read(4000))
     if not a:
         break
     for x in a:
         yield x

iters = []
while True:
  a = array.array('i')
  a.fromstring(sys.stdin.buffer.read(40000))
  if not a:
      break
  f = tempfile.TemporaryFile()
  array.array('i', sorted(a)).tofile(f)
  f.seek(0)
  iters.append(intsfromfile(f))

a = array.array('i')
for x in heapq.merge(*iters):
  a.append(x)
  if len(a) >= 1000:
      a.tofile(sys.stdout.buffer)
      del a[:]
if a:
  a.tofile(sys.stdout.buffer)

Note the #!/usr/bin/env python3.0, isn't that cute :)

Also, amusingly, the first comment in the blog page is:

The formatting for the code examples is really crappy: the indentation is wrong for some lines, and there's not enough of it for most others. Ironically, for Python, it is hard to read

Always fun when people come over to tackle a legend.

But there is truth in that: like substack, blogger doesn't have clear formatting for code. Also, it's not indented with 4 spaces, most variables have one letter names and some flow choice could be made more obvious.

Let's work on it a little and explain what it does in details.

An educational take

First, we need to generate an unsorted million of 32-bit integers and put that in a file. Given one bit is saved for the sign, it leaves 31 bits so we can broadly do:

import random
import struct

numbers = [random.randint(-2**31, 2**31 - 1) for _ in range(1000000)]
with open("random_numbers.bin", "wb") as file:
    # 'i' is the letter used by the formatting mini-language of the 
    # struct module to turn a Python number into a C like signed integer
    file.write(struct.pack(len(numbers) * "i", *numbers))

You'll notice we use the struct module to save the numbers in the packed binary data format instead of text. It takes less memory, space on disk, is faster to load, and Guido's script is using that fact.

Now we have a "random_numbers.bin" file full of data. If I want to load it in memory and sort it all, I can do this:

import struct

data = open("random_numbers.bin", "rb").read()
# Signed int are 4 bytes, so we extract by chunks of 4 the binary data
# and turn that into Python numbers
numbers = struct.unpack((len(data) // 4) * 'i', data)
numbers = sorted(numbers)

with open("sorted_numbers.bin", "wb") as file:
    file.write(struct.pack(len(numbers) * "i", *numbers))

On a modern laptop it runs in less than a second, and takes about 10MB of RAM. So the brute force solution is fine in most contexts. Guido’s solution is nonetheless interesting, so let's continue our exploration.

We begin by rewriting the imports so that it's obvious what modules we load, but also avoid the confusing double array, since I know some devs might wonder what's happening:

import sys
import tempfile
import heapq
from array import array

Nested modules and functions bearing the same name (like datetime.datetime) are one of my pet peeves.

Then we will make a small utility function that will make the code more accessible:

def int_array_from_file(int_file, buffer_size=4000):
    int_array = array("i")
    int_array.frombytes(int_file.read(buffer_size))
    return int_array

This function takes a file-like object (that's how we call the file handles that matches an expected standard file interface in the Python stdlib) as an input: int_file. The buffer_size param is mostly used to name this magic 4000 number. It means we will read chunks of 4000 bytes at time, not the entire file at once.

We create an array, not a list. This data structures can only contain a specific type of number but they take much less space in memory. We use "i", which like with struct, is the value in the format mini-language of the array module that says "signed integer".

From there, we read 4000 bytes from the file, and put that as numbers in our array, then return that. In Guido’s script, this method is called fromstring, but it doesn't exist anymore under that name. This old name was an artifact of the bad string/bytes mixup from Python 2. In fact, you can still see some remnant of it in the doc:

>>> help(array)
Help on method_descriptor:

frombytes(self, buffer, /)
    Appends items from the string, interpreting it as an array of machine values, as if it had been read from a file using the fromfile() method.

"Appends items from the string". Yep.

Then we'll turn Guido's intsfromfile() function into this:

def read_buffered_ints(int_file):
    while True:
        int_array = int_array_from_file(int_file)

        if not int_array:
            return None

        yield from int_array

We write this function to make what's happening clearer: we use a return and not a break so that it's obvious the function exits here as well, and yield from, that didn't exist in 2008, to make the for loop obsolete.

If you are not comfortable with yield, we have an article on that. But it's important to understand that this is actually not a function, but will produce a generator. This has consequences on the rest of the program.

To summarize, read_buffered_ints read integers as bytes from a file, convert them into arrays, and yield then each integer one by one. Why bother to do that? Because that's loading and converting the integer while buffering them, yet only 4000 bytes at a type, avoiding to consume too much memory.

Let's open files for our inputs and outputs:

with open("random_numbers.bin", "rb") as int_file:
    ...

And:

with open("sorted_numbers.bin", "wb") as final_result:
    ...

Guido gets inputs and outputs from stdin to stdout, but I'm sure people will try it out, and destroy their terminal with a bad pipe. So I will get and save the result from and into a file.

Once this is ready, we can go to the meat of the solution:

int_arrays_generators = []

# We load the numbers in chunks with this loop
while True:

    int_array = int_array_from_file(int_file)

    if not int_array:
        break

    # For each chunk, we create a temp file...
    temp_file = tempfile.TemporaryFile()
    # we sort the arrays, convert them back into bytes
    # then put them in the file...
    array("i", sorted(int_array)).tofile(temp_file)
    # and save a generator ready to give us back those ints, in our list
    temp_file.seek(0)
    int_arrays_generators.append(read_buffered_ints(temp_file))

That's a bit abstract, but the gist of it is:

We go back and forth between the raw bytes and integers in memory, converting them as needed.
We read only part of the data each time, sort it, and put the result in a temporary file.
We don't store the whole result in memory! We store generators that will give us access to the result later.

Because generators are lazy, they don't take much memory. But because they hold a state, they still have a reference on each file containing the newly sorted integers.

Now for the magic:

int_buffer = array("i")
# heapq.merge gives us all integers IN ORDER
for integer in heapq.merge(*int_arrays_generators):

    # We store them all in an array, and once the array is big enough,
    # we append its content into the final result file
    int_buffer.append(integer)
    if len(int_buffer) >= 1000:
        int_buffer.tofile(final_result)
        del int_buffer[:] # empty the array, it reached the size limit

# For the last array if it didn't reach 1000
if int_buffer:
    int_buffer.tofile(final_result)

This works thanks to heapq.merge, a function that can merge multiple sorted inputs into a single sorted output. It reads a little bit of each inputs, and then output them in order, then read a little bit again, and so on.

It works because int_arrays_generators is full of generators, and each generator will get the values from the sorted files. Because they are generators, they don't take much memory, and will output the content with a 4000 bytes buffer. And because heapq.merge only reads a bit of them every time, we get a thin stream of sorted data.

Here is the entire script (see colored version):

import sys
import tempfile
import heapq

from array import array


def int_array_from_file(int_file, buffer_size=4000):
    int_array = array("i")
    int_array.frombytes(int_file.read(buffer_size))
    return int_array


def read_buffered_ints(int_file):
    while True:
        int_array = int_array_from_file(int_file)

        if not int_array:
            return None

        yield from int_array


with open("random_numbers.bin", "rb") as int_file, open(
    "sorted_numbers.bin", "wb"
) as final_result:
    int_arrays_generators = []

    # We load the numbers in chunks with this loop
    while True:
        int_array = int_array_from_file(int_file, 40000)

        if not int_array:
            break

        # For each chunk, we create a temp file...
        temp_file = tempfile.TemporaryFile()
        # we sort the arrays, convert them back into bytes, 
        # and put them in the file...
        array("i", sorted(int_array)).tofile(temp_file)
        # then save a generator ready to give us back those ints, 
        # in our list
        temp_file.seek(0)
        generator = read_buffered_ints(temp_file)
        int_arrays_generators.append(generator)

    int_buffer = array("i")
    # heapq.merge give us all integer IN ORDER
    for integer in heapq.merge(*int_arrays_generators):
        # We store them all in an array, and once 
        # the array is big enough,
        # we append its content into the final result file
        int_buffer.append(integer)
        if len(int_buffer) >= 1000:
            int_buffer.tofile(final_result)
            del int_buffer[:]  # empty the array, it reached the size limit

    # For the last array if it didn't reach 1000
    if int_buffer:
        int_buffer.tofile(final_result)

Lot's of rarely used modules in this solution: struct, array, heapq... It uses context managers, generators, a little splat, and even del which is not a keyword I type remotely often.

It does demonstrate that Python is quite flexible in the ways you can approach constraints. And while it is slower than the brute force version, and certainly not as efficient as anything you could write a with low-level language, it's nice to know those tools are at your disposal if you are limited in RAM.