How not to sort in Python

You can't make me

Aug 26, 2023

Summary

Sometimes you want to sort, and sometimes you just want things to stay sorted. For the latter, we can use heapq to create a queue that guarantees to pop each element in order, bisect to manually insert at the right place, or sortedcontainers, to be lazy yet productive.

Actually, the laziest and most productive solution is likely to not do any of that and brute force it as your problem space likely allows it. But that would make for a short article.

Heap queue

Sorting in Python is convenient and pretty fast, thanks to TimSort. Yet, by design, sorting will traverse your iterable, sometimes many times. If you have to insert an element in your iterable in a loop, and then make sure it's in order every time, sorting it for that purpose will get something in the ball park of O(n ^ 2) algo complexity. If this means nothing to you, it's not great perf.

The typical example is the priority queue. You have a list where you insert a task with a priority attached, and the next task to perform is the one with the most urgent priority:

Let's say I have those very important tasks:

tasks = [
    (5, 'Defeat Malenia to unlock spreadsheet functionality'),
    (1, 'Collect triforces to enable Wi-Fi'),
    (8, 'Use dance emotes to encrypt hard drive'),
    (3, 'Throw masterball to clear browser cache'),
    (10, 'Score a goal to install software updates'),
    (7, 'Perform a fatality to make coffee'),
    (2, 'Open a portal to order pizza'),
    (6, 'Unlock an achievement to send an email'),
    (8, "Kiss Bowser for shits and giggles")
]

Now, If I want to add them all in a priority queue:

stuff_to_do = []

def add_action(priority, task):
    stuff_to_do.append((priority, task))
    stuff_to_do.sort(reverse=True)

for priority, task in tasks:
    add_action(priority, task)

if stuff_to_do:
    print(stuff_to_do.pop())
# (1, 'Collect triforces to enable Wi-Fi')

That's going to sort every single time, which is not ideal. Now, you might interject, why not insert them all and sort later? Yes, in this simple example used to explain the problem, this is, of course, the best course of action. But priority queues usually exist in the context of data comming from outside the system, continuously. You don't have access locally to all the tasks to do, you get a stream of constantly new messages that pop in. So use your imagination to make this easy code fit the IRL situation.

One possible solution to this is to use a heap queue, which, in the Python stdlib, is provided out of the box with the heapq module:

import heapq

stuff_to_do_but_well = []

for priority, task in tasks:
    heapq.heappush(stuff_to_do_but_well, (priority, task))

if stuff_to_do_but_well:
    print(heapq.heappop(stuff_to_do_but_well))
# (1, 'Collect triforces to enable Wi-Fi')

It seems to do exactly the same, but the stuff to do is stored differently. Without heapq, our tasks are just in reverse order:

>>> from pprint import pprint
>>> pprint(stuff_to_do)
[(10, 'Score a goal to install software updates'),
 (8, 'Use dance emotes to encrypt hard drive'),
 (8, 'Kiss Bowser for shits and giggles'),
 (7, 'Perform a fatality to make coffee'),
 (6, 'Unlock an achievement to send an email'),
 (5, 'Defeat Malenia to unlock spreadsheet functionality'),
 (3, 'Throw masterball to clear browser cache'),
 (2, 'Open a portal to order pizza')]

But heapq stores the element in a binary tree, where every parent node has a value less than or equal to any of its children:

>>> pprint(stuff_to_do_but_well)
[(2, 'Solve a puzzle in Portal to order pizza'),
 (3, 'Capture a Pokémon to clear browser cache'),
 (4, 'Win a match in Apex Legends to feed the cat'),
 (5, 'Defeat the final boss in Mario to unlock spreadsheet functionality'),
 (6, 'Unlock an achievement in Halo to send an email'),
 (8, 'Use Fortnite dance emotes to encrypt hard drive'),
 (7, 'Perform a Mortal Kombat fatality to make coffee'),
 (10, 'Score a goal in FIFA to install software updates'),
 (9, 'Complete a lap in Mario Kart to water plants')]

Ok, it's no super obvious looking at it like that, but it means inserting things in that way doesn't require a full sort, and popping the queue always gets the most important element first.

While this approach works well for priority queues because you have to pop the item every time, it's not great for other types of problems where you don't want to consume the queue. Indeed, heapq doesn't have a way to iterate on the heap, only ways to add and remove elements.

But we can solve this problem using Python bisect module.

Where are you?

Bisection is a fast way to search an element in a sorted list. This is typically something you learn at university, but given how many people learn to code by themselves, I'm not assuming everybody knows what it is.

Maybe you know this game where a person thinks about a number between 1 and 100, and you have to figure out which one it is? You give a guess, the persons answers "it's bigger" or "it's smaller", and you guess a new time. So you pick 50, then if it's lower, you pick 25, then if it's higher, you pick 32, and so on. It would be silly to ask, "is it one?" and then "is it two?" until 100.1

But if you search for something in a list, this is by default what the computer does. It goes through the entire list until it finds the element.

Bisect works the same way you would in the game: assuming the elements are sorted, it will first pick up the middle element, then check if it's a match, bigger or smaller, then move on from that. The only difference is that Python lib will not return to the place of the matching element, but where to insert it to maintain order.

On a big list, the difference between brute force and bisect is huge. Let's make one with random numbers:

import random
sorted_random_numbers = sorted(random.randint(0, 10000000) for _ in range(10000000))

Now we get a random thing to find in the list and try to figure out its index:

The brute force approach:

%timeit number_to_find = random.choice(sorted_random_numbers); sorted_random_numbers.index(number_to_find)
187 ms ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not only does it take 187ms, which is not fast, but the variability is huge, because the algo performs drastically faster or slower if the element is at the beginning or the end of the list.

But the bisect version tells another story:

import bisect
%timeit number_to_find = random.choice(sorted_random_numbers); bisect.bisect_left(sorted_random_numbers, number_to_find)
4.39 µs ± 319 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

It's more than 45 times faster, and it has way less variability.

We can use this to insert our element on our priority list while keeping it in the same order. bisect even provide the insort() function for this:

import bisect

stuff_to_do = []

def add_action(priority, task):
    bisect.insort(stuff_to_do, (-priority, task)) # negate priority for reversal

for priority, task in tasks:
    add_action(priority, task)

if stuff_to_do:
    print(stuff_to_do.pop())
# (-1, 'Collect triforces to enable Wi-Fi')

This is quite nifty and all, but there is one problem: Python lists are slow at inserting elements in the middle of it, which this does. Even bisect docs say:

Keep in mind that the O(log n) search is dominated by the slow O(n) insertion step.

A solution would be to keep "n" low, by chunking the list into small buckets, then build an abstraction for the binary search to read and insert as if it were a list.

But who wants to do that? Well, apparently a guy named Grant Jenks does.

Sorted containers

There is software that you remember for a certain element of elegance. The absolute simplicity of sqlite. The wonderful robustness of redis. The astonishing versatility of VLC.

Sorded containers is one of those. It's a simple lib, of a few files, with damn clean and well-documented code, that solves a single problem and does it well.

It provides lists, sets and dictionaries that stay sorted, without having to actually sort them.

It's pure Python, and it's fast. It's fast because it bisects things, but maintain the whole collection in small buckets, so insertion is not so costly.

Here is our problem solved with sorted container (after a pip install sortedcontainers, of course):

from sortedcontainers import SortedList

stuff_to_do = SortedList()

def add_action(priority, task):
    stuff_to_do.add((-priority, task))

for priority, task in tasks:
    add_action(priority, task)

if stuff_to_do:
    priority, task = stuff_to_do.pop()
    print(priority, task)
# (-1, 'Collect triforces to enable Wi-Fi')

Simple, you use it like a normal list, and it does the magic behind the scenes. The SortedDict is particularly nice, because it lets you easily fetch a range of data between two keys. I used it with a client this year. We built a huge sorted index for a slow calculation, put it in cache, and just never calculate it again, as we can fetch any range of results from anywhere in our problem boundary.

Or just don't do it

I mean, most data is so small, brute forcing the search or sorting every time is probably faster than those solutions in their real-life context. Especially since modern computers are absolute beast.

All those solutions might actually even be slower on a small list.

But you know what they say:

O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there

We can call it a learning experience.

You probably used bisecting to debug a program: you remove half of it and see if the bug is always there, then you keep the buggy half, and you divide it again to figure our which part is crashing, and so on, until you find the faulty line. Which contained a typo.

Bite code!

Discussion about this post