The costly mistake so many make with numpy and pandas

What happens next() will shock you

Jul 25, 2023

Summary

Python is slow, but the community compensates by using fast compiled extensions like the numpy-based ecosystem, including pandas and scipy.

Because of the nature of those solutions, iterating with them will eat the performance gain you should expect.

In general with numpy, pandas and co.:

Perform math operations directly on the arrays, don't use for on them.
If numpy provides a function, don't use it on elements, use it on whole arrays.
Be mindful of hidden loops, such as .apply() in pandas.
Don't concatenate things too much, it creates a new array every time.
String manipulations follow the same rules. It's not just for numbers.

The people that are not professional coders but need Python for their job are the ones most likely to make this mistake because they don't have the time, resource or will to know about it.

Numpy is fast

Python is considered a slow language. It's particularly slow for mathematical calculations because:

Function calls are slow.
Iteration makes several calls at every turn.
Numbers are not just numbers, they are big objects.

This is also what makes Python convenient, so instead of giving up on this, the community came up with compiled extensions. This gave us the popular numpy, and consequently, pandas, which is built with numpy, so everything applying on the former applies to the latter.

Let's measure a simple sum using ipython's wonderful %timeit.

Vanilla Python:

>>> %timeit sum(range(10000000))
96.9 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy:

>>> %timeit np.sum(np.arange(10000000))
16.1 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We get a 6X speed up, and the variation of perfs between runs is much smaller.

Why?

First, numpy doesn't use Python big objects to represent numbers, but use types way closer to the machine. In fact, if you chose your types carefully, you can speed it up even more:

>>> np.sum(np.arange(10000000)).dtype # default type is still quite big
dtype('int64')
>>> %timeit np.sum(np.arange(10000000, dtype=np.int32))
10.3 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But also:

Numpy code base contains some C, C++ and cython that are all compiled for better performances.
Numpy uses low level fixed sized arrays with the shape matching the type of the data they hold.
Numpy's API encourages performing calculations in numpy’s internals and limit conversions to slow Python types.
Numpy use vectorized algos which are faster, and limit Python loops.

However, numpy and pandas are well liked with Python that are not expert Python coders. Indeed, they are powerful tools for data analysis, which means they are well suited for professionals dealing with a lot of data mangling, like in biology or finance.

Unfortunately, those very same people are the ones most likely to make the mistake I will talk about here because they don't have the time, resource or will to know about it:

Numpy doesn't like iteration

The entire numpy API is organized to avoid iteration.

Instead of doing this:

>>> np.array([i + 1 for i in np.arange(3)])
array([1, 2, 3])

numpy encourages this:

>>> np.arange(3) + 1
array([1, 2, 3])

And for good reason.

Every time you extract data from a numpy array (or a pandas data frame), a conversion from numpy optimized numbers to Python fat and sluggish ones is applied. And when you perform an operation on the resulting element, instead of using numpy's vectorized algo, you are using a slow manual one.

Now you just have spent the 2 last articles to learn about iteration, so that kinda sucks, doesn't it?

It's only true for numpy-like libs though, and there are only a few compared to the hundreds of thousands of modules on pypi, so you are safe. Iteration is still very useful.

You just have to limit yourself so you use it sparingly around the parts of the code base that benefit from numpy.

Basically, if you see a for loop with a numpy array, you should stop and check out if you are supposed to do that.

Example of what not to do

>>> np.array([i + 1 / 2 * 3 for i in np.arange(3)])
array([1.5, 2.5, 3.5])

That should be written like:

>>> np.arange(3) + 1 / 2 * 3
array([1.5, 2.5, 3.5])

But also for function calls:

>>> np.array([np.cos(i) for i in np.arange(3)])
array([ 1.        ,  0.54030231, -0.41614684])

That should be written like:

>>> np.cos(np.arange(3))
array([ 1.        ,  0.54030231, -0.41614684])

And for mixing both:

>>> np.array([np.sin(np.cos(i) + 1 / 2 * 3) for i in np.arange(3)])
array([0.59847214, 0.89179191, 0.88376736])

That should be written like:

>>> np.sin(np.cos(np.arange(3)) + 1 / 2 * 3)
array([0.59847214, 0.89179191, 0.88376736])

The more you do, the more the difference in perf will be:

>>> %timeit np.array([np.sin(np.cos(i) + 1 / 2 * 3) for i in np.arange(10000)])
14 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit  np.sin(np.cos(np.arange(10000)) + 1 / 2 * 3)
22.1 µs ± 389 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

That's about a 600x difference.

Find implicit numpy calls

Sometimes, this is not as obvious.

E.G, scipy is using numpy under the hood, and it may make you create callables:


from scipy.interpolate import interp1d

# Fake data acting like reference points
x =  [0, 1, 2, 3, 4, 5]
y =  [0, 2, 3, 4, 3, 1]

# Scipy use the reference points to create an interpolation function
interp_func = interp1d(x, y)

# Some random x points we don't have a y value for
x_interp = [1.5, 3.4, 1.3]

# We interpolate them
y_interp = [interp_func(x) for x in x_interp]

>>> print(y_interp)
[array(2.5), array(3.6), array(2.3)]

We are not importing numpy, and we are using regular Python lists everywhere, so surely a for loop is ok?

Alas, y_interp accepts an entire list (or array) as a param, and it should be:

y_interp = interp_func(x_interp)
array([2.5, 3.6, 2.3])

It's way, way faster. And also returns only one array, which is usually what you want.

`for` loops in disguise

Another trap are things that don't look like they are using a Python for loop, but they are under the hood.

pandas's dataframes have many methods that are a for, a function call and a racoon in a trench coat.

E.G, the .apply() method:

>>> import pandas as pd
... 
... df = pd.DataFrame({
...     'Name': ['Alice', 'Bob', 'Charlie', 'David']*10000,
...     'Age': [25, 30, 22, 27]*10000,
... })
... 
... def categorize_age(age):
...     if age < 25:
...         return 'Young'
...     elif age < 30:
...         return 'Adult'
...     return 'Senior'
... 
... df['Age Group'] = df['Age'].apply(categorize_age)
... 
... print(df)
          Name  Age Age Group
0        Alice   25     Adult
1          Bob   30    Senior
2      Charlie   22     Young
3        David   27     Adult
4        Alice   25     Adult
...        ...  ...       ...
39995    David   27     Adult
39996    Alice   25     Adult
39997      Bob   30    Senior
39998  Charlie   22     Young
39999    David   27     Adult

[40000 rows x 3 columns]

Apply is a Python for loop in disguise. There is no simple substitution, though, you have to find the solution that matches your problem. Here we could do:

>>> bins = [0, 25, 30, float('inf')]
... labels = ['Young', 'Adult', 'Senior']
... df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
... print(df)
          Name  Age Age Group
0        Alice   25     Adult
1          Bob   30    Senior
2      Charlie   22     Young
3        David   27     Adult
4        Alice   25     Adult
...        ...  ...       ...
39995    David   27     Adult
39996    Alice   25     Adult
39997      Bob   30    Senior
39998  Charlie   22     Young
39999    David   27     Adult

[40000 rows x 3 columns]

The difference is 4x perf:

>>> %timeit df['Age'].apply(categorize_age)
2.72 ms ± 58.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit pd.cut(df['Age'], bins=bins, labels=labels, right=False)
600 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>>

It's even a bigger trap because with small datasets, apply() could be faster, as startup costs for any operation in pandas are quite big, and you will only see the problem once you get serious.

String manipulations

Just because those libraries are specialized in number don't mean they can't manipulate text. If you have a text manipulation to perform, it might be tempting to use .apply():

>>> import re
>>> import pandas as pd
...
... df = pd.DataFrame({
...     'phone_numbers': [
...         '(123) 456 7890',
...         '555_678_1234',
...         '1 (800) 555 5555',
...         '123-456-7890',
...         '+91 9876543210'
...     ] * 100000
... })
...
... def normalize_phone_number(phone):
...     return re.sub('[^+\d]', "", phone)
...
... df['normalized_phone'] = df['phone_numbers'].apply(normalize_phone_number)
...
... print(df)
          phone_numbers normalized_phone
0        (123) 456 7890       1234567890
1          555_678_1234       5556781234
2      1 (800) 555 5555      18005555555
3          123-456-7890       1234567890
4        +91 9876543210    +919876543210
...                 ...              ...
49995    (123) 456 7890       1234567890
49996      555_678_1234       5556781234
49997  1 (800) 555 5555      18005555555
49998      123-456-7890       1234567890
49999    +91 9876543210    +919876543210

[50000 rows x 2 columns]

But you can often use internal string manipulation functions again to avoid a for loop:

df['normalized_phone'] = df['phone_numbers'].str.replace(r'[^+\d]', '', regex=True)
print(df)

The difference in perfs is not as dramatic as with numbers, but it's still significant:

>>> %timeit df['phone_numbers'].str.replace(r'[^+\d]', '', regex=True)
317 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df['phone_numbers'].apply(normalize_phone_number)
501 ms ± 30 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Iteration on creation

If you do need iteration to create some dataset because numpy doesn't provide a good way to do this, do not repeatadly append to the numpy array (or pandas data frame).

Let's say you have an imaginary generator that outputs something from a data source you can't control and doesn't have a numpy/pandas adapter, but is iterable. That’s a lot to imagine, I’m delegating the creativity here.

Don't do:

all_the_things = np.array([])
for important_stuff in get_very_import_items():
    all_the_things = np.append(all_the_things, important_stuff.things())

This is extremely inefficient, because it recreates a new array every time you loop.

Make a whole iterable, and pass it to np.array directly:

import itertools
things = (important_stuff.things() for important_stuff in get_very_import_items())
all_the_things = np.array(itertools.chain.from_iterable(things))

Yes, I was running out of imagination at that point. Do you know how much time I spend just trying to come up with examples for those articles?

I'm not even sure the whole snippet runs. But it's 10p.m. so someone in the comments will report the typos and I'll fix them it tomorrow.

When iterating with numpy is OK

There are cases where iterating is perfectly OK.

Mostly, the rule is "if the libs doesn't cater to this use case".

E.G:

Printing the content in a beautiful way.
Sending the data to a system that doesn't understand numpy arrays and can't get all data at once.
In the shell because you want to see something quickly and you forgot how to do it.

So don't be afraid to loop, just think before you do.

And by think I mean ask ChatGPT first.

Bite code!

Discussion about this post

Bite code!

The costly mistake so many make with numpy and pandas

What happens next() will shock you

Summary

Numpy is fast

Numpy doesn't like iteration

Example of what not to do

Find implicit numpy calls

for loops in disguise

String manipulations

Iteration on creation

When iterating with numpy is OK

Discussion about this post

`for` loops in disguise