Summary
Python is slow, but the community compensates by using fast compiled extensions like the numpy-based ecosystem, including pandas and scipy.
Because of the nature of those solutions, iterating with them will eat the performance gain you should expect.
In general with numpy, pandas and co.:
Perform math operations directly on the arrays, don't use
for
on them.If numpy provides a function, don't use it on elements, use it on whole arrays.
Be mindful of hidden loops, such as
.apply()
inpandas
.Don't concatenate things too much, it creates a new array every time.
String manipulations follow the same rules. It's not just for numbers.
The people that are not professional coders but need Python for their job are the ones most likely to make this mistake because they don't have the time, resource or will to know about it.
Numpy is fast
Python is considered a slow language. It's particularly slow for mathematical calculations because:
Function calls are slow.
Iteration makes several calls at every turn.
Numbers are not just numbers, they are big objects.
This is also what makes Python convenient, so instead of giving up on this, the community came up with compiled extensions. This gave us the popular numpy, and consequently, pandas, which is built with numpy, so everything applying on the former applies to the latter.
Let's measure a simple sum using ipython's wonderful %timeit.
Vanilla Python:
>>> %timeit sum(range(10000000))
96.9 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy:
>>> %timeit np.sum(np.arange(10000000))
16.1 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We get a 6X speed up, and the variation of perfs between runs is much smaller.
Why?
First, numpy doesn't use Python big objects to represent numbers, but use types way closer to the machine. In fact, if you chose your types carefully, you can speed it up even more:
>>> np.sum(np.arange(10000000)).dtype # default type is still quite big
dtype('int64')
>>> %timeit np.sum(np.arange(10000000, dtype=np.int32))
10.3 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But also:
Numpy code base contains some C, C++ and cython that are all compiled for better performances.
Numpy uses low level fixed sized arrays with the shape matching the type of the data they hold.
Numpy's API encourages performing calculations in numpy’s internals and limit conversions to slow Python types.
Numpy use vectorized algos which are faster, and limit Python loops.
However, numpy and pandas are well liked with Python that are not expert Python coders. Indeed, they are powerful tools for data analysis, which means they are well suited for professionals dealing with a lot of data mangling, like in biology or finance.
Unfortunately, those very same people are the ones most likely to make the mistake I will talk about here because they don't have the time, resource or will to know about it:
Numpy doesn't like iteration
The entire numpy API is organized to avoid iteration.
Instead of doing this:
>>> np.array([i + 1 for i in np.arange(3)])
array([1, 2, 3])
numpy encourages this:
>>> np.arange(3) + 1
array([1, 2, 3])
And for good reason.
Every time you extract data from a numpy array (or a pandas data frame), a conversion from numpy optimized numbers to Python fat and sluggish ones is applied. And when you perform an operation on the resulting element, instead of using numpy's vectorized algo, you are using a slow manual one.
Now you just have spent the 2 last articles to learn about iteration, so that kinda sucks, doesn't it?
It's only true for numpy-like libs though, and there are only a few compared to the hundreds of thousands of modules on pypi, so you are safe. Iteration is still very useful.
You just have to limit yourself so you use it sparingly around the parts of the code base that benefit from numpy.
Basically, if you see a for
loop with a numpy array, you should stop and check out if you are supposed to do that.
Example of what not to do
>>> np.array([i + 1 / 2 * 3 for i in np.arange(3)])
array([1.5, 2.5, 3.5])
That should be written like:
>>> np.arange(3) + 1 / 2 * 3
array([1.5, 2.5, 3.5])
But also for function calls:
>>> np.array([np.cos(i) for i in np.arange(3)])
array([ 1. , 0.54030231, -0.41614684])
That should be written like:
>>> np.cos(np.arange(3))
array([ 1. , 0.54030231, -0.41614684])
And for mixing both:
>>> np.array([np.sin(np.cos(i) + 1 / 2 * 3) for i in np.arange(3)])
array([0.59847214, 0.89179191, 0.88376736])
That should be written like:
>>> np.sin(np.cos(np.arange(3)) + 1 / 2 * 3)
array([0.59847214, 0.89179191, 0.88376736])
The more you do, the more the difference in perf will be:
>>> %timeit np.array([np.sin(np.cos(i) + 1 / 2 * 3) for i in np.arange(10000)])
14 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np.sin(np.cos(np.arange(10000)) + 1 / 2 * 3)
22.1 µs ± 389 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
That's about a 600x difference.
Find implicit numpy calls
Sometimes, this is not as obvious.
E.G, scipy is using numpy under the hood, and it may make you create callables:
from scipy.interpolate import interp1d
# Fake data acting like reference points
x = [0, 1, 2, 3, 4, 5]
y = [0, 2, 3, 4, 3, 1]
# Scipy use the reference points to create an interpolation function
interp_func = interp1d(x, y)
# Some random x points we don't have a y value for
x_interp = [1.5, 3.4, 1.3]
# We interpolate them
y_interp = [interp_func(x) for x in x_interp]
>>> print(y_interp)
[array(2.5), array(3.6), array(2.3)]
We are not importing numpy, and we are using regular Python lists everywhere, so surely a for
loop is ok?
Alas, y_interp
accepts an entire list (or array) as a param, and it should be:
y_interp = interp_func(x_interp)
array([2.5, 3.6, 2.3])
It's way, way faster. And also returns only one array, which is usually what you want.
for
loops in disguise
Another trap are things that don't look like they are using a Python for
loop, but they are under the hood.
pandas
's dataframes have many methods that are a for
, a function call and a racoon in a trench coat.
E.G, the .apply()
method:
>>> import pandas as pd
...
... df = pd.DataFrame({
... 'Name': ['Alice', 'Bob', 'Charlie', 'David']*10000,
... 'Age': [25, 30, 22, 27]*10000,
... })
...
... def categorize_age(age):
... if age < 25:
... return 'Young'
... elif age < 30:
... return 'Adult'
... return 'Senior'
...
... df['Age Group'] = df['Age'].apply(categorize_age)
...
... print(df)
Name Age Age Group
0 Alice 25 Adult
1 Bob 30 Senior
2 Charlie 22 Young
3 David 27 Adult
4 Alice 25 Adult
... ... ... ...
39995 David 27 Adult
39996 Alice 25 Adult
39997 Bob 30 Senior
39998 Charlie 22 Young
39999 David 27 Adult
[40000 rows x 3 columns]
Apply is a Python for
loop in disguise. There is no simple substitution, though, you have to find the solution that matches your problem. Here we could do:
>>> bins = [0, 25, 30, float('inf')]
... labels = ['Young', 'Adult', 'Senior']
... df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
... print(df)
Name Age Age Group
0 Alice 25 Adult
1 Bob 30 Senior
2 Charlie 22 Young
3 David 27 Adult
4 Alice 25 Adult
... ... ... ...
39995 David 27 Adult
39996 Alice 25 Adult
39997 Bob 30 Senior
39998 Charlie 22 Young
39999 David 27 Adult
[40000 rows x 3 columns]
The difference is 4x perf:
>>> %timeit df['Age'].apply(categorize_age)
2.72 ms ± 58.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit pd.cut(df['Age'], bins=bins, labels=labels, right=False)
600 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>>
It's even a bigger trap because with small datasets, apply()
could be faster, as startup costs for any operation in pandas are quite big, and you will only see the problem once you get serious.
String manipulations
Just because those libraries are specialized in number don't mean they can't manipulate text. If you have a text manipulation to perform, it might be tempting to use .apply()
:
>>> import re
>>> import pandas as pd
...
... df = pd.DataFrame({
... 'phone_numbers': [
... '(123) 456 7890',
... '555_678_1234',
... '1 (800) 555 5555',
... '123-456-7890',
... '+91 9876543210'
... ] * 100000
... })
...
... def normalize_phone_number(phone):
... return re.sub('[^+\d]', "", phone)
...
... df['normalized_phone'] = df['phone_numbers'].apply(normalize_phone_number)
...
... print(df)
phone_numbers normalized_phone
0 (123) 456 7890 1234567890
1 555_678_1234 5556781234
2 1 (800) 555 5555 18005555555
3 123-456-7890 1234567890
4 +91 9876543210 +919876543210
... ... ...
49995 (123) 456 7890 1234567890
49996 555_678_1234 5556781234
49997 1 (800) 555 5555 18005555555
49998 123-456-7890 1234567890
49999 +91 9876543210 +919876543210
[50000 rows x 2 columns]
But you can often use internal string manipulation functions again to avoid a for
loop:
df['normalized_phone'] = df['phone_numbers'].str.replace(r'[^+\d]', '', regex=True)
print(df)
The difference in perfs is not as dramatic as with numbers, but it's still significant:
>>> %timeit df['phone_numbers'].str.replace(r'[^+\d]', '', regex=True)
317 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df['phone_numbers'].apply(normalize_phone_number)
501 ms ± 30 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Iteration on creation
If you do need iteration to create some dataset because numpy doesn't provide a good way to do this, do not repeatadly append to the numpy array (or pandas data frame).
Let's say you have an imaginary generator that outputs something from a data source you can't control and doesn't have a numpy/pandas adapter, but is iterable. That’s a lot to imagine, I’m delegating the creativity here.
Don't do:
all_the_things = np.array([])
for important_stuff in get_very_import_items():
all_the_things = np.append(all_the_things, important_stuff.things())
This is extremely inefficient, because it recreates a new array every time you loop.
Make a whole iterable, and pass it to np.array
directly:
import itertools
things = (important_stuff.things() for important_stuff in get_very_import_items())
all_the_things = np.array(itertools.chain.from_iterable(things))
Yes, I was running out of imagination at that point. Do you know how much time I spend just trying to come up with examples for those articles?
I'm not even sure the whole snippet runs. But it's 10p.m. so someone in the comments will report the typos and I'll fix them it tomorrow.
When iterating with numpy is OK
There are cases where iterating is perfectly OK.
Mostly, the rule is "if the libs doesn't cater to this use case".
E.G:
Printing the content in a beautiful way.
Sending the data to a system that doesn't understand numpy arrays and can't get all data at once.
In the shell because you want to see something quickly and you forgot how to do it.
So don't be afraid to loop, just think before you do.
And by think I mean ask ChatGPT first.
typo: np.cost should be np.cos
I think your last example needs np.fromiter or some other approach, otherwise you get an array full of Generator objects.