Testing with Python (part 5): the different types of tests

This article is brought to you by Brawndo the Thirst Mutilator

May 23, 2024

Summary

Sing with me!

Smoke testing so that you fail fast,

Regression tests to not break the past.

Sanity checks so you can keep yours,

Integrated tests, for side-effect chores.

End-to-end,

never ends!

Backtesting,

all the things!

And property tests have pros and cons,

But they'll guarantee your slot at PyCon.

You know the saying about naming things

As promised in the previous article, we are going to go through a lot of jargon today. As usual with words, we are going to hit problems, like blurry definitions, cultural differences, the fact people can't agree on even simple concepts, etc.

So take this article for what it is: a guide to give you the general direction.

I'm not attempting to write a dictionary, ensure everyone agrees, or achieve some sort of compliance. The goal here is to give you a grasp of what the testing landscape looks like, what people might mean when they use certain terms, and what type of systems you can put in place to get better software.

In a field where being technically correct is the best kind of correct, nit-picking is king, pedantry is queen, and I'm just the humble court jester.

I do intend to establish some precedence for vocabulary for the next articles, however.

In any case, remember that even if things seem clear in your mind, testing is a spectrum. Nobody can delimit exactly where a unit starts and ends, what's too much scope for integration or not enough, what's practical for end-to-end, and so on. We all draw lines in the sand, and we do it for each project.

I just want this post to help you draw yours.

When in doubt about classification, remember what Dr. Franz Kuhn attributed to "The Celestial Emporium of Benevolent Knowledge", that animals can be divided into:

those belonging to the Emperor,
those that are embalmed,
those that are tame,
pigs,
sirens,
imaginary animals,
wild dogs,
those included in this classification,
those that are crazy-acting,
those that are uncountable,
those painted with the finest brush made of camel hair,
miscellaneous,
those which have just broken a vase, and
those which, from a distance, look like flies.

You'll be fine.

What we will not cover

Testing is a vast field.

Checking for type hints is a kind of testing, and so is linting. Putting the software in the hands of the users to see how they do with it is also testing. Sometimes you put a camera and call that ergonomic/usability testing, sometimes you release it to a select group of users and call that alpha/beta testing, and sometimes you inflict it randomly to a subset of your production parc and confuse everybody. That's A/B testing.

Some companies have dedicated departments to manually validate software, and they might call it quality testing or acceptance testing. I have friends with no automated tests at all, but before each release, they click on every button on their app. Yes, that's testing too.

Then you have security tests, and all kinds of fuzzing, red teaming, and adversarial stressing. Testing again.

We are not done yet. Put the system under expected pressure, that's load testing. Put it under extreme pressure, that's stress testing. Check how it behaves under those pressures in the long run, that's soak testing. Look at how it behaves when there is a sudden change in this pressure, that's spike testing. And all that is grouped under the umbrella of performance testing.

Unplugging a server is a form of testing, ever heard of chaos monkey?

Even auditing is a form of testing.

Since I'm not writing a trilogy of 900 pages tomes, I'm going to stick to unit tests, integrated tests, end-to-end tests, and property tests. That's already quite a lot, we will probably need an article for each. Besides, most people will never work in an environment where you do all those types of testing. The cost is enormous, and only very big companies can afford the whole package.

Unit tests

When people say testing, they usually mean this.

And at the same time, that doesn't mean much. In fact, if we go to Wikipedia, you will get the least practically useful definition possible:

Unit testing, a.k.a. component or module testing, is a form of software testing by which isolated source code is tested to validate expected behavior.
Unit testing describes tests that are run at the unit-level to contrast testing at the integration or system level.

So you test part of the code in isolation, and this part is called a "unit".

What part of the code,? What does isolation mean? How do you test? They do try to be more precise later on:

Unit generally implies a relatively small amount of code; code that can be isolated from the rest of a codebase which may be a large and complex system. In procedural programming, a unit is typically a function or a module. In object-oriented programming, a unit is typically a method, object or class.

So basically, you can test a function, a method, an object, a class, or a module. Like you can pilot a tricycle, a bike, a car, a truck, or an interstellar spaceship. In isolation from the rest of the other drivers of the galaxy.

But it's not Wikipedia's fault. That's because unit tests are very loosely defined. The fun part is you'll find hundreds of articles or videos where people will tell you that their definition is actually the right one, it's very strict, and you should not deviate from it or you'll write incorrect tests.

Relax. Don't try to write perfect tests. Don't try to match a perfect definition.

Go back to the previous articles. Establish the goals for your testing. Stick to that. Let other people debate about what's right and wrong, we have code to write.

Let's start with what unit tests are useful for.

Smoke testing

Smoke tests are very basic, preliminary tests to check that the basic functionalities of the software are ok. It's basically to save you time: if this fails, no point in looking into the details.

Here is a first unit test I often write:

def test_import():
    from the_module import main_entry_point

Why? Because Python is plagued with import traps: sys.path, circular dependencies, shadowing, you name it. Importing can fail outside of the tests, but then I won't have a test report, I will have crashed tests. This one reports cleanly, and without ambiguity, if my project is simply not loading. It's not just for me, it's also for when my juniors mess up, and have to report something in a chat. I can tell them to run this test first.

Smoke tests take various forms, and like I said, it's a spectrum. It's not just for unit tests, you can have end-to-end smoke tests, like just running a CLI with --version and see if it returns an error code.

Smoke tests are mostly there for two reasons:

To get you started on testing. And get people on board. It's easy to write, easy to read.
To save you time. When something obvious blows out, you don't get lost trying to debug the small stuff.

Regression tests

We talked a lot about this in the previous post because that's the main benefit of unit tests: you check that nobody broke what was already working.

We saw a lot of examples like:

def test_add_strings(setup_and_tear_down):
    result = add("1", "2")
    assert result == "12"

That's a regression test, if somebody changes the code, it should still pass.

Regression tests are not carved in stone, you can decide to break them, delete them, or rewrite them. Emphasis on the decide. They make clear that you are breaking something with a change, and that if you go on with this, it's because you chose to, and not by mistake.

As with any other tests, it's not just unit tests that can avoid regression. Many tests do. But all tests can do all things, so there is that.

Sanity tests

The other side of the coin of regression tests is the sanity checks: you ensure that a particular thing works as expected. It saves time while developing, instead of running it manually again and again, you delegate that to the testing machinery. It gives you peace of mind. It forces you to use your code’s API, and therefore to understand the trade-off your design comes with. And of course, it encodes the compliance to specs, or shows that a bug fix does, well, fix the bug.

Usually, regression tests are just old sanity checks.

I used them interchangeably, to me it's the same thing, and it's more of a matter of context and vocabulary than a practical division. But you know geeks, we love taxonomy.

Ah, who am I kidding, I use neither. I just say "unit tests" when I talk about them, or just "tests". The team knows.

The scope of unit tests

What to test, how much to test, and how to test something is a topic that has flooded as many IRC chans as Vim VS Emacs. For the young readers, IRC chans were like Whatsapp groups, but in black-and-white.

In fact, the scope of a test might also be what makes it jump from the category of unit to integration to end-to-end. Then there is the whole ordeal of side effects.

I will address this in a separate article dedicated to good practices for unit tests. For now, just assume unit tests are tests that are "not too big". I'm on the side of Wikipedia on this one. Also, most people tend to agree that unit tests are those with few side effects, especially I/O such as network calls, file system access, etc.

If you pass a few immutable parameters to a single function and check the result, you’ll be hard-pressed to find someone who will argue it's not a unit test.

Integrated tests

Integrated tests are tests that check "more stuff than unit tests, but less than end-to-end, and we are ok with side effects". That's the exact scientific definition. Don't check.

Their goal is to see if several components work together. Like, does the model load from the cache? Does the API check for permissions?

So this is an integrated test:

def test_user_authentication():
    user = auth_service.authenticate("username", "password")
    assert user is not None

Because despite being only a few lines, it exercises a lot of the system, and actually goes beyond it to call on another one: the database.

You want integrated tests to be easy to run separately because:

They are slower than unit tests.
They will likely have side effects.
They may contain dirty little mocks (we'll have an article on this).
They are more brittle.

However, on the field, it's very common that they are just mixed in a big bag of tests, with unit tests. After all, they can also check for regression or sanity, and they look a lot like unit tests. If you can, put them in a separate directory, mark them with a decorator, or use a naming convention, so you can filter them.

Unfortunately, it's not always possible. And frankly, not always desirable. Remember it's all about goals and constraints. If running the whole test suite takes 4 more minutes but separating the whole thing would cost a lot, you might not care.

The main benefit of the separation, I find, is to force the devs to think about the purity of their components. The drawback, I find, is that the devs might focus too much on the purity of the components.

Many projects will have a huge directory of tests, with mostly integrated tests and very few unit tests, calling the whole blob "the tests", and they are doing fine. Don't stress too much over this.

It can be the sign of a very coupled design, which again, can be a good or bad thing depending on your context. It's worth investigating, though, given it can destroy a project.

Now, in the wonderful world of IT, there is always a catch.

In this case, integration is also used in the context of "continuous integration", which is the practice of packaging, installing, and running your software, with all the tests, on all supported platforms, every time you push new code. Think GitHub Actions, Gitlab CI, Azure Pipelines, Travis, Jenkins...

We don't want to make it too easy to communicate with each other, don't we?

For a small team and project, continuous integration is overkill. Having a manual check phase, before a release, is good enough. It's made easy with tools like nox + doit, and you can always move from that to CI later on. In fact, most of my CI just call doit behind the scenes, as I hate templated YAML with a burning passion and have weekly satanic rituals dedicated to cursing whoever came up with them.

When you grow, CI becomes handy in avoiding human error, enforcing policy, managing complexity, and so on. It's absolutely priceless once you reach a good amount of compatibility testing to check for different Python versions, browsers, devices, and operating systems. But your company Web API that runs on precisely and only CentOS 6 + Python 3.5.1, which you develop with a team of 3 people all pushing on the main Git branch, can definitely postpone adoption.

End-to-end tests

Abbreviated e2e, it's a form of testing that attempts to exercise a huge chunk of the system, in the way a user would do.

Let's take the example of a contact form and how you can end-to-end test it:

import pytest
from playwright.sync_api import sync_playwright
from contact.models import ContactMessage

@pytest.mark.django_db
def test_contact_form_submission(playwright_context):

    # playwright is a lib to manipulate a web browser from python
    with sync_playwright() as playwright:
        # Start a real web browser with JS support and actually
        # navigate to the site
        browser = playwright.chromium.launch()
        context = browser.new_context()
        page = context.new_page()
        # Assume the server has been started somewhere else
        page.goto("http://localhost:8000/contact")

        # Fill the contact form
        page.fill('#name', 'John Doe')
        page.fill('#email', 'johndoe@example.com')
        page.fill('#message', 'Hello, this is a test message.')
        page.click('button[type="submit"]')

        # Wait for the form to be submitted and confirmation message to appear
        page.wait_for_selector('.success-message')
        browser.close()

    # Check if the contact message exists in the database
    assert ContactMessage.objects.filter(
        name='John Doe',
        email='johndoe@example.com',
        message='Hello, this is a test message.'
    ).exists()

You'll note that:

It tests the front end with a real browser, HTML, CSS, and JS.
It exercises the DOM and forms interactions.
It checks for the HTTP stacks since it makes real POST requests.
It runs your backend code, validation, auth, and all.
It ensures the db is indeed up to date.
It even makes sure the response is coming back as expected.

That's a lot of goodness in a few lines.

I love end-to-end testing because it gives you so many benefits in such a small package. If this passes, then a lot of things are going right.

They are a fantastic canari that will tell you quickly if something that would affect a lot of users is running amok. They will tell you if you break the UI, confusing people. They will tell you if your integrated tests missed the elephant in the room. They will tell you if you had the wrong expectations all along, and keep you grounded in reality.

There are some serious critics ranting out there about end-to-end tests. They say they are brittle, and costly to maintain.

I agree that they are messy to write, read, and debug, plus the tooling could be better.

But their reputation of being brittle is also a consequence of the terrible habit a lot of teams have, which is to break user space.

"move fast and break things", "release early, release often", "feature flags" and all those things that very smart and successful people sold you. You know what they also do? Confuse the users as hell, destroy your customer’s productivity, turn support into a cat-and-mouse game, and all in all, crush your aura of reliability.

Now, I get that at an early product stage, e2e tests will be basically disposable. You are learning, you don't have a stability guarantee, etc. There is a need to stay flexible, lean, and fast.

Typically, having only a few of them can alleviate that. The main code paths. Changing 10 tests is not the end of the world, and it will catch a lot of problems very fast.

But once your product is stable, I find end-to-end tests to keep you honest: if you break 200 of them and they suddenly cost a lot of money to update, you probably are doing something user-hostile.

They are also tests that non-technical people understand well, and can contribute to. They tell user stories.

But yes, they are hard to write and read. Side effects, timing, mixing contexts, and crossing boundaries make them chaotic. We also have only so-so toolkits for that, and testing a GUI or a TUI is meh at best. If you have to test a PDF output, may God have mercy.

Plus they are super slow and heavy, you certainly don't want to run them on a Git pre-commit hook.

Do e2e early, but just a little. Maybe even just one. This keeps the dividends juicy, and the entry cost low. Even for a CLI, look at the ROI in this example of testing a command line tool that sends SMS alerts:

import pytest
import subprocess
import time
from twilio.rest import Client

# Twilio is a service that let you send text messages programmatically
account_sid = os.environ['TWILIO_SID']
auth_token = os.environ['TWILIO_TOKEN']
to_number = os.environ['TEST_USER_PHONE_NUMBER']
from_number = os.environ['TEST_SERVICE_PHONE_NUMBER']

def test_send_sms():
    test_message = "This is a test message"
    # Run the CLI command in a different process
    subprocess.run(['python', 'send_sms.py', test_message, to_number], check=True)

    # Wait for the message to be sent and received
    time.sleep(10)

    twilio_client = Client(account_sid, auth_token)
    messages = twilio_client.messages.list(to=to_number, from_=from_number, limit=1)
    assert len(messages) > 0
    assert messages[0].body == test_message

We exercise everything, argument parsing, network calls, reception, message integrity, that our account subscription has been paid (which nobody ever test despite it being utterly important and screwing many companies around the world)...

Sure, it comes with plenty of problems:

The network or Twilio can come down.
The sleep time might be off one day.
We rely on nothing else sending messages to that number.
If you mess up, such as introducing a bug that calls that test in a loop, it can cost you quite a lot of money.

But you can't hide behind the separation of concerns on this one, if any part of the chain is weak, your product is broken, and you will know.

Once the product is mature, double down on them. Keep them well separated from your unit and integrated tests. They should not affect each other. You should be able to break private API without affecting e2e at all. You should be able to change the way you exercise the UI without the smaller part to realize it.

Finally, and I repeat myself, remember that testing is a spectrum. You don't have to be at the absolute end for end-to-end to be valuable. Look at that example with FastAPI:

from fastapi import FastAPI
from fastapi.testclient import TestClient
from our_project.site import fast_api_app

client = TestClient(fast_api_app)

def test_read_user_profile():
    response = client.get("/me")
    assert response.status_code == 200
    assert response.json() == {"username": "BiteCode", "id": "987890789790"}

It tests the whole API endpoint, including a DB call, but doesn't spin a real server as it uses a test client that creates the Python HTTP request objects instead of parsing a string of bytes. It's also not exercising a real client parsing the response.

Is that e2e? Is that integration testing? Maybe it's Maybelline.

Who cares, it's useful.

Put it in one of those folders, agree with your team to consistently do that, and move to the next deliverable.

Backtesting

Backtesting is the neglected neighborhood of testing, and you will find it mostly in large, risk-adverse, long-distance runners among institutions.

It's the process of accumulating input and output that you know should be valid for your system, then regularly feeding it with the whole data set to check if it is still behaving like that.

It's a mix of regression and end-to-end testing, and it comes with the pros and cons of both.

It's costly, slow, and will fossilize your feature set.

But the longer you do it, the more reliable your system becomes, especially in the long tail of errors and edge cases. Some users will love your for being always there for them, and some will hate your because your never modernize. Also, you’ll have to deal with schema versioning a lot.

In short, it's quite adapted to a banking payment system, and utterly inadequate for a hot startup phone app.

What does it look like?

Imagine a trader wants to change his crypto-currency bot behavior, but wishes to see how it would have affected his gains compared to the previous strategy, given the same market:

import pandas as pd

# Load historical data

# Kryll is a veteran token that powers an automated trading platform, 
# which, funnily, provide a UI to create strategies and backtest 
# them without code. But pandas is free :)

df = pd.read_csv('kryll_historical_data.csv', parse_dates=['Date'])
df.set_index('Date', inplace=True)

# Calculate moving averages.
short_window = 40
long_window = 100

df['SMA40'] = df['Close'].rolling(window=short_window, min_periods=1).mean()
df['SMA100'] = df['Close'].rolling(window=long_window, min_periods=1).mean()

# Define the trading signals
df['Signal'] = 0
df['Signal'][short_window:] = np.where(df['SMA40'][short_window:] > df['SMA100'][short_window:], 1, 0)
df['Position'] = df['Signal'].diff()

# Initialize backtesting variables
initial_capital = 100000.0
positions = pd.DataFrame(index=df.index).fillna(0.0)
portfolio = pd.DataFrame(index=df.index).fillna(0.0)

# Simulate trades. This is BS, but have you worked in finance?
positions['Kryll'] = df['Position'] * initial_capital / df['Close']
portfolio['Positions'] = (positions.multiply(df['Close'], axis=0)).sum(axis=1)
portfolio['Cash'] = initial_capital - (positions.diff().multiply(df['Close'], axis=0)).sum(axis=1).cumsum()
portfolio['Total'] = portfolio['Positions'] + portfolio['Cash']

# Calculate returns
portfolio['Returns'] = portfolio['Total'].pct_change()

# Display the portfolio and performance metrics
print(portfolio)

# Plot the results
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(df.index, portfolio['Total'], label='Portfolio Value')
ax.plot(df.index, df['Close'], label='Kryll Close Price', alpha=0.5)
ax.set(title='Backtest of SMA Crossover Strategy', xlabel='Date', ylabel='Value')
ax.legend()
plt.show()

I've kept the wonderful style of quants code, including the inlined imports so you can have a taste of what our economy relies on. I'm jesting, plus I haven't tested this script at all, it's closer to pseudo-code.

Backtests don't have to be fully automated or be a one-on-one match to be useful. Sometimes you want your system to behave exactly as before, but sometimes you just want the trend to be generally similar or better, knowing that it's not possible to get the exact same result.

This is what it's about here: we display a matplotlib curve of the result at the end of the script, so we can visually inspect the result compared to what we had before.

Not all backtests are like this of course. Some require a perfect alignment and don't involve humans, but the bigger your dataset is, the less likely it will, or even can, happen. Reality is full of complications.

Property tests

Also known as "we saw that at PyCon, remember?", because everybody that heard about it thinks it's cool, but the number of persons that actually do this is close to the number of pins on a Raspberry Pi.

This idea is to run code, but instead of testing the result, you check that a general property remains true no matter what the input is. A tool (in Python, usually the excellent hypothesis) will then attempt to pass all sorts of garbage to it until it breaks.

It's very useful, I swear.

You begin by choosing a unit test that exercises a critical part of your program. Property testing is slow and expensive, so you generally start small. You also want to avoid side effects, because the code is going to be running millions of times in an uncontrolled manner, so it will be hard to manage the chain of causality otherwise. That's the main difference with general fuzzing, which actually aims to create chaos and in which that may be desirable, especially for security.

Let's go back to our add() examples from part 2 of this series of articles. We had something like this:

import random

import pytest
from the_code_to_test import add


@pytest.fixture()
def random_number():
    yolo = random.randint(0, 10)
    yield yolo
    print(f"\nWe tested with {yolo}")


@pytest.fixture()
def setup_and_tear_down():
    print("\nThis is run before each test")
    yield
    print("\nThis is run after each test")


def test_add_integers(setup_and_tear_down, random_number):
    result = add(1, 2)
    assert result == 3
    result = add(1, -2)
    assert result == -1

    assert add(0, random_number) > 0


def test_add_strings(setup_and_tear_down):
    result = add("1", "2")
    assert result == "12"


def test_add_floats():
    result = add(0.1, 0.2)
    assert result == pytest.approx(0.3)


def test_add_mixed_types():
    with pytest.raises(TypeError):
        add(1, "2")

How do we know we tested all edge cases and figured out all the bugs? It's impossible of course, but how confident are we that we chased all the most obvious ones?

Now your intuition is that for such a simple function, the domain is quite obvious, we can't possibly have missed something. I mean, come on, it’s add()!

But as usual, programming is laughing at our naivety, and there be dragons.

Hypothesis can help here, so let's create one more test after having pip installed hypothesis-pytest:

import pytest
from the_code_to_test import add
from hypothesis import given, strategies as st

@given(st.one_of(st.integers(), st.floats()), st.one_of(st.text(), st.integers(), st.floats()))
def test_add_mixed_types_property(a, b):
    if isinstance(a, (int, float)) and isinstance(b, (int, float)):
        result = add(a, b)
        assert result == a + b
    else:
        with pytest.raises(TypeError):
            add(a, b)

I won't go into details there, there will be an article dedicated to property testing. But essentially, we tell hypothesis we want to check the property stating "when use add(), either we pass the same type and the result is of the same type, or we don't pass the same types, and there is an error".

Sane behavior: add strings, get strings back. Add floats, get floats back. Add a string and a int, it's an error.

From that test, hypothesis is going to generate a ton of combinations of input data, run the code and try to prove we are nothing but stupid little apes, believing foolishly we were in control all along.

What could possibly go wrong?

All the data scientists in the room, who have suffered immensely due to the issue, already screamed the answer in the form of a beloved Indian bread yet hated float value:

a = 0, b = nan

    @given(st.one_of(st.integers(), st.floats()), st.one_of(st.text(), st.integers(), st.floats()))
    def test_add_mixed_types_property(a, b):
        if isinstance(a, (int, float)) and isinstance(b, (int, float)):
            result = add(a, b)
>           assert result == a + b
E           assert nan == (0 + nan)
E           Falsifying example: test_add_mixed_types_property(
E               a=0,
E               b=nan,  # Saw 1 signaling NaN
E           )

nan, of course, doesn't respect this property.

What you do with this information is up to you. Encode a special behavior for nan, ignore it in the property check, raise an error if nan is passed, weep and accept your fate… There is no right answer, but now you consciously make a decision about an edge case that 5 minutes ago, you may not have even known existed.

That's the beauty of property testing: it opens your eyes to the friction of real data.

Often, though, I want to keep my eyes closed, thank you very much. Because when you stare at the abyss, it stares back, and it's a never-ending chase of more and more special cases, all harder and harder to deal with, breaking your sense of what's real and what's not, confronting your own mortality.

On the other end, alienating 0.000001% of your user base with a few potential "wontfix" in the future is quite practical.

Before the next article on mocks, I’ll leave you with a bit of 37signals’ opinion that I happen to share:

Automated testing is about finding the right level of confidence that your software will be of acceptable quality for the criticality in question. It's not about finding certainty. It's confidence validated by reality. Understand this and testing becomes much easier.

Kevin Tewouda

May 25, 2024

Nice article as always.

I'm currently working for a large bank. We have a project we are maintaining called "backtest". It is now that I'm realizing why this API, thanks to your explanation ^^

For hypothesis, the issue I had when trying it, is that it doesn't work with with class methods, only plain functions. It is a no go for me, unfortunetaly

Expand full comment

Ranudar

"In a field where being technically correct is the best kind of correct, nick-picking is king, pedantry is queen, and I'm just the humble court jester."

Great sentence. I don't want to nit-pick (yes I do!) but nick-picking is written nit-picking. Or was that a trap?

Best regards!

1 reply by Bite Code!

1 more comment...

Bite code!

Discussion about this post