String manipulations Python beginners should know

Even beginners may know this already, but you never know

Sep 27, 2023

Summary

There are some formatting you will do over and over again, and not having to think to produce them, making sure you can continue your code without interruption, is going to produce a far better experience.

Among them:

Limiting the number of decimals on a float E.G.: f"{pi:.2f}"
Padding a number with 0. E.G.: f"{number:03}"
Formatting a date like it's not 1999. E.G.: f"{date.today():%d/%m/%Y}"
Cleaning up user strings. E.G.: " ".join(bio.strip().split())
Aliasing answers. E.G.: flag_to_bool.get(option.strip().casefold(), False)
Pluralizing. E.G.: f"{len(errors)} error{(len(errors) > 1) * 's'} found"

It's also good to know when you should not overdo it. You can get away with DIY for a lot of tasks.

Do I need to learn this stuff?

In this day and age of ChatGPT, learning how to format text is getting less and less appealing, since you can just ask the bot and it's very good at it.

Still, there is a flow to programming, writing and thinking, and breaking it always affects your productivity, and I might even say, your well-being.

So we are going to list here a few formatting patterns that you will likely have to use often, so that you can burn them into your mind once and for all. Even if you do end up getting them from an LLM, your brain can then pattern match them instead of reading them to make sure they are right for your situation and plug them right in.

Limiting the number of decimals on a float

It's a very, very common operation; from displaying percents to listing prices, at some point you will have more precision than your users would care for. You can reach for round(), but since you will likely put it in a middle of a text, using the format mini-language inside an f-string is the best policy:

>>> from math import pi
>>> pi
3.141592653589793
>>> f"{pi:.2f}"
'3.14'

f means "format it like a float", . then a number is the limit to the precision.

This can be used with a variable:

>>> f"{pi:.{limit}f}"
'3.14'

Or with a format() function call:

>>> "{:.2f}".format(pi)
'3.14'

Although most of the time, the f-string is all you need.

Padding a number with 0

User IDs, product codes, invoice numbers and so many other things are easier to read when they are always the same size. Not to mention your designer will love you much more.

The typical solution for this is to prepend 0 if you don't have the required number of digits. There is a function zfill() for it, but again, an f-string does the job fine:

>>> for _ in range(5):
...     number = random.randint(0, 200)
...     print(f"{number:03}")
...
051
026
004
093
137

This also works with variables and format(). And it can be combined with other specifiers, like the one we just saw above:

>>> for _ in range(5):
...     number = random.randint(0, 200)
...     print(f"{number:03.2f}")
...
17.00
19.00
117.00
179.00
84.00

Formatting a date like it's not 1999

The typical tutorial for formatting a date will call for using datetime.strftime(). It will work, but it's unnecessary.

First, if you are displaying a date for technical reasons, a simple call to str() will suffice:

>>> str(date.today())
'2023-09-27'
>>> str(datetime.now())
'2023-09-27 07:41:01.449948'

While this format is not what you want for documents and user interfaces because each country has their own conventions, for everything that must be looked at by programmers or machines, this is the format you need. It's non-ambiguous (no 02/03 vs 03/02 debate), it's easy to parse (one call of datetime.fromisoformat()) and if your sort them, the text order is the same as the time order, which is super convenient.

Of course, at some points you will format dates for users, in which case you can deep dive into the world of time zones, DST, and the amazing diversity of date conventions (what date is 01/02/03? why the Germans may write 01.02.03?), and I would recommend using the excellent pendulum for it.

Nevertheless, the most correct solution is actually rarely what you need. The Pareto solution is way simpler: format it manually just for one type of user, you'll figure it out when you have others that complain.

And for this, again, f-strings are your friend:

>>> f"{date.today():%d/%m/%Y}" # French dates
'27/09/2023'

And if you feel lazy, but still want your users to feel a bit at home, "%x will format a date (or datetime) according to the locale configured for the user:

>>> f"{date.today():%x}" # My locale is US_us
'09/27/23'

And %a will give you the day of the week according to the same locale:

>>> f"{date.today():%a}"
'Wed'
>>> f"{date.today():%A}"
'Wednesday'

With only this, you can go pretty far.

Note: if you do web programming, the locale of your server is not your user's locale, and that you cannot get your user's locale from the request. You'll need JS or a form for that. But I digress.

Cleaning up user strings

While sanitizing inputs can be a huge piece of work (and is why libs like pydantic exist), there are some things that are easy and that you can do, and will likely do, no matter what.

One of those things is removing white spaces.

Pretty much every programmer in the world will preemptively, and violently, assassinate those with a combination of strip() and split():

>>> bio = "   I'm a     user and I have sausage fingers    "
>>> " ".join(bio.strip().split())
"I'm a user and I have sausage fingers"

I do this even on small scripts for myself, because it's so easy to for a human to mess it up.

This will not handle invisible unicode traps, but those cases are so rare it's usually not worth it.

However, it still happens once in a while that you have to write to a system that doesn't deal with unicode. The professional solution (assuming nobody can't fix said system), is to use the unidecode lib. But the poor man's solution that will work well French, Spanish or German (but not for Hungarian, Russian, Arabic, or Chinese) is to normalize to ASCII using the "Normalization Form Compatibility Decomposition":

>>> import unicodedata # this is part of the stdlib
>>> greetings = "Hello, my name is àéèùöüçñØ, nice to meet you"
>>> unicodedata.normalize('NFKD', greetings).encode('ascii','ignore').decode('ascii')
'Hello, my name is aeeuoucn, nice to meet you'

The other way around (you are decoding something and can't deal with encoding BS right now) is to use surrogateescape decoding. You get UTF8 out of any garbage:

>>>  import secrets
>>> secrets.token_bytes().decode('utf8') # those bytes means NOTHING as text
Traceback (most recent call last):
  Cell In[71], line 1
    secrets.token_bytes().decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 0: invalid start byte
>>> secrets.token_bytes().decode('utf8', errors="surrogateescape") # but I can store them as text anyway!
'\udcc9\udcf1.~H<-\x1a\udcf2\x1ecVBZ\udc86\udcb7\udcc3uri\udcf0#\udcf5\udcdeے\udcdc\udcdf]#G\udc82'

This is no substitute for dealing with encoding correctly, but in a pinch, it will save you a lot of time and trouble.

Aliasing answers

You know those commands that tell you can answer "y/n"? They usually accept "y", but also "Y", "yes" and "YES".

Of course, a proper program would translate that to the language of the user, but that's again a rare occasion. Most of the time, what you really need is just:

answer = "Are you sure? y/n"
if answer.strip().casefold() in ("y", "yes"):

This doesn't look like aliasing because we are not mapping the answer to any equivalent explicitly, but consider this:

>>> flag_to_bool = { # believe it or not, all those are valid YAML values
...     "on": True,
...     "true": True,
...     "1": True,
...     "off": False,
...     "false": False,
...     "0": False
... }
>>> option = " On "
>>> flag_to_bool.get(option.strip().casefold(), False)
True

It's basically the same idea, but precomputed. When you don't pydantic your way out of validation, it's very handy.

You may wonder why we use .casefold() and not .lower() to normalize the case.

The Unicode standard section on lowercasing explains it better than I do:

the main purpose of case folding is to contribute to caseless matching of strings, whereas the main purpose of case conversion is to put strings into a particular cased form

Here we want to match string without difference between upper and lower case, not put string in a particular case form, so we use casefold. The most famous example of different in behavior is in German:

>>> "ß".lower()
'ß'
>>> "ß".casefold()
'ss'

lower() is usually fine though, but it costs nothing to use casefold().

Pluralizing

Making things plural is not trivial. You have entire libraries dedicated to it, and if you are creating a serious software, you should use them.

But at this stage you know what I'm going to say: for most of your life, you don't need perfect pluralization, good enough will do.

The most common case is for you is likely to be an English word where you need to prepend an "s".

In that case, we can use the fact that:

boolean are equal to 0 or 1 in Python:

>>> True  + 1
2

multiplying a string with a number is allowed:

>>> 0 * "a"
''
>>> 1 * "a"
'a'
>>> 3 * "a"
'aaa'

This makes just adding an "s" quite easy:

>>> errors = []
... print(f"{len(errors)} error{(len(errors) > 1) * 's'} found")
... errors = ["Woops" ]
... print(f"{len(errors)} error{(len(errors) > 1) * 's'} found")
... errors = ["Woops", "Errr..."]
... print(f"{len(errors)} error{(len(errors) > 1) * 's'} found")
0 error found
1 error found
2 errors found

Sure, this will not work with the "'" rules, or a mouse, but you deal more often with errors than mice.

Plus, we can use a similar trick with mice:

>>> mice = []
... forms = ["mouse", "mice"]
... print(f"{len(mice)} {forms[len(mice) > 1]} found")
... mice = ["Woops" ]
... print(f"{len(mice)} {forms[len(mice) > 1]} found")
... mice = ["Woops", "Errr..."]
... print(f"{len(mice)} {forms[len(mice) > 1]} found")
0 mouse found
1 mouse found
2 mice found

You now also understand why the passive is so much used in software messages: because once you need to distinguish "was/were", that trick becomes ugly very fast.

Aakash Ghosh

Nov 14, 2023Liked by Bite Code!

Nice article. A small typo in this line:

answer = "Are you sure? "y/n"

The string is not closed.

Expand full comment

1 reply by Bite Code!

Brett Denny

Sep 28, 2023Liked by Bite Code!

Sometimes we get in a bubble and a rhythm and forget to look for better ways of doing things (which is _exactly_ why I read your articles). I've been using f-strings almost since they were introduced, but NEVER ONCE in how many years had I once considered to use them for date formatting. I feel so dumb, but that's a game changer. Thanks!

Side note, is there a typo in your flag_to_bool dictionary with a quoted "False" value?

4 more comments...

Bite code!

Discussion about this post