Summary
Understanding python imports will make you a happier python coder.
The magic variable sys.path
contains the list of paths where Python is looking for things to import. Your virtual environment and the directory containing your entry point are automatically added to sys.path
.
Python executes any imported thing, puts the result in a cached module object, and gives you a variable that contains it.
If you use "__ini__.py", absolute imports and "-m", your importing life will get easier. Also, namespaces will save you from yourself.
Where is this damn package?
There are some feelings that can instantly bring humans together, tied by the bitter-sweetness of the shared pain.
Like the loneliness of an ImportError
.
You know, you have installed this package, you have checked that it's here, you have imported this package...
And it fails.
So far we have been talking a lot about "Relieving your packaging pain" by ensuring you are setting up yourself for success.
But luck is not always a lady. Or rather, sometimes it’s a dog lady.
Yet you still have to make things work.
Well, there is a second large reason things can go wrong, and it's the way Python deals with imports.
Don't get me wrong, it's a powerful import system, plus namespacing is one honking great idea.
It's also full of traps.
Say hello to sys.path
You probably know how to import things, I mean, it's just "import stuff".
But do you know how Python find "stuff" in the first place?
It's not magic, it's the sys.path
variable.
If I open a Python interpreter, I can do this:
>>> import sys
>>> type(sys.path)
<class 'list'>
>>> type(sys.path[0])
<class 'str'>
And you'll notice that there is a path
variable in the sys
module. It's a list of strings.
This list contains, well, as you can expect, file system paths:
>>> from pprint import pprint
>>> pprint(sys.path)
['',
'/usr/lib/python310.zip',
'/usr/lib/python3.10',
'/usr/lib/python3.10/lib-dynload',
'/home/user/.local/lib/python3.10/site-packages',
'/usr/local/lib/python3.10/dist-packages',
'/usr/lib/python3/dist-packages',
'/usr/lib/python3.10/dist-packages']
When you import "stuff" in Python, Python will go through this list, and try to find anything it can import with this name. As soon as it finds one, Python stops. It loads the stuff, put it in a stuff
variable at your disposal, and carry on with your script. If it can't find anything importable with this name on any of those paths, it will raise an ImportError
.
You can deduct three things from this:
To be importable, python code only has to be on one of those paths.
If some python code is not on any of those paths, you can't import it.
Two things with the same name cannot be imported together. The first one found in the list wins.
So if you want to import something, you want it to be on sys.path.
How things get on sys.path
Python comes with things that are on sys.path by default. The entire stdlib for a start.
Some OS-specific folders, like the user site-packages directory.
But also your virtual environment if you have activated one. Look at the last entry in this list:
>>> pprint(sys.path)
['',
'/usr/lib/python39.zip',
'/usr/lib/python3.9',
'/usr/lib/python3.9/lib-dynload',
'/home/user/.local/share/virtualenvs/test/lib/python3.9/site-packages']
]
It's my "test" virtual environment. This is why they are so nice, anything you install with pip goes there, and is automatically on sys.path
, ready to be imported.
And this is why I have been telling you to always use one. For everything. See, I even have one for tests. And one for writing articles.
Finally, and this is important, the directory of your entry point is automatically added on the path. This is the first entry, the empty string. I know, that's not the most explicit thing in the world.
Your entry point?
When you start a Python program, you always start from one script. This script may import other scripts, run functions, and so on. But it always starts with one single script.
This is your entry point.
And the directory that contains this script, this entry point, is automatically added to sys.path in the default Python configuration.
Note that this directory is not necessarily the current working directory. The current working directory is the one from which Python is started. It's used to resolve files you read with "open()" and the like. But it's completely independent from how the import system finds things.
Let me give you an example.
Imagine I am in "top_dir", here:
top_dir
├── foo
│ ├── bar.py
│ ├── __init__.py
│ └── blabla.py
└── blabla.py
Notice the two blabla.py files?
Now I run python foo/bar.py
, this means:
I ran "python" in "top_dir", so "top_dir" is my current working directory, but it is NOT added to
sys.path.
I ran the script "bar.py" (my entry point), which is in the directory "foo". So foo IS added to
sys.path
.
If bar.py contains import blabla
, it will import "top_dir/foo/blabla.py", not "top_dir/blabla.py"
Oh, and if you just start a Python shell, it's the current directory that is added to sys.path.
So in that case, import blabla
will import "top_dir/blabla.py", not "foo/blabla.py".
Tricky little bugger.
Packages vs. modules
If you want to divide your code into several files and even directories, you have to create modules and packages.
A module is anything containing Python code, usually it ends with ".py".
In:
top_dir
├── foo
│ ├── bar.py
│ ├── __init__.py
│ └── blabla.py
└── blabla.py
"top_dir/blabla.py", "foo/blabla.py", "bar.py" and "__init__.py" are modules.
Packages are any directory containing an "__init__.py" file. This file can be completely empty. It often is. But it has to be here.
This makes the directory a package, and Python can import packages like it can import modules.
You may read somewhere you don't need "__init__.py" anymore. You may even make some tests and see that, without it, the directory can be imported.
It's a trap. The explanation is long, so I'll skip it, but it will break on you at some point. Don't do it, always put an "__init__.py" file to make a package.
In our example, "foo" is a package.
If you start Python in "top_dir", you can do any of the following:
>>> import foo
>>> import foo.bar
>>> from foo import bar
>>> import blabla
>>> import foo.blabla
>>> from foo import blabla
Note: in casual conversations, people will mix up packages and modules. I certainly do. It boils down to "stuff you can import".
What happens when you import stuff?
Let's say I put the following code inside "top_dir/blabla.py":
print("hello, is this me you're looking for?")
eyes = "it"
And now I import it:
>>> import blabla
hello, is this me you're looking for?
>>> print(blabla.eyes)
it
You note that several things happen:
The message is printed.
I have a new variable
blabla
.I can access the content of
eyes
throughblabla
.
This is because when you import something, Python does the following:
It executes the code of the whole module. I repeat. All the code runs when you import a module.
It creates a module object.
It puts the result of the execution in that module object.
It puts that module object into a variable in your script.
In our case, you can see the variable blabla
contains our new module object:
>>> print(type(blabla))
<class 'module'>
>>> blabla
<module 'blabla' from '/home/user/Bureau/fdsq/blabla.py'>
All of this is cached. If I import blabla
a second time in the same shell, nothing will print:
>>> import blabla
A lot of people are surprised when they learn the whole script is executed at imports.
What if I put this in blabla.py:
def see_in(where):
return where == "smile"
Is it executed as well?
Yes. Absolutely.
But the function body doesn't run.
What is executed is the fact you create a new function.
Since all the code of all modules are executed on import, you should avoid putting heavy code outside of functions. Or code that have side effects.
Hence...
The 'if name == "main"' trick
A lot of people start coding a small script, and put everything directly in the file. It's fine, I do it too.
But then it evolves, and you make another module, then another module. And they want to import things from the first script.
Now there is a problem, if you import things from the first script, it runs.
This is where if __name__ == "__main__"
becomes useful: you can separate the use cases of importing the module and running the script.
My blabla.py script can go from:
print("hello, is this me you're looking for?")
eyes = "it"
def see_in(where):
return where == "smile"
to:
eyes = "it"
def main():
print("hello, is this me you're looking for?")
def see_in(where):
return where == "smile"
if __name__ == "__main__":
main()
This new version lets me import it without running the print()
:
>>> from blabla import see_in
>>> see_in("smile")
True
But I can run the module as a script, and get the print()
:
python blabla.py
hello, is this me you're looking for?
How does it work?
Well, __name__
is a magic variable. It's created automatically by Python and is available in any Python code.
It contains a string:
This string is the module name if you read
__name__
in any imported module.This string is a special value if you read
__name__
from your entry point.
The special value is a bit weird. It's the string "__main__"
.
As a consequence, ‘if __name__ == "__main__":’
simply means "if we are in the entry point".
So the code in ‘if __name__ == "__main__":’
only runs if the module is the entry points (running as a script) but not if it's imported.
One more details.
If you import a package (so a directory with an __init__.py file it it) instead of a module, the __init__.py file of this package runs.
__pycache__ and .pyc files
You may have noticed that "__pycache__" directories containing ".pyc" files pop up after you import things.
When Python executes a module, it first needs to read the Python code, then it transforms it in something called "bytecode" (check the url of this blog btw), and finally runs the bytecode.
This transformation is costly, so Python saves the bytecode in ".pyc" files and put them in "__pycache__" directories. Next run checks if the code hasn't changed, and if it hasn't, it will just read the existing ".pyc" files.
There are several implications to this:
pyc files are optional. If you delete them, Python will simply create them back.
Keeping them around will make your program start faster. Not run faster. Start faster.
pyc files are platform specific. Don't share them, it will likely not work. Don't put them into your version control system (e.g: don't commit them in a git repository).
If a .pyc file exists but not the matching .py file, Python will run the .pyc anyway.
That's a huge source of confusion and this is why you can start Python with the "-B" option to tell it to not write "pyc" files. Some people like to never have them.
I find it slows down dev too much for my taste, so I keep them. But sometimes when nothing seems to work, I delete all my .pyc files just in case.
Keep that trick in the back of your mind.
Which import syntax does what
There are a lot of different syntax to import things:
>>> import os
>>> import os.path
>>> from os import path
>>> import os as what_is_reality_really
What do they all do, and which one should you use?
import os
Import the "os" package, put it in the "os" variable.
I will use this syntax when the name of the module is short. Like "sys", "os", "json", etc.
import os.path
Import the "os" package, put it in the "os" variable. Import the "path" module. Make it an attribute of "os".
This assumes "path" is a module, and thus is importable.
Otherwise, you will get an error:
>>> from os import chdir
>>> import os.chdir
ModuleNotFoundError: No module named 'os.chdir'; 'os' is not a package
Here, chdir is a function, so it doesn't work. And the error message is terrible. But you can see the difference:
>>> os.path
<module 'posixpath' from '/usr/lib/python3.9/posixpath.py'>
>>> chdir
<built-in function chdir>
I never use this syntax anyway, and rarely see people doing so.
from os import path
Import the "os" package, put it in the "os" variable. Import the "path" module, put it in the "path" variable.
Useful if you don't want to prefix all your access to "path" with "os".
This is the most common syntax for imports.
import os as what_is_reality_really
This import the "os" package, but put it in the "what_is_reality_really" variable.
This is useful if:
The name of the thing you import is too long, and you will use it a lot.
The name of the thing you import conflicts with another name.
Occasionally useful. I will use it for things like datetime:
import datetime as dt
print(dt.date.today())
It's also very popular with some packages like "numpy" (as np) or "pandas" (as pd).
When to choose what?
It's mostly a matter of taste and style.
I will prefer:
import json
json.loads(data)
Over:
from json import loads
Because "json" is short to type, and "loads" too generic of a name to be imported on it's own. "loads()" what?
But usually I will use "from x import y":
from collections import deque, defaultdict
from itertools import combinations
Because I don't want to prefix everything with "collections" and "itertools" and the names are unlikely to be confused.
The last name in the shadow
If you import stuff, it creates a variable containing this stuff. If several variables with the same name are created, the last one wins:
>>> loads
<function loads at 0x7fa21497e4c0>
>>> from pickle import loads
>>> loads
<built-in function loads>
>>> loads = 1
>>> loads
1
There is a term for this phenomenon. We say "we shadow the 'loads' variable".
As you can imagine, this is a great source of bugs, and is why I would rather do:
>> import json
>>> import pickle
>>> loads = 1
>>> json.loads
<function loads at 0x7fa21497e4c0>
>>> pickle.loads
<built-in function loads>
>>> loads
1
This way, we have no conflict.
This concept is called "namespacing".
The idea of "namespace" is a big thing in python: every object is also a namespace.
Everytime you see "foo.bar", the things before the dot (here "foo"), is a namespace, because every name after the dot cannot conflict with similar names.
A namespace is basically the tool we use in python to seperate variable names from each other:
In:
>> import json
>>> import pickle
>>> loads = 1
>>> json.loads
<function loads at 0x7fa21497e4c0>
>>> pickle.loads
<built-in function loads>
None of those "loads()" conflict.
When you import something from another module, you suddenly bring those things from the namespace of one module, to the namespace of your script.
E.G, if you do this:
date = "now"
You create the variable "date" in the namespace of your script.
But if you do:
from datetime import date
You bring the content of the variable "date" from the "datetime" namespace into your namespace.
When you start looking for them, you will see namespace everywhere in Python: modules are namespaces, objects are namespaces, classes are namespaces...
The idea is that if you attach a name to those, the name is unique to this context, and don't conflict with the name in other contexts.
import *
The cardinal sin of imports is to do:
from os import *
from json import *
from datetime import *
Everywhere.
This will import anything and everything from "os", "json" and "datetime" and will cause several problems:
You don't know what is in those packages, so you have no idea what you just imported, or even if what you want is in there.
You just filled your local namespace with an unknown quantity of mysterious names, and you don't know what they will shadow.
Your editor will have a hard time helping you since it doesn't know what you imported.
Your colleague will hate you because they have no idea what variables come from where.
This is something you will often see in tkinter tutorials:
from tkinter import *
root = Tk()
label = Label(root, text="I am root")
Don’t do it. You will suffer.
Use an alias:
import tkinter as tk
root = tk.Tk()
label = tk.Label(root, text="I am root")
There are a few good reasons to use "*". In the shell, it's handy. Sometimes, you want to import all things in __init__.py and you have "__all__" defined (if you don't know what "__all__" is, don't worry, you are not missing much).
But those are rare.
A good rule of thumb is to not use "import *".
Relative or absolute imports?
You may have seen some code do this:
from .submodule import something
Instead of this:
from package.submodule import something
Those are called relative imports. They don't work the way you think.
They are a very good way to shoot yourself in the foot.
Don't use them.
Only use absolute imports.
Again the explanation is pretty long and this article is starting to feel like a small book, so I'll skip it.
You'll have to trust me on that one.
More love for -m
If you have read the blog for some time, you should have started to get the feeling we kinda like it.
This little flag of the python command let you run any importable module.
That's why we recommend "python -m black" instead of "black": this reduces the number of PATH problems.
That's why you can run "python -m serve.http" and have a web server magically runs out of the box.
It works by running the module with the name you pass after the "-m". If you pass a package instead of a module, the package must contain a "__main__.py" file for it to work. This __main__.py module will run.
Remember when we said the current working directory and the directory of your entry point may not be the same?
It can be solved with "-m".
Going back to:
top_dir
├── foo
│ ├── bar.py
│ ├── __init__.py
│ └── blabla.py
└── blabla.py
If I run from "top_dir"
python foo/bar.py
"top_dir" is the current working directory, but "foo" is added to the sys.path
.
But if I run:
python -m foo.bar
I run the code foo/bar.py as well, yet suddenly "top_dir" is bot the current working directory and added to sys.path
.
Suddenly everything makes sense, my imports can all start from the root of the project. My opened file paths as well.
Everything is normalized.
Bottom line, if you have scripts in your projects, don't run them directly. Run them using "-m", and you can assume everything starts from the root and be happy.
You’ll lose shell completion and “-m pdb”, but it’s worth it.
Tips and tricks
pytest, the most famous test runner in Python, does a nasty thing: it doesn't add the entry point directory to
sys.path
. However, you can force it to do so with configuration.Any path you add to the environment variable
PYTHONPATH
is added tosys.path
. I often have "PYTHONPATH=." in my env file. If none of this makes sense to you, I'll write an article on env vars some day.sys.path
is a list. Which means you can.append()
. Any directory you add there will have its content importable. It's a useful hack, but use it as a last resort.Because imports are cached, changing code on a file doesn't reflect immediately in the shell. Don't believe the trick you read on the internet about calling reload() functions. It will bite you. Restart the shell.
Anything you put in __init__.py is importable at the package level. If you import something from it, it's suddenly available from it. Aliasing things in __init__.py you use often can be a neat way of making your project imports look nice and clean.
For the reader: If you think importing can be confusing try to import from a sister package in your own project.
After hours of googling this was the best answer I found:
https://stackoverflow.com/a/50193944/14198656
TL:DR: Install your own package in editable mode:
pip install -e . (note the dot. you should be in a venv in the root directory of your project). Trust me, although I was hesitant for a long time this turned out to be the easiest solution by far.