Python environments: from virtual to contained environments
The joy of package management!
I have always liked to be in control and I have always enjoyed reading to understand how things work. So about a decade ago I decided to leave the Windows ecosystem and I hopped on the Linux bandwagon.
I experimented a fair bit with Damn Small Linux, Linux From Scratch, Gentoo, etc. And although having near total control, it was way too technical for my younger self. I then enjoyed Debian for a few months and I settled with Ubuntu for a couple of years. Having already tried a few flavours, I started to appreciate how important package management is and how painful it could sometimes be.
About 6 years ago, I discovered Arch Linux and the concept of rolling release. I’ve never looked back since moving to that distribution as the documentation is absolutely brilliant, the community is very active and the package manager is near perfect. No more package management issues, latest versions of any packages at any time (there is a bit of a lag for some packages), exactly what I was looking for!
You can imagine my shock when I started to program with python. The python ecosystem goes against everything I was used to. Even the first python programming books I bought were advising to learn python 2 as most libraries/frameworks were in python 2 and python 3 was too immature… Shocking! (for me at least)
Anyway, embracing python as a programming language meant I had to get my hands dirty and go back to actively manage packages.
I quickly realised that pip would install packages in different places, depending on how/what you install. (Have you ever done pip install --user [package]
?) By default, pip
installs system packages, available for all users. The --user
flag will install site packages which will be installed in your home directory, available only to you and not requiring any superuser privilege.
Virtual Environments
A virtual environment is a directory tree that will contain a specific version of Python and packages. As said before, Python can run from different locations, a virtual environment will ensure Python runs in a fully controlled manner, from a chosen directory.
Venv
Although shipping with python, I won’t be talking about the venv
package as virtualenv
is a much more powerful alternative. Comparison here Just know you can create an environment with python -m venv [environmentName]
.
Virtualenv
One of the lowest level for managing python virtual environments. It is not the most elegant way but it offers a lot of control. By creating a virtual environment, each project will have its own environment and so the package management is rather easy as you can just freeze
the packages in the version that works for your project.
To get started, install virtualenv:
pip install virtualenv
or even better, via your favourite package manager.
mkdir VirtualEnvironments
to create a folder with all your virtual environments.cd VirtualEnvironments
to get into the directory.virtualenv [environmentName]
to create your environment.
- To activate the environment, run
source [environmentName]/bin/activate
from theVirtualEnvironments
directory.
- To deactivate the environment:
deactivate
.
As you can see in the image above, the $PATH
and the python interpreters are different: we are switching from the virtual environment env
to the system environment. This means if your system is up to date and running python3.8
, you can still develop a project using python2.7
for example. Just use the -p
switch: virtualenv -p $(which python2.7) [environmentName]
to create python 2.7 environment.
Once the environment set up and active (the (environmentName)
tag at the beginning of the command prompt tells you which environment is running), go ahead and install the packages you need for your project, they'll be installed in the environment (in the directory). You can then run pip freeze > requirements.txt
to save the packages and their versions into the requirements.txt
text file.
Freezing packages is particularly useful if you need to transfer your project. Simply create a new virtual environment on the machine you need to import the project/environment to (make sure to initiate the correct version of python) then run pip install -r requirements.txt
to upgrade/downgrade the packages to the correct version.
Virtualenvwrapper
Another, higher level, way to organise your virtual environments is to use virtualenvwrapper: pip install virtualenvwrapper
Once installed, you can run which virtualenvwrapper.sh
to check which its path and add the following lines to your .bashrc
file
source /usr/local/bin/virtualenvwrapper.sh
Now source .bashrc
to load the new configuration file.
- Create a new environment with
mkvirtualenv [environmentName]
. If you have a requirements.txt file:mkvirtualenv [environmentName] -r requirements.txt
can be used to install the packages in the required version. - List the packages installed with
lssitepackages
but if you want to create therequirements.txt
file, usepip freeze > requirements.txt
- List your environments with
workon
- Activate a specific environment with
workon [environmentName]
- Change to the environment directory with
cdvirtualenv [environmentName]
(you can check the path withpwd
once the environment is activated) - Deactivate the environment with
deactivate
- Remove an environment with
rmvirtualenv [environmentName]
Slightly different commands but the result is the same. virtualenvwrapper
is a set of extensions to virtualenv
.
Other useful packages for managing virtual environments are:
- pipenv which is combining
pip
,virtualenv
andpipfile
(another way to address requirements.txt). - pyenv which aims to isolate Python versions, the virtual environments are managed with
virtualenv
orpyenv-virtualenv
Conda
Conda is very popular amongst data scientists for a few reasons:
- It bundles Intel MKL which makes some libraries (like numpy) faster.
- It manages packages locally so there is no need for superuser privilege.
- It comes with a lot of industry standard packages.
- Anaconda inc. is a company which offers support contracts.
- Makes using python on Windows a lot easier.
Not only does it manage packages, it also allows for environment management.
conda create --name [environmentName]
,conda create --name [environmentName] python=2.7
if you need a specific version of python orconda env create -f environment.yml
to create an environment from anenvironment.yml
file (similar to therequirements.txt
file)conda [environmentName] export > environment.yml
will create theenvironment.yml
file.conda install -n [environmentName] package=version
to add packages, the version number is optional.conda deactivate
to deactivate an environment.conda activate [environmentName]
to activate an environment.conda navigator
will start the anaconda navigator with all the applications installed in the environment
As you can see, no matter what tool you are using, virtual environments work in the same way: creation, activation, installation of packages, creation of text file (so the environment can be replicated), deactivation.
Contained environments
Another powerful way to address the environment issue is to use containers solutions such as Docker. For most situations, it is a little bit over the top but I will be addressing this because:
- it is rather easy to implement
- once you understand how Docker works, you get access to a lot of really useful images such as:
Postgres
,MongoDB
,Redis
,PySpark
, etc.
First, install Docker.
Add the user to the docker group
And make sure you activate the docker service
You can then execute docker run hello-world
There you go, you’ve just ran your first container!
A few useful commands:
docker pull [container]
to download the container image.docker run [container]
to execute the container.docker image ls
to list the containers locally available.docker rm [container]
to remove a container from your machine.
Make sure to visit the Docker hub which lists all the (shared) images available. I would also strongly advise to read the images description as they will always explain how to run the image. For example, you can see above that the postgreSQL container starts with docker run --name dbName -e POSTGRES_PASSWORD=password -d postgres
.
If you are interested in working with PySpark
(note that you'll have to create an account on Docker Hub, then use docker login
) run docker pull pyspark-notebook
to download the image then docker run -d -p ***:*** -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook
to start a container with a jupyter notebook.
If you are interested in building your own image, have a look here, this documentation will give you the basics to achieve your first build.
Conclusion
This article was a rather quick overview of the various solutions available to sort out the environment issue. Conda
is interesting for all the apps it can manage (Spyder
is quite a nice IDE, for example) but virtualenv
should be sufficient for most scenarios. We went a little bit off the beaten track by talking about Docker but mastering this tool is really interesting as it opens a whole new world!