DEV Community

Rick Galbo
Rick Galbo

Posted on

Simple Python Environments For Data Science šŸ

Python environments have long been one of the most confusing and annoying parts of a python work flow for data scientists (or anyone really). The lack of a clear and concise explanation for environments was recently brought to my attention by this tweet:

While Iā€™m no Jake VanderPlas, I am someone who has always struggled with python environments and definitely see the need for a clear and concise explanation of environments. Here you will find a simple guide explaining what virtual environments are and how make them to work.

Goals

  1. Explain wtf a virtual environment is ā“
  2. Explain how to create and use virtual environments šŸ˜†
  3. Get a jupyter notebook running inside a virtual environment šŸ˜Ž

WTF is a virtual environment

Python virtual environments allow you to wade through the shitshow that is installing python packages. There are two very common python package managers that most people in data science use pip and conda.

pip: a python tool for installing packages from Python Package Index (PyPI)

conda: package manager for the anaconda python distribution which has quickly become defacto for data science due to its user-friendliness (disclaimer: I use anaconda as my python distribution)

It can be simple to install packages with these tools (sometimes) but we have all gotten awful and confusing pip errors!. There is also the issue of packages that require different versions same dependency, or packages that donā€™t even work with the version of python that you are using.

This is where virtual environments come in handy.

Virtual environments allow you to create different versions of python with packages specific to each project that you are working on. When you create a new virtual environment you are specifying the versions of python and a versions of the packages you need in order to prevent those awful import errors we all hate.

Creating and Using Virtual Environments

So now that we know about virtual environments and feel they are marginally important lets look at two tools to create and manage virtual environments.

Anaconda

Anaconda uses the conda package manager. conda has a very simple api for creating environments.

$ conda create --name <environment_name> python=<version_of_python>

This command will create a new environment located inside of the anaconda directory ~/anaconda/envs/[environment name]/Ā . When creating a anaconda environment you can include package names as arguments:

$ conda create --name test_env --python=3 numpy pandas scikit-learn

This will create a new environment called test_env the packages numpy, pandas, and sklearn installed already. To use the shiny new environment that you just created you simply run:

mac/linux: $ source activate <environment name>

windows: $ activate <environment name>

When you are in the conda environment you can conda install any required packages or if those packages arenā€™t available through conda channels you can also pip install packages like you would normally.

To view the installed packages in your current environment you can run conda list which will print out the packages.

How do I know what environment Iā€™mĀ in

To view all of your environments run command conda info -e which will return a list of your environments and place a star next to the current environment. When you activate your virtual environment it will change your prompt so that the name of your virtualenv is at the beginning of your prompt.

Here I am activating my conda environment dl and running condaĀ info

Pipenv

If you have animosity towards authoritarian open sourced, or just want to work with straight PyPI packages there is a new option in pure python. Pipenv is the new cool kid on the block for managing python virtual environments courtesy of Kenneth Reitz. Unlike Anaconda which is geared toward scientific computing, Pipenv is built with python development in mind, specifically networking. This means that it doesnā€™t come with any built in features but works incredibly well and is fairly simple to use plus it has a really pretty cli. Pipenv is built to be a replacement for pip and virtualenv to create a simpler work flow for creating environments in python.

Creating an environment is very simple:

$ pipenv install 

Thats it! This creates a virtual environment and installs the packages that were specified. A few things to note here, unlike creating an environment in anaconda, there is no name to be set. The environment will just take the name of the directory that it is created in.

To activate the environment you will simply run pipenv shell inside the project directory and viola you are in your shiny new Pipenv. To install new packages into your environment simply run pipenv install [package name] while the project is active.

Pipenv will store package info in a file called a Pipfile that will look something like this:

$ pipenv install numpy pandas scikit-learn
$ cat pipfile
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"

[packages]
numpy = "*"
pandas = "*"
scikit-learn = "*"
[dev-packages]

You can also view package information by running pip list from inside an active environment.

Installing and RunningĀ Jupyter

Now that we have a few options to create and manage virtual environments we will demonstrate creating and using the environments to get a jupyter notebook installed and ready for data science action.

Jupyter withĀ Anaconda

By running three commands we can get a jupyter notebook running in an environment that is ready for action.

# create the environment
$ conda create -n jupyter_env python=3 jupyter
# activate the environment (mac/linux version)
$ source activate jupyter_env
# launch a notebook server in our env
$ jupyter notebook

That is all! Super simple, super concise and it ā€˜just worksā€™ which will save your time for all the super fun data munging you need to do before you can run models or make pretty data pictures. Now any time you want to work in a jupyter notebook you simply activate the environment and launch it in your project of choice.

Jupyter withĀ Pipenv

Similar to anaconda, we can create a Pipenv fairly simply.

# create a project directory
$ mkdir jupyter_project
# change into the project directory
$ cd jupyter_project
# create your pipenv
$ pipenv install jupyter
# activate the environment
$ pipenv shell
# launch a notebook server in our env
$ jupyter notebook

Just two more steps but thats all we need to get up and running with pipenv. The nice thing about pipenv is that we can install it using pip.

Note - The biggest drawback of Pipenv is that the cli doesnā€™t allow global environment access which means you cant just run $ pipenv shell from anywhere, it will only work from inside the project folder. However, all the environments are located in a directory ~/.local/share/virtualenvs/ so to activate them from other directories all you need to do is run:

# list the virtual environments
$ ls `~/.local/share/virtualenvs/`
# activate the environment from afar$ source ~/.local/share/virtualenvs/[environment_name]/bin/activate

Future postsĀ šŸ”®

Now that weā€™ve covered the basics of setting up simple environments for data science, there are some other important topics to cover. One of them is sharing your environments with others so that they can reproduce your code and interact with your research. Another important tool to cover is the actual jupyter notebook itself. There are many underutilized features of Jupyter notebooks that make machine learning projects much simpler.


Top comments (0)