Setting up Python in your Data Science and ML Development Environment

Setting up Python in your Data Science and ML Development Environment
Michael Louis
Co-Founder & CEO

Python is a necessary part of most Data Scientist’s and ML Engineer’s lives. It forms the backbone
of many of the packages we know and love, being known for it’s ease of use and interoperability with C. However, nearly every person will at some point encounter some issues with their python environment breaking. In the worst case, developers end up breaking the python runtime that their operating system relies on. There are then issues of using correct versions of python as well as correct package versions, and encapsulating those packages in a seperate environment for adequate testing.

This guide is an opinion piece on how you should setup your python environments. It entails setting up 3 sets of environments, each of which serve a different function in the development lifecycle of data or ML based application. These are:

  • The System Environment — The python runtime that your entire system relies on to function and supplies a place to install python-based CLI applications.
  • The Analysis Environment — The python runtime(s) you use to conduct data analysis and train ML models.
  • The Development Environment — The python runtime(s) you use to develop and test your applications.

Having this distinction allows a good balance between simplicity, ease of use and system safety. It allows you to ensure your system never breaks, you can setup and delete new virtual environments quickly and easily with minimal maintanence overhead, and you are able to thoroughly test any apps you develop in the same environment they would be deployed in.

System Environment

Your system very likely relies on python to function, and you should do your best to not break this environment with rogue packages. On macOS, you can further isolate the system environment by installing python with homebrew.

brew install python

You should leave this python maintained by your package manager (e.g. brew on macOS, apt on Ubuntu or yum on Fedora). The kinds of packages you should install here should be package agnostic, providing mainly system functions you run on CLI and not in code. Here are some of the kinds packages you would install to your system python (usually, they are provided directly by the package manager and are not installed with pip).

Analysis Environment

Online the overwhelming recommendation is to use virtual environments for each project you conduct in python, and for good reason. As we mentioned before, python environments can break, and if something goes wrong with the system python environment you are in for a world of pain. However, establishing a virtual environment for every analysis project you do will have you with multiple environments to manage along with their python version. Given that most of your analysis environments will largely be the same, this obviously ends up being extremely cumbersome, so most will resort to just using their system python. Here, we are going to recommend a middle ground with anaconda. Specifically, running a set of analysis environments with miniconda. miniconda will allow you to maintain a small handful of environments with different python versions rather easily, while providing most of the packages you would need to conduct all your data science work. It is also less bloated than standard anaconda, so you can just install what you need. This gives you a significant advantage. If a conda env breaks, you just create a new one with a clean slate. In the absolute worst case, you would just have to reinstall miniconda.

First install miniconda. We recommend using the bash script.

bash Miniconda3-latest-MacOSX-arm64.sh

You will need to ensure conda is on your path. Activate the base conda environment, then run conda init to fix your path variable with the shell you use.

source $HOME/miniconda3/bin/activate
conda init <zsh/fish/bash>

The base conda environment will now be the default for both python and pip .

Conda comes with a default channel for packages. At times, you may require packages only available on conda-forge . Forge is a community based repository for conda packages. From my experience, I have found conda-forge packages are more likely to break or be outdated, so you should add conda-forge as a backup channel.

conda config — append channels conda-forge

In the forge, there is a great package that will speed up your conda installations known as mamba. It runs all your installations concurrently, as oppossed to conda which runs sequentially. You should install this on the base environment.

conda install mamba

mamba is a drop in replacement for the conda command, and you can use it in lieu of that command in most instances. The only exception is activating environments, where you should still use conda. You are free to skip this package if it doesn’t interest you.

Here comes the magic. With miniconda, you do not need to manage the python version separately when you create new environments. You can simply just specify the version you want on creation. Thereafter, just create as many as you want/need. I typically like to have an environment for the python versions I work with (namely 3.10 and 3.9).

While it is low risk to use the base environment solely as you can just reinstall miniconda, I would recommend setting up these extra environments instead. If something goes wrong, you can simply delete the environment and create a replacement.

You can activate the different environments easily with conda.

conda activate py310

Now install whatever you need or like to work with for analysis!

mamba install numpy pandas scikit-learn jupyterlab pytorch matplotlib

If something isn’t available in the default or forge repos, feel free to install it with pip.

Development Environment

Now, if you’re an engineer, it’s time to think about dependency management for any packages you develop. This is where specific environments for projects start to become important, not only for ensuring the correct environment for running your package in production, but also for smooth collaboration across your team.

Here, we recommend poetry. poetry is a tool that makes your python package management rather straightforward. It will generate a lock file to keep track of the package versions you use, and supply a virtual environment with these packages installed along with the correct python version. Using this lock file, it will set up the correct virtual environment on any system.

Installing poetry can be done using a script they supply.

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -

Alternatively, on macOS you can use brew. This has the added benefit of being maintained by your package manager.

brew install poetry .

Once poetry is installed, using it is exceptionally easy. When you start a new package, you simply run the init function. You could also run poetry new to have poetry setup the project from scratch.

poetry init

Follow the prompt and you’re all set! It should generate two files: poetry.lock , the lock file used to setup your virtual env; & pyproject.toml , a configuration file you can edit to modify package versions and project metadata. You should commit both these files to your git repo. You can add more dependencies to your project with the poetry add command, and remove them with poetry remove. Alternatively, edit your pyproject.toml , then run:

poetry sync

When you have setup poetry in your project, you can run the install command to setup your virtual env. This will both create the virtual env you can use for development and install all the necessary packages.

poetry install

The last important command you should be aware of is poetry shell, which when run will spawn a shell within your new virtual environment, pointing to the new python virtual environment.

poetry shell

poetry is a powerful tool, and you can also use it to build your package and publish it, though we won’t cover that functionality here.

Back to blog