Kevin Cole

Posted on Oct 16, 2023 • Originally published at kevin-cole.com

Setting Up a (Free*) Collaborative Python Development Environment for a Small Team

#python #github #cloud #docker

Perhaps you've found yourself in this pickle: you're preparing to dig into a new coding project but this time, it won't just be you doing the work.

It's one thing to initiate a repository on your local machine and invite others to collaborate asynchronously through remote source management tools like Github, Gitlab or Codeberg; introducing the prospect of real-time collaboration and managing the dreaded "works on my machine" bumbles might warrant a bit more architectural thinking.

I found myself in this situation last week while initiating a new research project that will require a small team of collaborators to work on a Python-heavy project exploring the potentials and pitfalls of leveraging LLMs to provide enhanced general orientation for asylum-seekers on their rights and the procedures which apply to them. It's always a struggle to tame that initial urge to jump into your IDE of choice and start coding, but it was clear that this project could benefit from some measures to ensure new team members could be brought onboard and enabled to contribute with minimal friction.

In this post, I'll walk through both our decision-making process and provide a guide on how to use GitPod, Github and Jupyter Notebook to set up a collaborative Python development environment.

If you're just interested in the step-by-step guide, feel free to jump right there. And if you're really in a rush, you can fork this template repository as a basis for your own dev environment hosted by Gitpod.

To Containerize or Not to Containerize?

If you're planning to work collaboratively on a Python-heavy project, one of the first determinations you'll need to make is whether or not you plan to use "containerization". To provide a simple definition relevant to our project planning, "containerization" is an approach to software development and deployment where code and/or services are run within a minimal virtualized runtime environment that is either hosted on a local machine or run remotely (aka, "in the cloud").¹

What's the point of "containerization"? In our use case, containerization's main benefit is to ensure that all contributing developers can access and run code in a single, standardized environment to reduce errors and incompatibilities caused by differing local runtime environments and configurations. It allows us to largely sidestep issues faced when contributors are working on different operating systems, have locally-installed libraries/packages which introduce incompatibilities, and the myriad other configuration permutations that naturally result from the ways in which we all use our own computers.

And what about the drawbacks? In some cases, running your development environment within a container might introduce too much additional overhead. For example, running containers locally could require you to provide guidance on installing Docker on a plethora of individual computers—ack! Containerization can also lead to performance bottlenecks, especially for compute-heavy tasks like machine learning or 3D graphics rendering. Finally, taking full advantage of containerization frequently entails hosting these containerized environments remotely, and this can be quite costly!

So, how to decide? Ultimately, you'll need to take a hard look at your organization's use case to make a final determination. In our case, the following considerations were a top priority:

Contributors should be able to access the development environment with just a fundamental understanding of common software development tools: VS Code or a similar IDE and git-based source control.² It's a priority to enable contributors to learn through their research in this project, so barriers to entry must be as low as possible.
Contributors should not have to consider and manage project dependencies. Our priority is to enable direct contribution, rather than having to fiddle with configurations.
The solution must enable secure handling of authentication data, allowing for the group's work to be shared openly while also permitting the use access-controlled resources (in our case, AWS Bedrock's LLM APIs).
Considering the size of our team and the nature of our (non-profit) research, costs should be minimal (ideally, $0.00!)

Given this set of criteria, we opted to containerize our development environment but specifically chose to use a hosted "cloud-based development environment", rather than running our development container locally (for example, with Docker) or setting up our own (read, self-managed) remotely hosted container runtime.

What's a Cloud-based Development Environment?

Cloud-based Development Environments (CDEs) are remotely hosted runtimes used to enable one or many developers to work on software from different devices, and increasingly they're coupled with other tools in the developer's toolchain to provide one-click access to a "ready-to-code" state. You've probably come across some of the more popular service providers like GitPod, GitHub Codespaces, or Google's Cloud Workstations.

So, what's the deal with CDEs? To bring things to a point, many popular CDE solutions exist in a grey zone between the three main cloud service business models:

"Software as a Service" (SaaS)
"Platform as a Service" (PaaS)
"Infrastructure as a Service" (IaaS)

As others have rightfully pointed out,³ this means that using a (non self-hosted) CDE product makes you a current or potential future customer. At the same time, these products also deliver a valuable service, namely simplifying the overhead required to provision and manage your own remotely-run container instance.

In the current environment of VC-funded "blitzscaling," small teams/projects (and in our case, particularly non-profit organizations) are generally able to skate by on the "generous free tiers" made possible by this phenomenon, though the same cautionary warnings ought still apply:

Generous Free Tiers are frequently a "loss lead" offering and as many the hobbyist has learned from experience (ahem, Heroku), they may one day simply cease to exist.
That related dread-word, "Vendor Lock In".
Overdependence on abstracted/productized solutions can lead to knowledge/practical experience gaps in teams.

Like most things in life, there are certainly a set of benefits and tradeoffs to be considered, so it's crucial to take a hard look at your project, team and organization's requirements, goals, resources and options while deciding on a path forward. Just don't get so bogged down in the weeds that you forget that you can alter the direction of that path, even if doing so down the line might incur costs.

Our Solution: GitPod, GitHub & Jupyter Notebook

After a bit of reflection on our primary goals, anti-goals, and operational constraints, we landed on the following set of tools for our Python-focused, research-oriented project:

GitPod: GitPod is an (open sourced!) CDE solution which strongly integrates with VS Code, provides straightforward configuration for the workspace's underlying container image, and provides some built-in support for handling access to the dev workspace and environmental variables.
GitHub: Our organization already uses GitHub to host and manage remote repositories, so this was a bit of a given.⁴ Using a git-based source control system is a practical necessity in this type of collaborative project, allowing for
Jupyter Notebook: Because our project is research-focused, we decided to use Jupyter Notebook to maximize the accessibility and reproducibility of our work by leveraging the ability to directly document our approaches with Markdown in Jupyter Notebook's .ipynb files.

You might not want to take this approach if your project:

Requires or greatly benefits from hardware acceleration;
Requires you to run multiple concurrent and/or persistent services (databases, authentication servers, etc.);
Needs to support full-time contributors: GitPod's free tier is capped at 50 hours of container up-time per month!

Setting Up Our Workspace

Initialize Your Repository

The first step to getting your Gitpod workspace running is to initialize a GitHub repository.
Create a GitPod Workspace (and optionally, a Gitpod Project)

You can open your repository (or any repo your GitHub account has access to) in a Gitpod 'workspace' (i.e., ephemeral containerized runtime environment) by prepending gitpod.io/# to your GitHub repo's URL. For example, you can open the forem project in a Gitpod Workspace by navigating to the following URL: https://gitpod.io/#https://github.com/forem/forem.

You'll be asked to select your preferred editor experience, be that VS Code for the Browser, VS Code Desktop or another supported desktop IDE, or via SSH. Regardless of your editing method of choice, your next step will be setting up Gitpod's configuration files.
Adding Gitpod Dotfiles

Most configuration for Gitpod Workspaces is handled by two Dotfiles which you'll want to place in the root directory of your project's repo: .gitpod.yml and .gitpod.Dockerfile.
- .gitpod.Dockerfile: This (optional) file allows you more flexibility to use your own custom Dockerfile, rather than one of Gitpod's official Docker images. For this set up, we'll create a .gitpod.Dockerfile to ensure our container uses a consistent Python version.
- .gitpod.yml: This file specifies the underlying Gitpod workspace image to use for your runtime environment and allows you to define commands to be run on workspace startup as well as what ports (if any) you'd like to expose.

Here's our .gitpod.Dockerfile:

FROM gitpod/workspace-full

USER gitpod

# Install and set global Python version to 3.11
RUN pyenv install 3.11 \
    && pyenv global 3.11

This Dockerfile provides Gitpod with instructions to spin up containers for our workspace using the default image, which comes pre-bundled with typical development tools. We chose to use the default image out of convenience, but the gitpod/workspace-python image would have provided a lighter out-of-the-box footprint.

It also instructs for that image to run two pyenv commands: one to install Python version 3.11, and one to set the global Python environment to v3.11. This is an important step for our project because some Langchain dependencies are currently incompatible with Python versions >= 12.0.

We used the following configuration in our .gitpod.yml file:

image:
    file: .gitpod.Dockerfile

tasks:
    - init: pip install -r requirements.txt

This configuration does two things:

Instructs Gitpod to use our custom workspace image.
We instruct Gitpod to run the command pip install -r requirements.txt when the workspace container starts up, ensuring that all necessary Python libraries are installed and available in the runtime environment.
Initialize requirements.txt and Commit Changes

Before we finish our work initializing Gitpod, we'll need to install some of our known requirements and importantly, persist our changes by committing to our GitHub repository. You can use the terminal session connected to your Gitpod workspace to install Python libraries with pip. In our case, we'll install Jupyter Notebook:

pip install jupyter

Since we're working in an ephemeral workspace (container) and in a collaborative project, we'll need to be sure to save these changes by committing and pushing them to our GitHub repo. We'll use pip's freeze command to save our currently-installed libraries to a requirements.txt file in the root directory of our repo, so that all necessary libraries are installed each time our workspace image is recreated.

pip freeze > "requirements.txt"

This command redirects the output of pip freeze (a list of currently-installed libraries and their version) to the text file requirements.txt, referenced in our .gitpod.yml file.

Now we're ready to persist all of these changes by committing and pushing to our GitHub repository. You can use the built-in VS Code (or your editor of choice) 'Source Control' panel, or alternatively use the terminal in your workspace:

# Stage changed files
git add .
# Commit changes with a message
git commit -m "Your commit message here"
# Push these changes to the main branch of your remote repo
git push --set-upstream origin main

At this point, you've got a functional cloud-based development environment! To add contributors, you'll just need to ensure that they have appropriate access to your project's repository and that they've created a Gitpod account. As a next step, you might consider setting up branch protections in your GitHub repo to prevent unintentional commits to your main branch by contributors, or further defining your development environment by specifying VS Code extensions to be pre-installed in your .gitpod.yml Dotfile.

If like our team you're planning to use Jupyter Notebook files, note that you'll get the best support using VS Code for Desktop, rather than the browser.⁵

Concerns and Reflections

Rolling out a CDE with Gitpod turned out to be pretty simple, but the "cautionary tales" aren't without virtue. With just 50 hours of container uptime a month, it's clear that Gitpod can't provide a completely cost-free solution for professional teams working fulltime, and there are strong arguments to be made for simplifying this collaborative project's architecture by using just a GitHub repo and a more robust environment management tool, like Conda.

Working in a nonprofit and humanitarian organization brings with it a particular set of needs and goals which aren't necessarily always reflected in tech-first corporations, and it's important not to brush aside these priorities in favor of the architecture du jour.

Some other potential drawbacks to the CDE approach include:

Network latency, which can be a major obstacle especially when working with contributors in areas with intermittent or weak internet connections;
More limited options for managing larger file storage without introducing additional complexity;
Limitations on data sovereignty given reliance on third-party hosting of project repositories.

That being said, using a CDE in the specific context of this initiative allows our team to lower barriers to meaningful contribution by focusing on commonly-taught tooling (VS Code, GitHub) and minimize time spent troubleshooting platform and machine-specific installation woes. It allows for day one exposure to the team's work for new contributors, while also allowing for the abstractions upholding the 'ready-to-code' environment to be elegantly surfaced and retired: contributors can transition to development on their local machine (with or without containerization) once they are confident in setting up their own environment by cloning the repository locally and installing dependencies.

In practice, the term containerization encapsulates a broad approach in software engineering that may be used towards various ends, such as isolating the execution environment of potentially hazardous code from critical systems. IBM has a helpful overview of the topic at: https://www.ibm.com/topics/containerization ↩
VS Code's official documentation on using the Source Control panel is quite helpful as a teaching/learning resource! ↩
I found Mike Nikle's blog post on the subject to offer a strong, if perhaps overly skeptical, view on the drawbacks of adopting CDE products: "Dev environments in the cloud are a half-baked solution". ↩
It's worth noting here that GitPod also supports GitLab and Bitbucket. ↩
Per Gitpod's documentation: https://www.gitpod.io/docs/introduction/languages/python#jupyter-notebooks-in-vs-code ↩