DEV Community

Cover image for Setting Up a Data Science Environment on the Cloud (Without the Usual Setup Pain)
Mukul Wadhwa
Mukul Wadhwa

Posted on

Setting Up a Data Science Environment on the Cloud (Without the Usual Setup Pain)

If you’ve worked with data science or machine learning, you already know this part is not fun:

Installing Python packages
Fixing dependency conflicts
Matching library versions
Repeating the same setup on every new machine
Before you even write your first line of actual ML code, you’ve already burned an hour.

In this post, I’ll walk through:

what a practical data science environment actually needs
common mistakes people make during setup
and one clean way to avoid the whole mess on cloud VMs
This is written from a hands-on infrastructure perspective.

What a Real Data Science Environment Needs

A usable data science setup is more than “Python installed”.

At minimum, you usually need:

Core data & numerical stack

NumPy
Pandas
SciPy
Visualization

Matplotlib
Seaborn
Plotly
Machine learning

Scikit-learn
XGBoost / LightGBM / CatBoost
Deep learning (CPU or GPU)

PyTorch
TensorFlow / Keras
Notebooks & dev tools

JupyterLab
IPython
Requests, tqdm, etc.
Database connectivity

Most real projects also pull data from:

PostgreSQL / MySQL
MongoDB
Which means you need client libraries, not just Python itself.

Missing any of these usually leads to:
“ModuleNotFoundError”
“Version conflict”
“Works on my machine”

Why Local Setup Becomes Painful

Local environments break down fast when:

you switch machines
you collaborate with others
you need more RAM or CPU
you reinstall your OS
Conda helps, Docker helps, but both still require:

learning curves
maintenance
debugging broken environments
For many people, the problem isn’t coding — it’s environment reliability.

A Cleaner Approach: Pre-Configured Cloud VMs

One approach that’s worked well for me is using a pre-configured cloud VM where:

the OS is already set up
common data science and ML libraries are pre-installed
database connectors are ready
SSH access works out of the box
You spin it up, SSH in, and start coding.

No fighting pip.
No rebuilding environments.
No “let me install this first”.

This is especially useful when:

experimenting quickly
onboarding new teammates
running heavier workloads than a laptop can handle

What to Look for in a Data Science VM

If you go this route, make sure the VM actually provides:

20+ commonly used Python data science and ML libraries
SQL and MongoDB client connectors
SSH access with full control
Scalable CPU and RAM
No forced managed services you didn’t ask for
GPU support is a bonus — but only matters if you truly need it.

A Practical Example (Disclosure)

I recently set this up internally as a Data Science VM with:

pre-installed Python data science and machine learning stack
SQL and MongoDB connectors
SSH access
scalable resources
If you’re curious what that looks like in practice, here’s a reference implementation:

https://manage.digirdp.com/store/data-science-vm

Disclosure: this is a product from my own infrastructure setup. Linking for reference, not as a requirement.

Final Thoughts

Tooling should get out of your way, not slow you down.

Whether you:

build your own base image
use a managed platform
or run a pre-configured VM
the goal is the same:

Spend time on data and models, not environment firefighting.

If you’ve found cleaner ways to manage data science environments, I’d love to hear them in the comments.

Top comments (0)