DEV Community: swayam

How Dataflow and IONOS Are Building Sovereign Data Pipelines for Europe

swayam — Fri, 26 Jun 2026 07:40:18 +0000

70% of data and AI projects never reach production. The bottleneck isn't talent. It's tooling.

For European data teams, the bottleneck is tooling plus a legal minefield.

The platforms with the best developer experience (Jupyter, VS Code, Airflow, MLflow, Superset) sit on US infrastructure. The sovereign options leave teams stitching things together themselves, trading velocity for compliance.

That trade-off ends today.

Dataflow and IONOS, Europe's largest independent cloud provider, have partnered to deliver fully sovereign, enterprise-grade data pipelines across Europe. The full Dataflow environment now runs on IONOS Cloud. European-owned. European-operated. Outside U.S. CLOUD Act jurisdiction. By architecture, not policy.

Here's how our founder announced it on LinkedIn.

What you get:

Full stack on sovereign infra. Jupyter, VS Code, Airflow, MLflow, Superset on IONOS Cloud
Native connectors for IONOS Object Storage, Managed PostgreSQL, MariaDB, MongoDB. Zero egress fees
Kafka-native streaming via IONOS Event Streams. Sub-second data movement
EU-only data residency enforced at infrastructure level, not just policy
GDPR Article 30 and EU AI Act compliance. Audit logging, data lineage, consent-signal propagation built in
Single unified SLA. No finger-pointing between software vendor and cloud provider

Who it's for:

Government teams can build and ship AI without data leaving EU jurisdiction. No CLOUD Act grey areas.

Banks and FinTechs get auditable pipelines that meet DORA and EU AI Act requirements without a stack rip-and-replace.

Healthcare orgs can train models on patient data that stays exactly where regulation says it should.

Why now? The EU Commission awarded a €180M cloud contract in 2026 to European-only providers. France moved its Health Data Hub off Azure. The EU AI Act is in force with strict data provenance obligations. The market has moved. The question is whether your data stack has too.

"Sovereign cloud without sovereign data tooling is just compliant data storage. We're making it a platform."
CEO, Dataflow

Available now for enterprise customers across DACH, Benelux, and the UK.

🔗 Full details at dataflow.zone

Running Karpathy's Autoresearch Loop on a T4 GPU inside Dataflow

swayam — Wed, 27 May 2026 13:57:41 +0000

CONTEXT — WHAT IS DATAFLOW?

“Dataflow (dataflow.zone) is a Jupyter notebook cloud platform built for data teams and ML engineers who want a reproducible machine learning environment without managing infrastructure. It provides managed GPU instances for ML workloads, persistent shared disks, and containerized Python environments — a practical alternative to Colab, Paperspace, or Databricks for small teams. This post shows a real workflow running entirely inside Dataflow.”

EXECUTION STACK AT A GLANCE

Layer	Karpathy Original	Dataflow T4
Hardware	H100-class GPU	Tesla T4 (Dataflow)
Dataset	climbmix-400b-shuffle	TinyStories benchmark
Seq length	MAX_SEQ_LEN = 2048	MAX_SEQ_LEN = 256
Precision	bf16	fp16 (T4 compatible)
Attention	H100-oriented kernels	SDPA (patched)
Storage	Notebook home dir	/home/jovyan/shared/
Edit loop	Agent edits train.py freely	provider_loop.py (validated)
Experiment	5 minutes	5 minutes (unchanged)

I adapted Andrej Karpathy’s autoresearch execution model so it can run practically inside Dataflow on a T4 GPU instead of assuming an H100-class machine. Karpathy’s original execution is intentionally minimal. The human writes program.md, the agent reads it, edits only train.py, runs a fixed 5-minute training experiment, evaluates with prepare.py, checks val_bpb, commits the change if it improves, and rolls it back if it does not. In the original repo, prepare.py is the fixed benchmark layer: it uses karpathy/climbmix-400b-shuffle, MAX_SEQ_LEN = 2048, VOCAB_SIZE = 8192, TIME_BUDGET = 300, and a large validation budget. The training side is designed around a stronger GPU setup and expects the agent to freely modify train.py.

In my Dataflow version, I kept the same core idea but patched the execution stack for a T4. I changed the data path from the large climbmix setup to a TinyStories-based benchmark, reduced the sequence length to MAX_SEQ_LEN = 256, kept the same 5-minute experiment budget, and moved the dataset, tokenizer, cache, and virtual environment into /home/jovyan/shared/autoresearch-t4-support so the workflow uses the larger persistent shared disk instead of filling the notebook home directory.

I also patched the training path for T4 compatibility. The original theory works well on H100-style hardware, but the T4 path needed fp16 instead of relying on bf16, SDPA attention instead of H100-oriented attention/kernel assumptions, and removal of unsupported kernel dependencies. That made train.py actually runnable on the Tesla T4 available in Dataflow.

The other major change was how the AI edit loop is controlled. Karpathy’s original setup assumes a coding agent directly edits train.py with full freedom. In Dataflow, I made that safer through t4-colab-loop.ipynb and provider_loop.py: the notebook lets me choose Gemini or another provider, securely enter the API key, ask the model for experiment ideas, apply only validated edits to train.py, run the 5-minute training job, parse val_bpb, and keep or discard the run using local git.

So the difference is: Karpathy’s repo proves the clean H100 agent loop; my version keeps that loop but patches the hardware layer, dataset layer, storage layer, precision/attention layer, and edit-safety layer so the same autoresearch idea can run in a Dataflow T4 notebook environment.

KEY TAKEAWAYS

The autoresearch loop is hardware-agnostic when you patch the right layers — no H100 needed.
Dataflow’s persistent shared disk (/home/jovyan/shared/) keeps dataset, tokenizer, and venv off the limited notebook home directory.
fp16 + SDPA is a viable T4 substitute for bf16 + H100-tuned kernels, with no changes to the core experiment logic.
The validated edit loop via provider_loop.py makes AI-driven train.py mutation safe for multi-run research workflows.
This is a working example of a reproducible machine learning environment on a managed GPU notebook — the kind of setup Dataflow is built for.

Want to run this yourself?

Dataflow gives you managed GPU instances, persistent shared storage, and a cloud Jupyter environment — everything this workflow needs. Visit dataflow.zone to get started, and see the code HERE

From Zero to Pipeline in 10 Minutes: The End of Environment Chaos

swayam — Sat, 14 Mar 2026 20:30:12 +0000

Three engineers. Three environments. Zero consistency. Sound familiar?

Every data team hits this wall. A new project starts, and before a single pipeline runs, someone's debugging a dependency conflict, someone else is rewriting a .env file, and a new hire is still setting up their local environment on day three. This isn't a skills problem. It's an infrastructure problem.

The Real Cost of Environment Chaos
Think about how often your team deals with:

Dependency conflicts that break everything when one package updates
"Works locally, fails in production" moments right before a deadline
New engineers spending their first week on setup instead of shipping
Notebooks, pipelines, and dashboards that can't share the same connections
None of this produces value. It's just friction between your team and the actual work.

What a Shared Foundation Changes

Instead of every engineer setting up their own environment, you define it once dependencies, connections, secrets - and it works everywhere, for everyone, automatically.

That's exactly what Dataflow is built around.

One workspace. Jupyter, Airflow, Streamlit, and VS Code. pre-configured and ready the moment you log in. No pip installs. No config files. No Dockerfiles.
One set of connections. Define your data sources once. Every tool in your stack picks them up automatically.
One-click deployment. Push to production with dev-prod parity guaranteed. What works locally ships exactly as expected.

Built for Teams Who'd Rather Ship Than Configure

Dataflow is for data engineers, AI/ML teams, startups, and researchers who are done losing time to infrastructure. GPU-powered instances, cloud-agnostic deployment, and enterprise-grade security, all without a DevOps team.

"I went from zero to running my first pipeline in under 10 minutes, without any DevOps support." David Park, Senior Data Analyst, Quantify Labs

Try It Today

If you're still rebuilding your environment every time a new project starts, it's time to stop.
Sign in and start building, no credit card required

Running a project and need to compute? Apply for up to $1,000 in free Dataflow credits. open to founders, data engineers, AI builders, and researchers. No credit card. No catch. Claim your free credits

Not ready to sign up yet? Book a 20-minute demo and see it live.