DEV Community: Atelje Vagabond

[Boost]

Atelje Vagabond — Sat, 25 Apr 2026 09:19:16 +0000

Atelje Vagabond

Apr 24

Why You Need MLOps: When CI/CD for Machine Learning Becomes Mandatory

#mlops #machinelearning #datascience #ai

Comments 1

6 min read

Why You Need MLOps: When CI/CD for Machine Learning Becomes Mandatory

Atelje Vagabond — Fri, 24 Apr 2026 11:01:39 +0000

For six months, the team did everything right. They had a brilliant lead data scientist, Dr. Alan. They had a unique dataset. After hundreds of experiments and countless hours of training, they finally hit the magic number.

Accuracy: 94%.

The model converged. The investor demo was flawless. The funding was secured. The excitement in the room was palpable.

Then, they made the decision that breaks almost every new ML team:

"It works on Alan's machine. Let's just wrap it in an API and ship it to production tomorrow."

Six months of hard science was about to collide with the hard reality of software engineering.

The comic above isn't just a funny illustration; it is the autobiography of thousands of companies trying to deploy machine learning for the first time.

What followed for Dr. Alan's team wasn't a model failure. It was a system failure. The live data didn't match the clean training data. Predictions started silently degrading. Cloud costs exploded because GPU instances were left running idle. When things broke, no one could reproduce the exact combination of code and data that built the original model.

They learned an expensive lesson: A model in a Jupyter notebook is a hypothesis. A model in production is an obligation.

The Engineering Necessity of MLOps

The transition from a research prototype to a live production service introduces engineering challenges that break traditional software deployment methodologies.

Machine Learning Operations (MLOps) isn't just a buzzword or a set of "best practices." It is the engineering discipline required to apply DevOps principles—continuous integration, continuous delivery, infrastructure-as-code—specifically to the unique lifecycle of machine learning.

Unlike traditional software CI/CD, which primarily manages code versions, MLOps must manage three distinct, intertwined artifact types:

Code: The training scripts, feature engineering logic, and serving wrappers.
Data: The training datasets, validation splits, and live inference data.
Models: The serialized artifacts (pickles, ONNX files), container images, and hyperparameters.

The complexity multiplies because a single production "rollout" must atomically coordinate all three. Furthermore, you need the ability to rollback any one artifact independently without causing cascading failures in the others.

Defining the Architectural Threshold: When Is MLOps Mandatory?

How do you know when you've moved past the "prototype" phase and need a formal MLOps framework? It's not based on your model's accuracy score. It is determined by the operational tempo and complexity of your system.

If your system meets these criteria, MLOps is no longer optional; it's a mandatory architectural requirement.

System Characteristic	Local/Prototype Stage	Production Stage (MLOps Required)
Deployment Frequency	Ad-hoc (Manual, periodic updates)	High velocity (Weekly, daily, or automated triggers)
Data Variability	Fixed, frozen CSVs or tables	Streaming data, semi-structured inputs, evident concept drift
System Scale	Single user, local laptop GPU	Distributed throughput, high concurrent users, auto-scaling clusters
Auditability	Unknown (It's somewhere in a source Notebook)	Fully traceable lineage for compliance (banking, healthcare)
Failure Mode	Easily reproducible and debugged locally	Non-deterministic, difficult to trace (e.g., Training-Serving Skew)

The engineering threshold is crossed the moment the cost of manual monitoring, debugging, and firefighting exceeds the cost of building robust automation tooling.

The Key Architectural Pain Points MLOps Solves

Without a formal MLOps architecture, your system accumulates specific types of technical debt that degrade reliability and performance over time.

1. Training-Serving Skew (The Silent Killer)

This is the most critical engineering failure. It happens when the logic used to calculate features during training runs differently than the logic used during real-time inference.

For example, if your data scientist calculates a "7-day rolling average" in Pandas for training, but the production engineer reimplements that logic in Java for the serving API, tiny discrepancies will creep in. The model receives data in production that is mathematically different from what it saw during training, leading to junk predictions despite a high accuracy score. MLOps solves this through standardized feature stores that ensure consistent logic.

2. Model Drift and Data Decay

Model performance degrades over time not because the model "breaks," but because the world changes. The statistical properties of the input data shift. Without automated monitoring and automated retraining triggers, your model will confidently serve obsolete predictions.

3. Reproducibility Failure

When an incident occurs in production at 3 AM, can your team immediately reproduce the exact state—the specific code commit, the exact slice of data, the hyperparameters, and library dependencies—that led to that deployed model? If not, you don't have a production system; you have a black box. MLOps ensures every deployed artifact is immutable and traceable back to its origin.

The Cloud Native MLOps Stack: A Deep Dive

Modern MLOps architectures rely on cloud-native managed services to handle the heavy lifting of compute scheduling and container management, allowing teams to focus on the workflow logic.

While many clouds offer solutions, the choice often comes down to existing infrastructure and compliance needs.

⚙️ Google Cloud Platform: Vertex AI

Vertex AI emphasizes unified workflows where pipeline steps run as isolated containers.

Functionality Focus: Vertex AI Pipelines is the core orchestration engine, supporting Kubeflow Pipelines (KFP) and TFX. It allows you to define your workflow as a Directed Acyclic Graph (DAG): Data Validation → Feature Engineering → Training → Evaluation.
Key Component Notes: The Vertex AI Feature Store (V2) is now built on BigQuery for offline storage. It uses timestamp-based resolution to ensure point-in-time correctness, reducing training-serving skew.
Important Deprecation Warning: Be aware that Google's Legacy Feature Store API will be shut down in early 2027. Furthermore, the "Optimized online serving" for V2 is also deprecated; Google is directing users toward Bigtable online serving for low-latency scenarios. Plan any new architecture accordingly.

☁️ Microsoft Azure: Azure Machine Learning (Azure ML)

Azure ML shines in regulated enterprise environments due to its deep integration with Azure's governance and security fabric.

Architectural Strength: Security is paramount. Role-Based Access Control (RBAC) is handled via Microsoft Entra ID (formerly Azure AD), meaning identity policy is centrally managed rather than replicated in the ML platform.
Automation: It relies heavily on event-driven automation via Azure Event Grid. Events like ModelRegistered or RunCompleted can trigger downstream pipelines, automated validation checks, or deployments.
Lifecycle Management: Azure ML has strong model registry capabilities with full lineage tracking. While it supports MLflow, native deployment workflows often rely on Azure SDK v2 tags to manage lifecycle states (e.g., tagging a model as candidate vs production).

Operational Caveats: The Hidden Costs of ML Infrastructure

Before rolling out any architecture, you must address the elephant in the room: Operational Expenditure (OpEx).

Scaling ML infrastructure—especially GPU-accelerated distributed training—introduces massive cost variability that shocks organizations unprepared for it.

The GPU Price Tag: On Google Cloud, a single node with 8x H100 GPUs (the current gold standard for LLM work) can run upwards of $90 per hour on-demand. A 24-hour training run is a $2,000+ event. If you need multi-node distributed training, those costs scale linearly per day.
The "Zombie Cluster" Problem: In Azure ML, if compute clusters are configured with a minimum node count greater than zero, those nodes run continuously, regardless of whether a job is active. Without automated teardown triggers on job completion (or failure!), idle GPU hours will accumulate silently. You won't know until the five-figure bill arrives at the end of the month.

Architectural Requirement: Operational budget planning and FinOps practices must be integrated from Day 1. You need automated cluster teardown triggers, strict GPU utilization alerts, and scheduled pipeline runs to avoid on-demand burst pricing.

Implementing strict Cloud Cost Optimization and FinOps Practices is just as critical as the ML code itself.

Common Mistakes When Adopting MLOps

The failures we see are rarely related to the math; they are related to the process.

1. Assigning MLOps to the wrong team.
MLOps sits at the intersection of Data Science, Data Engineering, and DevOps. Handing the entire responsibility to a data science team with no infrastructure experience, or a DevOps team with no ML exposure, is a recipe for disaster. Pipelines will technically "run," but they won't be robust.

2. Skipping the Data Audit.
MLOps is architecture built on data assumptions. If you build pipelines before auditing your data reality—schema consistency, null distributions, ingestion latency—you will build a very expensive system that automates the ingestion of garbage data.

3. Treating MLOps as a "One-Time Setup."
MLOps infrastructure is not static. As noted in the Vertex AI section above, cloud APIs deprecate, SDK versions end-of-life, and pricing models change. If you don't budget for ongoing platform engineering maintenance, your pipelines will break within 18 months.

Summary: MLOps is System Resilience

MLOps isn't a product you buy; it's the discipline that shifts machine learning from a research science experiment into a reliable, scalable production service.

Ultimately, the infrastructure choices made during the prototype phase—feature contracts, data formats, registry design—become severe technical debt at scale. Establishing a rigorous foundation for your MLOps and Data Management ensures your system survives production loads without requiring continuous, expensive rework.

Connecting Physical and Digital: An ESP32 Spotify Display Built on Cloudflare Workers

Atelje Vagabond — Sat, 11 Apr 2026 15:57:40 +0000

At Ateljé Vagabond we build small, well-crafted systems. This article is about one of them: a 1.83-inch ESP32 display on the studio desk that shows what is currently playing on Spotify — and the Cloudflare Worker behind it that simultaneously powers a live widget on our website.

The interesting part is not the display itself. It is the architecture that makes it work without storing credentials in firmware, without duplicating API calls, and without two separate Spotify integrations fighting each other. One Worker at the edge handles everything: OAuth, token refresh, PNG-to-RGB565 image conversion, caching, and authentication for two completely different types of client.

This post covers the full system — the hardware quirks, the Worker design, the zero-heap-allocation JSON parser on the microcontroller, the OTA update pipeline, and why it is built the way it is.

Why Not Call Spotify Directly From the Device?

We wanted a small LCD on the studio desk that shows what is currently playing on Spotify. We had an ESP32 and a 1.83-inch display. The obvious first idea was to call the Spotify API directly from the firmware.

In practice, this does not work well. The Spotify Web API uses OAuth tokens that expire every hour. Storing a client secret in firmware is not safe — anyone who reads the flash has your credentials. The device would also need to handle token refresh, manage TLS certificates, and stay within rate limits, all on a microcontroller with limited memory. We also wanted the same now-playing data shown as a widget on our website. That would mean two separate clients calling Spotify independently, which doubles the rate-limit risk and doubles the credential exposure for no good reason.

The right solution is one backend that owns the Spotify communication, and serves both clients from there. That is what we built.

The Architecture

A Cloudflare Worker sits between Spotify and everything else. It holds the OAuth credentials, refreshes the token, calls /currently-playing, and returns different responses depending on who is asking.

                  ┌─────────────────────────────┐
                  │      Cloudflare Worker       │
                  │   (iot.ateljevagabond.se)    │
                  │                             │
  Spotify API ───►│ - Token refresh             │
                  │ - /currently-playing fetch  │
                  │ - PNG → RGB565 + RLE encode │
                  │ - KV cover cache (7 days)   │
                  └──────────┬──────────────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
   IoT Device Request              Browser Request
  (mTLS / gateway secret)        (Origin header check)
              │                             │
              ▼                             ▼
    ┌──────────────────┐       ┌────────────────────────┐
    │  ESP32 + ST7789  │       │  ateljevagabond.se      │
    │  Waveshare 1.83" │       │  Now Playing Web Widget │
    │  240×280 pixels  │       │  (live in your browser) │
    └──────────────────┘       └────────────────────────┘

The Spotify API is called once per polling cycle — one token refresh, one currently-playing fetch — regardless of how many clients are connected. The browser gets a JSON response with the cover URL. The ESP32 gets the same JSON plus the album art already converted to the binary pixel format the display expects: RGB565, run-length encoded, base64-encoded so it travels safely inside a JSON string. Cloudflare's edge cache means the Worker does not execute on every single poll. Spotify rate limits stop being a concern.

Part 1: The Hardware and Its Quirks

The hardware is a DFRobot FireBeetle ESP32 connected to a Waveshare 1.83-inch IPS LCD with an ST7789 controller — 240×280 pixels over SPI.

Getting the display to render correctly took longer than expected. Three PlatformIO flags are essential and none are documented clearly for this specific panel:

TFT_OFFSET_Y=20 — The ST7789 controller's internal frame buffer is 240×320, but the 1.83" panel only uses 240×280, starting at row 20. Without this, every image renders shifted upward, clipped at the top.
TFT_RGB_ORDER=TFT_BGR — This panel swaps the red and blue channels. Without this flag, reds look blue and blues look red.
TFT_INVERSION_ON — The panel requires color inversion enabled. Without this, everything renders as a washed-out gray.

These are the kind of things that datasheets should say clearly. We found all three through trial and error.

Streaming, Not Buffering

The ESP32 has limited heap. Loading an entire 115 KB RGB565 image into RAM before displaying it is not reliable. Instead, the firmware streams the cover image response in 512-byte steps. Two FreeRTOS tasks run in parallel: one fetches track metadata every 3 seconds, the other handles the image stream and writes pixels to the display as they arrive. A tftMutex serializes all display writes between the two tasks, and a watchdog triggers a soft display reset if the lock fails repeatedly — a necessary guard when two tasks compete for the same SPI bus.

The result is that album art appears from top to bottom while it is still downloading — not something we planned explicitly, but a natural consequence of the streaming approach, and we kept it.

A Parser With No Heap Allocation

The main JSON response includes the encoded pixel data, which makes it too large to deserialize with a standard library without risking heap fragmentation. We wrote a character-by-character state machine instead. It parses the JSON as bytes arrive, and when it reaches the cover_rgb565 field it switches into decode mode: every four base64 characters produce three raw bytes, which go directly into the RLE decoder, which expands them into pixel values and writes them to the display. The full image passes through the device without ever being held in memory as a whole.

Paused and Idle

After 60 seconds without playback the display switches to an alternate screen: the studio boot image, the current time, and the date. Time is synced on boot via NTP from time.cloudflare.com. The track title scrolls as a marquee when it is too long for the display width.

Part 2: The Cloudflare Worker

The Worker is a single TypeScript file. It handles two very different clients — a browser and an embedded device — with one fetch handler.

Authentication Without Shared State

Browsers and ESP32s authenticate differently. Browsers send an Origin header; the Worker checks it against an allowlist and handles CORS normally. The ESP32 sends no Origin header; instead it presents a client certificate over mTLS. The certificate and private key are stored in LittleFS on the device. Cloudflare's infrastructure verifies the certificate before the request reaches the Worker.

As a fallback for environments where mTLS is not available, the Worker also accepts a shared secret in a custom header. The comparison uses a constant-time function to prevent timing attacks — a small detail that matters when the secret is all that stands between an attacker and your Spotify data.

Neither authentication path exposes the Spotify credentials. Tokens live in the Worker's environment variables and never leave the edge.

The Image Pipeline

This was the most technically interesting part. When the ESP32 requests the cover image, the Worker:

Fetches the album art PNG from Spotify's CDN
Decodes the PNG using a pure TypeScript implementation with no external dependencies
Converts every pixel to RGB565 — the native 16-bit format of the ST7789 display: 5 bits red, 6 bits green, 5 bits blue
Run-length encodes the result. Album art compresses well — large areas of similar color are common. A 115 KB raw image typically becomes 20–40 KB
Base64-encodes the compressed data so it fits inside a JSON string
Caches the result in Cloudflare KV for 7 days, keyed by a hash of the source URL and encoding parameters

The expensive work — PNG decoding, pixel conversion, RLE encoding — happens once per unique album cover. Every subsequent request for the same cover hits KV directly.

Part 3: The Website Widget

The website at ateljevagabond.se calls the same Worker endpoint, without the cover encoding parameters. It gets a plain JSON response and renders the track name, artist, cover image, and a progress bar. The widget polls every 5 seconds. No separate backend. No separate Spotify integration. The same Worker call, a different response shape.

Four Days From First Commit to OTA Updates

The Worker came first. We built it in November 2025 while working on the main website. The now-playing feature grew from there: cover image proxying, then the RGB565 conversion, then mTLS authentication.

The ESP32 firmware started on December 26, 2025. The commit log shows the actual progression:

Dec 26 13:15 — Initial commit with CI
Dec 26 15:16 — Improve display recovery and add Husky hooks
Dec 26 22:40 — Use CA bundle and speed up cover fetch
Dec 28 22:42 — Add OTA update flow and release workflow
Dec 28 23:30 — Add paused/idle UI behavior
Dec 28 23:38 — Use remote R2 for OTA uploads
Dec 29 00:02 — Gate OTA release on CI and document Husky
Dec 29 10:36 — chore(ci): avoid duplicate runs and stage firmware
Dec 29 11:19 — chore(ci): upload to R2 via S3 endpoint

Four days from first commit to a working CI pipeline, OTA firmware updates via Cloudflare R2, a display recovery watchdog, paused/idle UI states, and Husky hooks enforcing code quality. The OTA setup works like this: on boot, the device fetches a JSON manifest from the Worker containing the latest version string, binary URL, SHA256 hash, and size. If the version is newer, it downloads the binary over HTTPS and performs an A/B partition swap — no USB cable needed after the first flash. GitHub Actions builds the firmware with APP_VERSION injected at compile time, uploads the binary to R2, and publishes the manifest.

One commit worth noting: "Dump wrangler logs on OTA failure" on December 29. The OTA uploads to R2 were failing silently in CI — Wrangler was swallowing its own error output. The fix was two lines: pipe the output to a file and print it. Once we could see the error, the underlying issue was clear and fixed in minutes. Silent failures waste time. Making failures visible is part of good engineering.

What We Have Now

On the studio desk, a 1.83-inch display shows the current track, artist, album cover, a progress bar that advances locally between fetches, and the remaining time. After one minute without playback it switches to a clock and date screen.

On the website, the same playback state appears as a small widget, updating every 5 seconds.

The Spotify API sees one token refresh and one currently-playing call per polling cycle, regardless of how many clients are connected. The KV cache means the expensive work — PNG decoding and RGB565 conversion — happens at most once per unique album cover over a 7-day window.

The security is straightforward. The ESP32 identifies itself with a client certificate. Browsers are restricted by CORS and the Origin allowlist. Spotify credentials never leave the Worker's environment variables.

The Source Code

We have open-sourced the complete system. The repository contains both the ESP32 firmware and the Cloudflare Worker code, with full setup instructions, mTLS configuration, and the CI/CD pipeline for OTA releases.

View the project on GitHub: esp32-spotify-cloudflare-worker

See It Live

Go to ateljevagabond.se — if something is playing at the studio right now, you will see it in the Now Playing widget. Same Worker, same data, two different consumers.

About Ateljé Vagabond

This project is an example of what we do: build focused, well-engineered systems where hardware and software work together cleanly. If you are working on something similar — edge computing, IoT + cloud integration, or custom Cloudflare Worker architectures — we would like to hear from you at ateljevagabond.se.

Built with PlatformIO, TFT_eSPI, ArduinoJson, and Cloudflare Workers. Most of the engineering happened between Christmas and New Year.