<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohamed Arbi </title>
    <description>The latest articles on DEV Community by Mohamed Arbi  (@goodnight).</description>
    <link>https://dev.to/goodnight</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1206164%2F62df7320-cb94-42f2-84da-96d65314f25e.png</url>
      <title>DEV Community: Mohamed Arbi </title>
      <link>https://dev.to/goodnight</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/goodnight"/>
    <language>en</language>
    <item>
      <title>1minMLOps #2 :Versioning your data with DVC</title>
      <dc:creator>Mohamed Arbi </dc:creator>
      <pubDate>Fri, 08 May 2026 12:38:00 +0000</pubDate>
      <link>https://dev.to/goodnight/1minmlops-2-versioning-your-data-with-dvc-2o0d</link>
      <guid>https://dev.to/goodnight/1minmlops-2-versioning-your-data-with-dvc-2o0d</guid>
      <description>&lt;p&gt;In the last article we talked about why ML is harder than regular software: code, data and environment all move at the same time. Today we're tackling the second one &lt;strong&gt;data&lt;/strong&gt; with a tool called &lt;strong&gt;DVC&lt;/strong&gt; (Data Version Control).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just use Git?
&lt;/h2&gt;

&lt;p&gt;Git is amazing for code, but it was designed for small text files. The moment you commit a 2 GB CSV or a folder of 50,000 images, things get unpleasant fast: the repo balloons, &lt;code&gt;git clone&lt;/code&gt; becomes a coffee break, and GitHub starts politely asking you to leave.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DVC&lt;/strong&gt; solves this by being "Git for data": it stores tiny pointer files in your repo and pushes the actual heavy data to a separate storage backend (S3, GCS, an SSH server, even a local folder). You get versioning, branching and reproducibility, without bloating Git.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Install DVC
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dvc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want S3 support, install the extra:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"dvc[s3]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other backends like &lt;code&gt;gs&lt;/code&gt;, &lt;code&gt;azure&lt;/code&gt;, &lt;code&gt;ssh&lt;/code&gt; work the same way — just swap the extra.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Initialize DVC in your project
&lt;/h2&gt;

&lt;p&gt;Let's start a tiny project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;mlops-demo &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;mlops-demo
git init
dvc init
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Initialize DVC"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;dvc init&lt;/code&gt; creates a &lt;code&gt;.dvc/&lt;/code&gt; folder (a bit like &lt;code&gt;.git/&lt;/code&gt;) and a &lt;code&gt;.dvcignore&lt;/code&gt; file. From now on, DVC and Git work side by side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Track your first dataset
&lt;/h2&gt;

&lt;p&gt;Let's grab a small dataset to play with. We'll use the classic Iris CSV:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;data
curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv &lt;span class="nt"&gt;-o&lt;/span&gt; data/iris.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note (PowerShell users):&lt;/strong&gt; in PowerShell, &lt;code&gt;curl&lt;/code&gt; is an alias for &lt;code&gt;Invoke-WebRequest&lt;/code&gt;, which doesn't accept the &lt;code&gt;-L&lt;/code&gt; flag and will error with &lt;code&gt;A parameter cannot be found that matches parameter name 'L'&lt;/code&gt;. Use one of these instead:&lt;/p&gt;


&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Option 1: call the real curl binary (ships with Windows 10+)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;curl.exe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-L&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-o&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/iris.csv&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c"&gt;# Option 2: native PowerShell&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Invoke-WebRequest&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Uri&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-OutFile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/iris.csv&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;code&gt;curl.exe&lt;/code&gt; follows redirects by default, so &lt;code&gt;-L&lt;/code&gt; is optional there.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now tell DVC to track it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc add data/iris.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DVC will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Move &lt;code&gt;data/iris.csv&lt;/code&gt; into its cache (&lt;code&gt;.dvc/cache/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Create a small pointer file &lt;code&gt;data/iris.csv.dvc&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;data/iris.csv&lt;/code&gt; to &lt;code&gt;.gitignore&lt;/code&gt; automatically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Commit the &lt;strong&gt;pointer&lt;/strong&gt;, not the data, to Git:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git add data/iris.csv.dvc data/.gitignore
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Track iris dataset with DVC"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you peek inside &lt;code&gt;data/iris.csv.dvc&lt;/code&gt;, you'll see something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;outs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;md5&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1f8e3c...&lt;/span&gt;
  &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3858&lt;/span&gt;
  &lt;span class="na"&gt;hash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;md5&lt;/span&gt;
  &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iris.csv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That hash is the version of your data. Change one byte in the CSV, and the hash changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Set up a remote storage
&lt;/h2&gt;

&lt;p&gt;Right now, the data only lives on your machine. Let's push it somewhere others (or future-you on another laptop) can pull it from.&lt;/p&gt;

&lt;p&gt;For a quick local test, you can use a folder as a "remote":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /tmp/dvc-storage
dvc remote add &lt;span class="nt"&gt;-d&lt;/span&gt; localremote /tmp/dvc-storage
git add .dvc/config
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Configure DVC remote"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For real projects, swap that with S3 or similar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc remote add &lt;span class="nt"&gt;-d&lt;/span&gt; s3remote s3://my-bucket/dvc-storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then push the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5: The reproducibility test
&lt;/h2&gt;

&lt;p&gt;This is the moment that makes DVC click. Let's pretend you're a teammate cloning the repo for the first time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /tmp
git clone /path/to/mlops-demo fresh-clone
&lt;span class="nb"&gt;cd &lt;/span&gt;fresh-clone
&lt;span class="nb"&gt;ls &lt;/span&gt;data/
&lt;span class="c"&gt;# Only iris.csv.dvc — the actual CSV is missing!&lt;/span&gt;

dvc pull
&lt;span class="nb"&gt;ls &lt;/span&gt;data/
&lt;span class="c"&gt;# iris.csv is back, byte-for-byte identical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You just versioned a dataset alongside your code, &lt;strong&gt;without&lt;/strong&gt; committing it to Git. 🎉&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Updating the dataset
&lt;/h2&gt;

&lt;p&gt;Real data changes. Let's simulate that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"6.0,3.0,4.5,1.5,versicolor"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; data/iris.csv
dvc add data/iris.csv
git add data/iris.csv.dvc
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add new sample to iris dataset"&lt;/span&gt;
dvc push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pointer file's hash updated. If you ever need the &lt;em&gt;old&lt;/em&gt; version of the data, just &lt;code&gt;git checkout&lt;/code&gt; an older commit and run &lt;code&gt;dvc pull&lt;/code&gt; , DVC fetches the dataset that matched that commit. Time travel for data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;With this in place, you can finally answer the question &lt;em&gt;"which data produced that model?"&lt;/em&gt; with a Git commit hash. That's a huge upgrade.&lt;/p&gt;

&lt;p&gt;In the next article, we'll add the second piece of the puzzle: &lt;strong&gt;experiment tracking&lt;/strong&gt; with MLflow, so we never again lose track of which hyperparameters and which data produced which metric.&lt;/p&gt;

&lt;p&gt;Stay tuned and have fun! 🥰&lt;/p&gt;

&lt;p&gt;
  If you enjoyed this article, you can support my work here:
&lt;/p&gt;

&lt;p&gt;
  &lt;a href="https://buymeacoffee.com/mohamedarbi" rel="noopener noreferrer"&gt;
    &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.buymeacoffee.com%2Fbuttons%2Fdefault-orange.png" alt="Buy Me A Coffee" height="100" width="434"&gt;
  &lt;/a&gt;
&lt;/p&gt;

</description>
      <category>ai</category>
      <category>versioning</category>
      <category>mlops</category>
      <category>dvc</category>
    </item>
    <item>
      <title>1minMLOps #1 : What is MLOps and why should you care?</title>
      <dc:creator>Mohamed Arbi </dc:creator>
      <pubDate>Thu, 07 May 2026 15:08:56 +0000</pubDate>
      <link>https://dev.to/goodnight/1minmlops-1-what-is-mlops-and-why-should-you-care-17an</link>
      <guid>https://dev.to/goodnight/1minmlops-1-what-is-mlops-and-why-should-you-care-17an</guid>
      <description>&lt;p&gt;If you've ever trained a beautiful model in a Jupyter notebook, watched the metrics shine, and then realized you have no idea how to actually put it in front of users, congratulations: you've just discovered why MLOps exists.&lt;/p&gt;

&lt;p&gt;In this series, we are going to walk together from a notebook to a fully deployed, monitored and self-retraining ML system, one tiny step at a time. But before we write any code, let's get the foundations straight&lt;/p&gt;

&lt;h2&gt;
  
  
  So, what is MLOps?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MLOps&lt;/strong&gt; (short for Machine Learning Operations) is the set of practices, tools and culture that lets you ship machine learning models to production &lt;em&gt;reliably and repeatedly&lt;/em&gt;. Think of it as DevOps' younger sibling: same spirit (automation, reproducibility, monitoring), but adapted to the weirdness of ML, where your code is not the only thing that changes, your &lt;strong&gt;data&lt;/strong&gt; changes, your &lt;strong&gt;model&lt;/strong&gt; changes, and the &lt;strong&gt;world&lt;/strong&gt; your model lives in changes too&lt;/p&gt;

&lt;p&gt;A useful way to picture it is the ML lifecycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data collection &amp;amp; versioning&lt;/strong&gt; — where does the data come from, and which version did we train on?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experimentation&lt;/strong&gt; — which features, which model, which hyperparameters?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training &amp;amp; evaluation&lt;/strong&gt; — does it actually work, and is it better than what we had?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Packaging&lt;/strong&gt; — wrap the model in something deployable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment&lt;/strong&gt; — serve predictions to real users (batch or real-time)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt; — is it still working? Did the data drift?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retraining&lt;/strong&gt; — close the loop and start again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Traditional software has steps 4–6. ML has all seven, and steps 1–3 keep coming back to haunt you&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "it works on my machine" is &lt;em&gt;worse&lt;/em&gt; in ML
&lt;/h2&gt;

&lt;p&gt;In classical software, if your code runs locally, it has a decent chance of running in production. In ML, that's a trap, because the model's behavior depends on &lt;strong&gt;three&lt;/strong&gt; moving things, not one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code&lt;/strong&gt;: the training script, the preprocessing, the inference logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data&lt;/strong&gt;: the exact dataset (and its version) you trained on&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment&lt;/strong&gt;: Python version, library versions, CUDA versions, OS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Change any of these three and your "great model from Tuesday" becomes "mysterious garbage on Friday" This is why ML teams need stricter versioning, tracking and packaging discipline than most web teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  What problems does MLOps actually solve?
&lt;/h2&gt;

&lt;p&gt;Concrete pains you'll feel without MLOps, and that we'll fix in this series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Which dataset gave us that 0.94 F1 score? Nobody remembers."&lt;/li&gt;
&lt;li&gt;"The model works locally but crashes in the Docker container."&lt;/li&gt;
&lt;li&gt;"We retrained the model and accuracy dropped, but we can't roll back."&lt;/li&gt;
&lt;li&gt;"Production is silently degrading and we noticed two weeks later."&lt;/li&gt;
&lt;li&gt;"Every deploy is a hand-crafted artisanal disaster."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these has a tool and a workflow that solves it, and we are going to meet them(almost) one by one&lt;/p&gt;

&lt;h2&gt;
  
  
  The MLOps stack we'll build
&lt;/h2&gt;

&lt;p&gt;Here's a sneak peek of the tools we'll touch in the next articles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DVC&lt;/strong&gt; for data versioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLflow&lt;/strong&gt; for experiment tracking and the model registry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; for serving&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; for packaging (we'll lean a bit on Clelia's 1minDocker series here)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions&lt;/strong&gt; for CI/CD&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evidently&lt;/strong&gt; for monitoring data and model drift (we can use prometheus and grafana too)&lt;/li&gt;
&lt;li&gt;A cloud provider (we'll pick one later) for actually deploying it all&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't worry if some of these names sound intimidating, we'll introduce them gently, one per article, and always with a working example.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you need to follow along
&lt;/h2&gt;

&lt;p&gt;Nothing fancy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git&lt;/code&gt; installed&lt;/li&gt;
&lt;li&gt;A GitHub account&lt;/li&gt;
&lt;li&gt;Docker installed (highly recommend to follow this series &lt;a href="https://dev.to/astrabert/1mindocker-1-what-is-docker-3baa"&gt;https://dev.to/astrabert/1mindocker-1-what-is-docker-3baa&lt;/a&gt;) &lt;/li&gt;
&lt;li&gt;A laptop and ~1 minute per article 😉&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next article, we'll get our hands dirty: we'll take a small dataset, version it with &lt;strong&gt;DVC&lt;/strong&gt;, and finally answer the question &lt;em&gt;"which data did we train on?"&lt;/em&gt; without crying&lt;/p&gt;

&lt;p&gt;Stay tuned and have fun! &lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>ai</category>
      <category>mlops</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
