DEV Community

Cover image for Day 11: Track a Dataset with DVC
Thu Kha Kyawe
Thu Kha Kyawe

Posted on

Day 11: Track a Dataset with DVC

Lab Information

A teammate has added the transactions dataset to the xFusionCorp Industries fraud-detection repository, but it was committed directly to Git instead of being tracked with DVC. Bring the repository in line with the team standard—every dataset under data/ must be tracked by DVC, not by Git.

A project exists at /root/code/fraud-detection/ with DVC already initialised. The dataset data/raw/transactions.csv is currently tracked by Git, and the team standard requires DVC to own it instead.

Stop Git from tracking the dataset without deleting it from disk.

Track the same dataset with DVC so a .dvc pointer file is produced and data/raw/.gitignore excludes the dataset itself.

Stage the new .dvc pointer and the new .gitignore, then record a Git commit with the message Track transactions dataset with DVC.

Once tracking is moved to DVC, the DVC TRACKED section in the EXPLORER panel will list the dataset, confirming the extension recognises it as a DVC-managed file.
Enter fullscreen mode Exit fullscreen mode

Lab Solutions

✅ Part 1: Lab Step-by-Step Guidelines

Step 1: Move into the repository

cd /root/code/fraud-detection
Enter fullscreen mode Exit fullscreen mode

Verify the dataset is currently tracked by Git:

git ls-files | grep transactions.csv
Enter fullscreen mode Exit fullscreen mode

Expected:

root@controlplane fraud-detection on  main ➜  git ls-files | grep transactions.csv
data/raw/transactions.csv
Enter fullscreen mode Exit fullscreen mode

Step 2: Stop Git from tracking the dataset (keep the file)

The lab specifically says:

Stop Git from tracking the dataset without deleting it from disk.

Use:

git rm --cached data/raw/transactions.csv
Enter fullscreen mode Exit fullscreen mode

Important: Use --cached.

Removes the file from Git tracking
Keeps the actual file on disk

Verify:

ls -l data/raw/transactions.csv
Enter fullscreen mode Exit fullscreen mode

The file should still exist.

Step 3: Track the dataset with DVC

Run:

dvc add data/raw/transactions.csv
Enter fullscreen mode Exit fullscreen mode

DVC will create:

data/raw/transactions.csv.dvc

and update:

data/raw/.gitignore

Expected output:

root@controlplane fraud-detection on  main [✘?] ➜  dvc add data/raw/transactions.csv
100% Adding...|███████████████████████████████████████|1/1 [00:00, 60.46file/s]

To track the changes with git, run:

        git add data/raw/.gitignore data/raw/transactions.csv.dvc

To enable auto staging, run:

        dvc config core.autostage true
Enter fullscreen mode Exit fullscreen mode

Step 4: Verify DVC artifacts

Check:

ls -la data/raw
Enter fullscreen mode Exit fullscreen mode

Expected:

root@controlplane fraud-detection on  main [✘?] ➜  ls -la data/raw
total 20
drwxr-xr-x 2 root root 4096 Jun 14 11:59 .
drwxr-xr-x 3 root root 4096 Jun 14 11:56 ..
-rw-r--r-- 1 root root   18 Jun 14 11:59 .gitignore
-rw-r--r-- 1 root root  379 Jun 14 11:59 transactions.csv
-rw-r--r-- 1 root root   95 Jun 14 11:59 transactions.csv.dvc
Enter fullscreen mode Exit fullscreen mode

Inspect the new files:

cat data/raw/.gitignore
cat data/raw/transactions.csv.dvc
Enter fullscreen mode Exit fullscreen mode

Step 5: Check Git status

git status
Enter fullscreen mode Exit fullscreen mode

You should see something similar to:

root@controlplane fraud-detection on  main [✘?] ➜  git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        deleted:    data/raw/transactions.csv

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        data/
Enter fullscreen mode Exit fullscreen mode

Step 6: Stage the required files

Stage everything needed for the migration:

git add data/raw/transactions.csv.dvc
git add data/raw/.gitignore
git add -u
Enter fullscreen mode Exit fullscreen mode

The git add -u stages the removal of transactions.csv from Git tracking.

Verify:

git status
Enter fullscreen mode Exit fullscreen mode

Everything should be staged.

Step 7: Commit the changes

Use the exact commit message required by the lab:

git commit -m "Track transactions dataset with DVC"
Enter fullscreen mode Exit fullscreen mode

Step 8: Verify the commit

git log --oneline -n 1
Enter fullscreen mode Exit fullscreen mode

Expected:

root@controlplane fraud-detection on  main ➜  git log --oneline -n 1
1b8a2c7 (HEAD -> main) Track transactions dataset with DVC
Enter fullscreen mode Exit fullscreen mode

Step 9: Final verification

Confirm Git no longer tracks the dataset:

git ls-files | grep transactions.csv
Enter fullscreen mode Exit fullscreen mode

Expected:

data/raw/transactions.csv.dvc

Confirm DVC tracks it:

dvc status

Expected:

Data and pipelines are up to date.


🧠 Part 2: Simple Step-by-Step Explanation (Beginner Friendly)

  • What is the problem?

Right now:

Git
└── data/raw/transactions.csv

Git is tracking a dataset file.

The team standard says:

Git → track code and metadata
DVC → track datasets and models

So we need to move ownership of the dataset from Git to DVC.

Why use git rm --cached?

If you run:

git rm data/raw/transactions.csv

Git removes the file completely.

We don't want that.

Instead:

git rm --cached data/raw/transactions.csv

removes it only from Git tracking.

The file remains on disk:

data/raw/transactions.csv

  • What does dvc add do?

When you run:

dvc add data/raw/transactions.csv

DVC creates a pointer file:

data/raw/transactions.csv.dvc

Think of it as:

transactions.csv.dvc

points to

transactions.csv

Git stores the small .dvc file instead of the large dataset.

  • Why is .gitignore created?

DVC automatically adds:

data/raw/.gitignore

so Git ignores:

transactions.csv

This prevents someone from accidentally committing the dataset again.

  • What gets committed to Git?

After migration, Git stores:

data/raw/transactions.csv.dvc
data/raw/.gitignore

Git no longer stores:

data/raw/transactions.csv


Resources & Next Steps
📦 Full Code Repository: KodeKloud Learning Labs
💬 Join Discussion: DEV Community - Share your thoughts and questions
💼 Let's Connect: LinkedIn - I'd love to connect with you

Credits
• All labs are from: KodeKloud
• I sincerely appreciate your provision of these valuable resources.

Top comments (0)