DEV Community

Bob Oner
Bob Oner

Posted on

Build a CSV Data Quality API with FastAPI, Pandas, Pytest, and Docker

Swagger UI

CSV files are still everywhere.

They appear in internal operations, analytics workflows, data exports, business reports, and small automation pipelines. Even when a team already uses databases or modern data platforms, CSV is often the format used to move data between people, tools, and systems.

The problem is that CSV files are easy to create but not always safe to trust.

Before a CSV file enters a pipeline, it is useful to answer a few basic questions:

  • How many rows and columns does it have?
  • Are there missing values?
  • Are there duplicate rows?
  • Are any columns completely empty?
  • Do the columns match what the next system expects?
  • Can another script or service consume the result in a predictable format?

In this article, I will walk through a small project that turns those checks into a reusable API:

FastAPI CSV Quality API

GitHub repo:

https://github.com/OnerGit/fastapi-csv-quality-api

The goal is not to build a full data quality platform. The goal is to show a practical engineering path from a local Python workflow to a small backend service that is documented, testable, and containerized.

What we will build

The API accepts a CSV file upload and returns a structured JSON quality report.

The report includes:

  • row count
  • column count
  • column names
  • missing values by column
  • missing value ratio by column
  • duplicate row count
  • duplicate row ratio
  • empty columns
  • column name issues
  • optional expected-column validation
  • warnings

The project also includes:

  • structured error responses
  • pytest tests
  • sample CSV files
  • Swagger UI
  • Dockerfile
  • Docker Compose support

Here is an example of the quality report shown in Swagger UI:

CSV quality report

Tech stack

This project uses:

  • FastAPI for the web API
  • Pandas for CSV analysis
  • Pydantic for response models
  • pytest for automated tests
  • Uvicorn as the ASGI server
  • Docker and Docker Compose for containerized execution

The project structure is intentionally small:

fastapi-csv-quality-api/
├── README.md
├── LICENSE
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── app/
│   ├── __init__.py
│   ├── main.py
│   ├── models.py
│   ├── analyzer.py
│   └── errors.py
├── tests/
│   ├── __init__.py
│   ├── test_health.py
│   ├── test_analyze.py
│   └── fixtures/
├── sample_data/
├── screenshots/
├── docs/
└── article_assets/
Enter fullscreen mode Exit fullscreen mode

The separation is simple:

  • main.py exposes the API routes.
  • analyzer.py contains the CSV analysis logic.
  • models.py defines typed response structures.
  • errors.py keeps error response helpers separate.
  • tests/ verifies the expected behavior.

Step 1: Run the API locally

Clone the repository:

git clone https://github.com/OnerGit/fastapi-csv-quality-api.git
cd fastapi-csv-quality-api
Enter fullscreen mode Exit fullscreen mode

Create a virtual environment:

python -m venv .venv
Enter fullscreen mode Exit fullscreen mode

Activate it.

On macOS or Linux:

source .venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

On Windows PowerShell:

.\.venv\Scripts\Activate.ps1
Enter fullscreen mode Exit fullscreen mode

Install dependencies:

python -m pip install --upgrade pip
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Run the API:

uvicorn app.main:app --reload
Enter fullscreen mode Exit fullscreen mode

Then open:

http://127.0.0.1:8000/docs
Enter fullscreen mode Exit fullscreen mode

You should see the FastAPI Swagger UI.

Step 2: Add a health check endpoint

A small service should have a simple health check endpoint. It gives us a quick way to verify that the API is running.

Example request:

curl http://127.0.0.1:8000/health
Enter fullscreen mode Exit fullscreen mode

Example response:

{
  "status": "ok",
  "service": "fastapi-csv-quality-api",
  "version": "0.1.0"
}
Enter fullscreen mode Exit fullscreen mode

This endpoint is also useful in tests, Docker checks, and future deployment environments.

Step 3: Build the CSV upload endpoint

The main endpoint is:

POST /analyze
Enter fullscreen mode Exit fullscreen mode

It accepts a CSV file as multipart form data.

Example request:

curl -X POST "http://127.0.0.1:8000/analyze" \
  -F "file=@sample_data/good_sample.csv"
Enter fullscreen mode Exit fullscreen mode

On Windows PowerShell, use curl.exe instead of curl:

curl.exe -X POST "http://127.0.0.1:8000/analyze" `
  -F "file=@sample_data/good_sample.csv"
Enter fullscreen mode Exit fullscreen mode

This is important because PowerShell may treat curl as an alias rather than the standard curl executable.

Step 4: Implement basic CSV quality checks

Once the uploaded file is accepted, the analyzer reads it and computes a set of practical metrics.

For a small MVP, the most useful checks are often simple:

row_count
column_count
column_names
missing_values_by_column
missing_value_ratio_by_column
duplicate_row_count
duplicate_row_ratio
empty_columns
Enter fullscreen mode Exit fullscreen mode

These checks are enough to catch many common CSV problems:

  • unexpected empty fields
  • duplicated records
  • fully empty columns
  • files with the wrong shape
  • files that look valid but are not useful downstream

A simplified example response looks like this:

{
  "filename": "bad_sample.csv",
  "row_count": 6,
  "column_count": 6,
  "column_names": [
    "id",
    "name",
    "email",
    "age",
    "signup_date",
    "notes"
  ],
  "missing_values_by_column": {
    "id": 0,
    "name": 1,
    "email": 2,
    "age": 1,
    "signup_date": 2,
    "notes": 6
  },
  "missing_value_ratio_by_column": {
    "id": 0.0,
    "name": 0.1667,
    "email": 0.3333,
    "age": 0.1667,
    "signup_date": 0.3333,
    "notes": 1.0
  },
  "duplicate_row_count": 1,
  "duplicate_row_ratio": 0.1667,
  "empty_columns": [
    "notes"
  ],
  "warnings": [
    "The CSV file contains 12 missing value(s).",
    "The CSV file contains 1 duplicate row(s).",
    "The CSV file contains empty column(s): notes."
  ]
}
Enter fullscreen mode Exit fullscreen mode

The key design choice is that the API returns structured JSON instead of plain text.

That makes the result easier to consume from:

  • another script
  • a data pipeline
  • a dashboard
  • a workflow automation tool
  • a monitoring job

Step 5: Add expected-column validation

In many workflows, the next system expects a fixed set of columns.

For example:

id,name,email,age,signup_date
Enter fullscreen mode Exit fullscreen mode

The API supports optional expected-column validation through a form field:

curl -X POST "http://127.0.0.1:8000/analyze" \
  -F "file=@sample_data/good_sample.csv" \
  -F "expected_columns=id,name,email,age,signup_date"
Enter fullscreen mode Exit fullscreen mode

This allows the service to compare the uploaded CSV headers against the expected headers.

The response can then tell you whether the file matches the expected shape, and which columns are missing or unexpected.

Expected columns validation

This is a small feature, but it changes the API from a generic CSV inspector into something more useful for real workflows.

Step 6: Return structured errors

CSV upload workflows can fail in many ways.

Some examples:

  • the user uploads a non-CSV file
  • the file is empty
  • the file cannot be parsed
  • the encoding is unsupported
  • the file is too large

Instead of returning inconsistent error messages, this project returns structured errors.

Example:

{
  "error": {
    "code": "invalid_file_type",
    "message": "Only .csv files are supported.",
    "details": {
      "filename": "not_csv.txt"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This format is useful because clients can check the code field programmatically.

For example, a frontend can show a friendly message for invalid_file_type, while a pipeline can log the error and stop processing.

Step 7: Add tests with pytest

A small API becomes much more useful when its behavior is protected by tests.

This project includes tests for:

  • /health returning 200
  • normal CSV analysis
  • missing value detection
  • duplicate row detection
  • non-CSV error handling
  • expected-column validation

Run the tests:

pytest
Enter fullscreen mode Exit fullscreen mode

Example test result:

Pytest passed

For a small demo project, tests are not just a formality. They make the project easier to refactor and safer to extend.

Step 8: Containerize the API with Docker

After the API works locally, the next step is to package it.

Build the Docker image:

docker build -t fastapi-csv-quality-api .
Enter fullscreen mode Exit fullscreen mode

Run the container:

docker run --rm -p 8000:8000 fastapi-csv-quality-api
Enter fullscreen mode Exit fullscreen mode

Then open:

http://127.0.0.1:8000/docs
Enter fullscreen mode Exit fullscreen mode

The container listens on 0.0.0.0:8000, while your local machine accesses it through 127.0.0.1:8000 after port mapping.

Docker run

Step 9: Use Docker Compose

Docker Compose is included for a simpler local workflow.

Start the service:

docker compose up --build
Enter fullscreen mode Exit fullscreen mode

Stop the service:

docker compose down
Enter fullscreen mode Exit fullscreen mode

This is useful when you want a repeatable local runtime without manually typing the full docker run command.

Why this project is intentionally small

This project is an MVP.

It intentionally does not include:

  • authentication
  • database storage
  • frontend UI
  • background jobs
  • large-file streaming
  • Kubernetes deployment
  • production cloud infrastructure

That is deliberate.

The purpose is to demonstrate a complete but lightweight backend workflow:

CSV upload → validation → analysis → structured response → tests → Docker packaging
Enter fullscreen mode Exit fullscreen mode

This makes the project easier to read, test, and extend.

Possible next improvements

There are many ways to extend this project:

  • configurable file size limits
  • date format checks
  • numeric column checks
  • JSON schema export
  • GitHub Actions CI workflow
  • cloud deployment tutorial
  • larger file handling
  • dashboard integration

A natural next step would be to deploy the containerized service to a small cloud VM or Kubernetes platform. But that should be treated as a separate tutorial rather than added to the first MVP.

Conclusion

This project shows how to turn a simple local CSV inspection workflow into a reusable API.

The main engineering ideas are:

  • keep the first version small
  • return structured JSON instead of text
  • separate API routing from analysis logic
  • define response models clearly
  • test the behavior with pytest
  • package the service with Docker

Even though the example is small, the pattern is useful:

local script → API service → tested component → containerized tool
Enter fullscreen mode Exit fullscreen mode

That pattern can be reused for many internal developer tools, data workflow utilities, and automation services.

You can find the full project here:

https://github.com/OnerGit/fastapi-csv-quality-api

Top comments (0)