Bob Oner

Posted on May 29

Build a CSV Data Quality API with FastAPI, Pandas, Pytest, and Docker

#python #fastapi #docker #testing

CSV files are still everywhere.

They appear in internal operations, analytics workflows, data exports, business reports, and small automation pipelines. Even when a team already uses databases or modern data platforms, CSV is often the format used to move data between people, tools, and systems.

The problem is that CSV files are easy to create but not always safe to trust.

Before a CSV file enters a pipeline, it is useful to answer a few basic questions:

How many rows and columns does it have?
Are there missing values?
Are there duplicate rows?
Are any columns completely empty?
Do the columns match what the next system expects?
Can another script or service consume the result in a predictable format?

In this article, I will walk through a small project that turns those checks into a reusable API:

FastAPI CSV Quality API

GitHub repo:

https://github.com/OnerGit/fastapi-csv-quality-api

The goal is not to build a full data quality platform. The goal is to show a practical engineering path from a local Python workflow to a small backend service that is documented, testable, and containerized.

What we will build

The API accepts a CSV file upload and returns a structured JSON quality report.

The report includes:

row count
column count
column names
missing values by column
missing value ratio by column
duplicate row count
duplicate row ratio
empty columns
column name issues
optional expected-column validation
warnings

The project also includes:

structured error responses
pytest tests
sample CSV files
Swagger UI
Dockerfile
Docker Compose support

Here is an example of the quality report shown in Swagger UI:

Tech stack

This project uses:

FastAPI for the web API
Pandas for CSV analysis
Pydantic for response models
pytest for automated tests
Uvicorn as the ASGI server
Docker and Docker Compose for containerized execution

The project structure is intentionally small:

fastapi-csv-quality-api/
├── README.md
├── LICENSE
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── app/
│   ├── __init__.py
│   ├── main.py
│   ├── models.py
│   ├── analyzer.py
│   └── errors.py
├── tests/
│   ├── __init__.py
│   ├── test_health.py
│   ├── test_analyze.py
│   └── fixtures/
├── sample_data/
├── screenshots/
├── docs/
└── article_assets/

The separation is simple:

main.py exposes the API routes.
analyzer.py contains the CSV analysis logic.
models.py defines typed response structures.
errors.py keeps error response helpers separate.
tests/ verifies the expected behavior.

Step 1: Run the API locally

Clone the repository:

git clone https://github.com/OnerGit/fastapi-csv-quality-api.git
cd fastapi-csv-quality-api

Create a virtual environment:

python -m venv .venv

Activate it.

On macOS or Linux:

source .venv/bin/activate

On Windows PowerShell:

.\.venv\Scripts\Activate.ps1

Install dependencies:

python -m pip install --upgrade pip
pip install -r requirements.txt

Run the API:

uvicorn app.main:app --reload

Then open:

http://127.0.0.1:8000/docs

You should see the FastAPI Swagger UI.

Step 2: Add a health check endpoint

A small service should have a simple health check endpoint. It gives us a quick way to verify that the API is running.

Example request:

curl http://127.0.0.1:8000/health

Example response:

{
  "status": "ok",
  "service": "fastapi-csv-quality-api",
  "version": "0.1.0"
}

This endpoint is also useful in tests, Docker checks, and future deployment environments.

Step 3: Build the CSV upload endpoint

The main endpoint is:

POST /analyze

It accepts a CSV file as multipart form data.

Example request:

curl -X POST "http://127.0.0.1:8000/analyze" \
  -F "file=@sample_data/good_sample.csv"

On Windows PowerShell, use curl.exe instead of curl:

curl.exe -X POST "http://127.0.0.1:8000/analyze" `
  -F "file=@sample_data/good_sample.csv"

This is important because PowerShell may treat curl as an alias rather than the standard curl executable.

Step 4: Implement basic CSV quality checks

Once the uploaded file is accepted, the analyzer reads it and computes a set of practical metrics.

For a small MVP, the most useful checks are often simple:

row_count
column_count
column_names
missing_values_by_column
missing_value_ratio_by_column
duplicate_row_count
duplicate_row_ratio
empty_columns

These checks are enough to catch many common CSV problems:

unexpected empty fields
duplicated records
fully empty columns
files with the wrong shape
files that look valid but are not useful downstream

A simplified example response looks like this:

{
  "filename": "bad_sample.csv",
  "row_count": 6,
  "column_count": 6,
  "column_names": [
    "id",
    "name",
    "email",
    "age",
    "signup_date",
    "notes"
  ],
  "missing_values_by_column": {
    "id": 0,
    "name": 1,
    "email": 2,
    "age": 1,
    "signup_date": 2,
    "notes": 6
  },
  "missing_value_ratio_by_column": {
    "id": 0.0,
    "name": 0.1667,
    "email": 0.3333,
    "age": 0.1667,
    "signup_date": 0.3333,
    "notes": 1.0
  },
  "duplicate_row_count": 1,
  "duplicate_row_ratio": 0.1667,
  "empty_columns": [
    "notes"
  ],
  "warnings": [
    "The CSV file contains 12 missing value(s).",
    "The CSV file contains 1 duplicate row(s).",
    "The CSV file contains empty column(s): notes."
  ]
}

The key design choice is that the API returns structured JSON instead of plain text.

That makes the result easier to consume from:

another script
a data pipeline
a dashboard
a workflow automation tool
a monitoring job

Step 5: Add expected-column validation

In many workflows, the next system expects a fixed set of columns.

For example:

id,name,email,age,signup_date

The API supports optional expected-column validation through a form field:

curl -X POST "http://127.0.0.1:8000/analyze" \
  -F "file=@sample_data/good_sample.csv" \
  -F "expected_columns=id,name,email,age,signup_date"

This allows the service to compare the uploaded CSV headers against the expected headers.

The response can then tell you whether the file matches the expected shape, and which columns are missing or unexpected.

This is a small feature, but it changes the API from a generic CSV inspector into something more useful for real workflows.

Step 6: Return structured errors

CSV upload workflows can fail in many ways.

Some examples:

the user uploads a non-CSV file
the file is empty
the file cannot be parsed
the encoding is unsupported
the file is too large

Instead of returning inconsistent error messages, this project returns structured errors.

Example:

{
  "error": {
    "code": "invalid_file_type",
    "message": "Only .csv files are supported.",
    "details": {
      "filename": "not_csv.txt"
    }
  }
}

This format is useful because clients can check the code field programmatically.

For example, a frontend can show a friendly message for invalid_file_type, while a pipeline can log the error and stop processing.

Step 7: Add tests with pytest

A small API becomes much more useful when its behavior is protected by tests.

This project includes tests for:

/health returning 200
normal CSV analysis
missing value detection
duplicate row detection
non-CSV error handling
expected-column validation

Run the tests:

pytest

Example test result:

For a small demo project, tests are not just a formality. They make the project easier to refactor and safer to extend.

Step 8: Containerize the API with Docker

After the API works locally, the next step is to package it.

Build the Docker image:

docker build -t fastapi-csv-quality-api .

Run the container:

docker run --rm -p 8000:8000 fastapi-csv-quality-api

Then open:

http://127.0.0.1:8000/docs

The container listens on 0.0.0.0:8000, while your local machine accesses it through 127.0.0.1:8000 after port mapping.

Step 9: Use Docker Compose

Docker Compose is included for a simpler local workflow.

Start the service:

docker compose up --build

Stop the service:

docker compose down

This is useful when you want a repeatable local runtime without manually typing the full docker run command.

Why this project is intentionally small

This project is an MVP.

It intentionally does not include:

authentication
database storage
frontend UI
background jobs
large-file streaming
Kubernetes deployment
production cloud infrastructure

That is deliberate.

The purpose is to demonstrate a complete but lightweight backend workflow:

CSV upload → validation → analysis → structured response → tests → Docker packaging

This makes the project easier to read, test, and extend.

Possible next improvements

There are many ways to extend this project:

configurable file size limits
date format checks
numeric column checks
JSON schema export
GitHub Actions CI workflow
cloud deployment tutorial
larger file handling
dashboard integration

A natural next step would be to deploy the containerized service to a small cloud VM or Kubernetes platform. But that should be treated as a separate tutorial rather than added to the first MVP.

Conclusion

This project shows how to turn a simple local CSV inspection workflow into a reusable API.

The main engineering ideas are:

keep the first version small
return structured JSON instead of text
separate API routing from analysis logic
define response models clearly
test the behavior with pytest
package the service with Docker

Even though the example is small, the pattern is useful:

local script → API service → tested component → containerized tool

That pattern can be reused for many internal developer tools, data workflow utilities, and automation services.

You can find the full project here: