CSV files are still everywhere.
They appear in internal operations, analytics workflows, data exports, business reports, and small automation pipelines. Even when a team already uses databases or modern data platforms, CSV is often the format used to move data between people, tools, and systems.
The problem is that CSV files are easy to create but not always safe to trust.
Before a CSV file enters a pipeline, it is useful to answer a few basic questions:
- How many rows and columns does it have?
- Are there missing values?
- Are there duplicate rows?
- Are any columns completely empty?
- Do the columns match what the next system expects?
- Can another script or service consume the result in a predictable format?
In this article, I will walk through a small project that turns those checks into a reusable API:
FastAPI CSV Quality API
GitHub repo:
https://github.com/OnerGit/fastapi-csv-quality-api
The goal is not to build a full data quality platform. The goal is to show a practical engineering path from a local Python workflow to a small backend service that is documented, testable, and containerized.
What we will build
The API accepts a CSV file upload and returns a structured JSON quality report.
The report includes:
- row count
- column count
- column names
- missing values by column
- missing value ratio by column
- duplicate row count
- duplicate row ratio
- empty columns
- column name issues
- optional expected-column validation
- warnings
The project also includes:
- structured error responses
- pytest tests
- sample CSV files
- Swagger UI
- Dockerfile
- Docker Compose support
Here is an example of the quality report shown in Swagger UI:
Tech stack
This project uses:
- FastAPI for the web API
- Pandas for CSV analysis
- Pydantic for response models
- pytest for automated tests
- Uvicorn as the ASGI server
- Docker and Docker Compose for containerized execution
The project structure is intentionally small:
fastapi-csv-quality-api/
├── README.md
├── LICENSE
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── app/
│ ├── __init__.py
│ ├── main.py
│ ├── models.py
│ ├── analyzer.py
│ └── errors.py
├── tests/
│ ├── __init__.py
│ ├── test_health.py
│ ├── test_analyze.py
│ └── fixtures/
├── sample_data/
├── screenshots/
├── docs/
└── article_assets/
The separation is simple:
-
main.pyexposes the API routes. -
analyzer.pycontains the CSV analysis logic. -
models.pydefines typed response structures. -
errors.pykeeps error response helpers separate. -
tests/verifies the expected behavior.
Step 1: Run the API locally
Clone the repository:
git clone https://github.com/OnerGit/fastapi-csv-quality-api.git
cd fastapi-csv-quality-api
Create a virtual environment:
python -m venv .venv
Activate it.
On macOS or Linux:
source .venv/bin/activate
On Windows PowerShell:
.\.venv\Scripts\Activate.ps1
Install dependencies:
python -m pip install --upgrade pip
pip install -r requirements.txt
Run the API:
uvicorn app.main:app --reload
Then open:
http://127.0.0.1:8000/docs
You should see the FastAPI Swagger UI.
Step 2: Add a health check endpoint
A small service should have a simple health check endpoint. It gives us a quick way to verify that the API is running.
Example request:
curl http://127.0.0.1:8000/health
Example response:
{
"status": "ok",
"service": "fastapi-csv-quality-api",
"version": "0.1.0"
}
This endpoint is also useful in tests, Docker checks, and future deployment environments.
Step 3: Build the CSV upload endpoint
The main endpoint is:
POST /analyze
It accepts a CSV file as multipart form data.
Example request:
curl -X POST "http://127.0.0.1:8000/analyze" \
-F "file=@sample_data/good_sample.csv"
On Windows PowerShell, use curl.exe instead of curl:
curl.exe -X POST "http://127.0.0.1:8000/analyze" `
-F "file=@sample_data/good_sample.csv"
This is important because PowerShell may treat curl as an alias rather than the standard curl executable.
Step 4: Implement basic CSV quality checks
Once the uploaded file is accepted, the analyzer reads it and computes a set of practical metrics.
For a small MVP, the most useful checks are often simple:
row_count
column_count
column_names
missing_values_by_column
missing_value_ratio_by_column
duplicate_row_count
duplicate_row_ratio
empty_columns
These checks are enough to catch many common CSV problems:
- unexpected empty fields
- duplicated records
- fully empty columns
- files with the wrong shape
- files that look valid but are not useful downstream
A simplified example response looks like this:
{
"filename": "bad_sample.csv",
"row_count": 6,
"column_count": 6,
"column_names": [
"id",
"name",
"email",
"age",
"signup_date",
"notes"
],
"missing_values_by_column": {
"id": 0,
"name": 1,
"email": 2,
"age": 1,
"signup_date": 2,
"notes": 6
},
"missing_value_ratio_by_column": {
"id": 0.0,
"name": 0.1667,
"email": 0.3333,
"age": 0.1667,
"signup_date": 0.3333,
"notes": 1.0
},
"duplicate_row_count": 1,
"duplicate_row_ratio": 0.1667,
"empty_columns": [
"notes"
],
"warnings": [
"The CSV file contains 12 missing value(s).",
"The CSV file contains 1 duplicate row(s).",
"The CSV file contains empty column(s): notes."
]
}
The key design choice is that the API returns structured JSON instead of plain text.
That makes the result easier to consume from:
- another script
- a data pipeline
- a dashboard
- a workflow automation tool
- a monitoring job
Step 5: Add expected-column validation
In many workflows, the next system expects a fixed set of columns.
For example:
id,name,email,age,signup_date
The API supports optional expected-column validation through a form field:
curl -X POST "http://127.0.0.1:8000/analyze" \
-F "file=@sample_data/good_sample.csv" \
-F "expected_columns=id,name,email,age,signup_date"
This allows the service to compare the uploaded CSV headers against the expected headers.
The response can then tell you whether the file matches the expected shape, and which columns are missing or unexpected.
This is a small feature, but it changes the API from a generic CSV inspector into something more useful for real workflows.
Step 6: Return structured errors
CSV upload workflows can fail in many ways.
Some examples:
- the user uploads a non-CSV file
- the file is empty
- the file cannot be parsed
- the encoding is unsupported
- the file is too large
Instead of returning inconsistent error messages, this project returns structured errors.
Example:
{
"error": {
"code": "invalid_file_type",
"message": "Only .csv files are supported.",
"details": {
"filename": "not_csv.txt"
}
}
}
This format is useful because clients can check the code field programmatically.
For example, a frontend can show a friendly message for invalid_file_type, while a pipeline can log the error and stop processing.
Step 7: Add tests with pytest
A small API becomes much more useful when its behavior is protected by tests.
This project includes tests for:
-
/healthreturning200 - normal CSV analysis
- missing value detection
- duplicate row detection
- non-CSV error handling
- expected-column validation
Run the tests:
pytest
Example test result:
For a small demo project, tests are not just a formality. They make the project easier to refactor and safer to extend.
Step 8: Containerize the API with Docker
After the API works locally, the next step is to package it.
Build the Docker image:
docker build -t fastapi-csv-quality-api .
Run the container:
docker run --rm -p 8000:8000 fastapi-csv-quality-api
Then open:
http://127.0.0.1:8000/docs
The container listens on 0.0.0.0:8000, while your local machine accesses it through 127.0.0.1:8000 after port mapping.
Step 9: Use Docker Compose
Docker Compose is included for a simpler local workflow.
Start the service:
docker compose up --build
Stop the service:
docker compose down
This is useful when you want a repeatable local runtime without manually typing the full docker run command.
Why this project is intentionally small
This project is an MVP.
It intentionally does not include:
- authentication
- database storage
- frontend UI
- background jobs
- large-file streaming
- Kubernetes deployment
- production cloud infrastructure
That is deliberate.
The purpose is to demonstrate a complete but lightweight backend workflow:
CSV upload → validation → analysis → structured response → tests → Docker packaging
This makes the project easier to read, test, and extend.
Possible next improvements
There are many ways to extend this project:
- configurable file size limits
- date format checks
- numeric column checks
- JSON schema export
- GitHub Actions CI workflow
- cloud deployment tutorial
- larger file handling
- dashboard integration
A natural next step would be to deploy the containerized service to a small cloud VM or Kubernetes platform. But that should be treated as a separate tutorial rather than added to the first MVP.
Conclusion
This project shows how to turn a simple local CSV inspection workflow into a reusable API.
The main engineering ideas are:
- keep the first version small
- return structured JSON instead of text
- separate API routing from analysis logic
- define response models clearly
- test the behavior with pytest
- package the service with Docker
Even though the example is small, the pattern is useful:
local script → API service → tested component → containerized tool
That pattern can be reused for many internal developer tools, data workflow utilities, and automation services.
You can find the full project here:





Top comments (0)