DEV Community

Your CSV Import Is Fine... Until Real Data Arrives

Most CSV import tutorials end the same way:

  • read the file
  • loop the rows
  • db.add(...)
  • done

That looks fine in a demo.

In production, it is how you wake up to corrupted customer data, duplicate records, broken status values, and a support thread that starts with “we only uploaded a CSV.”

I did not want a CSV uploader.

I wanted a safe ingestion layer.

So I built a FastAPI-based CSV import engine that treats external CSV files as hostile input, forces a preview-first workflow, records an import run, prevents duplicate commit accidents, and connects imported records to follow-up operational tasks.

This article is about building that system properly.

This is the operator workspace used to authenticate and start a safe CSV import.


The real problem is not file upload

The hard part of CSV import is not parsing a file.

The hard part is this:

  • somebody exports from Excel in a weird encoding
  • headers do not match your schema
  • one row has extra commas
  • another row is missing a value
  • status values are ACTIVE, 有効, new, archived, or something worse
  • someone uploads the same file twice
  • a “simple import” silently creates junk in production
  • nobody can explain what happened afterward

That is not an edge case.

That is the normal shape of business CSV.

So I stopped thinking in terms of “CSV upload” and started thinking in terms of this instead:

How do I create a safe entry point for external business data?

That framing changed everything.


What I built

This project is a Client Ops Workspace built with FastAPI and Next.js.

At the center is a CSV import engine for client records with this flow:

  1. Sign in
  2. Download the correct CSV template
  3. Upload a CSV
  4. Run Preview
  5. Inspect:
    • total rows
    • valid rows
    • error rows
    • create candidates
    • update candidates
    • row-level errors
    • normalization suggestions
    • row decisions
  6. Run Commit
  7. Store an import run record and audit trail
  8. Continue into project / task follow-up if needed

This is not “import and pray.”

It is preview first, commit later.

The workspace below shows the operator flow once real data is loaded: preview summary, row-level intent, filters, search, pagination, errors, and normalization feedback.

The preview step separates valid rows, errors, and import decisions before anything touches the database.


The core design choice: preview and commit must be separate

This is the part most tutorials skip.

I split CSV ingestion into two phases.

Preview

The system reads the uploaded CSV, validates it, normalizes values, detects malformed rows, checks for duplicates, determines create/update intent, and returns a structured preview.

Nothing is written to the database yet.

Commit

The system accepts an import_run_id from a preview that already happened and applies only the rows that survived validation.

That gives me several things at once:

  • safer UX
  • clearer operator intent
  • auditable import history
  • duplicate commit protection
  • better error reporting
  • a future path for approvals, manual mapping, and role-based operations

Once I committed to that architecture, the rest of the design got much cleaner.


The API shape

For the first version, I focused on clients.

The expected columns are:

  • contact_name
  • company_name
  • email
  • phone
  • status

Allowed statuses:

  • lead
  • active
  • inactive

The preview endpoint returns:

  • import run id
  • filename
  • summary
  • errors
  • suggestions
  • row decisions
  • header mapping info
  • timestamps

That means the frontend can render a real operator-facing review screen instead of a binary “success / failed” message.


Why the import run matters

The ImportRun model turned this from a toy into a system.

Each preview stores:

  • resource type
  • filename
  • file hash
  • preview status
  • total / valid / error row counts
  • create / update candidate counts
  • header mapping JSON
  • mapping confidence JSON
  • mapping notes JSON
  • preview payload JSON
  • actor user id
  • created / expires / committed / failed timestamps

That gives me:

1. Workflow-level idempotency

Commit is tied to a specific preview result through import_run_id.

If someone tries to commit the same run twice, the system refuses it.

2. Auditable behavior

I can inspect exactly what the operator previewed, what the system inferred, and when the commit happened.

3. Expiration

Previews are not forever. They have an expiration window.

That matters because preview payloads can be large, and stale previews are dangerous.


CSV hell is real, so the parser has to be suspicious

The parser is designed around distrust.

Encoding support

One detail that matters a lot in practice, especially in Japanese business environments, is CSV compatibility beyond clean UTF-8 exports.

That is why the parser also accounts for UTF-8 with BOM, CP932, and Shift-JIS patterns, along with header alias mapping for Japanese business labels.

I added decoding fallback for:

  • UTF-8 with BOM
  • UTF-8
  • CP932
  • Shift-JIS

That alone removes a huge class of “works on my machine” CSV problems for Japanese business data.

Header alias mapping

Real users do not always send your perfect schema.

So I added alias mapping for cases like:

  • 会社名company_name
  • 担当者名contact_name
  • 連絡先メールemail
  • 電話番号phone
  • 状態status

This gives the engine the ability to meet users where they are, not where my model wishes they were.

Malformed row detection

I use csv.DictReader with restkey and row-level parse checks to catch:

  • rows with more columns than the header
  • rows with fewer columns than the header
  • likely comma mismatch
  • likely broken quoting

That means malformed rows become explicit review items instead of silent garbage.

Broken rows are surfaced before commit instead of silently poisoning production data.

Invalid rows remain visible with human-readable validation output instead of silently failing.


Validation is not enough. Normalization matters too.

The engine does not only reject bad data. It also tries to make valid intent usable.

Status normalization

Examples:

  • ACTIVEactive
  • newlead
  • 有効active
  • 見込みlead
  • archivedinactive

Phone normalization

Examples:

  • +1-415-555-0182+14155550182
  • 090-9999-888809099998888

Contact name fallback

If contact_name is empty but company_name exists, the preview can suggest filling it from the company.

But this is important:

Suggestions are not the same as blind auto-correction.

The operator gets a readable preview of what the system inferred.

That is a very different philosophy from silently mutating everything.


Real-world CSV means alias mapping, not idealized schemas

If a user sends company_name, that is easy.

Real files are rarely that polite.

Business CSV files tend to arrive with mixed conventions, Japanese labels, and whatever the exporting system decided to call a field that week.

That is why alias mapping is part of the engine, not an afterthought.

The goal is not to force every operator to manually reshape a spreadsheet before import.

The goal is to let the system absorb common real-world variation safely.

Matched clients can immediately flow into follow-up work such as projects and tasks.


Row decisions are the real product

The most useful part of the preview is not the summary card.

It is the row decision list.

For each valid row, the system tells the operator:

  • which row it is
  • whether it will create or update
  • the normalized values
  • the reason

For example:

  • create because the email does not exist in the database
  • update because the email already exists in the database

In the UI, operators do not see a generic “import successful” message.

They see row-level intent.

That sounds simple, but it changes the operator’s experience from this:

“I uploaded a file and hoped for the best.”

to this:

“I can see exactly what this system is about to do.”

That is the difference between a utility and a trustworthy business tool.


The frontend is not decoration. It is part of the safety model.

I built a dedicated Next.js page for the workflow.

The page includes:

  • sign in
  • auth status
  • token show / hide / copy
  • template download
  • CSV upload
  • preview
  • commit
  • run status
  • preview snapshot
  • row decisions
  • errors
  • suggestions
  • task follow-up

The frontend evolved a lot during implementation because I stopped treating it as a debug console and started treating it like an operator workspace.

Key choices that mattered:

Preview snapshot stays compact

The top section is intentionally simple.

Analysis panels get wider

Once the CSV is analyzed, the page expands because row-level review needs horizontal space.

Color meaning stays disciplined

I avoided noisy UI coloring.

  • analyzed business data uses one calm accent tone
  • errors are red
  • summaries stay readable
  • the interface does not scream unless something is actually wrong

Search, filtering, and pagination

For row decisions, I added:

  • All / Create / Update filters
  • search
  • rows-per-page controls
  • pagination

This became necessary once I tested with 500+ and 1000+ row files.

If your “safe CSV import” falls apart when the dataset gets bigger, it is not safe.

It is theatrical.


Import does not end at commit

This is the part I care about most from a business perspective.

Most CSV demos stop at “data inserted successfully.”

Real work begins after that.

So I connected the import workflow to downstream operational actions:

  • match imported rows to existing clients
  • surface match quality
  • load related projects
  • load related tasks
  • create a project
  • create a task
  • change task status directly in the same workspace

This is where the engine stops being a parser and starts becoming operational infrastructure.

It turns the CSV engine from a dead-end admin tool into a real operational bridge.

The follow-up area turns imported rows into operational work, not just parsed data.

Why this matters

Businesses do not buy “CSV parsing.”

They buy shorter time from:

spreadsheet arrives

to

the right work starts

That is a much more valuable product story.


Match quality matters more than people admit

I added two match types in the follow-up UI:

  • Matched by email
  • Matched by company

And I do not treat them equally.

A company-name match is shown as a provisional link with an explicit warning.

That is not a cosmetic detail.

It prevents operators from over-trusting a weak match and creating follow-up work against the wrong record.

This is the kind of boring detail that makes systems usable in the real world.


Activity logging is part of the contract

CSV import without auditability is reckless.

I record activity log events such as:

  • csv_previewed
  • csv_import_committed
  • client_created_from_csv
  • client_updated_from_csv

This gives the system an operational memory.

When a team asks, “Who imported this?” or “Why did this client change?” the answer should not be “I think someone uploaded a spreadsheet last week.”

It should be queryable.


What I tested

I did not stop at one happy-path CSV.

I tested with multiple categories of files.

Core cases

  • UTF-8 normal CSV
  • CP932 Japanese CSV
  • normalization-focused CSV

Broken cases

  • extra columns
  • missing columns
  • malformed rows

Scale cases

  • 500-row mixed CSV
  • 1000-row CSV
  • update-like datasets

The result was important:

  • large previews stayed usable
  • broken rows surfaced as errors instead of poisoning the preview
  • malformed extra-column rows stopped producing misleading suggestions
  • create/update behavior stayed understandable under load

That last point matters.

A system like this does not earn trust because it works once.

It earns trust because it behaves clearly when the input is ugly.


What makes this commercially interesting

I do not think the interesting part is “I built a FastAPI app.”

The interesting part is this:

I built a reusable ingestion layer for messy business CSV.

That can be used in:

  • client imports
  • CRM migration workflows
  • lead intake pipelines
  • operations dashboards
  • internal admin tools
  • SaaS backends that need safe CSV entry points

And it is much more sellable than a generic CRUD sample because it solves a pain that companies already have.

Companies do not wake up asking for “a cool FastAPI demo.”

They wake up asking why their imported data is broken.


What I would add next

This version already does a lot, but it also leaves room for stronger commercial versions.

1. Manual header remapping UI

The schema already leaves space for a future manual override flow.

2. AI-assisted column inference

If a user uploads headers like Contact Info or Primary Reach, the system could propose likely mappings and record the reason.

3. Stronger import policies

Per-client rules, per-column exceptions, protected fields, partial-safe commit policies, and custom merge logic.

4. Background cleanup / retention policies

Especially important once preview payloads start accumulating in production.

That is where this stops being a project and becomes a product line.


Why I built it this way

Because I am tired of tutorials pretending CSV import is trivial.

It is not.

The “read file and insert rows” version is easy to write and expensive to own.

I wanted something closer to the real standard teams need:

  • suspicious parser
  • explicit validation
  • normalization with human-readable suggestions
  • preview-first review
  • idempotent commit flow
  • audit trail
  • downstream operational handoff

That is a better way to treat imported business data.


Final thought

Most CSV import examples are upload features.

I wanted to build something else.

I wanted to build a safe intake layer for operational data.

Something that assumes the CSV is messy.

Something that shows its reasoning.

Something that can say “not yet” before it says “done.”

Something that keeps humans in control without making them do everything by hand.

Because in production, the real feature is not import.

The real feature is not corrupting the business while you import.


Final stress test: 10,000 rows with 800 broken cases

As a final stress test, I ran this workflow against a 10,000-row CSV that included 800 intentionally broken cases.

Those broken rows included invalid emails, missing required fields, duplicate values, invalid status values, extra-column rows, and missing-column rows.

The result:

  • 10,000 total rows
  • 9,200 valid rows
  • 800 error rows

On the lowest-spec machine I had available for testing, the preview completed in about 20 seconds and still returned a readable decision table, validation output, and suggestions.
A dedicated operator workspace for safe CSV ingestion.

I would not treat this as a raw CSV parsing benchmark.

A 10,000-row import can look “fast” or “slow” depending on what the system is actually doing. In this case, the workflow was not just reading rows. It was validating fields, classifying row decisions, isolating broken records, generating suggestions, and returning a reviewable preview for operations.

So the result that mattered to me was not just the elapsed time. It was the fact that the workflow remained usable at 10,000 rows with 800 broken cases.


Interested in a private implementation?

This project is not published as a full public repository.

If your team needs a similar workflow for internal operations, CSV validation, safe commit flows, or post-import task handling, I’m available for private implementation discussions based on your use case and scope.

Email: fastapienne@gmail.com

Top comments (0)