So I built a thing. It's called Idswyft — an open-source identity verification platform. The kind of thing banks and fintechs pay $2-5 per check for, except you can self-host it, audit every line, and it's free.
Let me walk you through what it is, what problem it solves, and the technical choices behind it. No fluff, just real talk from someone who learned a LOT building this.
Wait, What Even Is Identity Verification?
You know when you sign up for a new bank account or crypto exchange and they ask you to take a photo of your driver's license and then a selfie? That's identity verification (also called KYC — Know Your Customer).
Behind that simple "take a photo" flow, there's a whole pipeline running:
- Reading the text off your ID (OCR)
- Checking if the document is real or tampered with
- Making sure the person holding the phone is the same person on the ID
- Checking you're not on any sanctions lists (coming soon)
Companies like Jumio, Onfido, and Persona charge $2-5 every time someone goes through this. That adds up fast. And here's the kicker — you're sending your users' most sensitive documents (passport photos, driver's licenses) to a third-party server you don't control.
What Idswyft Actually Does
Idswyft gives you that entire pipeline — document scanning, tamper detection, deepfake detection, face matching, liveness checks — as something you can run on your own servers. Your data never leaves your infrastructure unless you want it to.
For the user, it's 3 simple steps:
1. Take a photo of the front of your ID
2. Take a photo of the back
3. Do a quick live camera check (turn your head left, right — proves you're real)
Behind the scenes, 6 verification "gates" run one after another:
1. OCR Extraction → Read all text from the front of the ID
2. Tamper Detection → Is this a real document? (ELA, entropy, spectral analysis)
3. Cross-Validation → Does the barcode/MRZ on the back match the front?
(this catches photoshopped IDs really well)
4. Liveness Check → Is this a real person? (head-turn challenge, deepfake scan)
5. Face Match → Does the live face match the photo on the ID?
6. Sanctions Screen → Are they on any watchlists? (coming soon)
The key idea: if any gate fails, we stop immediately. No point running face matching if we couldn't even read the document. This saves processing time and gives clear, specific error messages — not just "verification failed" but "the barcode on the back of your ID couldn't be decoded, please retake the photo."
One rule I'm really strict about: every single decision in this pipeline is deterministic. No AI model decides whether you pass or fail. Gates use checksums, exact string matching, Levenshtein distance, cosine similarity with fixed thresholds — all math you can reproduce on paper. If you feed the same inputs twice, you get the same result twice. Always.
Why does this matter? Because when a regulator asks "why was this person rejected?", the answer needs to be "the MRZ checksum on check digit 3 failed" — not "the model returned 0.47." You can't audit a vibe. You can audit a checksum.
The Problem I Wanted to Solve
As a developer, your options for adding ID verification to your app are kind of rough:
Option A: Use a vendor (Persona, Onfido, etc.)
- Costs $2-5 per verification
- Your users' ID photos go to their servers
- You can't customize the verification rules
- You're locked into their pricing
- Want to even test it? Fill out a sales form and wait for a demo call
Option B: Build it yourself
- Takes months of engineering
- You need ML expertise for face matching and liveness
- OCR is harder than you think (trust me)
- Regulatory compliance is a whole thing
Option C: Idswyft
- Self-host it — your servers, your data
- Open source — audit every line of code
- Full pipeline already built
- Integrate in under 30 minutes with 4 API calls
- No sales calls, no contracts, no vendor lock-in
I wanted Option C to exist, so I built it.
To be real about where Idswyft sits today: it's production-capable, but the sweet spot right now is developers who need to spin up a working identity verification flow fast — for an MVP, a prototype, a hackathon, a side project, or just to prove to stakeholders that the feature works before committing to a $50K/year vendor contract.
The idea is simple: you shouldn't have to talk to a sales team, sign an enterprise agreement, and wait two weeks just to test if ID verification works in your app. With Idswyft, you can have a full pipeline running in minutes, validate your product idea, and then later make an informed decision about whether you need a commercial provider or whether Idswyft handles your needs just fine on its own.
The Tech Stack (and Why I Picked Each Thing)
Here's every major technology in the project and the honest reason it's there.
Backend: Node.js + TypeScript + Express
TypeScript was non-negotiable. When you're passing around verification results with fields like confidence_score, liveness_score, and face_match_results, you NEED type safety. A typo in a field name could mean silently passing someone who should've failed.
Express handles file uploads natively (multipart form data), and the Node.js ecosystem has great image processing libraries. Plus, the frontend is React, so it's TypeScript everywhere — one language across the whole stack.
Database: PostgreSQL
Verification data is naturally relational. A verification session has documents, each document has extraction results, each session has a final verdict. That's tables with foreign keys — a perfect fit for Postgres.
The two editions handle this differently. The self-hosted Community Edition runs a plain PostgreSQL 16 container — no external dependencies, no third-party database service, your data stays entirely on your machine. The Cloud Edition uses Supabase as the Postgres host, which gives us row-level security (RLS) for free. This means we can enforce "developer A can only see developer A's verifications" at the database level, not just in application code. If there's ever a bug in the API, the database itself won't let data leak across tenants.
OCR: PaddleOCR (primary) + Tesseract (fallback)
This was a big decision. The whole point of self-hosting is data sovereignty — if I sent ID photos to Google Cloud Vision or AWS Textract for text extraction, that would defeat the purpose. And baking a cloud vision model (like GPT-4o or Claude) directly into the verification engine would mean every single verification requires an API call to a third party. That's a non-starter for a self-hosted platform — it would add latency, cost money per scan, and send your users' ID photos to external servers by default. The core engine should work completely offline, no internet required.
PaddleOCR runs entirely on your machine. No API calls, no data leaving your server. It's the primary engine and handles most ID formats well.
Tesseract.js is the fallback — it's slower but works everywhere without any native dependencies. If PaddleOCR fails on an image, Tesseract takes a second pass.
How well does it actually work? I built a benchmark suite and tested against 50 US driver's license specimens (one per state, gathered from publicly available sources across the internet — there's no official dataset for this) — local OCR only, no cloud APIs. Here are the numbers:
Field Accuracy Notes
───────────── ──────── ─────────────────────────
Address 84% Multi-line assembly, 42 states
Document # 84% 42 states, dozens of formats
Date of Birth 78% Handles MM/DD/YYYY, YYYYMMDD, etc.
Full Name 78% Multi-line assembly, AAMVA prefix parsing
Sex 72% Tricky when embedded in multi-field lines
Expiry Date 67% Sometimes confused with issue date
───────────── ────────
Overall 77.2% Up from 63.5% baseline after 3 rounds
Those numbers are without any cloud LLM assistance — that's pure local PaddleOCR + post-processing heuristics. 100% of documents successfully had text extracted, and average processing time is under 1 second (780ms). For a completely offline, self-hosted OCR pipeline, I'm honestly pretty happy with that.
The hard part wasn't getting PaddleOCR to read text — it's surprisingly good at that. The hard part was the structured extraction pipeline: figuring out which text is the name vs. the address vs. the document number. Every state has a different layout. Florida's DL number looks like D123-456-83-789-0. Michigan's looks like S 000 123 456 789. Alabama's is just 12345678 — which looks exactly like a date. I had to build a two-strategy extraction system with 15+ regex patterns, label detection, AAMVA field code parsing, and state-specific format handling.
Getting from 63.5% to 77.2% took three focused improvement rounds over about a week. That's the kind of iterative, data-driven work that doesn't happen when you throw a cloud API at the problem and call it done.
But here's the thing — local OCR isn't perfect. Glare, low resolution, unusual fonts, IDs from countries with complex layouts — sometimes PaddleOCR just can't read a field with high confidence. Instead of shrugging and failing the verification, we give developers a way to handle those hard cases.
In the developer portal, there's an OCR Enhancement setting where you can plug in your own vision model — OpenAI (GPT-4o Vision), Anthropic (Claude Vision), or any custom OpenAI-compatible endpoint. You bring your own API key, we encrypt it and store it securely.
Here's how it works: PaddleOCR runs first on every document (fast, free, local). If any extracted field comes back with less than 60% confidence — say, the name is garbled or the expiry date is unreadable — only THEN does the system send that specific image to your configured LLM for a second opinion. The LLM re-extracts just the weak fields, and the results get merged back in.
So the pipeline is: local OCR handles the vast majority of cases for free, cloud LLM catches the hard remainder — only when you opt in, only with your own key, only for the fields that actually need help. Best of both worlds.
And to be really clear about the boundary: the LLM only ever reads text from an image. It never decides whether someone passes or fails verification. That decision is always deterministic — checksums, string matching, fixed thresholds. The LLM is behind a provider interface, isolated from all comparison and decision logic. It's a pair of reading glasses, not a judge.
Face Recognition: face-api.js (TensorFlow.js + WASM)
This one's cool. face-api.js extracts a 128-number "fingerprint" (called an embedding) from any face photo. We compute one embedding from the ID photo and one from the live capture, then compare them using cosine similarity (basically: how similar are these two lists of 128 numbers?).
The magic: it runs entirely in WASM (WebAssembly), so there are no native binary dependencies. It works on any platform — Linux, Mac, Windows, ARM — without installing anything special.
Score above 0.60? Match. Below? Fail.
Liveness Detection: Custom Active Challenge
This is probably the most interesting part. We don't just take a static photo — the user has to follow real-time instructions: "turn your head left... now right... back to center." The direction is randomized each time.
We score 7 different signals:
- Was a face detected in every frame?
- Did the head actually turn enough? (at least 12 degrees)
- Did it turn in the correct direction?
- Did they return to center?
- Did the whole thing take a realistic amount of time?
- Is the face size consistent across frames? (catches video splicing)
- Is a virtual camera detected? (OBS, ManyCam, etc.)
Fun failure story: We initially tried flashing colored lights on the screen and measuring the reflection on the user's face. Turns out, on mobile phones, the camera captures from the raw sensor pipeline, not the rendered screen. The color differences were literally in the noise range (like 2 RGB units). We scrapped it and went with geometric/temporal checks that work based on physics, not screen rendering.
Deepfake Detection: EfficientNet-B0 via ONNX
This one's increasingly important. With tools like Stable Diffusion and face-swap apps getting scarily good, someone can generate a realistic-looking face photo and hold it up to a camera — or worse, use a virtual camera to stream a deepfake in real-time.
We run an EfficientNet-B0 model (via ONNX Runtime) on the live capture frames to classify them as real or synthetic. It analyzes subtle artifacts that deepfakes still struggle with — inconsistent skin textures, unnatural lighting gradients, compression patterns that differ between real camera sensors and AI-generated images.
This runs alongside the geometric liveness checks, so an attacker would need to beat both: produce a deepfake realistic enough to fool the neural network AND physically move a 3D head in the correct randomized direction within the time window. That's a much harder attack surface than either check alone.
Like everything else in the pipeline, the deepfake score feeds into a deterministic threshold — above 0.5 confidence of being fake triggers a flag. No vibes, no ambiguity.
Barcode & MRZ Parsing: ZXing + Custom Parser
US driver's licenses have a PDF417 barcode on the back with all the cardholder's info encoded in it. European IDs use MRZ (Machine Readable Zone) — those two or three lines of angle-bracket-filled text at the bottom of a passport.
We parse both. Then we cross-validate: does the name on the front (from OCR) match the name in the barcode on the back? If someone photoshops the front of their ID but doesn't modify the barcode (which is much harder to tamper with), cross-validation catches it.
MRZ has built-in checksums — change even one character on the physical card, and the math doesn't add up. Instant rejection.
Image Processing: Sharp
Every image that comes in gets preprocessed before OCR: resize, normalize, deskew. Sharp is the fastest image processing library in Node.js (it uses libvips under the hood). It handles this in milliseconds.
Tamper Detection: ELA + Entropy + FFT
This is gate 2 — before we even try to match the face or validate the barcode, we check if the document itself has been digitally altered. We run three independent analyses using Sharp:
- Error Level Analysis (ELA) — Re-compresses the image and compares it to the original. Photoshopped regions compress differently than untouched ones, creating visible "hot spots" where edits were made.
- Entropy Analysis — Measures information density across the image. Pasted-in text or photos have different entropy patterns than surrounding areas.
- FFT Spectral Analysis — Runs a frequency-domain check. Digital edits introduce artifacts in the frequency spectrum that aren't present in photos taken directly from a camera sensor.
Each analysis produces a score, and they're combined with fixed thresholds. A document that passes all three has strong evidence of being an unaltered photo of a real document.
Webhooks: HMAC-SHA256 Signed
When a verification completes, we notify the developer's server via webhook. Each webhook is signed with HMAC-SHA256 so the developer can verify it actually came from us and wasn't tampered with.
The important pattern: webhooks fire AFTER we send the API response, never before. If a developer's webhook endpoint is down, that's what retries are for (3 attempts with exponential backoff). A slow webhook should never make our API slow.
Deployment: Two Editions
Idswyft ships in two editions — same codebase, different packaging:
Community Edition (self-hosted) — You run it on your own servers. One command (./install.sh) pulls 4 Docker containers (Postgres, ML engine, API, frontend) and you're running in about 2 minutes. Free forever, unlimited verifications, no usage caps. Your data stays on your infrastructure — nothing phones home. This is the version on GitHub.
Cloud Edition (idswyft.app) — A managed service for developers who don't want to deal with infrastructure. We host everything — same pipeline, same accuracy, same deterministic decisions — but you just call our API. Comes with a developer portal, usage dashboard, and webhook management. Starter tier is free (50 verifications/month), paid tiers for higher volume.
The key difference: Community gives you total control and zero cost. Cloud gives you zero ops burden. Both use the exact same verification engine — if it passes on cloud, it would pass on self-hosted, and vice versa.
The Architecture: Why 4 Containers?
+──────────────+
| Frontend | (React, Vite, Nginx)
| port 80 |
+──────┬───────+
|
+──────┴───────+
| API | (Express, TypeScript)
| port 3001 | ~250 MB image
+──┬───────┬───+
| |
+──────────┴──+ +-┴──────────+
| Engine | | Postgres |
| port 3002 | | port 5432 |
| ~1.5 GB | +────────────+
+─────────────+
(TensorFlow, PaddleOCR,
ONNX, face-api, sharp)
The Engine is a separate container because all the ML libraries (TensorFlow, PaddleOCR, ONNX Runtime, face-api) are HEAVY — they need native binaries, GPU support, and they balloon the Docker image to ~1.5 GB.
By extracting them into their own container, the API stays tiny (~250 MB) and fast to deploy. The API calls the Engine over HTTP when it needs to run OCR or face detection. In local development (without Docker), the API falls back to running the ML code directly.
This also means you can scale them independently — if OCR is your bottleneck, spin up more Engine containers without touching the API.
International Support
Idswyft handles IDs from 19 countries. Not just "we accept the image" — each country has its own format registry:
- US/CA: PDF417 barcodes on the back
- EU countries: MRZ zones (TD1 for ID cards, TD3 for passports)
- AU/NZ: Different barcode formats
When a user selects their country, the backend loads the right format rules. OCR extraction, barcode parsing, and cross-validation all adapt automatically.
MRZ supports all three ICAO standard formats:
- TD1 (3 lines, 30 chars) — national ID cards
- TD2 (2 lines, 36 chars) — older format IDs
- TD3 (2 lines, 44 chars) — passports
We benchmark international documents too — tested against specimens from the MIDV-500 dataset (Albanian national ID, German driver's license, US passport card). 68.2% field accuracy with 100% on dates and nationality fields. The German DL parser handles EU field numbering conventions (1., 2., 3., etc.), and the US passport card extracts all 6 fields correctly.
Security Stuff (The Important Part)
Identity verification handles the most sensitive data imaginable — passport photos, face scans, personal details. Security isn't optional.
Here's what we do:
- API keys are hashed with HMAC-SHA256 before storage. We never store the raw key. Lookup is by prefix, comparison is timing-safe (prevents timing attacks).
- Documents are encrypted at rest with AES-256.
- Face embeddings (the 128-number fingerprints) are computed on your server, never sent anywhere.
- Rate limiting prevents abuse: 1000 requests/hour per developer, 5 verifications/day per end user.
- Data retention is automated: documents are deleted after 90 days (configurable).
- GDPR/CCPA compliant: deletion endpoints, data access reports, consent tracking.
One lesson I learned the hard way: never silently auto-pass a security check. Early on, when face-api couldn't extract an embedding from a tiny ID photo, the system just gave it a perfect score of 1.0. That meant anyone could pass verification for someone else's ID if the photo was too small. We fixed it to route those cases to manual review instead. The principle: if a check can't run, flag it — don't skip it.
Developer Experience
I wanted integration to take less than 30 minutes. Here's what it looks like:
import { Idswyft } from '@idswyft/sdk';
const client = new Idswyft({ apiKey: 'sk_live_...' });
// 1. Start verification session
const session = await client.startVerification({ user_id: 'user-123' });
// 2. Upload front → runs OCR + tamper detection
await client.uploadFrontDocument(session.verification_id, frontImage);
// 3. Upload back → decodes barcode/MRZ + auto cross-validates against front
await client.uploadBackDocument(session.verification_id, backImage);
// 4. Live capture → runs liveness check + deepfake detection + face match
const result = await client.uploadLiveCapture(session.verification_id, liveFrame);
// result.status === 'verified' | 'failed' | 'manual_review'
4 function calls. Each one triggers the relevant gates automatically — you don't need to orchestrate the pipeline yourself.
Don't want to use the SDK? Just redirect users to the hosted verification page — it has a mobile-first UI with QR code handoff (start on desktop, finish on phone), guided camera capture, and active liveness challenges.
SDKs available for JavaScript (npm: @idswyft/sdk) and Python (PyPI: idswyft). Both are under 500 lines.
There's also a sandbox mode with relaxed thresholds so you can test the full flow without real documents.
Lessons I Learned Building This
OCR is way harder than face matching. Face embedding computation takes ~200ms. OCR on a driver's license? 780ms on average. And reading the text is the easy part — the hard part is structured extraction: figuring out which text is the name, which is the address, which is the DL number. I benchmarked against all 50 US states and had to build 15+ regex patterns, a two-strategy extraction system (labeled lines + fallback regex scan), AAMVA field code parsing, and state-specific format handlers. Address extraction alone went from 42% to 84% accuracy through six distinct fixes. Building a pluggable provider system early (so I could swap OCR engines without rewriting the verification logic) was one of the best decisions I made.
Benchmarking changes everything. Before I built the 50-state benchmark suite, I was guessing at what was broken. After, I could see exactly which states failed and why — a blanket looksLikeDate() filter was rejecting valid 8-digit DL numbers, AAMVA field labels were bleeding into extracted values, hyphenated DL numbers were being skipped by date-line filters. Three focused improvement rounds took overall accuracy from 63.5% to 77.2%. Data-driven iteration beats intuition every time.
Cross-validation is the real fraud catcher. A tampered ID can score 0.95 OCR confidence — the text looks perfect. But if the name on the front doesn't match the name encoded in the barcode on the back, that's a hard signal. The barcode is much harder to modify than the printed text.
Webhooks MUST be fire-and-forget. I learned this one the painful way. If webhook delivery happens before your API response, and the developer's server is slow, your users are sitting there waiting for a spinner. Send the response first. Fire webhooks after. If they fail, that's what retries are for.
Mobile cameras are weird. The color reflection anti-spoofing technique that works great in research papers completely fell apart on real mobile phones. The camera sensor doesn't capture what's on screen — it captures raw light. We had to pivot to an entirely different approach based on geometry and timing.
Every state is a special snowflake. I knew US driver's licenses varied by state, but I didn't appreciate HOW much until I saw the data. Florida uses hyphenated letter-digit combos (D123-456-83-789-0). Michigan spaces out digits after a letter prefix (S 000 123 456 789). Alabama is just 8 raw digits — which looks exactly like a date in MMDDYYYY format. Maryland prefixes with MD-. North Dakota uses SAM-67-1234. You can't build one regex to rule them all. You need a layered extraction system that tries specific patterns first and falls back gracefully.
Try It
- Live demo: idswyft.app/demo
- Docs: idswyft.app/doc
- GitHub: github.com/team-idswyft/idswyft
-
Self-host:
curl -fsSL https://raw.githubusercontent.com/team-idswyft/idswyft/main/install.sh | bash
It's open-source, self-hostable, and I built it because I think identity verification shouldn't require a $50K/year vendor contract. If you're building something that needs ID verification, I'd love to hear what you think.
Top comments (0)