NPCI & UPI — A Practical, End‑to‑End System Design
A user‑friendly, production‑grade blueprint that maps your notes to how UPI actually works on NPCI rails—complete with flows (P2P, P2M, refunds, mandates), APIs, data models, security, SRE, and checklists you can take to an architecture review.
Table of Contents
0) What is NPCI and where UPI fits
1) Problem scope actors
2) High-level architecture (NPCI-style)
3) Clean service boundaries
4) Data storage design
5) Messaging (Kafka topics patterns)
6) Authentication, authorization, and gateway posture
7) Core UPI flows (step by step)
- A) Registration & device binding
- B) P2P/P2M Push (intent/pay)
- C) Collect (Payer approves)
- D) QR flows
- E) Refunds
- F) Mandates (Autopay/e-mandate)
- G) Reconciliation reversals
8) Transaction state machine
9) Security compliance (NPCI aligned)
10) Observability SLOs
11) Scalability resilience patterns
12) Developer experience CI/CD
13) Public API sketches (mobile merchant)
14) Risk controls rate limits (practical defaults)
15) Testing strategy
16) Operational playbooks (quick recipes)
17) Minimal DDL TTL cheatsheet
18) ASCII sequence diagrams
19) NPCI specific glossary (quick reference)
20) Ready to build backlog (in order)
Final note
0) What is NPCI and where UPI fits
NPCI (National Payments Corporation of India) is the not‑for‑profit body that operates national retail payment systems: UPI, IMPS, AEPS, NETC, RuPay, etc. For UPI specifically, NPCI operates the UPI switch (also called the Central Mapper and Clearing & Settlement rails) that routes, authorizes, and settles interbank UPI transactions.
Key responsibilities of NPCI for UPI
-
Directory & addressing: Manages VPA (like
alice@bank
) resolution to underlying bank accounts. - Interbank routing: Orchestrates debit/credit between Issuer (payer’s bank) and Acquirer/PSP (payee’s side).
- Authorization & responses: Ensures messages adhere to UPI specs; returns result codes and RRN (reference number).
- Clearing/Settlement: Net settlement between banks via RBI; handles T+0/T+1 reports; supports reversals.
- Rules & compliance: Message standards, security baselines, dispute SLAs, fraud monitoring.
Your role (PSP/TPAP): You build the apps and platform that front users/merchants and talk to NPCI/banks through certified connectors.
🔼 Back to Top
1) Problem scope actors
- Users: Payers/Payees using mobile/web apps.
- Merchants: Accept UPI on web/POS/app (P2M and collect).
- PSP/TPAP: Your platform presenting UPI to customers.
-
Banks:
- Issuer bank: Holds the payer’s account (debit side).
- Acquirer/beneficiary bank: Holds payee’s account (credit side).
NPCI UPI Switch: Interbank routing + clearing/settlement.
🔼 Back to Top
2) High‑level architecture (NPCI‑style)
[ Mobile App ] ─┐ ┌─ [ Merchant Web/App ]
[ Web App ] ─┤ └─ [ POS/QR ]
│
[ CDN (static JS) ]
│
[ API Gateway ] ← WAF, rate limits, mTLS (merchant)
│
[ AuthN/Z ] ← OAuth 2.0/JWT, RBAC, device binding
│
┌──────────────┬──────────────────┬─────────────────────┐
│ │ │ │
[Transaction Svc] [Account Svc] [Fraud/Risk Svc] [Notification Svc]
│ │ │ │
└──────┬───────┴──────┬───────────┴──────────┬──────────┘
│ │ │
[ Kafka ] [ Redis Cache ] [ Config Service ]
│ │ │
┌──────┴─────┐ ┌────┴────┐ ┌────┴──────┐
│ UPI/NPCI │ │ MySQL │ │ NoSQL │
│ Adapter │ │ (users, │ │ (tx, logs │
│ (PSP API) │ │ accounts│ │ events) │
└──────┬─────┘ └────┬─────┘ └────┬──────┘
│ │ │
[ NPCI UPI Switch ] │ [ ELK / Grafana / Prom ]
│ │
[ Issuer / Acquirer Banks ]
Infra notes: LB (NLB/ALB) → API Gateway; service discovery (K8s DNS); blue/green or canary; IaC with Terraform; secrets in Vault/KMS; multi‑AZ plus DR.
🔼 Back to Top
3) Clean service boundaries
- Transaction Service
- Orchestrates UPI flows: push-pay (P2P/P2M), collect, refunds, mandates.
- State machine + idempotency + retries + reconciliation hooks.
- Emits domain events to Kafka; durable outbox pattern.
- Account Service
- VPA lifecycle, device binding, account linking.
- Balance inquiry, mandate metadata, UPI Lite wallets if applicable.
- UPI Adapter (Anti‑corruption layer)
- Encapsulates NPCI/bank APIs, schemas, crypto, client certs, signing (HSM/KMS).
- Bank‑specific connectors behind a single interface; resilience/circuit breakers.
- Fraud/Risk Service
- Velocity limits, device fingerprinting, rules + ML.
- Step‑up auth decisions (PIN/biometric/OTP) via policy.
- Notification Service
- Push/SMS/email/WhatsApp, merchant webhooks.
- Outbox + dedup + exponential backoff; signed webhooks (JWS).
- Reconciliation & Settlement Service
- Ingests NPCI/bank reports; auto‑reversal workflows; finance ops dashboards.
- Merchant Service
- Onboarding/KYC, keys/webhooks, settlement configs, invoices, dispute portal.
🔼 Back to Top
4) Data storage design
Relational (MySQL)
- users(id, phone, email, kyc_status, created_at, …)
- devices(id, user_id, device_hash, bound_at, risk_score)
- vpa(id, user_id, handle, is_default, status, bank_ref)
- accounts(id, user_id, bank_ifsc, account_ref_token, masked_number, status)
- mandates(id, user_id, umn, type, start_date, end_date, max_amount, status)
- merchants(id, name, kyc_tier, webhook_url, settlement_bank, status)
- refunds(id, tx_id, rrn, amount, reason, status)
- api_clients(id, client_id, scopes, rate_limit, jwk_set)
Index on user_id, handle, rrn, merchant_id; unique(handle), unique(rrn).
NoSQL (event sourced)
transactions (snapshot) and events (append‑only), audit_logs (immutable, PII‑safe).
{
"txId":"UUID",
"type":"P2P|P2M|COLLECT|REFUND|MANDATE",
"payer":{"vpa":"a@psp","deviceId":"..."},
"payee":{"vpa":"b@bank","merchantId":"..."},
"amount":5499,
"currency":"INR",
"state":"CREATED|PENDING|AUTHORIZED|SUCCESS|FAILED|REVERSED|TIMEOUT",
"rrn":"NPCI_RRN",
"npci":{"msgId":"...","respCode":"...","ts":"..."},
"risk":{"score":0.12,"rules":["..."]},
"ids":{"idempotencyKey":"...","merchantOrderId":"..."},
"timestamps":{"created":"...","updated":"..."}
}
Cache (Redis)
-
sess:<jti>
session;tx:<txId>
hot state. -
risk:vel:<userId>
counters;idemp:<merchantId>:<key>
results for POST idempotency. -
vpa:resolve:<handle>
short‑TTL VPA resolution.
🔼 Back to Top
5) Messaging (Kafka topics patterns)
- Topics:
tx.created
,tx.authorized
,tx.success
,tx.failed
,tx.timeout
,tx.refund.*
,notify.*
,recon.*
,risk.*
. - Pattern: Outbox → Kafka for exactly‑once semantics; consumer groups for notifications, analytics, recon, risk.
🔼 Back to Top
6) Authentication, authorization, and gateway posture
- OAuth 2.0 / JWT for mobile & merchant APIs, short TTL access tokens; rotating refresh tokens.
- mTLS for merchant server‑to‑server calls; WAF in front of gateway; IP allowlists for NPCI/banks.
- RBAC roles: user, merchant_admin, ops, recon_analyst.
- Rate limiting: per client_id + per VPA + per device (leaky bucket or token bucket).
🔼 Back to Top
7) Core UPI flows (step by step)
A) Registration & device binding
- App login (OTP/biometric/PIN).
- Create/link VPA, fetch linked accounts (per bank SDK if required).
- Bind device fingerprint to account/VPA; seed risk baseline.
B) P2P/P2M Push (intent/pay) — real‑time
- Client → TransactionSvc
/payments
(amount, payee VPA/QR, note, idempotencyKey). - Risk check; if needed, step‑up (PIN/biometric).
- TxSvc → UPI Adapter → NPCI → Issuer (debit) → NPCI → Acquirer (credit).
- Success → RRN returned; update state to
SUCCESS
; notify both parties. - Persist idempotent response keyed by idempotencyKey.
C) Collect (Payer approves)
- Merchant → MerchantSvc
/collect
creates a request. - TxSvc pushes approval notification to payer.
- Payer approves (PIN/biometric) → process as push → webhook to merchant.
D) QR flows
- Static QR: VPA embedded; user enters amount → push flow.
- Dynamic QR: includes amount & order → push flow with idempotency.
E) Refunds
- Merchant →
/refunds
with original RRN/txId and amount ≤ original. - Adapter calls NPCI refund; state
REFUND_PROCESSED
on success; notify.
F) Mandates (Autopay/e‑mandate)
- Merchant creates UMN with schedule and cap; user approves once.
- Scheduler triggers debits on due dates; risk rules still apply; notify success/failure.
G) Reconciliation reversals
- ReconSvc ingests NPCI/Bank reports (T+0/T+1).
- Mismatches → auto‑reversal or manual queue; dashboards for ops.
🔼 Back to Top
8) Transaction state machine
CREATED → PENDING → AUTHORIZED → SUCCESS
\→ FAILED
\→ TIMEOUT (→ async reversal if mandated)
Rules: strict idempotency (RRN + bank tokens), bounded retries (exp backoff), timeouts with resume on callback or status query.
🔼 Back to Top
9) Security compliance (NPCI aligned)
- PIN/biometric never leaves secure element; only verification result passes.
- HSM/KMS for signing keys, certs; encrypt data at rest (DB/NoSQL/Logs) and in transit (TLS 1.2+).
- PII minimization: store tokens/masks; avoid raw account numbers.
- Secret management: Vault/SSM; zero secrets in images.
- Fraud controls: device reputation, geo‑velocity, MCC tiers, high‑value step‑ups.
- Audit: immutable logs with correlation IDs; privacy redaction.
🔼 Back to Top
10) Observability SLOs
- Prometheus: p95/p99 latencies by connector; error rates; queue lags.
- Grafana: business KPIs (success rate, step‑up rate, approval rate).
-
ELK: structured logs; trace by
x-corr-id
.
Targets
- Auth p95 < 200 ms
- UPI pay p95 < 1.2 s (excluding user auth step)
- Availability ≥ 99.95% (multi‑AZ)
- Kafka durability: ISR≥3, cross‑AZ
🔼 Back to Top
11) Scalability resilience patterns
-
Stateless scale‑out for services; DB read replicas; NoSQL sharded by
txId
. - Redis HA (Cluster + AOF); TTLs: sessions 30m; idempotency 24–48h; risk windows.
- Circuit breakers per bank connector; adaptive rate control by queue depth.
- Multi‑AZ + DR (RPO ≤ 15m, RTO ≤ 1h); chaos drills for NPCI/bank/Redis outages.
🔼 Back to Top
12) Developer experience CI/CD
- Terraform for infra; per‑env workspaces.
- Central config (Spring Cloud Config/Consul) + Vault‑managed secrets.
- Pipelines: static analysis, unit/contract tests, ephemeral envs, canaries.
- Feature flags to roll new bank connectors safely.
🔼 Back to Top
13) Public API sketches (mobile merchant)
POST /v1/payments
{
"idempotencyKey":"c7e1-...",
"payer":{"vpa":"alice@psp","deviceId":"..."},
"payee":{"vpa":"bob@bank"},
"amount":5499,
"note":"Lunch",
"auth":{"type":"PIN","proof":"<opaque>"}
}
201
{ "txId":"...","state":"SUCCESS","rrn":"2345...","ts":"..." }
POST /v1/collect
{ "merchantOrderId":"ORD-123","payeeVpa":"shop@psp","payerVpa":"alice@psp","amount":9999,"note":"Checkout" }
GET /v1/transactions/{txId}
Returns current state (poll‑safe).
POST /v1/refunds
{ "originalTxId":"...","amount":5000,"reason":"PARTIAL_REFUND" }
Merchant webhooks
-
/v1/webhooks/tx
with JWS signature headers; retries with exponential backoff.
🔼 Back to Top
14) Risk controls rate limits (practical defaults)
- Per user/device/VPA velocity buckets in Redis (1m/1h/1d).
- Hard caps per mandate & per merchant category (MCC).
- ML features: time‑of‑day, device age, shared‑counterparty graph, dispute feedback.
🔼 Back to Top
15) Testing strategy
- Contract tests for NPCI/bank adapters (pacts, golden payloads).
- Simulation harness: NPCI/bank sandboxes with randomized latency/faults.
- Replay: run historical tx events through new risk models.
- Game days: circuit‑breaking banks; rebalancing Kafka; Redis failover.
🔼 Back to Top
16) Operational playbooks (quick recipes)
-
Stuck PENDING (>5m): trigger status query →
- no debit → mark FAILED
- debit/no credit → initiate REVERSAL per NPCI SOP
Bank connector down: open circuit; queue requests; mark status=DEGRADED; notify merchants; autoswitch if alternate rails exist.
Fraud spike: tighten velocity; force step‑up for risky cohorts; short‑TTL blacklists; inform merchants.
🔼 Back to Top
17) Minimal DDL TTL cheatsheet
MySQL
CREATE TABLE users (
id BIGINT PRIMARY KEY,
phone VARCHAR(15) UNIQUE,
email VARCHAR(255),
kyc_status ENUM('PENDING','VERIFIED','REJECTED'),
created_at DATETIME, updated_at DATETIME
);
CREATE TABLE vpa (
id BIGINT PRIMARY KEY,
user_id BIGINT,
handle VARCHAR(255) UNIQUE,
status ENUM('ACTIVE','BLOCKED'),
bank_ref VARCHAR(255),
INDEX (user_id),
FOREIGN KEY (user_id) REFERENCES users(id)
);
CREATE TABLE refunds (
id BIGINT PRIMARY KEY,
tx_id CHAR(36),
rrn VARCHAR(30),
amount BIGINT,
status ENUM('REQUESTED','PROCESSED','FAILED'),
created_at DATETIME,
UNIQUE(rrn)
);
Redis TTLs
-
sess:*
30 min;idemp:*
24–48 h;tx:*
terminal+1 h;risk:vel:*
per window.
🔼 Back to Top
18) ASCII sequence diagrams
P2P Push (Intent/Pay)
App → TxSvc: POST /payments (payer, payee, amount, idemp)
TxSvc → Risk: evaluate(device, velocity)
Risk → TxSvc: allow / step-up
TxSvc → UPI Adapter: create pay request
UPI Adapter → NPCI: debit→credit route
NPCI → Issuer: debit
NPCI → Acquirer: credit
NPCI → UPI Adapter: RRN + result
UPI Adapter → TxSvc: success
TxSvc → Kafka: tx.success
TxSvc → Notify: push/SMS both parties
Collect (Payer Approves)
Merchant → MerchantSvc: POST /collect
MerchantSvc → TxSvc: create collect
TxSvc → Notify: push approval to payer
App → TxSvc: approve (PIN)
... then same as push flow ...
TxSvc → Merchant webhook: signed receipt (RRN)
Refund
Merchant → MerchantSvc: POST /refunds (orig RRN)
TxSvc → UPI Adapter → NPCI: refund
NPCI → UPI Adapter: result + RRN
TxSvc: state=REFUND_PROCESSED → notify → webhook
Mandate (Autopay)
Merchant → MerchantSvc: create UMN
App → TxSvc: one-time approval (PIN)
Scheduler → TxSvc (due date): trigger debit
... risk checks ...
TxSvc → UPI Adapter → NPCI: process
→ notify + webhook; failures to retry/notify
19) NPCI specific glossary (quick reference)
-
VPA (Virtual Payment Address):
user@handle
verbal address for bank account. - RRN: unique reference number returned post authorisation.
- UMN: Unique Mandate Number for autopay mandates.
- Issuer vs Acquirer: debit bank vs credit bank.
- Collect: pull request that payer authorizes.
- Reversal: system‑initiated credit back on failures/mismatches.
🔼 Back to Top
20) Ready to build backlog (in order)
- Contract stubs for UPI Adapter (NPCI + 2 pilot banks).
- Tx state machine + idempotent APIs + outbox→Kafka.
- Redis‑backed risk MVP (velocity + rules YAML).
- Merchant webhooks (JWS) + portal for keys.
- Recon file parser + daily dashboards.
- Mobile flows: pay, collect approve, refunds, mandates.
🔼 Back to Top
Final note
This design maps cleanly to an NPCI‑compliant UPI platform: clear service boundaries, resilient flows, strong risk posture, and ops playbooks. If you’d like, we can turn this into a printable PDF, add more detailed sequence diagrams, or generate Spring Boot/Kafka/Redis starter code to bootstrap the services.
More Details:
Get all articles related to system design
Hastag: SystemDesignWithZeeshanAli
Git: https://github.com/ZeeshanAli-0704/SystemDesignWithZeeshanAli
Top comments (0)