How We Scaled Our Payment Processing Platform to 50 Million Daily Transactions with Conductor
TL;DR: We ripped out a gnarly homegrown orchestration layer and replaced it with Conductor OSS, validated it locally, then moved to Orkes to handle 50M+ daily credit card transactions across sync and async workflows. MTTR went from 6.5 hours to under 3 minutes. We stopped dreading deployments. And we finally — finally — have one place to debug transaction flows that touch 27+ major services, a bunch of smaller ones, and multiple message queues. This is the story of how we got here and what we learned along the way.
The System We Inherited (and Its Scars)
If you've ever built payment infrastructure, you know the drill. A credit card transaction looks simple from the outside — swipe, approve, done. Under the hood, it's a distributed systems nightmare that will absolutely ruin your weekend.
Our platform processes credit card authorizations, settlements, chargebacks, refunds, and batch reconciliation across multiple acquiring banks, card networks, and downstream financial systems. At peak we're pushing around 2,450 transactions per second. Every single one is someone's money, and every single one needs to be correct.
What We Were Actually Running
Here's the full picture of what our platform looked like before Conductor. This isn't the simplified version you'd put on a whiteboard for a VP — this is the real thing, the 27 major services and the mess of queues and databases connecting them:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ PAYMENT PLATFORM (PRE-CONDUCTOR) │
│ │
│ ┌─── INGESTION LAYER ───────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ api-gateway (Go) ← POS terminals, e-commerce, mobile SDKs │ │
│ │ webhook-receiver (Go) ← card network notifications, bank callbacks │ │
│ │ file-ingestion (Java) ← SFTP batch files from banks, clearing files │ │
│ │ merchant-api (Node.js) ← merchant dashboard, refund requests │ │
│ │ │ │
│ └────────────────────┬───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─── REAL-TIME AUTH SERVICES ───────────────────────────────────────────────┐ │
│ │ │ │
│ │ txn-router (Java) — BIN lookup, acquirer routing, load balancing │ │
│ │ fraud-scoring (Python) — TensorFlow model, real-time feature store │ │
│ │ risk-engine (Java) — rule-based checks, velocity limits, blocklists│ │
│ │ threeds-service (Java) — 3D Secure authentication (EU/PSD2) │ │
│ │ tokenization-svc (Go) — PCI-scoped, vault read/write │ │
│ │ iso8583-codec (Java) — message formatting for bank protocols │ │
│ │ bank-gateway (Java) — connections to 6 acquiring banks │ │
│ │ auth-response-svc (Go) — response normalization, auth code gen │ │
│ │ │ │
│ └────────────────────┬───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─── MESSAGE QUEUES ─────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Kafka (42 topics) — txn events, state changes, CDC streams │ │
│ │ RabbitMQ (18 queues) — settlement tasks, notification triggers │ │
│ │ SQS (11 queues) — async processing, DLQs for failed messages │ │
│ │ │ │
│ └──────┬──────────────────┬──────────────────┬───────────────────────────────┘ │
│ ▼ ▼ ▼ │
│ ┌─── SETTLEMENT & CLEARING ─────────────────────────────────────────────────┐ │
│ │ │ │
│ │ capture-service (Java) — batch capture, end-of-day grouping │ │
│ │ settlement-engine (Java) — net settlement calc, bank file generation │ │
│ │ clearing-matcher (Java) — match clearing records to original auths │ │
│ │ recon-service (Java) — 3-way reconciliation (us vs bank vs network) │ │
│ │ funding-calc (Java) — merchant payout calculation, fee deduction │ │
│ │ chargeback-svc (Java) — dispute intake, representment workflow │ │
│ │ │ │
│ └────────────────────┬───────────────────────────────────────────────────────┘ │
│ ▼ │
│ ┌─── DATA & REPORTING ──────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ event-publisher (Go) — fan-out to downstream consumers │ │
│ │ etl-pipeline (Python) — data transforms, enrichment, dedup │ │
│ │ report-generator (Python) — merchant reports, PDF/CSV generation │ │
│ │ data-warehouse-loader — Snowflake/Redshift sync │ │
│ │ analytics-api (Node.js) — merchant dashboard data, real-time metrics │ │
│ │ compliance-reporter (Python) — SAR filing, AML reports, reg submissions │ │
│ │ notification-svc (Node.js) — email, SMS, webhook to merchants │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── SUPPORTING SERVICES ───────────────────────────────────────────────────┐ │
│ │ │ │
│ │ config-service — merchant configs, routing rules, fee schedules │ │
│ │ audit-logger — PCI audit trail, access logs │ │
│ │ key-management — encryption key rotation, HSM integration │ │
│ │ health-monitor — service health checks, circuit breaker state │ │
│ │ idempotency-store — Redis-backed dedup for retries │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─── DATA STORES ───────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ PostgreSQL (primary txn DB) Redis (state, cache, idempotency) │ │
│ │ MySQL (legacy merchant DB) Elasticsearch (search, log aggregation) │ │
│ │ S3 (bank files, reports) Snowflake (analytics warehouse) │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
Count them up. 27 major services across 4 languages, 71 message queues and topics across 3 different brokers, 6 data stores. And every service was connected to every other service through a combination of direct HTTP calls, queue-based messaging, database polling, and cron jobs — each with its own retry logic, its own error handling, and its own operational blind spots.
The Homegrown Orchestration Tax
We started, like most teams do, by writing our own orchestration. The txn-router service was supposed to be the brain — it coordinated the auth flow by making synchronous calls to fraud-scoring, risk-engine, and bank-gateway. Settlement was handled by cron jobs in capture-service that polled the database and kicked off downstream processing.
The glue between all of this was:
- RabbitMQ for async tasks like settlement triggers and notification dispatch
- Redis for tracking transaction state across services
- Cron jobs in
capture-serviceandsettlement-enginefor batch processing - A custom state machine library one engineer wrote in 2019 inside
txn-router(who had since left — no documentation, obviously) - MySQL tables in
recon-serviceacting as pseudo-workflow state stores - Kafka for CDC streams and event fan-out to the data layer, bolted on later when the data team needed real-time feeds
It worked. For a while. At around 800K transactions per day, things started cracking. Not catastrophically — just enough to erode trust. A settlement batch in capture-service would fail silently. A refund flowing through chargeback-svc would get stuck in a state that didn't match any of our enum values. A timeout from bank-gateway would cause txn-router to retry and create duplicate authorizations. You know how it goes.
The worst part wasn't the failures themselves. It was debugging them.
The Debugging Hell of Distributed Payments
When a transaction fails in a monolith, you grep through one log file. When a transaction fails across 27+ microservices, a handful of smaller auxiliary services, 3 message brokers (including Kafka with 42 topics), and 6 databases, you:
- Get paged at 3 AM
- Open Splunk/Datadog/CloudWatch (we had all three, because different teams picked different tools at different times and nobody wanted to migrate)
- Try to find the transaction ID — if you're lucky, it was propagated correctly through all 27 services
- Realize the correlation ID got dropped somewhere between
txn-routerandsettlement-enginebecause someone forgot to pass headers through a Kafka consumer (we also had RabbitMQ and SQS in the mix, depending on which team built what) - SSH into 8+ different services to check local logs, hoping the timestamps line up across time zones
- Check Kafka consumer lag in Confluent across like 42 topics to see if something is backed up
- Check SQS dead letter queues — another 11 of them — to see if messages are piling up
- Reconstruct the transaction timeline manually in a Google Doc like some kind of caveman
- Find the actual bug: a race condition in the retry handler inside
txn-router - Fix it, deploy, and pray
Our MTTR for production payment incidents was averaging around 6.5 hours. For a payments company, every minute of downtime has a direct dollar cost. We were burning 120+ engineering hours per month on incident response instead of building anything useful.
And honestly, this was before things got really complicated.
When "Just Process the Transaction" Became a Distributed Saga
Modern payment processing isn't just "authorize and settle." Our transaction lifecycle grew into this beast:
Synchronous Path (< 80ms SLA)
The cardholder is waiting. The merchant terminal is spinning. You have milliseconds, not minutes.
-
Receive authorization request —
api-gatewayaccepts the request,tokenization-svcvaults the PAN -
Run fraud detection —
fraud-scoring(Python/TensorFlow) scores the transaction in real-time,risk-engineruns rule-based checks in parallel -
Route to processor —
txn-routerdoes BIN lookup, figures out which of our 6 acquiring banks handles this card -
Send to bank —
iso8583-codecformats the message,bank-gatewaysends it over the wire -
Return auth response —
auth-response-svcnormalizes the bank response, sends it back throughapi-gateway
This entire chain has to complete in under 80ms. No retries on the critical path. No eventual consistency. The answer has to be right, right now.
Asynchronous Path (minutes to days)
Once the auth response goes back, the real work begins:
-
Capture/Settlement —
capture-servicebatches the transaction,settlement-enginegenerates bank files -
Clearing —
clearing-matcheringests clearing files from Visa/Mastercard viafile-ingestion, matches against original auth -
Reconciliation —
recon-serviceruns 3-way reconciliation (our records vs. bank vs. network), flags discrepancies -
Funding —
funding-calccomputes merchant payouts, deducts fees and reserves -
Reporting —
etl-pipelinetransforms data,data-warehouse-loadersyncs to Snowflake,report-generatorbuilds merchant reports -
Chargeback handling —
chargeback-svcmanages disputes, triggers representment workflows -
Data hooks —
event-publisherfans out toanalytics-api,compliance-reporter,notification-svc, CRM, and AML systems
Every one of these steps has its own failure modes, its own retry semantics, its own SLA. A settlement batch runs at midnight and has to complete before the bank's cutoff at 5:30 AM. A chargeback has regulatory deadlines measured in calendar days. A data warehouse sync can be eventually consistent, but the merchant dashboard powered by analytics-api needs accuracy within minutes.
The Multi-Language Problem
Look at that architecture diagram again. It's a polyglot mess:
-
Java (14 services) —
txn-router,risk-engine,bank-gateway,iso8583-codec,capture-service,settlement-engine,clearing-matcher,recon-service,funding-calc,chargeback-svc,threeds-service,file-ingestion,config-service,audit-logger -
Python (4 services) —
fraud-scoring,etl-pipeline,report-generator,compliance-reporter -
Go (5 services) —
api-gateway,webhook-receiver,tokenization-svc,auth-response-svc,event-publisher -
Node.js (3 services) —
merchant-api,analytics-api,notification-svc
Our homegrown orchestration lived entirely in Java inside txn-router. Every non-Java service had to interact with it through awkward REST wrappers or by shoving messages onto queues and hoping the Java orchestrator would pick them up. The Python fraud-scoring service couldn't natively participate in a workflow — it was invoked via an HTTP call with custom timeout and retry logic copy-pasted from the orchestration layer. Every time someone asked "can we add a step to the pipeline?" I died a little inside.
Why We Looked at Conductor
Three incidents pushed us over the edge. All within six weeks of each other.
Incident #1: A settlement batch in capture-service failed because bank-gateway returned a new error code from one of our acquiring banks that we'd never seen. The state machine in txn-router didn't have a handler for it, so 47,000 transactions sat in a PROCESSING state forever. Merchants couldn't reconcile their books. Our support queue exploded.
Incident #2: The ML team deployed a new version of the fraud model in fraud-scoring that changed the response schema slightly. The response field moved from score to risk_score. txn-router still expected the old schema. Transactions started failing auth silently. We didn't catch it until a merchant called asking why their approval rate dropped by 40%. That was a bad Tuesday.
Incident #3: An engineer accidentally triggered the production settlement cron job in capture-service twice. Because our idempotency checks lived inside the cron job itself (not in any orchestration layer), we double-settled a batch worth $2.3M. That was a really fun conversation with the bank. Really fun.
After Incident #3, we started evaluating orchestration platforms seriously. We looked at Temporal, Apache Airflow, Step Functions, and Conductor.
Why Conductor Won
We went with Conductor for a few specific reasons:
Separation of workflow definition from execution. The workflow is a definition separate from the worker code. Our 27 existing services could become Conductor workers without rewriting their core logic — we just needed to wrap them.
Sync and async in the same engine. Conductor handles both synchronous request-response workflows (sub-80ms SLA in our case) and long-running durable workflows. We didn't need to maintain two separate systems for auth vs. settlement anymore.
Built-in visibility. Every workflow execution has a complete, searchable execution history. No more reconstructing timelines from 6 different logging systems and a dozen Kafka topics while half-asleep at 3 AM.
Language-agnostic SDKs. Java, Python, Go, C#, JavaScript. All 4 of our language stacks could participate natively. The Python team was particularly happy about this.
OSS-first model. We could start with Conductor OSS, validate locally, prove value without a procurement cycle, and move to Orkes Enterprise when we needed production-grade scale, security, and support.
Integrating Our Existing Services with Conductor
This is the part I think most people want to know about: how do you take 27 existing microservices and wire them into Conductor without rewriting everything?
The answer is: you don't rewrite. You annotate.
Java Services — Spring Boot Annotation Approach
Most of our Java services were already Spring Boot apps. The Conductor Java SDK gives you a @WorkerTask annotation that turns any method into a Conductor worker. Here's what we did with fraud-scoring's rule-based counterpart, risk-engine:
Before (called directly by txn-router via HTTP):
@RestController
public class RiskEngineController {
@PostMapping("/api/v1/risk/evaluate")
public RiskResult evaluateTransaction(@RequestBody TransactionData txn) {
VelocityCheck velocity = velocityService.check(txn.getCardFingerprint());
BlocklistCheck blocklist = blocklistService.check(txn.getMerchantId(), txn.getCardBin());
RuleResult rules = ruleEngine.evaluate(txn, velocity, blocklist);
return RiskResult.builder()
.riskScore(rules.getScore())
.flags(rules.getFlags())
.decision(rules.getScore() > 85 ? "DECLINE" : "APPROVE")
.build();
}
}
After (same logic, now a Conductor worker):
@Component
public class RiskEngineWorker {
@WorkerTask(value = "risk_engine_evaluate", threadCount = 8, pollingInterval = 100)
public RiskResult evaluateTransaction(TransactionData txn) {
VelocityCheck velocity = velocityService.check(txn.getCardFingerprint());
BlocklistCheck blocklist = blocklistService.check(txn.getMerchantId(), txn.getCardBin());
RuleResult rules = ruleEngine.evaluate(txn, velocity, blocklist);
return RiskResult.builder()
.riskScore(rules.getScore())
.flags(rules.getFlags())
.decision(rules.getScore() > 85 ? "DECLINE" : "APPROVE")
.build();
}
}
That's it. Same business logic. Same Spring beans injected. We added one annotation, configured the thread count and polling interval, and risk-engine was now a Conductor worker. The @WorkerTask annotation handles polling the Conductor server for tasks, deserializing inputs, serializing outputs, and reporting task status. We didn't touch the core risk evaluation logic at all.
We did the same for bank-gateway, capture-service, settlement-engine, clearing-matcher, recon-service, funding-calc, and chargeback-svc. Each one took maybe a day to integrate — most of that was writing tests, not changing code.
Python Services — SDK Worker Pattern
Our Python services (fraud-scoring, etl-pipeline, report-generator, compliance-reporter) used the Conductor Python SDK. Here's how we wired up fraud-scoring:
Before (Flask endpoint called by txn-router):
@app.route('/api/v1/fraud/score', methods=['POST'])
def score_transaction():
txn = request.json
features = feature_store.get_realtime_features(txn['card_fingerprint'])
tensor_input = preprocess(txn, features)
prediction = model.predict(tensor_input)
return jsonify({
'fraud_score': float(prediction[0]),
'fraud_signals': extract_signals(prediction),
'model_version': MODEL_VERSION
})
After (Conductor worker — same model, same preprocessing):
from conductor.client.worker.worker_task import worker_task
@worker_task(task_definition_name='fraud_score_transaction')
def score_transaction(txn: dict) -> dict:
features = feature_store.get_realtime_features(txn['card_fingerprint'])
tensor_input = preprocess(txn, features)
prediction = model.predict(tensor_input)
return {
'fraud_score': float(prediction[0]),
'fraud_signals': extract_signals(prediction),
'model_version': MODEL_VERSION
}
Same TensorFlow model. Same feature store. Same preprocessing. We swapped a Flask route decorator for a Conductor worker decorator. The fraud-scoring service went from being a fragile HTTP dependency of txn-router to a first-class participant in the workflow with built-in retries, timeout handling, and full execution visibility.
Go Services — Task Runner Pattern
For our Go services (api-gateway, tokenization-svc, auth-response-svc, event-publisher), we used the Conductor Go SDK:
Before (tokenization-svc called directly over gRPC):
func (s *TokenService) TokenizeCard(ctx context.Context, req *pb.TokenizeRequest) (*pb.TokenizeResponse, error) {
encrypted := s.vault.Encrypt(req.Pan)
token := s.store.Save(encrypted, req.MerchantId)
return &pb.TokenizeResponse{Token: token, Last4: req.Pan[len(req.Pan)-4:]}, nil
}
After (same logic, registered as a Conductor worker):
func TokenizeCardWorker(t *model.Task) (interface{}, error) {
pan := t.InputData["pan"].(string)
merchantId := t.InputData["merchant_id"].(string)
encrypted := vault.Encrypt(pan)
token := store.Save(encrypted, merchantId)
return map[string]interface{}{
"token": token,
"last4": pan[len(pan)-4:],
}, nil
}
// Registration at startup
func main() {
c := conductor.NewConductorWorker("https://conductor-server:8080", 8)
c.RegisterTask("tokenize_card", TokenizeCardWorker)
c.StartPolling()
}
The pattern was the same everywhere: extract the core business logic from whatever transport layer it was behind (REST, gRPC, queue consumer), wrap it in a Conductor worker, register it. The transport changes, the logic doesn't.
Node.js Services — Same Story
// notification-svc — before (SQS consumer)
sqsConsumer.on('message', async (msg) => {
const { merchantEmail, reportUrl, type } = JSON.parse(msg.Body);
if (type === 'report_ready') {
await sendgrid.send({ to: merchantEmail, template: 'report_ready', data: { reportUrl } });
}
});
// After (Conductor worker)
const { TaskManager } = require('@io-orkes/conductor-javascript');
const sendNotification = async (task) => {
const { merchantEmail, reportUrl, notificationType } = task.inputData;
await sendgrid.send({
to: merchantEmail,
template: notificationType,
data: { reportUrl }
});
return { sent: true, timestamp: Date.now() };
};
const taskManager = new TaskManager(conductorClient);
taskManager.registerWorker('send_merchant_notification', sendNotification, { pollInterval: 100 });
taskManager.startPolling();
The key insight here: we didn't throw away any of our existing services. We kept every single one. We just changed how they were invoked — instead of being called directly by other services (creating tight coupling and invisible dependencies), they became Conductor workers that the orchestration engine invokes as part of a defined workflow. The business logic inside each service didn't change at all.
Starting with Conductor OSS: The Proof of Value
We didn't go all-in on day one. That's not how you do things when you're processing people's money. We pulled down Conductor OSS from GitHub (https://github.com/conductor-oss/conductor), stood up a local instance, and migrated one non-critical workflow: our end-of-day merchant reporting pipeline.
Good candidate because:
- It was already half-broken (reports were frequently delayed or had missing data)
- No real-time SLA (merchants expect reports by 8 AM, not in milliseconds)
- It touched 5 services (
etl-pipeline→report-generator→data-warehouse-loader→notification-svc→compliance-reporter) - Failure was annoying but not catastrophic
Here's the workflow we built. I'm showing this as a flow diagram because that's basically what you see in the Conductor UI — not walls of JSON:
┌─────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: merchant_daily_report (v1) │
│ │
│ ┌─────────────┐ │
│ │ START │ │
│ │ merchantId, │ │
│ │ reportDate │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ extract_txns │ Worker: etl-pipeline (Python) │
│ │ │ Pull day's transactions for merchant │
│ │ │ from PostgreSQL + Snowflake │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ run_aggregations │ Worker: etl-pipeline (Python) │
│ │ │ Compute totals, averages, breakdowns │
│ │ │ by card type, currency, status │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ fraud_summary │ Worker: fraud-scoring (Python) │
│ │ │ Flag suspicious patterns, compute │
│ │ │ merchant-level risk score │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────── FORK ────────────────────────┐ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ generate_pdf │ │ push_to_warehouse │ │ │
│ │ │ │ │ │ │ │
│ │ │ Worker: │ │ Worker: │ │ │
│ │ │ report-generator │ │ data-warehouse- │ │ │
│ │ │ (Python) │ │ loader (Python) │ │ │
│ │ │ │ │ │ │ │
│ │ │ Build PDF/CSV, │ │ Sync to Snowflake │ │ │
│ │ │ upload to S3 │ │ for BI dashboards │ │ │
│ │ └──────┬───────────┘ └──────┬───────────┘ │ │
│ │ │ │ │ │
│ └──────────┴───────────────────────┴── JOIN ────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ send_notification │ Worker: notification-svc (Node.js) │
│ │ │ Email merchant with report link │
│ │ │ + push to merchant dashboard │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ END │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Two weeks later, the reporting pipeline was running on Conductor OSS. The immediate wins:
- Failed reports were automatically retried with configurable backoff. No more silent failures that we'd discover when a merchant complained the next morning.
-
We could see exactly where a report was stuck — the aggregation step in
etl-pipeline? The PDF generator? Thenotification-svcemail step? One click in the Conductor UI and you knew. -
The Python workers (
fraud-scoring,etl-pipeline,report-generator) participated natively. No more REST wrapper hacks.
That was enough to convince leadership. We started planning the migration of core payment flows.
Moving to Orkes Enterprise: The Core Payment Flows
Conductor OSS proved the model. But running it in production for 50M+ daily transactions across our full payment stack needed things OSS alone didn't give us.
The Authorization Flow — Sync, Sub-80ms
This is the big one. The flow that runs 2,450 times per second and can't be slow:
┌─────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: credit_card_authorization (v3) — SYNC │
│ SLA: < 80ms end-to-end │
│ │
│ ┌─────────────┐ │
│ │ START │ ← api-gateway receives auth request │
│ │ pan, amount, │ │
│ │ merchant, │ │
│ │ device_data │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ tokenize_card │ Worker: tokenization-svc (Go) │
│ │ │ Vault PAN, return token + last4 │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────── FORK (parallel) ─────────────────┐ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ fraud_score │ │ risk_evaluate │ │ │
│ │ │ │ │ │ │ │
│ │ │ Worker: │ │ Worker: │ │ │
│ │ │ fraud-scoring │ │ risk-engine │ │ │
│ │ │ (Python/TF) │ │ (Java) │ │ │
│ │ │ │ │ │ │ │
│ │ │ ML model score │ │ Velocity checks, │ │ │
│ │ │ + fraud signals │ │ blocklists, rules │ │ │
│ │ └──────┬───────────┘ └──────┬───────────┘ │ │
│ │ │ │ │ │
│ └──────────┴───────────────────────┴──── JOIN ──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ SWITCH: │ │
│ │ check_region │ │
│ └───┬──────────┬───┘ │
│ EU │ │ non-EU │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ threeds_auth │ │ │
│ │ │ │ │
│ │ Worker: │ │ │
│ │ threeds-service │ │ │
│ │ (Java) │ │ │
│ │ │ │ │
│ │ 3DS challenge │ │ │
│ │ if required │ │ │
│ └──────┬───────────┘ │ │
│ │ │ │
│ └───────┬───────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ route_to_acquirer │ Worker: txn-router (Java) │
│ │ │ BIN lookup → select acquiring bank │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ format_iso8583 │ Worker: iso8583-codec (Java) │
│ │ │ Build ISO 8583 message for bank protocol │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ send_to_bank │ Worker: bank-gateway (Java) │
│ │ │ TLS connection to acquiring bank │
│ │ │ Timeout: 2s, no retry on critical path │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ normalize_response│ Worker: auth-response-svc (Go) │
│ │ │ Normalize bank response → standard format │
│ │ │ Generate auth code if approved │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ RESPONSE │ → back to api-gateway → merchant terminal │
│ │ approved/ │ │
│ │ declined │ ← triggers async settlement workflow │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Notice how fraud-scoring (Python) and risk-engine (Java) run in parallel via the FORK step. Before Conductor, txn-router called them sequentially — first fraud, then risk — burning precious milliseconds on our 80ms budget. With an 80ms SLA, every millisecond is life or death. Running them in parallel shaved 18ms off our p50 latency. That's over 20% of our entire budget reclaimed. At 2,450 TPS it adds up fast.
The Settlement Flow — Async, Batch, Durable
This is the flow that runs at midnight and has to finish before the bank's 5:30 AM cutoff:
┌─────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: daily_settlement (v2) — ASYNC │
│ Triggered: midnight via Conductor scheduler │
│ SLA: complete before 5:30 AM bank cutoff │
│ │
│ ┌─────────────┐ │
│ │ START │ │
│ │ settlDate, │ │
│ │ acquirerId │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ batch_captures │ Worker: capture-service (Java) │
│ │ │ Group day's authorized txns by acquirer │
│ │ │ Validate capture amounts match auths │
│ │ │ Retry: 3x with exponential backoff │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ calculate_net │ Worker: settlement-engine (Java) │
│ │ _settlement │ Net out debits/credits per merchant │
│ │ │ Apply interchange fees, markups │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ generate_bank │ Worker: settlement-engine (Java) │
│ │ _files │ Generate settlement files in bank's │
│ │ │ required format (varies per acquirer) │
│ │ │ Upload to S3 staging bucket │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ submit_to_bank │ Worker: bank-gateway (Java) │
│ │ │ SFTP files to acquirer │
│ │ │ Retry: 5x (banks have flaky SFTP servers) │
│ │ │ Timeout: 300s │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────── FORK (parallel) ─────────────────┐ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ update_ledger │ │ publish_events │ │ │
│ │ │ │ │ │ │ │
│ │ │ Worker: │ │ Worker: │ │ │
│ │ │ funding-calc │ │ event-publisher │ │ │
│ │ │ (Java) │ │ (Go) │ │ │
│ │ │ │ │ │ │ │
│ │ │ Update merchant │ │ Fan out to Kafka │ │ │
│ │ │ balances, calc │ │ for downstream: │ │ │
│ │ │ payout amounts │ │ analytics, AML, │ │ │
│ │ │ │ │ compliance │ │ │
│ │ └──────┬───────────┘ └──────┬───────────┘ │ │
│ │ │ │ │ │
│ └──────────┴───────────────────────┴──── JOIN ──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ notify_merchants │ Worker: notification-svc (Node.js) │
│ │ │ Settlement confirmation emails │
│ │ │ Update merchant dashboard via analytics-api │
│ └──────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ END │ Settlement complete for this acquirer │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
The big difference from before: if submit_to_bank fails at 2:47 AM because the bank's SFTP server hiccupped, Conductor retries it automatically with exponential backoff. Before, the cron job would fail, an alert would fire (eventually), someone would wake up, SSH in, figure out what happened, and manually re-run the job. Now? It retries, succeeds on attempt 3, and nobody gets paged. We see it in the morning in the execution history and move on with our lives.
Scale and Reliability on Orkes
At 2,450 TPS (with spikes to 3,800+ during holiday shopping), we needed:
- Guaranteed low-latency workflow execution — no GC pauses killing our p99, no queue backup
- Multi-region deployment for failover
- Throughput that scales horizontally without us babysitting Elasticsearch and Postgres clusters
Orkes (https://orkes.io) gave us managed, production-grade Conductor with the infrastructure handled. We stopped spending cycles on Conductor's backing store and went back to working on actual payment logic. Honestly felt like a weight lifted off the ops team.
SSO and RBAC: Governance That Actually Works
This turned out to be bigger than we expected going in. In our old system, access control was basically "do you have SSH access to the box?" There was no real concept of who could modify a workflow definition vs. who could execute one vs. who could view execution history.
In payment processing, this matters a lot:
- PCI-DSS requires strict access controls on anything touching cardholder data
- SOC 2 audits need evidence of role-based access and audit trails
- Separation of duties — the engineer who writes the settlement logic shouldn't be the same person who can manually trigger a settlement batch in production
What Orkes gave us:
- SSO integration with our existing Okta setup — no separate credentials, no "shared team password in 1Password"
-
RBAC at the workflow and task level — we set up roles like
payment-ops(can view and retry failed workflows),payment-eng(can modify workflow definitions in dev and staging),payment-admin(can deploy to production), anddata-eng(can only see data pipeline workflows) -
Environment-level permissions — devs iterate freely in dev, but promoting a workflow change to production requires review from someone with the
payment-adminrole
Before we had RBAC, we had two separate production incidents caused by engineers accidentally running test workflows against prod. After RBAC: zero. In 14 months and counting. That alone justified the migration cost.
Workflow Versioning: Making Business Changes Safe
Payment processing logic changes constantly. New card network rules every quarter, updated fraud thresholds, regulatory requirements that shift by region, merchant-specific custom configurations. We're pushing workflow logic changes basically every sprint.
With the old system, deploying a workflow change meant a full code deployment. Roll it out across the fleet, cross your fingers, roll back if something catches fire.
With Conductor's workflow versioning on Orkes, we run multiple versions simultaneously:
┌─────────────────────────────────────────────────────────────────────┐
│ WORKFLOW VERSIONING: credit_card_authorization │
│ │
│ v1 ──── still running for 23 in-flight transactions │
│ (winding down naturally, no new traffic) │
│ │
│ v2 ──── handling 95% of production traffic │
│ (stable, battle-tested) │
│ │
│ v3 ──── canary: 5% of EU transactions only │
│ (added 3DS auth step for PSD2 compliance) │
│ if issues → route back to v2 with config change │
│ no code deploy, no service restart, no maintenance window │
│ │
└─────────────────────────────────────────────────────────────────────┘
For a team that used to schedule 2 AM deployment windows to minimize blast radius, this was genuinely life-changing.
AI-Powered Fraud Detection as a Native Workflow Step
Probably the most impactful change was getting our fraud detection ML pipeline properly integrated as a first-class Conductor workflow step.
Before, fraud-scoring was called via a synchronous HTTP request from txn-router during authorization. If the model service was slow or down, the entire auth flow degraded. We had hardcoded 30ms timeouts on the fraud call alone (that's nearly 40% of our 80ms budget on one step), no fallback strategy, and no way to A/B test model versions without deploying an entirely new service.
Now fraud-scoring is a Conductor worker (remember the Python annotation from earlier), and it participates in workflows like any other service. The fraud and risk checks run in parallel in the auth flow. But we also use Conductor to orchestrate the broader fraud operations:
┌─────────────────────────────────────────────────────────────────────┐
│ WORKFLOW: fraud_model_retrain (v1) — ASYNC │
│ Triggered: weekly or on-demand when new patterns detected │
│ │
│ START │
│ │ │
│ ▼ │
│ extract_training_data ─── etl-pipeline (Python) │
│ │ Pull 30 days of labeled txns │
│ ▼ │
│ validate_data_quality ─── etl-pipeline (Python) │
│ │ Check for drift, class imbalance │
│ ▼ │
│ train_model ───────────── fraud-scoring (Python) │
│ │ Train new TensorFlow model │
│ ▼ │
│ evaluate_holdout ──────── fraud-scoring (Python) │
│ │ Test against holdout set │
│ │ If accuracy < threshold → FAIL workflow │
│ ▼ │
│ SWITCH: accuracy_check │
│ │ │ │
│ │ pass │ fail │
│ ▼ ▼ │
│ deploy_ alert_ml_team ─── notification-svc (Node.js) │
│ canary Slack alert + PagerDuty │
│ │ │
│ │ Worker: fraud-scoring (Python) │
│ │ Deploy to 5% canary traffic │
│ ▼ │
│ monitor_canary ────────── analytics-api (Node.js) │
│ │ Watch false positive rate for 24hrs │
│ ▼ │
│ SWITCH: canary_healthy? │
│ │ │ │
│ │ yes │ no │
│ ▼ ▼ │
│ promote_ rollback ─────── fraud-scoring (Python) │
│ to_prod Revert to previous model version │
│ │ │
│ ▼ │
│ END │
└─────────────────────────────────────────────────────────────────────┘
Our false positive rate on fraud detection dropped from 3.2% to 1.8% after we started using Conductor to manage model versioning and canary rollouts properly. Turns out when you can A/B test models without fear, you iterate a lot faster.
Unified Debugging: All Logs in One Place
This might sound boring compared to the ML stuff, but honestly it changed our on-call experience more than anything else.
Before Conductor, debugging a failed transaction:
Transaction abc-123 failed.
Where? Somewhere across 27+ services. Could be anywhere.
Let me check:
→ api-gateway logs (Datadog)
→ txn-router logs (CloudWatch)
→ fraud-scoring logs (custom ELK stack the ML team set up)
→ risk-engine logs (CloudWatch, different account)
→ bank-gateway logs (Splunk)
→ settlement-engine logs (flat files on an EC2 instance — I wish I was kidding)
→ Kafka consumer lag dashboards (Confluent) — across 42 topics
→ SQS dead letter queues (another 11 to check)
→ RabbitMQ management console (18 queues)
→ Redis state (redis-cli on a bastion host)
→ PostgreSQL txn table (which state is this txn in?)
Time elapsed: 45 minutes. Still don't know what happened.
After Conductor on Orkes:
Transaction abc-123 → Workflow execution ID xyz-789
→ Open Orkes dashboard, search xyz-789
→ Full execution timeline: every service, every step, right there
→ See that tokenize_card (Go) ✓, fraud_score (Python) ✓,
risk_evaluate (Java) ✓, send_to_bank (Java) ✗ FAILED
→ Click send_to_bank → see exact error: bank returned code 96
(system malfunction) with full ISO 8583 response
→ See input payload, output payload, retry count (2 of 3)
→ Fix bank-gateway handler for code 96, deploy worker
→ Click "Retry from failed task" → transaction completes
Time elapsed: 3 minutes. Know exactly what happened, why, and fixed it.
Here's what the numbers actually look like:
| Metric | Before Conductor | After Conductor + Orkes |
|---|---|---|
| MTTD (time to detect an issue) | 22 minutes | 3 minutes |
| MTTR (time to recover) | 6.5 hours | 3 minutes |
| Production incidents per month | 18 | 3 |
| Failed transaction rate | 0.42% | 0.03% |
| Eng-hours spent on incident response/month | 127 | 19 |
The ability to replay a failed workflow from the exact point of failure is incredibly powerful for payment systems. When bank-gateway returns some weird error code we've never seen, we don't need to reprocess the entire transaction from scratch. We fix the handler, deploy the worker, and replay just the send_to_bank step with the original inputs. The transaction picks up right where it left off. All the upstream work — tokenization, fraud scoring, risk evaluation, routing — is preserved.
I cannot overstate how much better on-call is now. Our on-call engineers used to dread their rotation. Now it's mostly quiet, and when something does fire, they can diagnose and fix it during a single cup of coffee instead of pulling an all-nighter.
The Architecture Today
Here's what the platform looks like now — same 27+ services, but with Conductor as the orchestration backbone instead of point-to-point spaghetti:
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ ORKES / CONDUCTOR │
│ (Workflow Orchestration Engine) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ SYNC WORKFLOWS │ │
│ │ │ │
│ │ credit_card_authorization (v3) │ │
│ │ api-gateway → tokenization-svc → fraud-scoring ║ │ │
│ │ risk-engine ║ parallel │ │
│ │ → threeds-service (EU) → txn-router → iso8583-codec │ │
│ │ → bank-gateway → auth-response-svc │ │
│ │ │ │
│ │ refund_authorization (v2) │ │
│ │ merchant-api → txn-router → bank-gateway → funding-calc │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ ASYNC WORKFLOWS │ │
│ │ │ │
│ │ daily_settlement (v2) — runs midnight, SLA 5:30 AM │ │
│ │ capture-service → settlement-engine → bank-gateway │ │
│ │ → funding-calc ║ event-publisher ║ → notification-svc │ │
│ │ │ │
│ │ clearing_reconciliation (v1) — runs on bank file receipt │ │
│ │ file-ingestion → clearing-matcher → recon-service │ │
│ │ → compliance-reporter │ │
│ │ │ │
│ │ chargeback_handling (v2) — triggered on dispute intake │ │
│ │ chargeback-svc → recon-service → notification-svc │ │
│ │ → compliance-reporter │ │
│ │ │ │
│ │ merchant_daily_report (v3) — runs 5 AM daily │ │
│ │ etl-pipeline → fraud-scoring → report-generator ║ │ │
│ │ data-warehouse ║ parallel │ │
│ │ → notification-svc │ │
│ │ │ │
│ │ fraud_model_retrain (v1) — weekly + on-demand │ │
│ │ etl-pipeline → fraud-scoring → analytics-api │ │
│ │ → notification-svc │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ WORKERS: 14 Java │ 4 Python │ 5 Go │ 3 Node.js │
│ ACCESS: SSO (Okta) + RBAC (per-env, per-workflow) │
│ VERSIONS: multiple workflow versions running in parallel │
│ OBSERVABILITY: full execution history + Datadog/PagerDuty export │
│ │
└─────────────────────────────────────────────────────────────────────┘
Same services. Same languages. Same business logic. The difference is that every interaction between services is now a visible, retryable, debuggable step in a Conductor workflow instead of a hidden HTTP call or a message lost in a queue somewhere.
The best part is how boring it is now. Boring infrastructure is good infrastructure.
What Actually Improved (With Numbers)
Putting hard numbers on this because I know that's what I'd want to see if I were reading this:
Reliability
- Authorization success rate: 99.4% → 99.97%
- Settlement SLA compliance: 91% → 99.8% (batches complete before the 5:30 AM bank cutoff consistently now)
- Double-settlement incidents: zero in 14 months (we had 3 the year before)
- Workflow version rollouts with zero downtime — used to require 2 AM maintenance windows
Operational Efficiency
- MTTR: 6.5 hours → 3 minutes
- On-call escalations per month: 18 → 3
- Engineering hours on incident response: 127/month → 19/month
- Time to deploy new workflow logic: 3-5 days → 4 hours
Developer Experience
- New engineer onboarding for understanding payment flows: 3 weeks → 4 days. Visual workflow definitions beat reading spaghetti code across 27+ repos every single time.
- Service integration time: ~1 day per service to add the Conductor worker annotation/SDK wrapper. We migrated all 27 services in about 6 weeks with a 3-person team.
-
Cross-team collaboration got way better — the data team can see exactly what events
event-publisherproduces and when, without digging through Java source code or bugging payment engineers on Slack
Governance and Compliance
- PCI-DSS audit findings related to access control: 4 → 0
- Full audit trail of who changed which workflow, when, and what the diff looks like
- Environment isolation enforced by the platform, not by "please don't run this in prod" Slack messages
What We'd Tell Other Teams Building This Stuff
If you're running a payment platform or any high-volume financial transaction system and you're thinking about workflow orchestration, here's what we'd say:
Start with Conductor OSS. Seriously. We ran it locally, migrated one non-critical workflow, and validated the whole programming model in under two weeks. Low-risk way to prove it fits your architecture. The GitHub repo (https://github.com/conductor-oss/conductor) has solid docs and the community is active and helpful.
You don't have to rewrite your services. This was the biggest surprise. We thought integration would take months. It took weeks. The annotation/SDK approach means your services keep their existing logic — you're just changing how they get invoked. A @WorkerTask annotation in Java, a @worker_task decorator in Python, a RegisterTask call in Go. That's the migration for each service.
Move to Orkes when you need production scale, security, and governance. The managed infrastructure, SSO/RBAC, and enterprise support from Orkes (https://orkes.io) are what made it possible for us to run mission-critical payment flows on Conductor. The team there actually understood PCI-DSS scoping, which saved us a ton of back-and-forth.
Model your sync and async flows in the same engine. Biggest architectural win was putting our real-time auth path and batch settlement path under one orchestration system. Same visibility, same debugging tools, same versioning, same access controls. Maintaining two systems for two execution patterns was killing us operationally.
Invest in RBAC from day one. We wish we'd had workflow-level access control years ago. It's not just security theater for auditors — it's about operational confidence. When you know that only reviewed, approved changes can reach production, you deploy without the pit in your stomach.
Let your teams use the languages they're good at. Don't make your data scientists write Java. Don't make your systems engineers write Python. Conductor's language-agnostic worker model meant each team used the best tool for their job while still participating in the same end-to-end workflows. The amount of glue code we deleted was genuinely therapeutic.
We've been running this setup in production for 14 months now. Happy to answer questions if you're evaluating something similar — drop a comment or find us in the Conductor community Slack.
If you're kicking the tires on workflow orchestration for payments or financial services, start with Conductor OSS (https://github.com/conductor-oss/conductor) and check out Orkes (https://orkes.io) when you're ready to scale.
Tags: #payments #fintech #microservices #orchestration #conductor #orkes #distributed-systems #fraud-detection #machine-learning #devops #platform-engineering #workflow-orchestration
Keywords: payment processing orchestration, credit card transaction workflow, Conductor OSS, Orkes enterprise workflow, payment platform reliability, transaction processing at scale, fraud detection ML workflow, financial services microservices, payment settlement automation, PCI-DSS workflow compliance, MTTR payment systems, distributed transaction debugging, workflow versioning payments, polyglot microservices orchestration, Java Python Go workflow workers
Top comments (0)