Building a KYC questionnaire that knows what the regulator will ask before they ask it

#dotnet #fintech #architecture #webdev

If you've never worked in fintech: KYC stands for Know Your Customer. It's the legal requirement for financial institutions to verify who their customers are before providing services. Anti-money laundering, fraud prevention, sanctions screening - all of it starts with KYC. When a business opens an account, the bank needs to understand what that business does, where its money comes from, and whether it poses any compliance risk.

Before this feature shipped, a new business customer could complete onboarding, submit their application - and then wait for the banking partner's compliance team to start asking questions.

Our banking partner performs compliance reviews on all new business accounts. As part of that review, their compliance team would send questions about business activity, industry type, international transactions, licensing, certifications. The customer would answer. Sometimes the partner would come back with another round. Multiple rounds of back-and-forth, each round adding days to the timeline.

The critical detail: these questions weren't random. The compliance team asked them based on the customer's industry. A restaurant got asked about cash handling. A financial intermediary got asked about cross-border transaction volumes. A construction company got asked about subcontractors. An IT consultancy got asked about data processing agreements. The questions were predictable. We just weren't collecting the answers before the partner asked for them.

The solution was to intercept the question-answer cycle before it started. Collect the compliance questions during onboarding - before the review even begins - so when the application lands at the banking partner, the answers are already there.

The problem with "right questions"

Not every business needs to answer the same questions. A software consultancy and a money services company have very different compliance profiles. Asking the same 30 questions to everyone wastes the customer's time and produces noise in the compliance review.

The questionnaire had to be adaptive. The question set for each customer is determined by their industry classification - in Europe, that's the NACE code. Think of it as a standardised label for what a company does: restaurants, software development, financial services, construction, logistics, etc. Every registered business has one.

Each industry classification maps to a specific set of question groups. A money transfer business gets questions about transaction monitoring and correspondent banking. A restaurant gets questions about cash revenue ratios. An e-commerce company gets questions about chargebacks and payment providers. One industry might require a dozen groups, another might need half that. The question set is assembled at generation time based on this mapping, not hardcoded per customer.

The question hierarchy

Questions aren't flat. Many have sub-questions that only make sense given a specific answer.

"Does your company conduct business internationally?" has a follow-up question about destination countries - but only if the answer is yes. Asking about destination countries when the answer is no creates noise and confuses customers.

We model this with decimal question numbers. Question 1 is a parent. Questions 1.1 and 1.2 are its children. The hierarchy is encoded in the decimal, not in a separate parent-child field.

Question 1: Does your company conduct business internationally? (Bool)
  Question 1.1: If yes, which countries? (Text) - Condition: true
  Question 1.2: If no, why not? (Text) - Condition: false

At submission time, the validator prunes the tree. If the customer answers "no" to question 1, question 1.1 is removed from the contract before storage. The customer never sees questions that don't apply to them, and the stored answers form a clean, consistent set.

Some questions are container nodes - they exist to define the hierarchy but are never shown to the user. Only leaf questions generate actual form inputs. A boolean flag on each question distinguishes containers from visible inputs.

One group per step

The questionnaire can span multiple question groups depending on industry. Rather than dumping everything on one screen, each question group is a separate onboarding step.

A customer sees their progress at the top - "Step 3 of 7" or "Step 5 of 12" depending on the industry. Each step is independently validated and submitted. Answers are stored per-step. The customer can stop mid-questionnaire and resume later without losing progress. Completed steps are marked hidden - the answers are preserved for the compliance audit trail, but the customer can't edit them after submission.

The routing logic tracks which compliance steps are still pending. When the customer completes a step, it finds the next incomplete one and routes there. When all compliance steps are done, onboarding picks up from where it left off.

Preventing duplicates

Question generation runs once per customer. But onboarding flows have concurrent paths - multiple events can trigger at the same time, and we had to ensure we didn't generate duplicate question sets.

Two layers of protection:

┌─────────────────────────────────────────────────────────┐
│              Generate questionnaire request              │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ▼
              ┌─────────────────┐     YES
              │  Cache lock set? ├────────────► Drop request
              └────────┬────────┘
                  NO   │
                       ▼
             Set cache lock (short TTL)
                       │
                       ▼
          ┌────────────────────────┐     YES
          │  Completion flag set?  ├────────────► Drop request
          └────────────┬───────────┘
                  NO   │
                       ▼
             Generate questions
                       │
                       ▼
           Set completion flag (persistent)
                       │
                       ▼
                    Done ✓

First, a distributed cache lock with a short TTL using set-if-not-exists semantics. The first request to generate the questionnaire for a given onboarding ID wins. Concurrent requests within the TTL window are dropped.

Second, a persistent completion flag written to the onboarding metadata after generation completes. Even if the cache expires, the flag prevents re-generation.

The rollout

We didn't enable this for all customers at once. The feature was introduced during an ongoing product with existing customers, and asking existing customers to answer compliance questions mid-flight would have been disruptive.

Two feature toggles controlled the rollout. The first controls eligibility by customer creation date - customers created before the launch date are excluded. Hard boundary, no percentage, no A/B. The second controls traffic within eligible customers - started at 0%, increased gradually as we validated the system in production. The two toggles are independent: scope (who is eligible) and volume (what percentage see it) are controlled separately.

The numbers

Thousands of customers completed the questionnaire during the rollout period.

The banking partner's compliance review started with all required information already in the application. No rounds of questions. No waiting for customer responses. The back-and-forth that previously stretched across days of email exchanges was replaced by a single session during onboarding - at the point where the customer was already engaged and focused on getting their account open.

The technical work - industry-based question mapping, conditional logic, multi-step navigation, distributed deduplication - existed to serve one outcome: anticipating what the regulator would ask, and collecting those answers before the review even began.

What changed in the code

Before this feature, question generation was a static config:

// Before: same questions for every customer
var questions = LoadDefaultComplianceQuestions();
SendToPartner(application, questions);
// Partner replies: "We also need to know about X, Y, Z"
// Customer gets emailed. Waits. Answers. Waits again.

After:

// After: industry-adaptive generation
var industry = GetIndustryClassification(customer);
var questionGroups = MapIndustryToQuestionGroups(industry);
var tree = BuildConditionalTree(questionGroups);

// Customer answers during onboarding
var answers = CollectAnswersDuringOnboarding(tree);
var pruned = PruneByConditions(tree, answers);

SendToPartner(application, pruned);
// Partner receives complete data. No follow-up needed.

The difference in flow:

BEFORE:
Customer → Submit app → Partner reviews → Questions round 1
→ Customer replies (days) → More questions → Customer replies (days)
→ Approved
Total: days to weeks of back-and-forth

AFTER:
Customer → Answer compliance questions during onboarding → Submit app
→ Partner reviews (answers already attached) → Approved
Total: single onboarding session

No new infrastructure. No external dependencies. Just a question mapping layer, a conditional tree, and a deduplication guard - inserted into the existing onboarding pipeline.