DEV Community

Cover image for Building a Privacy-First Log Analyzer for Banking QA: The Technical Architecture
tanvi Mittal for AI and QA Leaders

Posted on

Building a Privacy-First Log Analyzer for Banking QA: The Technical Architecture

Part 2: Turning research into a working system that Security might actually approve

If you read Part 1, you know the problem: QA teams in banking waste 32% of their time creating test data that already exists in production logs, but PII compliance makes those logs untouchable.
After proving the tech exists (94% PII detection accuracy, <50ms scrubbing latency), the obvious next question was: Can I actually build this?
Turns out, the hard part isn't the technology. It's designing something that Security will trust.
The Non-Negotiables
Before writing a single line of code, I talked to three Security officers at different banks. They all said variations of the same thing:
"Show me how it handles the worst case. Then show me the audit trail. Then maybe we'll talk."
So I started with constraints, not features:

  1. Never Store Unmasked PII Not in memory. Not in logs. Not temporarily. Not "just for a second." If the system crashes mid-processing, there should be zero plaintext PII anywhere.
  2. Immutable Audit Trail Every log access, every masking decision, every query—logged with cryptographic proof it hasn't been tampered with. Because when your auditor asks "who accessed customer data on March 15th?", the answer can't be "probably nobody?"
  3. Defense in Depth One PII detection layer isn't enough. Neither is two. You need multiple layers with different approaches, so if one misses something, another catches it.
  4. Compliance by Default
    PCI DSS 4.0, GDPR Article 32, SOC 2 Type II requirements—baked into the architecture, not added as an afterthought.
    With those constraints, here's what I'm building.
    The Architecture (High Level)
    The system has five core components, each with a single responsibility:
    Production Logs → Ingestion Layer → Detection Pipeline →
    Storage Layer → Query Interface → Test Generator
    Simple chain. But each link has to be bulletproof.
    Component 1: The Ingestion Layer
    Job: Get logs from wherever they live (Splunk, ELK, S3, Datadog) without ever storing them unmasked.
    How it works:
    Logs come in as streams, not batches. Process one entry at a time. Mask immediately. Only then write to internal storage.
    python# Pseudocode - actual implementation is more complex
    def ingest_log_entry(raw_entry):

    Step 1: Pre-detection scan (fast regex patterns)

    if contains_obvious_pii(raw_entry):
    alert_security_team()

    Step 2: Stream to detection pipeline

    masked_entry = detection_pipeline.process(raw_entry)

    Step 3: Verify no PII leaked through

    if verification_scan(masked_entry) == FAIL:
    quarantine_entry()
    alert_security_team()

    Step 4: Write to storage with audit record

    storage.write(masked_entry, audit_trail={
    'timestamp': now(),
    'source': entry.source,
    'masking_confidence': masked_entry.confidence,
    'hash_of_original': sha256(raw_entry)
    })
    Key decisions:

Streaming, not batching: You can't accidentally log 1000 unmasked entries if you only process one at a time.
Multiple verification layers: Pre-scan catches obvious stuff fast. Post-scan catches what the main pipeline missed.
Quarantine on doubt: If confidence score is below threshold (I'm using 85%), don't mask it—quarantine it for manual review.

The part that took longest to figure out:
How do you verify masking worked without comparing to the original unmasked entry?
Answer: Pattern-based verification. After masking, run detection again. If it still finds PII-like patterns, something failed.
pythondef verification_scan(masked_entry):
# These should never appear in masked output
forbidden_patterns = [
r'\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}', # Card numbers
r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', # Names (heuristic)
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
# ... 40+ more patterns
]

for pattern in forbidden_patterns:
    if re.search(pattern, masked_entry):
        return FAIL

return PASS
Enter fullscreen mode Exit fullscreen mode

Not perfect. But catches 98% of masking failures in my testing.
Component 2: The Detection Pipeline
This is where the magic happens. Or fails spectacularly, depending on your PII detection accuracy.
I'm using a hybrid approach with four layers:
Layer 1: Rule-Based Detection (Fast)
Classical regex patterns for obvious stuff:

Credit cards: Luhn algorithm + format validation
SSN: Format + known invalid ranges
Email: RFC 5322 compliant regex
Phone numbers: International format handling

Performance: ~2ms per log entry
Accuracy: 78% (misses contextual PII)
Layer 2: Named Entity Recognition (Context-Aware)
Using spaCy with a custom-trained model on financial documents.
Catches things like:
"Transaction initiated by John Smith" → Masks name
"Smith College transaction fee" → Keeps name (it's an institution)
Performance: ~15ms per log entry
Accuracy: 87% (better with context)
Layer 3: Microsoft Presidio (Industry Standard)
Open-source PII detection with pre-built analyzers for 50+ PII types.
I modified it to be more aggressive in banking contexts:
pythonpresidio_analyzer = AnalyzerEngine()

Custom recognizers for banking-specific patterns

account_recognizer = PatternRecognizer(
supported_entity="ACCOUNT_NUMBER",
patterns=[
Pattern("ACC_NUM_1", r"\b[0-9]{8,17}\b", 0.8),
Pattern("IBAN", r"[A-Z]{2}[0-9]{2}[A-Z0-9]{11,30}", 0.9)
]
)

presidio_analyzer.registry.add_recognizer(account_recognizer)
Performance: ~30ms per log entry
Accuracy: 91% (excellent general-purpose)
Layer 4: LLM-Based Detection (Expensive, High-Recall)
For entries that previous layers flagged as uncertain, run through a small fine-tuned LLM.
Using a quantized Llama 2 7B model trained on synthetic banking logs with labeled PII.
Prompt template:
Identify ALL personally identifiable information in this log entry:

Log: "{log_entry}"

Return JSON with:

  • pii_found: true/false
  • entities: [{type, value, start_pos, end_pos}]
  • confidence: 0-1
    Performance: ~200ms per log entry (only for uncertain cases)
    Accuracy: 96% (catches nearly everything)
    The Voting Mechanism
    All four layers run in parallel. Results are combined with weighted voting:
    pythondef combine_detections(rule_results, ner_results, presidio_results, llm_results):
    weights = {
    'rule': 0.2, # Fast but basic
    'ner': 0.25, # Good context
    'presidio': 0.3, # Industry standard
    'llm': 0.25 # High accuracy but slower
    }

    Aggregate overlapping detections

    all_detections = merge_overlapping_spans(
    rule_results + ner_results + presidio_results + llm_results
    )

    Calculate confidence score for each detection

    for detection in all_detections:
    detection.confidence = sum(
    weights[layer] for layer in detection.detected_by
    )

    Only mask if confidence > threshold

    return [d for d in all_detections if d.confidence >= 0.85]
    Why 85% threshold?
    Testing on 10,000 synthetic banking logs:

85% threshold: 3 false negatives (leaked PII), 47 false positives (over-masked)
90% threshold: 12 false negatives, 23 false positives
80% threshold: 0 false negatives, 89 false positives

I'd rather over-mask than leak PII, but 89 false positives made logs unreadable. 85% is the sweet spot.
Component 3: The Masking Strategy
Okay, you've detected PII. Now what?
You can't just replace everything with . That destroys the usefulness for testing.
Here's what I'm using:
Strategy 1: Format-Preserving Masking
Keep the structure, hide the content:
Original: Card 4532-1234-5678-9010
Masked: Card 4532-
-***-9010
QA can still see:

It's a Visa card (4xxx)
Last 4 digits (for correlation with other logs)
Transaction patterns intact

Strategy 2: Tokenization with Consistency
Same PII value always maps to same token within a session:
Log 1: "John Smith initiated transfer"
Log 2: "Transfer approved by John Smith"

Becomes:

Log 1: "USER_A7F3 initiated transfer"
Log 2: "Transfer approved by USER_A7F3"
Critical for tracing transactions across logs. But tokens reset between analysis sessions (no long-term correlation).
Strategy 3: Synthetic Replacement
For test generation, replace with realistic fake data:
pythonfrom faker import Faker

def generate_synthetic_replacement(pii_type, original_value):
fake = Faker()

replacements = {
    'PERSON': fake.name,
    'EMAIL': fake.email,
    'PHONE': fake.phone_number,
    'CREDIT_CARD': fake.credit_card_number,
    'ADDRESS': fake.address,
}

# Maintain some format characteristics
if pii_type == 'CREDIT_CARD':
    # Keep same card network (Visa, MC, etc)
    network = detect_card_network(original_value)
    return fake.credit_card_number(card_type=network)

return replacements.get(pii_type, '[REDACTED]')()
Enter fullscreen mode Exit fullscreen mode

Tests run with fake data, but patterns match real scenarios.
Component 4: The Storage Layer
This one's deceptively simple: store masked logs with full audit trails.
But there's a catch—compliance requirements say you need to prove data lineage.
"This test case came from production logs collected on [date], masked with [confidence], accessed by [user], approved by [security officer]."
So every entry gets metadata:
json{
"masked_log_entry": "USER_A7F3 initiated transfer of AMOUNT_B2D9",
"metadata": {
"ingestion_timestamp": "2025-10-30T14:32:11Z",
"source_system": "splunk-prod-cluster-2",
"masking_confidence": 0.92,
"detection_layers_used": ["rule", "ner", "presidio"],
"original_hash": "sha256:7f3e9a2b...",
"accessed_by": [],
"compliance_tags": ["PCI-DSS-10.2", "GDPR-Art32"]
}
}
Storage uses append-only logs (like event sourcing). You can't modify history. You can only add to it.
Every query, every access, every download—appended to the audit trail with cryptographic signatures.
pythondef log_access(user_id, query, results_count):
access_record = {
'timestamp': now(),
'user_id': user_id,
'query': query,
'results_returned': results_count,
'ip_address': get_user_ip(),
'previous_hash': audit_log.get_latest_hash()
}

# Cryptographically sign the record
access_record['signature'] = sign_with_private_key(
    json.dumps(access_record, sort_keys=True)
)

# Append to immutable log
audit_log.append(access_record)
Enter fullscreen mode Exit fullscreen mode

If someone tries to tamper with the audit trail, the hash chain breaks. Auditors love this.
Component 5: The Query Interface
This is where QA engineers actually interact with the system.
Design principle: Make it feel like their existing tools, not a new thing to learn.
For teams using Splunk:
index="masked_production_logs" transaction_type="payment"
| where amount > 1000
| stats count by error_code
Same syntax. Just hits masked data instead of raw logs.
For teams using SQL:
sqlSELECT error_code, COUNT(*)
FROM masked_logs
WHERE transaction_type = 'payment'
AND amount > 1000
GROUP BY error_code;
Under the hood, every query:

Checks user permissions (RBAC)
Logs the query to audit trail
Scans results for any leaked PII (should never happen, but defense in depth)
Returns masked data with confidence scores

Component 6: The Test Generator
The final piece: turning log sequences into runnable tests.
Input: A sequence of logs that shows a bug:
[LOG 1] User A7F3 initiated payment of $1,234.56
[LOG 2] Payment validation service returned error: INVALID_ROUTING
[LOG 3] Retry attempted with same routing number
[LOG 4] Payment failed, user received generic error message
Output: A Selenium/Playwright test:
pythondef test_payment_invalid_routing_error_message():
"""
Test Case: Invalid routing number should show specific error
Source: Production logs 2025-10-15, Ticket #1234
"""
user = create_test_user()

# Step 1: Initiate payment (from LOG 1)
payment_page.enter_amount("1234.56")
payment_page.enter_routing_number("INVALID_ROUTING_NUM")
payment_page.click_submit()

# Step 2: Verify error message is specific, not generic (from LOG 4)
error_message = payment_page.get_error_message()
assert "routing number" in error_message.lower(), \
    f"Expected specific error about routing number, got: {error_message}"

# Step 3: Verify no retry with same invalid data (from LOG 3)
assert payment_page.retry_count() == 0, \
    "System should not auto-retry with known invalid routing number"
Enter fullscreen mode Exit fullscreen mode

The generator uses templates for common patterns:

Login flows
Payment transactions
Form submissions
API calls
Error handling

And lets you customize for your tech stack:
pythontest_generator = TestGenerator(
framework='pytest',
web_driver='playwright',
api_client='requests',
assertion_style='pytest',
output_format='python'
)

test_code = test_generator.from_log_sequence(
logs=failing_payment_logs,
test_name="test_payment_routing_validation",
tags=['payment', 'regression', 'P1']
)
The Security Review Checklist
Before this goes anywhere near production, it has to pass Security. Here's the checklist I'm using:
Data Protection:

No plaintext PII ever stored
Multi-layer detection (4 independent methods)
Format-preserving masking where possible
Encryption at rest (AES-256)
Encryption in transit (TLS 1.3)

Access Control:

Role-based access control (RBAC)
Multi-factor authentication required
Session timeout (15 minutes idle)
IP allowlisting for prod access
Principle of least privilege

Audit & Compliance:

Immutable audit trail
Cryptographically signed access logs
Retention policy (90 days)
Automated compliance reporting
Regular security assessments

Failure Handling:

Quarantine uncertain detections
Alert on detection confidence drop
Graceful degradation (fail closed, not open)
Automated rollback on anomaly detection

Testing:

10,000+ synthetic log test suite
Penetration testing planned
Red team exercise scheduled
False negative rate < 0.1%
False positive rate < 5%

The Parts I'm Still Figuring Out

  1. Performance at Scale Current benchmarks: 1,200 logs/second on a single machine. But banking systems generate millions of logs per day. Do I:

Horizontal scaling (more machines)?
Sampling (only process high-value logs)?
Tiered processing (fast scan for most, deep scan for flagged)?

Leaning toward tiered processing, but not sure yet.

  1. The Cold Start Problem When you first deploy this, you have zero masked logs. QA teams need to wait for ingestion to catch up. Options:

Batch process last 30 days of logs (risky, but fast)
Real-time only (safe, but slow to provide value)
Hybrid (real-time + controlled historical backfill)

Probably going with hybrid, with extra security review for historical processing.

  1. Custom PII Types Every bank has internal identifiers that are technically PII but don't match standard patterns:

Internal customer IDs
Proprietary account formats
Legacy system codes

Need a way for Security to define custom patterns without writing code:
yamlcustom_pii_types:

  • name: "INTERNAL_CUSTOMER_ID"
    pattern: "CUST[0-9]{8}"
    confidence: 0.95
    masking_strategy: "tokenize"

  • name: "LEGACY_ACCOUNT"
    pattern: "[A-Z]{2}[0-9]{6}[A-Z]"
    confidence: 0.90
    masking_strategy: "format_preserving"
    Building a config-driven recognizer system for this.
    What's Next
    I have a working prototype that handles the happy path. Next steps:
    Week 1-2: Hardening

Add the failure cases (network errors, malformed logs, detection timeouts)
Load testing (can it actually handle 1M+ logs/day?)
Fuzzing the PII detection (adversarial inputs)

Week 3-4: Security Review

Internal security team review
Third-party penetration testing
Compliance gap analysis

Week 5-6: Beta Testing

3-5 QA teams in controlled environments
Real logs (with extra monitoring)
Gather feedback on usability

Week 7-8: Iteration

Fix what breaks
Improve what's clunky
Add the features that beta testers actually need

The Real Test
The tech works. I'm confident about that.
The real question is: Will Security actually approve this for production use?
Because the best PII detection in the world doesn't matter if it never leaves the prototype stage.
That's why I'm designing for Security's approval from day one. Not bolting it on later.
We'll see if it works.
I Need Your Input (Again)
If you're building something similar or have thoughts on the architecture:
For Security folks:

What would make you trust this system?
What's missing from my security checklist?
Would you need penetration testing results before approval?

For QA engineers:

Is the query interface actually useful or am I overthinking it?
What test generation formats do you actually need?
Would you use this if it existed?

For architects:

Glaring architectural mistakes I'm making?
Performance bottlenecks I'm not seeing?
Better ways to handle the scale problem?

Drop a comment or hit me up. I'm documenting this whole build in public, so if you want updates or want to beta test, let me know.

Part 1: why-production-logs-are-a-qa-goldmine-and-why-nobody-uses-them
Part 3: Coming soon - Beta testing results and what actually broke
Building in public. Follow along or DM me to get involved.

Top comments (0)