Daniel Romitelli

Posted on Mar 19 • Edited on Mar 23 • Originally published at craftedbydaniel.com

I Almost Built a Second Search Index—Then I Realized Privacy Was a Runtime Toggle

#privacy #dataquality #python #recruitingtech

I noticed it in the worst possible place: mid-call, screen-shared, with a financial advisor driving.

They typed a query into the search UI, got a crisp ranked list back, and then—without thinking—hovered right over a client's name and AUM. The call didn’t explode, but my stomach did that little drop you only get when you realize a system is technically correct and still socially dangerous.

This is Part 10 of “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”. In Part 9 I talked about progressive disclosure via dual-channel streaming. (If Part 5 on kill switches and rollout flags sounds related — it is, but different. Part 5 was about deployment controls: circuit breakers and gradual rollout. This is about runtime modes: per-conversation behavior that changes the output shape without changing the pipeline.) This time the core decision is quieter, but it’s the kind of decision that separates an AI demo from an AI system you can actually operate: anonymization as a runtime toggle, not a data transformation.

The naive solution I almost shipped

The obvious idea—especially if you’ve lived in search infrastructure long enough—is: “Make a second index.”

An anonymized search index sounds clean on paper:

No private names stored.
No private firms stored.
No chance of leaking identity, because identity isn’t there.

But it’s a trap.

A second index means a second truth:

Two ingestion paths.
Two schemas.
Two ranking behaviors.
Two sets of bugs.
Two sets of backfills.

And the most subtle problem: you don’t just want “anonymized data.” You want the same ranking. If privacy mode changes the ranking, you’ve changed the product.

So I refused the data-pattern answer and went with a deployment-pattern answer:

Same search. Same index. Same scoring. Same results.

Only the output is transformed—per conversation—when privacy mode is enabled.

Key insight (the part most people miss)

Anonymization isn’t primarily a privacy feature. It’s a mode switch in a human workflow.

Financial advisors share their screen when pitching prospects. They need to search their book of business live. They need to talk about the short list. They need to compare options.

What they don’t need is the system “helpfully” revealing:

client names
firm names
precise AUM and production numbers

So privacy mode does something very specific:

Client names → "Client A/B/C"
Firm names → generic industry labels
AUM and production → rounded ranges ($1M–$5M, $5M–$25M, etc.)

And it does something equally specific by not touching it:

location is preserved
designations (CFP, CFA, CIMA) are preserved
availability is preserved

That last bullet is the tell. If you anonymize too aggressively, you destroy utility. If you anonymize too weakly, you destroy trust. The trick is to redact identity while keeping decision-critical structure.

How it works under the hood

There are four pieces that make this real:

1) A database-level flag: privacy_mode BOOLEAN DEFAULT false
2) A partial index: WHERE privacy_mode = true
3) A runtime transformer: anonymize_search_results() applied only to the client search index
4) Training data alignment: the fine-tuning dataset uses the same anonymization rules (including firm-type labels)

If you only do (3), you’ll eventually leak through some other path.

If you only do (4), the model will behave, but your raw API responses won’t.

If you do all four, privacy becomes a system property—not a promise.

Architecture flow

Here’s the dataflow I ended up with: one search path, one ranking path, and a single redaction gate at the boundary.

flowchart TD
  ui[AdvisorUI] --> api[SearchAPI]
  api --> search[ClientSearchIndex]
  search --> rank[RankedResults]
  rank --> gate[PrivacyGate]
  gate --> normal[NormalResponse]
  gate --> private[PrivacyModeResponse]

  subgraph persistence
    convo[ConversationState]
    flag[privacy_mode]
  end

  ui --> convo
  convo --> flag
  flag --> gate```



The important part is that `PrivacyGate` sits *after* ranking. That’s how I keep “same search, same ranking” true.

## The runtime toggle: redaction on the way out
The implementation detail that matters is where anonymization happens: **in runtime code, not in ingestion.**

In this codebase, the anonymizer lives in `app/utils/anonymizer.py`, and the call site is intentionally narrow: `anonymize_search_results()` is applied only to the client search index.

That constraint matters because “privacy mode” is a business workflow requirement—not a blanket system behavior. If you anonymize everything everywhere, you’ll break legitimate internal operations.



```python
# anonymizer.py — runtime redaction applied per-conversation

def anonymize_search_results(results_by_index, privacy_mode=False):
    """Redact identity fields when privacy mode is on. Other indexes pass through."""
    if not privacy_mode:
        return results_by_index

    clients = results_by_index.get("client-search", [])
    label_counter = 0

    for client in clients:
        label_counter += 1
        client["full_name"] = f"Client {chr(64 + label_counter)}"

        if client.get("employer"):
            client["employer"] = anonymize_firm_name(client["employer"])

        if client.get("book_size_aum"):
            client["book_size_aum"] = _anonymize_aum(client["book_size_aum"])

        # location, designations, availability — deliberately untouched

    results_by_index["client-search"] = clients
    return results_by_index


def _anonymize_aum(value):
    """Bin AUM into disclosure-safe ranges."""
    v = float(value) if value else 0
    if v < 25_000_000:   return "<$25M"
    if v < 100_000_000:  return "$25M–$100M"
    if v < 250_000_000:  return "$100M–$250M"
    if v < 500_000_000:  return "$250M–$500M"
    return ">$500M"

Sequential labels (not hashes) so people can say "Client B looks strong" mid-call. Firm names mapped to industry type via anonymize_firm_name() (shown below) so advisors can still reason about fit. AUM binned into ranges wide enough that you can't reverse-engineer the identity from the number.

What the UI sees: normal vs privacy mode

I forced myself to design this like a product surface, not a backend trick. The same query should yield the same ordering, but different identity exposure.

Normal mode (identity visible):

Client name present
Firm name present
AUM and production figures present

Privacy mode (identity redacted):

Client name becomes "Client A/B/C"
Firm name becomes a generic firm-type label
AUM becomes a rounded range ($1M–$5M, $5M–$25M, etc.)

And critically:

location unchanged
designations (CFP, CFA, CIMA) unchanged
availability unchanged

That’s what makes it usable mid-call.

Firm names don’t get blanked—they get mapped

Blanking firm names is safe, but it’s also dumb.

An advisor doesn't just care where a client's assets are custodied. They care what kind of institution it is. In this platform, the anonymized representation is a firm-type label—things like wirehouse, RIA, bank, insurance.

That mapping is defined in the curator job as FIRM_TYPE_MAP.

# anonymizer.py — firm-to-classification mapping (subset shown)

WIREHOUSES = {
    "merrill lynch": "a leading national wirehouse",
    "morgan stanley": "a leading national wirehouse",
    "ubs": "a leading national wirehouse",
    "raymond james": "a leading national wirehouse",
}

RIAS = {
    "fisher investments": "a multi-billion dollar RIA",
    "creative planning": "a multi-billion dollar RIA",
    "captrust": "a multi-billion dollar RIA",
}

BANKS = {
    "regions bank": "a regional banking institution",
    "pnc": "a regional banking institution",
}

INDEPENDENT_BDS = {
    "lpl financial": "a major independent broker-dealer",
    "northwestern mutual": "a major insurance company",
    "new york life": "a major insurance company",
}

def anonymize_firm_name(text):
    """Replace known firm names with generic industry classifications."""
    all_firms = {**WIREHOUSES, **RIAS, **BANKS, **INDEPENDENT_BDS}
    # Sort by length (longest first) to avoid partial matches
    for name in sorted(all_firms, key=len, reverse=True):
        pattern = re.compile(re.escape(name), re.IGNORECASE)
        text = pattern.sub(all_firms[name], text)
    return text

This looks almost too simple, but it’s exactly the point: the mapping is explicit, stable, and shared. When you’re in privacy mode, you’re not hiding structure—you’re hiding identity.

What surprised me when I first wired this in is how much better the conversation becomes. People stop anchoring on custodian brand names and start talking about fit — AUM range, designations, location.

The database flag + partial index: privacy mode without paying for it

The runtime toggle is only half the story. The other half is making sure privacy mode doesn’t impose a tax on the 95% of queries that run normally.

So privacy mode is stored as a boolean:

privacy_mode BOOLEAN DEFAULT false

And it’s supported by a partial index:

WHERE privacy_mode = true

That last clause is the whole trick. It means the system can optimize lookups for the privacy-mode subset without dragging normal-mode performance into it.

-- privacy_mode migration
-- Adds privacy_mode with a default of false, and a partial index for privacy mode queries.

ALTER TABLE conversation_state
  ADD COLUMN privacy_mode BOOLEAN DEFAULT false;

CREATE INDEX conversation_state_privacy_mode_true_idx
  ON conversation_state (privacy_mode)
  WHERE privacy_mode = true;

(Production note: on a large table, ADD COLUMN ... DEFAULT false rewrites the table in older Postgres versions. The safe pattern is add nullable first, backfill in batches, then set the default.)

I like this pattern because it’s honest about reality:

Most conversations don’t need privacy mode.
The ones that do need it need it right now.

The partial index is how you get both.

Training data alignment: the model can’t leak what it never learns to say

Runtime redaction prevents API/UI leakage. But models have their own failure mode: they can “helpfully” repeat sensitive strings if you train them that way.

So I aligned the fine-tuning dataset with the runtime anonymization rules.

There’s a training_data.jsonl file with 141 fine-tuning records using matching anonymization rules—specifically including the same firm-type mapping.

# training_data.jsonl
# 141 fine-tuning records.
# The dataset applies the same anonymization rules used at runtime,
# including firm-type labels that correspond to FIRM_TYPE_MAP.

This is one of those engineering moves that looks boring until you’ve been burned.

If your runtime system says “never show firm names,” but your fine-tuned bullet generator was trained on raw firm names, you’ve built a leak path that doesn’t show up in API logs. It shows up in the CEO’s inbox.

Why this is a deployment pattern, not a data pattern

The reason I’m stubborn about this distinction is that it changes how you evolve the system.

A data pattern (separate anonymized index) forces you to keep two corpuses in sync forever.

A deployment pattern (runtime toggle + shared mapping + aligned training data) lets you:

keep one search surface
keep one ranking behavior
keep one ingestion pipeline
add privacy as a boundary policy

It’s the difference between “we made a privacy copy” and “we made privacy a mode.”

And mode is what humans actually need.

Nuances and tradeoffs

Privacy mode is not magic; it’s a compromise you can defend.

1) Anonymization has to be selective

If you redact location, designations, and availability, you’ve thrown away the decision.

If you keep names and firms, you’ve thrown away the trust.

So the system explicitly:

redacts identity fields
preserves decision fields

2) Consistency matters more than cleverness

The firm-type mapping appears in two places on purpose:

curator.py via FIRM_TYPE_MAP
the 141-record training_data.jsonl

That redundancy isn’t accidental. It’s how I keep “privacy mode” from becoming a UI-only promise.

3) Scope control is a security feature

anonymize_search_results() is applied only to the client search index.

That’s not just performance hygiene—it’s blast-radius control. Privacy mode is for live client search during calls, not for every operational workflow.

Closing

When an AI suggests “make a second anonymized index,” it’s solving the wrong problem with the right vocabulary. The problem isn’t that the data needs to be anonymized; it’s that the conversation needs to be safe, immediately, without changing what the system believes is the best match—and once you see privacy as a runtime toggle, the rest of the architecture snaps into place.

🎧 Listen to the audiobook — Spotify · Google Play · All platforms
🎬 Watch the visual overviews on YouTube
📖 Read the full 13-part series with AI assistant