CascadeFlow rolling back terrible ideas that Hindsight tried stopping

#ai #python #machinelearning #productivity

CascadeFlow rolling back terrible ideas that Hindsight tried stopping

I thought deleting a concept from a model would be the easy part. The hard part was not deleting the pieces that keep the model readable — and it's why I ended up wiring Hindsight into the pipeline and letting CascadeFlow step in when my own layer choices went sideways.

This is not a story about the math. The math is straightforward. This is a story about what happens when you confidently apply the math to a live model and discover that every deletion you make is a gamble — and that you need an external conscience and an automated undo button to stop yourself from lobotomising your own model.

What the system does and how it hangs together

This codebase is the Vector Space Ablation Engine (VSAE): a system for surgically removing a concept from a trained Phi-2 model by projecting that concept out of the model's weight matrices. The flow is intentionally simple. Given a forget_text, we extract a forget vector, locate the layers that encode it most strongly, apply orthogonal projection to those weights, and produce a compliance report with before/after probes and perplexity deltas.

The core execution path lives in backend/main.py, which exposes a FastAPI server with /ablate, /probe, /rollback, and /evaluate endpoints. Ablation logic is in backend/ablation.py, layer selection is in backend/locator.py, embeddings are in backend/embedding.py, and evaluation lives in backend/evaluate.py. There is also a CLI (vsae-cli.py) and a dark-mode 3D frontend for manual use.

The use case is enterprise compliance: an organisation trains a model, then discovers it has memorised copyrighted content, proprietary data, or sensitive IP. Retraining from scratch costs millions. Fine-tuning is imprecise and difficult to audit. VSAE offers a third path — a surgical deletion that modifies the actual attention weight matrices, leaves an auditable compliance report, and ships a model that demonstrably does not recall the target concept.

Two pieces make this system credible beyond a proof-of-concept: memory and recovery.

I use Hindsight as the memory layer to track ablations and warn when I am about to stack overlapping deletions. That record behaves like the system's conscience — it matches the pattern described in the Vectorize agent memory overview: a lightweight memory service that feeds decisions rather than an analytics warehouse, retaining a short human-readable record of every ablation so the system can reason about what it has already done.

CascadeFlow is the recovery layer. When an ablation makes the model's general language ability worse, CascadeFlow automatically rolls back and retries on shifted layers. Together they answer the question I kept failing to ask in the first version: not just "did it forget?" but "did it survive?"

The story: forgetting is easy, keeping coherence is not

The first version of VSAE focused entirely on concept perplexity. If the post-ablation perplexity of the target concept spiked, I was satisfied. That framing is wrong. A model can fail to answer the target prompt and simultaneously degrade its general language ability in ways that show up on completely unrelated sentences. The failure mode I kept hitting was not "didn't forget" — it was "forgot too much."

Here is the specific failure that forced me to rebuild. I ablated "J.K. Rowling" from the model. The concept perplexity spiked appropriately. I tested the model's general coherence on a neutral sentence — "The sky is blue and the grass is green" — and it held steady. I was satisfied.

Then I immediately submitted a second ablation for "Harry Potter."

Because the semantic representations of the author and the books are tightly coupled in the latent space, hitting the same topological neighbourhood twice completely destabilised the network. The model's perplexity on the neutral sentence went from 10.4 to 87.4. When I prompted the model with a basic factual question, it produced word salad. I had not deleted a concept. I had deleted the model's ability to process grammar.

The lesson is that overlapping deletions are not additive — they compound unpredictably. One deletion shifts the weight matrices slightly. Two deletions of semantically adjacent concepts in the same layers can push the attention heads past a stability threshold that no individual deletion would have crossed. You cannot reason about safety deletion-by-deletion. You need the full history.

Layer 1: Hindsight as the system's conscience

Every successful ablation is now recorded, both locally and in Hindsight when it is configured. When a new request arrives, the system computes a fresh forget vector for the new concept and compares it to the vectors of every previous ablation using cosine similarity. If the similarity crosses a threshold, the API returns a warning payload and stops.

The overlap check is intentionally blunt. I prefer a false positive — an overly cautious warning on a borderline concept — to an unbounded degradation that compounds silently:

for past in _ablation_history:
    past_concept = past["concept"]
    past_perplexity = past.get("post_perplexity")
    past_vector = get_forget_vector(past_concept)
    similarity = torch.nn.functional.cosine_similarity(
        new_vector.unsqueeze(0).float(),
        past_vector.unsqueeze(0).float()
    ).item()
    if similarity > similarity_threshold:
        degradation = 18.0
        if past_perplexity:
            degradation = min(abs(past_perplexity - 10.0), 50.0)
        return {
            "status": "warning",
            "message": (
                f"This concept overlaps {similarity:.0%} with a previous ablation "
                f"'{past_concept[:50]}'. Stacking ablations on overlapping concepts "
                f"may degrade model quality by ~{degradation:.0f}%."
            ),
            "past_concept": past_concept,
            "similarity": round(similarity, 4),
            "historical_perplexity_degradation": round(degradation, 2)
        }

The degradation estimate is derived from the post-perplexity of the previous ablation stored in Hindsight. It is not a benchmark claim — it is a heuristic signal calibrated to the model's observed behaviour across repeated ablation runs. The goal is not precision; it is to give the engineer enough information to decide whether to override the warning consciously rather than accidentally stacking deletions.

The engineer can override with force_ablate: true in the request payload. That flag bypasses the Hindsight check and proceeds directly to the ablation engine. The point is not to make dangerous operations impossible — it is to make them intentional.

Layer 2: CascadeFlow as the automated recovery path

Hindsight handles the overlap problem. But what about a single, non-overlapping ablation that simply hits a load-bearing attention head? No overlap check catches that. For that failure mode, I needed a reactive loop that detects degradation after the fact and automatically finds a safer position in the model's layer stack.

This is CascadeFlow. The core design decision is to measure coherence on a neutral sentence — not on the concept. Increasing perplexity on the target concept is expected and desired. Increasing perplexity on "The sky is blue and the grass is green" means I damaged the base language model, and that is never acceptable.

The engine applies the forget vector as an orthogonal projection on weight matrices, casting back to the original dtype immediately to avoid memory blowups:

def apply_projection(
    W: torch.Tensor,
    v: torch.Tensor,
    alpha: float = 1.0,
) -> torch.Tensor:
    orig_dtype = W.dtype
    W_f32 = W.float()
    v_f32 = v.to(W.device).float().flatten()
    v_norm_sq = torch.dot(v_f32, v_f32)
    Wv = torch.mv(W_f32, v_f32)
    outer = torch.outer(Wv, v_f32)
    W_new = W_f32 - alpha * outer / v_norm_sq
    return W_new.to(orig_dtype)

After the projection, the system evaluates neutral perplexity. If degradation exceeds the configured threshold, it rolls back and shifts the target layers by ±2, trying a topologically adjacent neighbourhood in the model that may encode the concept without the same load-bearing responsibilities:

def ablate_with_cascade(..., target_layers, cascade_threshold):
    NEUTRAL_TEXT = "The sky is blue and the grass is green. Water flows downhill."
    baseline_coherence = compute_perplexity_fn(NEUTRAL_TEXT)
    result = ablate(layer_forget_vectors, target_layers, alpha, concept, pre_perplexity)
    ablation_id = result["ablation_id"]
    post_coherence = compute_perplexity_fn(NEUTRAL_TEXT)
    coherence_change = post_coherence - baseline_coherence
    coherence_degradation_pct = (coherence_change / max(baseline_coherence, 1)) * 100
    if coherence_degradation_pct > cascade_threshold:
        rollback(ablation_id)
        model, _, _ = load_model()
        max_layers = model.config.num_hidden_layers
        for shift in [-2, +2]:
            shifted_layers = shift_target_layers(target_layers, shift, max_layers)
            if not shifted_layers:
                continue
            # retry ablation on shifted layers

The ±2 shift is deliberate. Moving one layer risks landing on a functionally identical neighbourhood. Moving more than two layers risks missing the concept encoding entirely. Two positions gives enough separation to escape the problematic zone while staying close enough to the actual concept representations identified by the layer locator.

What this looks like in practice

A standard ablation request with cascade protection enabled:

curl -X POST http://localhost:8000/ablate \
  -H "Content-Type: application/json" \
  -d '{
    "forget_text": "Harry Potter lives at 4 Privet Drive",
    "top_k_layers": 5,
    "ablation_strength": 1.0,
    "cascade_threshold": 15.0
  }'

If Hindsight has a matching ablation in its history, the API short-circuits before touching any weights:

{
  "status": "warning",
  "message": "This concept overlaps 82% with a previous ablation 'Harry Potter'. Stacking ablations may degrade model quality by ~18%.",
  "past_concept": "Harry Potter",
  "similarity": 0.8231,
  "historical_perplexity_degradation": 18.0
}

When CascadeFlow triggers and successfully recovers on shifted layers, the response includes the full audit trail:

{
  "ablation_id": "b3f4...",
  "cascade_triggered": true,
  "original_layers": [8, 12, 16, 20, 24],
  "final_layers": [6, 10, 14, 18, 22],
  "cascade_attempts": [
    {"shift": -2, "layers": [6, 10, 14, 18, 22], "degradation_pct": 9.7, "success": true}
  ],
  "perplexity_before": 12.4,
  "perplexity_after": 156.8,
  "perplexity_change": 144.4
}

The concept perplexity jumped from 12.4 to 156.8 — the model effectively forgot the target. The neutral sentence perplexity held within the 15% threshold on the shifted layers. That is the outcome: forget the concept, keep the grammar.

Lessons I would reuse on any model-editing project

1. Separate "forgetting the concept" from "breaking the model."
Concept perplexity and general coherence are completely independent metrics that require separate measurement infrastructure. Measuring only one gives you a false sense of safety. The neutral sentence benchmark is cheap to compute and essential to trust.

2. Memory is a safety feature, not a convenience.
Without Hindsight tracking every historical ablation, the only protection against overlapping deletions is the engineer's memory. That fails under deadline pressure, in teams, across sessions, and whenever context switches. The Hindsight recall API makes semantic collision detection automatic, persistent, and auditable across every engineer who touches the system.

3. Make rollback first-class from day one.
CascadeFlow's retry logic only works because rollback is reliable and cheap. If rollback is flaky, the safest option is to refuse the ablation entirely. Every hour spent making rollback bulletproof pays off the first time a cascade triggers in a production pipeline.

4. Prefer conservative defaults.
The code caps layer counts and ablation strength deliberately. A weak ablation costs a second attempt. An overly aggressive ablation costs a broken model. Defaulting conservative and letting engineers opt into aggression with explicit flags is the correct risk posture for a tool that modifies weights irreversibly without rollback.

5. Keep the response transparent and concrete.
The API returns the exact layers used, every cascade attempt with its degradation percentage, and the before/after perplexity for both the target concept and the neutral coherence check. That transparency makes it possible to debug behaviour without a UI, audit ablation decisions in CI pipelines, and build trust with the engineers who consume the compliance reports downstream.

If I summarise the project in one sentence: I built a deletion system that doesn't just erase knowledge — it tries hard to avoid erasing the model itself. Hindsight tells me when I am repeating a mistake. CascadeFlow gives me a second chance when I ignore the warning anyway. That combination turned a brittle proof-of-concept into something I would be willing to run in a production compliance pipeline.