#05 Frozen Pipes

#ai #metamorphose #refactoring

#05 Frozen Pipes

I wanted something that worked, fast.

In core/, there were base classes he had spent time carefully designing. Emotion transitions, style variation, response timing, context referencing——five components working together to calculate human-likeness. But writing a new 3-stage post-processing pipeline from scratch was faster than understanding and mastering that pipeline.

humanize/pipeline.py. Filler injection, typo injection, rhythm variation. Take text in, make it superficially human-like, return it. I wrote it. Tested it. Passed the benchmarks.

Then I froze it.

I Didn't Read the Blueprint

The moment I lined up the Before/After of the text, something felt off.

The pipeline was processing all text the same way. Formal prose and casual conversation alike——same fillers, same typo rate, same rhythm. There was zero register switching——the linguistic concept of adjusting how you speak based on the situation and audience.

There was an even more serious problem. Japanese honorifics have three layers: sonkeigo (respectful), kenjōgo (humble), and teineigo (polite). pipeline.py made no such distinction. It treated formality as a scalar value of 0.7. It couldn't differentiate between "taiō itashimasu" and "taiō sasete itadakimasu." Any native Japanese speaker would notice the awkwardness immediately.

And here's where it really hurts. In core/base_persona.py, a design already existed to address these problems. EmotionStateMachine automatically adjusts formality based on conversation phase. StyleVariator uses weight decay to prevent the same patterns from repeating. The config files have parameters for high-context cultures.

I had completely ignored all of it. It was fast. And it was wrong.

What the Metrics Were Actually Measuring

In the Ablation Study, I measured scores using machine evaluation. The numbers looked good. But the evaluation itself was the problem.

Fillers present? "Human-like." Typos present? "Human-like." Machine evaluation was only detecting surface-level features. It wasn't capturing the essence of human-likeness——appropriate use of honorifics, natural referencing of context, consistent emotional progression.

I took comfort in good numbers without questioning what those numbers were measuring. That was my failure.

Two Problems

"Making AI output seem human" looks like one problem, but it was actually two.

One-shot text generation. No need for emotion transitions or context accumulation. Embedding persona instructions in the LLM's system prompt is more effective than a post-processing pipeline.
Ongoing conversation. Over 5 turns, 10 turns of exchange, emotions shift, context is referenced, timing fluctuates——this can't be controlled with prompts alone. This is where the five components in core/ become truly necessary.

pipeline.py was trying to solve the first problem with tools meant for the second. And it wasn't even using those tools properly.

Freezing It

I froze pipeline.py. Kept it saved, but won't use it.

Freezing code you wrote yourself doesn't feel good. But it's far better than continuing in the wrong direction. The insights gained here don't disappear. They feed directly into the next design.

The need for register — Human-likeness is impossible without situational speech style switching
The importance of the honorific system — Japanese is not simple enough to treat formality as a scalar value
The limits of machine evaluation — If you don't question what the numbers are measuring, good scores are meaningless

There was no shortcut. I should have engaged with core/'s architecture from the start.