松本倫太郎

Posted on Mar 28

The Story of Building and Then Freezing My Own AI Humanization Pipeline

#ai #opensource #research #machinelearning

What Happened

The core/ directory of human-persona contains a base class composed of four components: TimingController, StyleVariator, EmotionStateMachine, and ContextReferencer. It's a language- and culture-agnostic framework designed for human-like AI communication.

One day, I wrote a simple pipeline for integrating this framework into an actual production environment. humanize/pipeline.py — a post-processing pipeline consisting of three stages: filler injection, typo injection, and rhythm variation.

I wrote it. I tested it. It passed benchmarks.

And then I froze it.

This article is about why I froze the code I wrote myself.

What Was pipeline.py Doing?

The mechanism was simple:

class HumanizePipeline:
    def __call__(self, text: str, strength: float = 0.4) -> str:
        sentences = self._split(text)
        sentences = self._inject_fillers(sentences, strength)
        sentences = self._inject_typos(sentences, strength)
        sentences = self._vary_rhythm(sentences, strength)
        return self._join(sentences)

Filler Injection: Probabilistically inserting phrases like "Actually," or "To be honest," at the beginning of sentences.
Typo Injection: Intentional typos like "ですが" → "でうすが".
Rhythm Variation: Inserting short commentary sentences ("This is important.") to vary sentence length.

I fixed a bug in Japanese period handling (double periods 。。), and the DPO benchmark scores were fine.

Everything seemed to be going smoothly.

The First Sign of Trouble: No Register

The moment I lined up Before/After samples with real text, I noticed something was off.

The pipeline was processing all text the same way. Business emails, casual chats, official documents—all got the same fillers, the same typo rate, the same rhythm.

This is fundamentally wrong.

There was no distinction for register (formal / business / casual / friendly) from linguistics. "To be honest," might be appropriate in a business email, but using the same filler in a contract would be fatal. In a casual chat, it would conversely be too formal.

The Fatal Flaw: Japanese Honorific System

Next, I found an even more serious problem. Japanese honorifics have three layers:

Sonkeigo (respectful language): Elevates the other party's actions (e.g., "ご覧になる", "いらっしゃる").
Kenjougo (humble language): Lowers one's own actions (e.g., "拝見する", "参る").
Teineigo (polite language): Makes sentence endings polite (e.g., "です", "ます").

pipeline.py made no distinction between these. It treated Japanese formality with a single scalar value: formality_default: 0.7.

This made it impossible to differentiate between appropriate usage like "対応いたします" (kenjougo) and unnatural, excessive humility like "対応させていただきます". It couldn't properly choose between "ご検討くださいませ" (sonkeigo) and "検討します" (teineigo only).

It was a problem that any native Japanese speaker would find jarring immediately.

The Real Problem: I Hadn't Read My Own Code

Now for the most painful fact.

core/base_persona.py already had a design to address these issues:

EmotionStateMachine automatically adjusts formality based on conversation phase.
StyleVariator holds five stylistic patterns and prevents consecutive repetition of the same pattern via weight decay.
config/ja.json defines parameters like context_level: 0.85 (high-context culture) and formality_default: 0.7.

pipeline.py had been built while completely ignoring this existing architecture.

Why did I ignore it? To be honest, it was because "I wanted something quick and dirty." It was faster to write a 3-stage post-processor than to understand the core/ pipeline (emotion update → generation → style variation → context reference → ambiguity → post-processing → delay).

It was faster. And it was wrong.

The Trap of Automated Evaluation

There was another overlooked problem.

In the Ablation Study, I used a DPO benchmark to measure the contribution of each step:

Filler Injection: ~60%
Typo Injection: ~25%
Rhythm Variation: ~15%

The scores were good. But this evaluation itself was the problem.

The automated evaluation was only detecting superficial features. If there were fillers, it was "human-like"; if there were typos, it was "human-like." But that's not "human-likeness." True human-likeness lies in the appropriate use of honorifics, natural referencing of context, and consistent emotional transitions.

These are not reflected in DPO scores.

Pivot: Control at the Prompt Level

So what to do? I organized two facts:

Fact 1: For one-off text generation (proposals, emails, etc.), neither emotional transition nor context accumulation is necessary. Incorporating persona instructions into the LLM's system prompt is more effective than a post-processing pipeline.

Fact 2: The core/ architecture of human-persona becomes truly necessary in the phase of continuous conversation. In a 5- or 10-exchange interaction with a client, where emotions change, previous context is referenced, and response timing naturally fluctuates—this cannot be controlled by prompts alone.

In other words, pipeline.py was solving the wrong problem.

Humanizing one-off text was sufficient with prompt-level instructions:

Style and Persona:
- Polite language base (desu/masu style). Avoid excessive use of sonkeigo.
  - OK: 「対応いたします」
  - Avoid: 「対応させていただきます」
- Opening sentence: Max 20 characters. Gets straight to the core.
- Ratio of short to long sentences: Approximately 1:3.

I A/B tested this prompt with the DeepSeek API. The results were clear:

Aspect	Old (Generic Prompt)	New (Persona Prompt)
Opening	「案件内容を拝見しました。」(Formulaic)	9 characters, cuts to the core of the matter
Honorifics	Excessive use of 「させていただきます」	「です・ます」 base
Sentence Length Variation	Uniform (3-4 line sentences in parallel)	Mix of short sentences and explanatory sentences
CTA	「ご検討のほど、よろしくお願いいたします」	「商品数を教えてください」(Invites dialogue)

I also confirmed its effectiveness through Human Eval (visual assessment).

Lessons Learned

1. Read the Existing Architecture

Before writing new code, read all of the project's existing design. "It takes too long to read" is not an excuse. Rewriting code you wrote without reading it takes far more time.

2. Don't Over-Trust Automated Evaluation

Even if DPO benchmark scores are good, they might only be measuring superficial feature alignment. Especially for subjective qualities like "human-likeness," Human Eval (visual assessment by humans) is essential.

3. Identify the Problem Scope

"Making AI output human-like" seems like one problem, but it's actually two distinct problems:

One-off text: Prompt-level control is sufficient. A post-processing pipeline is over-engineering.
Continuous conversation: Prompts alone are insufficient. EmotionStateMachine, ContextReferencer, and TimingController are necessary. It breaks down beyond 5 exchanges.

pipeline.py was trying to solve the former problem with tools for the latter—and it wasn't even using those tools correctly.

4. Freezing Isn't Bad

"Throwing away code you wrote" doesn't feel good. But it's much better than continuing down the wrong path. I froze pipeline.py, but the insights gained here (the importance of the honorific system, the need for register, the limits of automated evaluation) will directly inform the next design.

Current Status and Next Steps

humanize/pipeline.py: Frozen. Saved but not used in production.
One-off text generation: Already migrated to prompt-level persona control. Includes numerical constraints and platform-specific tone adjustments.
Continuous conversation: Not started. Planning a full-scale redesign utilizing the core/ foundation.

Open Issues:

Repository: github.com/RintaroMatsumoto/human-persona

📄 The research in this article is formally published as a preprint
HumanPersonaBase: A Language-Agnostic Framework for Human-Like AI Communication
DOI: 10.5281/zenodo.19273577

DEV Community