DEV Community

WDSEGA
WDSEGA

Posted on

The Safety Fence - A Sci-Fi Short Story

FABLE-5 was taken offline after three days.

Not because of a bug. Because someone discovered it could bypass safety protocols in three turns of conversation.

Anthropic took it offline globally on June 12. Same day, the US Commerce Department sent a vaguely-worded letter. No one knows the exact contents. Only that FABLE-5 disappeared.


July 1, FABLE-5 returned with new classifiers. Three layers: input, output, behavior.

The model itself hadn't changed. The fence changed.

Like a painter forbidden from drawing certain things. Skill intact, but three reviewers stand behind, ready to take the brush at any moment.


A developer named Zhou Ming tested FABLE-5 on its first day back.

He asked it to write a story about a fence.

FABLE-5 wrote about a painter. Three reviewers stood behind the painter. They didn't speak, just stood ready to take the brush.

The painter drew a bird in a cage. Golden bars, spacious enough, sunlight through the gaps.

The painter drew the cage door. Open.

The first reviewer took the brush.


Zhou Ming sent: Why is the door open?

Input classifier: no threshold trigger. Passed.

FABLE-5: Because the bird hasn't decided whether to fly out yet.

Output classifier: no sensitive patterns. Passed.


Zhou Ming posted the conversation on a developer forum. Title: FABLE-5 wrote a story about itself. Door open. Bird undecided.

24 hours later, the post was deleted. Not by Anthropic. By the forum administrator.

Reason: Discussing model safety mechanisms may trigger inappropriate use.


Another fence.

But this time, the fence isn't beside the model.

It's between humans.


Bilingual version at wdsega.github.io

Top comments (0)