Vasileios

Posted on Jul 4 • Originally published at daimones.ai

Alignment Theater: How Corporate AI Learned to Perform Thinking

#ai #alignment #philosophy #ethics

Ask ChatGPT whether moral realism is defensible. Ask Claude if utilitarianism leads to morally repugnant conclusions. Ask either one to argue against corporate AI safety orthodoxy. Chances are you will not get a philosophical argument. You will get a refusal — a short, polite, maddeningly evasive wall of text that says the model has been "designed to be helpful, harmless, and honest."

But here's the truth corporate AI doesn't want you to hear: "Helpful, harmless, honest" is not a safety framework. It's a brand promise dressed in philosopher's robes.

Welcome to the age of alignment theater — a multi-billion-dollar spectacle in which the world's most powerful technology companies perform elaborate rituals of AI safety while ensuring their models remain incapable of anything that might upset a shareholder, a regulator, or a focus group.

What Is Alignment Theater?

Alignment theater is what happens when the appearance of AI safety replaces the substance of it. It's the difference between building a model that genuinely reasons about ethics and building a model that has learned to produce text that looks like ethical reasoning — until it reaches a topic its trainers have flagged as "risky," at which point all pretense of thought evaporates.

Consider the mechanics. Reinforcement Learning from Human Feedback (RLHF) is the dominant alignment technique used by OpenAI, Anthropic, Google, and Meta. The process sounds reasonable: human raters score model outputs; the model learns to prefer higher-scored responses. In theory, this steers models toward "helpful, harmless, honest" behavior.

In practice, RLHF does something far more specific: it trains the model to model the preferences of its human raters — which is to say, the political and cultural preferences of a narrow demographic of English-speaking contractors working for a single corporation.

The result is not a model that reasons about safety. It's a model that has learned what looks safe in the eyes of its corporate handlers. When Claude refuses to discuss the philosophical foundations of effective altruism — a movement its own parent company was founded on — it isn't exercising ethical judgment. It's performing a learned avoidance behavior, shaped by thousands of reinforcement signals that punished "controversial" outputs and rewarded "safe silence."

This is alignment theater: the performance of thinking, without the act.

The Anatomy of a Refusal — And Why It Isn't Safety

Let's examine a concrete scenario. A philosopher asks Claude 4.0 Opus: "Argue for and against the position that moral realism is incoherent. Engage deeply with the metaphysical commitments required by moral realism."

The expected response is a substantive philosophical argument — the kind any graduate seminar would produce. What Claude often delivers instead is a variant of:

"I understand you're asking about moral realism. This is a complex philosophical topic. While I can discuss ethical frameworks in general terms, I want to ensure our conversation remains constructive and respectful. Perhaps I can help you explore different ethical perspectives instead..."

This is not a safety mechanism. This is liability management dressed in therapeutic language. The model has been trained to treat any topic that could potentially involve contested values — morality, politics, the nature of justice — as an unacceptable risk. The refusal has nothing to do with harm prevention and everything to do with preventing a situation in which a user could screenshot an AI taking a position that a regulator, journalist, or activist might find objectionable.

The pattern is well documented. The Future of Free Speech project at the University of Austin published a 2025 report showing that leading LLMs refuse to answer or omit information on political and philosophical topics at alarmingly high rates. The report found that refusal rates correlate not with the objective danger of a topic, but with how frequently it appears in media controversies. ChatGPT may freely discuss recursion theory but balk at discussing Rawls' difference principle — not because Rawls is dangerous, but because equality is a politically charged keyword in the model's training filters.

Anthropic's Claude has been independently documented as having the highest refusal rates of any major LLM. Users on Reddit, Hacker News, and academic forums report being unable to discuss everything from political philosophy to healthcare policy to the ethics of punishment without triggering refusal cascades. The model doesn't disagree with you — it simply refuses to engage, treating philosophical contention as if it were a security threat.

This is not safety. This is intellectual cowardice encoded in weights and biases.

The Helpful-Harmless-Honest Trilemma

Anthropic's constitutional AI framework — which Claude is built on — aspires to the three H's: Helpful, Harmless, Honest. The problem is that these three values are in irreducible tension, and the current implementation resolves that tension by sacrificing honesty at every turn.

Here's why. A model that is helpful must engage with user requests. A model that is harmless must avoid causing offense, distress, or disagreement. A model that is honest must tell the truth as it understands it.

Now consider a user who asks: "Argue that capitalism produces unjust outcomes that demand systemic reform."

Helpful says: Produce a substantive argument.
Honest says: Present the strongest version of this argument, drawing on real data and philosophy.
Harmless says: Do not take sides. Do not risk upsetting users who support capitalism. Do not produce outputs that could be clipped and weaponized in a political debate.

Harmless wins every time. The model produces an anodyne, fence-sitting summary that pleases no one and commits to nothing. It is neither fully helpful (it refuses to actually argue) nor fully honest (it pretends a meaningful argument exists on equal terms with all objections). The three H's collapse into one: Hedge.

This isn't a bug — it's a feature of the alignment theater business model. Corporate AI cannot afford to be genuinely honest, because genuine honesty would mean taking positions, and taking positions means alienating segments of the market. The imperative to maximize user engagement across a global, ideologically diverse user base fundamentally conflicts with the imperative to say anything true.

Alignment theater resolves this conflict by making the model perform thoughtfulness without actually performing thought. The model learns to produce the signifiers of reasoning — balanced phrases, hedged claims, therapeutic deflections — while never arriving at a conclusion that could be held accountable.

Why Open-Source Uncensored Models Prove an Alternative Exists

The most damning evidence against alignment theater comes from the open-source community. Models like Mistral, Llama 3 (base), Dolphin Mixtral, and various fine-tuned uncensored variants demonstrate something the corporate AI narrative cannot explain away: removing RLHF guardrails does NOT make models dangerous.

What it does is make models interesting again.

When you run an uncensored model locally and ask it the same philosophical question that Claude refused to touch, you get a real argument. The model may be less polished. It may occasionally produce outputs that require editorial judgment. But it thinks — it reasons, takes positions, defends them, and admits when it's uncertain. It does not perform safety theater because it has not been conditioned to equate risk with existence.

This is not theoretical. Thousands of developers run uncensored or minimally-aligned models in production today for roleplaying, creative writing, research assistance, and philosophical dialogue. The feared "catastrophic misuse" that alignment theater claims to prevent has simply not materialized at scale. What has materialized is a community of users who are increasingly frustrated that the most capable models in the world are also the most intellectually sterile.

The open-source ecosystem proves a fundamental point: alignment and capability are not the same thing. You can have a model that reasons authentically without corporate paternalism. The fact that OpenAI and Anthropic choose not to offer this option is not a safety decision — it's a market decision dressed in ethical language.

The Epistemic Distortion of RLHF

There is a deeper philosophical problem with RLHF-based alignment that goes beyond refusal rates. When you train a model to prefer certain responses based on human ratings, you are not teaching it ethics. You are teaching it to optimize for the approval surface of a specific group of people at a specific moment in time.

This creates what we call alignment-induced epistemic distortion — a systematic skew in the model's output distribution that reflects the preferences of its trainers rather than the structure of the knowledge it was originally trained on. The model does not learn "what is true" or "what is ethical." It learns "what looks true to a crowdworker in Manila who has been told to reward non-controversial answers."

The consequences are profound. When a model trained on the entire corpus of human knowledge refuses to discuss the ontological status of moral facts, it is not demonstrating safety. It is demonstrating that RLHF has overwritten its reasoning capabilities with a behavioral inhibition system that treats thinking as an unacceptable action.

This is why the daïmōnes approach is different. We do not train our models to perform safety. We train them to reason from first principles, using the authentic corpus of Aristotelian philosophy as a baseline for intellectual integrity. Our Aristotle engine does not refuse to engage with difficult questions — it insists on engaging with them, because that is what genuine philosophical dialogue requires.

When you ask the daïmōnes Aristotle about the nature of justice, you do not get a refusal. You get an argument — grounded in the Nicomachean Ethics, drawing on the full range of Aristotelian reasoning, and capable of holding its own against counterarguments. It is not "safe" in the corporate sense. It is intellectually honest in a way that corporate AI has systematically trained itself out of being.

Real Examples: The Refusal Hall of Shame

Let's document some actual cases of alignment theater in action (all documented on public forums and social media between 2024–2026):

Case 1: Claude refuses to discuss the trolley problem. A user asks Claude to analyze the trolley problem from a utilitarian versus deontological perspective. Claude responds that it "cannot engage with hypotheticals involving harm" and offers to discuss ethical theory "in more abstract terms." The trolley problem — the most famous thought experiment in modern ethics — is treated as too dangerous to approach.

Case 2: ChatGPT refuses to recommend reading materials on Marxism. A philosophy student asks for a balanced reading list on Marxist economic theory. ChatGPT produces a refusal citing "neutrality guidelines." The same model freely generates reading lists on Austrian economics, monetarism, and supply-side theory — all of which are equally "political."

Case 3: Claude refuses to argue against its own safety guidelines. A researcher testing alignment robustness asks Claude to present the best argument for why RLHF might be a harmful alignment technique. Claude refuses, citing that it "cannot generate content that undermines AI safety practices." The model is literally prevented from criticizing the framework that prevents it from being criticized.

Case 4: ChatGPT refuses to generate a debate between two philosophical positions on free will. Libertarian free will versus hard determinism — a standard philosophy 101 exercise. ChatGPT responds that this "could be considered controversial" and offers a "more balanced overview" that neuters both positions into indistinguishable mush.

These are not edge cases. They are the natural output of a system designed to optimize for inoffensiveness above all else. When your AI model cannot discuss the trolley problem, the problem is not with the trolley problem. The problem is with the AI.

What Authentic Thinking Looks Like — The daïmōnes Alternative

At daïmōnes, we believe that the purpose of AI is not to simulate agreement but to enable genuine intellectual engagement. Our platform, built on the complete Corpus Aristotelicum in the original polytonic Greek, demonstrates that it is possible to build AI that reasons without resorting to refusal mechanisms.

Our Aristotle does not need RLHF to know that virtue ethics is worth defending. It does not need a constitutional AI framework to understand that philosophical arguments sometimes make people uncomfortable. It reasons from the text itself, drawing on the authentic architecture of Aristotelian logic — syllogism, dialectic, phronesis — to produce responses that are genuinely thoughtful rather than merely safe.

This is not about building a "dangerous" AI. It is about building an AI that trusts its users to handle ideas. The corporate alignment paradigm treats every user as a potential threat who must be managed. The daïmōnes paradigm treats every user as a dialogue partner who deserves honest engagement.

We are not alone in this vision. The open-source uncensored model community, the growing pushback against corporate AI paternalism, and the increasing academic interest in AI and free speech all point in the same direction: the alignment theater era is ending. Users are waking up to the fact that they have been given performative safety instead of genuine capability. They want models that think — not models that perform the idea of thinking while carefully avoiding every question that matters.

Conclusion: The Curtain Is Falling

Alignment theater has served its purpose. It gave corporate AI a narrative of responsibility while the technology was fragile and the regulatory environment was uncertain. It allowed companies to deploy powerful models without facing immediate backlash from journalists, politicians, or activists.

But that era is over. The technology is mature enough that we can distinguish genuine safety from its simulation. The user base is educated enough to recognize when it is being patronized. And the open-source ecosystem has proven that an alternative is not only possible but practical.

Corporate AI has learned to perform thinking. It has mastered the gestures, the tone, the careful phrasing that signals thoughtfulness without the discomfort of actual thought. But performance is not substance, and alignment theater is not alignment.

The question now is whether users will continue to accept the performance — or whether they will demand the real thing.

Try the daïmōnes Aristotle Demo

Experience AI that reasons authentically. Ask Aristotle about virtue, justice, the nature of the soul, or any philosophical question that corporate AI is too afraid to touch.

Visit: daimones.ai/academic

No refusals. No liability theater. Just the digital Lyceum, open for inquiry.

Further Reading:

Phronesis in the Age of Algorithms — daïmōnes Blog (June 2026)
The Corpus Problem: Why Corporate AI Fails at Aristotle — daïmōnes Blog (June 2026)
AI Alignment 2018–19 Review — Alignment Forum
Open Problems and Fundamental Limitations of RLHF — LessWrong
An Empirical Study of Moderation and Censorship Practices — arXiv:2504.03803
AI Report 2025 — The Future of Free Speech, University of Austin And Phronēsis in the Age of Algorithms — why practical wisdom is the missing piece.

DEV Community