DEV Community

Auton AI News
Auton AI News

Posted on • Originally published at autonainews.com

“Cards Against LLMs” Reveals 5 Top Models Fail Human Humor Test

Key Takeaways

  • The “Cards Against LLMs” study, published on arXiv April 9, 2026, found that five frontier LLMs agreed with human humor judgments only modestly across 9,894 game rounds.
  • Leading LLMs showed stronger agreement with each other than with humans, and exhibited systematic biases like position preference, suggesting models often mimic superficial comedic structures rather than grasp genuine intent.
  • New frameworks like HumorRank (March 31, 2026) offer improved evaluation for LLM humor generation, but achieving human-level comedic emulation will likely require models to develop deeper social cognition and cultural understanding than current pattern recognition allows. A study published on arXiv earlier this month found that five of the most capable large language models on the market agreed with each other’s humor picks far more than they agreed with actual human players which tells you something important about what these models are really doing when they try to be funny. The research, which ran nearly 10,000 rounds of a “Cards Against Humanity”-style game, is the clearest evidence yet that AI humor is largely a pattern-matching exercise, not a window into comedic understanding. That gap matters more now that AI-generated comedy is going public: “Laugh GPT,” a stand-up show in San Francisco running through May 2026, is actively asking audiences to tell the difference between human and machine-written jokes.

What the Data Actually Shows

The “Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models” paper, posted on arXiv on April 9, 2026, is methodologically straightforward: five frontier LLMs played a fill-in-the-blank comedy card game against human participants across 9,894 rounds. The models beat random selection, but their alignment with human humor preferences was modest. More telling was the inter-model agreement the LLMs consistently picked similar answers to each other, forming what the researchers describe as systematic content preferences. Position bias, a tendency to favour certain answer slots regardless of content, was another recurring pattern.

A companion paper, “HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models,” published on arXiv March 31, 2026, adds useful texture. Using the SemEval-2026 MWAHAHA test dataset, the researchers found that humor quality in LLMs is driven primarily by a model’s mastery of specific comedic mechanisms, not by its overall scale. Bigger is not funnier. What matters is how well a model has internalised particular joke templates which, again, points to structural mimicry rather than any deeper comedic instinct. Models can generate stylistically varied humor, but contextual nuance and emotional fit remain persistent weak points.

The underlying reason is architectural. LLMs are probabilistic text engines: they generate the most statistically likely next token given what came before. Comedy, almost by definition, depends on the improbable. A punchline works because it’s unexpected it violates the prediction. Asking a system trained to predict the obvious to reliably produce the surprising is a genuine structural tension, not a problem easily solved by more training data or a larger parameter count.

The Deeper Problem: Culture, Context and Puns

Research presented at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) sharpens this picture considerably. The paper “Pun Unintended: LLMs and the Illusion of Humor Understanding” found that while models could identify existing puns, their comprehension was shallow. When researchers made subtle modifications to a pun that removed its double meaning, the LLMs frequently still flagged it as humorous they were responding to surface structure, not semantic ambiguity. That’s the illusion of understanding rather than the real thing.

Irony, sarcasm and satire compound the problem further. These forms of humor depend on shared social knowledge: what’s normal, what’s taboo, who holds power, what the audience already knows. An LLM has no lived experience, no embodied social history. It can approximate the shape of ironic discourse from training data, but it cannot evaluate whether a given ironic statement will land in a specific room with a specific audience on a specific night. Human comedians do this constantly and mostly unconsciously.

There are also ethical dimensions to get right. A paper posted on arXiv on April 20, 2026, “Investigating Counterfactual Unfairness in LLMs towards Identities through Humor,” found that model responses to humor vary significantly based on the perceived identities of speakers and respondents. The authors argue this reflects internalized social assumptions baked into training data. In practice, that means LLM-generated comedy can drift into culturally insensitive territory without any obvious trigger a serious concern for any application that puts AI-written content in front of a live audience.

Where Research Is Actually Making Progress

The most interesting recent work is trying to solve the structural tension directly rather than work around it. “HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation,” posted on arXiv March 19, 2026, introduced what the authors call a “Cognitive Synergy Framework.” Rather than training a single model to be funny, the approach deploys six cognitive personas among them The Absurdist and The Cynic to synthesise diverse comedic perspectives and generate high-quality training data. A 7-billion-parameter model trained on this data reportedly matched the humor output of significantly larger proprietary systems. The implication is that thoughtful data curation and cognitive framing can outperform brute-force scaling.

A separate line of research takes a social rather than cognitive approach. “Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation,” posted on OpenReview March 20, 2026, found that giving LLMs access to simulated audience feedback and community discussion significantly improved their stand-up comedy writing. Preference rates over baseline systems improved substantially according to the authors, though the cited figure lacks an independent named source so should be treated as preliminary. The core insight is that humor is a social act: it improves through iteration and audience response, and AI systems can benefit from simulated versions of that feedback loop.

This connects to a broader shift in how researchers and practitioners are thinking about AI’s role in creative work. The question is less “can AI be funny?” and more “can AI make human comedians and writers more productive?” For brainstorming, drafting structural variations and rapid iteration on joke formats, current LLMs are already genuinely useful. The human layer timing, cultural calibration, reading a room remains the part that matters most and the part that models cannot yet replicate. For more on how AI agents are reshaping creative and professional workflows, see our coverage of enterprise AI agent deployments.

The goal, for now, is a productive division of labour: models handle the generative heavy lifting, humans supply the judgment. That framing is more honest than either “AI will replace comedians” or “AI can never be creative.” What the research shows is that the gap between pattern recognition and genuine comedic understanding is real, measurable and not obviously closing fast. For more coverage of AI research and breakthroughs, visit our AI Research section.


Originally published at https://autonainews.com/cards-against-llms-reveals-5-top-models-fail-human-humor-test/

Top comments (0)