Christoph Görn

Posted on Jun 12 • Edited on Jun 18 • Originally published at xn--grn-sna.name

When critics advance AI: How Apple's research reminds us why scrutiny matters

#ai

What happens when the world's most valuable technology company publishes research exposing fundamental limitations in AI? If you're Gary Marcus, you call it vindication. If you're building the future of AI, you should call it invaluable feedback.

The research in question comes from Apple's AI team, who published two papers that expose how even the most advanced language models struggle with genuine reasoning. Their findings are stark: models that cost billions to develop can fail at puzzles a first-year computer science student could solve, and adding irrelevant information to math problems can cause performance to plummet by up to 65%. Marcus, a cognitive scientist who has warned about these limitations for decades, sees this as confirmation of his long-standing concerns. But rather than viewing this as a defeat for AI, we should recognize it as exactly what the field needs: rigorous, honest assessment that helps us build better systems.

Understanding what Apple discovered about AI reasoning

Apple's research team, led by Mehrdad Farajtabar and Iman Mirzadeh, designed elegant experiments to test whether large language models truly reason or simply match patterns. Their methodology was refreshingly straightforward: create controllable puzzle environments where complexity could be precisely adjusted while keeping the logical structure consistent.

The results revealed three distinct performance regimes. At low complexity, standard language models surprisingly outperformed specialized reasoning models. Medium complexity showed reasoning models gaining an edge. But at high complexity, both types experienced what the researchers called "complete collapse" – unable to solve problems that follow clear logical rules.

Most revealing was their GSM-NoOp experiment. By adding seemingly relevant but actually irrelevant information to math problems – like mentioning that some kiwis were smaller than average – they caused state-of-the-art models to fail catastrophically. This wasn't a minor glitch; it was evidence that these systems rely on pattern matching rather than understanding.

Gary Marcus's perspective brings historical context

Marcus frames these findings within a broader narrative he's been articulating since 1998: neural networks excel at generalizing within their training distribution but struggle when encountering truly novel problems. His critique isn't dismissive – he acknowledges AI's genuine achievements like AlphaFold's breakthrough in protein folding. Instead, he argues for recognizing both capabilities and limitations.

"There is no principled solution to hallucinations in systems that traffic only in the statistics of language without explicit representation of facts and explicit tools to reason over those facts," Marcus writes. This isn't AI pessimism; it's a call for architectural innovation. He suggests that hybrid approaches combining neural networks with symbolic reasoning might offer a path forward.

Marcus's reputation as a constructive critic is well-established. With a PhD from MIT at 23 and successful AI companies under his belt, he brings both academic rigor and practical experience. Science fiction author Kim Stanley Robinson calls him "one of our few indispensable public intellectuals" on AI – high praise that reflects his role in keeping the field honest.

Why critical research accelerates progress

The history of AI is filled with examples where identifying limitations led directly to breakthroughs. When researchers discovered adversarial vulnerabilities – where tiny changes to images could fool AI systems – it sparked development of more robust training techniques. When bias in training data was exposed, it led to better data collection practices and fairness frameworks. When hallucination problems were documented, it inspired retrieval-augmented generation systems that ground AI responses in verified information.

This pattern extends beyond technical improvements. Microsoft, Google, and other tech giants have established dedicated AI safety teams specifically because critical research highlighted potential risks. Anthropic built its entire company philosophy around empirically-driven AI safety research. These aren't defensive reactions; they're proactive investments in making AI more reliable and beneficial.

The business impact is measurable. Companies using AI systems improved through critical feedback report productivity gains averaging 66%. Predictive maintenance systems refined through failure analysis reduce unplanned downtime by up to 50%. Each limitation identified and addressed makes AI more valuable in real-world applications.

Finding the balance between optimism and realism

Acknowledging limitations doesn't mean abandoning optimism about AI's potential. Even Marcus, often portrayed as an AI skeptic, readily admits these systems excel at brainstorming, code assistance, and content generation. The key is matching capabilities to appropriate use cases.

Consider how we approach other technologies. We don't expect calculators to write poetry or smartphones to perform surgery. Understanding boundaries helps us use tools effectively. The same principle applies to AI – knowing where it excels and where it struggles enables better decision-making about deployment.

This balanced perspective is gaining traction across the industry. The EU's AI Act, while comprehensive in its requirements, explicitly encourages innovation alongside safety measures. Leading AI companies increasingly publish their own limitation studies, recognizing that transparency builds trust and accelerates improvement.

The path forward requires both builders and critics

Apple's research and Marcus's commentary represent something precious in technology development: the willingness to look honestly at what we've built and ask hard questions. This isn't pessimism or opposition to progress. It's the scientific method at work, where hypotheses meet reality and adjustments follow.

For those building AI systems, critical research provides a roadmap for improvement. For those deploying AI in businesses and organizations, it offers guidance on appropriate use cases and necessary safeguards. For society at large, it ensures we approach transformative technology with eyes wide open.

The most exciting developments often emerge from addressing limitations. When early neural networks couldn't handle variable-length sequences, researchers invented transformers. When models struggled with long-term dependencies, attention mechanisms emerged. Today's limitations in reasoning and reliability will likely spark tomorrow's architectural innovations.

Critical thinking as a catalyst for innovation

The Apple papers don't represent a "knockout blow" to AI, despite Marcus's provocative headline. They represent something more valuable: a clear-eyed assessment of current capabilities that points toward future improvements. By documenting exactly how and why models fail at certain reasoning tasks, researchers provide specific targets for enhancement.

This dynamic – where critics and builders engage in productive dialogue – has driven progress in every technological revolution. The Wright brothers succeeded partly because they studied why others failed. The internet became robust because security researchers exposed vulnerabilities. AI will achieve its potential through the same process of iterative improvement guided by honest assessment.

As we continue developing AI systems, we need both the optimists who push boundaries and the critics who test them. We need companies like Apple conducting rigorous evaluations and voices like Marcus's providing historical perspective. Most importantly, we need a culture that views limitations not as failures but as opportunities for growth.

The future of AI isn't threatened by research exposing its current limitations. It's enhanced by it. Every well-documented limitation becomes a target for improvement. Every thoughtful critique sharpens our understanding. Every honest assessment brings us closer to AI systems that are not just powerful but reliable, not just impressive but trustworthy.

That's why we should celebrate when major tech companies publish research revealing AI limitations. It's why we should value critics who hold the field to high standards. And it's why the path to beneficial AI runs directly through the sometimes uncomfortable territory of acknowledging what our current systems cannot do. In technology, as in science, the truth – even when it challenges our assumptions – is always our ally.

References

Primary Sources

Apple Machine Learning Research - "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity"
- URL: https://machinelearning.apple.com/research/illusion-of-thinking
Gary Marcus - "A knockout blow for LLMs?"
- URL: https://garymarcus.substack.com/p/a-knockout-blow-for-llms

Additional Research Papers and Sources

ArXiv - "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"
- URL: https://arxiv.org/html/2410.05229v1
Gary Marcus - "CONFIRMED: LLMs have indeed reached a point of diminishing returns"
- URL: https://garymarcus.substack.com/p/confirmed-llms-have-indeed-reached
Big Think - "AI skeptic Gary Marcus on AI's moral and technical shortcomings"
- URL: https://bigthink.com/the-present/ai-skeptic-gary-marcus/
Gary Marcus Substack - "Marcus on AI"
- URL: https://garymarcus.substack.com/
ArXiv - "AI Safety for Everyone"
- URL: https://arxiv.org/html/2502.09288v1
Nature Machine Intelligence - "AI safety for everyone"
- URL: https://www.nature.com/articles/s42256-025-01020-y
Gary Marcus - "LLMs don't do formal reasoning - and that is a HUGE problem"
- URL: https://garymarcus.substack.com/p/llms-dont-do-formal-reasoning-and
Nielsen Norman Group - "AI Improves Employee Productivity by 66%"
- URL: https://www.nngroup.com/articles/ai-tools-productivity-gains/
Capella Solutions - "Case Studies: Successful AI Implementations in Various Industries"
- URL: https://www.capellasolutions.com/blog/case-studies-successful-ai-implementations-in-various-industries
Center for AI Safety
- URL: https://safe.ai/
- URL: https://safe.ai/ai-risk

Top comments (14)

david duymelinck • Jun 13

I'm not going to contest the research, I'm not smart enough.
But to me it feels like there was some financial pressure behind the release of the study. Apple was expected to come up with AI integration, because everyone else is doing it. And Apple as a brand is still seen as ahead of the curve.

I agree that we need to learn about the opportunities and limitations of technology. But there are many sales people out there looking for money. And they are dressing every new technology up like a Christmas tree.

Kristofer • Jun 17

Apple tried the whole AI thing (Remember Apple Intelligence?) and it turned out to be a complete mess, yet another wrapper around ChatGPT and half of the promises are still not implemented or really badly implemented.
Of course they now want to say "AI is lacking, thats why our AI is lacking", its their interest to down play AI because they are so far behind. Perhaps we should apply a bit of critical thinking into Apples Research.

david duymelinck • Jun 17

I think you read my comment wrong. I'm not criticizing the research, I'm criticizing shareholder pressure and marketing for setting the timing of the release of the rapport.

If it was released further away from the WWDC event, it was less likely to be seen as a ploy to keep the share price high.

Kristofer • Jun 17 • Edited

My bad. My comment was not suppose to criticizing your comment, I agree with it. I should have added my comment on the main thread instead.

All I want to say is that we should apply critical thinking on Apples Research, looking at the WWDC 25 highlight, they didnt mention AI once same month their paper comes out clamining AI is lacking. Is Apple doing this for the greater good of the AI development OR could it be that Apple dont want to loose face like google did and still sell billions worth of devices?

Todays LLMs are far from perfect, still an developing technology that havent found its true purpose, but I think Apple are having their "no one is going to buy devices" moment that Microsoft had back in early 2000s.

david duymelinck • Jun 17

Is Apple doing this for the greater good of the AI development OR could it be that Apple dont want to loose face like google did and still sell billions worth of devices?

This is the sentiment I wanted to avoid by preempting my first comment that I don't question the research.
Both things can be true. The research is valid and it is a marketing trick.

Kristofer • Jun 17

I agree with that, however I dont think Apple has the authenticity to make these research claims, tbh no big tech company has the authenticity its all about the shareholders.

david duymelinck • Jun 17

While I agree that Apple isn't leading the AI research, they have enough funds to hire good researchers.

I don't blame you for being skeptical about big companies. But the truth is that they have the deep pockets to invest in the research. I think the problematic things come from the applications of that research.

Kristofer • Jun 17

Yes, but its not a genuine effort from Apple to push the development of AI, instead its a way to inform their customer that "its ok to buy our devices even though it has no big AI capabilities".

This reminds me of MS doubbled down on Vista and Aero glass UI instead of focus on mobile devices.

I guess the hype surrounding LLMs are the real issue.

Oscar • Jun 17

Really appreciate the citations -- it's nice to know that something has actually been thought about, and not just written. Apple's research kind of aligns with my personal views on AI: it's great for things that thousands of people have done before, but the moment you throw something new or different at it, it implodes.

Mike Ritchie • Jun 12 • Edited

Lately I’ve seen references to LRMs, or “Large Reasoning Models”. Maybe someone is already seeing the flaws in using language models for reasoning and are starting to build models specifically for reasoning?

Christoph Görn • Jun 17

A Call to Action: Beyond Corporate Motivations to Collaborative Progress

The discussion around Apple's AI research reveals a crucial insight that demands our collective response: we cannot let corporate motivations overshadow the vital need for honest AI assessment.

Both David and Kristofer identified the core tension we face - major tech companies control much of AI research, yet their commercial interests inevitably influence what gets published and when. This reality doesn't invalidate good research, but it highlights a dangerous dependency that we must address.

Here's what we must do:

Support Independent AI Research

Fund academic institutions and non-profit organizations conducting AI limitation studies
Advocate for government research grants that aren't tied to commercial outcomes
Promote open-source AI research initiatives that prioritize transparency over profit

Demand Research Transparency

Pressure companies to publish methodologies, datasets, and negative results - not just breakthrough claims
Support researchers who speak honestly about AI limitations, even when it challenges industry narratives
Create industry standards requiring disclosure of commercial motivations behind research timing

Build Critical AI Literacy

Educate ourselves and others to evaluate AI research based on scientific merit, not corporate reputation
Share resources that help people understand the difference between marketing claims and peer-reviewed findings
Encourage skeptical thinking about AI capabilities while remaining open to genuine advances

Foster Collaborative Evaluation

Support initiatives like MITRE's AI Incident Sharing and NIST's AI Risk Management Framework
Participate in open-source projects that test and validate AI claims independently
Create cross-industry standards for AI evaluation that prioritize safety and accuracy over speed-to-market

The future of AI shouldn't be determined by shareholder interests or quarterly earnings calls. When we see valuable research - regardless of who funds it - let's use it as a catalyst for broader, independent investigation.

The question isn't whether Apple (or any tech giant) has pure motives. The question is whether we'll let corporate control of AI research prevent us from building better, safer, more reliable systems.

Let's turn this skepticism into action. Support independent AI research, demand transparency, and help build the critical thinking infrastructure our AI future desperately needs.

What will you do to ensure AI development serves humanity's interests, not just corporate bottom lines?

Ciphernutz • Jun 13

Really insightful research and an outstanding article. Thanks for sharing this—it highlights how critical scrutiny can drive meaningful progress in AI.