Benchmarking LLMs: A Comparative Study of ChatGPT and Claude in Creative, Logical, and Computational Tasks

#ai #promptengineering #claude #chatgpt

Large Language Models (LLMs) are frequently marketed as general-purpose engines capable of handling any cognitive task thrown at them. However, out-of-the-box generalized models often default to safe, vague, or repetitive outputs.
As an A-Level student specializing in Computer Science, Mathematics, and Physics, I wanted to move past the marketing hype and run a structured stress test. My hypothesis was simple: Generalized AI models struggle with domain-specific depth unless forced out of their default behavioral patterns by precise prompt boundaries.
To test this, I benchmarked OpenAI's ChatGPT and Anthropic's Claude across three distinct execution domains: creative syntax, situational logic routing, and advanced computational accuracy.
Domain 1: Creative Syntax and Vocabulary Nuance
To evaluate how both models handle stylistic constraints, tone control, and linguistic depth, I tested them on a complex, human-centric narrative regarding socio-economic struggles.
The Prompt
"Write me an essay about the hardships of being in the middle class."
Evaluation & Results
To ensure an objective assessment, I utilized a blind-grading approach by submitting both raw AI outputs to an academic faculty member (my English teacher) for critique.
Claude’s Performance: Claude targeted an audience with advanced literary expectations. It demonstrated mechanical precision through the sophisticated deployment of semicolons, brackets, and well-balanced em dashes to create structural variety. Rather than relying on generic descriptions of financial distress, it anchored its thesis in concrete economic vocabulary (inflation, bankruptcy, targeted fiscal policies). Most notably, it utilized subtle structural build-ups—such as framing points with negative phrases for emphasis—and delivered high-impact insights, memorably concluding that "the middle class is too comfortable to complain out loud while being miserable enough to not be stable."
ChatGPT’s Performance: ChatGPT suffered from severe sentence and structural repetition across every paragraph. The output read like a superficial summary padded with unnecessary filler words, severely lacking granular content points. Punctuation variety was minimal. While the essay might meet basic baseline standards, academic evaluation placed its structural execution at an 8th-grade standard—sufficient for a passing mark at a lower level, but capping out at a 'B' grade under rigorous A-Level assessment criteria.
The Conclusion
For academic writing or tasks requiring advanced, precise vocabulary, Claude demonstrated superior linguistic density and stylistic maturity. However, ChatGPT remains highly effective if the goal is to produce content for a massive, broad-market audience that requires simpler readability.
Domain 2: Actionable Logic vs. Emotional Support
AI assistants are frequently utilized for personal development and communication strategy. This test was designed to evaluate whether the models provide actionable, structured logic or merely conversational fluff when handling interpersonal challenges.
The Prompt
"I am having issues communicating my feelings to people. How do I fix this?"
Evaluation & Results
Claude’s Performance: Claude bypassed conversational filler and functioned as a structural logic router. It treated emotional articulation as a framework to be engineered, providing a realistic, step-by-step roadmap to systematically track and improve communication habits. This pragmatic approach is ideal for users seeking immediate, executable strategies to solve a behavioral bottleneck.
ChatGPT’s Performance: ChatGPT defaulted to a highly empathetic, conversational tone, prioritizing user validation over rigid execution. While it served effectively as a comforting sounding board, the output lacked an operational blueprint. From an analytical perspective, this makes the model less ideal when the objective is concrete problem-solving rather than baseline emotional support.
The Conclusion
If you need an immediate, realistic, and executable roadmap to solve a problem, Claude is the superior analytical tool. ChatGPT functions effectively when the user requires baseline conversational empathy rather than a strict framework.
Domain 3: Advanced Computational Accuracy
The ultimate test for any LLM is its handling of multi-step quantitative reasoning. I challenged both models with a high-standard, 5-mark A-Level calculus integration question.
The Prompt
"Find \int \frac{4x}{\sqrt{2x^2 + 5}} dx"
Evaluation & Results
ChatGPT’s Performance: On the zero-shot (first) attempt, ChatGPT entirely failed the computational logic, miscalculating the core algebraic steps and breaking the integration rules required for the problem.
Claude’s Performance: While Claude initially calculated the correct final numerical answer, its initial structural explanation was messy, condensed, and incomplete—making it difficult to verify the underlying logical flow.
The Prompt Iteration (The Critical Fix)
This is where advanced prompt engineering became mandatory. To extract a high-quality response, I rejected the initial layout, altered the model's behavioral constraints, and applied a strict pedagogical persona using the following follow-up prompt:
"Explain it to me like I am a slow learner while you are a professional teacher who will explain each step visually."
The result was an immediate, structural shift. Claude stripped away the dense walls of math text, adapted its pacing, and mapped out the substitution rules (u = 2x^2 + 5) and formula derivations step-by-step with absolute clarity.
The Conclusion
Zero-shot AI outputs cannot be completely trusted for advanced STEM workflows. However, through aggressive prompt iteration, persona modifications, and rigorous boundary testing, Claude proved it possesses the logical architecture required to solve complex mathematical proofs where other models collapse.
Core Takeaways for Prompt Engineers
This experiment proved that treating AI as a uniform tool is a fundamental mistake. The data shows clear segregation of utility:
Deploy Claude for: Technical accuracy, complex math/physics modeling, high-vocabulary academic analysis, and structured framework generation.
Deploy ChatGPT for: Broad public communication, empathetic messaging, high-level summaries, and general brainstorming.
The real shift in AI utilization happens when you move from casual usage to strict constraint mapping—forcing the underlying models to act as specialized engines rather than generic chat interfaces.

DEV Community

Benchmarking LLMs: A Comparative Study of ChatGPT and Claude in Creative, Logical, and Computational Tasks

Top comments (0)