arXiv Bombshell: LLMs Fail at Essay Grading
On March 24, 2026, researchers published a paper making waves in academic and developer circles: "LLMs Do Not Grade Essays Like Humans". The study evaluated GPT and Llama family models on automated essay scoring (AES) in out-of-the-box settings — no fine-tuning, no task-specific prompting.
The finding: agreement between LLM scores and human scores remains relatively weak. LLMs tend to assign higher scores to short or underdeveloped essays, while penalizing longer essays with minor grammatical errors. The models follow internally coherent patterns, but those patterns don't align with how human raters actually think.
What This Means for Developers
This doesn't mean LLMs are useless for education or writing tools. It means developers need to use them for the right tasks.
What LLMs ARE reliable for in writing contexts:
- Essay generation and variation — Creating draft content, generating multiple versions, producing training data at scale
- Writing assistance (not grading) — Suggesting improvements, identifying structural weaknesses, offering alternative phrasings
- Summarization — Condensing long essays into key points reliably
- Feedback drafting — Generating constructive comments that a human teacher can review and approve
- Content automation at scale — Producing e-learning content, quiz questions, and study guides
The paper itself notes: "LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring." Key word: supporting, not replacing.
3 Developer Use Cases That Actually Work
1. Generate Essay Drafts for Training Datasets
If you're building an AES system, you need training data. LLMs can generate thousands of essay variations at different quality levels for a fraction of human writer costs.
2. Build Writing Assistance Tools (Not Graders)
The research confirms LLMs are good at generating internally consistent feedback. Build tools that suggest improvements and flag weak arguments — frame it as "AI writing coach," not "AI grader."
3. Automated Content Generation for E-Learning
E-learning platforms need massive amounts of content: practice prompts, model answers, study guides. LLMs excel here. At NexaAPI's pricing, you can process thousands of content generation requests for dollars.
Build an AI Essay Assistant in 10 Lines of Code
Python — AI Writing Coach
from nexaapi import NexaAPI
client = NexaAPI(api_key='YOUR_API_KEY')
# Generate essay feedback — coaching, not grading
response = client.chat.completions.create(
model='gpt-4o', # Check nexa-api.com for latest models
messages=[
{
'role': 'system',
'content': 'You are a writing coach. Provide constructive feedback on essays, '
'focusing on structure, clarity, and argument strength. '
'Do not assign numeric grades.'
},
{
'role': 'user',
'content': 'Please review this essay and suggest improvements: [ESSAY TEXT HERE]'
}
],
max_tokens=500
)
print(response.choices[0].message.content)
# Cost: fraction of a cent per request via NexaAPI
JavaScript — AI Writing Feedback API
// Install: npm install nexaapi
import NexaAPI from 'nexaapi';
const client = new NexaAPI({ apiKey: 'YOUR_API_KEY' });
async function getEssayFeedback(essayText) {
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: 'You are a writing coach. Provide constructive feedback, do not assign grades.'
},
{
role: 'user',
content: `Please review this essay: ${essayText}`
}
],
maxTokens: 500
});
return response.choices[0].message.content;
}
getEssayFeedback('Your essay text here...').then(console.log);
// npm install nexaapi — cheapest LLM API on the market
Why NexaAPI for Ed-Tech
| Provider | Cost | Free Tier | Models |
|---|---|---|---|
| NexaAPI | Cheapest available | ✅ Yes | 56+ |
| OpenAI Direct | $2.50/1M tokens | ❌ No | ~15 |
| Anthropic Direct | $3.00/1M tokens | ❌ No | ~8 |
NexaAPI is OpenAI-compatible — just change the base URL and API key. No code rewrite needed.
The Smart Developer's Response
The arXiv paper isn't a reason to avoid LLMs in education. It's a roadmap for using them correctly. Don't build AI graders. Build AI writing coaches, content generators, and feedback assistants.
- 🌐 nexa-api.com — Free API key, no credit card required
- ⚡ RapidAPI — Try on RapidAPI
- 🐍
pip install nexaapi— PyPI - 📦
npm install nexaapi— npm
Reference: arXiv:2603.23714 — "LLMs Do Not Grade Essays Like Humans" (Barbosa et al., March 24, 2026)
Top comments (0)