The Research Is In: LLMs Struggle at Essay Grading
A new paper published on arXiv on March 24, 2026 drops a bombshell for anyone building AI-powered education tools: "LLMs Do Not Grade Essays Like Humans". Researchers evaluated GPT and Llama family models against human graders in out-of-the-box settings — no fine-tuning, no task-specific training. The verdict? Agreement between LLM scores and human scores remains "relatively weak."
Specifically, LLMs tend to over-score short or underdeveloped essays and under-score longer essays with minor grammatical errors. They follow coherent internal patterns — essays they praise tend to score higher — but those patterns diverge significantly from how human raters think.
This is a wake-up call. But it's also a clarifying moment: it tells us exactly where AI should and shouldn't be deployed.
What LLMs Are Actually Bad At
- Subjective evaluation — Grading requires nuanced human judgment that LLMs can't reliably replicate
- Rubric-based scoring — LLMs apply their own internal signals rather than following explicit rubric criteria
- Consistency across essay types — Performance varies significantly based on essay length and style
- Replacing human judgment — The paper concludes LLMs can assist human graders but cannot replace them
What AI APIs Actually Excel At: Generative Tasks
Here's the pivot that matters for developers: while LLMs struggle with subjective evaluation, they are extraordinarily powerful at generative and creative tasks.
- AI Image Generation — Create photorealistic images from text prompts. Models like Flux Schnell produce stunning results at scale.
- Video Synthesis — Generate short video clips and animate stills programmatically.
- Text-to-Speech (TTS) — Convert text to natural-sounding audio in multiple voices.
- Text Generation at Scale — Draft content, generate variations, summarize documents.
NexaAPI gives you access to all of them through a single unified API — 56+ models, $0.003/image, free tier available.
Python Example — AI Image Generation
# Install: pip install nexaapi
from nexaapi import NexaAPI
client = NexaAPI(api_key='YOUR_API_KEY')
# AI excels at generation, not grading
response = client.images.generate(
model='flux-schnell',
prompt='A student studying with glowing AI assistant, digital art, vibrant colors',
width=1024,
height=1024
)
print(response.url) # Your AI-generated image URL
# Cost: $0.003 per image — try 100 images for under $0.30
JavaScript Example
// Install: npm install nexaapi
import NexaAPI from 'nexaapi';
const client = new NexaAPI({ apiKey: 'YOUR_API_KEY' });
const response = await client.images.generate({
model: 'flux-schnell',
prompt: 'A student studying with glowing AI assistant, digital art, vibrant colors',
width: 1024,
height: 1024
});
console.log(response.url); // $0.003/image — 10x cheaper than competitors
Pricing Comparison
| Provider | Image Price | Models | Free Tier |
|---|---|---|---|
| NexaAPI | $0.003/image | 56+ | ✅ Yes |
| OpenAI DALL-E 3 | $0.040/image | 3 | ❌ No |
| Stability AI | $0.020/image | 8 | Limited |
Start Building Today
- 🌐 nexa-api.com — Get your free API key
- ⚡ RapidAPI — Try on RapidAPI
- 🐍
pip install nexaapi— PyPI - 📦
npm install nexaapi— npm
Reference: arXiv:2603.23714 — "LLMs Do Not Grade Essays Like Humans" (Barbosa et al., March 2026)
Top comments (0)