DEV Community

q2408808
q2408808

Posted on

LLMs Can't Grade Essays Like Humans — But Here's What AI Does Better (With Free API)

The Research Is In: LLMs Struggle at Essay Grading

A new paper published on arXiv on March 24, 2026 drops a bombshell for anyone building AI-powered education tools: "LLMs Do Not Grade Essays Like Humans". Researchers evaluated GPT and Llama family models against human graders in out-of-the-box settings — no fine-tuning, no task-specific training. The verdict? Agreement between LLM scores and human scores remains "relatively weak."

Specifically, LLMs tend to over-score short or underdeveloped essays and under-score longer essays with minor grammatical errors. They follow coherent internal patterns — essays they praise tend to score higher — but those patterns diverge significantly from how human raters think.

This is a wake-up call. But it's also a clarifying moment: it tells us exactly where AI should and shouldn't be deployed.

What LLMs Are Actually Bad At

  • Subjective evaluation — Grading requires nuanced human judgment that LLMs can't reliably replicate
  • Rubric-based scoring — LLMs apply their own internal signals rather than following explicit rubric criteria
  • Consistency across essay types — Performance varies significantly based on essay length and style
  • Replacing human judgment — The paper concludes LLMs can assist human graders but cannot replace them

What AI APIs Actually Excel At: Generative Tasks

Here's the pivot that matters for developers: while LLMs struggle with subjective evaluation, they are extraordinarily powerful at generative and creative tasks.

  • AI Image Generation — Create photorealistic images from text prompts. Models like Flux Schnell produce stunning results at scale.
  • Video Synthesis — Generate short video clips and animate stills programmatically.
  • Text-to-Speech (TTS) — Convert text to natural-sounding audio in multiple voices.
  • Text Generation at Scale — Draft content, generate variations, summarize documents.

NexaAPI gives you access to all of them through a single unified API — 56+ models, $0.003/image, free tier available.

Python Example — AI Image Generation

# Install: pip install nexaapi
from nexaapi import NexaAPI

client = NexaAPI(api_key='YOUR_API_KEY')

# AI excels at generation, not grading
response = client.images.generate(
    model='flux-schnell',
    prompt='A student studying with glowing AI assistant, digital art, vibrant colors',
    width=1024,
    height=1024
)

print(response.url)  # Your AI-generated image URL
# Cost: $0.003 per image — try 100 images for under $0.30
Enter fullscreen mode Exit fullscreen mode

JavaScript Example

// Install: npm install nexaapi
import NexaAPI from 'nexaapi';

const client = new NexaAPI({ apiKey: 'YOUR_API_KEY' });

const response = await client.images.generate({
  model: 'flux-schnell',
  prompt: 'A student studying with glowing AI assistant, digital art, vibrant colors',
  width: 1024,
  height: 1024
});

console.log(response.url); // $0.003/image — 10x cheaper than competitors
Enter fullscreen mode Exit fullscreen mode

Pricing Comparison

Provider Image Price Models Free Tier
NexaAPI $0.003/image 56+ ✅ Yes
OpenAI DALL-E 3 $0.040/image 3 ❌ No
Stability AI $0.020/image 8 Limited

Start Building Today

  • 🌐 nexa-api.com — Get your free API key
  • RapidAPI — Try on RapidAPI
  • 🐍 pip install nexaapiPyPI
  • 📦 npm install nexaapinpm

Reference: arXiv:2603.23714 — "LLMs Do Not Grade Essays Like Humans" (Barbosa et al., March 2026)

Top comments (0)