DEV Community

Cover image for Reimagining Essay Evaluation: Graphs, RoBERTa, and the Art of Fair Scoring
Naveen Jayaraj
Naveen Jayaraj

Posted on

Reimagining Essay Evaluation: Graphs, RoBERTa, and the Art of Fair Scoring

Modern way to evaluate a paper

Author: Naveen Jayaraj

📅 October 21, 2025 · ⏱ 5 min read


Table of Contents

  1. The Challenge: Essays Are More Than Just Word Soup
  2. The Monarch-AES Approach
  3. Conquering the Challenges
  4. The Optimization Experiment: Monarch Butterfly Optimization
  5. Results and Evaluation
  6. Lessons and Reflections
  7. Conclusion

The Challenge: Essays Are More Than Just Word Soup

Automated Essay Scoring (AES) is an intriguing Natural Language Processing (NLP) challenge — the attempt to score essays in a way that's fair, consistent, and effective.

The issue is not trivial. Essays aren't just words; they contain argument, structure, and coherence.

This understanding led to the creation of Monarch-AES (also called GraphBertAES), a hybrid AES model that combines:

  • The semantic power of RoBERTa,
  • The structure-aware reasoning ability of Graph Attention Networks (GAT), and
  • The optimization capability of Monarch Butterfly Optimization (MBO).

Traditional transformer-based AES models like BERT and RoBERTa perform well in understanding what an essay says, but not how it is structured.

Good writing is as much about organization and coherence as it is about content.

This inspired the question:

Can a model understand both what an essay says and how it says it?


The Monarch-AES Approach

The Monarch-AES architecture was designed to evaluate both semantics (meaning) and structure (organization) of essays.

🧠 Semantic Modeling

RoBERTa processed entire essays to generate deep contextual embeddings using the [CLS] token — representing the essay’s overall meaning.

🔗 Structural Modeling

Each essay was represented as a graph, with sentences as nodes and two kinds of edges:

  • Sequential edges → between consecutive sentences to model narrative flow
  • Semantic edges → between semantically similar sentences to model thematic cohesion

This setup allowed the Graph Attention Network (GAT) to capture how ideas relate and transition — essentially, how well-structured the essay is.

Flow chart of the project
Overall pipeline of Monarch-AES: semantic and structural layers working together.


🧩 Combined Representation

The final essay representation was created by concatenating:

  • RoBERTa’s semantic embedding
  • GAT’s structural embedding

This fusion of meaning and structure was then passed through a regression layer to produce the essay’s predicted score.

GATConv model architecture
Graph Attention Network (GAT) used for structural coherence modeling.

RoBERTa architecture diagram
RoBERTa model architecture providing contextual semantic representation.


Conquering the Challenges

Building Monarch-AES came with several hurdles — both technical and conceptual.

⚙️ Data Handling

  • The model was trained and evaluated on the ASAP 2.0 dataset, a benchmark for AES tasks.
  • To focus on consistency, a single essay prompt (“The Venus Prompt”) was chosen.
  • Class imbalance was tackled using PyTorch’s WeightedRandomSampler for balanced training.

🔄 Graph Construction

Creating dynamic sentence-level graphs during training was initially slow.

The fix? Precompute all graphs before training and store them as objects — which significantly sped up the process.

🧪 Debugging & Stability

  • Mixing BertModel and RoBERTa caused compatibility warnings — solved by migrating fully to RobertaModel.
  • Common errors (like unimported modules or train_test_split issues) reinforced the value of clean, reproducible code practices.

The Optimization Experiment: Monarch Butterfly Optimization

While standard training used AdamW, Monarch-AES also underwent an experimental phase using Monarch Butterfly Optimization (MBO) — a metaheuristic inspired by butterfly migration.

Unlike gradient descent:

  • MBO evolves a population of solutions across generations.
  • It balances exploration and exploitation using Lévy flights.

🌿 Why MBO?

MBO can escape local minima that AdamW tends to converge to, providing better parameter exploration in complex, high-dimensional spaces.

The experimental MBO setup required:

  • Removing backpropagation
  • Ranking model candidates by fitness (loss)
  • Iteratively evolving them

Although computationally heavier, MBO showed that nature-inspired algorithms can successfully tune deep models in novel ways.


Results and Evaluation

Model performance was measured using key AES metrics:

Metric Description Score
QWK Quadratic Weighted Kappa 0.834
MSE Mean Squared Error 0.198
MAE Mean Absolute Error 0.256

These results indicate high agreement with human raters and near-zero deviation from true essay scores.

Confusion matrix
Confusion matrix showing strong alignment between predicted and actual scores.

Actual vs predicted graph
Predicted vs Actual score distribution — illustrating strong model reliability.

Visual analyses such as loss curves, scatter plots, and confusion matrices demonstrated that Monarch-AES consistently outperformed transformer-only baselines, achieving more human-like evaluation through the blend of semantics and structure.


Lessons and Reflections

The Monarch-AES project yielded several insights:

  • Hybrid architectures combining transformers and GNNs lead to richer, more interpretable representations.
  • Graphical essay modeling revealed how thematic links and sentence transitions influence perceived writing quality.
  • Metaheuristic optimizers like MBO can outperform gradient-based ones in navigating complex search landscapes.
  • Efficient preprocessing and clean pipelines greatly improved scalability and reproducibility.

Conclusion

Monarch-AES was built on a fundamental belief:

Meaning and structure must go hand in hand — in writing and in AI.

Essays are not mere token sequences but structured arguments.

By combining RoBERTa’s semantic power with GAT’s structural insight, the system evaluated essays like a human — understanding what is said and how it’s said.

This work underscored the importance of hybrid intelligence — blending architectures and ideas for deeper understanding.

The future of NLP lies not in choosing between semantics and structure but in integrating them into one unified system.


Tags

essay scoring system · monarch butterfly optimization · AI · Machine Learning · RoBERTa · GNN


© 2025 Naveen's Technical Revelations

Top comments (0)