Modern way to evaluate a paper
Author: Naveen Jayaraj
📅 October 21, 2025 · ⏱ 5 min read
Table of Contents
- The Challenge: Essays Are More Than Just Word Soup
- The Monarch-AES Approach
- Conquering the Challenges
- The Optimization Experiment: Monarch Butterfly Optimization
- Results and Evaluation
- Lessons and Reflections
- Conclusion
The Challenge: Essays Are More Than Just Word Soup
Automated Essay Scoring (AES) is an intriguing Natural Language Processing (NLP) challenge — the attempt to score essays in a way that's fair, consistent, and effective.
The issue is not trivial. Essays aren't just words; they contain argument, structure, and coherence.
This understanding led to the creation of Monarch-AES (also called GraphBertAES), a hybrid AES model that combines:
- The semantic power of RoBERTa,
- The structure-aware reasoning ability of Graph Attention Networks (GAT), and
- The optimization capability of Monarch Butterfly Optimization (MBO).
Traditional transformer-based AES models like BERT and RoBERTa perform well in understanding what an essay says, but not how it is structured.
Good writing is as much about organization and coherence as it is about content.
This inspired the question:
Can a model understand both what an essay says and how it says it?
The Monarch-AES Approach
The Monarch-AES architecture was designed to evaluate both semantics (meaning) and structure (organization) of essays.
🧠 Semantic Modeling
RoBERTa processed entire essays to generate deep contextual embeddings using the [CLS] token — representing the essay’s overall meaning.
🔗 Structural Modeling
Each essay was represented as a graph, with sentences as nodes and two kinds of edges:
- Sequential edges → between consecutive sentences to model narrative flow
- Semantic edges → between semantically similar sentences to model thematic cohesion
This setup allowed the Graph Attention Network (GAT) to capture how ideas relate and transition — essentially, how well-structured the essay is.

Overall pipeline of Monarch-AES: semantic and structural layers working together.
🧩 Combined Representation
The final essay representation was created by concatenating:
- RoBERTa’s semantic embedding
- GAT’s structural embedding
This fusion of meaning and structure was then passed through a regression layer to produce the essay’s predicted score.

Graph Attention Network (GAT) used for structural coherence modeling.

RoBERTa model architecture providing contextual semantic representation.
Conquering the Challenges
Building Monarch-AES came with several hurdles — both technical and conceptual.
⚙️ Data Handling
- The model was trained and evaluated on the ASAP 2.0 dataset, a benchmark for AES tasks.
- To focus on consistency, a single essay prompt (“The Venus Prompt”) was chosen.
- Class imbalance was tackled using PyTorch’s WeightedRandomSampler for balanced training.
🔄 Graph Construction
Creating dynamic sentence-level graphs during training was initially slow.
The fix? Precompute all graphs before training and store them as objects — which significantly sped up the process.
🧪 Debugging & Stability
- Mixing
BertModelandRoBERTacaused compatibility warnings — solved by migrating fully toRobertaModel. - Common errors (like unimported modules or
train_test_splitissues) reinforced the value of clean, reproducible code practices.
The Optimization Experiment: Monarch Butterfly Optimization
While standard training used AdamW, Monarch-AES also underwent an experimental phase using Monarch Butterfly Optimization (MBO) — a metaheuristic inspired by butterfly migration.
Unlike gradient descent:
- MBO evolves a population of solutions across generations.
- It balances exploration and exploitation using Lévy flights.
🌿 Why MBO?
MBO can escape local minima that AdamW tends to converge to, providing better parameter exploration in complex, high-dimensional spaces.
The experimental MBO setup required:
- Removing backpropagation
- Ranking model candidates by fitness (loss)
- Iteratively evolving them
Although computationally heavier, MBO showed that nature-inspired algorithms can successfully tune deep models in novel ways.
Results and Evaluation
Model performance was measured using key AES metrics:
| Metric | Description | Score |
|---|---|---|
| QWK | Quadratic Weighted Kappa | 0.834 |
| MSE | Mean Squared Error | 0.198 |
| MAE | Mean Absolute Error | 0.256 |
These results indicate high agreement with human raters and near-zero deviation from true essay scores.

Confusion matrix showing strong alignment between predicted and actual scores.

Predicted vs Actual score distribution — illustrating strong model reliability.
Visual analyses such as loss curves, scatter plots, and confusion matrices demonstrated that Monarch-AES consistently outperformed transformer-only baselines, achieving more human-like evaluation through the blend of semantics and structure.
Lessons and Reflections
The Monarch-AES project yielded several insights:
- Hybrid architectures combining transformers and GNNs lead to richer, more interpretable representations.
- Graphical essay modeling revealed how thematic links and sentence transitions influence perceived writing quality.
- Metaheuristic optimizers like MBO can outperform gradient-based ones in navigating complex search landscapes.
- Efficient preprocessing and clean pipelines greatly improved scalability and reproducibility.
Conclusion
Monarch-AES was built on a fundamental belief:
Meaning and structure must go hand in hand — in writing and in AI.
Essays are not mere token sequences but structured arguments.
By combining RoBERTa’s semantic power with GAT’s structural insight, the system evaluated essays like a human — understanding what is said and how it’s said.
This work underscored the importance of hybrid intelligence — blending architectures and ideas for deeper understanding.
The future of NLP lies not in choosing between semantics and structure but in integrating them into one unified system.
Tags
essay scoring system · monarch butterfly optimization · AI · Machine Learning · RoBERTa · GNN
© 2025 Naveen's Technical Revelations
Top comments (0)