DEV Community

Cover image for AI Scoring Agent: Lessons Learned and New Directions
Jesse Smith Byers
Jesse Smith Byers

Posted on

AI Scoring Agent: Lessons Learned and New Directions

In my previous posts in the series, I described an AI Scoring Agent prototype that was designed to score open-ended student responses using a scoring guide. While the agent performed well in some cases, my last post outlined a number of ways in which it fell short. This final post is written more from my educator perspective, and shares my lessons learned about AI in student assessment. Although this has shown that AI agents should NOT score student assessments, I also share some new directions in which this type of tool could be appropriate and helpful.

Lessons Learned about AI in Assessment

1. AI Agents SHOULD NOT Score Student Responses

This was highlighted in my last post, so I will not repeat the evidence here. I am admittedly an AI skeptic in regards to education, so this lesson is not at all surprising to me. Hallucinations generated by the LLM are the biggest concern, so any further work in using LLM-based agents for assessment would have to solve this problem first.

2. But...AI Agents CAN Replicate The Human Scoring Process

A positive finding was that agents can be designed to replicate the structured scoring processes that are required to score open-ended responses. I was able to have the agent internalize the scoring guide by summarizing the differences between each score level and coming up with a solid plan to apply the scoring criteria to the response.

Additionally, an agent can verify scores, and use conditional logic to determine whether fact checking, or re-planning is necessary. This agent was designed to save its scoring plan as long as it was producing scores that could be verified, but also was able to replace plans that were not working well with new plans, which also reflects a human process of trying new approaches if an old approach stops being effective.

3. Examples and Strict Formats Are Not Always Helpful

In human scoring, teachers are trained using a scoring guide and a collection of exemplars (multiple examples of student responses at each score level, with a score and an explanation). These exemplars are designed to help each teacher calibrate on what each score level looks like, so that groups of teachers will score similar to each other. Examples are contained within the STAAR materials that I used in this project.

In the first version of the agent, I provided 3 few-shot examples of student responses and example explanations at each score level for the agent to train on, with the hope that it would calibrate similar to how humans calibrate on these examples. I quickly learned that these examples limited the agent's ability to accurately score new responses - if the new response was not similar to a training response, then it would receive a low score, even if the response was correct.

In the same vein, my attempts to provide more specific guidance in terms of formatting the agent's scoring plan, or structuring the explanations for each score also led to poorer results. In both cases, it seemed that the agent overtrained on the examples, to the detriment of its performance on new student responses.

4. The Agent Knows Enough Without Fact Checking

While tools can be helpful, adding tools can lead to unneeded complexity without additional benefits. In the case of my agent, none of the content required up-to-the-minute accuracy, so the LLM's training material had already contained the factual information needed to score any of the science questions. The fact checking just increased runtime without real benefit.

5. The Agent Generally Shares Reasonable Text Explanations

In most cases, the explanations matched evidence in the student response and reflected criteria in the scoring guide, and this was more reliable than the numeric scores the agent might produce. This opens the door towards using agents for more formative, feedback-oriented tasks than grading tasks. In fact, unprompted, the agent often would suggest specific improvements to a student response to elevate it to the next score level!

New Directions

As I stated at the outset, I am an AI in education skeptic. Although this exploration confirmed my belief that AI agents should not be used for scoring, this exercise has highlighted some new directions in which similar AI agent-based tools could be useful for teachers or students.

1. Coaching Students to Write Explanations

Many students struggle with writing in science class, specifically when asked to develop explanations in Claim-Evidence-Reasoning (CER) format. There are many rubrics focused on scoring CER explanations. The tool could be used by a student to evaluate their CER response against a CER rubric, and the score and explanation could give them valuable feedback on how to better structure their answer. This process could be fairly self-directed by a student, before handing in an answer to be graded by a (human) teacher.

For example, the agent used the CER Rubric below to grade a question about photosynthesis.

CER Rubric:

Question:

Student Response:

“Blue light resulted in more bubbles than red or green so it had the highest rate. I know green light is often reflected by plants so less is used, so the plant did less under green, and red was lowest.”

The tool (with unaltered prompts) returned the following final score and explanation:

SCORE: 2
EXPLANATION: The student's response includes a claim that blue light had the highest rate based on observed bubbles, provides some evidence related to plant pigmentation, and gives reasoning connecting green light reflection with lower photosynthesis. However, the claim is not directly supported by evidence from the table provided in the plan, and the explanation does not fully explain the impact of specific wavelengths on photosynthesis or connect it accurately to the data mentioned. The response lacks precision due to vague references ("more bubbles") and does not completely align with a coherent scientific argument structure (Claim-Evidence-Reasoning).

A student reading this could infer that the response would be stronger if they:

  • used evidence from the data table
  • expand the answer to include details about wavelength and impact on photosynthesis
  • used more precise language
  • explicitly identified their claim, evidence, and reasoning within the answer

The student could polish up their answer, feed it through the AI agent for additional feedback, and submit it to the teacher when they have achieved a higher score. With only a small adjustment to the text prompt within the score node, the agent could provide the feedback in a more explicit manner.

2. Providing a Second Opinion For A Human Scorer

Teachers are always exhausted, and grading open-ended responses takes a long time. Much of this time is spent debating internally whether a response is closer to one score level description or another. While it is ideal to get a second set of eyes on a student response, that is not always feasible. Instead, a teacher could use the scoring agent for individual responses that they consider to be borderline, and consider the agent's score and explanation as a second opinion to help them justify a higher or lower score. This would not replace the human grading in any way, but would just add additional perspective.

3. Make Exit Tickets More Productive

Exit tickets are considered a best practice in teaching. A teacher gives the class a carefully-constructed question at the end of class that is handed in before the bell. Ideally, the teacher reads all of the responses overnight (but does not grade them), does some analysis of trends within the answers, and uses this data to plan the next day's instruction. However, the exit ticket strategy often falls apart because it is time-consuming to carefully examine all of the responses, and many teachers don't have time to analyze the data on a daily basis.

This scoring tool could be adapted to accept a batch of student exit ticket answers, and return an analysis at the class level. It could highlight trends in the class by summarizing the explanations returned when each exit ticket response is scored. This way, a teacher can collect general information about where the class stands, and use that information to plan a lesson around any student misconceptions or partial understandings that are uncovered. Individual students would never be held accountable to a score assigned by a machine, but everyone would benefit from the insights delivered by the scoring agent.

Wrapping Up

My experience of designing and building a prototype AI Scoring Agent has been illuminating. While it confirmed my suspicions that grading needs to be done by a human, it did spark some ideas for how AI agents could be designed to reduce the burden on teachers, allowing them to focus their limited time on tasks that are most important for their students.

A lot of AI-focused work is being done right now in education, and my hope is that those working in the EdTech sector continue to have a healthy curiosity to explore what's possible, throw out ideas that don't work, and use the experience to spark new ideas that can benefit both students and teachers.

Top comments (1)

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

nice, Jesse!