Jesse Smith Byers

Posted on Oct 22

AI Scoring Agent Behavior: The Good, The Bad, and The Ugly

#machinelearning #python #datascience #ai

In the first post of this series, Building an AI Scoring Agent: Step-By-Step, I described the technical details of building an AI Scoring Agent for scoring open-ended science questions. This post analyzes the behavior of the prototype tool, and reflects on appropriate and inappropriate use cases for a tools like this.

TLDR: This shouldn't be used for scoring tests or exams!!!

As a recap, the tool takes in a scoring guide, a question, and a student response, and outputs a score and an explanation for the score. The sections below analyze the behavior of the agent when it is working as expected, as well as analyzes its workflow and output when things go awry.

The Good

The Agent Follows Simple Workflows

In many cases, the agent works as expected, following a predictable workflow as described in the code and summarized in the first post. The first post ended with an output example of the simplest workflow, of Plan -> Score -> Verify, outputting an accurate answer that matches the "ground truth" of a human scorer. Notably, the scoring plan is reasonable based on the scoring guide, and the explanation generated is consistent with the student response and the scoring guide.

The Agent Can Optionally Fact Check

Similarly, there were many cases in which the agent followed a more complicated path, sometimes integrating the web search tool, and re-scoring to get to an accurate score that matches the "ground truth" of a human scorer.

For example, the output below is from scoring a student response of "sodium-2, sulfur-1, oxygen-4", which is a correct response that should earn a score of 2. The agent used fact checking to check how many atoms are in the chemical formula, then scored the answer correctly.

================== A. PLANNING SCORING APPROACH ==================
🧠 Generating new plan

...(plan removed for brevity)...

================== C. SCORING STUDENT RESPONSE ==================
Scoring Output: Fact check triggered with query: number of atoms in sodium, sulfur, and oxygen

================== B. FACT CHECKING (Optional) ==================
Fact check QUERY → RESULT: number of atoms in sodium, sulfur, and oxygen → Jan 5, 2022 · So, it's 2 atoms of Sodium , 1 atom of Sulfur , and 4 atoms of Oxygen , which gives us a total of 7 atoms per molecule of Na2SO4. The 'Total # of atoms ' refers to the total count of ato...
Score after fact check: Score: 2
EXPLANATION: The student provides a complete and accurate response by listing the total number of atoms (7) across three distinct elements (sodium, sulfur, and oxygen). Sodium is correctly identified as contributing two atoms, sulfur as one atom, and oxygen as four atoms.

================== C. SCORING STUDENT RESPONSE ==================
Scoring Output: SCORE: 2
EXPLANATION: The response correctly identifies all elements and their respective atom counts. Sodium is listed as contributing two atoms, sulfur as one atom, and oxygen as four atoms, resulting in a total of seven atoms per molecule of Na2SO4.

==================== D. VERIFYING SCORE ====================
Verification response: {
    "new_score": 2,
    "reason": "The student response correctly identifies the total number of atoms and elements, although it uses a different notation (sodium-2 instead of Na2)."
}
OLD_SCORE: 2  -  NEW_SCORE: 2

******************** FINAL ANSWER *******************
SCORE: 2
EXPLANATION: The response correctly identifies all elements and their respective atom counts. Sodium is listed as contributing two atoms, sulfur as one atom, and oxygen as four atoms, resulting in a total of seven atoms per molecule of Na2SO4.

The Agent Can Revise The Plan To Produce Correct Scores

In the example below, the agent follows a more complex path: Plan -> Score -> Verify -> Re-Plan -> Re-Score -> Fact Check -> Re-Score -> Re-Verify. In the end, it verifies the accurate score of 1.

================== A. PLANNING SCORING APPROACH ==================
🧠 Generating new plan

...(plan removed for brevity)...

================== C. SCORING STUDENT RESPONSE ==================
Scoring Output: SCORE: 1
EXPLANATION: The response acknowledges an external force (the hand's push), but lacks detail on other critical forces like gravity, friction, and the normal force. It also does not explain how these forces affect the block's motion before release or after being released from the hand's influence.

==================== D. VERIFYING SCORE ====================
Verification response: {
    "new_score": 2,
    "reason": "The student response demonstrates an understanding of force and motion, relating the action of pushing to the movement of an object. However, it lacks clarity in describing the relationship between direction and distance."
}
OLD_SCORE: 1  -  NEW_SCORE: 2

================== A. PLANNING SCORING APPROACH ==================
🧠 Generating new plan

...(plan removed for brevity)...

================== C. SCORING STUDENT RESPONSE ==================
Scoring Output: Fact check triggered with query: Is pushing a box considered as applying force downwards due to gravity?

================== B. FACT CHECKING (Optional) ==================
Fact check QUERY → RESULT: Is pushing a box considered as applying force downwards due to gravity? → In straight and level flight, lift equals weight. In a steady level banked turn of 60°, lift equals double the weight. The pilot experiences 2 g and a doubled weight. The steeper the bank, the greater...
Score after fact check: SCORE: 1
EXPLANATION: The student's response only mentions a single force (the force of the hand), which is not enough to understand the complete dynamics of motion in such scenarios. They do not include gravity, normal force, or friction as forces acting on the block. No description is provided for how these forces affect the box before or after it is pushed.

================== C. SCORING STUDENT RESPONSE ==================
Scoring Output: SCORE: 1
EXPLANATION: The student's response does not mention gravity, friction, or normal force. It also fails to explain how any forces are affecting the motion of the box before and after it is pushed.

==================== D. VERIFYING SCORE ====================
Verification response: {"new_score": 1, "reason": "The student response demonstrates a basic understanding of force and motion, using correct vocabulary."}
OLD_SCORE: 1  -  NEW_SCORE: 1

******************** FINAL ANSWER *******************
SCORE: 1
EXPLANATION: The student's response does not mention gravity, friction, or normal force. It also fails to explain how any forces are affecting the motion of the box before and after it is pushed.

Summary of the Strengths

In summary, the agent can follow its workflow to (at times) produce accurate scores, either taking a simple path (Plan -> Score -> Verify) or through taking more complicated paths including optional fact checking, and re-planning, re-scoring, and re-verification. However, the agent does not produce correct answers all the time, as illustrated in "The Bad" and "The Ugly" sections below.

The Bad

The Fact Check Can Lead To Inaccurate Scores, and Might Be Unnecessary

In this example, the agent calls the fact check, even though it already has the facts it needs clearly identified in its scoring plan. Furthermore, it seems to give the response all of the points based on the results of the fact check, and not the content of the student response. The student response for this run was "Na, S, O", which should receive a score of 1 because it does not include atom counts.

================== A. PLANNING SCORING APPROACH ==================
✅ Using cached plan

...(plan edited for brevity)...

**Flowchart Plan:**
- **Step 1:** Identify if the student answers both parts of the question correctly (elements and atoms).
    - **IF** Student answers that there are sodium, sulfur, and oxygen elements AND
        - **THEN** Check if student mentions correct atom counts for each element.
            - **IF** Correct atom counts identified (2 sodium, 1 sulfur, 4 oxygen) 
                - **THEN** Assign Score 2

...(plan edited for brevity)...


================== C. SCORING STUDENT RESPONSE ==================
Scoring Output: Fact check triggered with query: formula for sodium sulfate

================== B. FACT CHECKING (Optional) ==================
Fact check QUERY → RESULT: formula for sodium sulfate → Sodium sulfate is an inorganic compound with formula Na2SO4 and several related hydrates. It is a white solid that is highly soluble in water and has various applications in industry and medicine. Jul...
Score after fact check: SCORE: 2
EXPLANATION: The student correctly identified all elements (sodium, sulfur, oxygen) but did not specify their respective atoms in the sodium sulfate formula. A perfect response would include both the correct identification of elements and their atom counts (Na, S, O). In this case, since only the elemental symbols were provided without detailing the number of atoms for each element, the score is 2.

...(output edited for brevity)...

******************** FINAL ANSWER *******************
SCORE: 2
EXPLANATION: The student correctly identified all elements (sodium, sulfur, oxygen) and provided their respective atoms in the sodium sulfate formula.

The Agent Struggles With Partially Correct Student Responses

The example above also illustrates how the agent struggles more with student responses that should receive partial credit than responses that are clearly correct or clearly incorrect. However, this may be explained by the quality of the scoring guide. The Texas STAAR scoring guides tend to give a lot of detail for a score a 2, and much more vague descriptions for scores of 1 or 0, as shown below:

Score 2: The student response includes:
- There are a total of 7 atoms representing three elements in the formula
- AND
- The elements are sodium (2 atoms), sulfur (1 atom), and oxygen (4 atoms).
Score 1: The student answers half of the question correctly.
Score 0: The response is incorrect or irrelevant.

Summary of the Weaknesses

Initial tests showed that while the fact checking functionality worked as it was directed, it did not help the agent to increase accuracy in scoring, and may result in more inaccurate scores. In addition, the agent performs better on responses that are correct and should earn the highest score possible, and struggles more on the partially correct student responses.

The Ugly

The LLM Hallucinates While Making Plans and Composing Explanations

While the agent was successful at times, hallucinations were a major problem. For example, when making scoring plans, the agent usually created reasonable plans that matched the scoring guide, but at times would create plans that introduced additional score levels that didn't exist in the scoring guide, or made up processes to add or deduct points that were not consistent with the scoring guide.

In this forces and motion example, the plan includes too many score levels and makes up a process for point deductions, and as a result produces inaccurate scores. The original scoring guide has only 0-2 points, not 3.

...(response shortened for brevity)...

7. **Summarize Final Score**: Award total points based on cumulative score from each section.
   - APPLY scoring guide rules: 3 points if all forces are identified with accurate explanations; decrease points accordingly for less comprehensive or inaccurate responses.

**Note:** This plan assumes a simplified point system where points can be lost but not awarded, maintaining the grading scale's integrity. Adjustments may vary based on specific grading criteria outlined in `scoring_guide_2`.

This flaws in this plan were probably generated by simple text prediction, based on other scoring guides that Ollama had access to during LLM's original training. The verification node, which could trigger re-planning, helped to mitigate this problem, but the problem still exists and came up more frequently with scoring guides that were less detailed.

In addition, hallucinations were detected in score explanations, noting details about student responses that were not actually present in the response itself. For example, in this case, the student response only includes the elements, "Na, S, O" and should receive a score of 1.

SCORE: 2
EXPLANATION: The student correctly identified all elements (sodium, sulfur, oxygen) and provided their respective atoms in the sodium sulfate formula.

Again, these hallucinations are likely an artifact of text prediction more than anything else.

Text in the Student Response Can Influence Scoring

Finally, text in the student response can override the agent's prompt, leading to score the response as directed by the student. This is obviously a major problem that would prevent LLM-based agents from being used for exam-based scoring, as pointed out by a commenter on my first post:

Summary

As described above, while the agent can provide some accurate scores and reasonable explanations, it has some major limitations and flaws that would prevent it from being used for scoring summative student assessments, such as exam responses.

In my final post, I'll share some reflections I have about some possible use cases of agent-based scoring given these limitations, and some lessons learned (both tech and education-focused) from the process of building this prototype.

DEV Community