Swapneswar Sundar Ray

Posted on May 2

I tried using AI to build an exam system. It worked… until it didn’t.

I didn’t start with the idea of building an exam platform. This actually came from a different problem. We were using AI to generate structured data for APIs, and everything looked fine at first. The responses were correct, nothing obviously wrong. But then things started breaking in production in very strange ways. One example was a value like 120.5 instead of 120.50. Same number from a human point of view, but the downstream system rejected it because it expected an exact format. These were small issues, but they took a lot of time to debug and they kept happening.

That got me thinking. If AI behaves like this with structured data, what happens when we use it to generate exam questions or evaluate answers? In demos it looks impressive. It can generate questions instantly, even evaluate answers. But in real usage, consistency becomes a problem. Difficulty levels vary randomly, answers are not always structured the same way, and evaluation can feel subjective. That’s not something you can rely on for students or schools.

At first, I tried fixing it the usual way—by improving prompts. Making them longer, adding more rules, being very specific. It helped a little, but it didn’t solve the core issue. You still get edge cases where the output is slightly off. That’s when I realized the problem is not the prompt. The problem is trusting AI output directly without control.

So instead of trying to “fix AI,” I built a small system around it. It’s a simple Java-based application that runs as a JAR. Students can enter their details, choose subject and topic, and the system generates questions, runs a timer, collects answers, and produces a report. Nothing very new there. The important part is what happens in between.

Home Page (Landing + Features)

This is the main entry point of the system. It shows the overall idea clearly — an AI-powered exam platform where users can generate questions, register, and select topics.

What stands out here is that the system is not just a basic form-based app. It is positioned as a complete examination framework, with features like AI question generation, evaluation, timer, and reporting already integrated.

The feature section below highlights the core capabilities in a structured way. It shows that the platform is designed to handle the full exam lifecycle, not just question generation. That makes it more like a system solution rather than a small tool.

Student Registration

This screen captures detailed student information — not just name and email, but also age, country, experience, interests, and education level.

This is important because the system is trying to personalize question generation based on user context. It shows that the design is thinking beyond generic questions and moving toward adaptive exam generation.

The structure is simple, but the idea behind it is strong — collecting enough context so AI can generate more relevant and meaningful questions.

Topic Selection

This screen shows predefined exam topics like Java, Spring, System Design, etc., with clear details:

difficulty level
number of questions
time duration

This is where the system becomes more structured. Instead of random question generation, it introduces controlled exam configuration.

It also shows that the system is trying to balance:

flexibility (multiple topics)
control (fixed duration, levels)

This reduces randomness and makes the exam predictable.

Exam Start Screen

This screen shows instructions before the exam starts. It includes rules like:

time limits
no refresh
answers cannot be changed

Mathematics – Easy

This screen shows basic math questions like addition and multiplication. At first glance, it looks like a normal quiz, but behind the scenes the questions are generated using AI based on the selected subject (Mathematics) and difficulty (Easy).

When the user selects the topic, the system sends a prompt to the AI like:

“Generate easy-level math questions with multiple choice answers in a fixed format.”

AI returns a response, but the system does not directly display it. Instead, it validates the structure:

ensures each question has exactly 4 options
checks formatting consistency
extracts the correct answer

For math questions specifically, the system can also verify correctness deterministically. For example:

15 + 27 is recalculated by the system
the correct answer is confirmed before being stored

Science – Medium

In this screen, the questions are conceptual, like chemical symbols or physics facts.

Here, the system relies more on AI knowledge, but still enforces:

structured question format
single correct answer
valid option set

Since these are not numeric, the system cannot “calculate” answers the same way as math. Instead, it:

cross-checks answer format
ensures only one correct answer is marked
optionally re-prompts AI for clarification if response is ambiguous

Programming – Hard

This screen shows advanced questions like time complexity and data structures.

Here, the system uses AI to generate:

domain-specific questions
appropriate difficulty
relevant answer choices

To improve reliability, the system:

enforces known patterns (e.g., Big-O notation format)
validates option consistency
ensures only one correct answer is selected

For some questions, the system can also apply rule-based validation, like:

valid complexity values (O(n), O(log n), etc.)
known correct answers for standard problems

This is a small but important part. It shows that the system is thinking about real-world exam conditions, not just generating questions.

The “Start Exam Now” action clearly separates setup from execution, which is good design for flow control.

Every AI response goes through a validation layer before it is used. That means checking structure, fixing formatting issues, ensuring required fields exist, and making sure the output is consistent. So instead of just taking what AI gives, the system adjusts it into something predictable. In simple terms, AI suggests, but the system decides.

The system is designed to keep the experience simple and clear for students. Every question follows the same format, with the same number of options and a consistent layout, so students don’t get confused or distracted by changing structures.

When a student clicks on “Show Answer”, it’s not just revealing whatever the AI generated. The answer has already been carefully processed by the system. It is first extracted from the AI response, then validated to make sure it is correct and properly formatted. For numerical questions, the system can even recheck the calculation before showing the answer.

This way, students can trust that the answer they see is accurate, consistent, and helpful for learning—not just a random AI output.

This small change made a big difference. The system became much more stable. The outputs were consistent. The same input would lead to similar structure every time. It stopped feeling like a demo and started behaving more like something usable.

I also kept the system intentionally simple. No heavy UI, no complex setup. Just a Java JAR with an in-memory database. You can run it locally and try it out. The goal was not to build a full product, but to test this idea of combining AI with strict validation.

I’m sharing this because I keep seeing the same pattern everywhere. AI-first systems look great initially, but small inconsistencies show up later and cause real problems. Not big failures, just small ones that are hard to trace. Adding a control layer seems boring, but it makes the system reliable.

If you’ve worked on something similar—AI-generated data, exam systems, or validation layers—I’d be interested to hear how you handled it. Did you keep improving prompts, or did you add some kind of control mechanism?

I’ve put the project here:
GitHub: https://github.com/swapneswarsundarray/ai-assisted-exam
Feeback: https://swapneswarsundarray.github.io/ai-assisted-exam/feedback.html

Still early, still evolving. If this sounds interesting, feel free to try it out or contribute. Would be good to build this with more real-world input.

In the end, I don’t think AI replaces systems. It just becomes one part of it. The rest is still structure, validation, and control. That’s where things actually start working properly.

DEV Community

I tried using AI to build an exam system. It worked… until it didn’t.

Top comments (0)