Most AI projects don't include evaluation. They show a nice demo, pick cherry-picked examples, and call it done. I wanted to be honest, so I tested my model on 200 questions it had never seen.
How I evaluated
The test set has 1,273 questions that were never used in training. I tested on 200 of them:
for question in test_questions[:200]:
# Show model the question + 4 options
# Ask it to pick A, B, C, or D
prediction = model.predict(question)
# Check against correct answer
if prediction == correct_answer:
correct += 1
accuracy = correct / 200
The results
Total questions: 200
Correct: 62
Accuracy: 31.0%
Random baseline: 25.0%
My model beats random guessing by 6 percentage points. That's 31% vs 25%.
Is 31% good?
Depends on what you're comparing to.
Compared to random guessing on 4-option MCQs — yes. The model genuinely learned something. It's not just flipping coins.
Compared to GPT-4 on the same benchmark — no. GPT-4 scores around 90%. But GPT-4 has roughly 1000x more parameters and was trained on vastly more data for months on expensive hardware.
For a 1.3 billion parameter model trained for 1.5 hours on a free GPU — 31% is a real result.
What I would do differently
Three things would improve this significantly:
Larger base model — Mistral 7B or LLaMA 3 8B would likely score 50-60%+ on the same benchmark with the same training data. Size matters for reasoning.
Better RAG knowledge base — I stored MCQ training text as knowledge chunks. Clean medical facts from PubMed or clinical guidelines would give the retriever much better material to work with.
Re-ranking — After retrieving top 3 results, a cross-encoder model could re-rank them by true relevance before injecting into the prompt.
Why I'm sharing the real numbers
Because honest numbers are more valuable than fake demos. Any recruiter or engineer who looks at this project can see exactly what was built, how it was tested, and where the limitations are.
The architecture is production-ready. Swap in a 7B model and the retrieval pipeline, API, and UI don't change at all. That's what good engineering looks like.
Full project: github.com/YadavAkhileshh/medmind
Model: huggingface.co/Yakhilesh/medmind-opt-medical
Top comments (0)