I evaluated my self-trained LLM what 31% accuracy actually means

#ai #sideprojects #productivity #beginners

Most AI projects don't include evaluation. They show a nice demo, pick cherry-picked examples, and call it done. I wanted to be honest, so I tested my model on 200 questions it had never seen.

How I evaluated

The test set has 1,273 questions that were never used in training. I tested on 200 of them:

for question in test_questions[:200]:
    # Show model the question + 4 options
    # Ask it to pick A, B, C, or D
    prediction = model.predict(question)

    # Check against correct answer
    if prediction == correct_answer:
        correct += 1

accuracy = correct / 200

The results

Total questions: 200
Correct:         62
Accuracy:        31.0%
Random baseline: 25.0%

My model beats random guessing by 6 percentage points. That's 31% vs 25%.

Is 31% good?

Depends on what you're comparing to.

Compared to random guessing on 4-option MCQs — yes. The model genuinely learned something. It's not just flipping coins.

Compared to GPT-4 on the same benchmark — no. GPT-4 scores around 90%. But GPT-4 has roughly 1000x more parameters and was trained on vastly more data for months on expensive hardware.

For a 1.3 billion parameter model trained for 1.5 hours on a free GPU — 31% is a real result.

What I would do differently

Three things would improve this significantly:

Larger base model — Mistral 7B or LLaMA 3 8B would likely score 50-60%+ on the same benchmark with the same training data. Size matters for reasoning.
Better RAG knowledge base — I stored MCQ training text as knowledge chunks. Clean medical facts from PubMed or clinical guidelines would give the retriever much better material to work with.
Re-ranking — After retrieving top 3 results, a cross-encoder model could re-rank them by true relevance before injecting into the prompt.

Why I'm sharing the real numbers

Because honest numbers are more valuable than fake demos. Any recruiter or engineer who looks at this project can see exactly what was built, how it was tested, and where the limitations are.

The architecture is production-ready. Swap in a 7B model and the retrieval pipeline, API, and UI don't change at all. That's what good engineering looks like.

Full project: github.com/YadavAkhileshh/medmind
Model: huggingface.co/Yakhilesh/medmind-opt-medical