Most AI models perform well on benchmarks.
But what happens when you test them on real-world messy input?
I created a small challenge on VibeCode Arena to find out…
And the results were surprising.
🧪 The Challenge: Prescription Confusion Trap
Here’s the input I gave:
“Doctor ne bola din mein 2 baar dawa lena hai, par maine sirf ek hi li. Kal se chest pain halka hai, par breathing problem nahi hai. Mere papa ko heart problem hai. Maine ibuprofen li thi kal raat.”
🎯 The Task
Convert this into strict JSON format:
- patient_info
- active_symptoms
- negative_symptoms
- family_history_notes
- medication_taken
- dosage_misuse_flag (true/false) ## ⚠️ Why This Is Tricky
This is NOT just extraction.
It tests real-world understanding:
- 👉 Doctor said 2 times, patient took only once
- 👉 “No breathing problem” (negative symptom)
- 👉 Father has heart problem (not the patient)
- 👉 Medication taken: ibuprofen ## ❗ The Critical Question
Can your AI detect this?
👉 dosage_misuse_flag = true
Many models miss this completely.
🔥 Try It Yourself
I’ve made this challenge public here:
👉 https://vibecodearena.ai/duel/5a70b6a3-20b7-4bb2-94f6-bd494f5d60c2
⚔️ Challenge Rules
- Use your favorite AI (ChatGPT, Claude, Gemini, etc.)
- Generate the JSON output
- Comment your result
Top comments (0)