I've been running real benchmarks on open-source LLMs to test things the big labs don't publish. My latest experiment compares Gemma 4 E4B against the rest of the Gemma family on enterprise tasks.
Full results with methodology and limitations: https://aiexplorer-blog.vercel.app/post/gemma-4-e4b-enterprise-benchmark
I'm building a collection of these experiments at aiexplorer-blog.vercel.app/experiments — covering structured JSON output, context position bias, RAG compliance, and prompt injection defenses.
Feedback welcome — what models or tasks should I benchmark next?
Top comments (0)