<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marko Milojkovic</title>
    <description>The latest articles on DEV Community by Marko Milojkovic (@marko_milojkovic_9fabcac1).</description>
    <link>https://dev.to/marko_milojkovic_9fabcac1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3680611%2Fe64d472c-6114-4369-949b-a98b395cebf0.jpg</url>
      <title>DEV Community: Marko Milojkovic</title>
      <link>https://dev.to/marko_milojkovic_9fabcac1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/marko_milojkovic_9fabcac1"/>
    <language>en</language>
    <item>
      <title>I built an AI model comparison tool after 12 hours wasted on LLM integration in project. Launching on Product Hunt today.</title>
      <dc:creator>Marko Milojkovic</dc:creator>
      <pubDate>Tue, 24 Feb 2026 14:34:31 +0000</pubDate>
      <link>https://dev.to/marko_milojkovic_9fabcac1/i-built-an-ai-model-comparison-tool-after-12-hours-wasted-on-llm-integration-in-project-launching-4cg5</link>
      <guid>https://dev.to/marko_milojkovic_9fabcac1/i-built-an-ai-model-comparison-tool-after-12-hours-wasted-on-llm-integration-in-project-launching-4cg5</guid>
      <description>&lt;p&gt;I run a product studio building apps. Nowadays every app needs at least one LLM integration.&lt;/p&gt;

&lt;p&gt;I subscribe to Claude ($20/mo) and usually use it for API as it's good for complex stuff, but sometimes it's overkill so I switch to ChatGPT. Then Gemini came out. Then Grok. Then DeepSeek.&lt;/p&gt;

&lt;p&gt;You never really know which one is actually best for your specific use case. They're all changing daily. New models, new sub-models, different pricing.&lt;/p&gt;

&lt;p&gt;I was spending hours reading "ChatGPT vs Claude" Reddit threads and still guessing.&lt;/p&gt;

&lt;p&gt;And even worse: I'd integrate a model, then discover a cheaper alternative that works just as well. Too late - already spent 2-3 days on integration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frwc3jp2y4kwv263keu2l.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frwc3jp2y4kwv263keu2l.webp" alt=" " width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I built Test AI Models: paste your actual prompt, see quality/speed/cost across 9 LLM models side-by-side in 30 seconds.&lt;/p&gt;

&lt;p&gt;No API keys needed. No reading benchmarks that test "write a poem" when you need to debug code.&lt;/p&gt;

&lt;p&gt;Test YOUR production prompts. See which model actually wins for YOUR use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it started&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Built the first version in less than a week for a Bubble/Contra hackathon. 4 LLMs, basic comparison. Won "Best Use of AI" + $5K award.&lt;/p&gt;

&lt;p&gt;Started to read Reddit threads where developers argue over ChatGPT vs Claude - no one is right as they could not prove it.&lt;/p&gt;

&lt;p&gt;That was the signal - other developers have this exact problem too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current status&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Launched BETA: Feb 24, 2026 (Product Hunt)&lt;br&gt;
Models: 9 total (ChatGPT, Claude, Gemini, Grok, Perplexity, DeepSeek, Qwen, Kimi, Mistral)&lt;br&gt;
Users: 90+ early testers&lt;br&gt;
Tests model run: 420+&lt;/p&gt;

&lt;p&gt;Pricing: 50 free test model selections, then $9/mo + API credits (1:1, no markup)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Realexample (I used it on itself)&lt;/strong&gt;&lt;br&gt;
Our app auto-generates titles for tests. I was using Claude Sonnet ($423/1M runs).&lt;/p&gt;

&lt;p&gt;Tested alternatives:&lt;br&gt;
Claude: $423/1M - perfect quality ✓&lt;br&gt;
DeepSeek: $31/1M - cheapest BUT failed format requirements ✗&lt;br&gt;
Qwen: $43/1M - also failed ✗&lt;br&gt;
Grok: $49/1M - perfect quality ✓ + 8.6x cheaper&lt;/p&gt;

&lt;p&gt;Switched to Grok. Saved $45/year on ONE tiny feature.&lt;/p&gt;

&lt;p&gt;The lesson: "Use the cheapest model" doesn't work if it breaks your requirements. You have to test with YOUR actual needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech stack&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Platform: Bubble.io (no-code)&lt;br&gt;
APIs: OpenAI, Anthropic, Google, xAI, DeepSeek, Alibaba, Moonshot...&lt;br&gt;
Email: Brevo&lt;br&gt;
Payments: Paddle&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's next&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Deciding the roadmap based on user feedback. What should I build?&lt;/p&gt;

&lt;p&gt;A) Submodels (GPT-4o vs GPT-4o-mini, Claude Opus vs Sonnet vs Haiku)&lt;br&gt;
B) API access (trigger tests from n8n, Zapier, agentic workflows)&lt;br&gt;
C) Quality scoring (hallucination detection, consistency testing)&lt;br&gt;
D) Image/voice generation comparison (DALL-E vs Midjourney, ElevenLabs vs Play.ht)&lt;/p&gt;

&lt;p&gt;What would you actually use?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ask&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try it: testaimodels.com - run one test with your actual prompt, tell me what breaks or what's confusing&lt;/li&gt;
&lt;li&gt;Feedback: What's missing? What would make this 10x more useful?&lt;/li&gt;
&lt;li&gt;Roadmap input: A, B, C, or D above (or tell me what I'm missing)
I'm building this in public. Every decision is shaped by what early users say matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Questions I have&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is $9/mo too expensive for indie devs? (API credits are 1:1, no markup)&lt;/li&gt;
&lt;li&gt;Is "test selections" pricing confusing? (50 selections = 5-25 full tests depending on how many models you compare)&lt;/li&gt;
&lt;li&gt;What modality matters most after text? (Image, audio, video?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop a comment. I read and reply to everything.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>llm</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
