DEV Community: Harshil Siyani

Same Prompt. Same Model. Different Output. Every Time.

Harshil Siyani — Tue, 15 Apr 2025 12:14:16 +0000

AI is changing how we build software.

But as builders, we’re quietly ignoring a major flaw:

We don’t test our prompts.
We just deploy them… and hope.

The Moment It Broke

I was working on an AI feature that relied on a simple prompt to generate short summaries.

Same model. Same prompt. Temperature 0.1.

Ran it ten times and got five different outputs. Some subtle. Some wildly off.

It hit me: If your API response is unpredictable, your product is unreliable.

Why This Is Getting Worse

AI models are non-deterministic, which means slight differences are expected. But that doesn't mean they're acceptable in production.

To make matters worse:

Gemini 1.0 is already deprecated
GPT-4.0 could be next
Each update subtly changes model behavior
Prompts that worked yesterday might break tomorrow

If you're building AI-first apps, you're in a loop of:

Test → Fix prompt → Re-test → Cross fingers → Repeat

That’s hours of work.

Every. Time. A. Model. Changes.

So I Built PromptPerf.dev

I needed a tool to help me trust my AI outputs before shipping.

PromptPerf.dev is a playground for prompt testing:

✅ Test your prompt across multiple AI models

✅ Run it at different temperatures

✅ Compare outputs across multiple runs

✅ Track consistency + score against expected answers

Here’s a sneak peek:

Where We're Headed

Right now, I’m building this in public. It’s early — but focused.

Why This Matters

If you're building with LLMs, you know the feeling:

The "it worked locally" moment — but with GPT
A broken chain in Langchain or RAG that fails silently
Users noticing weird output before you do

PromptPerf doesn’t replace model tuning.

It makes prompt reliability visible.

💬 I’d Love to Hear From You

Have you run into inconsistency issues?
What’s your current prompt testing workflow (if any)?
Should prompt testing be part of CI/CD?

If this resonated, join the waitlist or just drop your thoughts below — I'd genuinely love feedback as we build.

🧪 PromptPerf.dev — build AI products you can trust.

[Boost]

Harshil Siyani — Mon, 14 Apr 2025 10:35:41 +0000

Taradepan R

Apr 11 '25

5 AI Tools to Build Your First MVP in Days, Not Months🚀🚀🚀

#ai #startup #programming #productivity

216

7 min read

Day3: 10 signups.

Harshil Siyani — Mon, 14 Apr 2025 10:32:07 +0000

I have been sharing about the problem I'm working to solve around AI Prompt optimisation online on IndieHacker, Product Hunt, X and discord/Slack groups.

First what am I building?
building Promptperf.dev this is a platform that will automate promt testing. Why is this important. As more and more AI models are available its hard to find out which model will provide the best performance and consistent output. My tool will allow users run automated tests against multiple models and configs and most importants on multiple runs. How is this useful? Take this as an example: user wants to test 4 AI models at temperatures 0.1, 0.4, 0.7 and 1.0 and has 3 prompts they want to test. To ensure consistency each config to be ran 10 times to ensure its not hallucinated. (4 Models x 4 temp x 3 prompts x 10 runs = 480 API Calls and manually entering results into some database/excel to compare the results)

Why now? AI model providers are starting to deprecate models eg. Gemini 1.0 is being deprecated which means all the apps running on Gemini 1.0 will now need to test a new model and just doing a model swap to the newer model doesnt work. So prompt testing will be required to find the next AI model that will easily be swapped in its place.

Let me share which platform has helped me get the most traction:

Day 1 first shared this on IndieHackers and X got 2 signups and about 30 visitors,

Next day I looked for online communities on discord and Slack: Most of these communities had showcase channel where I can promote the product but others had introductions which I got warnings from moderator when I shared the product Im building. I also replied to quite alot of X tweets. I created a list of potential AI influencers and started to reply to their tweets and where I could I plugged in PromptPerf.dev but I also updated my bio to ensure it directly users to the product from the profile. This provided another 2 signups, unsure from where but website visitors were up to 70 at this point

Day 3: Focused entirely on X and replied back to multiple accounts not focusing on the product but just replying anything AI related which I assume made me gain trust in the AI audience and got quite alot of follows which could have directed some visitors to the product. Here I got to 10 subscribers who had opted in.

Now I had created tags on the signup form for users to select if they are interested in helping with feedback or get early access. Heres the breakdown:
5 for early adopters, 4 for early adopter and provide feedback, 1 for just feedback.

Do you think the problem I'm working to solve is real?