DEV Community

Cover image for I spent ~500M tokens building a prompt optimization tool
东道主
东道主

Posted on

I spent ~500M tokens building a prompt optimization tool

Hey everyone,

I've been working on an automated prompt optimization project for a while now, and I've gone through roughly 500M tokens iterating on the core loop.

Along the way, I tried leaning on pretty much every major model out there — GLM, DeepSeek, GPT, Claude, you name it — to help me refine the architecture. But honestly, their output was underwhelming for this specific task. Most of their built-in agent/skill features were basically useless for actually designing a better optimization pipeline.

This is the core design pattern I'm currently running with:

        ┌──────────────────────────────────────────────────────┐
        ▼                                                        │
Current Prompt ──► Evaluate (target + judge) ──► Score + deductions
        ▲                                                        │
        │                                                        ▼
Optimizer Model ◄────────── rewrite from feedback ◄─── keep best-scoring version
        (repeats until round budget is hit; highest-scoring prompt wins)
Enter fullscreen mode Exit fullscreen mode

I've tacked on a few extra things on top: a prompt library, a test question bank, and some other quality-of-life features. But I can't shake the feeling that these are just surface-level additions — they don't really move the needle on how well the core optimization actually works.

That's why I'm posting here. I'd love to get this community's take:

  • What would you change about this core loop to make it fundamentally better?
  • What features do you actually find valuable in a prompt optimization tool, beyond the basics?

I'm relatively new to sharing my work here, so any advice, critiques, or wild ideas are greatly appreciated. Thanks in advance!

Enter fullscreen mode Exit fullscreen mode

Top comments (0)