I spent ~500M tokens building a prompt optimization tool

东道主 — Sat, 04 Jul 2026 03:05:54 +0000

Hey everyone,

I've been working on an automated prompt optimization project for a while now, and I've gone through roughly 500M tokens iterating on the core loop.

Along the way, I tried leaning on pretty much every major model out there — GLM, DeepSeek, GPT, Claude, you name it — to help me refine the architecture. But honestly, their output was underwhelming for this specific task. Most of their built-in agent/skill features were basically useless for actually designing a better optimization pipeline.

This is the core design pattern I'm currently running with:

        ┌──────────────────────────────────────────────────────┐
        ▼                                                        │
Current Prompt ──► Evaluate (target + judge) ──► Score + deductions
        ▲                                                        │
        │                                                        ▼
Optimizer Model ◄────────── rewrite from feedback ◄─── keep best-scoring version
        (repeats until round budget is hit; highest-scoring prompt wins)

I've tacked on a few extra things on top: a prompt library, a test question bank, and some other quality-of-life features. But I can't shake the feeling that these are just surface-level additions — they don't really move the needle on how well the core optimization actually works.

That's why I'm posting here. I'd love to get this community's take:

What would you change about this core loop to make it fundamentally better?
What features do you actually find valuable in a prompt optimization tool, beyond the basics?

I'm relatively new to sharing my work here, so any advice, critiques, or wild ideas are greatly appreciated. Thanks in advance!

DEV Community: 东道主

I spent ~500M tokens building a prompt optimization tool