Hey everyone,
I've been working on an automated prompt optimization project for a while now, and I've gone through roughly 500M tokens iterating on the core loop.
Along the way, I tried leaning on pretty much every major model out there — GLM, DeepSeek, GPT, Claude, you name it — to help me refine the architecture. But honestly, their output was underwhelming for this specific task. Most of their built-in agent/skill features were basically useless for actually designing a better optimization pipeline.
This is the core design pattern I'm currently running with:
┌──────────────────────────────────────────────────────┐
▼ │
Current Prompt ──► Evaluate (target + judge) ──► Score + deductions
▲ │
│ ▼
Optimizer Model ◄────────── rewrite from feedback ◄─── keep best-scoring version
(repeats until round budget is hit; highest-scoring prompt wins)
I've tacked on a few extra things on top: a prompt library, a test question bank, and some other quality-of-life features. But I can't shake the feeling that these are just surface-level additions — they don't really move the needle on how well the core optimization actually works.
That's why I'm posting here. I'd love to get this community's take:
- What would you change about this core loop to make it fundamentally better?
- What features do you actually find valuable in a prompt optimization tool, beyond the basics?
I'm relatively new to sharing my work here, so any advice, critiques, or wild ideas are greatly appreciated. Thanks in advance!
Top comments (0)