Alex C

Posted on Jun 16

I Don't Get Prompt Engineering. Are We All Just Gambling with Tokens Now?

#ai #llm #rust #architecture

Okay, I need to get this off my chest.

I've been in this industry long enough to see a few hype cycles come and go. I survived the "blockchain will solve everything" era. I watched microservices eat monoliths and then watched everyone quietly crawl back to monoliths. I've seen enough to know that when everyone suddenly agrees on something, I should probably sit down and shut up until the dust settles.

But this time is different. This time, the thing everyone agrees on is... using LLMs for everything? And I genuinely, deeply, don't get it.

Let Me Show You What I Mean

A few weeks ago, I was looking at a codebase from a well-funded startup. I won't name them — it's not important. What's important is that they had this in production:

def sort_hotels_by_price(hotels):
    prompt = f"""
    Sort these hotels by price from lowest to highest:
    {hotels}

    Return as JSON array.
    """
    response = llm.complete(prompt)
    return json.loads(response)

I stared at this for a full minute. Then I laughed. Then I got kind of sad.

We're taking a deterministic O(n log n) operation — something that takes microseconds and costs nothing — and turning it into:

An API call to OpenAI or Anthropic ($0.001–0.01 per request)
A non-deterministic operation that might hallucinate
A 500ms–2s latency hit
A fragile JSON parse that might break

For sorting a list.

And the scariest part? This isn't some junior dev's side project. This is production code, serving real users, with real money flowing through it.

The Math That Keeps Me Up at Night

Let's do the math, because the numbers are genuinely wild.

Say you're building a travel app with 10,000 daily active users. Each user triggers maybe 50 LLM calls per session — search, filter, sort, recommend, parse results, generate descriptions, you name it.

That's 500,000 LLM calls per day.

At $0.002 per call (cheap model, short prompt), that's $1,000/day. Or $30,000/month.

For what? For things you could do with sort(), filter(), and map() since the 1970s.

And here's the kicker: that $30k/month doesn't even guarantee your list comes back sorted correctly. I've seen LLMs "sort" a list and forget 3 items. Return duplicates. Return a completely different data structure. Refuse to process because it "couldn't help with that."

When was the last time sort([3, 1, 2]) returned [1, 2]?

What We Actually Do

I'm not just complaining — we've built a real product with real users, and we had to make these choices ourselves.

GoChatTravel is a chat-first travel booking platform. Users talk to our bot in WhatsApp, Telegram, Messenger — you name it — to book hotels, flights, tours, trains. It's the kind of product where the temptation to "just use an LLM for everything" is overwhelming. Every single step feels like it could be a prompt.

But we made a different choice, and honestly, it feels almost too boring to talk about.

The LLM handles the conversation

That's it. That's its job. It understands what the user wants from their natural language, it generates friendly human responses, it handles the small talk and the edge cases. It's really good at that. We let it do that one thing.

The code handles literally everything else

And by "everything else," I mean a lot. Here's what actually happens when a user says "find me flights from Toronto to Tokyo next week":

// 1. LLM extracts intent (this is where LLM shines)
let intent = llm.parse_intent("Find me flights from Toronto to Tokyo next week");
// Returns: { action: "search", from: "YYZ", to: "NRT", dates: "..." }

// 2. Code does the heavy lifting
let supplier_results = poll_supplier_apis("YYZ", "NRT", dates).await;
// Parallel API calls to Amadeus, Sabre, direct carriers, OTAs, etc.

let normalized = normalize_fare_classes(supplier_results);
// Because every supplier has their own idea of what "economy" means.
// Basic economy, main cabin, economy light, economy standard — 
// we map all of this to a consistent scale.

let with_rules = apply_markup_and_commissions(normalized);
// Business rules, margin levels, partner-specific markups,
// availability rules, fare family restrictions.

let all_flights = with_rules.into_iter().filter(|f| f.available).collect();

// Now we have, say, about 10000 flight options.
// We're NOT sending this to the LLM. Not even close.

// 3. Deterministically pick 3 smart choices
let cheapest = all_flights.iter().min_by_key(|f| f.price).cloned();
let fastest = all_flights.iter().min_by_key(|f| f.duration).cloned();
let optimal = pick_optimal(&all_flights);
// Our own scoring function: balances price, duration, stops, departure time

let smart_choices = vec![cheapest, fastest, optimal];

// 4. Save all options to a one-time viewable page (signed URL, expires in 24h)
let view_all_url = generate_one_time_page(&all_flights, Duration::hours(24));

// 5. LLM generates a natural response with ONLY the 3 smart choices
let response = llm.generate_response(&smart_choices, &view_all_url);
// "I found flights from Toronto to Tokyo next week. Here are the best options:
//  💰 Cheapest: Air Canada via Vancouver — $847, 18h total
//  ⚡ Fastest: ANA direct — $1,420, 10h 30m
//  ⭐ Best value: Korean Air via Seoul — $980, 15h, great departure time
//
//  Want to see all 100,500 options? Here's a link (expires in 24h): [view all]"

And then, if the user picks a flight, the code takes over again: payment processing through Stripe, PCI compliance, booking confirmation, ticketing, after-sales flows for changes and cancellations, integration with the support team. None of this touches the LLM.

Why this works

Notice what's happening with the flights. We have about 1000 options, but we're only sending 3 to the LLM. Why?

Token efficiency. Sending 10k (for some round trips you can have more) flights to an LLM would cost a fortune and probably exceed context limits anyway. Three smart choices cost almost nothing.
Better UX. Users don't want to scroll through 10k flights in a chat. They want 3 curated options.
Deterministic curation. The "cheapest," "fastest," and "optimal" picks are made by code, not by the LLM guessing. No hallucinated "best value" that's actually the most expensive option.
Escape hatch. If the user does want to see everything, we give them a one-time link to a web page. It expires in 15 minutes, so we're not hosting infinite comparison pages.

The LLM never sees the raw data. It just sees: "Here are 3 options, here's a link for the rest. Talk to the user."

That's the whole point. The LLM is a conversation layer, not a data processing layer.

The boring result

We load tested this engine on modest hardware — 2 vCPU, 2GB RAM. It handles about 8,000 requests per second with p99 latency under 50ms.

Try doing that with an LLM in the critical path. I dare you.

But Here's Where I Get Honest

Look, I've been pretty hard on the "LLM for everything" approach. And I stand by most of it. But I also want to ask myself a genuine question: am I missing something?

Maybe I am. Let me play devil's advocate against myself.

What I might be losing

Flexibility. When requirements change, you just update a prompt. With code, you need to deploy. There's something seductive about being able to tweak behavior without touching the codebase.

Handling edge cases. LLMs are genuinely good at the weird, fuzzy stuff. A user says "I want a hotel that's not too fancy but not a dump either" — good luck encoding that as a filter rule. An LLM handles it naturally.

Faster prototyping. For an MVP, prompting your way through logic is fast. You can ship in days what would take weeks to code properly.

Personalization at scale. LLMs can adapt responses to individual users in ways that are painful to implement with rules.

Richer comparisons. An LLM could theoretically explain why one flight is better than another in natural language, considering dozens of factors. We do this manually with our "optimal" scoring function, but it's necessarily simpler than what an LLM could do. We trade sophistication for predictability.

But here's what I keep coming back to

Every time I've tried to use an LLM for something that could be deterministic code, I've regretted it. Not immediately — usually 3 to 6 months later, when:

The LLM starts returning subtly wrong results and I can't figure out why
The bill comes in and it's 3x what I expected
A user complains about inconsistent behavior and I can't reproduce it
I need to debug something at 2 AM and I'm staring at a prompt trying to figure out why it's returning garbage

The LLM is a probabilistic tool. It's amazing at things that are inherently fuzzy — language, intent, creativity. But when you need something to be exactly right, every single time, you want code.

My Rule of Thumb (For Now)

I'm not saying LLMs are bad. They're incredible tools. We use them every day. But I think we've swung too far in the "LLM for everything" direction, and we're going to look back in a few years and cringe at how much money we burned on tokens for things that sort() could have handled.

My rule of thumb is simple:

If the operation has a single correct answer, use code. If it has many valid answers, use an LLM.

Sorting has one correct answer. Use code.
Understanding what "not too fancy but not a dump" means? Many valid answers. Use an LLM.

Am I wrong? Maybe. I genuinely want to hear from people who are successfully running LLM-heavy architectures in production. Show me the numbers. Convince me I'm missing something.

Or if you're also tired of seeing sort() implemented as a prompt, drop a comment. Let's commiserate.

P.S. If you're building a chat-based application and want to see how we implemented this architecture in Rust, check out GoChatTravel. We're a small Canadian startup — if you want to chat about architecture, argue about LLMs, or just tell us we're wrong, we're always just a chat away. 😉

Top comments (2)

Vladislav Simutin • Jun 16

This was such a refreshing and necessary read. You hit the nail right on the head regarding the "LLM for everything" hype cycle. The point about using code for deterministic tasks versus LLMs for fuzzy, nuanced understanding really resonated with me. Great perspective!

mote • Jun 17

Prompt engineering isn't dead, it's just... expensive to do manually. The real skill now is knowing when to stop tweaking prompts and start building infrastructure.

I spent 3 weeks prompt-tweaking a robot's navigation decision-making. What actually fixed it? A proper state memory layer. The model kept forgetting which room it was in â no amount of "you are a careful robot" in the system prompt was going to fix that.

Gambling with tokens vs. building reliable systems â feels like the difference between writing one-off scripts and production code. Both have their place.