Stacey Schneider

Posted on Apr 15 • Edited on Apr 17

Running AI on a Budget: 11 Tactics for Enterprise-Scale Efficiency

#ai #productivity #llm #resources

At my company, PromptOwl, everyone coworks with AI for 90 to 100% of their work. It's everywhere, in every process, and increasingly connecting everything. Engineering, marketing, sales, leadership—AI is in every workflow, every day. Getting there took us just over a year, and taught us a lot about what that actually costs.

Running AI at that scale boils down to two optimization problems: money and time.

Money shows up on the monthly invoice, and if you are not paying attention it will floor you in costs. AI is expensive to run blindly. We expect to pay $2-300 per developer, per day on the frontier models, but we had to make sure we weren't wasting money on things that didn't matter.

Time is the second problem. It's the hours lost to waiting for generated responses, to rerunning prompts because bad context resulted in wrong outputs, and workflows that require constant manual intervention to function and not fall over.

Optimizing for time and money is an evergreen effort. Too much is evolving, and we will always need to adapt. But these eleven tactics are the foundation of how we run AI at scale.

The Setup

Get this right once. It pays back on every session.

1. Organize your Prime Documents

On our journey to 90%, the first obvious problem was re-explaining core business tenets to the models for every conversation. So, I started writing context files for the models—in markdown, to cut down on the size of the messages sent to the LLMs.

A Prime Document is a structured context file written specifically for AI use—not a deck you'd send to a colleague, but a document the model can actually use. Your brand brief, product spec, customer profile, team norms. Every function that runs AI regularly should have at least one.

Most teams don't have these. They improvise context in the chat window every time, and wonder why the outputs aren't consistent.

2. Write them for AI—shrink your files

I like to think that every character I send to AI lights up a GPU in some data center. And while I love a light show, I visualize a huge energy waste in padding the data. It also can contort the results, as the context window for each LLM is fixed. Your whole conversation has to fit in it. So every unnecessary character, eats away at what the models can do for you.

Uploading a full 60-page brand guide every time may provide detail, but it will cost you processing time, tokens—and potentially the right answer. The model re-reads everything it doesn't need, every single time. The more it has, the more it can confuse.

Your Prime Documents should contain only what's relevant, written in a format the model can parse efficiently—not formatted for a human presenting to a board. Ask it to distill these documents for itself, then start a new chat with it. Smaller, cleaner input means identical or better output quality, faster responses, and a lower cost per call.

3. Use a context management tool

Eventually, you will collect a nest of these machine optimized documents. And then maintaining them becomes the new burden.

We started to centralize the management of our Prime Documents, publishing them on our Google Drive. I became the arbiter of them. But since everything in AI moves so fast, our context needed near constant updating. This new task of resyncing them ended up taking so much of my time, I took on the title Chief Context Officer.

Plus the whole thing was such a drag. Literally. I was constantly dragging documents to every chat, plus reviewing and updating them if they needed. Only to have to do it all over again in a half hour when my conversation ran out of space.

The solution was to build a context management layer that operates over a wiki-style system of markdown files. Accessing it via a CLI meant I could still use Claude and Antigravity the exact same way, but the system actively improved the context by creating tags and links to live data and other docs. Now my chats have the ability to look at everything and pick up where the other left off, with no retraining of the conversation.

This pattern reduces tasks by 2-300% in my workflow, and mitigates the risk of having to start over if something goes sideways. I don't maintain it nearly as often either, as it usually just learns as I work.

This is the same theory that Andrej Karpathy wrote about a couple weeks ago. We published a ContextNest whitepaper on the details of how this works two months ago, too. And bonus! It's also an open-source project: ContextNest. There should be a desktop client any day now too.

4. Set up skills

When customers ask me what they should apply AI to first, my answer is always the same—build what you repeatedly spend the most time on. Sometimes folks do a time journal to determine where this happens. I have a bias for rapid results, so I think this can be conducted as a mental exercise. At least to find the low hanging fruit.

For me, I would spend 4-5 hours every week doing strategy work. I'd pull all the numbers, have AI distill all the standup notes, make sure action items were tracked and ticked off. Then I'd think about what to prioritize this week. Finally, I'd communicate it to the team so everyone is aligned and moving in unison.

At first, I attempted to automate the SOP for doing this with another markdown file. Then I built tools to handle the repeatable research processes, like summarizing the chatter on the engineering channel in Slack. These Skills are reusable, pre-configured AI behaviors you define once and improve through actual usage.

They can also be shared. In our case, that means that everyone who meets with a prospect, could create the same deep dive research brief and proposal based on our status today—what new features are out, what promotion is running, what the customer wanted to meet about in the first place. With very little coordination, we can create the same output.

Without shared skills, every person on your team reinvents the wheel and gets slightly different results. Skills create consistency, reduce token cost on repeated tasks, and make AI output auditable across the org.

The 7 Habits of Highly Successful Prompting

Now that we have the system, these are the habits that govern every session to get the most out of them.

1. Plan first. Build last.

The expensive moment isn't generating the final artifact. It's generating it three times because the spec wasn't clear—and losing 20 minutes each time to recreate it.

Before you ask for the web page, the strategy doc, or the campaign copy, use a few cheap messages to get alignment—figure out your structure, identify the edge cases, and establish naming conventions and expectations.

Thinking is cheap. Building is expensive. Iteration in planning costs almost nothing in tokens and almost always saves you time. Iteration in generation costs both.

2. Run a murderboard on the plan

One of the best things AI does is help you think through things like other people. Other professionals, your customers, or future prospects. This is your opportunity to learn from what they would predictably say. I have a markdown file of various personas that I pull from to review my plans, ruthlessly tear them apart, and find not just the holes, but their recommendations to satisfy them.

I call this group-think exercise a "murderboard," but you can just tell it to run a focus group antagonistically or to act like a customer and complain. As long as you try to use multiple perspectives and explicitly get the model to break its tendency for sycophancy, it will help you find problems before you codify them into production.

3. Tell the AI to ask you questions

A 600-word fully-specified prompt is often the most expensive way to get a mediocre result. With that much detail, the models think they should know everything and usually make some terrible assumptions.

Describe what you need, focusing on the results and how it will be used. Tell the model to ask clarifying questions before it starts. Each exchange costs a fraction of a re-generation. You get better output from a conversation than from a wall of text that leaves the model guessing where your spec was ambiguous.

4. Edit the message. Don't stack on top of it.

You sent a prompt, spotted a typo, realized you left out a constraint. Most people send a correction as a new message.

That's a mistake for two reasons.

Every new message adds to the context window—the model is now reading your original error and your correction simultaneously and trying to reconcile them. It has to think through this each time.

Plus, new messages can take time to ingest—or worse, distract the model from your original question.

Find the edit button. Replace the message. Then the next response doesn't carry your mistake forward, and you don't spend ten minutes untangling an output that went sideways because of a fixable prompt.

5. Turn off what you're not using

Web search has a cost. Extended thinking has a cost. Document connectors have a cost. Most of the time none of them are needed for the task at hand. Frontier models have huge costs.

If you can streamline what you don't need (especially if switching costs are low because you have a ContextNest), you can save a lot of time and tokens.

Enable them in the moment you actually need them—not as default-on settings running in the background of every request.

6. Work in smaller sections

Engineers called this out a long time ago. Models can not handle large codebases well. Using context and focusing on smaller sections at a time means models return results more quickly and have less risk of choking on the task.

The same is true for business efforts. Don't ask for the 5,000-word strategy document in one prompt. Ask for the outline. Then expand each section. Don't ask for the full function—ask for the structure first, then fill in each piece.

Smaller sections mean faster iteration, easier course correction, and lower cost when something needs to be redone. It also gives you natural checkpoints so the output doesn't drift through a long generation you can't easily fix at the end.

7. Match the model to the task

Every call to a frontier model that didn't need to be one is money you didn't have to spend.

As of April 15, the cost to generate 500 words on Opus 4.6 is about 1.67 cents. For that same 1.67 cents:

Sonnet 4.6 gives you ~835 words
Haiku 4.5 gives you ~2,500 words

This means Haiku is 5x cheaper than Opus for output. For content generation, listicles, and drafts—Haiku earns its place. Opus earns its place when nuance, analysis, or voice precision actually matters. Sonnet feels like the safe middle ground, but often the flash models are enough.

Route simple triage, summarization, and single-turn questions to a lightweight model. Save the heavy models for work that actually requires it—the analysis feeding a real decision, the writing carrying your company's voice, the code review that can't afford a miss. The right tool for the job is a standard engineering principle. Apply it here.

The foundation, not the ceiling

These eleven tactics cut both bills—the invoice and the productivity drain. But there is more to come, especially when you think about sharing a living context across a team or an organization.

This is the area I am studying now. I'll be writing (and releasing commercial software) about how workflow and tools are adapting across organizations in future posts. Stay tuned!

DEV Community