RileyKim

Posted on Jun 21

How I Cut Our Security Stack Costs 60% — A CTO Field Guide for 2026

#programming #ai #python #api

Six months ago I walked into a budget review and almost choked on my coffee. Our security tooling bill had quietly doubled over the previous quarter, and nobody could give me a straight answer on what we were actually getting for the money. That's the moment I stopped treating AI in our security pipeline as a nice-to-have and started treating it as a core architectural decision. This is the story of how I rebuilt that stack, what it actually cost, and where I'd do things differently if I started over.

Why Every Startup CTO Ends Up Here

Every team I've worked with hits the same wall at scale. You start with a handful of alerts, a few scripts that grep logs, maybe a managed SIEM that costs a reasonable amount when you're tiny. Then you grow. Suddenly you're drowning in phishing reports, anomaly tickets, vulnerability scans, and a compliance team that wants to know why your detection coverage has gaps. The temptation is to throw more human analysts at the problem, but that's a linear cost curve against an exponential alert curve. It doesn't work.

The other temptation is to buy a giant enterprise security platform. That's worse. You get locked into multi-year contracts, the APIs are awkward, the pricing is opaque, and migrating off later becomes a quarter-long project that nobody wants to sponsor. Vendor lock-in in security tooling is particularly nasty because your data is sensitive and your switching costs are enormous. I learned this the hard way at a previous company where we paid for a "premium" tier of a well-known platform and discovered that the features we actually needed lived behind another paywall.

So I started looking at AI-first approaches. Not as a silver bullet, but as a way to automate the boring 80% of security work — triage, summarization, log parsing, alert enrichment, code review for secrets — at a cost that doesn't scale linearly with volume.

The Vendor Lock-In Question

Before I talk numbers, I have to talk about architecture. Any time you're embedding an AI provider into a production system, you need to assume the provider will either change pricing, deprecate a model, or get acquired by someone you don't want to be a customer of. The only sane answer is a thin abstraction layer over the API surface, so you can swap providers in an afternoon instead of a quarter.

This is why I ended up routing everything through Global API. They expose 184 models through a single OpenAI-compatible interface, which means my code doesn't care whether the underlying call lands on a DeepSeek, a Qwen, or a GPT-4o. I keep the abstraction thin — basically just the base URL and the model string — and I can route any specific workload to whichever model makes sense economically.

Here's the thing: the underlying models matter enormously for cost. The cheapest entry in their catalog starts at $0.01 per million tokens and the most expensive tops out at $3.50 per million tokens. That's a 350x spread. If you're building serious production volume, you cannot afford to just call the default expensive model on every request. The cost difference between a smart routing strategy and a naive one is the difference between a unit-economics-positive product and a fundraising distraction.

Pricing Reality Check

Let me put the numbers in front of you. These are the models I've actually been using in production, with the actual prices I'm paying through Global API:

Model	Input ($/M)	Output ($/M)	Context Window
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

Look at GPT-4o. Ten dollars per million output tokens. If you naively route every alert triage through GPT-4o, you'll burn through runway in a way that makes your finance lead weep. Now look at GLM-4 Plus at $0.80 per million output. That's 12.5x cheaper for many tasks where the frontier model doesn't actually add value. DeepSeek V4 Pro is great when you genuinely need reasoning — incident response playbooks, complex log analysis, code review — and it's still less than a quarter of the GPT-4o output cost.

The aggregate effect across my stack: 40-65% cost reduction compared to the all-GPT-4o baseline I inherited. That's not a marketing claim, that's what's on my invoice at the end of the month. And the quality is comparable or better on the specific workloads I care about, which I'll get to.

The First Integration

The actual integration took me about ten minutes, which is the only acceptable answer at a startup. Here's the basic shape of what I dropped into our security service:

import openai
import os
from typing import Optional

class SecurityAnalyzer:
    def __init__(self):
        self.client = openai.OpenAI(
            base_url="https://global-apis.com/v1",
            api_key=os.environ["GLOBAL_API_KEY"],
        )
        self.default_model = "deepseek-ai/DeepSeek-V4-Flash"
        # Reasoning model for complex incident analysis
        self.reasoning_model = "deepseek-ai/DeepSeek-V4-Pro"

    def triage_alert(self, alert_payload: dict) -> dict:
        """Fast, cheap triage for incoming security alerts."""
        response = self.client.chat.completions.create(
            model=self.default_model,
            messages=[
                {
                    "role": "system",
                    "content": "You are a security alert triager. Classify severity, summarize, suggest action.",
                },
                {
                    "role": "user",
                    "content": f"Alert payload: {alert_payload}",
                },
            ],
            temperature=0.1,
        )
        return {"analysis": response.choices[0].message.content}

That's it. The abstraction is exactly what you'd write for OpenAI itself, because Global API is OpenAI-compatible. The model string is the only thing that changes. This matters enormously for vendor lock-in — when I want to A/B test a different model, I change one string. When I want to shift a workload from DeepSeek V4 Flash to GLM-4 Plus for a week to measure quality, I change one string. No SDK swaps, no auth dance, no schema migration.

Routing Workloads to the Right Model

The real engineering work is not the integration. It's deciding which model gets which workload. I ended up with a routing function that looks roughly like this:

def route_to_model(task_type: str, payload_size: int) -> str:
    """Route a security task to the most cost-effective model."""

    # Simple triage: cheap and fast
    if task_type in ("alert_triage", "log_summary", "phishing_classify"):
        return "deepseek-ai/DeepSeek-V4-Flash"

    # Short context classification tasks
    if task_type in ("ioc_enrichment", "severity_score") and payload_size < 32000:
        return "Qwen/Qwen3-32B"

    # Long-context log dumps
    if payload_size > 32000 and task_type in ("log_analysis", "incident_summary"):
        return "THUDM/GLM-4-Plus"

    # Deep reasoning: incident response, threat hunting
    if task_type in ("incident_response", "root_cause", "code_secret_review"):
        return "deepseek-ai/DeepSeek-V4-Pro"

    # Fallback to premium model only when necessary
    if task_type == "executive_summary":
        return "openai/gpt-4o"

    return "deepseek-ai/DeepSeek-V4-Flash"

Notice the GPT-4o call at the bottom. I only use it for one thing: generating executive summaries that go to our board. The output quality on polished prose still has an edge there that I haven't fully replicated, and the volume is low enough that the cost is negligible. For everything else, I'm using the cheaper models and getting equivalent results.

Caching, Streaming, and the Boring Stuff

The big wins at scale aren't the model choice — they're the operational patterns around the model. Three things changed my cost structure dramatically:

Aggressive caching. I cache the embedding and the response for any security event payload that's been seen before. Our actual hit rate sits around 40% because a lot of alerts are repetitive (same scanner, same signature, same false positive pattern). A 40% hit rate means 40% of my bill just disappears. The cache layer is a Redis instance we were already running.

Streaming responses. When I'm generating a long incident report or a code review, I stream the tokens back to the analyst's UI. This isn't just a UX win — perceived latency drops dramatically when the user sees text appearing in real time. Average end-to-end latency is 1.2 seconds, and throughput on the streaming pipeline clocks around 320 tokens per second. The analysts stopped complaining about the AI being "slow" the moment I turned streaming on.

Graceful fallback. Every call has a fallback model. If DeepSeek V4 Flash is rate-limiting me — and during a major incident every team on the platform is probably hitting the same model — I fall back to GLM-4 Plus. If that's also throttled, I fall back to the GPT-4o endpoint. The user never sees an error, and the cost just shifts up the stack temporarily. This is production-ready behavior, not "move fast and break things" behavior.

ROI, In Actual Numbers

Let me give you a concrete example. We process roughly 12 million tokens of alert content per day across triage, enrichment, and summarization. At GPT-4o pricing across all of it, that's about $120 per day in output tokens alone, plus a similar amount in input. Roughly $7,000 a month just for output.

After the rerouting and caching, the same workload runs about $2,800 a month. That's a 60% reduction, right in the 40-65% band I keep quoting. The quality scores I'm tracking — we measure user satisfaction on analyst reviews — went from 81% positive to 84.6% positive. Better outcomes for less money. That's the dream scenario, and it's not even close to a marginal improvement. It's the difference between this AI initiative being a line item and being a strategic advantage.

The benchmark average of 84.6% across our evaluation set is honestly what surprised me most. I expected the cheaper models to drag the average down. They didn't. The reason is that the cheaper models are tuned for exactly the kind of structured, instruction-following work that security tasks tend to be. We're not asking them to write poetry. We're asking them to classify, summarize, and extract.

The Vendor Lock-In Test

Here's the test I now apply to every AI integration. If my current provider disappeared tomorrow, how long would it take me to switch? With the Global API setup, the answer is: minutes. I change the base URL, run my eval suite against an alternative provider, and ship. That's the test. Anything that takes longer than a day to swap is, in my book, a vendor lock-in risk that needs to be designed out.

The model diversity is also a hedge against any single model's regression. When GPT-4o had its well-documented quality issues a few months back, my executive-summary workload was the only one affected, and I had a GLM-4 Plus comparison running in shadow mode already. I rolled back in an hour. Compare that to teams who built their entire stack on a single provider — those teams were stuck for weeks.

What I'd Tell My Past Self

If I could go back to the start of this project, here's what I'd say:

Don't start with the most expensive model. You'll be tempted, because the demo quality is impressive, and you'll write a budget based on that. By the time you realize the cheaper model is good enough, you've already committed to a cost structure that's hard to unwind. Start cheap, measure quality, and only escalate workloads that genuinely need the premium tier.

Instrument from day one. Track token usage per workload, latency per model, quality scores per task type, and cache hit rates. Without instrumentation, you're flying blind and you can't defend the architecture to your finance team.

Treat the model selection as a routing problem, not a deployment problem. You want to be able to route any task to any model without code changes beyond a config update. The function I showed above is a starting point, not the final word — you'll evolve it.

Build the fallback path before you need it. The day you hit a rate limit is not the day you want to be writing fallback code. Build it during the boring week, not during the incident.

Where I Landed

I'm now running a production security AI stack that processes millions of tokens a day, costs roughly 40-65% less than the naive all-frontier-model approach, scores 84.6% on our internal quality benchmarks, and can be re-platformed in an afternoon if any single model or provider becomes a problem. The architecture is boring in the best possible way — a thin abstraction, a routing function, a cache layer, a streaming layer, and a fallback. Nothing clever. Everything observable. Everything swappable.

The single biggest unlock was treating model selection as a first-class architectural decision instead of a vendor relationship. Once I did that, the cost optimization fell out naturally.

If you're staring at a security tooling bill that's getting out of hand, or you're building a new product and trying to figure out how to keep AI costs sane at scale, the unified API approach is honestly the cleanest path I've found. Global API has

DEV Community

How I Cut Our Security Stack Costs 60% — A CTO Field Guide for 2026

Why Every Startup CTO Ends Up Here

The Vendor Lock-In Question

Pricing Reality Check

The First Integration

Routing Workloads to the Right Model

Caching, Streaming, and the Boring Stuff

ROI, In Actual Numbers

The Vendor Lock-In Test

What I'd Tell My Past Self

Where I Landed

Top comments (0)