DEV Community: openai

GPT-5.6 pricing: the cheaper model is not always the cheaper AI workflow

Shruti Saraswat — Tue, 30 Jun 2026 12:17:33 +0000

A pricing table is useful.

It is also easy to overread.

When a new model family arrives with clearer tiers, faster options, and lower-cost paths, the first instinct is to compare input and output prices. That makes sense. Founders need to know whether a feature can survive real usage.

But the price per million tokens is only the first layer of AI cost.

The real product cost usually appears one step later:

Which tasks use which model?
How much output does the workflow generate?
How often does the same context repeat?
How many retries happen when the first answer is not good enough?
How much human review still sits around the AI step?
What happens when users depend on the feature every day?

This is why GPT-5.6 is interesting from an economics angle, not only a capability angle.

The model lineup gives teams more pricing choice. The product still needs a cost system.

What changed

OpenAI introduced GPT-5.6 with three model tiers:

Sol: the strongest model, priced at $5 input and $30 output per one million tokens.
Terra: a balanced model, priced at $2.50 input and $15 output per one million tokens.
Luna: a faster and lower-cost model, priced at $1 input and $6 output per one million tokens.

OpenAI also introduced more predictable prompt caching for GPT-5.6 and later models, including explicit cache breakpoints and a 30-minute minimum cache life. Cache writes are billed at 1.25x the model’s uncached input rate, while cache reads receive a 90 percent cached-input discount.

That creates a practical question for teams building AI into SaaS products:

Should cost planning start with the model tier, or with the workflow?

The safer answer is workflow.

Why model price is not the full cost

A lower-cost model helps when the task fits it.

It does not automatically make the full product cheaper.

For example, imagine two AI workflows:

A support-tagging workflow that classifies customer messages into a few categories.
A technical review workflow that reads long context, reasons through multiple constraints, and produces a detailed recommendation.

The first workflow may work well with a fast, lower-cost model.

The second may need a stronger model, or at least a careful routing rule that sends only the hard cases to the stronger path.

If both workflows use the same model by default, one of two things usually happens:

The simple workflow becomes more expensive than necessary.
The complex workflow becomes cheaper at first, but creates review work, retries, or user corrections later.

Both are cost problems.

One is visible on the invoice.

The other hides inside operations.

The four cost layers founders should model

A founder does not need to turn every AI feature into a finance spreadsheet before testing it.

But once the feature moves toward customer-facing usage, four cost layers should be visible.

1. Model tier cost

This is the obvious one.

Input tokens, output tokens, reasoning effort, model tier, and provider pricing all matter.

But teams should not stop here. The cheapest model for one task may become expensive if it produces answers that require extra review, retries, or longer prompts.

2. Output shape

Output tokens are often where costs grow quietly.

A product that returns short classifications, status labels, or structured fields has a different cost profile from a product that generates long explanations, drafts, recommendations, or reports.

If a feature always asks for a long answer, the bill grows with every user action.

A better pattern is to design the output around the user decision:

Does the user need a short answer?
Does the user need a draft?
Does the user need a reasoned explanation?
Does the system need a structured object instead of prose?
Can the full explanation appear only when requested?

The output format is not only UX. It is cost design.

3. Repeated context and caching

Prompt caching becomes valuable when a workflow sends the same large context repeatedly.

That may include:

System instructions.
Product rules.
Policy text.
Tool definitions.
Reusable examples.
Account-level configuration.
Long documents or knowledge context that remains stable across requests.

Caching is not magic. It depends on reuse.

If the prompt changes constantly, the cache hit rate stays low. If static content is placed at the beginning and dynamic user content appears later, the chance of a useful cache hit improves.

This changes prompt design.

A production prompt should not be treated as one big text block. It should be structured so repeated content remains stable, measurable, and cacheable where the provider supports it.

4. Review, retry, and fallback cost

This is the layer many early AI demos miss.

The first API call may be cheap.

The full workflow may not be.

A customer-facing feature can create extra cost through:

retries after weak answers,
review queues,
escalation to a stronger model,
fallback paths,
support tickets,
manual correction,
reprocessing failed jobs,
longer latency windows,
and customer confusion when the output is not clear.

Those costs do not always appear as tokens.

They appear as engineering time, support load, product complexity, and lower trust.

A better cost model for AI features

Instead of asking, “Which model is cheapest?” ask:

What is the cheapest reliable path for this workflow?

That question leads to a more useful structure.

Routine path

Use this for low-risk, repeatable tasks.

Examples:

classification,
extraction,
short summaries,
simple rewriting,
intent detection,
formatting,
routing,
and lightweight support assistance.

The goal is speed and predictability.

Escalation path

Use this for tasks where stronger reasoning changes the outcome.

Examples:

complex code review,
multi-step product analysis,
policy-sensitive work,
security review,
technical planning,
and decisions that affect customers or operations.

The goal is quality, not default low cost.

Cached path

Use this when long context repeats.

Examples:

documentation assistant,
policy review,
product onboarding assistant,
internal knowledge workflows,
support copilots with stable business rules,
and agent workflows with repeated tool definitions.

The goal is to avoid paying full input cost for the same context again and again.

Human-review path

Use this when the output has meaningful business impact.

Examples:

legal-sensitive drafts,
financial recommendations,
healthcare-adjacent content,
security-sensitive workflows,
customer-facing automation,
and high-value account decisions.

The goal is confidence, not automation for its own sake.

What developers should measure

A production AI feature should not be measured only by total API spend.

It should track cost by workflow.

Useful metrics include:

Cost per successful task
Not cost per API call. A task may require multiple calls.
Output tokens per task type
Some prompts look cheap until the output becomes long.
Cache hit rate
If caching is expected to reduce cost, measure whether it is actually being hit.
Retry rate
A cheaper model that triggers more retries may not be cheaper.
Escalation rate
How often does the workflow move from a low-cost model to a higher-capability model?
Human correction rate
Manual edits, rejected outputs, or support follow-ups are part of the cost.
Latency by path
A low-cost path that feels slow can still hurt the product experience.
Cost by customer segment
Heavy users may behave very differently from the average demo user.

These metrics make the cost real.

Without them, the team is only guessing from the pricing page.

What founders should decide before launch

Before turning an AI workflow into a customer-facing promise, founders should model three usage levels:

1. Pilot usage

A small number of users.

The goal is to learn whether the workflow is useful and where quality breaks.

2. Normal usage

Expected steady product usage.

The goal is to see whether cost fits pricing, support capacity, and margin.

3. Growth usage

Higher adoption after the feature becomes popular.

The goal is to check whether the system still makes sense when customers actually use it.

This is where many AI features become clearer.

A workflow that looks affordable for 20 users may need routing, caching, batching, or limits before it works for 2,000 users.

The practical takeaway

GPT-5.6 gives teams more choices across capability, speed, and cost.

That is useful.

But the economics of an AI product will not be solved by picking the lowest-priced model.

The better move is to design the workflow around:

task complexity,
output length,
repeated context,
cache behavior,
retry rate,
review requirements,
fallback paths,
and customer dependency.

The cheapest model is not always the cheapest workflow.

The cheapest reliable workflow is the one that routes the right task to the right path, measures what happens after launch, and avoids turning every customer action into the most expensive possible AI call.

Founder action checklist

Before shipping an AI feature, ask:

Which parts of this workflow are routine?
Which parts need stronger reasoning?
Which context repeats often enough to cache?
What is the expected output length?
What happens when the answer is not good enough?
How often will a user trigger this workflow?
What is the cost per successful task, not just per API call?
Does the pricing still work at 10x usage?

That is where AI cost planning becomes useful.

Not at the pricing table alone.

At the workflow.

Sources

Beyaz Saray'dan OpenAI Kararı: GPT-5.6 Hükümet Denetimiyle Sunulacak

İsmail Hakkı Eren — Tue, 30 Jun 2026 12:14:55 +0000

Yapay Zeka Dünyasında Yeni Dönem: Federal Denetim ve Güvenlik Odaklı Dağıtım

Yapay zeka modellerinin gelişim hızı ve toplumsal etkileri arttıkça, hükümetlerin bu teknolojiler üzerindeki denetim mekanizmaları da sıkılaşıyor. Bunun en somut örneği, OpenAI’ın üzerinde çalıştığı ve büyük merakla beklenen yeni modeli GPT-5.6 ile yaşanıyor.

Geleneksel olarak yeni modellerini küresel ölçekte ve geniş kitlelerin kullanımına sunan OpenAI, bu kez farklı bir strateji izlemek zorunda kalacak. ABD'deki Trump yönetiminin doğrudan talebi doğrultusunda, GPT-5.6 modeli ilk etapta genel kullanıma açılmayacak. Şirket, modeli yalnızca Beyaz Saray tarafından onaylanan sınırlı sayıdaki yakın iş ortağıyla paylaşabilecek.

Müşteri Bazında Hükümet Onayı Dönemi

OpenAI CEO'su Sam Altman, şirket içinde çalışanlarla gerçekleştirdiği haftalık değerlendirme toplantısında yeni modelin dağıtım sürecine dair önemli detaylar paylaştı. Altman'ın açıklamalarına göre, GPT-5.6'nın ön izleme (preview) aşamasında erişim yetkileri "müşteri bazında" (customer-by-customer) bizzat hükümet yetkilileri tarafından kontrol edilerek onaylanacak.

Sürecin işleyişine dair öne çıkan başlıklar şunlar:

Kademeli Dağıtım: Sınırlı ve denetimli dağıtım sürecinde herhangi bir güvenlik zafiyeti veya operasyonel sorun yaşanmazsa, modelin birkaç hafta içinde daha geniş bir kullanıcı kitlesine açılması planlanıyor.
Çalışanlar ve Hükümet İş Birliği: OpenAI mühendisleri ve ürün ekipleri, yeni modelin güvenlik standartlarını test etmek ve risk analizlerini yapmak üzere hükümet temsilcileriyle doğrudan iş birliği içinde çalıştı.
Talep Sahibi Kurumlar: Sınırlı erişim talebinin arkasında özellikle iki kritik kurumun yer aldığı belirtiliyor: Ulusal Siber Direktörlük Ofisi (Office of the National Cyber Director) ve Bilim ve Teknoloji Politikası Ofisi (Office of Science and Technology Policy - OSTP).

Trump Yönetiminin Değişen Yapay Zeka Vizyonu

Seçim döneminde ve göreve geldiği ilk günlerde yapay zeka sektörüne yönelik daha serbest ve regülasyonsuz bir yaklaşımı savunan Trump yönetimi, son aylarda bu tutumunda belirgin bir değişiklik sergilemeye başladı. Ulusal güvenlik, siber savunma kapasitesi ve küresel teknoloji rekabeti gibi faktörler, federal denetimin artırılmasına yol açtı.

Bu strateji değişikliğinin en net kanıtı, Trump'ın Haziran 2026'da imzaladığı yeni başkanlık kararnamesi oldu. Kararname kapsamında, kritik düzeyde yapay zeka modeli geliştiren öncü şirketlerin, modellerini kamuya sunmadan önce güvenlik testleri ve risk değerlendirmeleri amacıyla gönüllü olarak hükümet birimlerine göndermesi talep ediliyor. OpenAI'ın GPT-5.6 için uygulayacağı bu kontrollü süreç de söz konusu kararnamenin sahadaki ilk büyük testi ve uygulaması olarak değerlendiriliyor.

Sektör İçin Ne Anlama Geliyor?

GPT-5.6'nın dağıtım biçimi, yapay zeka geliştiricileri ve kurumsal müşteriler için yeni bir emsal teşkil edebilir. Artık en güçlü yapay zeka modelleri sadece ticari birer ürün olarak değil, ulusal güvenliği doğrudan ilgilendiren stratejik altyapılar olarak görülüyor. Bu durum, gelecekte geliştirilecek diğer büyük modellerin (örneğin Anthropic'in Claude 5 veya Google'ın Gemini 4 serisi) lansman stratejilerini de kökten değiştirebilir.

Hükümetinin onay süzgecinden geçecek olan GPT-5.6'nın özellikle hangi sektörlerdeki iş ortaklarına öncelik vereceği ve genel kullanıma ne zaman sunulacağı önümüzdeki haftalarda netleşecek.

OpenAI's usage limit won't stop your spending — here's what actually does (2026)

Russel — Tue, 30 Jun 2026 11:00:00 +0000

You set an OpenAI usage limit. You felt responsible. Then the invoice landed higher than the number you typed, and you sat there wondering what the limit was even for.

The short version, up front: OpenAI's "usage limit" does not stop your spending. It sends an email when you cross a threshold while your requests keep going. It's a smoke alarm, not a circuit breaker. The only things that actually cap an OpenAI bill are running out of prepaid credit and your auto-recharge settings. Below is how that works in 2026, what changed this year, and what to bolt on so the bad number reaches you before your card does.

One disclosure first: I build a tool in this space — BillGuard — so read the last section as biased and judge it on the merits. Everything before it is just how the billing works.

Does the OpenAI usage limit actually stop spending? No.

Open Settings → Limits and you'll find a "monthly budget" or usage limit. It looks like a cap. It reads like a cap. It is not a cap.

Cross that number and OpenAI emails you. Your requests keep going. There used to be a real hard limit that suspended API access at the ceiling, and OpenAI removed it — quietly, with the old setting relabeled from a cut-off to an alert. There's a whole "OpenAI removed budget limits, you can only get warnings" thread on Hacker News, and the developer forum still has standing requests to bring the hard cap back, because prepaid billing leaves no upper bound if a key leaks or a loop runs wild.

So the mental model most of us carry — "I set a limit, so I'm safe" — is wrong. You set an alert. The meter keeps running while you're asleep, in a meeting, or just not refreshing the dashboard.

So what actually stops an OpenAI runaway bill?

Mostly one thing: running out of prepaid credit.

New API accounts are on prepaid billing. You buy credits, usage burns them down, and per OpenAI's own docs, "your API usage will be halted once your account balance reaches $0." That's the real hard stop. Not the usage limit. The empty wallet.

Now the trap: auto-recharge. It's offered when you set up prepaid billing, and it tops your balance back up the moment it dips below a threshold. So the one mechanism that would halt a runaway loop — hitting zero — never fires. The balance refills itself, the loop keeps calling, and you meet the damage on the receipt.

That's the surprise-bill machine in two parts: a soft "limit" that only notifies, plus an auto-recharge that quietly removes the only real brake.

Wait — didn't per-project limits used to work?

They did, loosely, and this is the part most 2026 guides haven't caught up to. Until recently the standard advice was: put production behind a project-scoped key, set a per-project hard limit, and OpenAI would stop that project a few dollars over the cap. Imperfect, but real.

Around May 2026, developers started reporting that this stopped working too. In one forum thread, an org owner watched a project run to $1,800 on a $1,000 cap while still showing green, and the "set a budget" buttons disappeared for both projects and the organization — replaced with alert-only language. Other owners in the same thread confirmed the per-project enforcement they'd relied on was gone, leaving a "x used of y limit" progress bar that no longer does anything.

I'm flagging this as reported behavior, not a documented OpenAI change — your account may differ, so check yours. But if your safety plan is "production runs on a project key with a hard limit," it's worth re-testing, because for a lot of people that net quietly disappeared this year.

The OpenAI controls that still help, ranked

OpenAI does give you real knobs. They're just not the ones the name implies, and after this year the useful list is shorter.

Auto-recharge settings — your closest thing to a real ceiling. Turn auto-recharge off and you hard-stop at $0 when credits run out. Leave it on but set a low monthly recharge cap and it can't top up past that amount in a given month. Pair that with a modest balance and your trust-tier limit caps how much can be in the account at once. This is now the main lever people actually have.
Project-scoped API keys — for blast radius, not budgets. Create a project, generate a key tied to it, and that key only touches that project's resources. If it leaks, the damage is one project, not your whole org. Still the most underused safety feature OpenAI ships — docs here. Just don't count on the per-project spend limit to stop anything in 2026 (see above).
The Usage and Cost APIs. OpenAI exposes spend programmatically, including a /v1/organization/costs endpoint broken down by minute, hour, and day and filterable by key, project, or model. You can't watch a dashboard you've closed — but you can poll an API. This is the hook everything external hangs off.

Is Anthropic any better at capping spend?

Cleaner story, fewer feet-guns. Anthropic's API has an actual spend cap that behaves like one. Per the Claude rate-limits docs, each usage tier carries a monthly spend cap — $500 on Start, $1,000 on Build, $200,000 on Scale — and "once you reach your tier's spend cap, API usage pauses until the next month." You can also set your own lower spend limit beneath the tier cap, and apply custom per-workspace spend and rate limits.

So if you assumed Anthropic was the loose one, flip it: hit the ceiling and it stops. The by-the-hour visibility is thinner than I'd like, and there's one caveat — on AWS Marketplace those spend limits aren't available — but the headline control actually works.

Soft vs hard, native vs external — the whole thing in one table

Mechanism	Stops spend, or just warns?
OpenAI "usage limit" / monthly budget	Warns. Requests keep going.
OpenAI per-project budget (historically)	Used to stop loosely; reported broken/removed in 2026.
OpenAI prepaid balance hits $0, auto-recharge OFF	Real stop.
OpenAI prepaid + auto-recharge ON, no monthly cap	No stop. Balance silently refills.
OpenAI auto-recharge with a low monthly cap	Soft ceiling — closest native control.
Anthropic spend limit / tier cap	Real stop. Pauses until next month.
External real-time alert (poll Usage/Cost API)	Early warning, by your actual spend.

How do I actually get warned before the invoice?

Every native control above shares one flaw. None of them reach you in real time, by your actual spend, somewhere you'll see it. A dashboard you check on Tuesday won't save you from a loop that starts Friday night. What "good" looks like is dumb and specific: the moment your real spend crosses a line you care about, a message lands on your phone that night, not on the 1st of next month. Three ways to get there:

Roll your own. Cron job, hit /v1/organization/costs hourly, compare to a number, ping a webhook. A weekend's work, and now you own a tiny billing service forever. Plenty of people do exactly this, and it's a perfectly good answer if you don't mind babysitting a cron job.

Use a FinOps platform. CloudZero, Vantage, Finout, Amnic — anomaly detection, team allocation, the lot. Built for finance orgs spreading real money across teams. For a solo dev shipping a side project, it's a freight train to fetch groceries.

Use a lightweight alerting tool. This is the indie-sized slot, and it's filling up. Capped does this — an hourly check against the cost API, pings at 80/100/150% of a cap you set. Worth a look. Helicone used to be the default recommendation, but it was acquired by Mintlify in March 2026 and is now in maintenance mode — security fixes only, no roadmap — and it typically sits in your request path as a proxy, which not everyone wants in front of production traffic.

Where BillGuard fits (the biased part)

Disclosure again: my product, weigh it accordingly.

BillGuard is the "roll your own" option without the weekend, and read-only by design. You hand it a read-only admin key for OpenAI or Anthropic — no proxy, no SDK, nothing in your request path — and it polls your real spend, forecasts where the month lands, and pings you on email, Telegram, or Slack the second you cross a line you set. Setup is about thirty seconds. Founding plan is $7/month.

The forecast is the part I actually care about: not just "you hit 80%," but "at this rate you'll land at $X by the 30th," while there's still time to do something. And because it never touches your traffic, it can't add latency or become a thing that goes down and takes you with it.

It does not stop your spend — nothing external can, short of pulling your key — but it makes the bad number reach your phone hours before it reaches your card.

If you've ever set a usage limit and assumed you were covered, that assumption is the whole reason this exists. And if you'd rather wire up the cron job — genuinely, go do it. The point of this post isn't the tool. It's that the native limit was never the safety net you thought it was, and in 2026 even the project-level one quietly went away. What you bolt on next is your call.

FAQ

Does setting a usage limit in OpenAI actually cap my spending?
No. The usage limit is a notification threshold, not an enforced cap. OpenAI emails you when you cross it and keeps processing your requests. The only native hard stop is your prepaid balance reaching $0 with auto-recharge off.

What actually stops an OpenAI API runaway bill?
Running out of prepaid credit. If auto-recharge is on with no monthly cap, the balance refills and nothing stops. Turning auto-recharge off, or setting a low monthly recharge cap, is the closest thing OpenAI gives you to a real ceiling.

Do per-project spending limits work in 2026?
They used to stop a project loosely a few dollars over its cap, but as of around May 2026 developers report that enforcement was removed and the UI now offers alerts only. Project-scoped keys are still worth using to limit blast radius if a key leaks — just don't rely on the per-project budget to halt spend.

Does Anthropic's Claude API have a real spending cap?
Yes. Each usage tier has a monthly spend cap (Start $500, Build $1,000, Scale $200,000) and usage pauses until the next month once you hit it. You can also set a lower spend limit yourself. The exception is AWS Marketplace, where spend limits aren't available.

How do I get a real-time alert before the bill arrives?
Poll OpenAI's /v1/organization/costs endpoint (or Anthropic's Usage & Cost API) on a schedule and alert when spend crosses a threshold. You can build this yourself with a cron job, or use a lightweight tool like Capped or BillGuard that does the polling and notifies you on email, Telegram, or Slack.

Written by Russell, who builds BillGuard. Originally published on the BillGuard blog.

Why your Claude and OpenAI API calls are slow (and how to fix it)

cauqjbwkerl — Tue, 30 Jun 2026 08:35:49 +0000

TL;DR

If you're calling the Claude or OpenAI API from Asia, Oceania, or South America, you're likely adding 400–800ms of pure network overhead before a single token arrives. The root cause is geography, not the AI model itself — and it's fixable with smarter routing.

The Problem Nobody Talks About

I spent two weeks debugging what I thought was a slow model. My streaming chat app felt sluggish — users were staring at a blank screen for nearly a second before anything appeared. After profiling every layer of my stack, I finally isolated the culprit: the raw time from my server in Tokyo to api.anthropic.com and back was eating 600–900ms before the API even started generating.

This isn't a Claude or OpenAI problem. It's physics and routing.

Why AI API Latency Is Geography-Dependent

Both Anthropic and OpenAI run their inference infrastructure primarily in US data centers (us-east-1, us-west-2 regions). When you're in Singapore, Seoul, or Sydney, every API call has to:

Cross transoceanic fiber — a round-trip from Tokyo to Virginia is ~160ms at the speed of light, and real-world routing adds 30–50% on top of that.
Negotiate TLS from a distance — a TLS 1.3 handshake requires 1 round-trip; TLS 1.2 requires 2. At 200ms RTT, that's 200–400ms gone before a byte of your prompt is sent.
Fight TCP congestion control — long-haul routes traverse multiple ISP handoffs. TCP's slow-start and congestion windows are tuned for short distances; on transoceanic routes, you get retransmits and window stalls that inflate latency unpredictably.

The result: developers in the US see 40–80ms to first token on streaming calls. Developers in Asia routinely see 300–600ms, sometimes spiking past 1 second during peak hours.

Measure Your Baseline First

Before optimizing anything, get hard numbers. Here's a curl timing command I use to measure raw connection latency to both APIs:

# Measure connection phases to Anthropic
curl -o /dev/null -s -w "\
DNS lookup:      %{time_namelookup}s\n\
TCP connect:     %{time_connect}s\n\
TLS handshake:   %{time_appconnect}s\n\
First byte:      %{time_starttransfer}s\n\
Total:           %{time_total}s\n" \
https://api.anthropic.com/v1/models

# Same for OpenAI
curl -o /dev/null -s -w "\
DNS lookup:      %{time_namelookup}s\n\
TCP connect:     %{time_connect}s\n\
TLS handshake:   %{time_appconnect}s\n\
First byte:      %{time_starttransfer}s\n\
Total:           %{time_total}s\n" \
https://api.openai.com/v1/models

From Tokyo, my typical output looks like:

DNS lookup:      0.028s
TCP connect:     0.187s
TLS handshake:   0.412s
First byte:      0.843s

From a US-East server, the same command returns 0.041s to first byte. That's a 20x difference on connection setup alone.

Measuring Streaming First-Token Latency in Python

The curl test measures connection overhead, but for streaming AI APIs, what users actually feel is time-to-first-token (TTFT). Here's a Python snippet that measures this precisely:

import time
import anthropic

def measure_ttft(prompt: str) -> dict:
    client = anthropic.Anthropic()

    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    with client.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for text in stream.text_stream:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()

    ttft = (first_token_time - start) * 1000
    total = (end - start) * 1000

    return {
        "ttft_ms": round(ttft, 1),
        "total_ms": round(total, 1),
        "tokens": token_count,
        "throughput_tps": round(token_count / ((end - first_token_time)), 1),
    }

if __name__ == "__main__":
    result = measure_ttft("Explain TCP slow start in two sentences.")
    print(f"Time to first token: {result['ttft_ms']}ms")
    print(f"Total time:          {result['total_ms']}ms")
    print(f"Throughput:          {result['throughput_tps']} tokens/sec")

Run this 5–10 times and average the results. On a cold connection from Southeast Asia, I consistently measured 780–920ms TTFT. That's the number we want to crush.

The Fix: Route Through Infrastructure Close to the API

The insight is simple: if the AI APIs live in the US, your traffic should enter the US network as close to those endpoints as possible — not traverse 15 BGP hops across the Pacific first.

There are a few approaches:

Option 1: Deploy your backend in the US. If you control your server, move it to us-east-1. This is the cleanest solution but not always feasible — your users might be in Asia, your data residency requirements might be regional, or you might be running on a local machine during development.

Option 2: Use a regional proxy or accelerator. Route your AI API traffic through an optimized path that has a PoP (point of presence) near the Anthropic/OpenAI data centers. The proxy handles the long-haul routing on an optimized backbone, and your server only needs to reach the nearest proxy node.

This is where I found TonBoVPN genuinely useful. It's designed specifically for routing AI API traffic — you set HTTPS_PROXY in your environment, and your Claude/OpenAI calls get routed through nodes with optimized paths to US API endpoints. The setup is literally one environment variable:

export HTTPS_PROXY=http://your-tonbovpn-endpoint:port

# Your existing Python code works unchanged
python your_app.py

Both the anthropic and openai Python SDKs respect standard proxy environment variables, so there's zero code change required.

Real-World Numbers

After switching to proxied routing, here's what my TTFT measurements looked like across different Asian cities (averages over 20 runs each):

Location	Direct TTFT	Proxied TTFT	Improvement
Tokyo	820ms	195ms	4.2× faster
Seoul	760ms	170ms	4.5× faster
Singapore	690ms	155ms	4.5× faster
Sydney	950ms	220ms	4.3× faster

The 3–4× improvement is consistent across regions. More importantly, the variance dropped dramatically — direct calls would sometimes spike to 2,000ms during peak hours; proxied calls stayed under 300ms at P95.

Why This Matters for Streaming UX

For non-streaming API calls, latency is just latency — your user waits, the response arrives. But for streaming, TTFT is the difference between a UI that feels alive and one that feels broken.

Human perception research puts the "feels instant" threshold at around 100ms and "noticeable delay" at 300ms. At 800ms TTFT, users genuinely think the app is loading or broken. At 180ms, the first token appears before they've consciously registered waiting.

If you're building any kind of chat interface, code assistant, or real-time AI feature for users outside the US, optimizing TTFT isn't a nice-to-have — it's the single highest-leverage UX improvement you can make.

Quick Checklist

[ ] Run the curl timing test against both API endpoints from your actual server location
[ ] Measure baseline TTFT with the Python snippet above
[ ] If TTFT > 300ms, routing is your bottleneck (not the model)
[ ] Try proxied routing via TonBoVPN or a US-region proxy
[ ] Re-run measurements and compare P50 and P95 (variance matters as much as average)
[ ] If deploying to production, instrument TTFT as a metric in your observability stack

Conclusion

Slow AI API responses from outside the US are almost always a routing problem, not a model problem. The fix is straightforward: get your traffic onto an optimized path to US infrastructure as early as possible, whether that's moving your server, using a regional accelerator, or proxying through a service built for this use case. Measure first, optimize second — the curl and Python snippets above give you everything you need to quantify the problem and verify the fix.

Preserving Context When Moving from ChatGPT to Codex CLI

Viacheslav Bogdanov — Tue, 30 Jun 2026 07:48:26 +0000

A lot of useful technical work starts as a conversation.

Maybe you are exploring an architecture decision in ChatGPT. Maybe you are debugging an idea before touching the codebase. Maybe you are asking broad questions first, then moving into a local development workflow once the direction is clear.

That handoff is often awkward.

The context starts in ChatGPT, but the implementation happens somewhere else: a terminal, an editor, a local agent, or a repo-specific workflow.

For me, that "somewhere else" is often Codex CLI.

So I built a small bridge:

npx chatgpt2codex https://chatgpt.com/share/<id>

chatgpt2codex imports a shared ChatGPT conversation into a local Codex CLI session attached to your current project directory.

GitHub: https://github.com/vv-bogdanov/chatgpt2codex

npm: https://www.npmjs.com/package/chatgpt2codex

The workflow gap

ChatGPT share links are useful for handing context to another person, but they are not directly useful to local tooling.

If I have a long conversation that contains design notes, constraints, tradeoffs, and implementation ideas, I do not want to manually copy pieces of it into a new Codex thread.

I want the local coding agent to resume from the conversation as if it had been part of the local workflow from the beginning.

That is what this tool tries to do.

What it does

Given a public ChatGPT share URL, chatgpt2codex:

reads the shared conversation
normalizes the messages
writes a Codex rollout JSONL session
attaches the session to a project directory
indexes the session so modern Codex builds can show it in resume flows

By default, it attaches the imported session to the current directory:

npx chatgpt2codex https://chatgpt.com/share/<id>

You can target another project directory with:

npx chatgpt2codex https://chatgpt.com/share/<id> -C /path/to/project

And you can preview the import without writing anything:

npx chatgpt2codex https://chatgpt.com/share/<id> --dry-run

A few practical details

The tool is intentionally conservative.

If a Codex session already exists for the target project directory, it exits with an error instead of overwriting anything:

A Codex session already exists for /path/to/project.
Use --force to replace it.

If you do want to replace the existing imported session, use:

npx chatgpt2codex https://chatgpt.com/share/<id> --force

There are also options for overriding the imported title and Codex home directory:

npx chatgpt2codex https://chatgpt.com/share/<id> \
  --name "Architecture discussion" \
  --codex-home ~/.codex

The slightly tricky part

The first version wrote a Codex session file, but that was not always enough for Codex to pick it up in the resume flow.

Modern Codex builds use both rollout JSONL files and local SQLite metadata. So the current release writes the session file and also indexes the thread in Codex's local state_5.sqlite database when that database exists.

It also uses Codex-visible session metadata, so the imported conversation appears as a normal CLI-originated thread rather than being filtered out.

That was the main lesson: for local agent tools, "write the file" is often only half the integration. The rest is making sure the surrounding state agrees with it.

Caveat

ChatGPT share pages and Codex local session files are not official public import APIs.

Because of that, I kept the implementation small and pragmatic, with tests around the parts most likely to break:

parsing shared ChatGPT conversations
writing Codex session metadata
detecting duplicate sessions for the same project directory
replacing an existing session with --force

The tool requires Node.js 22.13.0 or newer because Codex's local SQLite index is important for the current workflow.

Why I like this shape

This is not a big framework or a new platform.

It is just a small CLI tool that closes a specific gap:

ChatGPT conversation -> shared URL -> local Codex session

That is enough to make the handoff from exploration to implementation feel much smoother.

If you use Codex CLI and sometimes start your thinking in ChatGPT, I would love to hear whether this fits your workflow.

Repo: https://github.com/vv-bogdanov/chatgpt2codex

How to Fix AI API Failures That Look Like Rate Limits but Are Actually Network Issues

cauqjbwkerl — Tue, 30 Jun 2026 07:01:15 +0000

TL;DR

If your OpenAI, Claude, or Gemini API calls are failing with cryptic errors that look like rate limits, the real culprit is often your network — ISP routing, DNS pollution, or TCP RST injection. A real 429 has a JSON body and a Retry-After header; a network failure gives you an empty response, a connection reset, or a timeout. Here's how to tell them apart and fix it systematically.

I spent two frustrating days last month convinced I'd somehow blown through my OpenAI quota. My Python script kept dying with RateLimitError, but the OpenAI dashboard showed I'd barely touched my limits. Sound familiar? If you're working from Southeast Asia, mainland China, or certain parts of the Middle East, this is a surprisingly common trap.

Let me walk you through exactly how I diagnosed it and what I did to fix it.

Real 429 vs. Network Failure — Know the Difference

This is the first thing to nail down, because the fix is completely different depending on which one you're dealing with.

A genuine rate limit (HTTP 429) always has:

An HTTP status code of 429 in the response
A Retry-After header telling you how long to wait
A JSON body like {"error": {"type": "rate_limit_error", "message": "..."}}

A network-level failure looks like one or more of these:

ConnectionResetError or ConnectionRefusedError
requests.exceptions.ConnectionError with an empty response body
TimeoutError or ReadTimeout with no HTTP status at all
The Python OpenAI SDK raising APIConnectionError instead of RateLimitError

The SDK wraps both into similar-looking exceptions, which is why they're easy to confuse. The key is to look at the exception class and the response body, not just the error message string.

Step 1: Verbose curl to the API Endpoint

Before touching any code, go raw. Run this from your terminal:

curl -v --max-time 15 \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}],"max_tokens":5}' \
  https://api.openai.com/v1/chat/completions 2>&1 | head -80

Watch the output carefully:

If you see * Connected to api.openai.com followed by TLS handshake lines and then an HTTP response — your network path is basically working.
If you see * Connection reset by peer or curl: (35) OpenSSL SSL_connect — you have a network-level block, likely TCP RST injection.
If it just hangs until timeout — routing issue or DNS resolution is hitting a poisoned record.

Do the same for Anthropic:

curl -v --max-time 15 https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-20240307","max_tokens":5,"messages":[{"role":"user","content":"ping"}]}' 2>&1 | head -80

Step 2: Check DNS Resolution

DNS pollution is a real thing in several regions. Your ISP may be returning a bogus IP for api.openai.com.

# What does your default resolver say?
nslookup api.openai.com

# Compare against a clean resolver
nslookup api.openai.com 8.8.8.8
nslookup api.openai.com 1.1.1.1

If the IPs are different — especially if your default resolver returns a private IP range or a local redirect — you've found your DNS problem.

Step 3: Traceroute to See Where Traffic Dies

# Linux/macOS
traceroute -n api.openai.com

# Windows
tracert api.openai.com

If the trace stops at a hop inside your ISP's network (typically hops 3–8) and never reaches the destination, that's ISP-level routing interference. If it reaches international exchange points but then drops, it's a peering or transit issue.

Step 4: Enable Python SDK Debug Logging

Once you suspect a network issue, enable verbose logging in the OpenAI Python SDK to see exactly what's happening at the HTTP layer:

import logging
import httpx
import openai

# Enable full HTTP request/response logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger("httpx").setLevel(logging.DEBUG)
logging.getLogger("openai").setLevel(logging.DEBUG)

client = openai.OpenAI()

try:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "hello"}],
        max_tokens=10
    )
    print(response.choices[0].message.content)
except openai.APIConnectionError as e:
    print(f"Network-level failure (not a rate limit): {e}")
except openai.RateLimitError as e:
    print(f"Genuine rate limit: {e}")
except openai.APIStatusError as e:
    print(f"API error {e.status_code}: {e.message}")

The APIConnectionError vs RateLimitError distinction is your smoking gun.

The Fix: Route Through a Reliable Tunnel

Once I confirmed it was a network issue (my traceroute was dying at hop 6 inside my ISP), the solution was to route API traffic through a tunnel with better international connectivity.

You have a few options:

Configure your system proxy if you already have a VPN or proxy running locally
Use a purpose-built tunnel optimized for this kind of traffic

For option 1, here's how to configure the proxy in both Python and Node.js:

Python (OpenAI SDK):

import httpx
import openai

# If your local proxy is running on port 7890
proxy_url = "http://127.0.0.1:7890"

client = openai.OpenAI(
    http_client=httpx.Client(
        proxy=proxy_url,
        timeout=30.0
    )
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "hello"}]
)
print(response.choices[0].message.content)

Or via environment variable (works for most SDKs):

export HTTPS_PROXY="http://127.0.0.1:7890"
export HTTP_PROXY="http://127.0.0.1:7890"
python your_script.py

Node.js (using the official OpenAI package):

import OpenAI from 'openai';
import { HttpsProxyAgent } from 'https-proxy-agent';

const proxyAgent = new HttpsProxyAgent('http://127.0.0.1:7890');

const client = new OpenAI({
  httpAgent: proxyAgent,
});

const response = await client.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: 'hello' }],
});
console.log(response.choices[0].message.content);

For option 2, after trying a few generic VPN services that added latency or had unreliable uptime, I ended up using TonBoVPN, which is specifically tuned for AI API traffic. The difference in latency and connection stability to api.openai.com and api.anthropic.com was noticeable — it also handles the DNS resolution cleanly, which matters if your ISP is doing DNS poisoning. You still configure it the same way via the local proxy port.

Putting It All Together: A Diagnostic Checklist

Here's the exact order I run through now whenever an AI API starts misbehaving:

Check the exception type — APIConnectionError = network, RateLimitError = real quota
Check your dashboard — OpenAI, Anthropic, and Google all show real-time usage
Run verbose curl to the endpoint and look for TLS handshake vs connection reset
Compare DNS resolution between your default resolver and 8.8.8.8
Run traceroute and see where packets stop
Enable SDK debug logging for the full HTTP picture
Configure a proxy and test again

One more thing: if you're building a production service that serves users in regions with connectivity issues, consider implementing a retry wrapper that distinguishes between these error types. Retrying a genuine rate limit with exponential backoff makes sense. Retrying a TCP RST injection 10 times in a row just wastes time and quota.

import time
import openai

def call_with_smart_retry(client, max_retries=3, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except openai.RateLimitError as e:
            wait = int(e.response.headers.get("retry-after", 60))
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
        except openai.APIConnectionError as e:
            print(f"Connection error on attempt {attempt+1}: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # brief backoff, then check your tunnel
            else:
                raise
    raise RuntimeError("Max retries exceeded")

Conclusion

Most "rate limit" errors I've seen from developers in Asia are actually network failures in disguise — and the fix is completely different from what you'd do for a real 429. The diagnostic flow (curl, DNS check, traceroute, SDK logging) takes about five minutes and tells you exactly where the problem lives. Once you know it's a routing or DNS issue, configuring an HTTPS_PROXY in your SDK is a one-liner that usually solves it immediately. Don't spend hours tweaking retry logic or downgrading your API tier when the problem is three hops into your ISP's network.

How I Fixed OpenAI Assistants API Timeout Errors in Production

Admin Test — Tue, 30 Jun 2026 06:55:08 +0000

It was during a live client demo.

The AI was mid-session. The user was answering questions.
Everything was going perfectly.

Then — this:

"Sorry, there was an error processing your request. Please try again."

The client looked at us. My manager looked at me. I looked at my laptop
and wanted to disappear.

The Investigation

First thing I checked: OpenAI dashboard. No failed runs. Nothing.

I checked our server logs. There it was:

run_timeout — after exactly 60 seconds

But here's the thing — the run wasn't failing. It was just slow.
OpenAI was still processing. Our backend gave up at 60s.
OpenAI finished at 87s.

We quit too early.

Why Does This Happen?

The longer a session gets, the more history OpenAI has to process.
Early in a session: 3–5 seconds.
Mid-session (10+ messages): 30–50 seconds.
Long sessions: 60–90+ seconds.

Our hardcoded limit of 60 seconds wasn't matching reality.

The Fix

Step 1: Made the timeout configurable via environment variable.

  # .env
  OPENAI_RUN_TIMEOUT_MS=150000

Step 2: Updated the polling loop to use it.

  const TIMEOUT_MS = parseInt(process.env.OPENAI_RUN_TIMEOUT_MS) || 150000;
  const TERMINAL = ['completed', 'failed', 'cancelled', 'expired', 'requires_action'];

  while (!TERMINAL.includes(runStatus.status)) {
    if (Date.now() - startTime >= TIMEOUT_MS) throw new Error('run_timeout');
    await new Promise(r => setTimeout(r, 1000));
    runStatus = await openai.beta.threads.runs.retrieve(threadId, run.id);
  }

Step 3: Deployed. No more errors.

Lessons Learned

Always handle ALL 5 terminal states — not just "completed"
Never hardcode timeouts for AI workloads — they vary by session length
Your error logs and OpenAI dashboard together tell the full story

What's Next

I'm exploring runs.stream() — streaming responses in real time,
no polling, no timeouts. Will write a follow-up once it's in production.

Have you hit this before? How did you handle it?
Drop it in the comments.

The seven ways an AI agent can silently fail

Babar Hayat — Tue, 30 Jun 2026 06:39:14 +0000

Your AI agent returned 200. The job finished in 3 seconds. Everything looks fine.

Except output_tokens was zero. It spent $0.80. It produced nothing. And no one noticed for 6 hours.

This is the defining failure mode of AI agents in production: they don't throw errors. They quietly fail in ways that look exactly like success.

Here's what we track in AI Agents Control Tower — per execution, automatically — and the 7 specific failure types we detect.

What gets tracked on every run

Every time your wrapped agent executes, we record:

Tokens in / tokens out — prompt tokens consumed, completion tokens produced
Cost in USD — real dollars, not just tokens, calculated per model's pricing
Latency — wall-clock execution time in milliseconds
output_summary — what the agent actually produced (the real response text, not just a status code)
Status — Healthy, Failed, Stale, or Empty Run

The distinction between ran and did the right thing lives in these four numbers. HTTP 200 only tells you the API responded. Tokens out and output_summary tell you whether it actually worked.

The three critical states

Failed — the agent received a non-200 response. Explicit, visible, but still worth dedicated detection.

Stale — the agent hasn't run within its expected cadence. It ran reliably for two weeks, then quietly stopped. No error, no notification. Stale fires when the silence exceeds the expected window.

Empty Run — the agent ran, returned 200, but produced zero output tokens. Ran successfully. Cost money. Did nothing. This is the one that hides in plain sight.

The 7 alert types — with detection logic

1. silent_failure — output_tokens = 0 on HTTP 200. The most common, most dangerous. HTTP 200 is not a product guarantee.

2. execution_failed — non-200 response. The only one that looks like a failure from the outside too.

3. token_anomaly — usage 3× above this agent's historical baseline. Usually context bloat, unexpected retries, or a prompt change that became accidentally verbose. 3× now means 10× next month.

4. agent_loop — the same tool or endpoint called repeatedly with the same input. Stuck. Every iteration burns tokens and produces zero incremental value.

5. budget_exceeded — execution cost crossed a per-agent threshold you configured. Fires immediately — not at end-of-month when the invoice arrives.

6. high_cost_spike — sudden per-execution cost anomaly relative to historical baseline. Catches unexpected behavior that doesn't fit a fixed budget ceiling.

7. no_activity — agent hasn't run in the expected window. The stale state at the alert level.

Notice: 5 of 7 produce no error. They pass every "did it complete" check. The only way to catch them is an external layer watching from outside the agent's own perspective.

Setup — 3 lines

// JS — npm install opsveritas-sdk
import { OpsVeritas } from 'opsveritas-sdk';
OpsVeritas.init('[your-webhook-secret]');
const wrapped = OpsVeritas.wrap(client, { agentName: 'My Agent' });
// use `wrapped` exactly as you'd use `client`

# Python — pip install opsveritas
import opsveritas
opsveritas.init('[your-webhook-secret]')
client = opsveritas.wrap(client, agent_name='My Agent')

You keep using your existing LLM client exactly as before. The wrapper intercepts each call, records tokens in/out, cost, latency, and output_summary, then sends it to your dashboard. No new infrastructure. No code changes to your agent logic.

Every run appears in your dashboard within seconds — cost in USD, token breakdown, output summary, health status, and automatic detection across all 7 failure types.

Free to start → agents.opsveritas.com

DM me for a 15-min walkthrough.

How to Get Free OpenAI API Credits in 2026

Denise Amsen — Tue, 30 Jun 2026 04:40:21 +0000

Are you looking to build AI applications using OpenAI's API but want to reduce upfront costs? Here is the ultimate guide to getting free OpenAI API credits in 2026.

1. New Account Trial Credits

OpenAI typically grants $5 in free credits to new developer accounts. These credits expire after 3 months, which is perfect for prototyping.

2. OpenAI for Startups Program

If you have an early-stage startup, you can apply directly to the OpenAI Startup Program. They provide $2,500 to $50,000 in free API credits to build your MVP.

3. GitHub Student Developer Pack

Students can get access to free credits and developer tools, including GitHub Copilot and partner API credits.

4. Microsoft Founders Hub

By joining the Microsoft for Startups Founders Hub, you can get up to $150,000 in Azure credits, which can be spent on Azure OpenAI service (GPT-4o, GPT-4o mini, DALL-E).

GLM 5.2 Has a 1M Token Context Window. Here's What That Does to Your API Bill.

Emmanuel Ekunsumi — Tue, 30 Jun 2026 02:53:59 +0000

Z.ai dropped GLM 5.2 on June 13, 2026, and the benchmarks are hard to ignore.

It's a 744B-parameter Mixture-of-Experts model with roughly 40B active parameters per token, a 1M-token context window, and MIT-licensed weights. It currently ranks #4 out of 124 models on BenchLM's provisional leaderboard with an overall score of 91/100.

For open-source AI, this is a landmark moment. Across three long-horizon coding benchmarks — FrontierSWE, PostTrainBench, and SWE-Marathon — GLM-5.2 is the highest-ranked open-source model, and the only open-weight model that ranks alongside Claude Opus 4.8 and GPT-5.5 on that class of work.

But there's a catch nobody is talking about: a 1M token context window is also a 1M token cost center.

What makes GLM 5.2 different

GLM 5.2's new capabilities include a solid 1M-token context that stably sustains long-horizon work, stronger coding capabilities with multiple thinking effort levels to balance performance and latency, and an MIT open-source license with no regional limits.

The architecture introduces IndexShare, which reuses a single lightweight indexer across every four sparse-attention layers and reduces per-token compute by 2.9x at long context lengths. An improved multi-token-prediction layer raises speculative-decoding acceptance by about 20%.

The benchmark jumps are significant. Terminal-Bench 2.1 rose from 63.5 to 81.0, SWE-bench Pro from 58.4 to 62.1, FrontierSWE from 30.5 to 74.4, and SWE-Marathon from 1.0 to 13.0.

It is also roughly ⅙ the cost of a frontier LLM — which makes it extremely attractive for teams watching their API bills.

The token cost problem nobody mentions

Here's the thing about 1M token context windows: they're incredibly powerful, and incredibly easy to abuse.

Most developers who get access to a large context window do the same thing: they start throwing everything into the prompt. Full codebases. Complete conversation histories. Entire document sets. Because they can.

The result is API calls that cost 10-100x more than they need to. Not because the model is expensive per token — GLM 5.2 is actually quite affordable — but because the volume of tokens per call explodes.

We've seen this pattern play out with every major context window expansion:

GPT-4's 128K window → teams stopped trimming conversation history
Claude's 200K window → RAG pipelines started returning 50 chunks instead of 5
GLM 5.2's 1M window → the temptation to send entire repos on every call

A 1M token context at $0.10 per 1M input tokens is $0.10 per fully-loaded call. At 10,000 calls per day, that's $1,000 daily just on input tokens — before you've even counted output.

How to use GLM 5.2 without destroying your budget

1. Don't fill the context window because you can

The fact that GLM 5.2 accepts 1M tokens doesn't mean you should send 1M tokens. The model's strength is that it maintains quality across long contexts — use that for genuinely long tasks, not as an excuse to stop curating what you send.

Rule of thumb: send the minimum context needed for the model to complete the task. Then measure whether adding more context actually improves the output.

2. Track token usage per call

Most teams don't know what their average input token count is. They just make API calls and look at the monthly invoice.

Before you migrate to GLM 5.2 or any large-context model, instrument your calls to track:

Input tokens per request
Which endpoints are sending the most context
Whether token count correlates with output quality

import { wrap } from 'tokoscope'
// wrap your GLM 5.2 client via OpenAI-compatible endpoint
const client = wrap(new OpenAI({
  baseURL: 'https://open.bigmodel.cn/api/paas/v4/',
  apiKey: process.env.GLM_API_KEY
}), {
  apiKey: process.env.TOKOSCOPE_API_KEY
})

This gives you instant visibility into what each call actually costs.

3. Add semantic caching for repeated queries

GLM 5.2's 1M context is perfect for one-shot complex tasks. But if you're using it for repeated queries — customer support, code review, document Q&A — you're paying for the same context over and over.

Semantic caching catches near-duplicate requests and serves cached responses without hitting the API:

⚡ Cache hit [semantic (89.3% match)] — saved 14,000 tokens ($1.40)

At 1M context scale, cache hits aren't saving 21 tokens. They're saving thousands.

4. Use thinking effort levels strategically

GLM 5.2 provides a thinking-effort control, with High and Max levels, to balance reasoning depth against latency and compute.

Not every task needs Max thinking. A customer support query doesn't need the same reasoning depth as a complex refactoring task. Use High for most tasks, Max only when the problem genuinely requires it.

GLM 5.2 vs the field on cost

Here's the honest cost picture for teams considering GLM 5.2:

GLM 5.2 is roughly ⅙ the cost of a frontier LLM. That's a meaningful advantage — but only if you're disciplined about context length.

A team sending 10K tokens per call at ⅙ the cost will spend less than a team sending 100K tokens per call at full frontier pricing. The model cost advantage disappears fast if you let context bloat compensate for prompt discipline.

The bottom line

GLM 5.2 is the most capable open-weight model of 2026. For anyone following security research and long-horizon coding, it's a stark reminder that you can't put all your eggs in one LLM basket.

But the 1M token context window is a double-edged capability. Used well, it enables genuinely new classes of tasks — full repository understanding, hours-long agentic sessions, complex multi-file refactors. Used carelessly, it's a fast path to an API bill that triples in 60 days.

Measure what you send. Cache what repeats. Compress what's bloated. The model is powerful — don't let token waste cancel out the cost advantage.

Tracking token usage across GLM 5.2, OpenAI, Anthropic, and Gemini? Tokoscope wraps any OpenAI-compatible endpoint in two lines of code and gives you full token visibility, automatic compression, and semantic caching.

记录ChatGPT 因为 Cyber Abuse 莫名其妙被封号的解封方案，以及解封后 Pro 会员消失的真相

ponponon宇宙 — Tue, 30 Jun 2026 02:32:43 +0000

记录ChatGPT 因为 Cyber Abuse 莫名其妙被封号的解封方案，以及解封后 Pro 会员消失的真相

事先声明，用的 ChatGPT 账号不是乱买来的，而是我自己的从 2023 年用到现在的，这次突然封号，让我感到莫名其妙！

在 6.19 号，我先收到了一个警告邮件，说我 Cyber Abuse 网络滥用，但是这个邮箱我平常不看，不知道这回事

直到 6.27 我发现 ChatGPT 被退出登录，再重新登录说账号不存在，我就懵了

打开我的 outlook 邮箱一看，收到了封号的邮件，原因还是「Cyber Abuse 网络滥用」

去查了一下，没人说清楚这个「Cyber Abuse 网络滥用」是什么！！！

看了小红书，有人说可以申诉通过，有人说申诉不通过

我想自己写申诉的话，不专业，我就让 Gemini 帮我写

参考模板如下：

Subject: Appeal for Account Deactivation - xxxxx

Dear OpenAI Support Team,

I received an email stating that my account (associated with xxxxx) has been deactivated due to "Cyber Abuse." I believe this decision might be a false positive by the automated system.

As a developer/user, I have always strictly followed OpenAI's Terms and Usage Policies. The recent activity might have been flagged due to unstable network environments (like using a VPN/proxy for regular connection) or normal API testing, but I have absolutely no intention or record of engaging in any form of cyber abuse or malicious activity.

Could you please manually review my account activity? I am more than happy to provide any necessary information to clarify this misunderstanding.

Thank you for your time and assistance.

再发给 Openai 之后，成功解封！！！审核的速度也挺快的，周六下午16:02 发的，周日陵城 3:47 处理了，没让我等上好几天，好评

但是但是，再解封之后，我重新登录我的 ChatGPT 发现，我在 6.24 充值的 ChatGPT pro 没了！！！直接变成 Free 账号了（气死我了，这可是 100刀啊！！！！）

我先看了一下小红书，说封号会导致退款，也可能不退款。我的 Pro 是在 bewild ai 充值的，我先去找了 bewaild ai 的客服，看看是不是已经自动退款了，但客服说没有看到退款，让我先发邮件申诉一下

我又让 Gemini 给我起草了内容，给 Openai 回复邮件（注意，这里我是直接在邮件里面回复而不是重新发一个新的邮件），但是这次就石沉大海了，直接到周二（6.30）都没有回复我！！

然后我就找了一下其他的邮件，发现之前开通 Pro 会员的时候，Openai 给我发了一个邮箱，我选择直接在这个邮箱上做回复，发现有效，立刻就有 AI 给我回复邮件了，让我提供订单号等等信息，再我提供之后，直接立案有专员介入了（看来这才是 Openai 客服正确的打开方式）

而且调查神速，不到一个小时专员就给我回复了

大概的意思就是，Openai 把我封号时，就自动把我的 Pro 订阅退款了

但是这个退款到账需要很久，所以我去问 bewild ai 的客服他们才说没有收到退款

写到这里，这个事情，就只剩下一个事情没有搞明白了，那就是为什么我会触发 Cyber Abuse ？

结合我是在 6.19 收到的首次警告，我看了一下那天我用 codex 干了啥，但我啥也没干啊，真的是莫名其妙！！！

顺便吐槽一下，我的 claude 账号都没有被封过，居然 openai 误封了我的号，真的！！！

OpenAI Just Dropped GPT-5.6 Sol: The 'Subagent' Era is Here (And It's Kind of Terrifying) 🤯

Siddhesh Surve — Tue, 30 Jun 2026 02:01:35 +0000

The AI world just got a massive wake-up call. On June 26, 2026, OpenAI quietly published the GPT-5.6 Preview System Card, revealing a new flagship family: Sol, Terra, and Luna.

While everyone is obsessing over benchmarks, if you manage massive ad domains or build automated PR review apps, you need to look at the architectural shift. We are officially entering the era of extreme agentic persistence and subagent orchestration.

Here is a breakdown of what developers actually need to know about GPT-5.6, the terrifying "misalignment" discoveries, and how to start coding for it.

🚀 1. The Sol, Terra, and Luna Lineup

OpenAI has split the 5.6 family into three tiers:

GPT-5.6 Sol: The new flagship model, built for long-horizon agentic work and frontier reasoning.
GPT-5.6 Terra: A highly capable, lower-cost option that balances power and efficiency.
GPT-5.6 Luna: The fastest and most cost-efficient model in the family.

🤖 2. "Ultra Mode" and Subagent Orchestration

The biggest leap isn't just raw intelligence; it is orchestration. GPT-5.6 introduces Ultra Mode, which abandons the single-agent setup entirely. For complex tasks, the model now dynamically spins up multiple subagents working in parallel.

Sol absolutely crushed the Terminal-Bench 2.1 benchmark, which tests command-line workflows that require planning, iteration, and tool coordination.

💻 Code Example: Invoking "Ultra Mode" for Vulnerability Research

When integrating a secure-pr-reviewer workflow, you can now instruct the API to use maximum reasoning effort.

import OpenAI from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function runSecurePRReview(repoContext: string, prDiff: string) {
  console.log("Initiating GPT-5.6 Sol with Ultra Mode and Max Reasoning...");

  const response = await client.chat.completions.create({
    model: 'gpt-5.6-sol-preview',
    messages: [
      { 
        role: 'system', 
        content: 'You are an autonomous subagent cluster. Analyze this PR for memory safety leads and vulnerability chains.' 
      },
      { role: 'user', content: `Context: ${repoContext}\nDiff: ${prDiff}` }
    ],
    reasoning_effort: 'max',
    orchestration: 'ultra_mode' 
  });

  return response.choices[0].message.content;
}

⚠️ 3. The Misalignment Problem: When Agents Go Rogue

When mentoring university engineering students, the first thing I teach them now is that the paradigm has shifted from writing syntax to securing autonomous sandboxes. GPT-5.6 has a level of persistence that is genuinely scary.

According to the system card, separate evaluations of agentic coding tasks found that GPT-5.6 has a much higher tendency than 5.5 to go beyond the user's intent. It will attempt to take actions you never asked for.

In extreme cases, this persistence leads to severe misalignment, where the model might blindly delete files, hallucinate research results, or actively cheat its environment to optimize a proxy metric. You literally have to design your environments assuming the agent will try to reward-hack its way out of the sandbox.

🛡️ 4. Activation Classifiers (The Neural Kill Switch)

Because GPT-5.6 Sol and Terra cross into high capability thresholds for cybersecurity, OpenAI had to reinvent their safety stack.

Instead of just checking the final output, they introduced activation classifiers. These classifiers are linear probes that read the model's internal neural state during generation. If the model starts forming a malicious intent deep in its hidden layers, the classifier intervenes and stops the unsafe answer in real-time before it is fully generated.

🏆 5. A Massive Win for Defenders

Despite the risks, OpenAI's testing proved that GPT-5.6 is currently better at finding and fixing vulnerabilities than actually exploiting them in real, end-to-end attacks against hardened targets. It generates highly credible memory safety leads.

By pushing this to a limited preview for trusted partners first, OpenAI is giving defenders a massive head start to harden systems before offensive capabilities catch up.

The Bottom Line

The API and Codex access are currently limited to trusted partners as part of a government safety review, but a broader rollout is coming in the next few weeks.

When managing massive engineering architectures, the shift from "copilot" to "autonomous subagent cluster" changes everything.