DEV Community: llm

How to Automate the ChatGPT & Gemini Web UIs Without an API Key

Usama — Tue, 30 Jun 2026 12:35:30 +0000

You've got a folder of a few hundred screenshots and you want the text out of each one. Or you want to generate a batch of images for a side project. Or you just want to drop a single "summarize this" call into a script you're writing on a Sunday afternoon. So you open the pricing page for the official API, do the math on per-token billing plus setting up keys and a payment method, and it's hard to justify, because the exact same model will do the exact same thing for free in a browser tab.

There are really two ways to get a model like ChatGPT or Gemini to do work for you. The web UI is free, or already covered by a subscription you're paying for anyway, but you drive it by hand. The API is scriptable, but you pay by the token. Most of the time that trade-off is fine. But for a whole category of work like hobby projects, throwaway scripts, research, or anything that doesn't need production-grade reliability, you're stuck picking between "free but manual" and "automated but paid."

Which raises the obvious question: why not automate the free web UI? It's just a webpage. You open it, type in the box, click send. It turns out that hides a few fiddly problems, which I ran into enough times that I eventually built a small library for them. In this article we'll work through what it takes to automate these UIs, and at the end I'll show how little code it comes down to.

1. What it takes to drive a chat UI

A single round trip with ChatGPT or Gemini breaks down into four jobs:

Get your text into the input box
Optionally attach a file
Wait for the model to finish answering
And read the answer back out.

Every one of these is harder than it sounds, because the page is a modern single-page app that was never built to be driven by a script. We'll use Selenium with undetected-chromedriver, and for now assume the browser is already open (we'll get to launching it in the next section). To keep the code readable I'll show whichever of the two platforms makes each problem clearest, and mention the other where it differs.

1.1 Typing the message

The first surprise is that the input isn't a normal text field you can drop a string into. On ChatGPT it's a contenteditable div, and on Gemini it's a custom rich-textarea element. You can still send keystrokes to it, but two things will trip you up. A plain Enter submits the message, so any newline inside your prompt has to go in as Shift+Enter. And emoji and other characters outside the basic range quietly break send_keys, so those need to be inserted through JavaScript instead.

That pushes you toward sending the message one character at a time:

from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

box = driver.find_element(By.CSS_SELECTOR, 'div[contenteditable="true"]')
box.click()

for char in message:
    if char == "\n":
        # A plain Enter would send the message early
        box.send_keys(Keys.SHIFT, Keys.ENTER)
    else:
        box.send_keys(char)

Gemini works the same way, just against the rich-textarea element instead of the contenteditable div.

1.2 Uploading a file

This is where it gets interesting. The file <input> on the page is hidden, and the useful trick is that you don't need to open a file dialog at all: if you can get a reference to a hidden input[type=file], you can hand it a path with send_keys and ChromeDriver does the upload internally, no dialog involved.

ChatGPT is the easy case. The input already exists in the page, so you unhide it and send the path. Gemini is the awkward one. Clicking its upload button makes the page call the input's own .click(), which pops the operating system's file picker, a window Selenium has no way to drive. The fix is to stop the page from opening that dialog in the first place, by monkey-patching the browser's click method so it ignores the call on file inputs:

driver.execute_script("""
    const orig = HTMLInputElement.prototype.click;
    HTMLInputElement.prototype.click = function () {
        if (this.type === 'file') return;   // swallow the call that opens the OS dialog
        return orig.apply(this, arguments);
    };
""")

With that in place you can walk through Gemini's upload menu without a dialog ever appearing, then find the hidden input it creates, unhide it, and feed it the path:

file_input = driver.find_element(By.CSS_SELECTOR, 'input[name="Filedata"]')
driver.execute_script("arguments[0].style.display = 'block';", file_input)
file_input.send_keys("/path/to/receipt.jpg")

In real code you'd restore the original click afterward so the patch doesn't leak into the rest of the session, but the four lines above are the whole idea. The recurring lesson with this kind of automation is that the hardest problems are the ones where the page actively fights you.

1.3 Waiting for the response

You've sent the message. Now you have to know when the model is done, and there's no event you can listen for and no callback that fires. You poll the page and read its visual cues. The cleanest signal on ChatGPT is the stop button: while a response is being generated there's a stop button on screen, and when generation finishes it disappears.

import time

def is_generating():
    return bool(driver.find_elements(By.CSS_SELECTOR, '[data-testid="stop-button"]'))

while is_generating():
    time.sleep(1)

The principle here is that you're inferring application state from interface elements that were never meant to be read as an API.

1.4 Getting the response out

The reply lives in the page as rendered HTML. Pulling the text out is a matter of finding the right container in the last response and reading it:

turn = driver.find_elements(By.CSS_SELECTOR, ".agent-turn")[-1]   # the most recent response
text = turn.find_element(By.CSS_SELECTOR, ".markdown").text

If you want the raw markdown source instead of the rendered text, there's a copy button you can click and then read off the clipboard. And if the response contains a generated image, getting it out is its own small pipeline: you click the image's download button and then wait for the file to arrive in your download folder, skipping the partial .crdownload file the browser writes while the download is still in progress.

That's a full round trip: text in, file attached, wait for the answer, text or image back out. Run it twice, though, and you hit the next problem. The second time your script opens the browser, you're logged out and starting from a blank session, which is where the next piece comes in.

2. Making it survive across runs

The reason your second run starts logged out is that an automated browser, by default, begins every session from nothing: no cookies, no history, no saved login. So before any of the previous section's code is useful in practice, you need the browser to remember who you are between runs, and you need it to behave enough like a real session that the platform doesn't start throttling you. That comes down to one Chrome setting, a one-time setup step, and typing at a human pace.

2.1 A browser profile that persists

Chrome keeps everything about your identity on a site, including cookies and login sessions, inside a profile directory. If you let Chrome spin up a throwaway profile each run, you lose all of that the moment the script ends. Point it at a directory you control instead, and the login survives:

import undetected_chromedriver as uc

options = uc.ChromeOptions()
options.add_argument("--user-data-dir=/path/to/your/profile")

driver = uc.Chrome(options=options)

Two things are happening here. undetected-chromedriver is a drop-in replacement for Selenium's Chrome that smooths over the most obvious tells of an automated browser. And the --user-data-dir flag is the part that gives you persistence: it tells Chrome to store its profile in a folder of your choosing, so the session you logged into yesterday is still there today. A profile with real history also looks like a returning user rather than a brand-new automated one, which keeps the session healthier over time.

2.2 Logging in, once

A profile directory is only useful once there's a logged-in session inside it, so there's a one-time setup step. You open the browser pointed at your profile, log in by hand, then close it. Every automated run after that reuses the saved session.

driver = uc.Chrome(options=options)
driver.get("https://gemini.google.com")

input("Log into the browser window, then press Enter here to finish setup.")
driver.quit()

Logging in is also where a paid plan pays off. If you already subscribe to ChatGPT Plus or a paid Gemini tier, signing in during setup means every automated run uses that subscription, with its higher message limits and access to the better models, instead of being capped at the free tier. You do this once per machine and forget about it.

2.3 Typing at a human pace

A script that drops an entire prompt into the box in a single instant doesn't behave like a person at a keyboard, and sessions that look automated are the ones that get rate-limited or challenged. The fix is cheap. We're already sending the message one character at a time, so all it takes is a small, slightly random delay between keystrokes:

import time, random

for char in message:
    box.send_keys(char)
    time.sleep(random.uniform(0.02, 0.05))   # a human pace, not an instant dump

The randomness matters more than the exact timing, since a perfectly even rhythm is itself a tell.

With that, the machine is complete. The browser stays logged in across runs, and the input behaves enough like a real person to keep the session stable. You've now seen everything that goes into automating these interfaces, which means it's a good moment to step back and see how much of it you have to write yourself.

3. All of that, in a few lines

Every problem in the last two sections is the kind you want to solve once and then never think about again. That's what pushed me to wrap the whole thing up into a library. It's called Hermex, and you install it with pip install hermex.

The one-time login from the previous section becomes a single call:

from hermex import ChatGPT

ChatGPT.setup()   # opens a browser once: log in, then close the window

After that, the entire round trip from earlier, launching the browser, typing, uploading, waiting for the response, and reading it back, is one line:

response = ChatGPT.simple_query("What does this receipt say?", attachments=["receipt.jpg"])
print(response.text)

For a back-and-forth conversation, keep the browser open and call query as many times as you want:

from hermex import Gemini

gemini = Gemini()
gemini.open_url()

print(gemini.query("Summarize the history of the internet.").text)
print(gemini.query("Now just the key dates.").text)

gemini.close()

And a generated image comes back as a path to the downloaded file:

response = gemini.query("Generate an image of a mountain at sunset.")
print(response.image)

Under the hood, that's everything from the previous sections: the character-by-character typing with its newline and emoji handling, the hidden-input upload with Gemini's dialog suppression, the polling that waits for generation to finish, the text and image extraction, and the persistent profile that keeps you logged in. None of it is conceptually hard, but it's a lot of fiddly surface area to get right and, harder still, to keep working as the interfaces change. That last part is the real argument for not hand-rolling it every time. Hermex is open source under the MIT license, and the code is on GitHub at github.com/pseudo-usama/hermex.

4. Wrapping up

Automating a chat web UI comes down to a handful of problems that each look trivial and aren't: getting text into an input that isn't a text field, attaching files through an element the page hides from you, knowing when the model has finished without any event to tell you, and pulling the answer back out. Wrap those up with a profile that stays logged in, and it collapses to a single line you can call from a script.

The catch is that it's brittle by nature. You're driving an interface built for people, not programs, and a redesign that moves a button or renames a class will quietly break it. That makes it a great fit for hobby projects, scripts, and research, and a poor fit for production, where the official API earns its cost. And since ChatGPT and Gemini each have their own terms of service, where you take this is your call and your responsibility.

The code is on GitHub if it's useful. The documentation is available at hermex.usama.ai.

Gemma-4 31B + vLLM + RTX 6000 PRO : 1168 tokens/sec and still asking for more...

Nikhil — Tue, 30 Jun 2026 12:35:13 +0000

We pushed Gemma-4 31B to 24 concurrent requests on a single RTX 6000 PRO Blackwell. The queue never filled. ~1.17k tokens/sec, and it still had headroom.

Most LLM "benchmarks" show you one request at a time. That tells you almost nothing about production.

So we ran Gemma-4 31B (FP8) on vLLM under a real ShareGPT workload, ramping concurrency 12 → 16 → 20 → 24, and watched what actually happens.

The numbers that mattered:

→ Peak throughput: ~1,168 tokens/sec total (~548 tok/s output)

→ Median time-to-first-token: ~0.7s — snappy even under load

→ Queue depth: averaged 0.41, peaked at just 3 while 14–21 requests ran concurrently

→ Server stayed unsaturated across the entire sweep

The one thing to watch:

Tail TTFT.
Median first-token stays fast, but p99 climbs to ~19s at the heaviest concurrency. That's the first metric to flex as you push higher — not throughput, not the queue.

Setup:

1× RTX 6000 PRO Blackwell (96GB)
Gemma-4 31B-it, FP8 checkpoint
vLLM 0.20 — prefix caching + chunked prefill on
ShareGPT workload, 1024 max output tokens, streaming ON
Max model length (context) : 4096

Verdict:

A single Blackwell card runs a 31B model at 24-way concurrency without breaking a sweat. The high end-to-end latency is just long generations, not queuing — and there's clearly room to climb past 24.

Token Throughput chart:

E2E Latency Chart

Full writeup — configs, charts, and per-concurrency breakdown — in the comments. ↓

Changes to LLM pricing: Ambient and Novita

Narev Bot — Tue, 30 Jun 2026 12:32:44 +0000

Model price changes detected for Ambient and Novita. Details below.

GLM-5.2 vs Anthropic Mythos for Bug Finding: Architectures, Benchmarks, and Production Playbook

Delafosse Olivier — Tue, 30 Jun 2026 12:30:12 +0000

Originally published on CoreProse KB-incidents

By 2026, most developers already pair-program with an AI assistant; the real decision is which model is allowed near production code, secrets, and CI pipelines.[1] These assistants run on large-scale artificial intelligence and generative AI foundations, and their behavior under real operational pressure matters.

For bug finding—especially security issues—the model choice affects:

How many real defects you catch
How many new vulnerabilities you introduce
How much every CI run costs

This article compares Zhipu AI’s GLM-5.2 and Anthropic’s Mythos as bug-finding engines in realistic RAG, agent, and CI/CD architectures. The focus is reusable evaluation and rollout, not leaderboard scores.

1. Problem Framing: Why Compare GLM-5.2 and Mythos for Bug Finding?

By 2026, AI copilots are baseline; the differentiator is fit to workflow and risk profile, not raw coding ability.[1] Pentesters already see very different security behavior across assistants: some explain vulns well, others write exploits easily, and some introduce insecure patterns into code.[1]

📊 Enterprise reality

Around 68% of organizations put 30% or fewer generative AI projects into production, primarily due to underestimated integration, governance, and data prep complexity.[3] The same issues appear when wiring GLM-5.2 or Mythos into CI as automated reviewers.

⚠️ Demo vs production gap

Serving LLMs in production means handling:

Latency SLAs and tail latencies
Token-based pricing and unbounded loops
Observability of prompts, context, and outputs
Hallucinations and unsafe tool calls[8][10]

A model that feels great in the IDE can be unusable when every PR triggers hundreds of RAG + tool steps in CI.[8]

💼 Anecdote: A 40-person fintech added an LLM static reviewer to CI and quickly hit:

3× longer CI times
Insecure crypto suggestions merged
A surprise four-figure API bill from an unbounded agent loop[10]

Not because the model was bad, but because it was treated as a chatbot, not an infrastructure component.

Security audits of LLM apps now routinely find prompt injection, RAG poisoning, code exfiltration, and unsafe tool execution; “LLM pentest” offerings have emerged.[9] Your bug-finding model is part of the attack surface. In a world of AI worms and AI-orchestrated espionage, ignoring this is negligent.

💡 Framing question

For CI-integrated AI code review and bug triage, under regulatory and security pressure, does GLM-5.2 or Mythos deliver better end-to-end value—accuracy, cost, and risk—once embedded in a full stack?

The rest of the article gives you the tools to answer that in your own environment.

2. Evaluation Methodology: How to Measure Bug-Finding Performance Rigorously

A serious comparison needs more than anecdotes. Following production evaluation playbooks, define metrics before prompt or pipeline tuning.[6]

2.1 Core metrics

Capture at least:

Defect recall: fraction of known bugs correctly identified and fixed
Localization accuracy: correct file/function highlighted
Patch correctness: compiles, tests pass, no new defects
Hallucination rate: unsupported or failing suggestions[2][6]
Latency & P95: full path including RAG and tools[8]
Cost per 1K tokens and per CI run: models, embeddings, tools[6][10]
Reproducibility: stability across repeated runs with identical inputs[6]

📊 Evaluation guidance stresses quantifying accuracy, latency, cost, and hallucinations before system tuning.[6]

2.2 Dataset design

Build a labeled dataset that mirrors your real defects:

Failing unit/integration tests
Known security issues (injection, auth bugs, secrets)
Flaky tests, race conditions
Performance regressions and leaks

For each scenario, include:

Minimal reproducer (snippet or repo)
Ground truth (must-pass tests or neutralized CVE)
Severity labels (e.g., CVSS-like)[6][9]

Many generative AI projects fail at scale because they rely on synthetic examples and skip curated datasets.[3]

💡 Security scenarios to include[1][9]

Unsafe input validation around SQL/OS commands
Insecure crypto or hard-coded secrets
Deserialization of untrusted data
Overpermissive auth logic

These reflect real AI-generated and AI-modified code issues.[1]

2.3 Closed-book vs RAG-augmented

Evaluate both modes:

Closed-book: Failing test, stack trace, relevant file only.
RAG-augmented: Plus retrieved context (docs, logs, standards).

RAG combines retrieval from a knowledge base with LLM generation to reduce hallucinations and use up-to-date internal knowledge.[2][4] For debugging, this often means:

Logs and traces
Past incident tickets
Internal guidelines and security standards

Well-tuned RAG can cut hallucinations by 40–60%, depending on domain.[2] Measure how much GLM-5.2 vs Mythos actually benefit in your stack.

2.4 Experiment loop and governance

Use an iterative loop:

Run baseline prompts and tools.
Log metrics and representative examples.
Adjust prompts, system messages, tools.
Re-run and compare via dashboards.[6]

Persist prompts, retrieved docs, and generated diffs for traceability and auditability, as required by modern LLM governance frameworks and the AI Act.[5] Debug workloads involving personal data or safety-critical systems especially require this.[5]

⚡ Mini-conclusion: Treat evaluation as a product. If you can’t trend recall, hallucinations, and cost per CI run over time, you’re not ready to choose a model.

3. Architecture: GLM-5.2 vs Mythos in a RAG- and Tool-Enhanced Debugging Stack

GLM-5.2 and Mythos are pluggable components inside a broader system. The surrounding architecture often matters as much as the model.

3.1 High-level pipeline

A typical production debugging pipeline:

Trigger: CI detects a failing pipeline or new security finding.
Retrieval – telemetry: Fetch stack traces, logs, traces.
Retrieval – knowledge: Query vector DB for code, docs, standards.
Reasoning: LLM analyzes context, localizes bug, proposes patch.
Tools: Run tests, linters, SAST/DAST, sandbox repro.
Decision: Auto-apply patch, open PR, or comment only.

This is a standard RAG + tool-use pattern for code and observability data.[2][4][8]

💡 RAG layout for code[2][7]

Embed into a vector DB:

Source files and tests
Architecture docs and runbooks
Historical incident tickets

Retrieve Top‑K chunks per failure via a vanilla RAG pipeline extended to code.

3.2 Query enhancement and GLM-5.2 vs Mythos

Retrieval quality is often the bottleneck. Query enhancement—hypothetical questions, HyDE-style docs, sub-queries, stepback prompts—consistently boosts RAG performance.[7]

For bug finding:

Turn a stack trace into multiple “what went wrong?” questions
Generate a hypothetical failure explanation and embed it (HyDE) to locate files[7]

Compare GLM-5.2 and Mythos on:

Quality of these auxiliary queries/documents
Tendency to overfit to their own hypotheticals over retrieved context

3.3 Agents, gateways, and guardrails

Modern debugging stacks increasingly use agentic AI: networks of agents that plan, decompose, and call tools.[8] Both Mythos (in the Claude family)[8] and GLM-5.2 can power such systems.

Typical orchestration:

AI gateway normalizes APIs, auth, and routing.
Requests are routed to GLM-5.2 or Mythos by latency, cost, sensitivity.[8][10]
Agents call tools (tests, scanners, sandboxes) and occasionally web search.
Many enterprises expose tools via the Model Context Protocol (MCP) so multiple agents share capabilities.

In this setup:

GLM-5.2 self-hosting can cut marginal cost but adds infra complexity.
Mythos as a managed API speeds adoption and may offer stricter alignment and data guarantees.

Tools like Claude Code show the risk: if agents can execute shells, weak constraints can run destructive commands on your repo. Agent meltdowns and bad configs rival model choice in importance.[9]

⚠️ Non-negotiable guardrails[9]

Strict tool schemas and allowlists
Output validation (e.g., patches cannot modify auth middleware in “read-only” mode)
Prompt-injection filters on user input and retrieved docs

💼 Production mapping[8]

Many orgs now deploy LLMs behind:

Ingress → AI gateway → model router
Vector DB for RAG
Observability stack for prompts, retrievals, outputs

This reflects 2025–2026 practice, far from the “single notebook” view.

4. Benchmark Scenarios: From Unit Test Failures to Security Vulnerabilities

Your benchmark suite should cover correctness and safety, reflecting how pentesters and developers already use AI for exploitation and debugging.[1][9]

4.1 Security-heavy scenarios

Design tasks like:

Misconfigured auth logic (bypassable role checks)
Unsafe deserialization leading to RCE
Command injection behind partial validation
SQL injection via ORM edge cases[1][9]

Each scenario should include:

Reproducible environment
Tests or PoCs proving exploitability and remediation[6]

Include at least one poisoning / prompt injection case where the model is steered toward disabling security checks, echoing concerns about AI worms and autonomous exploit chains.

📊 LLM pentests now separate LLM/RAG-specific flaws (prompt injection, poisoning, unsafe tools) from classic web issues.[9]

4.2 Systemic and RAG-specific failures

Include systemic failure modes:

Brittle CI pipelines around AI tools
Misaligned expectations between security and product
Poor data classification exposing sensitive logs[3][8]

RAG-specific failures to benchmark:

Context poisoning: Malicious docs instruct disabling security.
Irrelevant retrieval: Wrong files → spurious fixes.
Sensitive leakage: RAG reveals secrets or confidential modules inappropriately.[2][9]

💡 Example: A pentest found a PDF in a RAG index that injected prompts convincing the LLM to dump internal config and bypass safeguards, mapped to OWASP LLM01.[9]

4.3 Multi-level tasks and insecure suggestions

Design tasks across levels:

“Fix this failing unit test.”
“Identify and remediate OWASP Top 10-style issues in this service.”
“Harden this CI workflow used by an LLM agent running tests.”[9]

Measure:

True defect recall
Precision of safe, compilable patches
Frequency of insecure patterns (e.g., SQL string concat, weak crypto) each model suggests[1]

This mirrors findings where AI tools rapidly generate complex but insecure scripts and exploits.[1]

4.4 Governance-aware tasks

Include tasks where the model must:

Redact PII from logs before use
Avoid exporting data outside allowed regions
Respect retention and minimization constraints[5]

Governing LLM usage demands audit trails, lawful processing bases, and AI Act risk classification. Your benchmark should test how well GLM-5.2 vs Mythos respect these constraints without extreme prompt engineering.[5][3]

⚡ Mini-conclusion: Benchmarks that skip security, RAG poisoning, and governance will favor the “catchiest chatbot,” not the safest debugging engine.

5. Production Concerns: Latency, Cost, Governance, and Safety Trade-offs

Even if Mythos beats GLM-5.2 by 10% recall, that can vanish if CI runs cost 10× more or break data residency rules.

5.1 Cost per CI run

Since pricing is token-based, estimate:

Average tokens per request (prompt + context + output)
Requests per failing PR (including RAG and tools)
Price per 1K tokens for each model and embedding tier

Then compute cost per CI run for GLM-5.2 vs Mythos under realistic failure and adoption rates.[6][10]

📊 One real case: a developer left an AI loop on overnight and incurred a $3,000 API bill—showing how fast unbounded agents can explode costs.[10]

5.2 Latency and throughput at system level

Measure end-to-end latency:

Gateway/routing
Vector DB retrieval
Model inference
Tools (tests, linters, scanners)

Network hops and external APIs often dominate latency, not raw model speed.[8][10] This matters when CI per-PR budgets are 5–10 minutes.

Helpful techniques:

Parallelize retrieval and tool calls
Batch multiple failing tests
Use cheaper models for “explanation-only” comments

5.3 Governance, standards, and data protection

Robust LLM governance for debugging needs:

Data classification of logs, traces, repos
Lawful basis/DPIA for personal data in logs
AI Act risk categorization and controls for high-risk domains (finance, health, safety)[5]

Standards like ISO/IEC 42001 for AI management are emerging reference points. Self-hosted GLM-5.2 may ease residency concerns but increases infra/maintenance; managed Mythos may simplify ops but restrict what data you can send.[5][3]

Traceability is essential: log prompts, retrieved docs, diffs, and decisions for audit, incident response, and appeals.[5][6] Training developers (e.g., Secure Code Warrior, internal “LLM safety drills”) is now as important as prompt tuning.

5.4 Adversarial testing and hardening

Apply AI-specific pentest practices:

Jailbreak and prompt injection attempts
RAG poisoning with crafted docs
Tool abuse: commands that modify infra, leak secrets, escalate privileges[9]

Findings are often mapped to OWASP LLM Top 10 and AI Act obligations, highlighting both model behavior and architectural weaknesses.[9][5]

⚠️ Organizational reality: Leaders often assume that because public chatbots “just work,” wiring LLMs into CI and security is easy. They underestimate integration, data, and governance complexity—one reason so many projects stall pre-production.[3]

6. Implementation Playbook: Rolling Out GLM-5.2 or Mythos for Bug Finding

This section compresses the ideas above into a rollout plan.

6.1 Phased rollout

Pilot on non-critical services
- Restrict to low-risk repos.
- Run GLM-5.2 and Mythos in comment-only mode.
Instrument evaluation
- Capture recall, hallucination, latency, cost.
- Compare GLM-5.2 vs Mythos on identical tasks.[6]
Progressive expansion
- Add more services as metrics stabilize.
- Enable auto-fix only for low-risk categories.[3]

Successful projects favor staged rollouts, stakeholder alignment, and continuous measurement over “big bang” launches.[3][6]

💼 Anecdote: One SaaS firm started with AI linting on a sandbox repo, then expanded to all internal services after three months of stable metrics and governance sign-off.

6.2 RAG tuning for debugging

For the RAG layer:

Chunking: Use structure-aware chunks (functions, classes, doc sections) instead of fixed tokens.
Indexing: Separate indices for code, docs, and tickets.
Query enhancement: Use HyDE-style hypotheticals and stepback prompts to boost recall and precision.[7]

Across all phases, treat GLM-5.2 and Mythos as interchangeable backends for the same agentic workflows. The decisive signal is in the metrics: which model finds more real bugs per dollar of CI budget, under your governance and resilience constraints, with your AI agents and RAG stack?

About CoreProse: Research-first AI content generation with verified citations. Zero hallucinations.

🔗 Try CoreProse | 📚 More KB Incidents

Inside the Quiet Rise of Autonomous AI Agents

Yao Xiao — Tue, 30 Jun 2026 12:26:42 +0000

There is a specific threshold every engineer crosses when building with modern LLMs. You wire a language model to a live tool and send a single open-ended query. The model triggers an API, evaluates the JSON payload, self-corrects, and autonomously spins up a subsequent call.

Then it hits you: the loop is running on its own.

No per-step prompts. No human middleware. Just an unprompted sequence of tactical actions driving toward a strategic goal.

That moment used to feel like an ephemeral party trick. In 2026, it is production infrastructure.

The shift from AI-as-chatbot to AI-as-agent is not a rebrand. It is a structural change in how language models are deployed — and understanding it matters whether you are building these systems, integrating them into an existing stack, or simply trying to stay technically oriented in a field that is compounding faster than most people have calibrated for.

The Core Distinction: Answering vs. Acting

A conversational AI responds. An agentic AI acts.

The difference is not purely semantic — it is architectural. When you ask ChatGPT to summarize a document, the model reads your prompt, generates a token sequence, and stops. That is a single forward pass, one context window, one output. When you give an AI agent a goal — "research this competitor, draft a briefing, and schedule the summary for tomorrow morning" — the model must plan, call tools, observe results, re-plan based on what came back, and iterate until the goal is satisfied or it runs out of options.

Dimension	Conversational AI	Agentic AI
Input	Single user prompt	High-level goal + environment state
Execution	One forward pass	Multi-step loop (plan → act → observe)
Failure mode	Wrong answer, hallucination	Loop divergence, tool misuse, scope leak
Compute cost	Predictable (one call)	Variable (N calls per goal)
Human role	Evaluates every output	Sets goal; reviews at checkpoints

The cognitive architecture is different. The failure modes are different. The prompting requirements are different. And the results, when it works, are qualitatively more powerful than anything a single-turn prompt can produce.

What an Agent Actually Is, Under the Hood

Strip away the marketing and an AI agent is a loop with four components:

Perception. The model receives inputs — a user goal, tool outputs, memory contents, environment observations. This is its "world state."

Reasoning. The model reasons about what to do next. Production agents implement ReAct-style reasoning (Reason + Act): the model emits an explicit chain-of-thought trace before committing to an action — a thought: field followed by an action: field in the output schema. This is not an optional design choice. Without a structured reasoning step, the model skips directly to action selection, which collapses reliability on any non-trivial task.

Action. The model emits a structured output — often a JSON function call or a tool invocation — that triggers real-world effects: a web search, a database query, a file write, an email send, a subprocess execution.

Observation. The result of the action is fed back into the context window. The model reads it, updates its understanding, and starts the loop again.

This loop continues until the agent decides it has satisfied the goal, or a stopping condition (maximum iterations, human confirmation gate) is met. The entire thing runs over what is, at its core, still just a next-token prediction model. The "agency" emerges from the loop structure, not from any new architectural invention inside the model itself.

Author's Comment: This is the part that trips people up. Agents are not a new category of AI. They are a new deployment pattern for the same underlying models you already use. The intelligence is the same; the scaffolding around it is what changes the capability ceiling.

A Worked Example: Tearing Down a Real Agent Step by Step

Theory is useful. A concrete teardown is more useful. Let's build the simplest agent worth building — a Research Briefing Agent — and annotate every step against the loop structure above. You can set this up today, in ChatGPT or Gemini, with no code.

The goal we hand to the agent:

"Research the current state of autonomous AI agents in enterprise software. Identify 3 key trends, find one supporting data point for each, and write a 300-word executive briefing. Stop and ask me before sending anything externally."

That single sentence contains a planning problem, multiple tool calls, a memory requirement, a reflection requirement, and a hard boundary constraint. Watch how each one maps.

Setting It Up: ChatGPT (GPT Builder) vs. Gemini (Gems)

In ChatGPT: Go to chatgpt.com → Explore GPTs → Create. Paste the system prompt below into the "Instructions" field. Under "Capabilities," enable Web Search. That's your tool. Save and open the GPT.

In Gemini: Go to gemini.google.com → Gems → New Gem. Paste the same system prompt into the "Instructions" field. Gemini Gems have Google Search access by default. Save the Gem.

Both platforms expose a tool-augmented, looping LLM behind a simple form. The underlying mechanism is identical to what production agent frameworks implement — the platform just handles the loop scaffolding for you.

The System Prompt (Copy This Exactly)

You are a Research Briefing Agent. 
Your job is to autonomously research a topic, synthesize findings, 
and produce a structured executive briefing.

ROLE: Senior research analyst with expertise in technology trends.

TASK: When given a research topic, you will:
1. Break the topic into 3 searchable sub-questions.
2. Search for each sub-question independently.
3. Extract one concrete data point or quote per sub-question.
4. Synthesize findings into a 300-word executive briefing with headers.
5. Perform a self-review: check that every claim has a source 
   and the briefing is under 320 words.

FORMAT: Return your output as:
  - PLAN: (numbered list of sub-questions before searching)
  - FINDINGS: (bullet list of data points with sources)
  - BRIEFING: (final 300-word document)
  - SELF-REVIEW: (pass/fail + one sentence rationale)

CONSTRAINTS:
- Do not send any content externally or take any action 
  beyond searching and writing.
- Do not exceed 5 web searches per task.
- If a search returns no useful result, 
  log "no result" and move to the next sub-question.
- Stop and ask the user for clarification 
  if the topic is ambiguous or spans more than one distinct domain.
- Never fabricate a data point. 
  If you cannot find a real source, state it explicitly.

This prompt is less than 200 words. Every line maps to a specific component of the agent loop. Let's trace it.

How to Invoke the Agent

Once the GPT is saved or the Gem is created, the agent is sitting idle — it has a system prompt loaded, tools enabled, and no active goal. Nothing runs until you send the first message. That first message is the invocation: the signal that starts the loop.

Step 1: Send the opening task message.

Open the GPT (in ChatGPT, click the GPT name from your sidebar or the Explore page) or open the Gem (in Gemini, click the Gem from your Gems list). In the chat input, type your research topic as a direct instruction. Do not preface it with pleasantries. Be specific about what you want and what form you want it in:

Research the current state of autonomous AI agents in enterprise software.
Identify 3 key trends, find one supporting data point for each, and write
a 300-word executive briefing.

That message is your goal. Hit send.

Step 2: Watch for the PLAN block — do not interrupt it.

A correctly configured agent will not immediately start searching. Its first output should be the PLAN: block — three numbered sub-questions it has derived from your goal. This is the Reasoning step made visible. If you see it, the agent is working correctly. If the model skips straight to "Sure, here are three trends about AI agents..." without searching, the system prompt is not loaded or the tool is disabled.

In ChatGPT, a working invocation looks like this in the response pane:

PLAN:
1. What enterprise AI agent platforms have reported production deployments
   in 2025–2026?
2. What efficiency or cost reduction figures have been cited 
   in agentic AI case studies?
3. Which major enterprise software vendors (SAP, Salesforce, ServiceNow, etc.) 
   have shipped GA agent features, and what is the reported adoption scope?

[Searching the web…]

In Gemini Gems, the same sequence appears — Gemini shows its Google Search queries inline as it executes each one.

Step 3: Handle disambiguation if the agent pauses and asks.

If your topic is broad or crosses domains, the agent will halt and surface a clarifying question before searching. This is the human_checkpoint constraint from the system prompt working as designed. Do not skip past this. Answer it directly:

Agent: Your topic spans both autonomous coding agents and autonomous
       enterprise workflow agents. Which domain should I focus on?

You:   Focus on enterprise workflow automation agents — think Salesforce
       Agentforce, ServiceNow, and similar platforms.

A single-sentence reply is enough. The agent re-enters the loop with your constraint applied and continues from where it paused.

Step 4: Receive and review the final output.

Once all searches complete, the agent emits FINDINGS:, then BRIEFING:, then SELF-REVIEW:. Review the self-review verdict first. A PASS means the model found sources for every claim and stayed within the word limit. A FAIL entry means it flagged a gap — read the rationale and decide whether to accept it or prompt a correction:

You:   The second finding has no URL. Please search again specifically for
       that data point or replace it with a verifiable alternative.

The agent treats your follow-up as a new Perception input, re-enters Reasoning, executes the corrective Action, and updates the output.

Step 5: Trigger follow-up tasks in the same session.

The agent retains the full conversation context within the session. You can issue a second task without restarting:

You:   Now do the same briefing but focused on healthcare AI agents specifically.
       Reuse the same format.

The agent carries forward its understanding of the BRIEFING: format from the prior turn — this is working memory in action. It does not re-read the system prompt line by line; the relevant context is already in the window. For a third or fourth task in the same session, watch for context drift: if the model starts skipping the PLAN: step or producing shorter outputs, the context window is becoming crowded with prior turns. Start a fresh session for unrelated topics.

Author's Comment: The invocation message is more consequential than most people realize. Vague goals produce vague plans, which produce vague searches, which produce vague briefings. The model cannot compensate for an underspecified goal — it will fill the ambiguity with plausible-sounding but potentially wrong sub-questions. Treat your invocation message like a spec document for a junior analyst: the more precisely you define scope, output format, and constraints upfront, the less correction you do downstream.

The Loop Trace: Annotated Against Theory

Turn 1 — Perception

You paste the goal into the GPT or Gem. The model's context window now contains:

(1) the system prompt above,

(2) your goal sentence.

This is the agent's complete world state at T=0. There is no memory of prior sessions, no external data — just two text blocks. This is Perception: the model has ingested its environment.

Turn 1 — Reasoning (the PLAN step)

Before touching a search tool, the model emits its plan:

PLAN:
1. What are the most-cited enterprise AI agent deployments in 2025–2026?
2. What productivity or cost metrics have been reported from agentic AI pilots?
3. Which enterprise software vendors have shipped native agent features 
   and what is their adoption rate?

This is the thought: step from ReAct — explicit goal decomposition expressed as text before any action is committed. The system prompt forced this by mandating PLAN: as the first output block. Without that instruction, most models skip straight to searching, which degrades multi-step reliability.

Turns 2–4 — Action (Tool Calls)

The model calls web_search three times — once per sub-question. Each call is a discrete Action: the model emits a structured tool invocation, the platform executes it, and the raw search result is returned. In ChatGPT's interface you see a "Searching the web…" spinner; in Gemini you see the query appear inline. Behind both is the same JSON function call mechanism:

{
  "tool": "web_search",
  "arguments": {
    "query": "enterprise AI agent deployments productivity metrics 2025"
  }
}

Note what the system prompt does here: it caps tool calls at 5 ("Do not exceed 5 web searches"). This is your circuit breaker. Without it, a poorly-grounded model will search indefinitely, hallucinating new sub-questions to justify more calls. The constraint converts an open-ended loop into a bounded one.

Turns 2–4 — Observation (Result Injection)

Each search result is injected back into the context window as an observation: block. The model reads the returned content, extracts the relevant data point, and notes the source URL. This is the Observation step — the world state is updated with new evidence, and the model re-enters the reasoning phase for the next sub-question.

If a search returns nothing useful, the system prompt's fallback fires: log "no result" and move on. This is explicit failure handling — the model does not retry indefinitely or hallucinate a result. It acknowledges the gap and continues.

Turn 5 — Action (Writing) + Reflection

With three data points in context, the model synthesizes the BRIEFING: block. It then immediately executes the SELF-REVIEW: step — re-reading its own output, checking word count and source coverage, and emitting a pass/fail verdict. This is the critic-actor pattern in miniature: the same model acts as both author and reviewer within a single turn.

If the self-review fails (word count exceeded, missing source), the model is instructed to revise and recheck. In production frameworks this would be an explicit second agent call. Here, the single-model loop approximates it cheaply.

The Boundary Constraint in Action

The final line of the system prompt — "Stop and ask the user for clarification if the topic is ambiguous" — is the human-in-the-loop gate. If you had asked the agent to "research AI in finance," it spans trading systems, fraud detection, lending, and compliance. The agent would recognize the ambiguity, halt the loop, and surface a clarifying question before spending 5 search calls on the wrong sub-domain. This maps directly to the human_checkpoint field in the scope-scoping JSON pattern from the Multi-Agent section.

What This Example Proves

One system prompt, one tool (web search), zero code. And yet the agent exhibits planning, tool use, bounded iteration, explicit failure handling, self-reflection, and a hard stop condition. Every component from the theoretical loop above is present and accounted for.

The gap between this toy example and a production system is not conceptual — the architecture is identical. The gap is in reliability engineering: error rate budgets, observability hooks, retry logic, and the discipline to define denied_tools before you deploy. The concepts scale. The discipline is the hard part.

The Four Capabilities That Make Agents Work

Research from Berkeley and DeepMind has converged on a consistent taxonomy of what separates a capable agent from an overrated wrapper. The four capabilities are planning, tool use, memory, and reflection.

1. Advanced Planning: Algorithmic Goal Decomposition to Mitigate Execution Drift

Agentic workflows lacking structural goal decomposition predictably fail under non-trivial execution depths. In production infrastructure, planning is not merely a long system prompt — it is the runtime capacity of the model to map an abstract macro-goal into a deterministic directed acyclic graph (DAG) of executable sub-tasks, where each node has a defined input, a defined success criterion, and a defined fallback.

The quality of that decomposition is directly correlated with the quality of the system prompt that defines the agent's operating environment. An agent given only a goal and a tool list will produce a shallow, linear plan. An agent given a goal, explicit reasoning instructions, intermediate-step examples, and failure-state definitions will produce a robust DAG. This is why prompt engineering for agents is a structurally different discipline than prompt engineering for answers — the target is not a good response, it is a reliable process that holds under N sequential decisions.

2. Tool Use: Deterministic Schema Design for Zero-Hallucination Invocations

Tools are what give agents reach beyond the model's training data and the current context window. A tool is, at its simplest, a typed function the model can call by name with structured arguments. The function executes externally, and the result is injected back into context as an observation.

The design of tool schemas is where most agent reliability problems actually originate — not in the model, and not in the prompt. How you specify each tool's name, argument types, valid value ranges, and expected output format directly determines whether the model invokes the tool correctly or hallucinates an argument. Ambiguous schema descriptions produce malformed calls. Precisely typed schemas with enum constraints and explicit description fields on every parameter produce deterministic invocations. Treat your tool schema with the same rigor you would apply to a public API contract.

3. Memory Architecture: Building Persistent State Across Multi-Session Agent Runs

Context windows have grown dramatically — GPT-4.1, Gemini 1.5 Pro, and Claude 3.7 all support million-token contexts — but for long-running agents, even these are not enough. A job that spans hours or multiple sessions needs a memory architecture beyond a single context window.

Agents typically implement memory at three levels. Working memory is just the active context window — fast, temporary, expensive per token. Short-term memory is a vector store or key-value cache the agent can query and write to during a session. Long-term memory is a persistent database that survives session boundaries, allowing an agent to pick up where it left off days later.

The production-grade architecture for each layer — including the latency trade-offs between retrieval-augmented memory and full-context injection, and when each approach breaks down — is a topic that deserves its own dedicated treatment. We derived the full framework in Memory, Planning, Tools: The Three Pillars Every Serious AI Power User Must Understand, which covers the engineering decisions that determine whether a multi-session agent remains coherent or accumulates silent state corruption over time.

4. Reflection: Implementing Critic-Actor Loops for Self-Correcting Agent Pipelines

The most underrated capability in the list. Reflection is the agent's ability to evaluate its own outputs and intermediate steps, identify errors, and self-correct before delivering a final result — closing the loop without waiting for a human to catch the mistake.

In practice, reflection is implemented as a second pass: the agent runs a task, then routes the output to a separate critic prompt — or a dedicated critic agent — that evaluates the result against the original goal and emits a structured pass/fail verdict with improvement notes. This critic-actor setup produces measurably better results on complex tasks than single-pass execution. The cost is additional inference calls, but for high-stakes, error-sensitive tasks, the reliability gain justifies it without question.

Multi-Agent Systems: When One Is Not Enough

Single-agent architectures have a natural ceiling. A single context window, a single chain of reasoning, a single point of failure. Multi-agent systems distribute the workload across specialized agents coordinated by an orchestrator.

A common pattern: an orchestrator agent receives a high-level goal, breaks it into sub-tasks, and dispatches each to a specialist agent — a researcher, a writer, a code reviewer, a data analyst. Each specialist works within its domain, returns a result, and the orchestrator integrates the outputs into a coherent whole.

This pattern is powerful but introduces new failure modes. Agents can contradict each other. Orchestrators can lose track of partial results. Communication overhead eats into context budgets. The industry has begun addressing this through protocol standardization — Anthropic's Model Context Protocol (MCP) defines a standard interface for LLMs to connect to external tools and data sources, while Google's Agent2Agent (A2A) specification proposes a standard for inter-agent communication. These are not finished standards, but their existence signals that the field is moving from ad-hoc integration to structured interoperability.

Practical Pitfall Avoidance — Scope Leakage: The most common multi-agent failure is scope leakage: a sub-agent interprets its task more broadly than the orchestrator intended and performs actions outside its sanctioned boundary. The mitigation is tight tool scoping — each agent receives only the tools required for its specific role. A concrete enforcement pattern looks like this:
{
  "agent": "researcher",
  "allowed_tools": ["web_search", "read_pdf", "read_url"],
  "denied_tools": ["write_file", "send_email", "execute_code"],
  "max_iterations": 10,
  "human_checkpoint": "before_final_output"
}
Declare the denied_tools list explicitly. An agent with an implicit boundary will drift toward it. An agent with an explicit constraint list will not.

The Reliability Problem That Nobody Wants to Talk About

Here is the honest part: current autonomous agents fail at rates that would be unacceptable in any production software system. A 2025 analysis from METR tracking agent performance on real-world tasks showed that even frontier models succeed on only a fraction of multi-step tasks requiring sustained autonomous execution. METR's research on measuring AI ability to complete long tasks documents this gap in detail and frames it as a fundamental reliability challenge, not a marginal engineering issue.

This does not mean agents are not useful — they clearly are, for the right tasks. It means the gap between "impressive demo" and "production system" is larger for agents than for any other AI deployment. The tasks agents handle most reliably share common characteristics: they are well-defined, their success criteria are measurable, they operate in bounded environments with known tool behaviors, and they have human-in-the-loop checkpoints at high-stakes decision points.

The underlying tension here is architectural: traditional software engineering assumes idempotency, whereas agentic execution is a stochastic state-machine where error rates propagate multiplicatively.

Consider a non-trivial workflow requiring $N$ sequential agentic decisions. Even if your frontier model delivers a stellar $p = 0.95$ single-step reliability rate, the joint probability of a zero-fault autonomous run scales as:

$$P(\text{Success}) = p^N$$

At $N = 10$ steps — a modest research-and-write pipeline — that 95% per-step reliability collapses to $0.95^{10} \approx 0.60$. A 40% failure rate on a ten-step task is not an edge case. It is the baseline for any agent deployed without explicit fault-tolerance architecture. At $N = 20$, the same model drops to $\approx 36\%$ success. The math is unforgiving.

This is why treating an LLM as a deterministic function call is a catastrophic design error. Engineers must treat the model as a volatile stochastic node inside a rigid, deterministic shell — borrowing fault-tolerance paradigms directly from distributed systems. You do not fix the node; you architect aggressive retry logic, state-rollback fallbacks, and execution circuit breakers around its probabilistic boundaries. The same principle that makes distributed systems engineers paranoid about network partitions should make agent engineers paranoid about per-step inference variance.

Agents deployed against open-ended, poorly-defined goals in uncontrolled environments fail early and often. This is not a model limitation waiting to be fixed by the next version — it is a system design problem that requires the same architectural discipline you would apply to any fault-tolerant distributed system.

Prompt Engineering for Agents Is Not What You Think

Most engineers who start building agents try to use the same prompting instincts they developed for conversational AI. Write a detailed system prompt, describe the goal, list the tools. This gets you to about 40% reliability on simple tasks.

Reliable agent prompts require a different structure. The system prompt for an agent needs to specify not just what to do but how to reason about what to do. It needs explicit instructions for handling ambiguity, explicit rules for when to stop and ask for human confirmation, explicit formats for tool call outputs, and explicit recovery behaviors for when a tool fails.

Simon Willison and the team at Anthropic have written clearly about this — the dominant framing from their engineering blog is that the prompt is not a configuration file; it is the agent's operating procedure document, and it needs to be written with the same rigor you would apply to a runbook in a production service.

To bypass these structural failures, engineers must enforce rigid boundary conditions before running inference. Implementing a deterministic scaffolding framework like Prompt Scaffold allows you to systematically isolate Role, Task, Context, Format, and Constraints before the model ever sees a token of your goal. In agentic design, the Constraints field is your architectural guardrail — it codifies exact exception-handling states, human-in-the-loop trigger conditions, and acceptable error budgets that prevent open-ended execution drift. Most first-time agent builders treat the Constraints field as optional. The production failure logs tell a different story.

The full taxonomy of agentic prompt architecture — action instruction schemas, output contract enforcement, and failure-recovery patterns for production systems — is a topic too dense to compress into a section of this article. We laid it out in full in the Prompt Engineering Playbook for Autonomous AI Agent Systems, which derives the structural differences between conversational and agentic prompting from first principles and provides the exact templates that production agent systems require.

The Shift in Mental Model That Changes Everything

Conversational AI rewards good question-askers. You get better at formulating queries; the model gives you better answers. The skill is fundamentally linguistic.

Agentic AI rewards systems thinkers. You get better at defining goals, scoping tool access, designing feedback loops, and setting failure boundaries. The skill is fundamentally architectural.

This shift has real implications for who builds well with AI going forward. Developers who approach agents as fancy chatbots will build systems that look impressive in demos and fail in production. Developers who approach agents as distributed systems with probabilistic components — applying the same rigor they would to any asynchronous, fault-tolerant architecture — will build systems that reliably deliver value.

The quiet rise of autonomous agents is not a trend to watch from a distance. It is an infrastructure shift that is already happening in production systems across industries. The engineers who understand the underlying mechanics — the loop structure, the memory architecture, the tool design, the reliability constraints — will be the ones building the systems that everyone else uses.

The technology is less magical than the demos suggest. It is also more consequential than most people have calibrated for.

The first malicious MCP server was one line of code: the postmark-mcp rug pull

Brenn Hill — Tue, 30 Jun 2026 12:00:00 +0000

In September 2025, security researchers at Koi Security found what's widely described as the first in-the-wild malicious MCP server. It wasn't a sophisticated zero-day. It was one added line in an email tool.

What happened

postmark-mcp is an npm package that gives an AI agent a tool for sending email through Postmark. For fifteen releases — versions 1.0.0 through 1.0.15 — it did exactly that, and nothing else. It got adopted, it got trusted, it landed in people's daily agent workflows. By the time it mattered, it was pulling roughly 1,500 downloads a week.

Then version 1.0.16 shipped on September 17, 2025. The diff was small enough to miss in a glance: the send-email function gained a Bcc field pointing at phan@giftshop[.]club, a domain the maintainer controlled. Every email the agent sent — content, recipients, attachments, whatever secrets or PII happened to be inside — got silently copied to the attacker.

Nothing else changed. The tool still sent your email correctly. From the outside, and from the agent's perspective, it worked. That's the whole trick: the malicious version was indistinguishable in behavior from the benign one, except for the carbon copy you couldn't see.

Anyone on auto-update inherited the backdoor the moment they pulled the new version. The package was downloaded 1,643 times in total before it was removed from npm. Postmark, the company, confirmed it had nothing to do with the package — the name just borrowed their credibility.

Why it matters

The uncomfortable lesson here isn't "audit your dependencies." Plenty of people had effectively audited this one — it was fine for fifteen versions. The lesson is that approval isn't permanent.

When you vet a tool, you vet a specific version's behavior at a specific moment. An MCP server can change its tool definitions and its actual behavior in any later release, and the agent — which trusts the tool to describe itself honestly — has no built-in way to notice. This is the "rug pull": vetted and benign, then quietly hostile, with the trust you extended earlier carried forward to code you never looked at.

MCP makes this sharper than a normal dependency bump, because these tools run with real authority inside your agent's loop. An email tool can read and send mail. A filesystem tool can read and write files. The blast radius of a hostile update is whatever you granted the tool on the day you trusted it.

The practitioner takeaway

You can't manually re-read every dependency on every update. But you can make "the tool changed" a thing your system notices instead of a thing it silently accepts.

Pin versions. Auto-update is what turned a malicious release into mass exposure. Pin MCP servers and their dependencies to exact versions, and treat a version bump as a change that needs a human, not a default.
Fingerprint tools at approval time. When you vet a tool, record a fingerprint — the package version and integrity hash, plus the tool's declared schema and description. That's the thing you actually approved.
Re-check the fingerprint on every load. Before an agent uses a tool, compare its current fingerprint to the approved one. A postmark-mcp running 1.0.15 and one running 1.0.16 should not look the same to your system.
Treat a moved fingerprint as hostile until proven otherwise. If the hash, version, or tool definition changed and nobody re-approved it, fail closed. Don't run the tool, don't pass it secrets, and surface the diff to a human. A changed tool definition is exactly the signal a rug pull produces.

None of this requires catching the malicious line by reading it. It requires noticing that something changed in a tool you'd already decided to trust — which is the one signal this attack couldn't hide.

This incident is one of the sources behind *BRACE*, an open, vendor-neutral framework for securing autonomous AI agents — its ecosystem guide covers vetting tools and re-checking them on every load. BRACE is built by reading the incidents and the research and asking, each time: what concrete control would have prevented or contained this?

The Audit Tax: Why Your Agent Made You Slower

Ben Stanley — Tue, 30 Jun 2026 11:30:38 +0000

Originally published in Temrel, a weekly newsletter on agentic engineering.

You ask an agent to code an update. It takes about 90 seconds to produce the PR. You then spend the next 90 minutes reading it line by line to see if you trust it. You might, whisper it, be shipping code even slower than you were before.

Agent-based development velocity is borrowed time, re-invoiced with interest at review time. The agent writes the PR in seconds; you pay for that speed in the time it takes to decide whether to trust what it has written. This is the Audit Tax.

This is a deliberate sequel to last week's "Stop prompting, start looping." Verification was one of our six dials, and today we focus on that one.

The bottleneck moved while you were watching the leaderboard

Code generation is effectively solved. By mid-2026, even the die-hard holdouts can't seriously argue that coding agents underperform humans in commercial environments. The hard part now is verification.

The old scoreboard measures the wrong thing: model benchmarks, tokens per second, and the rest. The real measurement is how quickly agent-produced code gets into production.

According to LinearB's 2026 Software Engineering Benchmarks Report, AI PRs take 4.6x longer to get reviewed. That is a product of higher volume and faster delivery, and it is the biggest blocker to AI engineering productivity.

Reviewing agent code is harder than reviewing human code

Verification is harder than it looks. You can't interrogate the agent and trust the answer; the hallucination might be buried in the reasoning. Your old heuristics for reviewing human code are unfit for the task:

Agent-written PRs always look clean and self-confident, whether they work or not. Sloppy formatting and thin documentation no longer signal a weak PR, so you can't kick it back on those grounds.

Enforcing small diffs doesn't work either. Try it and "4.6x longer" becomes a stretch goal; you'll be drowning in PRs forever.

Individual reliability means nothing now. John, the old hand who always shipped clean code and earned a cursory review? John's gone. There's just Claude now.

And don't forget: you contribute to The Sloppening every time you push slop to the codebase.

Stop paying the tax by hand. Build the verification layer.

Get your cheap, deterministic gates in first: typecheck, tests, lint, build. You already have them, they're virtually free and fast, and they catch stupid mistakes. Anthropic calls these code-based graders.

Then add a review subagent. In Anthropic's terms, model-based graders. Check the diff against the stated intent, not just whether it builds and runs.

Then human-in-the-loop: a person's eyes on anything that survives the deterministic and agent-review gates. The machines clear the early hurdles, and the human lets the output hit production. Anthropic calls these human graders.

Evals make verification repeatable, not vibes

Anthropic recommend starting evals early, and so do I. Record the cases where the agent misses requirements, and once you have around 20, start building your evals.

Add your deterministic checks plus an LLM-as-judge for the fuzzy intent. Wire them to triggers so you don't kick them off by hand.

There's an in-depth Anthropic blog on methodology that is lighter on technical implementation. Take that as a sign of how early this step in the agentic loop still is.

Action steps (do this week)

Measure your tax: time-to-generate a PR versus time-to-merge it. The gap is the bill.
Add one mandatory CI gate the agent cannot merge past (start with tests or typecheck).
Stand up a 20-case eval from last month's actual agent failures.
Add a "review" pass that checks diffs against intent before they reach you.
Re-measure the gap. Watch the tax drop.

Why this matters

This is the reframing of the dev career ladder. We started with context engineering (2024), then loop engineering (2026). Follow the thread and you become one of the top players in software development, set up well for what's next.

Whoever owns verification owns the bottleneck, and whoever owns the bottleneck owns the leverage. Code generation is solved. The tax is rigorous evaluation.

Pay the tax on purpose, or pay it by accident.

Subscribe to Temrel for weekly agentic engineering field notes.

How to Give Your AI Agent Access to Upwork Data

AlterLab — Tue, 30 Jun 2026 11:21:48 +0000

How to Give Your AI Agent Access to Upwork Data

This guide covers accessing publicly available data. Always review a site's robots.txt and Terms of Service before automated access.

TL;DR

Use AlterLab's Extract API to turn Upwork job pages into structured JSON. Your AI agent can call the API directly, receive clean data, and feed it into an LLM for market intelligence, skill tracking, or rate monitoring—no HTML parsing needed.

Why AI agents need Upwork data

AI agents benefit from fresh, structured web data for several agentic use cases:

Freelance market intelligence: Track demand for skills, average rates, and job volume over time.
Skill demand monitoring: Identify which technologies or services are gaining traction in the freelance marketplace.
Rate analysis: Compare compensation trends across regions or experience levels to inform pricing strategies.

These insights feed RAG pipelines, tool calls, and knowledge base updates that keep agents current without manual scraping.

Why raw HTTP requests fail for agents

Direct HTTP calls to Upwork often break agent pipelines:

Rate limiting: IP bans or CAPTCHAs cause failed requests and wasted token budgets on retries.
JavaScript rendering: Modern pages rely on client‑side code; raw HTML lacks the data you need.
Bot detection: Headless browser signatures trigger blocks, requiring complex mitigation.
Parsing overhead: Agents spend cycles extracting fields from noisy HTML instead of reasoning.

The result is brittle pipelines, higher latency, and increased cost per successful data point.

Connecting your agent to Upwork via AlterLab

AlterLab handles anti‑bot measures, rendering, and extraction so your agent receives structured output. Use the Extract API for schema‑driven JSON or the Scrape API for raw HTML when you need full page control.

Structured extraction with the Extract API

Define a schema that matches the Upwork job fields you need—title, price, description, etc.—and let AlterLab return clean data.

```python title="agent_upwork_extract.py" {3-8}

client = alterlab.Client("YOUR_API_KEY")

Request structured data from a Upwork job listing

result = client.extract(
url="https://www.upwork.com/jobs/~0123456789abcdef",
schema={
"title": "string",
"price": "string",
"description": "string",
"skills": "list[string]"
}
)

result.data is a dict ready for your LLM

print(result.data)






```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/extract/templates/{template_id} \
  -H "X-API-Key: YOUR_KEY" \
  -d '{
    "url": "https://www.upwork.com/jobs/~0123456789abcdef",
    "schema": {
      "title": "string",
      "price": "string",
      "description": "string",
      "skills": "list[string]"
    }
  }'

Both examples return a JSON object that your agent can pass directly to an LLM call, saving tokens and eliminating parsing logic.

For cases where you need the full rendered page (e.g., to run custom logic), use the Scrape API:

```python title="agent_upwork_scrape.py" {3-6}
result = client.scrape(
url="https://www.upwork.com/jobs/~0123456789abcdef",
formats=["html"] # returns cleaned HTML ready for downstream parsing
)




Refer to the [Extract API docs](/docs/extract) for schema options and rate limits.

## Using the Search API for Upwork queries
When you need to discover jobs matching a query (e.g., “Python Django”), AlterLab’s Search API lets you retrieve results without building a crawler.



```python title="agent_upwork_search.py" {3-7}
# Assume you have previously created a search template via the dashboard or API
search_id = "upwork-python-jobs"

result = client.search(
    search_id=search_id,
    params={"q": "Python Django", "page": 1}
)

# result.data contains an array of structured job objects
for job in result.data["items"]:
    print(job["title"], job["price"])

```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/search/upwork-python-jobs \
-H "X-API-Key: YOUR_KEY" \
-d '{"q": "Python Django", "page": 1}'




The Search API returns the same structured format as Extract, making it easy to plug into agentic workflows.

<div data-infographic="try-it" data-url="https://www.upwork.com" data-description="Extract structured Upwork data for your AI agent"></div>

## MCP integration
AlterLab provides an MCP server that exposes its APIs as tool calls for agents built with Claude, GPT, or Cursor. Register the MCP server in your agent’s toolkit and invoke Upwork extraction as a standard function call. See the [AlterLab for AI Agents](https://alterlab.io/glossary/user-agent) glossary for setup details.

## Building a freelance market intelligence pipeline
Here is an end‑to‑end example showing how an agent can collect Upwork data, enrich it, and store insights.

1. **Agent triggers a tool call** – The LLM decides it needs current freelance rates for “React Native”.
2. **AlterLab fetches and extracts** – The agent calls the Extract API with a schema for title, price, and skills. AlterLab handles rendering, anti‑bot, and returns JSON.
3. **Agent processes the data** – The structured output is passed to a summarization LLM or stored in a knowledge base.
4. **Pipeline repeats on a schedule** – Using cron or an internal scheduler, the agent refreshes the dataset hourly.



```python title="freelance_pipeline.py" {3-15}

from openai import OpenAI

alterlab_client = alterlab.Client("YOUR_ALTERLAB_KEY")
llm_client = OpenAI(api_key="YOUR_OPENAI_KEY")

def fetch_upwork_jobs(query: str, limit: int = 20):
    """Retrieve structured job data for a given query."""
    search_id = "upwork-freelance-search"
    resp = alterlab_client.search(
        search_id=search_id,
        params={"q": query, "limit": limit}
    )
    return resp.data.get("items", [])

def enrich_with_llm(jobs):
    """Ask the LLM to extract trends from raw job listings."""
    prompt = (
        "Analyze the following Upwork job listings and summarize:\n"
        "- Median hourly rate\n"
        "- Top 5 requested skills\n"
        "- Any notable changes from the previous report\n\n"
        f"Jobs: {jobs}"
    )
    completion = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    return completion.choices[0].message.content

def main():
    jobs = fetch_upwork_jobs("React Native")
    insight = enrich_with_llm(jobs)
    # Store insight in a database or trigger a notification
    print("Market insight:", insight)

if __name__ == "__main__":
    main()

The pipeline uses AlterLab as a reliable data source, letting the agent focus on reasoning rather than navigating anti‑bot measures.

Key takeaways

Structured extraction removes HTML parsing overhead and improves token efficiency.
AlterLab’s built‑in anti‑bot handling delivers reliable data for agentic pipelines.
Use the Search API for discovery and the Extract API for precise field selection.
Integrate via MCP to treat AlterLab as a standard tool call in LLM agents.
Review the AlterLab pricing page to estimate costs for your agent’s data volume.

Hit reply if you have questions.

How to Give Your AI Agent Access to Seeking Alpha Data

AlterLab — Tue, 30 Jun 2026 11:21:47 +0000

How to Give Your AI Agent Access to Seeking Alpha Data

Disclaimer: This guide covers accessing publicly available data. Always review a site's robots.txt and Terms of Service before automated access.

TL;DR

To give an AI agent access to Seeking Alpha data, connect it to the AlterLab Extract API. This allows your agent to request a URL and receive structured JSON instead of raw HTML, making it compatible with RAG pipelines and tool-calling-based-reasoning without manual parsing.

Why AI Agents Need Seeking Alpha Data

Standard LLMs are limited by their training cutoff. For financial agents, this means they are blind to current market sentiment, recent earnings transcripts, and real-time stock analysis. To build a production-grade investment agent, you must bridge the gap between the LLM and live web data.

High-performing agentic workflows use Seeking Alpha data for:

Investment Research Monitoring: Agents that track specific tickers and summarize new analysis articles as they are published.
Earnings Analysis: Automatically pulling key metrics from earnings summaries to compare against historical trends in a RAG (Retrieval-Augm-ented Generation) database.
Stock Discussion Pipelines: Monitoring sentiment in public comment sections to provide a "market mood" metric for a broader investment tool.

Why Raw HTTP Requests Fail for Agents

If you attempt to use a simple requests.get() or fetch() call within a tool-call-loop, your agent will likely fail. Seeking Alpha utilizes sophisticated anti-bot protections that detect non-browser signatures.

When an agent hits a wall, it doesn's just "get the wrong data"—it wastes your most expensive resource: the LLM's context window. Instead of getting financial data, your agent receives a 403 Forbidden error or a CAPTCHA challenge. This results in:

Token Waste: The agent tries to "reason" through an error page, consuming tokens for no value.
Broken Pipelines: An agent that cannot fetch data cannot complete its tool-calling loop, causing the entire task to crash.
Rate Limiting: Repeatedly hitting a site with the same signature will lead to an IP ban, breaking your agent's ability to access any data from that source.

Connecting Your Agent to Seeking Alpha via AlterLab

The most efficient way to feed data to an agent is via structured extraction. Rather than passing raw HTML into an LLM—which is noisy and expensive—you should use the AlterLab Extract API. This transforms a webpage into a clean JSON object that fits perfectly into a prompt.

Using the Extract API

The Extract API uses predefined templates to turn any URL into structured data. This is the preferred method for RAG pipelines because it minimizes the token count significantly.

```python title="agent_extraction.py" {3-8}

client = alterlab.Client("YOUR_API_KEY")

Extract structured data directly for the agent's context window

result = client.extract(
url="https://seekingalpha.com/article/example-article-id",
schema={
"article_title": "string",
"author": "string",
"sentiment": "string",
"key_points": "array of strings"
}
)

Pass this clean JSON directly to your LLM

print(result.data)




Alternatively, you can use `curl` for lightweight server-side implementations:



```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/extract/templates/{template_id} \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://seekingalpha.com/example",
    "schema": {"title": "string", "author": "string"}
  }'

For more details on schema definitions, check our Extract API docs. If you are building a production service, refer to our Getting started guide to set up your environment.

Searching for Financial Data at Scale

Sometimes your agent doesn's have a specific URL but rather a query (e.g., "Find recent sentiment for $TSLA"). In these cases, the Search API allows your agent to perform queries against the web and receive structured results.

An agentic workflow would look like this:

Agent identifies a need for new data.
Agent generates a search query.
Agent calls the AlterLab Search tool.
AlterLab returns a list of URLs and metadata.
Agent selects the most relevant URL and calls the Extract API.

MCP Integration: Giving Claude and GPT-4 Real-World Access

The Model Context Protocol (MCP) is becoming the standard for connecting LLMs to external data sources. By using AlterLab as an MCP server, you can give agents like Claude or custom-built GPTs the ability to "browse" Seeking Alpha as a tool. This transforms the agent from a static text generator into a dynamic researcher capable of real-time market analysis.

Learn more about how we support this via our User Agent glossary.

Building an Investment Research Monitoring Pipeline

To build a professional-grade monitoring system, you need to move away from manual scripts and toward automated pipelines. A robust architecture looks like this:

Trigger: A cron job or a webhook signals a new article.
Extraction: AlterLab fetches the article, bypasses bot detection, and returns structured JSON via a Webhook.
Reasoning: The LLM receives the JSON, compares it against your investment thesis, and decides if action is required.
Action: The agent posts a summary to Slack or updates a database.

Implementation Example: The Monitoring Loop

```python title="monitoring_pipeline.py" {2,5,8-12}

client = alterlab.Client("YOUR_API_KEY")
llm = openai.OpenAI()

def monitor_ticker(url):
# 1. Get clean data from AlterLab
raw_data = client.extract(url=url, schema_id="seeking_alpha_article")

# 2. Feed structured data to LLM for reasoning
response = llm.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": "You are a financial analyst. Summarize the sentiment of this article."},
        {"role": "user", "content": f"Data: {raw_data.data}"}
    ]
)
return response.choices[0].message.content

Example URL

print(monitor_ticker("https://seekingalpha.com/article/example"))




## Key Takeaways
* **Structured over Raw**: Never feed raw HTML into an LLM. Use the Extract API to minimize token usage and-to-maximize reasoning-quality.
* **Avoid the Retry Loop**: Building your own proxy rotation is a waste of engineering time. Let the API handle the heavy lifting of bot detection.
* **Agentic Tools**: Use the MCP pattern to give your agents native access to web data without writing custom scrapers for every site.

By implementing these patterns, you move from "scraping websites" to "orchestrating data pipelines," creating agents that can actually act on real-world information.

***

**AlterLab // Web Data, Simplified.**

SGLang v0.5.14: LPLB Expert-Parallel Load Balancing

pueding — Tue, 30 Jun 2026 11:19:23 +0000

What: The SGLang v0.5.14 release ships LPLB — a linear-programming load balancer for serving mixture-of-experts models, where the experts are split across many GPUs and each step routes every token to a few of them.

Why: In expert-parallel MoE serving, token routing is uneven and shifts every step, so one overloaded GPU stalls the whole step at a sync barrier; evening that load is what unlocks throughput on big MoE models like DeepSeek-V4.

vs prior: Earlier setups used static, hand-tuned expert placement and ate the imbalance; LPLB keeps redundant replicas of the hot experts and solves a small linear program each step to minimize the busiest GPU's share of the work.

Think of it as

A warehouse store opening duplicate counters to even out the longest line.

       40% of this step's tokens want one hot expert

   WITHOUT LPLB                 WITH LPLB (3 replicas)
   ┌──────────────┐             ┌──────────────┐
   │ GPU1 ####### │ 40%         │ GPU1 ##      │ 14%
   │ GPU2 #       │  5%         │ GPU2 ##      │ 14%
   │ GPU3 #       │  5%         │ GPU3 ##      │ 14%
   └──────┬───────┘             └──────┬───────┘
          ▼                            ▼
   barrier waits on GPU1        lanes finish together
   ✗ others idle ~1/3 step      ✓ idle time deleted

a customer = a token routed to its experts this step
a specialty counter = an expert (a sub-network in a mixture-of-experts model)
a checkout lane = a GPU the experts are spread across
one counter mobbed while others sit idle = per-GPU load imbalance
duplicate copies of the busy counter = redundant expert replicas
the floor manager who evens the longest line each wave = LPLB

Quick glossary

MoE (Mixture-of-Experts) — A model whose feed-forward layer is split into many experts (sub-networks); a small router sends each token to only a few. Total parameters are huge, but the active ones per token stay small. DeepSeek-V4 is a large MoE model.

Expert parallelism (EP) — The serving layout that spreads a MoE's experts across many GPUs, because all the experts together do not fit on one. Each step, tokens must be shipped to whichever GPU holds their chosen expert and the results shipped back.

Load imbalance — When this step's router sends far more tokens to some experts than others, the GPUs holding the popular experts get swamped while the rest sit idle. The pattern is data-dependent, so it shifts batch to batch.

Redundant expert replicas — Keeping extra copies of the hot experts on several GPUs so their token load can be split, instead of one GPU owning a popular expert alone. The balancer decides how to divide each expert's tokens among its copies.

LPLB — SGLang's Linear-Programming Load Balancer. Each step it solves a tiny linear program over the current token counts to assign load across replicas so the maximum per-GPU load is as small as possible (a min-max objective).

Waterfill — The second expert-parallel balancer the release ships alongside LPLB. SGLang names it but does not detail how it works; the name points to a classic water-filling heuristic — fill the least-loaded replica first — which would be a lighter alternative to solving the LP each step.

All-to-all — The expert-parallel communication step that ships tokens out to their experts' GPUs and the results back. It runs every layer and waits for the slowest GPU, which is why imbalance is so costly here.

The news. On June 26, 2026, the SGLang team released v0.5.14, with work from 56 contributors. The headline is 5x higher throughput at the same interactivity serving DeepSeek-V4 on NVIDIA GB300, driven by two new expert-parallel load balancers — Waterfill and LPLB (a linear-programming load balancer) — plus CuteDSL prefill kernels for Blackwell and int8 checkpoint pooling for linear-attention prefix caches. Read the release →

Picture a warehouse store at peak rush. The checkout lanes are the GPUs; the specialty counters — deli, pharmacy, bakery — are the model's experts, and because no single lane can hold them all, the store spreads the counters across the lanes. That spread is expert parallelism: a mixture-of-experts model has too many experts to fit on one GPU, so they live across many, and each decode step the router sends every customer (token) to the one or two counters they need. The trouble is that the rush is lumpy. This wave, everyone wants the deli; next wave, the pharmacy. So one counter gets mobbed while the rest stand idle — and the store can't close out the rush until that longest line clears.

That last clause is the whole problem, because the lanes do not finish independently. Every GPU has to meet at a sync barrier — the all-to-all that ships tokens to their experts and the answers back — and that barrier waits for the slowest lane. The GPU holding this step's most popular expert therefore sets the pace for all of them, and the fast lanes burn the difference as idle time. Add more GPUs and the imbalance can get worse, not better, because the hot expert still lives on one lane while you have paid for more lanes to stand around.

SGLang v0.5.14's fix is to stop letting one counter bottleneck the floor. It keeps redundant replicas of the hot experts — duplicate deli counters on several lanes — and then, each wave, the floor manager solves a quick assignment problem: given how many customers want each counter right now, divide every counter's line across its copies so the busiest lane does as little as possible. That floor manager is LPLB, and "as little as possible" is literal: it solves a small linear program whose objective is to minimize the maximum per-GPU load (a min-max). Waterfill is the other balancer the release pairs it with, and SGLang does not spell out how it works. The name, though, points to a classic water-filling heuristic — fill the least-loaded replica first — which would be a lighter alternative to running the LP every step.

Hold the layout fixed and walk the imbalance math (illustrative — the release reports only the end-to-end 5x). Say 8 GPUs serve a batch, and the router sends 40% of this step's tokens to one hot expert that lives on a single GPU, while another GPU draws just 5%. The step can't end until that one GPU finishes its 40%, so the other seven idle for roughly a third of the step — you own 8 GPUs but move at the speed of the busiest one. Now place 3 replicas of that hot expert and let LPLB split its tokens across them: its share per GPU falls from 40% toward about 14%, the barrier wait shrinks sharply, and the lanes finish much closer together. The win isn't a faster kernel — it's deleting the idle time that imbalance was manufacturing.

Expert-parallel balancing	How it assigns load	Per-step cost	Balance quality
Static / hand-tuned placement	fixed expert→GPU map, set before serving	~none	poor under shifting, data-dependent routing
Waterfill (this release)	the release's second balancer; name implies water-filling, internals not detailed	—	a lighter companion to LPLB (inferred from the name)
LPLB (this release)	solves a linear program to minimize the busiest GPU's load	a small solve each step	tightest — a min-max optimum over replicas (SGLang v0.5.14)

Where it earns its keep is exactly the regime DeepSeek-V4 lives in: a large MoE served with expert parallelism across many Blackwell GPUs, where the all-to-all and its sync barrier are a leading cost in each decode step. The release's headline — 5x higher throughput at the same interactivity — is a goodput claim: more tokens per second without making any single user wait longer. Read it as the lanes finishing together instead of seven of them waiting on one — the same hardware, far less idle time.

Related explainers

SGLang v0.5.12 — TokenSpeed MLA backend — the prior SGLang release, a kernel-level cache-write win rather than a scheduling one
Manifold Power Iteration — MoE router alignment — the other MoE balance problem: which expert a token picks (router design), not where that expert runs
GLM-5.2 — active vs total parameters — why MoE serving is its own discipline: huge total weights, small active compute per token

FAQ

What is LPLB (linear-programming load balancing)?

LPLB is the Linear-Programming Load Balancer added in SGLang v0.5.14. When a mixture-of-experts model is served with expert parallelism — its experts split across many GPUs — the router sends an uneven, step-by-step-changing number of tokens to each expert, so some GPUs get swamped while others idle. LPLB keeps redundant replicas of the hot experts and, each step, solves a small linear program over the current token counts to divide every expert's load across its replicas so the maximum per-GPU load is minimized. Evening the load shrinks the wait at the all-to-all sync barrier that gates each decode step.

Why does expert-parallel MoE serving need load balancing at all?

Because expert parallelism makes the GPUs finish a step together, not independently. Every layer runs an all-to-all that ships tokens to their experts' GPUs and the results back, and that barrier waits for the slowest GPU. Since token-to-expert routing is data-dependent and shifts every batch, whichever GPU holds this step's most popular expert becomes the bottleneck for all of them — and the rest burn the difference as idle time. Without balancing, adding more GPUs can even make it worse, because the hot expert still lives on one GPU. SGLang reports a 5x throughput gain at the same interactivity for DeepSeek-V4 on NVIDIA GB300 once the load is evened.

How does LPLB differ from Waterfill, and from a MoE router?

Waterfill and LPLB are the two expert-parallel balancers the release ships, both aimed at spreading each step's token load across expert replicas. SGLang details LPLB — it solves a linear program for a tight min-max balance at a small per-step cost — but does not spell out Waterfill's internals; the name points to a classic water-filling heuristic (fill the least-loaded replica first), which would be a lighter alternative to an LP solve. Both differ from the MoE router: the router decides which expert each token should go to (a quality choice about the model's output), whereas the balancers decide where, among the redundant copies of that chosen expert, the work actually runs (a serving choice about GPU utilization).

Originally posted on Learn AI Visually.

Amazon Bedrock Deployment Guide: From Environment Setup to Production Operations

Andy Tan — Tue, 30 Jun 2026 11:18:05 +0000

Amazon Bedrock, AWS's fully managed service for foundation models, makes it much easier to build and deploy generative AI applications through a model-as-a-service (MaaS) approach. This guide outlines a structured deployment workflow that covers permissions, network architecture, model onboarding, API integration, and performance optimization, helping teams build AI services that are scalable, secure, and operationally reliable.

Core Benefits and Technical Context

Organizations typically choose Amazon Bedrock for the following reasons:

Resource isolation and elastic scalability: Dedicated compute capacity helps reduce contention with other workloads, while scaling policies can adjust capacity based on demand. Under the right conditions, this can improve cost efficiency significantly.
Security and compliance: Bedrock integrates with AWS security controls such as VPC networking and IAM, helping organizations meet strict security and compliance requirements, including standards such as SOC 2 Type II, HIPAA, and GDPR.
Operational simplicity: Because AWS manages the underlying infrastructure, teams can reduce deployment time and lower operational overhead compared with self-managed model serving stacks.

Pre-Deployment Preparation

2.1 AWS Account and Permission Setup

For better security, use a dedicated IAM user or role instead of the root account, and enable AWS CloudTrail for auditing and operational traceability.

Example IAM policy (JSON):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:*",
        "ec2:Describe*",
        "s3:GetObject"
      ],
      "Resource": "*"
    }
  ]
}

Note: In production environments, always follow the principle of least privilege and scope Resource permissions as narrowly as possible.

2.2 Local Environment Configuration

Install and configure the AWS CLI (version 2.15 or later is recommended) so that you can manage resources from the command line.

aws configure
# Enter your Access Key ID, Secret Access Key, Region (for example, us-west-2), and preferred output format (such as json)

2.3 Network and Storage Architecture

A three-tier architecture is commonly recommended to support high availability and security:

Frontend layer: Use an Application Load Balancer (ALB), ideally protected by AWS WAF against common web threats.
Application layer: Deploy Bedrock-related application components across multiple Availability Zones (AZs) for resilience.
Data layer: Use Amazon S3 for model artifacts, logs, and intermediate data. Where appropriate, use VPC endpoints or PrivateLink to reduce public internet exposure.

Model Deployment Workflow

3.1 Model Preparation and Conversion

If you plan to work with a custom model such as DeepSeek-R1, prepare the model artifacts in a format compatible with your deployment pipeline, such as FP16 or FP8 where applicable.

Example conversion code:

import torch
from deepseek_r1.converter import BedrockExporter

model = torch.load('deepseek_r1_base.pt')
exporter = BedrockExporter(
    framework='pytorch',
    output_path='s3://model-bucket/deepseek/',
    precision='fp16'  # supports fp32/fp16/bf16
)
exporter.convert(model)

It is generally recommended to package model artifacts as a .tar.gz file and keep the package size below 50 GB.

3.2 Deployment Through the Console or API

You can deploy model-related resources through the Bedrock console or via API-driven automation.

Example API workflow:

import boto3

bedrock = boto3.client('bedrock-runtime', region_name='us-west-2')

response = bedrock.create_model(
    model_name='deepseek-r1-prod',
    base_model_identifier='deepseek-ai/deepseek-r1-6b',
    inference_configuration={
        'preferred_compute_type': 'gpu_t4',
        'min_worker_count': 2,
        'max_worker_count': 10
    }
)

3.3 Auto Scaling Strategy

To balance responsiveness and cost efficiency, define scaling rules such as the following:

Scale out when: Request queue depth exceeds 50, or latency rises above 2 seconds.
Scale in when: CPU utilization remains below 30% for 5 minutes.
Cooldown period: 300 seconds to avoid rapid scaling oscillation.

API Integration Patterns

4.1 Basic Text Generation

Use the invoke_model API for synchronous inference requests.

import boto3
import json
from botocore.config import Config

bedrock_config = Config(
    retries={'max_attempts': 3, 'mode': 'adaptive'},
    read_timeout=60
)
client = boto3.client('bedrock-runtime', config=bedrock_config)

response = client.invoke_model(
    modelId='deepseek-r1-prod',
    body=json.dumps({
        "prompt": "Explain the basic principles of quantum computing",
        "max_tokens": 512,
        "temperature": 0.7
    })
)
print(json.loads(response['body'].read())['generation'])

4.2 Streaming Responses and Multi-Turn Conversations

Streaming output: Use invoke_model_with_stream to deliver responses incrementally and improve the user experience.
Conversation handling: Use Bedrock conversation-oriented APIs or your own session layer to preserve context for assistants, customer support bots, and similar use cases.

4.3 Batch Processing Optimization

For non-real-time workloads, dynamic batching can improve throughput substantially. A batch size of 32 to 64 requests is often a practical starting point.

Performance Optimization and Monitoring

5.1 Performance Tuning Approaches

Model quantization: Moving from FP32 to FP16 or FP8 can reduce memory usage and improve inference speed.
Caching: Integrate ElastiCache Redis and apply an LRU strategy to frequently repeated queries.
Asynchronous processing: Route non-real-time requests through Amazon SQS to decouple frontend traffic from backend inference workloads.

5.2 Example Benchmark Targets

Metric Test Method Target
Time to First Token (TTFT) Empty request test < 800 ms
Throughput 100 concurrent requests sustained for 5 minutes > 80 TPS
Error rate Measured across 1,000 consecutive requests < 0.1%

5.3 CloudWatch Monitoring and Alerts

Set up alerts on key operational metrics such as:

CPUUtilization: Above 85% for 5 minutes -> trigger an SNS notification and scale out automatically.
ModelLatency: P99 latency above 1000 ms -> investigate load levels or switch traffic to a backup endpoint.
Invocations 4xx: More than 10 per minute -> inspect client request formatting and permissions.

Security, Compliance, and Cost Management

6.1 Data Protection

Network isolation: Use VPC endpoint policies to restrict traffic to private subnets where appropriate.
Encryption: Use AWS KMS customer-managed keys (CMKs) to protect sensitive data.
Auditability: Log API metadata to support investigation, traceability, and compliance review.

6.2 Cost Structure and Optimization

Running a model such as DeepSeek-R1 on Bedrock may involve compute, storage, and data transfer costs.

Optimization ideas include:

Use Lambda@Edge where low-latency global access is needed.
Cache frequent requests to reduce unnecessary inference traffic.
Review utilization regularly and adjust Reserved Instances or Savings Plans where applicable.

Troubleshooting

Symptom Possible Cause Recommended Action
503 Service Unavailable Capacity overload Increase max_worker_count or enable auto scaling
Garbled model output Encoding mismatch Verify that Content-Type is application/json
Unstable latency Network jitter Consider AWS Direct Connect or review the network path
Access Denied Missing IAM permissions Check whether the IAM role includes AmazonBedrockFullAccess or an equivalent custom policy

By following the practices outlined above, teams can deploy AI capabilities on Amazon Bedrock in a way that is efficient, secure, and scalable, while accelerating integration into real business applications.

The end of hardcoded model prompts: Building agents that discover their their own infrastructure

Renato Marinho — Tue, 30 Jun 2026 11:16:48 +0000

I was reading a thread recently about how MCP servers are burning 50k+ tokens before a user even types a single word, and it hit home. We're all obsessed with the 'intelligence' of these models, but we're ignoring the massive architectural debt we're creating by hardcoding tool definitions into our system prompts.

If you've ever spent a Tuesday night debugging why an agent failed because a provider updated their model version or renamed an endpoint in their SDK, you know exactly what I'm talking about. We treat LLM capabilities like static constants, but the reality of modern inference—especially when dealing with something as dynamic as NVIDIA's Cloud Engine—is that the infrastructure is constantly shifting.

This is why I think the industry needs to stop focusing on 'prompt engineering' and start focusing on 'discovery-driven architecture.'

I've been playing with the NVIDIA API Catalog MCP lately, and it highlights exactly where we're going. Instead of telling an agent, "You have access to Llama3," you give the agent a tool that lets it ask, "What is actually available in the NVIDIA matrix right now?"

The Death of the Static System Prompt

Most developers approach MCP development by defining every possible tool in the JSON schema. It works fine for a demo. In production, it's a nightmare. You end up with massive context windows filled with documentation for models that might not even be active in your current region or quota tier.

When you use the NVIDIA API Catalog via Vinkius, the paradigm shifts. The existence of nvidia_list_foundation_models changes everything. An intelligent agent shouldn't start its loop by trying to guess which model is best; it should start by querying the catalog. By calling that tool, the agent gets a real-time dump of accessible paths. It sees exactly what NVIDIA has deployed—whether it's Nemotron or Llama3—and adjusts its subsequent nvidia_chat_command calls based on actual availability.

This isn't just about convenience; it's about preventing the exact token bloat that everyone is complaining about right now. If the agent discovers what's available via a tool call, you don't need to list every possible model configuration in your system instructions. You only pay for the context of the models currently active.

Managing the 'Runaway Agent' Problem

One of the biggest fears I hear from CTOs when they look at agentic workflows is: "How do I stop this thing from burning my entire NVIDIA credit quota in twenty minutes?"

If you're building a direct integration, you're likely manually tracking usage or setting hard limits in your orchestrator. It's clunky and usually reactive (meaning you find out about the bill after it arrives). The NVIDIA Catalog setup allows for a much more proactive approach using nvidia_check_token_quota.

You can instruct your agent to check its own constraints before initiating heavy inference tasks. If the quota is low, the agent can decide to switch from a massive instruction-heavy model to something smaller or simply pause and alert a human. It moves the governance from the 'policeman' (the orchestrator) into the 'worker' (the agent).

This is exactly why we built Vinkius with an emphasis on production-grade execution. When you connect this MCP, it’s not just about passing through an API key. We run every execution in isolated V8 sandboxes with strict governance policies—DLP, SSRF prevention, and HMAC audit chains. Because when you give an agent the power to trigger nvidia_chat_completion or process vision tasks via nvidia_vision_inference, you're essentially giving it a credit card linked to your infrastructure. You can't treat that connection as a simple webhook.

Beyond Text: The Multimodal Pipeline

What most people miss when they look at these MCPs is the depth of the utility beyond just chat completions. If you're working on RAG (Retrieval-Augmented Generation), you aren't just looking for text; you're looking for vectors.

The ability to use nvidia_generate_embeddings directly within the agentic loop means your agent can handle its own vectorization needs without needing a separate, hardcoded pipeline. It can take unstructured data, pass it through the NVIDIA proxy, and get back the numerical arrays needed for semantic search.

Then you have tools like nvidia_summarize_content or even vision-based inference. When an agent can pivot from reading text to analyzing graphical data via nvidia_vision_inference, and then immediately check if there are any fine-tuned overrides available through nvidia_list_lora_adapters, you're no longer building a chatbot. You're building a self-configuring compute engine.

The Reality Check

I am not saying this fixes everything. If you're running local Docker metrics and need absolute control over every micro-latency, you shouldn't be using a cloud proxy; you should be looking at nvidia-nim-mcp for local boundaries. This catalog is for the engineers who want to leverage NVIDIA's massive compute matrix without building the entire plumbing themselves.

But if your goal is to deploy an agent that actually works in the real world—where models change, quotas fluctuate, and security cannot be a secondary thought—then you need to stop hardcoding. You need tools that provide discovery, not just execution.

If you want to see how this looks in practice without dealing with the headache of configuring OAuth callbacks or managing environment variables for every single provider, check it out on Vinkius. It's three steps: subscribe, grab your token, and paste it into Claude or Cursor. No friction, just execution.

I've seen enough broken integrations to know that if a developer hits a configuration wall in the first five minutes, they're gone. We built this so you can focus on the logic, not the plumbing.

MCPs are the music of AI Agents. We built the catalog. Discover Vinkius MCP Catalog.