DEV Community: Evan-dong

I Generated a Full Brand Mockup Set from a Logo in 10 Minutes — Here's the Workflow

Evan-dong — Sat, 23 May 2026 12:02:18 +0000

I Generated a Full Brand Mockup Set from a Logo in 10 Minutes — Here's the Workflow

Every time I finish a logo, the same problem comes back: how do I show what this brand actually looks like in the real world? Clients don't react to a mark on a white background. They want to see signage, cards, packaging, booths, and product surfaces.

That part usually takes half a day of mockup sourcing and layout work. I wanted to see if Image 2 could compress that.

Who this is for: designers, brand consultants, or developers building brand-adjacent tools who want a faster mockup workflow.

The framework

I split the output into six categories before generating anything:

Logo material and finish studies
Single branded item mockups
Combined brand material sets
Spatial and environmental scenes
Human or usage scenarios
Social media visuals

The planning prompt

Once you have the categories, use this prompt to generate a plan:

I already have a logo. The industry is [industry]. The brand personality is [keywords]. Based on the uploaded logo, help me plan a complete logo mockup generation set. Include: 1) logo material variations, 2) single brand applications, 3) combined brand materials, 4) spatial scenes, 5) human or usage scenarios, and 6) social media visuals. For each category, provide the recommended visual direction, aspect ratio, and prompts that can be used directly in Image 2. Keep the logo recognizable, choose materials appropriate to the industry and brand personality, and make the outputs suitable for a brand proposal.

Test 1: Evolink (tech/API brand)

Brand inputs: professional, stable, developer-friendly, enterprise-ready, connected.

Image 2 generated logo finishes, interface materials, developer badges, documentation covers, and booth-style spatial visuals.

After one round, Evolink stopped looking like just a geometric mark. It started to feel like a real company with booths, dashboards, documentation, and team assets.

Test 2: MoriJoy (nature park brand)

To see if this holds outside tech, I tried MoriJoy — a parent-child nature park. Logo: smiling treehouse + sapling. Palette: wood tones, cream, light green. Personality: soft, warm, healing, natural, playful, family-oriented.

The mockup directions shifted completely: entrance signage, membership cards, staff name tags, children's drinkware, sticker sets, tote bags, family activity spaces.

Same workflow, completely different visual language. The system adapts the brand world around the logo.

What I learned

The six-part framework prevents random outputs and keeps the set coherent.
Defining brand personality keywords before generating makes a big difference.
Not every output is perfect — you still need design judgment to pick the best directions.
This works best for early-stage proposals and direction-setting, not final production.
The workflow compresses what normally takes half a day into about 10 minutes.

Try it

If you're building brand proposals or need to show a logo in context quickly, this workflow is worth testing.

Upload your logo and generate a full brand mockup set with Image 2: https://evolink.ai/gpt-image-2

Gemini 3.5 Flash Just Shipped — Here's When to Use It (and When Not To)

Evan-dong — Wed, 20 May 2026 10:33:55 +0000

Google launched Gemini 3.5 Flash at I/O 2026 today. The "budget" model now beats Gemini 3.1 Pro on agent and coding benchmarks. Here is what you actually need to know to decide whether to switch.

Quick specs

Model ID: gemini-3.5-flash
Context: 1M input / 65K output
Input: text, image, audio, video, PDF
Pricing: $1.50/M input, $9.00/M output, $0.15/M cached input
Knowledge cutoff: January 2026
Dynamic Thinking: on by default (medium), low mode available

Where Flash wins

Benchmark	Flash	3.1 Pro	Gap
Terminal-Bench 2.1 (coding)	76.2%	70.3%	+5.9
MCP Atlas (tool calling)	83.6%	78.2%	+5.4
Finance Agent v2	57.9%	43.0%	+14.9
GDPval-AA (decision-making)	1,656 Elo	1,314 Elo	+342
CharXiv Reasoning	84.2%	—	—

Where Flash does NOT win

Humanity's Last Exam: 3.1 Pro leads (44.4% vs 40.2%)
ARC-AGI-2: 3.1 Pro leads (77.1% vs 72.1%)
SWE-Bench Pro: Claude Opus 4.7 still leads
Computer Use: GPT-5.5 is the only model with production screen control. Flash does not support this.

Tool capabilities

What ships:

Function Calling
Structured Output
Search-as-a-tool
Code Execution

What is missing:

Computer Use — no screen control, no clicking, no form filling. If your agent operates a browser or desktop app, you still need GPT-5.5 for that part.

When to pick which model

Agent needs to call multiple tools in sequence?
  → Gemini 3.5 Flash

Agent needs to refactor code across a large repo?
  → Claude Opus 4.7

Agent needs to control a browser or desktop?
  → GPT-5.5

Need all three depending on the task?
  → Route through a unified gateway

API call example

Through EvoLink (single endpoint for Gemini, Claude, and GPT):

curl -X POST https://direct.evolink.ai/v1beta/models/gemini-3.5-flash:generateContent \
  -H "Authorization: Bearer YOUR_EVOLINK_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{
      "role": "user",
      "parts": [{"text": "List the top 3 files changed in the last commit and explain what each change does"}]
    }]
  }'

If you were using any previous Gemini model through EvoLink, swap the model ID. No other change needed.

The benefit: same API key and endpoint for Flash, Opus 4.7, and GPT-5.5. Switch model IDs depending on the task. One bill, automatic failover.

Full docs: EvoLink Gemini 3.5 Flash API Guide

Things to keep in mind

These are Google's benchmarks. Independent community testing will confirm or adjust. Take the exact numbers with normal benchmark skepticism.
Agent cost is not unit cost. A 20-step agent loop means 20 API calls. Fast and cheap per token is not the same as cheap per workflow run.
Dynamic Thinking adds reasoning tokens. The model thinks before answering. This increases output quality but also output token count. Watch your bills.
3.5 Pro is expected next month. If you need peak general reasoning from Google, it might be worth waiting.

Sources

Pick models by task, not by tier. Flash for tool orchestration. Opus for code rewrites. GPT for screen control. That is the state of play.

tags: gemini, ai-agents, llm, api, google-io

Codex Now Works from Your Phone — Plus Hooks and CI/CD Tokens

Evan-dong — Fri, 15 May 2026 07:57:10 +0000

Codex Now Works from Your Phone — Plus Hooks and CI/CD Tokens

I run Codex for long refactoring tasks. The annoying pattern: start a job, step away, come back to find Codex has been waiting 40 minutes for me to approve a permission check.

The May 2026 updates fix that, along with a few other gaps I cared about.

The Three Things That Matter

1. Mobile access (preview, May 14)

Codex is now available in preview on the ChatGPT mobile app for iOS and Android. Your phone connects to the Codex session running on your Mac — you review diffs, approve commands, switch models, and follow terminal output from wherever you are. The code stays on the host machine.

Setup: latest ChatGPT app + Codex for Mac → scan QR code to pair.

2. Hooks (GA, May 14)

You can now inject custom scripts at 6 points in the Codex loop. The one I found most useful: PreToolUse to scan Bash commands for credential patterns before they execute.

# ~/.codex/config.toml
[[hooks.PreToolUse]]
matcher = "^Bash$"

[[hooks.PreToolUse.hooks]]
type = "command"
command = "python3 ~/.codex/hooks/scan_credentials.py"
timeout = 30

The hook receives JSON on stdin with the command about to run, and can return {"decision": "block", "reason": "..."} to stop it.

Other hook events: SessionStart, PostToolUse, PermissionRequest, UserPromptSubmit, Stop. Enterprise teams can push managed hooks via requirements.toml.

Caveat: hooks do not intercept all shell calls yet per the docs.

3. Access tokens (Business/Enterprise, May 5)

If you run Codex in CI/CD, you no longer need to fake an interactive login. Access tokens carry workspace identity:

export CODEX_ACCESS_TOKEN="<token>"
codex exec --json "run tests and report coverage"

Business and Enterprise only. Tokens expire (configurable) and show up in governance logs under the creating user.

Also Shipped

Remote SSH (GA): connect Codex to managed dev environments through a relay — no public SSH port needed
HIPAA compliance: for eligible Enterprise workspaces in local environments

The Multi-Agent Problem Remains

These features make Codex better. But if you also use Claude Code and Gemini CLI, you still have three API keys, three dashboards, no fallback between them.

All three CLIs support custom endpoints:

# Codex: ~/.codex/config.toml
[api]
base_url = "https://api.evolink.ai/v1"

# Claude Code
export ANTHROPIC_BASE_URL="https://api.evolink.ai"

# Gemini CLI: ~/.gemini/.env
GOOGLE_GEMINI_BASE_URL=https://api.evolink.ai/

One gateway → unified cost, automatic fallback on 429/5xx, model switching via routing rules.

Setup guide for routing all three CLIs →

References

I Tried TencentDB Agent Memory — Here's What the Token Reduction Looks Like

Evan-dong — Fri, 15 May 2026 07:47:26 +0000

I Tried TencentDB Agent Memory — Here's What the Token Reduction Looks Like

The context window problem in long-running agents is familiar: by turn 20, you are paying for tool logs the agent does not need anymore. Truncation loses detail. Summarization compresses but also forgets.

Tencent Cloud open-sourced TencentDB Agent Memory (MIT license, May 2026), and it takes a different approach: offload the verbose stuff to local files, keep a Mermaid task graph in context, let the agent drill back in when it needs specifics.

The Architecture

Four memory layers, each traceable back to raw data:

L0 Conversation: raw dialogue + tool logs
L1 Atom: structured facts extracted every N conversations
L2 Scenario: aggregated solution patterns
L3 Persona: user behavior profiles built over time

The short-term trick: verbose tool output gets offloaded to refs/*.md files. In context, only a lightweight Mermaid graph remains. When the agent needs a specific output, it retrieves by node_id.

The Numbers

According to the project's benchmarks (long-horizon sessions, not isolated turns):

Task	Success Change	Token Change
WideSearch	33% → 50%	−61.38%
SWE-bench	58.4% → 64.2%	−33.09%
PersonaMem	48% → 76% accuracy	N/A

Biggest gains on WideSearch — makes sense, that is where context accumulates fastest. SWE-bench improvement is real but modest (+9.93%).

Important caveat: these are self-reported by the project team, not independently verified.

Quick Setup (OpenClaw)

openclaw plugins install @tencentdb-agent-memory/memory-tencentdb

// ~/.openclaw/openclaw.json
{
  "memory-tencentdb": {
    "enabled": true,
    "offload": { "enabled": true }
  }
}

openclaw gateway restart

That is it. SQLite + sqlite-vec by default, no external DB needed. The offload.enabled: true is what activates the Mermaid compression — without it you only get long-term memory.

Two Layers of Cost Optimization

Memory cuts tokens per call. But you are still paying the provider's per-token rate, and if the provider has an outage, the agent stalls.

If you route agent LLM calls through a gateway, you get a second optimization layer: model routing (pick the cheapest capable model per task), automatic fallback on 429/5xx, and a unified cost dashboard.

For Hermes, this means setting MODEL_BASE_URL to a gateway endpoint:

docker run -d --name hermes-memory \
  -e MODEL_BASE_URL=https://api.evolink.ai/v1 \
  -e MODEL_API_KEY=your-key \
  hermes-memory

Fewer tokens × lower cost per token = compounding savings.

More on unified API routing for multi-model apps →

Limitations

Only OpenClaw and Hermes are supported today
Offloading is off by default
SQLite is single-agent; concurrent access needs Tencent Cloud Vector DB backend
Benchmarks are project-reported

Codex vs Claude Code vs Cursor: What Changed in May 2026 and What Can Be Routed

Evan-dong — Thu, 14 May 2026 12:50:13 +0000

A common setup now is Codex CLI for scaffolding, Claude Code for refactoring, and Cursor for IDE or cloud-agent workflows. This month all three shipped infra updates — on different dates, solving different problems.

Here is what changed, what can be wired through one endpoint, and what has to remain separate.

The Updates

Codex CLI — Windows Sandbox (May 13)

Codex now has real OS-level sandboxing on Windows. Dedicated user accounts (CodexSandboxOffline/CodexSandboxOnline), per-account firewall rules, helper binaries for privilege boundaries. Linux already had seccomp/bubblewrap; Windows finally caught up.

Previously, Windows sandbox attempts used synthetic SIDs and proxy-based network blocking — programs could bypass them by implementing their own networking stack.

Claude Code — Doubled Five-Hour Rate Limits (May 6)

Anthropic doubled Claude Code's five-hour rate limits for Pro, Max, Team, and seat-based Enterprise plans. Also removed the peak hours limit reduction for Pro and Max.

This was announced May 6, a week before the other updates. If you were splitting refactoring tasks to stay under the five-hour cap, you have roughly 2x headroom.

Cursor — Cloud Agent Environments (May 13)

Cloud environments with multi-repo support, Dockerfile config with build secrets, layer caching (70% faster on cache hits), version history, rollback, scoped egress/secrets, audit logging.

Dockerfile auto-config is in private beta, rolling out to Enterprise teams.

Quick Comparison

	Codex	Claude Code	Cursor
Updated	May 13	May 6	May 13
What	Windows sandbox	2x five-hour limits	Cloud dev environments
Runs	Local sandbox	Local terminal	Cloud + IDE
Multi-repo	No	No	Yes
Isolation	OS-level	None	Cloud containers
Best for	Scaffolding, Windows	Refactoring, terminal	End-to-end delivery

The Problem When You Use More Than One

Multiple provider keys, billing dashboards, and rate-limit surfaces. When Claude Code hits its five-hour cap, no automatic fallback happens by default. When you want to compare spend across routable CLIs, you are exporting data from separate consoles.

One Gateway Setup

Point the routable CLIs at one endpoint:

# Codex: ~/.codex/config.toml
[api]
base_url = "https://api.evolink.ai/v1"

# Claude Code
export ANTHROPIC_BASE_URL="https://api.evolink.ai"
export ANTHROPIC_AUTH_TOKEN="your-evolink-key"

# Gemini CLI: ~/.gemini/.env
GEMINI_API_KEY=your-evolink-key
GOOGLE_GEMINI_BASE_URL=https://api.evolink.ai/

What you get: unified cost dashboard, automatic fallback on 429/5xx, model switching via routing rules.

Note: Cursor cloud environments use their own backend — not routable through third-party gateways yet. Local Cursor IDE completions can be routed.

Full walkthrough: One Gateway for 3 Coding CLIs

Watch Out For

Config precedence: env vars override config files. If you set ANTHROPIC_BASE_URL globally and have a project .claude/settings.json, behavior depends on tool version.
Auth errors: 401 from gateway usually means old provider key is still being sent. Restart terminal after env var changes.
Upstream limits still apply: gateway adds fallback, but Anthropic five-hour cap and OpenAI daily quotas are still upstream.

Sources

OpenAI: Codex Windows Sandbox (May 13)
Cursor: Cloud Agent Environments (May 13)
Anthropic: Higher Limits (May 6)

I Tried the Claude Code Skills Repo That Got 77K Stars — Here Is What Works and What Does Not

Evan-dong — Wed, 13 May 2026 12:13:27 +0000

I kept running into the same problem with coding agents: I would describe a task, the agent would build something, and it was not what I meant. Not broken — just off.

The fix turned out to be surprisingly low-tech. Matt Pocock published a repo of "skills" — small instruction files that go in your .claude directory and change how the agent approaches work. The repo exploded: 77,000+ stars, 6,700+ forks, #1 on GitHub Trending.

I installed it. Here is what I found.

Setup (About 60 Seconds)

npx skills@latest add mattpocock/skills

Pick the skills you want. Select /setup-matt-pocock-skills — it is the bootstrap.

Inside your agent:

/setup-matt-pocock-skills

It asks your issue tracker (GitHub / Linear / local files), triage labels, and docs folder. Done.

Works with Claude Code, Codex, Cursor, or anything that reads .claude/ directories.

The 4 Skills I Actually Use

There are 28 skills. I use 4 regularly.

`/grill-with-docs`

This is the best one. Before you start coding, the agent asks you detailed questions about what you are building. Edge cases, constraints, why you are doing it this way.

The output is a CONTEXT.md file — a shared vocabulary that the agent reads in every future session. One of my projects had a 15-word phrase that got replaced with "materialization cascade." Every session after that was shorter and more accurate.

`/tdd`

Forces test-driven development: tests first, implementation second, verification third. Works great on isolated functions. Gets annoying on complex UI where the tests are hard to specify upfront.

`/diagnose`

Structured debugging instead of the agent guessing. Most useful when the error message does not point to the real problem.

`/caveman`

Extreme concision. The agent says as little as possible and just executes. Perfect for experienced devs who already know what they want.

Everything Else (Quick Map)

Alignment: /grill-me (non-code version), /to-prd, /to-issues
Code quality: /prototype, /triage, /zoom-out
Workflow: /handoff, /write-a-skill, /review (WIP), /setup-pre-commit

Honest Limits

Skills are instructions, not plugins. They do not make the model smarter — they make the conversation more structured.
CONTEXT.md drifts. You need to update it as the project evolves, or re-run the grill.
Agent-agnostic = lowest common denominator. Skills work everywhere but cannot use agent-specific features like Claude Code's /goal.
28 skills is too many to learn at once. Start with the four above.

When It Gets Bigger

Once a team runs multiple agents on the same codebase, the next problem is not prompt quality — it is routing. Who uses which model, how much does each workflow cost, and where do the API keys live.

Skills handle the agent behavior side. For the model routing side, connecting Claude Code CLI to a gateway makes the access path concrete. The EvoLink Claude Code CLI guide documents that setup if your team is at that stage.

Who Should Use This

✅ Daily coding agent user who wants structured quality control
✅ Team with domain-specific terminology that confuses agents
✅ Multi-developer project using coding agents on the same repo
❌ One-off scripts or throwaway code
❌ Projects too small for shared language docs

mattpocock/skills on GitHub | skills.sh installer | Claude Code docs

tags: claude-code, ai, developer-tools, coding-agent

Stop Asking Claude Code for Markdown Specs. Ask for HTML Artifacts.

Evan-dong — Sat, 09 May 2026 09:30:17 +0000

Claude Code is very good at writing Markdown. That does not mean Markdown should be the default output for every task.

Thariq from the Claude Code team recently described a workflow where he increasingly asks Claude Code for HTML instead of Markdown. The reason is practical: long Markdown specs are easy to generate but hard to read. HTML can turn the same information into a navigable, visual, and sometimes interactive artifact.

When HTML Beats Markdown

Use HTML when the output is meant to be consumed by people, not maintained line by line in Git.

Good fits:

PR walkthroughs
design option comparisons
architecture explainers
onboarding docs
debugging reports
one-off planning tools
draggable prioritization boards

Keep Markdown for:

READMEs
changelogs
durable docs
API references
anything that needs clean Git diff review

Example: PR Review Artifact

Instead of:

Summarize this PR in Markdown.

Try:

Create a single self-contained HTML PR walkthrough.
Render the important diff areas with inline annotations.
Color-code findings by severity.
Add a manual verification checklist at the bottom.

That gives reviewers something closer to a focused review interface than a wall of bullets.

Example: Implementation Options

Generate 5 implementation approaches as one HTML file.
Use a comparison grid.
For each approach show:
- complexity
- migration risk
- test impact
- recommended use case

This is much easier to scan than five Markdown sections stacked vertically.

Trade-Offs

HTML is not always better.

Concern	Markdown	HTML
Git diffs	Great	Noisy
Long-term docs	Great	Mixed
Visual hierarchy	Limited	Strong
Interactivity	None	Possible
Sharing in browser	Requires renderer	Native

The rule I use: Markdown is the source. HTML is the reading surface.

Practical Prompt

Create a self-contained HTML explainer for this feature.
Audience: an engineer who has not seen this code before.
Include a visual summary, annotated code snippets, risks, and a next-step checklist.
Do not add external dependencies.

The real insight is not "HTML is better than Markdown." It is that AI-generated output does not have to be plain text. If the model can generate a useful interface, ask for the interface.

For teams building Claude Code workflows across multiple models, EvoLink provides unified API access to Claude and other frontier models.

OpenAI's New Realtime Voice Models Can Think, Translate, and Transcribe — Here's What Developers Need to Know

Evan-dong — Fri, 08 May 2026 13:36:06 +0000

OpenAI just shipped three realtime voice models through their API. One reasons at GPT-5 level during live calls. One translates 70+ languages in real time. One does streaming transcription. All available today.

Let me break down what matters for developers.

The Three Models

GPT-Realtime-2 handles voice conversations with GPT-5-level reasoning. The key difference from previous voice models: it can call tools mid-conversation without going silent. It narrates what it's doing while executing — OpenAI calls this "preamble."

GPT-Realtime-Translate does real-time voice translation. 70+ input languages, 13 output languages. End-to-end audio processing (no intermediate text step), which preserves tone and emotion.

GPT-Realtime-Whisper is streaming speech-to-text. Words appear as the speaker talks. Built for live captions and meeting transcription.

Integration Options

All three use the Realtime API with three connection methods:

WebRTC — browser-based, lowest latency
WebSocket — server-side, more control
SIP — telephony integration

GPT-Realtime-2: Voice Agents That Actually Work

If you've built voice agents before, you know the pain: tool calls create dead air. The user asks something that requires a database lookup, and the agent goes silent for 2-3 seconds. Feels broken.

GPT-Realtime-2 solves this with preamble — it talks through its actions while executing them. "Let me check your calendar... I see you have a meeting with Alex Kim in 12 minutes." The tool call happens in parallel with the speech.

Other developer-relevant specs:

128K context window (up from 32K)
Handles interruptions without losing context
Better instruction following for system prompts
Text tokens: $4/$16 per million (input/output)
Audio tokens: $32/$64 per million

GPT-Realtime-Translate: The $0.034/min Disruption

The translation model is priced at $0.034 per minute. For context, a human simultaneous interpreter costs $25-44 per minute.

Technical details:

Processes raw audio end-to-end (not cascaded speech-to-text-to-speech)
Preserves speaker emotion and tone
Works best with brief pauses between thoughts (labeled "turn-based" in docs)
Occasional hallucinations still occur
Supports language switching mid-stream

The end-to-end approach is what makes the quality difference. Traditional pipelines lose vocal characteristics at every stage. This model skips text entirely.

GPT-Realtime-Whisper: Streaming Transcription

If you need real-time captions or meeting transcription, this is the model. Low-latency streaming output as the speaker talks.

What You Can Build

The three models together cover the full voice infrastructure stack:

Customer support agents that can reason, look up accounts, and process requests — all by voice
Real-time translation layers for international meetings at 1/1000th the cost of human interpreters
Live captioning systems for streaming, conferences, or accessibility
Multilingual voice assistants that handle code-switching naturally
Telephony bots via SIP integration that feel like talking to a person

Anthropic's Agents Now Self-Improve Between Sessions. Here's How Dreaming Works.

Evan-dong — Thu, 07 May 2026 12:12:59 +0000

On May 6th, Anthropic shipped three new capabilities for Managed Agents. Two of them — Outcomes and multi-agent orchestration — are solid infrastructure upgrades. The third one, Dreaming, is the one worth stopping to think about.

Dreaming is a scheduled background process that runs between sessions. The agent reviews its own past conversation transcripts, identifies recurring patterns, and writes learnings into its memory stores. No human prompt required. No explicit instruction to "remember this."

If you've been building with Claude agents, you already know how memory works: you tell the agent something, it stores it, it uses it next time. Passive. Explicit. You're the one deciding what gets remembered.

Dreaming flips that. The agent decides.

How It Actually Works

The process runs on a schedule between sessions. The agent scans past transcripts looking for signal: mistakes it repeated, approaches that worked, edge cases it missed. It then curates its own memory stores based on what it finds. The original session data stays untouched — Dreaming writes to memory, not back to history.

There are two autonomy modes you can configure:

Automatic: the agent identifies patterns and writes them to memory directly
Human review: the agent proposes memory updates, you approve before they take effect

The human review mode is the safer starting point for production systems. You get the cross-session pattern recognition without giving the agent unilateral write access to its own memory.

Currently in research preview — not GA yet.

Why This Matters: The Cross-Session Blind Spot

Here's the problem Dreaming solves. Individual sessions can't see cross-session patterns. A support agent that misclassifies a certain type of ticket won't notice it's made the same error 12 times this month. Each session starts fresh. The pattern is invisible.

Dreaming surfaces exactly that kind of signal. It's the difference between an agent that resets every session and one that accumulates operational experience over time.

The practical implication: an agent that's been running for three months has three months of self-curated experience. A freshly deployed agent starts from zero. Over time, these become fundamentally different systems — not because of different prompts, but because of different histories.

Outcomes: The Signal Dreaming Needs

Dreaming needs to know what "doing well" means. That's what Outcomes provides.

You define a success rubric. A separate Claude instance — isolated from the agent's reasoning, running in its own context window — evaluates output against your criteria. If it fails, the grader identifies what needs to change, and the agent iterates until it meets the bar.

Numbers from Anthropic's internal testing:

Task success rates improved up to 10 percentage points over standard prompting
Structured file generation: +8.4% on .docx, +10.1% on .pptx
Works for subjective quality — editorial voice, writing style, brand consistency

The isolation model matters here. The grader runs in a separate context window, which means it can't be influenced by the agent's own reasoning. It's evaluating output, not process.

Connect the two: Outcomes identifies failures. Dreaming remembers them. One is the exam. The other is the error notebook.

Multi-Agent Orchestration: Now in Public Beta

The third piece moved from preview to public beta. A coordinator agent decomposes tasks and delegates to up to 20 specialist subagents running in parallel. Each subagent gets its own context window. They share a common filesystem.

Key details for builders:

Full trace visibility in Claude Console
Coordinator can send follow-up messages mid-workflow
Subagents retain context between exchanges
Orchestration depth limited to one level — no sub-sub-agents

The depth limit is worth noting. If your architecture needs nested orchestration, this isn't the right fit yet.

Real-world results from early adopters:

Harvey (legal AI): task completion rates up approximately 6x
Wisedocs (document verification): review speed improved 50% while maintaining quality
Netflix: parallel batch analysis across hundreds of build logs
Spiral by Every: Haiku coordinator + Opus writing subagents + Outcomes grader scoring against editorial principles

Webhooks and Pricing

Webhooks are in public beta. Agents push notifications to your system when tasks complete. For long-running jobs — some sessions run for hours — this is essential. You don't want to poll.

Pricing: standard Claude API token rates plus $0.08 per active session hour. Idle time is free. A 30-minute task costs 4 cents in infrastructure fees on top of tokens. Dreaming, Outcomes, and Webhooks don't add separate charges.

Quick Reference

Feature	Status	What It Does
Dreaming	Research preview	Agents review past sessions, extract patterns, curate memory
Outcomes	Public beta	Automated output grading against developer-defined rubrics
Multi-agent orchestration	Public beta	Coordinator + up to 20 parallel subagents, shared filesystem
Webhooks	Public beta	Push notifications when agent tasks complete
Pricing	Live	$0.08/active session hour + standard token costs

One Limitation Worth Knowing

Managed Agents runs Claude models exclusively. The orchestration, Dreaming, Outcomes grading — all Claude. If your architecture needs to route between models (cost optimization, specialized capabilities, latency requirements), that's a layer Managed Agents doesn't address.

If you're building multi-model agent systems that need persistent context across providers, EvoLink provides a unified gateway routing across Claude, DeepSeek, GPT, and others from a single API endpoint.

Author: Jessie, COO at EvoLink

Sources:

How I Stopped Burning Through My Claude Code Quota by Noon

Evan-dong — Wed, 06 May 2026 09:58:47 +0000

By Jessie, COO at EvoLink

You open Claude Code at 9am. By noon, you're rate-limited. Your colleague does twice the work and still has quota left at 5pm. Same Max subscription. What's going on?

I ran into this exact situation and went digging. Turns out Anthropic published an internal engineering blog — "Lessons from building Claude Code: Prompt Caching is Everything" — that explains the whole thing. The short version: your daily habits are probably destroying your cache hit rate, and that's costing you 10-20x more tokens per message than necessary.

Here's what I learned and what I changed.

The Core Mechanic: Prefix Caching

Every request Claude Code sends to the model follows this structure:

System prompt + Tool definitions → Project docs (CLAUDE.md) → Session context → Messages

The API caches this sequence from the front. On the next request, if the prefix matches what was cached before, it reuses the prior computation. A cache hit costs one-tenth of normal price for those tokens.

But if any single byte in the prefix changes, everything from that point onward is invalidated. Full price recalculation.

The ordering is intentional. Anthropic's design principle: the less something changes, the earlier it goes. System prompt and tool definitions rarely change — they sit at the front. CLAUDE.md changes occasionally — middle. Messages change every turn — last. Each new turn just appends to the end. Everything before it stays cached.

Four Things That Kill Your Cache

1. Switching Models Mid-Conversation

This one hurts the most. You're mid-session with Opus, a simple task comes up, you run /model to switch to Haiku, handle it, switch back.

Cache is bound to the model. One switch = all accumulated cache invalidated, rebuilt from scratch. The rebuild cost often exceeds what letting Opus answer the simple question would have cost.

Anthropic's internal approach: keep one model for the main conversation. When a smaller model is needed, use a sub-agent — independent context and cache, does its work, passes the result back without touching the main session's cache chain.

2. Changing Tool Configuration Mid-Session

Adding an MCP tool, removing one, or updating parameters — tool definitions are part of the cached prefix. Any change breaks the chain.

This is why Claude Code keeps tool definitions in place even when unused. The cost of extra definition tokens is negligible compared to a full cache invalidation.

Plan Mode follows the same logic: instead of removing execution tools when entering planning mode, it adds EnterPlanMode/ExitPlanMode as special tools. The tool set never changes. The cache stays valid.

For users with many MCP tools, Claude Code uses lazy loading: start with lightweight stubs (tool name + one-line description), pull full schemas only when the model actually needs to call a tool.

3. Opening New Sessions Constantly

Every fresh claude invocation starts cache from zero. If your habit is "ask two questions, quit, reopen" — you never accumulate cache benefit.

4. Switching Between Accounts

Cache is isolated per account. Rotating through account pools resets the cache each time.

What to Do Instead

Keep conversations long. Longer conversation = thicker cache = cheaper messages toward the end. Stop opening new sessions unnecessarily.

You might worry about context window overflow. Don't. Claude Code has built-in compaction — automatic history compression when context gets too long. Anthropic designed Cache-Safe Forking: the compaction request reuses the exact same system prompt and tool definitions, sharing the same cache chain. The only new cost is the compression instruction itself.

Long conversations don't get more expensive. They get cheaper.

Don't switch models mid-conversation. If you need a different model, open a separate conversation for that task.

Configure MCP tools before the session starts. Don't add or remove mid-session.

Use --resume to continue previous sessions.

claude --resume

This restores your last session. The cache chain picks up where it left off. No rebuild. This single flag is probably the most underrated cost-saving habit in Claude Code.

Quick Reference

Action	Cache Impact	Cost Impact
Switch model mid-conversation	Full invalidation	Up to 20x
Add/remove MCP tools	Full invalidation	10-20x
Open new session	Start from zero	First turns at full price
Switch accounts	Full invalidation	10-20x
Long continuous conversation	Accumulates	Gets cheaper over time
Use `--resume`	Continues chain	Near-free

One More Detail Worth Knowing

Claude Code never modifies the system prompt to update state information (current time, file changes). Instead, it injects updates using <system-reminder> tags inside messages. Because modifying the prompt would break the cache. The prompt is treated as immutable infrastructure. Messages are the fluid information layer.

That's the level of obsession Anthropic has about this. They monitor cache hit rate with the same severity as server uptime. A drop is treated as an incident.

The Model-Switching Problem

"Never switch models" is painful advice in practice. Sonnet for everyday coding, Opus for architecture decisions, Haiku for quick questions — that's a normal workflow.

Anthropic's answer is "use sub-agents," but most users can't orchestrate sub-agents themselves. If you're running Claude Code through a gateway like EvoLink, model routing can happen at the infrastructure level without breaking your session's cache chain. Worth knowing that option exists.

Caching is not an optimization technique. It is the foundation of the entire system. Now you know what Anthropic knows.

Sources:

Jessie is COO at EvoLink, a Claude API gateway for teams and developers.

Codex v0.128.0: /goal Keeps Working Until It's Done -- Even Across Sessions

Evan-dong — Sun, 03 May 2026 08:58:24 +0000

Every AI coding assistant forgets what it was doing the moment you close the terminal. Codex just fixed that.

OpenAI shipped v0.128.0 on April 30th with two features that matter more than they sound: /goal for persistent cross-session objectives, and /pet for ambient agent status feedback.

The Session Amnesia Problem

You ask your AI assistant to refactor a module. It gets halfway through. You close the terminal, grab coffee, come back -- and it has zero memory of what it was doing.

You re-explain the task. It starts over. You lose 15 minutes of context every single time.

This is the intent persistence problem. Not context window size -- the model simply forgets your objective when the session ends.

/goal: Define It Once, Codex Keeps Going

/goal lets you set a persistent objective that survives across sessions:

/goal create "Increase test coverage in src/auth/ from 62% to 90%"

Close the terminal. Reboot. Come back tomorrow. The goal is still there.

Command	What it does
`/goal create`	Define a persistent objective
`/goal pause`	Suspend the goal, preserve progress
`/goal resume`	Pick up where you left off
`/goal clear`	Mark done or abandon

Under the hood, goal state is managed through app-server APIs with runtime continuation. When you /goal resume, Codex restores the execution context -- not just the goal text.

This shifts AI coding from request-response to goal-driven agent: you define the destination, the tool figures out how to get there across as many sessions as it takes.

/pet: Agent Observability, But Cute

Type /pet and a small animated creature appears in your Codex interface. It reflects what Codex is doing in the background:

Running a task? The pet is active.
Tests passed? It celebrates.
Something stuck? It reacts.
Idle? It sleeps.

9to5Mac called them "little Dynamic Island-ish messengers." Sam Altman said: "This isn't the most important thing we've done, but it's more useful than it looks."

You can also /hatch a custom pet -- Codex generates one based on your project context.

Silly? Sure. But agent observability during long-running tasks is a real problem, and this solves it without requiring you to tail logs.

What This Signals

When Cursor, Claude Code, and Codex generate roughly similar code, what differentiates them?

Dimension	Old	New
Task scope	Single-turn	Multi-session goal tracking
Agent visibility	Terminal output	Ambient status indicators
Session model	Stateless	Stateful across restarts

Once core functionality reaches parity, experience becomes the differentiator.

v0.128.0 Quick Reference

Feature	Command	Description
Virtual pet	`/pet`	Animated agent status companion
Custom pet	`/hatch`	AI-generated project-specific pet
Goal system	`/goal`	Persistent cross-session objectives
Self-update	`codex update`	Update from terminal
Side chat	`/side`	Parallel conversation panel
Plugin marketplace	marketplace	One-click plugin install

Practical Notes

Use /goal for multi-day refactors, coverage targets, migration checklists. Not for one-off fixes.
Use /pet as ambient monitoring during long agent runs.
If you are juggling multiple AI tools (Codex, Claude Code, Gemini), the fragmentation tax is real. EvoLink unifies 30+ models behind one API gateway with smart routing.

References:

Claude Opus 4.7: What Actually Changed and Whether You Should Migrate

Evan-dong — Thu, 30 Apr 2026 10:01:52 +0000

If you follow AI model releases, you have already seen the headlines about Claude Opus 4.7. Most of them focus on benchmark numbers.

This article focuses on something more useful: what changed in practice, what breaks during migration, and which workflows benefit most.

The Short Version

Claude Opus 4.7 is Anthropic's strongest generally available model for agentic coding and structured enterprise work as of April 2026. It is not a universal upgrade. It introduces breaking API changes that require testing before migration.

Where Opus 4.7 Is Strongest

Agentic Coding

This is the headline improvement. Anthropic describes Opus 4.7 as a notable step up over Opus 4.6 for multi-step software engineering tasks. The difference shows most on work that requires:

reading a codebase across multiple files
forming a plan and using tools
verifying outputs before finalizing
revising when initial attempts fail

If your LLM usage is mostly one-shot snippets or ad hoc brainstorming, the upgrade matters less.

High-Resolution Vision

Opus 4.7 raises the image ceiling from 1568px / 1.15MP to 2576px / 3.75MP with simpler 1:1 coordinate mapping. This matters for screenshot QA, UI bug investigation, dense chart interpretation, and document understanding workflows.

Task Budgets

A new task_budget parameter (beta) lets you give Claude an approximate token budget for the full agentic loop, including thinking, tool calls, and output. The model can prioritize work and wind down gracefully instead of hitting a wall mid-task.

Extended Thinking Control

A new xhigh effort level sits between high and max, giving finer control over reasoning depth.

What Breaks During Migration

This is the part most review posts underplay.

Sampling parameters removed. Setting temperature, top_p, or top_k to any non-default value returns a 400 error. If your production code depends on those controls, this is a migration task, not a footnote.

Extended thinking budgets removed. Adaptive thinking is now the supported path, disabled by default unless you opt in.

Thinking output hidden by default. Thinking content is omitted unless you explicitly choose a display mode like summarized. Apps that surface reasoning traces will see UX changes.

Tokenizer changed. The new tokenizer can use 1x to 1.35x more tokens depending on content. Old max_tokens assumptions and compacting logic may behave differently.

Pricing

	Input	Output
Claude Opus 4.7	$15 / 1M tokens	$75 / 1M tokens
Prompt caching write	$18.75 / 1M tokens	-
Prompt caching read	$1.50 / 1M tokens	-
Batch API	$7.50 / 1M tokens	$37.50 / 1M tokens

The headline price is simple. The real cost story is not. Because the tokenizer changed, two teams can quote the same pricing and end up with different effective costs. Replay real prompts and measure before committing.

Who Should Upgrade

Opus 4.7 is a strong fit if you are:

building coding agents that inspect, plan, and verify across files
running enterprise workflows with documents, charts, or screenshots
building long-horizon agents where follow-through matters
willing to tune effort, caching, and token budgets

Who Should Test First

Slow down if you are:

sensitive to token cost variance
dependent on sampling parameter controls
building experiences where conversational style matters more than execution
expecting a drop-in swap from Opus 4.6

Access

Available through Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, and Claude consumer plans (Pro, Max, Team, Enterprise). Also rolling out in GitHub Copilot.

For teams evaluating multiple models in production, a unified API gateway like EvoLink simplifies routing and billing across providers without vendor lock-in.

Bottom Line

Claude Opus 4.7 is one of the best generally available choices for agentic coding in April 2026. Adopt it as a measured workflow decision, not as a blanket default. Test your migration path before switching production traffic.

Based on Anthropic's official launch materials and API documentation published April 16, 2026.

DEV Community: Evan-dong

I Generated a Full Brand Mockup Set from a Logo in 10 Minutes — Here's the Workflow

I Generated a Full Brand Mockup Set from a Logo in 10 Minutes — Here's the Workflow

The framework

The planning prompt

Test 1: Evolink (tech/API brand)

Test 2: MoriJoy (nature park brand)

What I learned

Try it

Gemini 3.5 Flash Just Shipped — Here's When to Use It (and When Not To)

Quick specs

Where Flash wins

Where Flash does NOT win

Tool capabilities

When to pick which model

API call example

Things to keep in mind

Sources

Codex Now Works from Your Phone — Plus Hooks and CI/CD Tokens

Codex Now Works from Your Phone — Plus Hooks and CI/CD Tokens

The Three Things That Matter

Also Shipped

The Multi-Agent Problem Remains

References

I Tried TencentDB Agent Memory — Here's What the Token Reduction Looks Like

I Tried TencentDB Agent Memory — Here's What the Token Reduction Looks Like

The Architecture

The Numbers

Quick Setup (OpenClaw)

Two Layers of Cost Optimization

Limitations

Links

Codex vs Claude Code vs Cursor: What Changed in May 2026 and What Can Be Routed

The Updates

Quick Comparison

The Problem When You Use More Than One

One Gateway Setup

Watch Out For

Sources

I Tried the Claude Code Skills Repo That Got 77K Stars — Here Is What Works and What Does Not

Setup (About 60 Seconds)

The 4 Skills I Actually Use

/grill-with-docs

/tdd

/diagnose

/caveman

Everything Else (Quick Map)

Honest Limits

When It Gets Bigger

Who Should Use This

Stop Asking Claude Code for Markdown Specs. Ask for HTML Artifacts.

When HTML Beats Markdown

Example: PR Review Artifact

Example: Implementation Options

Trade-Offs

Practical Prompt

OpenAI's New Realtime Voice Models Can Think, Translate, and Transcribe — Here's What Developers Need to Know

The Three Models

Integration Options

GPT-Realtime-2: Voice Agents That Actually Work

GPT-Realtime-Translate: The $0.034/min Disruption

GPT-Realtime-Whisper: Streaming Transcription

What You Can Build

Links

Anthropic's Agents Now Self-Improve Between Sessions. Here's How Dreaming Works.

How It Actually Works

Why This Matters: The Cross-Session Blind Spot

Outcomes: The Signal Dreaming Needs

Multi-Agent Orchestration: Now in Public Beta

Webhooks and Pricing

Quick Reference

One Limitation Worth Knowing

How I Stopped Burning Through My Claude Code Quota by Noon

The Core Mechanic: Prefix Caching

Four Things That Kill Your Cache

1. Switching Models Mid-Conversation

2. Changing Tool Configuration Mid-Session

3. Opening New Sessions Constantly

4. Switching Between Accounts

What to Do Instead

`/grill-with-docs`

`/tdd`

`/diagnose`

`/caveman`