DEV Community: 侯垒

中转站余额为什么掉得快？我拆了一次 AI 编程任务的真实消耗

侯垒 — Thu, 25 Jun 2026 06:33:13 +0000

中转站余额为什么掉得快？我拆了一次 AI 编程任务的真实消耗

最近很多人开始高频使用 Claude Code、Codex、Cursor、OpenCode 这类 AI 编程 Agent。

用起来确实很爽：一句话丢过去，它能读项目、找文件、分析报错、改代码、跑命令，最后再给你一个总结。

但用得多了，另一个问题也开始明显起来：

中转站余额掉得真快。

有时候明明只是让它修一个小 bug，或者开一次 Code Plan，让它先分析一下项目，结果额度消耗比想象中高很多。

这时很多人的第一反应是：

是不是中转站扣量不准？
是不是模型太贵？
是不是 Agent 偷偷请求了很多次？
是不是 plan 模式也在大量消耗 token？

这些问题不能靠感觉判断。

如果想知道钱到底花在哪里，就得把一次 AI 编程任务拆开看。

所以我最近开始用一个更笨但更有效的方法：把一次 Agent 任务的每一轮请求拆开看。先看一张图。

这里我用到的工具是 ccglass。它不是另一个 AI 编程助手，而是可以帮你观察 Claude Code、Codex 等工具每一轮真实请求的本地工具。对成本敏感的人来说，它最直接的价值是：

看清一次 Agent 任务到底烧了多少上下文、多少 token、多少 cache、多少 latency。

这张图里，模型最终只输出了 34 tokens，但顶部已经显示 24,660 in，估算成本约 $0.0930。请求概览里还能看到 31 tools、cache write 24,654 tokens、latency 2.40s。

也就是说，真正值得关注的不是最后输出了多少字，而是这一轮请求背后带了多少上下文。

不是你的错觉，Agent 任务确实比普通聊天重

普通聊天的成本比较容易理解。

你问一句，模型答一句：

用户输入 -> 模型输出

输入多少 token，输出多少 token，大概还能估。

但 AI 编程 Agent 完全不是这个结构。

你输入一句：

帮我分析这个 bug，并修复相关代码。

背后可能发生的是：

第 1 轮：模型理解任务
第 2 轮：搜索相关文件
第 3 轮：读取代码
第 4 轮：分析调用关系
第 5 轮：运行测试
第 6 轮：带着测试输出继续分析
第 7 轮：修改文件
第 8 轮：再次验证
第 9 轮：总结结果

这不是一次请求，而是一串请求。

每一轮都可能有 input token 和 output token。更重要的是，后面的请求往往会带上前面产生的上下文。

所以你看到的是“修一个小 bug”，实际消耗的是“多轮上下文 + 工具调用 + 文件内容 + 命令输出 + 模型回复”。

这就是 Agent 类工具比普通聊天更烧额度的根本原因。

真正贵的，往往不是最后生成的那几行代码

很多人以为 AI 编程的成本主要花在“生成代码”上。

但实际拆开看，经常不是。

下面这张图更直观。这里用户输入的只是一个简单的 hello，但请求流里已经出现了 system reminder、Agent 类型说明，以及 31 tools offered to the model。

这就是 AI 编程 Agent 和普通聊天最大的区别。

普通聊天里，你可能只是在发一句话。

但 Agent 请求里，模型同时会看到一整套运行上下文：系统提醒、工具说明、可用能力、当前环境、历史消息，以及后续可能调用的工具。

真正重的内容通常包括这些：

system prompt；
developer 指令；
tools schema；
历史消息；
当前工作目录信息；
搜索结果；
文件内容；
测试日志；
命令输出；
前几轮 assistant 的分析；
tool call 和 tool result；
Code Plan 阶段生成的规划上下文。

其中最容易被忽略的是 tools schema 和工具结果。

Agent 能读文件、写文件、跑命令，是因为它把可用工具的定义发给了模型。工具越多，schema 越复杂，这部分上下文就越重。

Agent 读文件以后，文件内容也可能进入下一轮请求。

Agent 跑测试以后，测试输出也可能进入下一轮请求。

Agent 搜索代码以后，搜索结果也可能进入下一轮请求。

如果某一轮读了一个很大的文件，或者测试输出很长，后续几轮的 input token 就可能明显上升。

这时你以为自己只是在让它“改几行代码”，但实际上模型每轮都在背着一大包上下文继续工作。

Code Plan 为什么看起来没写代码也会贵

很多人对 Code Plan 或 plan mode 的消耗尤其困惑。

因为它看起来还没开始写代码，只是在“想一下”。

但对 Agent 来说，plan 不等于空想。

一个比较完整的 plan 阶段，可能会做这些事：

读取项目结构；
分析 README 或配置文件；
搜索相关模块；
理解已有代码风格；
判断任务影响范围；
推断需要修改哪些文件；
生成分步骤执行计划；
保留这些分析给后续实现阶段使用。

这些动作都会消耗 token。

尤其是当项目比较大、任务描述比较宽泛时，plan 阶段为了建立项目地图，可能会读很多上下文。

所以 Code Plan 贵，不一定是异常。

它的本质是：在正式动手前，先花 token 买一次理解和规划。

这个成本值不值，取决于任务复杂度。

如果是大重构、跨模块修改、复杂 bug，plan 阶段很有价值。

如果只是改一个很小的样式或文案，反复开 plan 就可能显得浪费。

中转站倍率会放大这种体感

如果你用的是中转站、聚合 API 或第三方模型网关，成本体感会更明显。

因为它通常不是简单看“请求次数”，而是跟这些因素有关：

模型倍率；
input token；
output token；
cache 计费方式；
上下文长度；
是否使用高价模型；
是否多轮请求；
是否有长输出和长工具结果。

高倍率模型本身就贵。

如果再叠加长上下文和多轮请求，余额下降会非常明显。

这也是为什么同样是“修一个 bug”，有时很便宜，有时很贵。

任务表面看起来相似，但背后的请求链路可能完全不同：

一个任务只读了 2 个文件；
另一个任务读了 15 个文件；
一个任务只跑了单个测试；
另一个任务跑了全量测试并返回大量日志；
一个任务 cache 命中不错；
另一个任务每轮都在创建新上下文；
一个任务很快收敛；
另一个任务来回尝试了很多轮。

如果没有工具观察，你只能看到余额变少。

但你不知道是哪一轮、哪个文件、哪个工具结果把成本拉上去的。

用 ccglass 拆一笔账

ccglass 的实用性就在这里。

它可以在本地作为代理，记录 Claude Code、Codex、OpenCode 等 AI 编程工具实际发给模型 API 的请求和响应，并通过 Dashboard 展示出来。

对成本分析来说，重点不是看热闹，而是看这些信息：

一次任务到底有几轮请求；
每一轮 input token 多少；
每一轮 output token 多少；
cache creation 有多少；
cache read 有多少；
latency 是否异常；
messages 数量是否增长；
tools 数量是否很大；
哪一轮请求突然变重；
哪个工具结果进入了后续上下文；
是 plan 阶段贵，还是实现阶段贵；
是读文件贵，还是测试输出贵。

再看 response 里的 usage 明细。

这次请求里，input_tokens 是 6，output_tokens 是 34，但 cache_creation_input_tokens 达到了 24,654，cache_read_input_tokens 是 0。

这个例子很适合说明一个成本误区：用户输入和模型输出本身都不长，但 Agent 请求仍然可能携带大量可缓存上下文。对于中转站或聚合 API 用户来说，这类上下文最终都会反映到额度消耗、倍率成本或延迟体感上。

有了这些信息以后，很多问题就能从“感觉”变成“证据”。

比如你可以看到：

第 1 轮 input 不高，只是任务和系统上下文。
第 2 轮 tools schema 很大。
第 3 轮读了一个大文件，input 开始上涨。
第 4 轮测试输出很长，后面几轮都变重。
第 5 轮 cache 没怎么命中，所以延迟和消耗都上去了。

这时你就知道，余额掉得快不是玄学。

它可能是因为 Agent 读了太多无关文件，也可能是测试日志太长，也可能是 plan 阶段上下文太宽，也可能是模型倍率太高。

看清成本以后，怎么省？

ccglass 本身不会替你省钱。

它的作用是告诉你钱花在哪里。真正省钱，还要调整使用习惯。

下面这些方法通常很有效。

1. 小任务不要开太大范围

不要动不动就说：

帮我优化这个项目。

这类 prompt 会鼓励 Agent 大范围搜索和读取上下文。

更好的写法是：

只检查 src/auth 目录下 session 相关逻辑，先分析问题，不要修改代码。

范围越清楚，Agent 乱读文件的概率越低。

2. 直接给关键文件

如果你已经知道问题大概在哪，就直接告诉它。

问题大概率在 src/auth/session.ts 和 src/auth/session.test.ts。
请优先检查这两个文件。

这比让它全项目搜索更省。

3. 控制测试日志

测试日志很容易变成长上下文。

如果你把完整 CI 日志丢给 Agent，或者让它反复跑全量测试，token 会涨得很快。

更好的方式是：

先给失败测试名；
给核心报错；
给关键调用栈；
让它按需运行单个测试；
避免把无关日志全部塞进去。

4. 不要反复开 plan

Plan 很有用，但不是所有任务都需要。

如果只是小改动，反复进入 Code Plan 可能不划算。

可以把 plan 用在这些场景：

跨模块修改；
大重构；
复杂 bug；
不熟悉的项目；
需要评估风险的任务。

简单任务可以直接限定范围，让 Agent 小步执行。

5. 低价值任务用低倍率模型

不是所有任务都值得用最强模型。

比如：

改文案；
补简单类型；
生成重复样板代码；
改格式；
写简单测试；
查找明显错误。

这类任务可以考虑低倍率模型。

高价模型留给真正需要复杂推理的任务。

6. 定期看异常任务

如果你是重度用户，建议偶尔用 ccglass 看几次典型任务。

不用每次都盯着看，但可以定期做成本体检：

哪类任务最耗 token？
哪个 Agent 请求轮数最多？
哪个模型 latency 最高？
哪些任务 cache 命中差？
哪些工具结果特别大？
哪些 prompt 容易导致全项目搜索？

看几次以后，你会更清楚自己怎么用 Agent 最划算。

ccglass 的定位：不是抓包玩具，而是成本观察入口

如果只是偶尔用 AI 补一段代码，可能不需要 ccglass。

但如果你已经开始高频使用 Claude Code、Codex、Code Plan，或者你用的是中转站、聚合 API、第三方模型网关，那成本就不是小问题。

这时你需要的不只是“能不能完成任务”，还要知道：

它请求了几轮；
它每轮带了多少上下文；
它为什么突然变贵；
它有没有重复发送无效内容；
plan 阶段到底花了多少；
cache 有没有帮上忙；
高倍率模型是否用在了值得的地方。

ccglass 解决的就是这个可见性问题。

它不会替你决定用哪个模型，也不会替你优化 prompt。

但它可以把一次 AI 编程任务的消耗摊开，让你知道钱到底花在哪里。

总结

AI 编程不是不能花钱。

真正的问题是：钱花得明不明白。

Claude Code、Codex、Code Plan 这类 Agent 工具，本来就比普通聊天更重。它们会多轮请求、携带工具定义、读取文件、保留历史、处理测试输出，还可能在 plan 阶段建立大量上下文。

如果你用中转站，这些都会变成非常真实的额度消耗。

所以，与其只盯着余额下降，不如拆开看一次任务：

哪一轮最贵；
哪些上下文最重；
哪些工具结果被反复带入；
cache 有没有命中；
plan 阶段是否值得；
模型倍率是否匹配任务价值。

这就是 ccglass 对成本敏感用户的实用价值。

它让 AI 编程成本从一笔糊涂账，变成可以观察、分析和优化的工程指标。

一句话总结：

不是少用 AI，而是少烧无效上下文。

项目地址

ccglass 是一个开源项目，项目地址：

https://github.com/jianshuo/ccglass

如果你也在使用 Claude Code、Codex、OpenCode 或其他 AI 编程 Agent，并且对请求链路、token 成本、cache 命中、延迟分析感兴趣，欢迎试用、提 issue 或 star 支持。

AI Agent 出问题时，不要只看最终回答：一次请求级调试的思路

侯垒 — Mon, 22 Jun 2026 02:04:28 +0000

这两年 AI 编程工具变化很快。

从 Copilot 式的代码补全，到 Claude Code、Codex、OpenCode、Cursor、Cline 这类 Agent 工具，AI 已经不只是“帮你补几行代码”了。它可以读项目、改文件、跑命令、调用工具、分析报错，然后继续下一轮。

但 Agent 越像一个“会干活的人”，问题也越明显：

它出错的时候，你很难判断它到底错在哪里。

很多人遇到 AI 编程工具跑偏时，第一反应是改 prompt。

比如：

你要更认真一点。
不要偷懒。
必须先读代码再修改。
不要乱改无关文件。

这些提示有时有用，但更多时候只是把问题往后推。

因为真正的问题可能根本不在你写给它的那句话里，而在它实际发给模型的完整请求里。

为什么只看最终回答不够？

一个 AI Agent 完成任务，通常不是一次请求结束。

它的流程更像这样：

用户提出任务
-> 模型收到 system prompt、上下文、工具列表
-> 模型决定调用某个工具
-> 本地执行工具，比如读文件、搜索代码、运行命令
-> 工具结果再进入下一轮模型请求
-> 模型继续判断
-> 直到给出最终回答或修改代码

我们在终端或编辑器里看到的，通常只是这个过程的一小部分。

最终回答只能告诉你“它做了什么”，但很难告诉你“它为什么这么做”。

比如下面这些问题，只看最终回答基本看不出来：

它到底有没有看到关键文件？
它看到的是完整上下文，还是被截断后的上下文？
system prompt 里有没有某条规则影响了它？
工具 schema 是不是描述不清，导致它选错工具？
tool result 有没有进入下一轮？
哪一轮开始 token 暴涨？
provider 返回的 400 到底是模型问题，还是请求格式问题？
它是不是每一轮都重复带入了大量无效上下文？

这些都属于“请求级问题”。

AI Agent 的常见问题，其实可以分层看

我现在更倾向于把 AI Agent 的问题分成几层，而不是笼统地说“模型不行”。

第一层：模型没有看到该看的东西

这是最常见的问题之一。

你以为 Agent 已经读了某个文件，但实际发给模型的请求里没有那段内容。

或者它读过，但后续请求里没有保留。

于是模型只能基于不完整上下文做判断，结果自然不稳定。

这种时候继续强调“请仔细阅读代码”没有太大意义。你真正要确认的是：

关键上下文有没有进入下一轮请求？

第二层：工具列表或工具描述有问题

Agent 能调用工具，并不代表它会正确调用工具。

模型看到的是一组工具 schema，包括工具名、描述、参数结构。

如果工具描述太模糊，模型可能不知道什么时候该用。

如果参数结构太复杂，模型可能生成错误参数。

如果工具太多，模型也可能选错。

这类问题在 MCP、插件、子任务工具越来越多以后会更明显。

第三层：tool call 和 tool result 没有正确闭环

一个正常的工具调用闭环应该是：

assistant: tool_use
tool: tool_result
assistant: 根据 tool_result 继续

如果中间某一环丢了，Agent 就会开始出现奇怪行为。

比如：

工具没有真正执行，但模型以为执行了
tool result 太长，进入下一轮后污染上下文
多个 tool_result 顺序错了
tool_use 和 tool_result 没有正确配对
provider 要求的消息格式和客户端保存的格式不一致

这类 bug 最难靠肉眼看最终回答判断。

你必须看到真实的请求和响应。

第四层：token 和成本问题

很多人用 AI 编程工具时，一开始只关心效果。

但一旦使用频率上来，token 成本就会变成实际问题。

尤其是 Agent 场景里，token 并不只花在最终回答上。

大量 token 可能花在：

system prompt
工具 schema
历史消息
文件内容
搜索结果
命令输出
tool result
子 Agent 的上下文

有时一次任务很贵，不是因为最终回答很长，而是因为某一轮请求带了巨大的上下文。

所以看 session 总成本还不够，最好能看到每一次请求的 token 使用。

请求级调试应该看什么？

如果一个 AI Agent 跑偏，我通常会先看这几类信息。

1. system prompt

system prompt 是 Agent 行为的底层约束。

很多看似“模型自己决定”的行为，其实是 system prompt 影响的。

比如：

它为什么总是先写计划？
它为什么不直接执行命令？
它为什么遇到某类文件不修改？
它为什么要反复确认？

这些可能都和 system prompt 有关。

2. messages

messages 决定模型当前能看到什么。

重点不是“这个 Agent 曾经读过什么”，而是：

当前这一轮请求里到底包含了什么？

这两个问题不一样。

Agent 本地读过一个文件，不代表后续每一轮模型请求都带着这个文件。

3. tools schema

如果 Agent 没有调用你期望的工具，先不要急着骂模型。

可以先看：

工具是否真的出现在 tools 里？
工具描述是否清楚？
参数 schema 是否合理？
是否有多个类似工具让模型混淆？
provider 是否支持这种 schema？

4. tool call / tool result

这一层最适合排查 Agent 为什么做错。

你可以看：

模型选择了哪个工具
传入了什么参数
工具返回了什么
返回结果有没有进入下一轮
下一轮模型有没有基于这个结果继续

如果这里断了，后面再怎么调 prompt 都不稳定。

5. token / cache / cost / latency

这些指标能帮你判断问题发生在哪一轮。

比如：

哪一轮 input token 突然升高？
哪一轮 tool result 特别大？
cache 有没有命中？
哪个模型最贵？
哪个请求延迟最高？
失败请求有没有 usage 信息？

这比只看“今天花了多少钱”有用得多。

ccglass 解决的就是这一层

最近我发现了一个开源工具：ccglass。

项目地址：

https://github.com/jianshuo/ccglass

它不是另一个 AI 编程助手，而是一个本地观测工具。

简单说，它会在本地启动一个代理和 Dashboard，让你看到 Claude Code、Codex、OpenCode、CodeBuddy、Qoder 等工具实际发给模型的请求。

你可以看到：

system prompt
messages
tools schema
tool calls
tool results
request / response body
token / cache / cost
latency
turn-to-turn diff

也就是说，它关心的不是“再帮你写一段代码”，而是“让你看清 Agent 到底是怎么工作的”。

一个典型使用场景

假设你让 Claude Code 修一个 bug。

它最后确实改了代码，但你觉得改得很奇怪。

这时你可以不急着重跑，而是打开 ccglass 看这几件事：

第一轮请求里有没有带上相关文件？
system prompt 有没有影响它的修改策略？
它调用了哪些工具？
每个工具返回了什么？
哪个 tool result 进入了下一轮？
修改代码前，它是否真的看到了测试失败信息？
token 是不是在某一轮突然暴涨？

如果你能回答这些问题，调试 Agent 就从“感觉不对”变成了“证据链不对”。

它和普通抓包工具有什么区别？

很多人会问：这是不是 Charles、mitmproxy、Proxyman 也能做？

某些情况下可以，但 AI 编程 Agent 有几个麻烦点：

不一定遵守系统代理
有些客户端走自定义 base URL
有些使用 OpenAI / Anthropic 兼容接口
有些 provider 的 streaming 格式不一样
有些工具调用信息需要按 Agent 语义展示，而不是只看 HTTP 包

ccglass 的目标不是替代通用抓包工具，而是专门围绕 AI Agent 请求结构做展示。

它会把 system、messages、tools、tool call、tool result、usage、cost、diff 这些东西整理成适合调试 Agent 的视图。

为什么我觉得这件事会越来越重要？

AI 编程工具的发展方向很明显：Agent 会越来越自动。

它们会读更多文件，调用更多工具，跑更长任务，甚至调度子 Agent。

这当然会提高效率。

但如果没有观测能力，复杂度也会一起上升。

以后我们遇到的问题可能不再是：

这段代码补全得对不对？

而是：

为什么这个 Agent 在第 7 轮调用了这个工具？
为什么它把这个 tool result 带到了后面 12 轮？
为什么这次任务花了 30 万 token？
为什么同样的任务换 provider 就失败？
为什么上下文压缩后它丢了关键约束？

这些问题都需要请求级视角。

结语

我觉得 AI 编程工具接下来的一个重要方向，不只是让 Agent 更强，而是让 Agent 更可观察。

只看最终回答，就像只看程序崩溃后的最后一行日志。

真正要调试复杂系统，还是要看输入、输出、中间状态和调用链。

AI Agent 也是一样。

如果你也在用 Claude Code、Codex、OpenCode、CodeBuddy、Qoder 这类工具，可以试试 ccglass：

https://github.com/jianshuo/ccglass

它解决的不是“让 AI 更聪明”，而是让使用者更清楚地知道：

AI 到底看到了什么，又基于什么做出了下一步。

Debugging AI Coding Agents: How to See Prompts, Tool Calls, Token Usage, and Cost

侯垒 — Sun, 21 Jun 2026 02:33:55 +0000

When a coding agent fails, the visible error is rarely the whole story.

You might see:

a tool call that never ran
a command repeated again and again
a sudden token spike
a provider rejecting a request with 400 Bad Request
an agent that says it edited a file but did not
a long session that starts producing shallow or confused answers

The usual reaction is to tweak the prompt and try again.

Sometimes that works. But for agentic coding tools, guessing is not enough. You need to inspect what the agent actually sent to the model.

That is the problem ccglass is built for.

GitHub: https://github.com/jianshuo/ccglass

The debugging problem with coding agents

Modern coding agents are not simple chatbots.

Tools like Claude Code, Codex, OpenCode, CodeBuddy, Qoder, and similar systems usually run a loop like this:

user request
  -> model request
  -> tool call
  -> local command / file read / edit / search
  -> tool result
  -> next model request
  -> final answer

When something goes wrong, the bug can be in any part of that loop.

For example:

The model never saw the tool schema you thought it saw.
The tool schema was too large or malformed.
The model returned a malformed tool call.
The local client dropped part of the tool result.
A huge tool result entered the next request and inflated token usage.
The provider rejected a request shape that another provider accepts.
A proxy or gateway translated Anthropic and OpenAI formats incorrectly.

You cannot debug that reliably from the final answer alone.

What to inspect first

When an agent behaves strangely, I usually want to see five things.

1. The system prompt

The system prompt often explains behavior that looks mysterious from the outside.

It may contain rules about:

when to ask permission
when to use tools
how much work to do before stopping
whether to run tests
whether to preserve existing files
how to summarize results

If the agent ignores your instruction, first check whether the system prompt is pushing it in a different direction.

2. The tool schema

Tool calling depends heavily on the schema sent to the model.

If a tool is described vaguely, has confusing parameter names, or contains a schema shape the provider does not like, the model may choose the wrong tool or produce invalid arguments.

This matters even more with MCP servers and custom tools.

The question is not "what did my code define?" The real question is:

What tool schema was actually sent in the model request?

3. The tool call

A tool call bug can come from the model, the client, or the provider adapter.

You want to inspect:

tool name
call id
arguments
malformed fields
missing required fields
whether the tool call was emitted as structured data or plain text

For example, if the model emits something that looks like a tool call but the client renders it as text, the agent may continue as if the tool ran even though no tool result exists.

4. The tool result

Tool results are often the hidden source of context bloat.

A single file read, search result, stack trace, or command output can add thousands of tokens to the next turn.

If the agent suddenly becomes expensive or confused, check what tool results were fed back into the model.

5. Token usage and latency

Token totals are useful, but per-request token usage is better.

You want to know:

which request got expensive
whether input, output, or cache tokens dominated
whether a request was slow before the first token
whether repeated turns reused the same large context
whether a provider returned usage data at all

That is the difference between "this session was expensive" and "this specific tool result caused the spike."

Using ccglass for request-level debugging

ccglass is a local proxy and dashboard for coding-agent traffic.

It lets you inspect what supported agents actually send to the model:

system prompts
messages
tool schemas
tool calls
tool results
raw request and response bodies
token/cache/cost
latency
turn-to-turn diffs

It works locally. It is open source.

Install:

npm install -g ccglass

Start it:

ccglass

Or choose a client directly:

ccglass claude
ccglass codex
ccglass opencode
ccglass qoder
ccglass codebuddy

For generic OpenAI-compatible or Anthropic-compatible clients, you can also run proxy-only mode:

ccglass proxy --provider openai
ccglass proxy --provider claude

Then point your client or IDE at the printed local base URL.

Example debugging workflow

Suppose an agent repeatedly fails to call a tool correctly.

Instead of changing the prompt first, inspect the actual request flow:

Open the ccglass dashboard.
Find the request where the model was expected to call the tool.
Expand the system prompt and tool schema.
Check whether the tool was visible to the model.
Check the model response for the tool call.
Check whether the tool result was paired correctly.
Compare the next request to see what context was carried forward.

That gives you a factual answer to questions like:

Did the model see the tool?
Did it call the wrong tool?
Were the arguments malformed?
Did the client drop the tool result?
Did the next turn include the right result?

Example: debugging token spikes

Another common problem:

Why did this one coding-agent session use so many tokens?

In ccglass, inspect the request list and session summary.

Look for:

a request with unusually high input tokens
a large tool result entering the next request
many repeated requests with similar context
cache usage that is lower than expected
a slow request with high input size

Then use turn-to-turn diff to see what changed between two requests.

This is often more useful than looking only at the final cost.

Example: debugging provider 400 errors

Provider errors are another good use case.

If an Anthropic-compatible or OpenAI-compatible endpoint rejects a request, you need the exact payload.

Check:

request body
tool schema
message order
tool_use / tool_result pairing
response or error body
provider/model name

This is useful when working with:

internal gateways
OpenRouter
Ollama-compatible endpoints
Bedrock or Vertex routes
Anthropic-compatible translation layers
OpenAI-compatible coding-agent backends

The failure is often not "the model is bad." It is often a request-shape problem.

Exporting evidence

ccglass can export captured requests:

ccglass export <session>/<seq> --format raw
ccglass export <session>/<seq> --format md
ccglass export <session>/<seq> --format json
ccglass export <session>/<seq> --format har

That is useful when reporting bugs to an agent project, provider, or proxy maintainer.

Instead of saying:

The agent failed.

You can show:

This exact request contained this tool schema, this model response emitted this malformed tool call, and this provider returned this error.

That is much easier to debug.

A few practical notes

ccglass is not a universal network sniffer.

It works best when the client can be pointed at a local base URL or local proxy. For example, API-key based OpenAI-compatible and Anthropic-compatible traffic is a good fit.

Some clients have special transports. For example, Codex authenticated through ChatGPT login may use a WebSocket path that does not honor OPENAI_BASE_URL, so local base URL inspection will not see that traffic.

For CodeBuddy, ccglass uses a forward-proxy mode because CodeBuddy hardcodes its upstream endpoint.

Why this matters

As coding agents become more autonomous, debugging needs to move one layer deeper.

It is no longer enough to ask:

Did the agent produce the right diff?

You also need to ask:

What did the agent see, what tool did it choose, what result came back, and what context entered the next turn?

That is what ccglass tries to make visible.

GitHub:

https://github.com/jianshuo/ccglass

Install:

npm install -g ccglass

If you build with coding agents, request-level debugging is worth having in your toolbox.

Stop Flying Blind with Coding Agents: Inspect Claude Code and Codex Requests with ccglass

侯垒 — Sat, 20 Jun 2026 00:31:18 +0000

AI coding agents are getting good enough that they no longer feel like autocomplete.

Tools like Claude Code, Codex, OpenCode, Cursor, Cline, and other agentic coding systems can read files, modify code, run commands, call tools, inspect errors, and continue working across multiple turns.

That is useful. It is also increasingly opaque.

When you ask an agent to fix a bug, you usually see the final answer and maybe a few tool calls in the terminal. But you often do not know:

What system prompt did the model receive?
Which messages were included in the request?
Which tool schemas were shown to the model?
Why did it choose one tool instead of another?
What tool result was fed into the next turn?
How many tokens did the task use?
Was cache used?
Which request was slow?
How much did the session cost?

For small experiments, guessing is fine. For real work, guessing is a bad debugging strategy.

That is why I built ccglass.

GitHub: https://github.com/jianshuo/ccglass

What is ccglass?

ccglass is a local observability tool for AI coding agents.

It runs a local proxy, captures model requests and responses, and shows them in a web dashboard.

The goal is simple: make it easy to see what tools like Claude Code, Codex, DeepSeek-TUI, Kimi, OpenCode, Ollama, OpenRouter, and other agent clients are actually sending to the model.

ccglass can show:

system prompts
user and assistant message history
tool schemas
tool calls and tool results
raw request and response bodies
token usage
cache usage
estimated cost
latency
turn-to-turn diffs

It is not another coding agent. It is a visibility layer for the agents you already use.

Why not just use a normal HTTP proxy?

General-purpose tools like Charles, mitmproxy, or Proxyman are great, but AI coding agents can be awkward to inspect with traditional proxy setups.

Some clients do not reliably honor HTTP_PROXY / HTTPS_PROXY. Some have their own networking behavior. Some use provider-specific base URL settings. Patching the client is fragile.

ccglass takes a more targeted approach.

It starts a local proxy, then launches or configures the target agent with the right base URL environment variable, such as:

ANTHROPIC_BASE_URL
OPENAI_BASE_URL

The agent sends model traffic to the local proxy. ccglass logs it, forwards it to the real upstream API, and renders the result in a dashboard.

That means you can inspect LLM traffic without installing a CA certificate, decrypting TLS, or modifying the client source code.

Quick start

Install ccglass with npm:

npm install -g ccglass

Then run:

ccglass

You can also start a specific client directly:

ccglass claude
ccglass codex
ccglass opencode
ccglass deepseek
ccglass kimi

For example:

ccglass codex

When it starts, ccglass prints a local dashboard URL:

dashboard: http://127.0.0.1:57633

Open that URL and you can watch requests appear as the agent works.

What can you debug with it?

1. Prompt and context problems

Sometimes an agent makes a bad decision because it did not see the context you expected.

With ccglass, you can inspect the actual messages sent to the model instead of guessing from the terminal output.

You can answer questions like:

Did the file content make it into the request?
Was the previous tool result included?
Did a long conversation bury the important instruction?
Did the system prompt constrain the behavior?

2. Tool call behavior

Coding agents are only as good as their tool loop.

ccglass lets you inspect the tools shown to the model, the tool call selected by the model, the arguments passed to the tool, and the result that was fed back into the next request.

That is useful when an agent:

chooses the wrong tool
repeats the same command
fails to use an available tool
gets confused by a tool schema
behaves differently across providers

3. Token and cost spikes

Long-running agent sessions can burn tokens quickly.

ccglass shows token usage, cache usage, estimated cost, and latency per request and per session.

That makes it easier to spot:

a huge tool result entering the context
repeated large prompts
low cache hit behavior
slow requests
expensive turns that did not add much value

4. Provider and proxy issues

If you use local gateways, OpenAI-compatible endpoints, Anthropic-compatible endpoints, OpenRouter, Ollama, Bedrock, Vertex, or internal proxies, request shape matters.

ccglass helps you compare what the client sent with what the upstream expected.

This is especially useful when debugging:

custom base_url configuration
Anthropic vs OpenAI-compatible payload differences
missing tool call fields
malformed tool arguments
provider-specific routing

Exporting requests

You can export captured requests for deeper inspection or bug reports:

ccglass export <session>/<seq> --format raw
ccglass export <session>/<seq> --format md
ccglass export <session>/<seq> --format json
ccglass export <session>/<seq> --format har

That makes it easier to attach useful evidence when reporting issues to an agent, provider, or gateway project.

Who is this for?

ccglass is useful if you:

use Claude Code, Codex, OpenCode, or similar coding agents heavily
build tools around coding agents
maintain an LLM gateway or proxy
debug prompt, context, or tool-call behavior
want to understand token usage and cost
compare different providers or agent clients
care about local-first observability instead of sending traces to a SaaS service

A note on Codex

Codex has multiple auth and transport paths.

In API-key mode, routing through a configurable base URL works well for local proxy inspection.

When Codex is authenticated through ChatGPT login, some traffic may use a WebSocket path that does not honor OPENAI_BASE_URL. In that case, local proxy tools like ccglass cannot see that traffic.

That distinction matters when debugging Codex routing.

Why I think this matters

As coding agents become more capable, developers need better tools for understanding agent behavior.

The interesting question is no longer only:

Did the agent produce the right code?

It is also:

Why did the agent behave that way?

To answer that, you need visibility into prompts, context, tools, requests, latency, and cost.

ccglass is a small open-source step in that direction.

GitHub: https://github.com/jianshuo/ccglass

Install:

npm install -g ccglass

If you are building with coding agents and have ever wondered what they actually send to the model, give it a try.

Building ccglass: the architecture of a local LLM reverse proxy

侯垒 — Wed, 17 Jun 2026 02:14:58 +0000

The 30-second pitch

ccglass is a local reverse proxy that captures LLM API traffic from coding agent CLIs (Claude Code, Codex, DeepSeek, Kimi, etc.) and shows you a real-time dashboard of prompts, costs, and cache hit rates.

It's open source. It's 5,000 lines of Node. It's MIT licensed.

GitHub: https://github.com/jianshuo/ccglass

The constraint that shaped everything

The hardest part wasn't building a proxy. It was making it work with coding agent CLIs that deliberately bypass HTTP_PROXY.

Every native CLI (Claude Code is Node, Codex is Node, DeepSeek's CLI is Go, etc.) opens HTTPS sockets directly. They don't honor HTTP_PROXY env vars. So the standard "man-in-the-middle" pattern (mitmproxy, Charles) doesn't apply — these tools need a CA cert to intercept HTTPS, but the CLI isn't going to trust your CA.

The trick: intercept the local loopback hop, not the wire.

The CLI's API base URL is https://api.anthropic.com. We override it to http://127.0.0.1:8123. Now the local hop is plain HTTP — no cert, no interception, no TLS. The CLI's Node https module makes a request to http://127.0.0.1:8123, which our proxy receives, logs, and forwards to the real https://api.anthropic.com.

Architecture

┌─────────────┐   plain HTTP    ┌─────────────┐    HTTPS    ┌─────────────┐
│  Claude     │ ──────────────▶ │  ccglass    │ ──────────▶ │ Anthropic   │
│  Code CLI   │  127.0.0.1:8123 │  proxy      │             │ API         │
└─────────────┘                 └─────────────┘             └─────────────┘
                                       │
                                       │ log + dashboard
                                       ▼
                                ┌─────────────┐
                                │  Browser    │
                                │  UI :8123   │
                                └─────────────┘

3 components:

Spawn wrapper — overrides *_BASE_URL env vars, spawns the CLI as a child process
Proxy server — logs requests, forwards upstream, captures responses (SSE streaming included)
Web UI — real-time dashboard, web-socket fed

What I learned about streaming

The trickiest part: LLM APIs use Server-Sent Events (SSE) for streaming. The CLI expects an openai-sse or anthropic-sse stream. We need to:

Proxy the response as a stream (no buffering)
Tee the stream to the log file (we need every chunk)
Compute the cost incrementally as chunks arrive (token counts come in the final chunk)

In Node, this is pipeline() with a Transform stream that hashes each chunk and writes it to a side channel. The CLI gets the original stream unchanged.

Cost calculation

Each provider has a different pricing model. Cache hits, prompt caching, batch API, all change the math.

I extracted pricing into a JSON file (data/pricing.json) keyed by provider:model and updated monthly. The cost is computed during the response stream so you see cost accumulating in real time on the dashboard.

MCP integration

The wild feature: ccglass has its own MCP (Model Context Protocol) server. When Claude Code starts, it can call our MCP tools. One of them is get_recent_requests — Claude can query its own request history from inside the chat.

User: what did I prompt you with 3 turns ago?
Claude: [calls ccglass MCP get_recent_requests]
Claude: You prompted me with "refactor the user service to use the new repository pattern".

It's recursive and weird. I love it.

What's next

More providers — every new coding agent CLI that ships will need a config
Cost forecasting — given your usage pattern, predict next month's bill
Team sharing — local mode stays local, but opt-in to share specific sessions with teammates (encrypted, E2E)

Try it

npm i -g ccglass
ccglass claude

Open the dashboard. Run a few prompts. The first time you see your own cache hit rate, you'll get it.

GitHub: https://github.com/jianshuo/ccglass

Why I quit SaaS AI observability tools and built a local proxy instead

侯垒 — Sun, 14 Jun 2026 02:02:55 +0000

A confession

I've been using Langfuse and Helicone for the last 6 months. They're great products. Their teams are sharp.

But they don't work for coding agents.

The mismatch

Tool	Architecture	Works for coding agents?
Langfuse	SDK + async upload to SaaS	❌ Need to instrument the agent
Helicone	HTTPS proxy via HTTP_PROXY	❌ CLIs ignore HTTP_PROXY
Datadog LLM Obs	APM agent	❌ Same problem
ccglass	Local loopback reverse proxy	✅ Yes

The reason: Claude Code, Codex, OpenCode, Kimi, etc. are native CLIs (Node, Rust, Go). They make HTTPS calls directly to the API endpoint. They do not respect HTTP_PROXY environment variables.

So the standard observability play — "just point your SDK at our proxy" — doesn't work. The agent isn't using a library that knows to call your endpoint.

What I actually needed

I needed something that would:

Be a man-in-the-middle on the loopback (so it sees plain HTTP)
Forward to the real API (so the agent works)
Be zero-config (the agent already trusts http://127.0.0.1)
Not require a CA cert (loopback is plain HTTP)
Be local-only (no SaaS, no account)

I built it. It's called ccglass. It does those 5 things. Nothing else.

What it looks like in practice

$ npm i -g ccglass
$ ccglass claude
# → starts proxy on http://127.0.0.1:8123
# → overrides ANTHROPIC_BASE_URL to point at it
# → spawns claude
# → opens dashboard at http://127.0.0.1:8123

The dashboard shows:

Live request log with the full system prompt, tool calls, responses
Per-request cost (with cache-aware pricing)
Per-turn diff (what changed in the context this turn)
Cache hit rate (how often your system prompt is being cached)
Token breakdown (input / output / cache_read / cache_write)

What's different from Langfuse / Helicone

Local-only. No data leaves your machine. No account. No API key on their side.
Works for coding agents specifically. Built for the HTTP_PROXY-bypass problem.
Single binary, 1-command install. No SDK to integrate.
Open source under MIT. You can read every line.

What's the same

Token accounting
Per-request cost
Latency tracking
Provider routing (multiple model providers)

Why I'm sharing this

If you use a coding agent heavily, and you don't know which of your prompts are 4,000 tokens of accidental repetition, you're leaving money on the table.

The first time I saw my own cache hit rate (38% — meaning I was re-sending the same system prompt 38% of the time and not knowing it), I had a "wait, that's literally me paying for nothing" moment.

Try it once. The data is eye-opening.

🔗 GitHub: https://github.com/jianshuo/ccglass

I built a local reverse proxy to see what Claude Code actually sends to Anthropic

侯垒 — Wed, 10 Jun 2026 07:08:09 +0000

The problem I couldn't solve

I was spending ~$1,800/month on Claude Code.

I had no idea where the money was going. I had no idea which prompts were 4,000-token monstrosities, which ones were 200-token gems, or which ones I'd accidentally repeated 3 times this week.

I tried the obvious tools first:

mitmproxy — didn't work. Claude Code (and Codex, DeepSeek, Kimi, GLM, etc.) all ignore HTTP_PROXY because they're native CLIs that open HTTPS sockets directly.
Charles — same problem.
Langfuse / Helicone — these are SaaS. You have to send your data to them. Not what I wanted.
Custom hooks — limited to events the CLI exposes. I wanted the raw HTTP.

I wanted a local, open-source, zero-account way to see what my coding agent was doing.

The solution: a local reverse proxy on the loopback

The insight: every coding agent CLI talks to api.anthropic.com (or similar). If I make it talk to http://127.0.0.1:port instead, and have a tiny proxy on that port forward to the real API, the local hop is plain HTTP — easy to log, no CA cert, no TLS pinning pain.

That's it. That's the whole trick.

npm i -g ccglass
ccglass claude
# → opens http://localhost:8123 in your browser
# → real-time dashboard of every request

What I learned in 30 days

After running every Claude Code session through it, I found:

1. I had a 38% cache hit rate I didn't know about

I was repeating myself in 38% of prompts and paying full price. The dashboard made it visible. I rewrote my CLAUDE.md to front-load context — cache hit rate jumped to 70%, monthly bill dropped 35%.

2. Per-provider cost varies 10x

Same task:

Claude Sonnet 4.6: $0.42
GPT-4o: $0.31
DeepSeek: $0.04

I started picking per-task. Anthropic for quality, DeepSeek for bulk.

3. Turn counts were higher than I thought

Average 4.2 turns per task. After seeing the data, I rewrote my CLAUDE.md. Turn count dropped to 2.8. Less back-and-forth = less cost = faster delivery.

4. MCP self-inspection is wild

ccglass has an MCP server. When you run ccglass claude, the agent can query its own request history inside the chat. I asked Claude "what did I prompt you with 3 turns ago?" and it answered correctly.

What's supported

16+ providers out of the box:

Coding agents: Claude Code, Codex, OpenCode, CodeBuddy, Reasonix
Pure LLM APIs: Anthropic, OpenAI, DeepSeek, Kimi, GLM, OpenRouter
Cloud: AWS Bedrock, GCP Vertex AI
Local: Ollama, LM Studio

The limits (I want to be honest)

Cursor subscription models can't be intercepted (they use a server-side proxy).
VS Code Continue with built-in models: same.
It's local-only by design. No SaaS, no telemetry, no account. (If you want cloud, use Langfuse.)

Open source

GitHub: https://github.com/jianshuo/ccglass

460+ stars at time of writing. MIT licensed. PRs welcome.

If you ship with Claude Code / Codex / Kimi and have ever asked "where is my money going", try it once. The data is eye-opening.

Make Your AI Coding Agent Transparent - See What It Actually Sends to the Model

侯垒 — Wed, 10 Jun 2026 02:46:06 +0000

If you've been using AI coding agents like Claude Code or Codex, you know how powerful they can be. But they also feel like a black box. What's actually in that system prompt? How much context is being sent every turn? Where are all my tokens going?

I recently found a tool called ccglass that answers these questions beautifully, and I wanted to share my experience.

What is ccglass?
ccglass is a lightweight local tool that acts as a reverse proxy between your AI coding agent and the model API. It intercepts all requests and responses, logs them, and displays them in a really nice web dashboard.

ccglass Dashboard

Getting Started
Installation is simple (Node.js 18+ required):

npm install -g ccglass

Then just run it and pick your agent:

ccglass

Or specify directly:

ccglass claude # for Claude Code
ccglass codex # for Codex
ccglass deepseek # for DeepSeek-TUI

It will:

Start a local proxy server
Set the right environment variables automatically
Launch the agent for you
Open the web dashboard

That's it! No CA certificates to install, no complicated setup.

What You Can See

The dashboard shows you everything:

The Full System Prompt

This is probably the most interesting part. You get to see how different agents frame their instructions to the model. Claude Code's system
prompt is fascinating to read!

Complete Message History

See the full context being sent each turn, how it evolves, and what gets kept vs. dropped.

Tool Schemas and Calls

See what tool definitions the agent provides to the model, and what tool calls the model makes in response.

Token and Cost Breakdown

Know exactly how many tokens you're using, what's cached, and get cost estimates per request and per session.

Token Summary

Turn-by-Turn Diff View

Compare requests to see exactly what changed between turns. Super useful for debugging why an agent started behaving differently.

Supported Agents

The list is pretty extensive:

Claude Code (including Bedrock and Vertex modes)
Codex (OpenAI)
DeepSeek-TUI and Reasonix
Kimi (Moonshot)
OpenCode
Ollama
OpenRouter
GLM/Zhipu
CodeBuddy (VS Code/JetBrains plugins)

IDE Support

If you use Cursor, Cline, Continue.dev, or similar IDEs that let you set a custom API base URL, you can use the proxy mode:

ccglass proxy --provider openai

Then just point your IDE's API base URL to the local proxy address it gives you.

Why I Like This

Learn from the best - See how production-grade agent systems are built
Debug effectively - Understand why your agent is making certain choices
Optimize costs - See where your tokens are actually going
Local only - All logs stay on your machine (default redacts auth tokens)
MIT licensed - Completely open source

Pro Tip: Export for Documentation

You can export any request to raw HTTP, Markdown, JSON, or HAR format:

ccglass export / --format md

Great for documentation, bug reports, or just saving interesting prompts.

Try It Out

If you're using any AI coding agent regularly, I highly recommend giving ccglass a try. It will change how you think about these tools.

Install it now:

npm install -g ccglass

Check out the project: github.com/jianshuo/ccglass