DeepClaude: I Combined Claude Code with DeepSeek V4 Pro in My Agent Loop and the Numbers Threw Me Off
DeepSeek V4 Pro correctly solves 94% of deep reasoning tasks in my loop… but the latency cost makes it unusable for 60% of my agent cases. Yeah, you read that right. And that completely blows up the narrative of "combining models is always better."
Tuesday night I watched the DeepClaude post climb to 467 points on Hacker News. What caught me wasn't the repo itself — it was a comment buried on page 2: "The dual architecture makes theoretical sense, but nobody measured whether the orchestration overhead destroys the benefit in real loops." Three hours later I had the experiment running.
I've written before about how I use YAML specs for my agents and about how Kimi K2.6's benchmarks surprised me against my real cases. This post is the next step: what happens when you combine the two best models I use in production inside a concrete hybrid architecture.
My thesis, before I show you the numbers: DeepClaude is not a universal upgrade — it's a tool that shines in a specific task regime and sinks in another. The problem is that regime isn't obvious until you measure.
What DeepClaude Is and How I Dropped It Into My Real Loop
The DeepClaude repo implements an architecture where DeepSeek R1 (or V4 Pro, depending on the fork) does the chained reasoning — the internal thinking — and Claude handles synthesis and final output. The idea is to leverage DeepSeek's cheap chain-of-thought to give Claude richer context than it would generate on its own.
But I don't run a chat loop. I run an agent system that operates on my production codebase: generates code, reviews PRs, writes specs, detects regressions. The question wasn't "is it better in chat?" but "what does it do when one agent's output is the next agent's input?"
First thing I did was clone the repo and wire the integration into my TypeScript stack:
// deepclaude-client.ts
// Hybrid client: DeepSeek reasons, Claude synthesizes
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai"; // DeepSeek uses OpenAI-compatible API
const deepseek = new OpenAI({
apiKey: process.env.DEEPSEEK_API_KEY,
baseURL: "https://api.deepseek.com/v1",
});
const claude = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
interface DeepClaudeResult {
deepseekThinking: string; // raw reasoning
claudeOutput: string; // final output
latencyMs: number;
tokensDeepseek: number;
tokensClaude: number;
}
async function deepClaudeComplete(
prompt: string,
systemContext: string
): Promise<DeepClaudeResult> {
const start = Date.now();
// Step 1: DeepSeek generates deep reasoning
const dsResponse = await deepseek.chat.completions.create({
model: "deepseek-reasoner", // V4 Pro with thinking enabled
messages: [
{
role: "system",
content: "Reason through the problem in depth. Do not generate final output.",
},
{ role: "user", content: prompt },
],
max_tokens: 8000,
});
const thinking =
dsResponse.choices[0]?.message?.content ?? "";
const tokensDS = dsResponse.usage?.total_tokens ?? 0;
// Step 2: Claude synthesizes using DeepSeek's reasoning as context
const claudeResponse = await claude.messages.create({
model: "claude-opus-4-5",
max_tokens: 4096,
system: systemContext,
messages: [
{
role: "user",
content: `Prior reasoning available:\n<thinking>\n${thinking}\n</thinking>\n\nTask: ${prompt}`,
},
],
});
const claudeOutput =
claudeResponse.content[0].type === "text"
? claudeResponse.content[0].text
: "";
return {
deepseekThinking: thinking,
claudeOutput,
latencyMs: Date.now() - start,
tokensDeepseek: tokensDS,
tokensClaude: claudeResponse.usage.input_tokens + claudeResponse.usage.output_tokens,
};
}
I ran this against three types of tasks from my real loop:
- Code generation with complex specs (30 cases)
- Code review of PRs with architectural changes (20 cases)
- Production regression debugging (15 cases)
The Real Numbers — and Where They Threw Me Off
Latency
The first number that hit me:
| Task | Claude Only | DeepSeek Only | DeepClaude |
|---|---|---|---|
| Simple code generation | 3.2s | 8.1s | 11.4s |
| Architectural code review | 7.8s | 19.3s | 24.1s |
| Regression debugging | 6.1s | 15.7s | 20.2s |
DeepClaude's latency is the sum of both plus orchestration overhead. There's no possible parallelism because DeepSeek's thinking is Claude's input. In a loop where one agent calls the next, this multiplies. With 4 agents chained, I went from a ~30-second pipeline to a ~90-second one.
Cost Per Task
Here's the pleasant surprise:
| Task | Claude Opus Only | DeepClaude |
|---|---|---|
| Simple code generation | $0.038 | $0.019 |
| Architectural code review | $0.094 | $0.051 |
| Regression debugging | $0.071 | $0.041 |
DeepClaude runs ~46% cheaper than Claude Opus alone. The reason: DeepSeek generates the reasoning context at a fraction of the cost, and Claude receives a richer prompt that needs fewer output tokens to reach the correct answer.
Output Quality — Here's the Actual Thesis
I measured quality with a simple but honest method: ran each output against my codebase's tests, plus manual review for cases where tests aren't sufficient.
Simple code generation (functions under 100 lines, clear specs):
- Claude only: 87% passes tests without modification
- DeepClaude: 89% passes tests without modification
- Difference: statistically irrelevant. The latency overhead buys you nothing here.
Architectural code review (changes touching multiple modules):
- Claude only: identified 71% of real issues
- DeepClaude: identified 91% of real issues
- This difference matters. DeepSeek finds the edge cases Claude walks right past.
Regression debugging (production errors with real stack traces):
- Claude only: reached root cause on first attempt in 67% of cases
- DeepClaude: reached root cause on first attempt in 88% of cases
- Here DeepSeek's deep thinking completely changed the outcome.
The pattern that emerged is clear: the regime where DeepClaude wins is long-range reasoning over existing code, not generation from scratch. And it makes sense — DeepSeek's thinking shines when there's rich context to explore, not when there's a clean spec to execute.
The Gotchas the Repo Doesn't Document
1. DeepSeek's Thinking Is Verbose to the Point of Annoying
In 30% of my cases, DeepSeek generated over 6,000 tokens of thinking for a task Claude resolves in 1,200 tokens of output. All that thinking lands in Claude's context, which then has to ignore half of it. I implemented a compression step:
// compress-thinking.ts
// Trim DeepSeek's thinking before sending it to Claude
async function compressThinking(thinking: string): Promise<string> {
// Extract only conclusion blocks and critical steps
const lines = thinking.split("\n");
const relevant = lines.filter(
(l) =>
l.includes("Therefore") ||
l.includes("The problem is") ||
l.includes("The solution") ||
l.includes("Conclusion") ||
l.startsWith("→") ||
l.startsWith("**")
);
// If compression is too aggressive, keep the last 2000 chars
const compressed = relevant.join("\n");
return compressed.length > 500
? compressed
: thinking.slice(-2000);
}
With this, latency dropped 18% with no measurable quality loss.
2. Claude Ignores the Thinking When the Instruction Isn't Explicit
I caught this reading logs. If you don't explicitly tell Claude "use the prior reasoning to guide your response," it treats it as context noise. The system prompt matters:
// The system prompt that worked in my tests
const systemContext = `
You receive a coding task along with prior reasoning marked in <thinking>.
That reasoning already explored the solution space.
Your job is to synthesize that analysis into a precise, actionable response.
Do not repeat the reasoning — use it. Output must be code or direct analysis.
`.trim();
3. The Overhead Kills the Benefit in Async Pipelines
In my architecture, I have agent tasks that run in the background with no latency urgency. That's where DeepClaude makes sense. But in the agent that responds to uptime events on Railway, 24 seconds of latency is unacceptable — the user has already refreshed the page three times.
The rule I adopted: DeepClaude for batch and async tasks; Claude alone for synchronous tasks with a user waiting.
4. DeepSeek's Errors Get Amplified
I found two cases where DeepSeek's thinking reached an incorrect conclusion and Claude took it as gospel. There's no cross-validation mechanism — if DeepSeek reasons wrong, Claude synthesizes wrong. I implemented a fallback:
// Basic validation: if Claude expresses uncertainty, fall back to Claude alone
async function deepClaudeWithFallback(prompt: string, system: string) {
const result = await deepClaudeComplete(prompt, system);
// Detect uncertainty signals in Claude's output
const errorSignals = [
"i'm not sure",
"could be incorrect",
"the previous reasoning suggests",
"based on the prior analysis, although",
];
const outputLower = result.claudeOutput.toLowerCase();
const hasUncertainty = errorSignals.some((s) =>
outputLower.includes(s)
);
if (hasUncertainty) {
// Fallback: Claude alone, without the contaminated thinking
console.log("[deepclaude] Fallback triggered — thinking possibly corrupted");
return await claudeOnlyComplete(prompt, system);
}
return result;
}
FAQ: DeepClaude in Production Agent Loops
Does DeepClaude fully replace Claude Code?
No, and thinking so would be a mistake. Claude Code has native integration with the filesystem, shell, and project context. DeepClaude is a completions architecture, not an integrated agent. The use cases are different: Claude Code for iterative interaction with the codebase; DeepClaude for heavy reasoning tasks inside your own pipeline.
Is DeepSeek V4 Pro the same as DeepSeek R1?
Not exactly. V4 Pro is the more recent version with improvements in multimodal reasoning and long context. The original DeepClaude repo was designed with R1, but the architecture is compatible. In my tests I used the deepseek-reasoner model, which is what the public API currently exposes.
How much does running DeepClaude in production cost at real volume?
At my current volume (~200 agent tasks per day), DeepClaude costs approximately $8/day versus $15/day for Claude Opus alone — but only for the tasks where I activated it (async batch, ~40% of volume). Net monthly savings: ~$210. Not transformative, but not nothing either.
Is it worth it for a small project with a few agents?
Probably not. The setup overhead, orchestration complexity, and managing two separate APIs carry a real maintenance cost. If you're running fewer than 50 agent tasks per day, Claude alone with a solid system prompt will get you 90% of the value without the complexity.
Is DeepSeek's thinking visible or a black box?
It's visible in the API response — plain text in the content field. That's a huge advantage for debugging: you can log the reasoning and understand why the pipeline reached a wrong conclusion. In my Railway logs, the thinking turned out to be the best diagnostic tool I had.
How does this affect the specs strategy I described before?
Pretty directly. In my YAML specs system for agents, the spec tells the agent what to do and how to structure its output. With DeepClaude, the spec is still Claude's input, but DeepSeek's thinking acts as a "context elaboration" step before Claude consumes it. Net effect: Claude needs less detailed specs because the thinking already resolved the ambiguities.
What I Accept, What I Don't Buy, and What's Still Rattling Around in My Head
I accept: DeepClaude is a legitimate architecture for a subset of tasks. The cost savings are real and the quality jump on deep reasoning is measurable. It's not marketing.
I don't buy: The narrative of "always better than either alone." The numbers clearly show that for simple code generation, the difference is statistical noise and the latency cost is a poisoned gift. The HN hype is overfit to complex reasoning cases.
What's still rattling around in my head: The real value of this architecture might not be the final output — it might be the thinking logs. Having DeepSeek's intermediate reasoning in my production logs gives me a level of observability into the agent's decision process that I never had before. That alone — regardless of whether it improves the output — might be worth the overhead.
The question I keep coming back to, after watching how Spotify is marking human content and how models differentiate in specific niches: is the future of coding agents an orchestrator that dynamically routes each task to the most appropriate model? DeepClaude is a crude first step toward that. And the numbers say there's something real here, even if the repo doesn't fully exploit it yet.
If you implement this in production, start with async batch. Measure latency before and after. And log the thinking — it's the most valuable data in the whole system.
Original source: Hacker News
This article was originally published on juanchi.dev
Top comments (0)