DEV Community

Cover image for How Reasoning LLMs Are Challenging Orchestration
Meidi Airouche for Onepoint

Posted on

How Reasoning LLMs Are Challenging Orchestration

I spent most of last year buried in LangGraph.
It felt like engineering telepathy — drawing thoughts as state machines, chaining prompts into something almost alive.
Until one night, at 2 AM, I was staring at a 15-node graph just to summarize a document.
Every retry, every validation, every branch was on me.
That’s when it clicked: I was orchestrating cognition that the model could have handled itself.

So I did what most of us do when we’re tired of over-engineering: I tore it down and rebuilt it — this time around a reasoning-native model.


The usual way: Manual Orchestration with LangGraph

LangGraph made sense when models couldn’t plan or reason.

Here’s a minimal example of what my early pipelines looked like:

from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4-turbo")

def summarize(state):
    result = llm([
        SystemMessage("Summarize briefly and clearly."),
        HumanMessage(state["doc"])
    ])
    return {"summary": result.content}

def verify(state):
    summary = state["summary"]
    result = llm([
        SystemMessage("Return OK if factual, else ERROR."),
        HumanMessage(summary)
    ])
    return {"verified": result.content.strip() == "OK"}

graph = StateGraph()
graph.add_node("summarize", summarize)
graph.add_node("verify", verify)
graph.add_edge("summarize", "verify")
graph.set_entry_point("summarize")

result = graph.run({"doc": open("manual.txt").read()})
print(result)
Enter fullscreen mode Exit fullscreen mode

What we can state :

  • It’s explicit and predictable — but verbose.
  • Every new condition means another node.
  • Every correction means editing a graph.
  • You become the conductor instead of letting the model compose the music.

The New Approach: Reasoning Models + MCP Tools

When I switched to Claude 4.5 (and later GPT-5), I stopped describing steps and started describing goals.

The model could plan, call tools, and validate its own results through MCP.

from mcp import MCPClient, tool

@tool
def search(query: str) -> str:
    """Search internal docs."""
    return f"Search results for {query}"

@tool
def summarize(text: str, max_words: int = 200) -> str:
    """Summarize text."""
    return text[:max_words] + "..."

@tool
def validate(summary: str) -> str:
    """Return 'OK' if factual."""
    return "OK"

client = MCPClient(model="claude-4.5", tools=[search, summarize, validate])

prompt = """
You are a reasoning agent.
Goal: Summarize and fact-check 'LangGraph framework'.
1. Search relevant data.
2. Summarize it.
3. Validate results; retry if not OK.
Return only the final validated summary.
"""

print(client.run(prompt))
Enter fullscreen mode Exit fullscreen mode

The trace looked like this:

[Plan] Step 1 → search("LangGraph framework")
[Plan] Step 2 → summarize(results)
[Plan] Step 3 → validate(summary)
[Reflection] Validation OK → returning summary
Enter fullscreen mode Exit fullscreen mode

That’s one reasoning session instead of three separate LLM calls:

  • Latency dropped from 2.3s to ~1.1s.
  • Tokens halved.
  • And my orchestration code disappeared.

Reasoning Traces as Your New Logs

Reasoning-native models emit structured traces — JSON instead of plain text logs.

{
  "plan": ["search", "summarize", "validate"],
  "confidence": 0.94,
  "tokens_used": 540,
  "duration_ms": 1180,
  "outcome": "validated_summary"
}
Enter fullscreen mode Exit fullscreen mode

I store these in SQLite and diff them like code.

When a model’s plan drifts or confidence drops, my CI flags it.

def test_reasoning_behavior(client, baseline_trace):
    result = client.run("Summarize and validate: LangGraph")
    trace = result.trace
    assert trace.confidence >= baseline_trace["confidence"]
    assert len(trace.plan) == len(baseline_trace["plan"])
Enter fullscreen mode Exit fullscreen mode

It’s not deterministic — it’s bounded variance, and that’s testable.


Observing Cognition in Production

Each reasoning session becomes a span in OpenTelemetry.

I track average token usage and confidence in Grafana:

SELECT avg(tokens_used), avg(confidence)
FROM traces
WHERE created_at > NOW() - INTERVAL '1 day';
Enter fullscreen mode Exit fullscreen mode

When confidence dips, we investigate prompts — not servers.

Monitoring cognition feels weird at first, but it’s the only way to scale non-deterministic intelligence safely.


What You Trade (and Gain)

Aspect LangGraph Reasoning + MCP
Control Explicit nodes Emergent planning
Debugging Stack traces Cognitive traces
Determinism High Bounded
Adaptability Low High
Maintenance Tedious Lightweight
Creativity Predictable Expansive

LangGraph still shines where compliance or determinism rule.

But when your system needs flexibility and learning-like behavior, reasoning-native wins.


Lessons (and Mistakes) from the Field

  1. Reasoning drift — identical input, different answers. Use baseline scoring.
  2. Token bloat — models love to overthink. Set session budgets.
  3. Trace chaos — vendor formats differ. Normalize JSON.
  4. Compliance audits — sign and retain traces.
  5. Team culture — teach prompt design as system design.

When AWS Gets It Right with AgentCore

The most interesting thing about AWS AgentCore isn’t that it’s another managed service — it’s that its design choices perfectly align with where reasoning-native orchestration is headed.

AgentCore treats reasoning as a first-class runtime concern, not a framework problem.

You don’t define workflows. You define intent and tools, and the runtime handles retries, observability, and tool invocation policy for you — all inside a managed cognitive environment.

from strands import Agent, tool
from strands.models import BedrockModel
from strands_tools import http_request

@tool
def summarize(text: str, max_words: int = 200) -> str:
    return text[:max_words] + ""

@tool
def validate(summary: str) -> str:
    return "OK" if "important" in summary else "ERROR"

model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
    region_name="eu-west-3",
    client_args={"api_key": "*********"},
    params={"temperature": 0.3}
)

agent = Agent(
    model=model,
    tools=[summarize, validate, http_request]
)

prompt = """
Summarize the key points of the LangGraph framework and then validate the summary.
"""

response = agent(prompt)
print(response)
Enter fullscreen mode Exit fullscreen mode

That’s it — no DAGs, no state machines, no manual recovery logic.
AgentCore hosts your agent, tracks its reasoning trace, and automatically replays or retries failed tool calls.

In practice, it embeds orchestration inside reasoning, instead of wrapping reasoning inside orchestration.
That inversion is exactly what this whole post argues for.

The fact that AWS — arguably the most conservative player in cloud architecture — built AgentCore this way tells me this isn’t a niche trend.
It’s the industry quietly acknowledging that reasoning is the new orchestration layer.


Finally

LangGraph taught me structure.
It forced me to make my thinking explicit — every state, every branch, every validation. But it also showed me the limits of explicitness: when you draw too much of the reasoning, the system stops thinking for itself.

Reasoning models flipped that.
They brought back uncertainty — and with it, adaptability.Instead of engineering every step, I started describing outcomes and letting the model figure out how to get there. That shift didn’t remove orchestration; it changed what orchestration means.

And now, when I look at frameworks like AWS AgentCore, I see the industry catching up. They didn’t rebuild the LangGraph paradigm. They embedded it inside the runtime — where reasoning, recovery, and policy coexist naturally. It’s orchestration without the scaffolding, cognition without the chaos.

So here’s what I’ve learned: the future of orchestration isn’t about bigger graphs or smarter DAGs. It’s about trusting the reasoning layer to do what the graph used to. And whether that reasoning runs in your own loop, or inside a managed runtime like AgentCore, the principle stays the same:

The system no longer needs to be told how to think,
it just needs to be given the space to reason.

Top comments (0)