Meidi Airouche for Onepoint

Posted on Nov 4

How Reasoning LLMs Are Challenging Orchestration

#mcp #aws #ai #programming

I spent most of last year buried in LangGraph.
It felt like engineering telepathy — drawing thoughts as state machines, chaining prompts into something almost alive.
Until one night, at 2 AM, I was staring at a 15-node graph just to summarize a document.
Every retry, every validation, every branch was on me.
That’s when it clicked: I was orchestrating cognition that the model could have handled itself.

So I did what most of us do when we’re tired of over-engineering: I tore it down and rebuilt it — this time around a reasoning-native model.

The usual way: Manual Orchestration with LangGraph

LangGraph made sense when models couldn’t plan or reason.

Here’s a minimal example of what my early pipelines looked like:

from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4-turbo")

def summarize(state):
    result = llm([
        SystemMessage("Summarize briefly and clearly."),
        HumanMessage(state["doc"])
    ])
    return {"summary": result.content}

def verify(state):
    summary = state["summary"]
    result = llm([
        SystemMessage("Return OK if factual, else ERROR."),
        HumanMessage(summary)
    ])
    return {"verified": result.content.strip() == "OK"}

graph = StateGraph()
graph.add_node("summarize", summarize)
graph.add_node("verify", verify)
graph.add_edge("summarize", "verify")
graph.set_entry_point("summarize")

result = graph.run({"doc": open("manual.txt").read()})
print(result)

What we can state :

It’s explicit and predictable — but verbose.
Every new condition means another node.
Every correction means editing a graph.
You become the conductor instead of letting the model compose the music.

The New Approach: Reasoning Models + MCP Tools

When I switched to Claude 4.5 (and later GPT-5), I stopped describing steps and started describing goals.

The model could plan, call tools, and validate its own results through MCP.

from mcp import MCPClient, tool

@tool
def search(query: str) -> str:
    """Search internal docs."""
    return f"Search results for {query}"

@tool
def summarize(text: str, max_words: int = 200) -> str:
    """Summarize text."""
    return text[:max_words] + "..."

@tool
def validate(summary: str) -> str:
    """Return 'OK' if factual."""
    return "OK"

client = MCPClient(model="claude-4.5", tools=[search, summarize, validate])

prompt = """
You are a reasoning agent.
Goal: Summarize and fact-check 'LangGraph framework'.
1. Search relevant data.
2. Summarize it.
3. Validate results; retry if not OK.
Return only the final validated summary.
"""

print(client.run(prompt))

The trace looked like this:

[Plan] Step 1 → search("LangGraph framework")
[Plan] Step 2 → summarize(results)
[Plan] Step 3 → validate(summary)
[Reflection] Validation OK → returning summary

That’s one reasoning session instead of three separate LLM calls:

Latency dropped from 2.3s to ~1.1s.
Tokens halved.
And my orchestration code disappeared.

Reasoning Traces as Your New Logs

Reasoning-native models emit structured traces — JSON instead of plain text logs.

{
  "plan": ["search", "summarize", "validate"],
  "confidence": 0.94,
  "tokens_used": 540,
  "duration_ms": 1180,
  "outcome": "validated_summary"
}

I store these in SQLite and diff them like code.

When a model’s plan drifts or confidence drops, my CI flags it.

def test_reasoning_behavior(client, baseline_trace):
    result = client.run("Summarize and validate: LangGraph")
    trace = result.trace
    assert trace.confidence >= baseline_trace["confidence"]
    assert len(trace.plan) == len(baseline_trace["plan"])

It’s not deterministic — it’s bounded variance, and that’s testable.

Observing Cognition in Production

Each reasoning session becomes a span in OpenTelemetry.

I track average token usage and confidence in Grafana:

SELECT avg(tokens_used), avg(confidence)
FROM traces
WHERE created_at > NOW() - INTERVAL '1 day';

When confidence dips, we investigate prompts — not servers.

Monitoring cognition feels weird at first, but it’s the only way to scale non-deterministic intelligence safely.

What You Trade (and Gain)

Aspect	LangGraph	Reasoning + MCP
Control	Explicit nodes	Emergent planning
Debugging	Stack traces	Cognitive traces
Determinism	High	Bounded
Adaptability	Low	High
Maintenance	Tedious	Lightweight
Creativity	Predictable	Expansive

LangGraph still shines where compliance or determinism rule.

But when your system needs flexibility and learning-like behavior, reasoning-native wins.

Lessons (and Mistakes) from the Field

Reasoning drift — identical input, different answers. Use baseline scoring.
Token bloat — models love to overthink. Set session budgets.
Trace chaos — vendor formats differ. Normalize JSON.
Compliance audits — sign and retain traces.
Team culture — teach prompt design as system design.

When AWS Gets It Right with AgentCore

The most interesting thing about AWS AgentCore isn’t that it’s another managed service — it’s that its design choices perfectly align with where reasoning-native orchestration is headed.

AgentCore treats reasoning as a first-class runtime concern, not a framework problem.

You don’t define workflows. You define intent and tools, and the runtime handles retries, observability, and tool invocation policy for you — all inside a managed cognitive environment.

from strands import Agent, tool
from strands.models import BedrockModel
from strands_tools import http_request

@tool
def summarize(text: str, max_words: int = 200) -> str:
    return text[:max_words] + "…"

@tool
def validate(summary: str) -> str:
    return "OK" if "important" in summary else "ERROR"

model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
    region_name="eu-west-3",
    client_args={"api_key": "*********"},
    params={"temperature": 0.3}
)

agent = Agent(
    model=model,
    tools=[summarize, validate, http_request]
)

prompt = """
Summarize the key points of the LangGraph framework and then validate the summary.
"""

response = agent(prompt)
print(response)

That’s it — no DAGs, no state machines, no manual recovery logic.
AgentCore hosts your agent, tracks its reasoning trace, and automatically replays or retries failed tool calls.

In practice, it embeds orchestration inside reasoning, instead of wrapping reasoning inside orchestration.
That inversion is exactly what this whole post argues for.

The fact that AWS — arguably the most conservative player in cloud architecture — built AgentCore this way tells me this isn’t a niche trend.
It’s the industry quietly acknowledging that reasoning is the new orchestration layer.

Finally

LangGraph taught me structure.
It forced me to make my thinking explicit — every state, every branch, every validation. But it also showed me the limits of explicitness: when you draw too much of the reasoning, the system stops thinking for itself.

Reasoning models flipped that.
They brought back uncertainty — and with it, adaptability.Instead of engineering every step, I started describing outcomes and letting the model figure out how to get there. That shift didn’t remove orchestration; it changed what orchestration means.

And now, when I look at frameworks like AWS AgentCore, I see the industry catching up. They didn’t rebuild the LangGraph paradigm. They embedded it inside the runtime — where reasoning, recovery, and policy coexist naturally. It’s orchestration without the scaffolding, cognition without the chaos.

So here’s what I’ve learned: the future of orchestration isn’t about bigger graphs or smarter DAGs. It’s about trusting the reasoning layer to do what the graph used to. And whether that reasoning runs in your own loop, or inside a managed runtime like AgentCore, the principle stays the same:

The system no longer needs to be told how to think,
it just needs to be given the space to reason.

DEV Community