DEV Community

Wojtek Pluta for Oracle Developers

Posted on with Nacho Martinez • Edited on • Originally published at blogs.oracle.com

Agent Reasoning: The Thinking Layer

Key Takeaways

  • Agent Reasoning is an open-source reasoning layer that adds planning, deduction, and self-correction to any Ollama-served LLM (e.g., gemma3, llama3), via plug-and-play Python or a proxy server.
  • Multiple proven reasoning strategies built-in (CoT, Self-Consistency, ToT, ReAct, Self-Reflection, Decomposition, Refinement) with a guided “start simple” path.
  • Practical tooling for teams: interactive CLI/TUI, Python API, and an Ollama-compatible gateway so existing apps gain reasoning without code changes.
  • Clear benchmark guidance: CoT delivers the best average accuracy; ToT shines for multi-step logic; ReAct leads when tools (search, calculator) matter.

Implementing Cognitive Problem-Solving in Open Source Models

From Nacho Martinez, Data Scientist Advocate at Oracle (and author of the A2A-based Multi-Agent RAG system) comes an open-source reasoning layer that can enable any open-source Large Language Model (LLM) such as gemma3 or llama3 to perform complex planning, logical deduction and self-correction. The layer wraps these models in a cognitive architecture built based on key research papers (CoT, ToT and ReAct).

We call this Agent Reasoning, and it is available open-source in this GitHub repository, alongside a Jupyter notebook.

Features of Agent Reasoning

  • Plug & Play: Use via Python Class or as a Network Proxy.
  • Model Agnostic: Works with any model served by Ollama.
  • Advanced Architectures:
    • Chain-of-Thought (CoT) & Self-Consistency: Implements Majority Voting (k samples) with temperature sampling.
    • Tree of Thoughts (ToT): BFS strategy with robust heuristic scoring and pruning.
    • ReAct (Reason + Act): Real-time tool usage (Web Search via scraping, Wikipedia API, Calculator) with fallback/mock capabilities. External grounding implemented.
    • Self-Reflection: Dynamic multi-turn Refinement Loop (Draft -> Critique -> Improve).
    • Decomposition & Least-to-Most: Planning and sub-task execution.
    • Refinement Loop: Score-based iterative improvement (Generator → Critic → Refiner) until quality threshold met.
    • Complex Refinement Pipeline: 5-stage optimization (Technical Accuracy → Structure → Depth → Examples → Polish).

    Interactive Jupyter Notebook

    We prepared an interactive Jupyter notebook to demonstrate the capabilities of agent reasoning.

    This is a comprehensive demo covering all reasoning strategies (CoT, ToT, ReAct, Self-Reflection) with benchmarks and comparisons.

    Architectures in Detail

    For most users, start with Chain-of-Thought (CoT) — it has the best average accuracy and lowest latency cost. Use Self-Consistency when correctness is critical and you can afford 3–5× more inference time. Avoid ToT for knowledge-retrieval tasks (it underperforms baseline on MMLU) and reserve it for multi-step planning or logic puzzles.

    Architecture Description Best For Papers
    Chain-of-Thought Step-by-step reasoning prompt injection. Math, Logic, Explanations Wei et al. (2022)
    Self-Reflection Draft -> Critique -> Refine loop. Creative Writing, High Accuracy Shinn et al. (2023)
    ReAct Interleaves Reasoning and Tool Usage. Fact-checking, Calculations Yao et al. (2022)
    Tree of Thoughts Explores multiple reasoning branches (BFS/DFS). Complex Riddles, Strategy Yao et al. (2023)
    Decomposed Breaks complex queries into sub-tasks. Planning, Long-form answers Khot et al. (2022)
    Recursive (RLM) Uses Python REPL to recursively process prompt variables. Long-context processing Author et al. (2025)
    Refinement Loop Generator → Critic (0.0-1.0 score) → Refiner iterative loop. Technical Writing, Quality Content Inspired by Madaan et al. (2023)
    Complex Refinement 5-stage pipeline: Accuracy → Clarity → Depth → Examples → Polish. Long-form Articles, Documentation Multi-stage refinement architecture

    Accuracy Benchmarks

    You can evaluate reasoning strategies against standard NLP datasets to measure accuracy improvements. The benchmark system includes embedded question sets from 4 standard datasets.

    To run an accuracy benchmark:

    accuracy benchmark evaluate reasoning strategies

    Or using the Python API:

    accuracy benchmark evaluate reasoning strategies codeblock

    Charts are auto-generated after each run and save to benchmarks/charts/.

    Dataset Category Questions Format Reference
    GSM8K Math Reasoning 30 Open-ended number Cobbe et al. (2021)
    MMLU Knowledge (57 subjects) 30 Multiple choice (A-D) Hendrycks et al. (2021)
    ARC-Challenge Science Reasoning 25 Multiple choice (A-D) Clark et al. (2018)
    HellaSwag Commonsense 20 Multiple choice (A-D) Zellers et al. (2019)

    The following are the results of a full evaluation across all 11 strategies:

    Strategy GSM8K MMLU ARC-C HellaSwag Avg
    Standard (baseline) 66.7% 90.0% 92.0% 90.0% 84.7%
    Chain of Thought 73.3% 96.7% 88.0% 90.0% 87.0%
    Tree of Thoughts 76.7% 63.3% 76.0% 90.0% 76.5%
    ReAct 63.3% 86.7% 96.0% 90.0% 84.0%
    Self-Reflection 66.7% 90.0% 88.0% 90.0% 83.7%
    Self-Consistency 76.7% 96.7% 92.0% 66.3%
    Decomposed 10.0% 60.0% 84.0% 38.5%

    Key findings:

    • CoT achieves the highest average accuracy (87.0%), outperforming Standard on GSM8K (+6.6%) and MMLU (+6.7%)
    • Self-Consistency ties CoT on MMLU (96.7%) and GSM8K (76.7%) through majority voting
    • ToT excels on GSM8K math (76.7%, +10% over Standard) through branch exploration
    • ReAct achieves the highest ARC-Challenge score (96.0%) via tool-augmented reasoning

    Accuracy statistics

    This is the accuracy heat map per-strategy:

    accuracy heat map per-strategy

    This is the average accuracy by strategy:

    average accuracy by strategy across 4 dataset for gemma3:latest

    Benchmarks

    Benchmarks charts are auto-generated after every benchmark run.

    For a complete listing of sample output benchmarks (response latency, throughput etc.) please refer to the Agent Reasoning GitHub repository.

    Quick start (3 commands)

    uv sync && ollama pull gemma3:270m && uv run agent-reasoning

    Installation

    One-command, single-step install

    curl -fsSL https://raw.githubusercontent.com/jasperan/agent-reasoning/main/install.sh | bash

    You can also install agent-reasoning using either PyPi or directly from source.

    Using PyPi

    From Source using uv

    Development

    Configuring the large language model (LLM)

    We use Ollama as an example for this procedure.

    Ollama must be running locally, or you can connect to a remote Ollama instance.

    ollama pull gemma3:270m    # Tiny model for quick testing
    ollama pull gemma3:latest  # Full model for quality results

    Configuring the remote Ollama endpoint

    If you don't have Ollama installed locally, you can connect to a remote Ollama instance. Configuration is stored in config.yaml in the root directory of the repository.

    Option 1: Interactive CLI configuration

    agent-reasoning
    # Select "Configure Endpoint" from the menu

    Option 2: Server CLI Argument

    agent-reasoning-server --ollama-host http://192.168.1.100:11434

    Option 3: Direct Config File

    Copy the example config and edit it:

    cp config.yaml.example config.yaml

    Or create config.yaml in the project root:

    ollama:
      host: http://192.168.1.100:11434

    Option 4: Python API

    Usage

    1. Interactive CLI

    Use the rich CLI to access all agents, comparisons and benchmarks.

    • Timing Metrics: Every response shows TTFT, total time, tokens/sec
    • Session History: All chats auto-saved to data/sessions/ with export to markdown
    • Head-to-Head: Compare any two strategies side-by-side in parallel
    • Agent Info: Built-in strategy guide with descriptions and use cases
    • Benchmark Charts: Auto-generate PNG visualizations of benchmark results

    Setup

    Shortcuts

    The CLI also provides useful shortcuts:

    Interactive experience

    2. Terminal UI

    You can also use a Go-based terminal interface with a split-panel layout and arena grid view.

    • Split layout: agent sidebar + chat panel
    • Arena mode: 3x3 grid showing all agents running in parallel
    • Real-time streaming with cancellation support

    The TUI automatically starts the reasoning server on launch. Requires Go 1.18+.

    Keybindings for TUI

    Chat View

    The default chat view is a split-pane layout with a 16-agent sidebar, chat panel with live streaming, and a metrics bar showing TTFT, tokens/sec, and token count in real-time.

    Press v to toggle structured visualization mode. Instead of raw text, you see the agent's reasoning process rendered live: tree diagrams for ToT, swimlanes for ReAct, vote tallies for Consistency, score gauges for Refinement, and more.

    Press p to open the hyperparameter tuner. Adjust ToT width/depth, Consistency samples, Refinement score thresholds, and other agent parameters before running a query.

    Press ? to invoke the strategy advisor. The MetaReasoningAgent analyzes your query and recommends the best strategy.

    Modes of interaction

    Arena Mode prompts all 16 agents to race simultaneously on the same query displayed using a 4x4 grid; a leaderboard bar updates as each agents finish:

    Head-to-Head Duel prompts two agents to compete 1-1 on the same query.

    There are plenty of other features to try, such as:

    • the Step-Through Debugger which enables pausing the agent between LLM calls and inspecting intermediate state
    • the Benchmark Dashboard which reads existing JSON benchmark files
    • the Session Browser which enables search and re-running of past conversations, with filtering options
    • the Agent Guide, which contains reference cards for all 16 agents, covering best-for, parameters, trade-offs, and research reference. Pressing Enter on any card initiates a chat with the agent.

    3. Python API (for developers)

    Use the ReasoningInterceptor as a drop-in replacement for your LLM client.

    Using agents directly:

    Using refinement agents for quality control:

    4. Reasoning Gateway Server

    Run a proxy server that impersonates Ollama. This allows any Ollama-compatible app, such as LangChain or Web UIs, to gain reasoning capabilities without any code changes whatsoever.

    Then configure your app:

    • Base URLhttp://localhost:8080
    • Modelgemma3:270m+cot (or +tot, +react, etc.)

    API Endpoints

    Troubleshooting

    • Model Not Found: Ensure you have pulled the base model (ollama pull gemma3:270m).
    • Timeout / Slow: ToT and Self-Reflection make multiple calls to the LLM. With larger models (Llama3 70b), this can take time.
    • Hallucinations: The default demo uses gemma3:270m which is extremely small and prone to logic errors. Switch to gemma2:9b or llama3 for robust results.

    Extending the system further

    You can add additional reasoning strategies.

    1. Create a class in src/agent_reasoning/agents/ inheriting from BaseAgent.
    2. Implement the stream(self, query) method.
    3. Register it in AGENT_MAP in src/agent_reasoning/interceptor.py.

    Conclusion

    Thank you for reading, and we look forward to seeing what you build using Agent Reasoning!

    Frequently Asked Questions (FAQs)

    When should I use each strategy?

    Start with Chain-of-Thought for best accuracy/latency trade-off; use Self-Consistency when correctness is critical; reserve Tree of Thoughts for complex multi-step reasoning; pick ReAct for fact-checks or calculations.

    Do I need a specific model?

    No. It’s model-agnostic for any model served by Ollama. Quality improves with larger models (e.g., gemma2:9b, llama3 vs tiny 270m).

    How hard is setup?

    Three-command quick start, one-line install script, and ready-to-run demos in a Jupyter notebook. A proxy lets existing Ollama apps adopt reasoning by just changing the base URL/model name.

    How do I evaluate results?

    Built-in benchmarks (GSM8K, MMLU, ARC-Challenge, HellaSwag) auto-generate charts, with side-by-side strategy comparisons and session histories for review.

Top comments (0)