When AI Stopped Answering and Started Acting: Lessons from Two Research Papers
Muhammad Abbas · 24P-0545 · BS-CS · FAST NUCES Peshawar
Artificial Intelligence — Dr. Bilal Jan
[🎥 Watch my 2-minute video breakdown on YouTube]
📓 Explore my NotebookLM notebook (https://notebooklm.google.com/notebook/594319fb-bb23-47ea-afbc-468253ade11b)
Why I'm Writing This
For most of my CS education so far, AI meant one thing: a model you prompt, and it responds. Clean, bounded, predictable. Two research papers I read this semester completely changed that mental model.
The first — "The Rise of Agentic AI" (Bandi et al., 2025, Future Internet, MDPI) — is a systematic review of 143 studies on agentic AI systems. The second — "A Survey of LLM-based Deep Search Agents" (Xi et al., 2025, arXiv) — is the first comprehensive survey of AI agents built specifically for complex, multi-step information retrieval.
Together, they paint a clear picture of where AI is actually heading — and it connects directly to every core topic in our course: search algorithms, agent types, environments, and rational behavior.
Paper 1: The Rise of Agentic AI
What the Paper Is
Bandi et al. reviewed 143 primary studies on agentic AI, covering work published from 2005 to 2025. More than 90% of those papers were from 2024–2025 alone, which tells you everything about how fast this field is moving. The authors are from Northwest Missouri State University and published this in Future Internet (MDPI) in September 2025.
The paper's core argument is simple but consequential: traditional AI answers prompts. Agentic AI pursues goals.
An agentic system doesn't just respond to a single query — it decomposes complex objectives into sub-tasks, selects tools, takes multi-step actions, evaluates its own progress, and continues operating with minimal human supervision. The difference is like asking someone a question versus hiring someone to complete a project.
It's Not Theoretical Anymore
A spring 2025 survey by MIT Sloan Management Review and Boston Consulting Group found that 35% of organizations had already adopted AI agents by 2023, with another 44% planning to deploy soon. Microsoft, Google, Salesforce, and IBM are all embedding agentic capabilities directly into their enterprise software platforms.
Interest in the term "agentic AI" was minimal for years, then spiked sharply beginning in April 2024, reaching peak global search popularity by July 2025 — according to Google Trends data cited in the paper. This is not hype; it is a genuine architectural shift already underway.
The Four Capabilities That Matter
The paper organizes agentic systems around four core mechanisms. Reading these through the lens of our AI course made them immediately recognizable:
Planning — Agentic systems decompose a high-level goal into ordered sub-goals and pursue them sequentially. This is the same logic as search: finding a path from a start state to a goal state through a sequence of actions. Agentic AI applies this in open-ended, real-world environments.
Memory — Unlike a stateless LLM call that forgets everything, agentic systems maintain short-term working memory (current context) and long-term episodic memory (past interactions and outcomes). This is what allows an agent to reason across multiple sessions.
Reflection — Some systems can evaluate their own outputs, identify errors or gaps, and retry. This maps directly to heuristic improvement and feedback loops in our course material.
Tool Use — Agents can invoke external APIs, browse the web, execute code, and query databases. They are not bounded by training data; they can act on the world in real time.
The Connection to Our AI Course
In class, we study the PEAS framework: Performance measure, Environment, Actuators, Sensors. Agentic AI systems are the most complete real-world implementation of this framework I have encountered. They have sensors (user inputs, web data, tool responses), actuators (API calls, code execution, file writes), a performance measure they optimize toward, and they operate in environments that are dynamic, partially observable, and non-deterministic — exactly the environment types we classify in class.
We also study the agent type hierarchy: reactive agents, goal-based agents, utility-based agents, and learning agents. The paper's architectural taxonomy maps almost perfectly onto this ladder. Agentic AI represents the goal-based and utility-based tiers operating at scale — systems that reason about future consequences before committing to actions.
The Risk That Doesn't Get Enough Attention
The paper highlights a failure mode specific to agentic systems that I hadn't thought about before. If a standard LLM gives a wrong answer, that is a single error. If an agentic system takes ten sequential actions and makes a wrong decision at step seven, it may continue building on that mistake for all remaining steps — compounding the error across the entire workflow.
The MIT Sloan article on agentic AI (2025) puts it directly: a rogue agent making a consequential decision based on faulty reasoning can cause far more damage than a single hallucination. This is one of the core open challenges the paper identifies alongside reliability, interpretability, and governance.
Paper 2: A Survey of LLM-based Deep Search Agents
The Three Generations of Search
Xi et al. from Shanghai Jiao Tong University wrote what they describe as the first systematic survey of LLM-based Search Agents — a specialized class of agentic systems focused on complex information retrieval. The paper was submitted to arXiv in August 2025 (arXiv:2508.05668).
The paper traces three distinct phases of how AI handles search:
Traditional Web Search — The user writes a query, receives a ranked list of links, and synthesizes the results manually. Passive retrieval.
LLM-Enhanced Search — The model rewrites the user's query to improve accuracy, or summarizes retrieved results. Still fundamentally one-turn and passive. This is standard RAG (retrieval-augmented generation).
LLM-based Search Agents — The agent understands the user's intent, not just their literal query. It plans a multi-step retrieval strategy, executes multiple search rounds dynamically, synthesizes evidence across sources, and produces a structured, grounded answer. OpenAI's Deep Research is the most visible commercial example of this paradigm.
The Architecture — Directly Connected to Our Search Algorithms
The paper organizes search agents around four questions, but the one that connected most directly to our course is the first: how to search.
There are two structural approaches. Sequential search processes one path at a time — either reflection-driven (the agent evaluates results after each step and decides what to search next based on what it found) or proactivity-driven (the agent plans the full search strategy upfront before executing). Parallel search explores multiple directions simultaneously — either decomposition-based (breaking the query into independent sub-queries) or diversification-based (generating varied perspectives on the same query).
The mapping to what we study is direct:
Decomposition-based parallel search is structurally equivalent to BFS — expanding many directions at the same level simultaneously
Reflection-driven sequential search is equivalent to DFS with backtracking — going deep on one path, recognizing a dead end, and returning to explore alternatives
The agent's decision of which direction to search next mirrors A* heuristics — using a signal to estimate which path has the highest expected value
I did not expect a 2025 survey paper to be, in effect, an applied version of chapter three of our textbook. The paper makes the connection explicit.
The Benchmark Result That Stopped Me
The paper references a benchmark called DeepWideSearch (arXiv:2510.20168), designed to test whether agents can perform both deep reasoning and wide-scale information gathering simultaneously. Most benchmarks test one or the other. DeepWideSearch tests the combination — approximating what a human researcher actually does.
The result: the most advanced agent systems achieve only 2.39% average success rate.
That number is worth sitting with. These are frontier models — GPT-5, Gemini 2.5 Pro, Claude Sonnet 4 — and they succeed at this combined task less than three times out of a hundred. The benchmark identifies four root causes of failure: agents lack effective reflection when they encounter wrong search trajectories; they over-rely on internal training knowledge rather than retrieving current information; they fail to extract relevant content even when they reach the right pages; and they run into context overflow on complex tasks.
We are at the very beginning of solving this problem.
What NotebookLM Revealed
I want to be transparent about something. On my first manual read of this paper, I skimmed past the distinction between proactivity-driven and reflection-driven sequential search. They seemed like implementation details.
When I uploaded the paper to Google NotebookLM and asked it to explain the difference in depth, the distinction became clear in a way my first reading missed. Proactivity-driven search plans ahead — the agent commits to a strategy before it starts. Reflection-driven search adapts continuously — the agent evaluates each result and decides the next step based on what it found. The first is forward planning; the second is iterative deepening with mid-search backtracking.
Once I had that framing, the whole taxonomy of the paper clicked. My honest conclusion: reading yourself first is essential — you need to engage with the material directly. But using NotebookLM as a second pass for the dense technical parts is genuinely useful, not a shortcut.
Connecting Both Papers
Reading them together, the progression is clear:
StageWhat It DoesTraditional AIAnswers a single questionAgentic AIPursues a goal across multiple stepsLLM Search AgentsAgentic AI specialized for knowledge-intensive tasks
Both papers converge on the same open research problems: long-horizon planning, graceful failure recovery, multi-agent coordination, evaluation under open-ended conditions, and responsible deployment. These are not abstract academic challenges. They are the same questions we encounter in our AI course when we ask: how does an agent navigate an unknown environment? How does it decide which path to explore? How does it know when a solution is good enough?
My Takeaway
Before these papers, I thought of AI as a function: input → output. After reading them, I think of AI as an actor: state → decision → action → new state.
The agent perception-action loop we cover in class is not a theoretical abstraction. It is the architecture underlying the most capable AI systems being built and deployed right now. Reading these papers made that real for me in a way that no lecture or textbook chapter fully had.
References
Bandi, A., Kongari, B., Naguru, R., Pasnoor, S., & Vilipala, S.V. (2025). The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges. Future Internet, 17(9), 404. https://www.mdpi.com/1999-5903/17/9/404
Xi, Y., Lin, J., Xiao, Y., Zhou, Z., Shan, R., Gao, T., Zhu, J., Liu, W., Yu, Y., & Zhang, W. (2025). A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges. arXiv:2508.05668. https://arxiv.org/abs/2508.05668
DeepWideSearch (2025). Benchmarking Depth and Width in Agentic Information Seeking. arXiv:2510.20168. https://arxiv.org/abs/2510.20168
MIT Sloan Management Review (2025). Agentic AI, explained. https://mitsloan.mit.edu/ideas-made-to-matter/agentic-ai-explained
Tags: #AI #AgenticAI #LLM #MachineLearning #SearchAgents #ComputerScience #Beginners
Mention: @raqeebr on Hashnode
Muhammad Abbas · 24P-0545 · FAST NUCES Peshawar
Top comments (0)