Abdul Azeem Javaid

Posted on Mar 14

From Keyword Search to Goal-Driven Intelligence: What Agentic AI and Deep Search Agents Mean for the Future of AI

#agents #ai #algorithms #computerscience

By Abdul Azeem Javaid — BS Computer Science, FAST-NUCES Peshawar Published: March 2026 | AI Course, Dr. Bilal Jan | Roll No: 24P0523

Introduction: Why I Wrote This

I am a second-year Computer Science student at FAST-NUCES Peshawar, and this semester I am taking an AI course taught by Dr. Bilal Jan. We recently finished covering the core building blocks of AI: rational agents, environment classifications, search algorithms (BFS, DFS, A*, UCS, iterative deepening), and constraint satisfaction problems. The theory is elegant. The textbook examples are clean. And then you look at what researchers are actually building in 2025, and you realise those fundamentals are not just academic preparation — they are the active DNA of some of the most powerful systems being deployed right now.

For this assignment, I was asked to select two recent AI research papers, read them carefully, and connect them to course concepts. I chose:

"The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges" — Bandi, Kongari, Naguru, Pasnoor & Vilipala (2025), published in Future Internet (MDPI). DOI: 10.3390/fi17090404

"A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges" — Xi, Lin, Xiao, Zhou, Shan, Gao, Zhu, Liu, Yu & Zhang (2025), from Shanghai Jiao Tong University and Central South University. arXiv: 2508.05668

Both papers are free to access. I read them both in full before using Google NotebookLM to deepen my understanding. What follows is my honest account of what each paper argues, how it connects to the AI concepts I have been learning in class, and what genuinely surprised me along the way.

Paper 1: The Rise of Agentic AI (Bandi et al., 2025)

What Problem Is the Paper Solving?

By 2025, the term "agentic AI" had become so widely used — and so inconsistently defined — that researchers, engineers, and journalists were essentially using it to mean different things. A marketing team would call a chatbot with memory "agentic." A research lab would reserve the word for systems that decompose goals and use tools autonomously. The confusion was real and had practical consequences: it made it difficult to evaluate systems, compare architectures, or build on prior work.

Bandi and colleagues set out to fix this. They conducted a systematic literature review of 143 primary studies on agentic AI. More than 90% of these studies were published in 2024 or 2025 alone — which tells you something important about how fast this field is moving. From this body of work, the authors construct a unified taxonomy: a structured, hierarchical definition of what agentic AI is, what capabilities it requires, how those capabilities combine into different architectures, and where the field is headed.

The Core Argument: Agency Is a Spectrum

The paper's most important conceptual contribution is the idea that agency is not binary. A system does not either have agency or not have it. Agency is a spectrum, and a system's position on that spectrum is determined by how many of four core capabilities it possesses, and how deeply it has developed each one.

The four pillars the paper identifies are:

Strategic Planning The ability to decompose a high-level, long-horizon goal into a sequence of concrete sub-tasks, execute those sub-tasks in the right order, monitor progress, and adapt when circumstances change. This is not the same as following a fixed script — it requires reasoning about what the next step should be given the current state of the world.
Persistent Memory The ability to retain information across interactions. Traditional language models are amnesiac: each conversation starts fresh. An agentic system can remember what it learned in a previous session, recall which tools it has used before, and update its knowledge based on new information it encounters. The paper distinguishes between short-term (in-context) memory and long-term (external database or vector store) memory — a distinction that turns out to matter a lot for system design.
Tool Use The ability to interact with external systems — search engines, APIs, databases, calculators, code interpreters, other AI models — to extend the agent's capabilities beyond what it can do through pure language reasoning. An agent that can write and execute code, query a live database, or browse the web is operating at a fundamentally different level from one that can only generate text.
Multi-Agent Collaboration The ability to coordinate with other agents: to delegate sub-tasks, to specialise, to verify each other's work, and to combine outputs. The paper describes architectures where multiple agents operate in parallel, with some acting as orchestrators and others as specialist workers — a pattern that should sound familiar to anyone who has studied distributed systems or multi-agent environments in an AI course.

No single capability makes a system agentic. It is their combination in one goal-directed system that qualifies.

The Agent Type Taxonomy

The paper's taxonomy of agent types maps directly onto the framework that most introductory AI courses teach. The authors identify four levels:

Agent Level

Core Capability

Course Equivalent

Reactive agents

Respond to immediate percepts with no memory or planning

Simple reflex agents

Limited-memory agents

Retain recent context to inform current decisions

Model-based reflex agents

Goal-based agents

Reason about future states to select actions that achieve goals

Goal-based agents

Utility-based agents

Optimise across multiple competing objectives using a utility function

Utility-based agents

What the paper adds, beyond this textbook taxonomy, is an empirical survey of which architectures are actually being deployed in industry and research. The finding is clear: reactive and limited-memory architectures are insufficient for real-world tasks of any complexity. The systems that perform well are consistently goal-based or utility-based — and the ones that are genuinely transforming industries are utility-based agents with all four pillars active.

The Challenge Section: Honest Assessments

One of the things I appreciated most about this paper is how honest it is about the unsolved problems. The challenges it identifies include:

Reliability: Agentic systems fail in ways that are difficult to predict and even more difficult to reproduce. An agent that worked perfectly on a benchmark may behave erratically in deployment.

Interpretability: When a complex multi-agent system produces an output, it is often impossible to trace which agent made which decision and why. This is not just a technical problem — it is a governance problem.

Resource management: Agents that can issue hundreds of API calls, write and execute code, and spawn sub-agents can consume enormous computational resources in pursuit of a goal. Without careful constraint design, they can also be exploited.

Trust and verification: How do you know whether one agent's output is correct enough to pass to the next agent? The paper discusses mechanisms for agent-to-agent verification, but notes that this is an active research area with no settled answers.

These are not hypothetical concerns. They are the problems that teams building real agentic systems are grappling with right now.

Paper 2: A Survey of LLM-based Deep Search Agents (Xi et al., 2025)

The Evolution of Search: Three Stages

Xi and colleagues at Shanghai Jiao Tong University open their survey with an observation that is simultaneously obvious and profound: the way humans interact with information has gone through three distinct stages, and each stage required different underlying technology.

Stage 1 — Traditional Web Search (1990s–2010s) A user types keywords into a search engine. The engine returns a ranked list of links. The user manually reads through results, follows promising links, synthesises information across multiple pages, and eventually arrives at an answer. The intelligence is almost entirely human.

Stage 2 — LLM-Enhanced Search (2020–2023) Large language models are integrated into search. The system rewrites queries to be more effective, summarises the top results, and presents synthesised information in natural language. This is a significant improvement in user experience, but it is still a single-turn interaction. The model receives a query, retrieves some results, and generates a response. There is no planning across multiple retrieval steps.

Stage 3 — Deep Search Agents (2024–present) The system autonomously plans a multi-step retrieval strategy, issues multiple queries based on what it has learned from earlier results, synthesises information across dozens of sources, reasons about conflicting evidence, and produces a comprehensive, structured answer — all without human intervention in the loop. The intelligence has shifted substantially to the machine.

OpenAI's Deep Research product is the most well-known example of Stage 3. The paper notes that Deep Research can autonomously browse dozens of web pages in sequence, synthesise their content, identify gaps in its knowledge, issue follow-up queries, and produce a detailed research report. This is not keyword search. It is goal-directed autonomous information acquisition.

Architectures: How Search Agents Actually Work

The paper organises search agent architectures around three primary patterns:

Sequential (Reflection-Driven) Search The agent issues a query, evaluates the results, decides whether the available information is sufficient to answer the original question, and — if not — formulates a new, more targeted query based on what it has learned. This process continues iteratively until the agent judges that it has enough information or until a resource budget is exhausted.

This architecture has a direct parallel to Iterative Deepening Search from our AI course. The agent controls its own search depth dynamically, deepening the search when earlier results prove insufficient. The key difference is that the state space is the open web rather than a well-defined problem graph, and the branching factor is essentially unlimited.

Parallel (Proactive) Search Rather than processing one query at a time, the agent decomposes the original question into multiple independent sub-questions and issues all of them simultaneously. Results from parallel streams are then synthesised into a coherent answer.

This is analogous to multi-path BFS — expanding multiple branches simultaneously rather than completing one branch at a time. The obvious advantage is speed; the challenge is synthesis. Combining information from twenty simultaneous retrieval streams requires sophisticated reasoning about consistency, relevance, and authority.

Proactivity-Driven Search The agent does not wait for a query to be fully specified. Based on partial information about what the user is looking for, it infers likely follow-up information needs and retrieves relevant material proactively — before being asked. This is the most ambitious pattern and the most powerful when it works.

This maps directly onto the proactive goal-based agent design: an agent that models likely future states and takes actions in anticipation of future needs, rather than merely reacting to current percepts.

The Benchmark That Stopped Me Cold

The paper introduces and evaluates systems on a benchmark called DeepWideSearch, which is specifically designed to test two capabilities simultaneously: deep reasoning (requiring multi-step inference and synthesis) and wide information retrieval (requiring broad coverage of a large information space).

The best available LLM-based search agents, tested on this benchmark, achieved a success rate of approximately 2.39%.

Let me sit with that number for a moment. Two point three nine percent. Not twenty-three percent. Not even twelve percent. The systems that are routinely described in press releases as "revolutionary" and "transformative" are failing on roughly 97.6% of tasks that require simultaneously deep reasoning and broad retrieval.

This is not a criticism of the researchers — they designed the benchmark precisely because current systems struggle with it, and identifying failure modes is how science makes progress. But it is a significant corrective to the hype cycle. The gap between what these systems can do in a narrow, well-defined task and what they can do in a complex, open-ended one remains very large.

Training and Optimisation

The paper also covers how search agents are trained and improved. The key challenge is that many search tasks are open-ended: there is no single correct answer, and the quality of a research report is inherently subjective. This makes defining a reward signal for reinforcement learning difficult.

The paper discusses several approaches:

Process-supervised reward models, which give credit not just for arriving at the right answer but for following a good search strategy — issuing targeted queries, synthesising evidence carefully, checking for contradictions.

Outcome-based reward models, which evaluate only the final output but use multiple evaluation dimensions (accuracy, coverage, coherence) rather than a single correctness score.

Self-improvement through reflection, where the agent evaluates its own outputs and uses that evaluation to refine its search strategy.

This connects directly to the utility function design problem we studied in our AI course. How do you formally specify what "good" means for an open-ended information task? It turns out this is a genuinely hard problem, and the paper is frank about the fact that it is not yet solved.

Connecting Both Papers to Course Concepts

This is the section where I want to be explicit about the bridges between what I am reading in these research papers and what I have been learning in class.

On Agent Types

Both papers operate within the agent type framework that Russell and Norvig established and that we studied in course. The progression from reactive to utility-based agents is not just a theoretical taxonomy — it is the actual design progression that has played out in the AI industry over the past five years. Simple chatbots are reactive agents. Search engines with AI summaries are limited-memory agents. Deep Research systems are goal-based agents. The fully autonomous multi-agent systems that Bandi et al. describe — capable of planning, memory, tool use, and collaboration — are utility-based agents with sophisticated world models.

What the papers add is empirical evidence about where each level falls short and what is required to move to the next level. This is the kind of grounding that makes the abstract taxonomy feel concrete.

On Search Algorithms

The search patterns in Xi et al. map cleanly onto the search algorithms we studied:

Search Agent Pattern

Course Algorithm

Key Characteristic

Sequential reflection-driven

Iterative Deepening Search (IDS)

Depth controlled dynamically based on sufficiency

Parallel multi-stream

Multi-path BFS

Simultaneous expansion of multiple branches

Proactivity-driven

Best-First / A*

Uses estimated relevance (heuristic) to guide retrieval

Goal decomposition

AND-OR tree search

Complex goals decomposed into sub-problems

The critical difference between these algorithms in their textbook form and their application in search agents is the nature of the state space. In a textbook search problem, the state space is well-defined, finite, and known in advance. The open web is none of these things. This is why the algorithms require substantial adaptation — and why the benchmark numbers are still so low.

On Environment Classification

In our course, we classify agent environments along seven dimensions. Both papers essentially describe systems operating at the complex end of every single dimension:

Partially observable: No search agent can see the entire web.

Stochastic: Queries return different results at different times; web pages change; APIs fail.

Sequential: Each retrieval step influences the next.

Dynamic: The environment changes while the agent is operating.

Continuous: The quality of a response is a continuous variable, not a discrete one.

Multi-agent: Both papers discuss multi-agent architectures explicitly.

Unknown: The agent does not know in advance which sources are authoritative.

Understanding where a system sits on these seven dimensions is not just a classification exercise — it determines which agent architecture is appropriate and which algorithms will work. This is why the textbook framework is still relevant even for systems that were inconceivable when the textbook was written.

On CSPs and Constraints

Multi-agent coordination, discussed in Bandi et al., is fundamentally a constraint satisfaction problem. When multiple agents share resources, divide tasks, and combine outputs, they must satisfy constraints: agent A cannot use tool X while agent B is using it; sub-task C must complete before sub-task D begins; the combined output must be coherent and non-contradictory. The paper discusses protocols for managing these constraints, including message-passing systems, shared memory architectures, and explicit orchestration hierarchies.

Similarly, the search termination decision in Xi et al. — when has the agent gathered enough information to produce a good answer? — is a constraint satisfaction problem over a quality space. The agent must find a stopping point that satisfies constraints on coverage, accuracy, coherence, and resource budget simultaneously.

Personal Insight: Manual Reading vs. NotebookLM

I want to be honest about how I used both approaches, because the assignment specifically asks for this reflection, and because I think the comparison is genuinely interesting.

The Manual Reading Experience

My first read of Bandi et al. was slow. The paper is a systematic review, which means it spends a lot of space on methodology — explaining how studies were selected, categorised, and evaluated. I found myself reading paragraphs that I understood individually but could not assemble into a coherent picture. The terminology was unfamiliar, the number of referenced systems was overwhelming, and I kept losing track of where the paper's own argument was against the background of the studies it was summarising.

Xi et al. was harder still. The survey references dozens of specific systems — PASA, WebPilot, ManuSearch, Search-R1, Perplexica, and many others — with no prior context. Reading the survey without having read the underlying papers felt like arriving at the middle of a conversation.

What manual reading gave me, however, was ownership of the struggle. By the time I finished both papers, I had a map of my own confusions — specific questions, specific sentences that didn't quite parse, specific places where I needed to look something up. That map turned out to be exactly what made NotebookLM useful.

The NotebookLM Experience

I uploaded both PDFs to NotebookLM and spent about two hours using it after my initial manual read.

The most useful thing I did was ask it to anchor the papers' terminology to our course material. For example, I asked: "The paper mentions limited-memory agents — is this the same as the model-based reflex agent from the AIMA textbook?" NotebookLM's response helped me see that yes, these map onto the same idea, updated for modern architectures — a model-based reflex agent that uses an LLM's context window as its world model is functionally a limited-memory agent in the paper's taxonomy.

I also asked NotebookLM to cluster the search agent systems from Xi et al. by architecture pattern, which produced a much cleaner mental model than the paper's linear presentation had given me.

What NotebookLM did not do, and this is important, is surface the details that most surprised me. The 2.39% benchmark number, for instance — I noticed it because I was reading carefully. NotebookLM's summaries emphasised the high-level architecture contributions and did not flag the evaluation results as particularly noteworthy. That asymmetry tells me something: AI-assisted reading is very good at giving you a map of the territory, but it tends to smooth over the valleys — the challenging results, the honest admissions of failure, the numbers that should give you pause.

The most valuable insight of this entire assignment came from a close, careful, unassisted reading of a single paragraph in Xi et al. No tool would have given me that.

What I Found Most Interesting

Across both papers, three ideas stood out to me as genuinely new — things I had not encountered in our course and that will change how I think about AI going forward.

The Degree of Agenticness

Bandi et al. argue that agency is a spectrum, not a binary. This dissolves a lot of the philosophical confusion around questions like "Is ChatGPT intelligent?" or "Is this system truly autonomous?" Those questions assume a sharp categorical boundary. The degree-of-agenticness framing replaces them with a more useful question: Which capabilities does this system have, how well developed are they, and what does that tell us about what it can do?

From a course perspective, this is a direct generalisation of the agent classification framework. Simple reflex, model-based, goal-based, and utility-based are not discrete boxes — they are waypoints along a continuum. Real systems sit between these waypoints. Understanding where a system sits, and why, is more useful than assigning it a category label.

Proactivity as an Agent Property

The proactivity-driven search pattern in Xi et al. — where an agent retrieves information before being explicitly asked, based on inferred user needs — is a property that I had not seen formalised in our course material. It is a direct extension of the goal-based agent concept, but in a direction that feels qualitatively different.

A purely reactive agent waits for input. A goal-based agent plans to achieve a specified goal. A proactive agent anticipates what goals the user is likely to have next and begins working toward them in advance. This requires a model not just of the world, but of the user — their current task, their likely next steps, their knowledge gaps. It is a more ambitious form of intelligence, and it raises interesting questions about autonomy and consent that the paper acknowledges but does not fully resolve.

The Honest Benchmark Number

I keep coming back to that 2.39% success rate on DeepWideSearch. Not because it is discouraging, but because of what it implies about the structure of intelligence.

The tasks where AI systems excel tend to be tasks where depth and breadth can be addressed sequentially. Answer this narrow question deeply, or summarise this broad domain shallowly. DeepWideSearch tests whether systems can do both simultaneously — reason carefully about complex evidence while also covering a wide information space. That turns out to be extremely difficult. The number suggests that depth and breadth are not easily composable capabilities, and that the architecture innovations needed to achieve both at once are not yet in hand.

This is the most intellectually honest finding I encountered in either paper, and it came from the kind of careful reading that no summary tool would have surfaced.

Conclusion

I started this assignment expecting to read two papers and connect them to course topics. What I found instead was a through-line. The concepts in our AI textbook — agent types, search algorithms, environment dimensions, utility functions — are not historical relics. They are the conceptual vocabulary that researchers at Shanghai Jiao Tong University and in journals published by MDPI are using right now to describe, evaluate, and improve the most advanced AI systems in the world.

Agentic AI is, at its core, a utility-based goal agent with a sophisticated world model and a rich toolset. Deep search agents are iterative deepening and best-first search applied to the open web, trained through reinforcement learning on reward signals that are themselves an unsolved research problem. The names change, the scale changes, the capabilities change — but the underlying framework that makes those systems intelligible is the same one Dr. Bilal Jan is teaching us.

That continuity is both motivating and clarifying. It means understanding BFS and DFS and A* is not an academic exercise. It is preparation for understanding why deep search agents are designed the way they are, where they will fail, and what would need to change for them to succeed. And it means that when the next paradigm arrives, we will be equipped to understand it quickly — because we know the foundations it will build on.

References

Bandi, A., Kongari, B., Naguru, R., Pasnoor, S., & Vilipala, S. V. (2025). The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges. Future Internet, 17(9), 404. https://doi.org/10.3390/fi17090404

Xi, Y., Lin, J., Xiao, Y., Zhou, Z., Shan, R., Gao, T., Zhu, J., Liu, W., Yu, Y., & Zhang, W. (2025). A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges. arXiv:2508.05668. https://arxiv.org/abs/2508.05668

Russell, S., & Norvig, P. (2022). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.

Abdul Azeem Javaid is a BS Computer Science student at FAST-NUCES Peshawar. This post was written as part of an AI course assignment under the supervision of Dr. Bilal Jan.

Tags: #artificialintelligence #agents #searchalgorithms #deeplearning #llm #csassignment #machinelearning #agenticai #research

Cross-posted on Dev.to