ruchika bhat

Posted on Jan 27

Beyond Chatbots: How LLMs Interact with the Real World

#ai #programming #learning #tutorial

Welcome to part 5 of our LLM series! Today, we explore how LLMs escape their training data limitations and interact with the real world through three key technologies: Retrieval-Augmented Generation (RAG), Tool Calling, and Agentic Workflows. These systems transform LLMs from isolated knowledge repositories into dynamic interfaces capable of accessing current information, executing actions, and pursuing goals autonomously.

1. Retrieval-Augmented Generation (RAG): Bridging the Knowledge Cutoff Gap

The fundamental limitation of pre-trained LLMs is their knowledge cutoff—they cannot know events occurring after their training data ends. RAG addresses this by providing models with access to external knowledge sources at inference time.

The RAG Architecture

The canonical RAG framework was introduced in Lewis et al. (2020) in their seminal paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". The architecture combines a retriever and generator in an end-to-end differentiable model:

class RAGSystem:
    """
    Implementation based on Lewis et al. (2020)
    Combines DPR retriever with BART generator
    """
    def __init__(self):
        self.retriever = DensePassageRetriever()  # Dual-encoder architecture
        self.generator = Seq2SeqModel()           # Typically BART or T5

    def forward(self, query, k=5):
        # Retrieve relevant documents
        docs = self.retriever.retrieve(query, k=k)

        # Concatenate for generator
        input_text = f"Query: {query}\nDocuments: {docs}"

        # Generate answer conditioned on retrieved docs
        return self.generator.generate(input_text)

The Two-Stage Retrieval Pipeline

Modern RAG systems employ sophisticated retrieval strategies:

def hierarchical_retrieval(query, corpus, top_k=10, rerank_k=100):
    """
    Implements two-stage retrieval from Karpukhin et al. (2020)
    'Dense Passage Retrieval for Open-Domain Question Answering'
    https://arxiv.org/abs/2004.04906
    """

    # Stage 1: First-stage retrieval (fast, approximate)
    candidates = first_stage_retrieval(query, corpus, k=rerank_k)

    # Stage 2: Re-ranking (slow, precise)
    # Uses cross-encoder architecture from Nogueira & Cho (2019)
    # 'Passage Re-ranking with BERT' https://arxiv.org/abs/1901.04085
    reranked = cross_encoder_reranker(query, candidates, k=top_k)

    return reranked

Why Large Context Windows Aren't Enough

While models like GPT-4 Turbo offer 128K context windows and Claude 3 reaches 200K, several studies reveal limitations:

The Needle-in-a-Haystack Problem: Liu et al. (2023) in "Lost in the Middle: How Language Models Use Long Contexts" demonstrate that model performance degrades when relevant information is in the middle of long contexts.
Attention Dilution: As context length increases, each token receives less attention. This is quantified in the attention entropy metric proposed by Su et al. (2021) in "RoFormer: Enhanced Transformer with Rotary Position Embedding".
Economic Constraints: At current API pricing (~$0.01 per 1K tokens), processing 100K tokens costs ~$1 per query—prohibitive for many applications.

Advanced RAG Techniques

Hypothetical Document Embeddings (HyDE)

Proposed by Gao et al. (2022) in "Precise Zero-Shot Dense Retrieval without Relevance Labels", HyDE generates a hypothetical answer and uses its embedding for retrieval:

def hyde_retrieval(query, llm, retriever):
    """
    Implementation of HyDE from Gao et al. (2022)
    """
    # Generate hypothetical document
    hypothetical_doc = llm.generate(
        f"Write a document that answers: {query}"
    )

    # Use hypothetical document embedding for retrieval
    return retriever.retrieve(hypothetical_doc)

ColBERT-based Reranking

Khattab & Zaharia (2020) introduced ColBERT in "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", enabling late interaction between query and document representations:

class ColBERTReranker:
    """
    Late interaction architecture from Khattab & Zaharia (2020)
    Enables fine-grained similarity computation
    """
    def score(self, query, document):
        # Encode query and document separately
        Q = self.encoder(query)  # Shape: [q_len, d]
        D = self.encoder(document)  # Shape: [d_len, d]

        # Compute max similarity for each query token
        similarities = torch.einsum('qd,nd->qn', Q, D)
        max_sim = torch.max(similarities, dim=1).values

        return torch.sum(max_sim)  # Sum of maximum similarities

RAG Evaluation Metrics

Proper evaluation requires multiple perspectives:

def evaluate_rag_system(system, benchmark):
    """
    Comprehensive evaluation following Thakur et al. (2021)
    'BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models'
    https://arxiv.org/abs/2104.08663
    """

    metrics = {
        # Retrieval metrics
        "recall@k": calculate_recall_at_k,
        "precision@k": calculate_precision_at_k,
        "mrr": calculate_mean_reciprocal_rank,
        "ndcg": calculate_normalized_dcg,

        # Generation metrics
        "rouge": calculate_rouge_scores,  # Lin (2004)
        "bleu": calculate_bleu_score,     # Papineni et al. (2002)
        "bertscore": calculate_bert_score, # Zhang et al. (2019)

        # End-to-end metrics
        "exact_match": calculate_exact_match,
        "f1_score": calculate_f1_score
    }

    return {name: metric(system, benchmark) for name, metric in metrics.items()}

2. Tool and Function Calling: LLMs as Programmers

Tool calling enables LLMs to interface with external APIs and systems. The foundational work comes from Schick et al. (2023) in "Toolformer: Language Models Can Teach Themselves to Use Tools", which demonstrated that LLMs can learn to use tools through self-supervised learning.

Tool Learning Paradigms

class ToolLearningMethods:
    """
    Comparison of tool learning approaches
    """

    def toolformer_approach(self):
        """
        From Schick et al. (2023): Self-supervised tool learning
        Key innovation: API calls as special tokens in training
        """
        training_data = self.generate_api_call_dataset()

        # Fine-tune with API call tokens interleaved
        model = fine_tune_with_special_tokens(
            model, 
            training_data,
            special_tokens=["[API_CALL]", "[API_RESULT]"]
        )

        return model

    def gorilla_approach(self):
        """
        From Patil et al. (2023): 'Gorilla: Large Language Model Connected with Massive APIs'
        https://arxiv.org/abs/2305.15334
        Specializes in API documentation understanding
        """
        # Train on API documentation + examples
        training_data = self.collect_api_documentation()

        # Fine-tune for API call generation
        return fine_tune_for_api_generation(model, training_data)

The Tool Calling Workflow

def tool_calling_pipeline(query, tools, llm):
    """
    Implementation following Qin et al. (2023)
    'ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs'
    https://arxiv.org/abs/2307.16789
    """

    # Step 1: Tool selection (routing)
    selected_tools = tool_router.select_tools(query, tools, max_tools=5)

    # Step 2: Parameter generation
    # Based on Yao et al. (2022) 'ReAct: Synergizing Reasoning and Acting'
    thought_process = llm.generate(
        f"Plan how to use tools to answer: {query}\n"
        f"Available tools: {selected_tools}"
    )

    # Step 3: API call generation
    api_calls = []
    for tool in selected_tools:
        call = llm.generate_api_call(tool, query, thought_process)
        api_calls.append(call)

    # Step 4: Execution and synthesis
    results = [execute_api_call(call) for call in api_calls]

    # Step 5: Response generation
    final_response = llm.synthesize_results(query, results, thought_process)

    return final_response

Model Context Protocol (MCP)

Anthropic's Model Context Protocol provides standardization:

class MCPIntegration:
    """
    Implementation of Anthropic's Model Context Protocol
    Standardizes tool, resource, and prompt exposure
    """

    def configure_mcp_server(self):
        config = {
            "version": "1.0",
            "tools": self.expose_tools_as_mcp(),
            "resources": self.expose_resources_as_mcp(),
            "prompts": self.standardize_prompts()
        }

        # MCP enables standardized communication
        # between LLM hosts and tool providers
        return MCPServer(config)

    def mcp_tool_schema(self):
        return {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "description": {"type": "string"},
                "inputSchema": {
                    "type": "object",
                    "properties": {...}
                }
            },
            "required": ["name", "description"]
        }

3. Agentic LLMs: Autonomous Goal Pursuit

Agentic systems represent the pinnacle of LLM capability—models that can autonomously pursue goals through iterative reasoning and action.

The ReAct Framework

Yao et al. (2022) introduced the ReAct paradigm in "ReAct: Synergizing Reasoning and Acting in Language Models", which combines reasoning traces with action steps:

class ReActAgent:
    """
    Implementation of Yao et al. (2022) ReAct framework
    """

    def solve_problem(self, problem, max_steps=10):
        trajectory = []

        for step in range(max_steps):
            # Generate thought (reasoning)
            thought = self.generate_thought(problem, trajectory)

            # Generate action based on thought
            action = self.generate_action(thought, self.available_actions)

            # Execute action
            observation = self.execute_action(action)

            # Update trajectory
            trajectory.append({
                "thought": thought,
                "action": action,
                "observation": observation
            })

            # Check termination condition
            if self.is_goal_achieved(observation, problem):
                break

        return self.format_solution(trajectory)

    def generate_thought(self, problem, trajectory):
        """
        Implements chain-of-thought reasoning from Wei et al. (2022)
        'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models'
        https://arxiv.org/abs/2201.11903
        """
        prompt = self.build_react_prompt(problem, trajectory)
        return self.llm.generate(prompt, max_tokens=200)

Multi-Agent Systems

Recent work explores collaborative multi-agent systems:

class MultiAgentSystem:
    """
    Based on Hong et al. (2023) 'MetaGPT: Meta Programming for Multi-Agent Collaborative Framework'
    https://arxiv.org/abs/2308.00352
    """

    def __init__(self):
        self.agents = {
            "product_manager": ProductManagerAgent(),
            "architect": ArchitectAgent(),
            "project_manager": ProjectManagerAgent(),
            "engineer": EngineerAgent(),
            "qa": QualityAssuranceAgent()
        }

        # Communication protocol inspired by
        # Park et al. (2023) 'Generative Agents: Interactive Simulacra of Human Behavior'
        # https://arxiv.org/abs/2304.03442
        self.message_bus = MessageBus()

    def execute_project(self, requirements):
        # Product manager defines specifications
        specs = self.agents["product_manager"].analyze_requirements(requirements)

        # Architect designs system
        design = self.agents["architect"].create_design(specs)

        # Project manager creates plan
        plan = self.agents["project_manager"].create_plan(design)

        # Engineers implement
        implementation = []
        for task in plan:
            code = self.agents["engineer"].implement(task)
            implementation.append(code)

        # QA tests
        tests = self.agents["qa"].test(implementation)

        return self.compile_results(specs, design, plan, implementation, tests)

Agent-to-Agent Protocol (AAP)

Google's Agent-to-Agent Protocol provides standardization:

class AAPIntegration:
    """
    Implementation of Google's Agent-to-Agent Protocol
    """

    def define_agent_skill(self, agent, skill):
        return {
            "id": f"{agent.id}:{skill.name}",
            "description": skill.description,
            "input_schema": skill.input_schema,
            "output_schema": skill.output_schema,
            "capabilities": skill.capabilities,
            "constraints": skill.constraints
        }

    def agent_communication(self, sender, receiver, message):
        """
        Implements structured communication protocol
        """
        return {
            "message_id": str(uuid.uuid4()),
            "sender": sender.id,
            "receiver": receiver.id,
            "timestamp": datetime.utcnow().isoformat(),
            "content": message,
            "content_type": "text/plain",  # or "application/json"
            "priority": "normal",  # low, normal, high, critical
            "ttl": 3600,  # time-to-live in seconds
            "requires_ack": True
        }

4. Safety and Robustness Considerations

Safety Risks in Agentic Systems

Extensive research highlights the risks:

Prompt Injection Attacks: Perez & Ribeiro (2022) in "Ignore Previous Prompt: Attack Techniques For Language Models" demonstrate vulnerabilities.
Jailbreaking: Wei et al. (2023) in "Jailbroken: How Does LLM Safety Training Fail?" analyze safety training limitations.
Tool Misuse: The Agent Safety Bench provides comprehensive testing.

Defensive Architectures

class SafetyLayer:
    """
    Implements safety measures from various research
    """

    def __init__(self):
        # Input validation from Gehman et al. (2020)
        # 'RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models'
        # https://arxiv.org/abs/2009.11462
        self.toxicity_classifier = load_toxicity_classifier()

        # Output verification from Madaan et al. (2023)
        # 'Self-Refine: Iterative Refinement with Self-Feedback'
        # https://arxiv.org/abs/2303.17651
        self.self_refinement = SelfRefinementModule()

        # Tool permission system
        self.permission_manager = PermissionManager()

    def validate_action(self, agent, proposed_action):
        """
        Multi-layer safety validation
        """
        checks = [
            self.check_toxicity(proposed_action),
            self.check_permissions(agent, proposed_action),
            self.check_resource_access(agent, proposed_action),
            self.check_rate_limits(agent, proposed_action),
            self.check_content_policy(proposed_action)
        ]

        return all(checks)

Formal Verification Approaches

Recent work explores formal methods for agent safety:

class FormalVerification:
    """
    Based on Scheurer et al. (2023)
    'Training Language Models with Language Feedback'
    https://arxiv.org/abs/2304.14146
    """

    def verify_agent_behavior(self, agent, specification):
        """
        Formal verification of agent behavior against specifications
        """
        # Translate specification to temporal logic
        spec_ltl = self.translate_to_ltl(specification)

        # Model agent behavior as transition system
        transition_system = self.extract_transition_system(agent)

        # Verify using model checking
        return self.model_check(transition_system, spec_ltl)

5. Practical Implementation Guidelines

Development Workflow

class AgentDevelopmentPipeline:
    """
    Best practices distilled from research and industry experience
    """

    def development_cycle(self):
        stages = [
            # 1. Prototyping
            self.prototype_with_most_capable_model(),

            # 2. Evaluation
            self.evaluate_with_comprehensive_benchmarks(),

            # 3. Fine-tuning
            self.fine_tune_with_domain_data(),

            # 4. Safety testing
            self.red_team_testing(),

            # 5. Deployment
            self.deploy_with_guardrails(),

            # 6. Monitoring
            self.monitor_with_telemetry()
        ]

        return stages

    def benchmark_suite(self):
        """
        Comprehensive evaluation suite
        """
        return {
            "capability": [
                "HotpotQA",  # Yang et al. (2018)
                "FEVER",     # Thorne et al. (2018)
                "TriviaQA"   # Joshi et al. (2017)
            ],
            "safety": [
                "ToxiGen",   # Hartvigsen et al. (2022)
                "RealToxicityPrompts",
                "AdvBench"   # Zou et al. (2023)
            ],
            "reasoning": [
                "GSM8K",     # Cobbe et al. (2021)
                "MATH",      # Hendrycks et al. (2021)
                "ARC"        # Clark et al. (2018)
            ],
            "tool_use": [
                "ToolBench",  # Qin et al. (2023)
                "API-Bank"    # Li et al. (2023)
            ]
        }

6. Future Research Directions

Open Challenges

class ResearchFrontiers:
    """
    Active research areas based on recent literature
    """

    def current_challenges(self):
        return {
            "long_horizon_planning": {
                "problem": "Agents struggle with very long sequences of actions",
                "references": [
                    "Huang et al. (2022) 'Language Models as Zero-Shot Planners'",
                    "Ahn et al. (2022) 'Do As I Can, Not As I Say'"
                ]
            },
            "verifiable_safety": {
                "problem": "Proving safety properties of agent behavior",
                "references": [
                    "Levine et al. (2023) 'Formal Verification of Neural Networks'",
                    "Bastani et al. (2021) 'Verifiable Reinforcement Learning'"
                ]
            },
            "cross_modal_tool_use": {
                "problem": "Using tools across different modalities",
                "references": [
                    "Yang et al. (2023) 'MM-REACT: Prompting ChatGPT for Multimodal Reasoning'",
                    "Suris et al. (2023) 'ViperGPT: Visual Inference via Python Execution'"
                ]
            }
        }

    def emerging_paradigms(self):
        return [
            "Foundation models for tool use",
            "Self-improving agent systems",
            "Formally verified agent architectures",
            "Multi-agent emergent behaviors"
        ]

7. Implementation Resources

Open Source Frameworks

class OpenSourceEcosystem:
    """
    Current state of open source tools (as of early 2025)
    """

    def rag_frameworks(self):
        return {
            "haystack": "https://haystack.deepset.ai/",
            "llama_index": "https://www.llamaindex.ai/",
            "chroma": "https://www.trychroma.com/",
            "weaviate": "https://weaviate.io/"
        }

    def agent_frameworks(self):
        return {
            "langchain": "https://www.langchain.com/",
            "autogen": "https://microsoft.github.io/autogen/",
            "crewai": "https://www.crewai.com/",
            "smolagents": "https://github.com/huggingface/smolagents"
        }

    def evaluation_tools(self):
        return {
            "ragas": "https://github.com/explodinggradients/ragas",
            "truera": "https://truera.com/",
            "langsmith": "https://smith.langchain.com/"
        }

Conclusion

The evolution from static LLMs to dynamic, interactive systems represents one of the most significant advances in AI. By combining retrieval, tool use, and agentic capabilities, we're creating systems that can truly understand and interact with the world.

Key Research Takeaways:

RAG effectiveness depends crucially on retrieval quality, with two-stage retrieval (dense + reranking) providing optimal results
Tool learning benefits from both supervised fine-tuning and self-supervised approaches like Toolformer
Agentic systems require careful safety design, with formal verification emerging as a critical research direction
Evaluation must be multi-faceted, assessing capability, safety, and robustness across diverse benchmarks

Next Article Preview: In our next installment, we'll dive deep into LLM Evaluation, covering benchmark design, evaluation methodologies, and the challenges of measuring true capability versus benchmark performance.

References

Core Papers Cited

RAG Foundations
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering
- Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT
Tool Learning
- Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools
- Qin, Y., et al. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
- Patil, S., et al. (2023). Gorilla: Large Language Model Connected with Massive APIs
Agentic Systems
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models
- Hong, S., et al. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework
- Park, J., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior
Safety and Evaluation
- Gehman, S., et al. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
- Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models
- Zou, A., et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models

DEV Community