<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shittu Olumide</title>
    <description>The latest articles on DEV Community by Shittu Olumide (@shittu_olumide_).</description>
    <link>https://dev.to/shittu_olumide_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F375026%2Fc3e3ff46-3a88-4a85-9c0a-5a25f59a27b5.jpg</url>
      <title>DEV Community: Shittu Olumide</title>
      <link>https://dev.to/shittu_olumide_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shittu_olumide_"/>
    <language>en</language>
    <item>
      <title>Here’s how I would learn AI Agents as a total beginner</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Mon, 30 Mar 2026 10:23:33 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/heres-how-i-would-learn-ai-agents-as-a-total-beginner-16k1</link>
      <guid>https://dev.to/shittu_olumide_/heres-how-i-would-learn-ai-agents-as-a-total-beginner-16k1</guid>
      <description>&lt;p&gt;Most people still use AI as a high-tech typewriter. They ask for an email draft or a summary of a meeting and call it a day. That approach is already becoming obsolete. We have moved past the point where AI just talks. Now, we are in the phase where AI acts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.salesmate.io/blog/ai-agents-adoption-statistics/" rel="noopener noreferrer"&gt;Gartner&lt;/a&gt; predicts that &lt;strong&gt;40%&lt;/strong&gt; of enterprise software will have task-specific agents built into them by the end of 2026. To put that in perspective, that number was below 5% in 2024. This isn’t just a small update to how software works. It is a fundamental change in how we get work done.&lt;/p&gt;

&lt;p&gt;An agent is different because it doesn’t just give you a response. It thinks through a goal, finds the tools it needs, and stays on the job until the task is complete. If I were starting today, I would not waste time on complex prompt engineering tricks. Instead, I would focus on the architecture of autonomy. This is the path I would take to go from zero to building agents that actually deliver value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0nth3q1w2w40h90hejb2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0nth3q1w2w40h90hejb2.png" alt="Intro Image" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://bayelsawatch.com/agentic-ai-market-to-exceed-usd-196-6-billion/" rel="noopener noreferrer"&gt;The market for this technology is expected to cross $196.6 billion in the coming years&lt;/a&gt;. The demand for people who can build these systems is far outstripping the number of people who actually understand how they work. Learning this now puts you ahead of the curve before the field becomes crowded.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Defining the “Agent” (It isn’t just a GPT with a fancy name)
&lt;/h2&gt;

&lt;p&gt;Before you write a single line of code, you have to understand the difference between a &lt;strong&gt;Large Language Model&lt;/strong&gt; (LLM) and an Agent. It is easy to confuse the two because agents rely on LLMs to function. However, the distinction is critical. An LLM is essentially a brain in a jar. It is brilliant, well-read, and capable of complex reasoning, but it cannot move or interact with the world on its own. It sits there waiting for you to ask it a question so it can provide a text-based answer.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“ An Agent is that same brain but with hands and a memory ”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In technical terms, an agent is a system that uses an LLM as its reasoning engine to achieve a specific goal. Instead of just answering a prompt, the agent looks at the goal, breaks it down into smaller steps, and chooses the right tools to execute those steps. If a chatbot is a librarian who tells you where the books are, an agent is the researcher who goes to the shelves, reads the books, and writes the report for you.&lt;/p&gt;

&lt;p&gt;The shift toward this “agentic” workflow is driving massive efficiency gains. Organizations that have moved from simple chatbots to autonomous agents are seeing a &lt;a href="https://www.salesmate.io/blog/ai-agents-adoption-statistics/" rel="noopener noreferrer"&gt;&lt;strong&gt;20%&lt;/strong&gt; to &lt;strong&gt;30%&lt;/strong&gt; reduction in operational friction and support costs&lt;/a&gt;. This happens because the agent handles the “doing” part of the job, which previously required a human to copy and paste data between different tabs and applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7qdfho2gtx9n056mfvc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7qdfho2gtx9n056mfvc.png" alt="Structural components of an AI Agent" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you build an agent, you are essentially creating a loop. The system “&lt;strong&gt;thinks&lt;/strong&gt;” about the next step, “&lt;strong&gt;acts&lt;/strong&gt;” by using a tool, “&lt;strong&gt;observes&lt;/strong&gt;” the result of that action, and then starts the cycle again until the task is finished. This autonomy is what makes it an agent. If the system stops and waits for you to tell it what to do at every single step, it is just a complicated script, not an agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Starter Pack: Three Foundations You Actually Need
&lt;/h2&gt;

&lt;p&gt;You do not need a degree in advanced mathematics or a background in computer science to build an AI agent. Many people get intimidated by the “&lt;strong&gt;AI&lt;/strong&gt;” label and assume they need to understand neural network weights or backpropagation. In reality, building agents is much more about logic and orchestration than it is about calculus.&lt;/p&gt;

&lt;p&gt;If you are starting from zero, you only need to focus on three specific areas to get your first agent running.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python: The Language of Automation
&lt;/h3&gt;

&lt;p&gt;Python is the undisputed king of the AI world. As of 2026, it remains the most popular language for developers, &lt;a href="https://www.tiobe.com/tiobe-index/" rel="noopener noreferrer"&gt;holding a 29% share of the global programming market&lt;/a&gt;. You do not need to be an expert, but you must be comfortable with the basics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Functions and Loops&lt;/strong&gt;: This is how you tell the agent to repeat a task until it succeeds.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;JSON Handling&lt;/strong&gt;: This is the most important part. AI models communicate in a format called JSON. You need to know how to “&lt;strong&gt;parse&lt;/strong&gt;” this data so your agent can read information from a weather API or a database and then use it.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The API Mental Model
&lt;/h3&gt;

&lt;p&gt;An agent is only as good as the tools it can access. You need to understand how to use an API (Application Programming Interface), which is essentially a digital handshake between two programs. When you want your agent to send an email or check a stock price, it sends a request to an API and gets a response back.&lt;/p&gt;

&lt;p&gt;Most beginners start with the OpenAI or Anthropic APIs. You will need to learn how to manage “&lt;strong&gt;API Keys&lt;/strong&gt;” safely. Think of an API key like a credit card: if you leak it, anyone can use your account to run expensive AI models.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Reasoning Loop: Think, Act, Observe
&lt;/h3&gt;

&lt;p&gt;This is the “&lt;strong&gt;logic&lt;/strong&gt;” foundation. Every agent follows a cycle often called the “&lt;strong&gt;ReAct&lt;/strong&gt;” pattern. It is the same process a human uses to solve a problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Think&lt;/strong&gt;: The agent looks at the user’s request and decides what it needs to do.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Act&lt;/strong&gt;: The agent uses a tool, like a calculator or a web search.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observe&lt;/strong&gt;: The agent looks at the result of that action and asks, “Did this solve the problem?”&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer is no, the loop starts over. Understanding this flow is more important than memorizing code because it helps you debug why an agent is getting “stuck” in a loop or failing to complete a task.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxk0p1bezgbob0yyru2cl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxk0p1bezgbob0yyru2cl.png" alt="A flowchart showing a user request entering a Python Script" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By focusing only on these three foundations, you avoid the “tutorial hell” of learning things you will never use. Once you can write a basic Python script that talks to an API, you are already ahead of most people just playing with chat interfaces.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Phase One: The Single-Tool Agent
&lt;/h2&gt;

&lt;p&gt;The most common mistake beginners make is trying to build a complex, multi-agent system on day one. Instead, you should start with a “&lt;strong&gt;Single-Tool Agent&lt;/strong&gt;.” This is an AI that has one job and one specific tool to help it do that job. &lt;a href="https://www.merge.dev/blog/ai-agent-statistics" rel="noopener noreferrer"&gt;As of 2026, 81% of companies are taking this exact approach, using agents primarily for targeted lookups in third party software before moving on to more complex workflows&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The magic happens when you move from a simple prompt to a “&lt;strong&gt;ReAct&lt;/strong&gt;” (Reason + Act) loop. In a standard chatbot, the AI just guesses the answer. In a ReAct loop, the AI evaluates if it has enough information. If it doesn’t, it pauses its text generation, calls a tool, and then uses the new data to finish the task. This pattern is highly effective. &lt;a href="https://www.merge.dev/blog/ai-agent-statistics" rel="noopener noreferrer"&gt;Data from early 2026 shows that single-tool agents handling tasks like travel planning or vendor comparisons achieve a completion success rate of 87%&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To build this, you don’t need a massive framework. You can see the logic in a few lines of Python. Below is a simplified example of how an agent “&lt;strong&gt;decides&lt;/strong&gt;” to use a search tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# A simple representation of an agent reasoning loop
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;simple_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# The 'Reasoning' step: The AI thinks about what it needs
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thinking: I need to find the current price of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# The 'Action' step: The AI chooses to call a specific tool
&lt;/span&gt;    &lt;span class="c1"&gt;# In a real setup, this would be an API call to Google or a Database
&lt;/span&gt;    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_search_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# The 'Observation' step: The AI looks at the tool output
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Observing: The tool returned &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Final step: The AI provides the answer based on the new data
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The current price of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# This function simulates an external tool (like a web search)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_search_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Simple mock data to represent a real API response
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$65,000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;# Running our beginner agent
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;simple_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bitcoin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;In this script, the &lt;code&gt;simple_agent&lt;/code&gt; function mimics the brain of the agent. Instead of just returning a hardcoded string, it follows a sequence. First, it identifies a gap in its knowledge. Second, it calls the &lt;code&gt;call_search_tool&lt;/code&gt; function, which represents an external API. Finally, it takes that "&lt;strong&gt;observation&lt;/strong&gt;" and incorporates it into the final response. This shift from "&lt;strong&gt;generating text&lt;/strong&gt;" to "&lt;strong&gt;managing a process&lt;/strong&gt;" is the core skill you are trying to learn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbwmwxa5g39wnqdvko1ie.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbwmwxa5g39wnqdvko1ie.png" alt="A vertical flowchart showing three main boxes" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mastering this single-tool loop is your first milestone. Once you can consistently get an AI to use one tool correctly without getting confused, you have unlocked the foundation of autonomous AI. The goal here isn’t to build something fancy but to ensure your “&lt;strong&gt;handshake&lt;/strong&gt;” between the AI and the external world is solid.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Phase Two: Give Your Agent a Memory
&lt;/h2&gt;

&lt;p&gt;A large context window is often marketed as the solution to AI memory, but in a production environment, it is a trap. Just because a model can “&lt;strong&gt;read&lt;/strong&gt;” a million tokens at once does not mean it remembers who you are when you come back two weeks later. A context window is like a whiteboard. It is great for the task happening right now, but once the session ends, the board is wiped clean. To build a serious agent, you need a filing cabinet.&lt;/p&gt;

&lt;p&gt;This filing cabinet is what we call persistent memory. As of 2026, &lt;a href="https://www.landbase.com/blog/agentic-ai-statistics" rel="noopener noreferrer"&gt;research shows that companies deploying agents with robust memory layers see a 30% reduction in operational costs&lt;/a&gt; because the AI doesn’t have to re-learn user preferences or project details every single session.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Two Layers of Memory
&lt;/h3&gt;

&lt;p&gt;If you want your agent to feel intelligent, you have to manage two distinct types of storage.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Short-term Memory&lt;/strong&gt;: This is the immediate conversation history. It allows the agent to understand that when you say “&lt;em&gt;Do it again&lt;/em&gt;,” you are referring to the last task it performed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Long-term Memory&lt;/strong&gt;: This is where you store facts that should last forever. If a user says they prefer Rust over Python, that should be moved from the whiteboard to the filing cabinet.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most common way to handle long-term memory is through Retrieval-Augmented Generation (RAG). This technique currently dominates &lt;strong&gt;&lt;a href="https://www.marketsandmarkets.com/Market-Reports/vector-database-market-112683895.html" rel="noopener noreferrer"&gt;51% of all enterprise AI implementations&lt;/a&gt;&lt;/strong&gt;. RAG allows the agent to search through a massive database of past interactions or uploaded documents and pull only the relevant “&lt;strong&gt;memories&lt;/strong&gt;” into the current conversation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5xlkbetx8xzbajof9if.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh5xlkbetx8xzbajof9if.png" alt="A central AI Agent icon is positioned between two storage types" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can start practicing this by creating a simple “&lt;strong&gt;profile&lt;/strong&gt;” system. Instead of just sending a prompt, you send the prompt along with a small snippet of data about the user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# A simple way to simulate long-term memory
&lt;/span&gt;&lt;span class="n"&gt;user_memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preferred_language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rust&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experience_level&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Beginner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;personalized_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Retrieve the 'memory' for this specific user
&lt;/span&gt;    &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preferred_language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# The agent uses the memory to change its behavior
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory Check: User prefers &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Logic to execute the task based on memory
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I am writing your &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; code in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;# The agent now 'remembers' the user preference across different calls
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;personalized_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web scraper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code Explanation&lt;/strong&gt;:&lt;br&gt;
In this example, the &lt;code&gt;user_memory&lt;/code&gt; dictionary acts as a mock database. When the &lt;code&gt;personalized_agent&lt;/code&gt; function is called, the first thing it does is a "&lt;strong&gt;Memory Check&lt;/strong&gt;." It looks up the user ID to see if there are any saved preferences. Because it finds that the user prefers Rust, it automatically adjusts its output without the user needing to specify the language again. In a real application, you would replace this dictionary with a vector database like &lt;a href="https://www.pinecone.io/" rel="noopener noreferrer"&gt;Pinecone&lt;/a&gt; or &lt;a href="https://weaviate.io/" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt;, but the logic remains exactly the same.&lt;/p&gt;

&lt;p&gt;By implementing this, you move away from building a generic tool and start building a personalized assistant. &lt;a href="https://www.landbase.com/blog/agentic-ai-statistics" rel="noopener noreferrer"&gt;This is the difference between a toy and a system that provides a 171% average ROI for businesses&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  5. Phase Three: The Multi-Agent Leap
&lt;/h2&gt;

&lt;p&gt;In the real world, a single person rarely handles an entire project from strategy to execution. You have managers, researchers, and writers. AI development is moving in the same direction. We have realized that asking one large model to do everything often leads to “&lt;strong&gt;cognitive overload&lt;/strong&gt;,” where the AI starts losing track of instructions or hallucinating details. &lt;em&gt;Specialization is the fix&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://bayelsawatch.com/agentic-ai-market-to-exceed-usd-196-6-billion/" rel="noopener noreferrer"&gt;By the start of 2026, multi-agent orchestration captured a 66% share of the global agentic AI market&lt;/a&gt;. This shift happened because specialized teams of agents are more reliable than one “&lt;strong&gt;do-it-all&lt;/strong&gt;” bot. When you break a task into parts, you can use smaller, faster models for simple steps and save the heavy, expensive models for high-level reasoning.&lt;/p&gt;
&lt;h3&gt;
  
  
  Choosing Your Framework: CrewAI vs. LangGraph
&lt;/h3&gt;

&lt;p&gt;As you move into this phase, you will likely choose between two dominant tools.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CrewAI&lt;/strong&gt;: This is the best choice if you want to get a project running quickly. It uses a “Role-Based” approach. You define an agent, give it a backstory, and assign it a task. It is intuitive because it mimics a human office. &lt;a href="https://www.nxcode.io/resources/news/crewai-vs-langchain-ai-agent-framework-comparison-2026" rel="noopener noreferrer"&gt;Community benchmarks show that developers can move from an idea to a working multi-agent prototype 40% faster using CrewAI compared to other frameworks&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LangGraph&lt;/strong&gt;: If you need absolute control and “durable execution,” this is the industry standard. It models your agents as nodes in a graph. If an agent crashes halfway through a three-hour task, LangGraph can resume exactly where it left off. &lt;a href="https://dev.to/pooyagolchian/ai-agents-in-2026-building-autonomous-workflows-with-local-llms-and-open-source-frameworks-36e4"&gt;In early 2026, LangGraph surpassed CrewAI in total GitHub stars, largely driven by enterprise teams who need this level of “checkpointing” for production apps&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hnhzn24f809hk4j66ef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hnhzn24f809hk4j66ef.png" alt="Multi-Agent Collaboration Diagram" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Building a Collaborative Team
&lt;/h3&gt;

&lt;p&gt;The core idea here is delegation. You write a script where one agent is responsible for the “&lt;strong&gt;What&lt;/strong&gt;” and another is responsible for the “&lt;strong&gt;How&lt;/strong&gt;.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# A conceptual look at multi-agent delegation
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Simulating the agent performing its specific role
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] processed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; with &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_multi_agent_flow&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: The Researcher finds the data
&lt;/span&gt;    &lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find latest AI trends&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;found_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: The Manager reviews and delegates to the Writer
&lt;/span&gt;    &lt;span class="n"&gt;manager_decision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This looks good. Summarize this for a blog.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: The Writer takes the research and the manager's note
&lt;/span&gt;    &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;final_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;found_data&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; and &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;manager_decision&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final_output&lt;/span&gt;
&lt;span class="c1"&gt;# Running the collaborative workflow
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;run_multi_agent_flow&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;This code demonstrates a “&lt;em&gt;Sequential&lt;/em&gt;” workflow. The &lt;code&gt;Agent&lt;/code&gt; class is a template that can be customized with different roles. In the &lt;code&gt;run_multi_agent_flow&lt;/code&gt; function, we create a researcher and a writer. The output of the researcher is passed directly to the writer. This "&lt;strong&gt;handoff&lt;/strong&gt;" is the foundation of multi-agent systems. In a production setting, you would use a framework like CrewAI to handle these handoffs automatically, allowing agents to "&lt;strong&gt;talk&lt;/strong&gt;" to each other until the manager agent is satisfied with the result.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Building the “Proof of Concept” Project
&lt;/h2&gt;

&lt;p&gt;Theory is a comfortable place to stay, but it won’t teach you how to handle an agent that starts hallucinating or getting stuck in a logic loop. The fastest way to move from a beginner to someone who actually understands this tech is to build a project that solves a recurring problem. &lt;a href="https://masterofcode.com/blog/ai-agent-statistics" rel="noopener noreferrer"&gt;Research shows that developers using AI agents to assist in their workflows can complete tasks 126% faster than those working manually&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For your first build, I recommend an &lt;strong&gt;Automated Newsletter Researcher&lt;/strong&gt;. This is a project that actually saves you time. Instead of you spending thirty minutes every morning scouring the web for news, your agent does it for you.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Tools of the Trade
&lt;/h3&gt;

&lt;p&gt;You don’t need a heavy enterprise stack for this. Stick to the “&lt;strong&gt;minimalist&lt;/strong&gt;” path to keep your code clean and easy to debug.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt;: The backbone of your script.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OpenAI Agents SDK&lt;/strong&gt;: This is a lightweight toolkit released recently that focuses on simple agent-to-agent handoffs and tool use without the bloat of larger frameworks. It is currently the production leader in terms of adoption because of its stability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tavily&lt;/strong&gt;: A search engine built specifically for AI agents that returns clean, LLM-ready content instead of messy HTML.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyh7e8f8smsc2ci9rh196.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyh7e8f8smsc2ci9rh196.png" alt="Automated newsletter researcher workflow" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Success Metric
&lt;/h3&gt;

&lt;p&gt;Your project is “&lt;strong&gt;finished&lt;/strong&gt;” when your agent can autonomously find three relevant articles on a specific topic, summarize them into three bullet points each, and save that summary to a local file without you touching your keyboard. This isn’t just a coding exercise. It is a functional piece of automation that delivers a measurable result. Early data from 2026 suggests that even simple personal automation projects like this can reclaim up to &lt;a href="https://mitsloan.mit.edu/ideas-made-to-matter/agentic-ai-explained" rel="noopener noreferrer"&gt;20% of a professional’s daily schedule&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Implementation
&lt;/h3&gt;

&lt;p&gt;Here is a structured look at how you would set up the core logic for this researcher.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="c1"&gt;# Assuming the use of a simplified Agents SDK pattern
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agents_sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Runner&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Define the search tool
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_the_web&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# This would typically call the Tavily API
&lt;/span&gt;    &lt;span class="c1"&gt;# For this example, we return a mock list of data
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI Agents in 2026&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agents are now 40% of apps.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Python vs Rust&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rust is gaining ground in AI.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI SDK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Minimalism is the new standard.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# Step 2: Create the Researcher Agent
&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;News Scout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find the top 3 news stories about AI Agents and summarize them.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_the_web&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Step 3: Run the workflow
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# The runner manages the 'Think-Act-Observe' loop automatically
&lt;/span&gt;    &lt;span class="n"&gt;runner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting the research process...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the biggest trends in AI Agents this week?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: Save the output to a file
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;newsletter_draft.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Newsletter draft saved successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code Explanation&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;In this setup, we use a &lt;code&gt;Runner&lt;/code&gt; object to handle the heavy lifting of the reasoning loop. The &lt;code&gt;researcher&lt;/code&gt; agent is given a specific identity and a single tool: the &lt;code&gt;search_the_web&lt;/code&gt; function. When the &lt;code&gt;runner.run()&lt;/code&gt; command is executed, the AI realizes it doesn't know the latest news. It triggers the search tool, observes the results we provided in the mock list, and then uses its internal logic to summarize those results. Finally, the script takes that summary and writes it to a physical file on your hard drive.&lt;/p&gt;

&lt;p&gt;This project proves that you can bridge the gap between “&lt;strong&gt;chatting&lt;/strong&gt;” and “&lt;strong&gt;doing&lt;/strong&gt;.” Once you see that text file appear on your desktop, you have officially moved past the beginner phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The AI agent market is moving faster than the current talent pool can keep up. According to research from &lt;a href="https://www.researchandmarkets.com/reports/6103459/ai-agents-market-report" rel="noopener noreferrer"&gt;Research and Markets&lt;/a&gt;, the global AI agents market is expected to reach &lt;strong&gt;$12.06 billion&lt;/strong&gt; by the end of 2026, representing a growth rate of 45.5% from the previous year. This growth is not just a theoretical spike. It is a direct result of companies moving away from simple chatbots toward autonomous systems that can actually finish complex work.&lt;/p&gt;

&lt;p&gt;While roughly &lt;strong&gt;62%&lt;/strong&gt; of organizations are currently experimenting with these systems, &lt;a href="https://azumo.com/artificial-intelligence/ai-insights/ai-agent-statistics" rel="noopener noreferrer"&gt;only about 6% have managed to become “high performers”&lt;/a&gt; who are effectively scaling agents across their business. This creates a massive opening for anyone who can move past the beginner stage. Most people are still stuck in the cycle of just asking an AI for summaries. By building the single-tool and multi-agent projects we covered in this guide, you are positioning yourself in that top &lt;strong&gt;6%&lt;/strong&gt; of the workforce.&lt;/p&gt;

&lt;p&gt;The reality of 2026 is that proficiency in building autonomous workflows is becoming a baseline requirement. &lt;a href="https://thoughtminds.ai/blog/10-gartner-prediction-for-enterprise-ai-adoption-trends" rel="noopener noreferrer"&gt;Gartner&lt;/a&gt; suggests that by 2027, &lt;strong&gt;75% of all hiring processes will include some form of certification or testing for workplace AI proficiency&lt;/strong&gt;. The goal for you right now should not be to just use an agent. Your goal should be to understand how to build, maintain, and orchestrate them.&lt;/p&gt;

&lt;p&gt;Do not wait for a perfect certification or a formal university course to tell you that you are ready. The hands-on experience of building a proof of concept is worth more than any theory you could read. Tools like the &lt;a href="https://developers.openai.com/api/docs/guides/agents-sdk" rel="noopener noreferrer"&gt;OpenAI SDK&lt;/a&gt; and &lt;a href="https://www.tavily.com/" rel="noopener noreferrer"&gt;Tavily&lt;/a&gt; have made the entry barrier lower than ever. Start with one tool, one logic loop, and one goal. The market is waiting for the people who actually know how to build the “&lt;strong&gt;hands&lt;/strong&gt;” for the brain.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>python</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Functional Programming in Python: Leveraging Lambda Functions and Higher-Order Functions</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Fri, 18 Apr 2025 13:14:11 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/functional-programming-in-python-leveraging-lambda-functions-and-higher-order-functions-797</link>
      <guid>https://dev.to/shittu_olumide_/functional-programming-in-python-leveraging-lambda-functions-and-higher-order-functions-797</guid>
      <description>&lt;p&gt;Writing several lines of code for a computer program instructs the computer on how to solve a problem or achieve a particular task by executing the code. Without this instruction, the computer is clueless about how to solve the problem.&lt;/p&gt;

&lt;p&gt;It’s relieving to know we can solve problems with a computer by writing several lines of code, sometimes hundreds. Now, imagine that you have been saddled with the responsibility of solving repetitive programming tasks several times a day.&lt;/p&gt;

&lt;p&gt;Don’t tell me you intend to keep writing the several lines of code that make up the program every time to solve your tasks. Come on, be a smart programmer, modularize your code, and use functions!&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Functional programming is a programming paradigm that involves organizing your code into functions. When programs built with this paradigm run, they are treated as a chain or connection of several other functions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This programming paradigm treats functions as first-class citizens — this implies that functions can be passed as arguments, returned from other functions, and, lastly, assigned to a variable.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore the principles of functional programming in Python, focusing on lambda functions and higher-order functions to simplify your code and boost productivity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisite
&lt;/h3&gt;

&lt;p&gt;This article is suitable for beginners in Python programming who have basic knowledge and wish to learn in-depth about functions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Functions
&lt;/h2&gt;

&lt;p&gt;Functions in programming refer to a block of code that encapsulates or hides the inner implementation of the solution to a specific task or problem. A function is designed to solve a specific task; there is no one-size-fits-all approach. There has to be one function for a specific task.&lt;/p&gt;

&lt;p&gt;A function takes in an argument, processes it, and may return a result or perform an action without returning any value. With that being said, let’s move on to the different types of functions available in Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  Types of Functions in Python
&lt;/h2&gt;

&lt;p&gt;There are several types of functions in Python, but this article will focus on &lt;strong&gt;Lambda&lt;/strong&gt;, &lt;strong&gt;user-defined&lt;/strong&gt;, and &lt;strong&gt;higher-order&lt;/strong&gt; functions.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;User-defined functions&lt;/strong&gt;: These are functions created by the programmer to perform a specific operation. They usually follow the standard function paradigm structure.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;say_hello&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;!, welcome.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Built-in functions&lt;/strong&gt;: They are functions that come with Python, you don’t have to define them because there is already an abstracted implementation. You just go ahead and use it in your code. Example: &lt;code&gt;print()&lt;/code&gt;, &lt;code&gt;len()&lt;/code&gt;, &lt;code&gt;sum()&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Special (Magic/Dunder) Functions&lt;/strong&gt;: These are special functions that are built into Python; they have double underscores (&lt;strong&gt;__&lt;/strong&gt;) in their names, and they are used in Python to perform special operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lambda functions&lt;/strong&gt;: They are also known as anonymous functions. They are single-expression functions created using the lambda keyword. They are used for simple operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Higher-order functions&lt;/strong&gt;: These are functions that take other functions as arguments or return other functions. Examples are: &lt;code&gt;filter()&lt;/code&gt;, &lt;code&gt;reduce()&lt;/code&gt;, and &lt;code&gt;map()&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Anatomy of a typical Python function
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Module
Docstring
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;function_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Function Docstring
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;you passed &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;parameter&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, this function is used for demonstration purposes.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code snippet above shows the various parts of a typical Python function; let’s go into every detail:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Module Docstring&lt;/strong&gt;: According to the &lt;a href="https://peps.python.org/pep-0257/" rel="noopener noreferrer"&gt;PEP (Python Enhancement Proposals) 257&lt;/a&gt;, a module docstring should be placed at the top of your Python file where your function is written. The module docstring should ideally contain information about the file’s contents. Always ensure that it is contained inside the multi-line quotes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;def Keyword&lt;/strong&gt;: This keyword marks the beginning of the function being defined. It always precedes the function name.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;function_name&lt;/strong&gt;: Python functions are named based on the task they perform. Appropriately naming your functions in your code helps ensure a readable code that some other programmer can easily make meaning from. The &lt;code&gt;function_name&lt;/code&gt; comes after the &lt;code&gt;def&lt;/code&gt; keyword and precedes a bracket.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;parameter&lt;/strong&gt;: The parameter is a placeholder for the actual value or argument that will be passed to the function when it is called or used. Not all functions have a parameter; sometimes, it is just empty brackets after the &lt;code&gt;function_name&lt;/code&gt;. Such functions don’t expect the user to pass any value to them; they perform the task they are defined for when called, regardless.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Function Docstring&lt;/strong&gt;: Slightly different from the Module Docstring, the Function Docstring contains precise information about the function and what it does, the type of arguments it receives, and the value it returns. Refer to &lt;a href="https://peps.python.org/pep-0257/" rel="noopener noreferrer"&gt;PEP 257&lt;/a&gt; for more details on how to use docstrings in Python.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;return statement&lt;/strong&gt;: The return statement in the Python function primarily sends the resultant value or values from the function to the caller. It marks the end of the function execution. From the code snippet above, the return statement sends the string in quotes from the function after replacing the right value for the parameter. Functions can still work without a return statement, but there wouldn’t be any value or values that can be collected and stored in a variable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Let’s look deeper into Lambda and Higher-Order functions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Lambda functions:
&lt;/h3&gt;

&lt;p&gt;Lambda functions, as previously discussed, are anonymous because they are not defined by a name like typical Python functions. See the code snippet below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The variable answer is merely a reference to the function and not the name, a lambda function can still be written without any variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# outputs 5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you see from the code snippet above that it is without any reference variable, but still works fine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Higher-order functions:
&lt;/h3&gt;

&lt;p&gt;Earlier, we discussed that Higher-Order Functions can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take other function(s) as arguments&lt;/li&gt;
&lt;li&gt;Return other functions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You will see this in action in a bit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Function that takes another function as an argument
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apply_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Applies the passed function to the given value.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Example function to pass
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;square&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="c1"&gt;# Using the apply_function with the square function
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;square&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Output: 25
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the code snippet above, &lt;code&gt;apply_function&lt;/code&gt; is the Higher-Order Function. It takes the square function as one of its arguments, uses it to perform the squaring operation, and finally returns the result.&lt;/p&gt;

&lt;p&gt;One of the benefits of using Higher-Order Functions is that they promote code modularity; different functions handle the separation of logic. This is how to pass a function as an argument to a Higher-Order function.&lt;/p&gt;

&lt;p&gt;Let’s now look at how to return functions from a Higher-Order Function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Function that returns another function
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_multiplier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;multiplier_factor&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt; Returns a function that multiplies its input by a given multiplier factor.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
 &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multiply_with_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;multiplier_factor&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;multiply_with_input&lt;/span&gt;

&lt;span class="c1"&gt;# Using the function
&lt;/span&gt;&lt;span class="n"&gt;double_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_multiplier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Returns a function that multiplies by 2
&lt;/span&gt;&lt;span class="n"&gt;triple_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_multiplier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Returns a function that multiplies by 3
&lt;/span&gt;
&lt;span class="c1"&gt;# Calling the returned functions
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;double_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# Output: 10
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;triple_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# Output: 15
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the code shown above, the Higher-Order function is the &lt;code&gt;create_multiplier&lt;/code&gt;, and the function defined inside it is the &lt;code&gt;multiply_with_input&lt;/code&gt; function. If you look keenly, you will see that both functions take in arguments. How do we pass arguments on to the outer and inner functions during usage? You will see that in a bit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Using the function
&lt;/span&gt;&lt;span class="n"&gt;double_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_multiplier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Returns a function that multiplies by 2
&lt;/span&gt;
&lt;span class="c1"&gt;# Calling the returned functions
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;double_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# Output: 10
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the example above, you can see that the &lt;code&gt;create_multiplier&lt;/code&gt; is first called, and an argument is passed to it. The &lt;code&gt;multiply_with_input&lt;/code&gt; function is returned to the &lt;code&gt;double_value&lt;/code&gt; variable; it can now be accessed by passing an argument to it.&lt;/p&gt;

&lt;p&gt;This is how a function can be returned from a Higher-Order Function.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, you have learned about functional programming and how we can leverage it for a cleaner and modular code; you also learned about Lambda and Higher-Order Functions — the lambda keyword enables creating functions using a single expression without giving it a name, making it suitable for handling simple operations even inside another function. Higher-order functions are a slightly advanced type of Python functions — they can take in functions(s) as arguments and can also return a function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Always ensure that no brackets are added to a higher-order function when passing a function as an argument. By adding the brackets to it, you are calling it already, which will result in an error. Passing the function without brackets, which is the right thing, implies that you are passing a reference to the function.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>python</category>
      <category>learning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Overview of Statsmodels</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Thu, 13 Mar 2025 16:12:58 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/overview-of-statsmodels-2jba</link>
      <guid>https://dev.to/shittu_olumide_/overview-of-statsmodels-2jba</guid>
      <description>&lt;p&gt;Having data, no matter how quality it may be, without being able or knowing how to properly analyze and draw crucial statistical insights from it I am talking about key pointers and indicators that can help inform your next big business decision could be very frustrating, this is like having a gold mine but not having the capacity to tap into it, even though it can change your life tremendously, for the better.&lt;/p&gt;

&lt;p&gt;Statistical modeling is a crucial framework used by statisticians, researchers, and data professionals to draw valuable insights from data. It largely involves the use of mathematical and statistical techniques and methodologies to analyze, represent, predict, and interpret data. It gives sufficient information that aids proper understanding of given data, leading to informed decision-making. It is to this end that different tools were developed for conducting statistical modeling. To name only a few tools used for this purpose, we have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Python libraries (statsmodel, sci-kit-learn, pandas).&lt;/li&gt;
&lt;li&gt;  R programming language (Specifically designed for scientific computing tasks).&lt;/li&gt;
&lt;li&gt;  SAS/SPSS (A commercial software for advanced statistical analysis).&lt;/li&gt;
&lt;li&gt;  Excel (Used for simple statistical modeling and visualization)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, we will be focusing on &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt;; we will learn about its various tools and how it is used for statistical modeling with a simple code illustration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Prior knowledge of Python programming and statistical operations on datasets is useful, as it would ease the understanding of this article’s content.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is statsmodel ?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; contains methods and classes used for statistical modeling. According to the official &lt;a href="https://www.statsmodels.org/stable/index.html" rel="noopener noreferrer"&gt;website&lt;/a&gt; of &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt;, “statsmodel is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration.”&lt;/p&gt;

&lt;p&gt;The ability of &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; to integrate easily with pandas and &lt;strong&gt;&lt;em&gt;numpy&lt;/em&gt;&lt;/strong&gt; libraries, which are very popular and commonly used tools for data operations, such as loading, exploring, and manipulating datasets, makes it favorable to Data Professionals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; contains various statistical tools often used by professionals to conduct statistical operations on data. Listed below are some of the predominantly used &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; tools for performing various tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Time series-analysis&lt;/strong&gt;: When performing time series tasks (analyzing and forecasting time series data) with &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt;, ARIMA (AutoRegressive Integrated Moving Average) and SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous variables) models are commonly used.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Visualization&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; contains plotting functions that can be used to create visual representations of statistical data and also visualize model diagnostics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Linear Regression Model&lt;/strong&gt;: Ordinary Least Squares (OLS), Generalized Least Squares (GLS), and Weighted Least Squares (WLS) are all tools available in &lt;strong&gt;&lt;em&gt;statsmodel,&lt;/em&gt;&lt;/strong&gt; they are used for analyzing Linear Regression models. They all have their various use cases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Statistical Tests&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; provides numerous testing frameworks for conducting statistical tests on data, these tools can be used for both hypothesis testing and diagnosis. Here are some of the tests that &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; provides: t-tests (used to compare the means of two distinct groups of data), chi-squared tests (Used for testing the relationships that exist between categorical data, that is, text data), ANOVA (Analysis of Variance), unlike the t-test that compares the means of two data groups, ANOVA is used to compare the mean of multiple data groups.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Datasets&lt;/strong&gt;: In-built datasets are available in &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;and&lt;/em&gt; can be used to practice and test operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Nonparametric methods&lt;/strong&gt;: These are tools provided by the &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; module for analyzing data without assuming a specific distribution. To achieve this, the Kernel Density Estimation (KDE) approach is predominantly used.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Applications of statsmodel
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; has numerous real-world applications, and it is used in various industries to carry out niche tasks.&lt;/p&gt;

&lt;p&gt;Here are some key areas where &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; is utilized:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Finance&lt;/strong&gt;: &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; is used in finance for forecasting and risk analysis. Time series can be used to monitor stock prices and returns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Econometrics&lt;/strong&gt;: Economic data can be analyzed, and economic theories can be tested using the &lt;strong&gt;&lt;em&gt;statsmodel.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Marketing&lt;/strong&gt;: E-commerce companies use &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; to analyze and understand customer behavior while interacting with their business to predict demand and strategize for better profitability.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Energy&lt;/strong&gt;: In the energy sector, the &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; is used to forecast energy consumption and pricing at a given time for electricity and gas markets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Manufacturing&lt;/strong&gt;: In the manufacturing industry, the &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; is used for predictive maintenance; it predicts potential failure from analyzing previous data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to install and use statsmodel
&lt;/h2&gt;

&lt;p&gt;So far in this article, we have learned a lot of theories about &lt;strong&gt;&lt;em&gt;statsmodel,&lt;/em&gt;&lt;/strong&gt; laying a good foundation for building more advanced knowledge. Let’s take it a little bit further to the coding side.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing statsmodel
&lt;/h3&gt;

&lt;p&gt;To install &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; on your machine, follow the steps below:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1
&lt;/h3&gt;

&lt;p&gt;Create and activate a virtual environment.&lt;/p&gt;

&lt;p&gt;On Linux/Mac OS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python3&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;venv&lt;/span&gt; &lt;span class="n"&gt;venv&lt;/span&gt;
&lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="n"&gt;venv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nb"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;activate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows OS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;venv&lt;/span&gt; &lt;span class="n"&gt;venv&lt;/span&gt;
&lt;span class="n"&gt;venv&lt;/span&gt;\&lt;span class="n"&gt;scripts&lt;/span&gt;\&lt;span class="n"&gt;activate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2
&lt;/h3&gt;

&lt;p&gt;Install the &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; module on your machine by entering this code in your terminal window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;statsmodels&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3
&lt;/h3&gt;

&lt;p&gt;Confirm the installation was successful. Create a new Python file, and write the following code inside it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statsmodels.api&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sm&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running this code, the version of &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; you installed should show on your terminal.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jtbwfcvxr94q5ssavpn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jtbwfcvxr94q5ssavpn.png" alt="Statsmodel" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We just finished installing &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt;; now, let us do a Linear Regression Analysis using &lt;strong&gt;&lt;em&gt;statsmodel.&lt;/em&gt;&lt;/strong&gt; Before that, let us install a dependency that we will use in our code. Run this code in your terminal window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This installs the &lt;strong&gt;&lt;em&gt;pandas&lt;/em&gt;&lt;/strong&gt; library into our created virtual environment, which is useful for loading and manipulating data.&lt;/p&gt;

&lt;p&gt;In the Python file you created earlier, write the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statsmodels.api&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sm&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# Example dataset
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_rdataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mtcars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;

&lt;span class="c1"&gt;# Define variables
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="c1"&gt;# Independent variables
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_constant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Add intercept
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mpg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Dependent variable
&lt;/span&gt;
&lt;span class="c1"&gt;# Fit the model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;OLS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Print the summary
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;  In this code snippet, the &lt;strong&gt;&lt;em&gt;statsmodel.api&lt;/em&gt;&lt;/strong&gt; module was imported, which happens to be the main module that provides tools for statistical modeling. The &lt;strong&gt;&lt;em&gt;pandas&lt;/em&gt;&lt;/strong&gt; library was also imported to handle the data frame data type returned from the &lt;strong&gt;&lt;em&gt;mtcars&lt;/em&gt;&lt;/strong&gt; dataset.&lt;/li&gt;
&lt;li&gt;  The dataset is stored as a &lt;strong&gt;&lt;em&gt;pandas&lt;/em&gt;&lt;/strong&gt; data frame in the &lt;strong&gt;&lt;em&gt;data&lt;/em&gt;&lt;/strong&gt; variable.&lt;/li&gt;
&lt;li&gt;  The dependent and independent variables were extracted and stored in the y and X variables, respectively.&lt;/li&gt;
&lt;li&gt;  The OLS (Ordinary Least Square) method was used to create an OLS regression model.&lt;/li&gt;
&lt;li&gt;  The model was fitted to the data.&lt;/li&gt;
&lt;li&gt;  Lastly, the summary of the model was printed, which shows a detailed overview of the regression results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you run this code, you should get an output similar to what is shown in the image below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqn6h94pkzaerzjsqoqf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqn6h94pkzaerzjsqoqf.png" alt="Terminal" width="800" height="569"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we’ve explored the power of &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; and its role in statistical modeling and conducted linear regression analysis. &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; provides a comprehensive collection of tools for data professionals, researchers, and analysts.&lt;/p&gt;

&lt;p&gt;By integrating seamlessly with &lt;strong&gt;&lt;em&gt;pandas&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;numpy&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;statsmodel&lt;/em&gt;&lt;/strong&gt; makes statistical modeling in Python intuitive and accessible. Whether you’re working in &lt;strong&gt;finance, healthcare, manufacturing, or marketing&lt;/strong&gt;, this library offers powerful capabilities for extracting meaningful insights from data.&lt;/p&gt;

&lt;p&gt;As you’ve seen from our example, installing and using &lt;strong&gt;statsmodel&lt;/strong&gt; is straightforward, and with a solid grasp of its tools, you can unlock new possibilities in data analysis. I recommend that you dive even deeper — experiment with other datasets and explore more advanced statistical models- to cement what you are learning and grow your expertise even more.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>beginners</category>
      <category>python</category>
      <category>learning</category>
    </item>
    <item>
      <title>Build a RAG-Powered Research Paper Assistant</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Tue, 11 Mar 2025 12:28:24 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/build-a-rag-powered-research-paper-assistant-3da1</link>
      <guid>https://dev.to/shittu_olumide_/build-a-rag-powered-research-paper-assistant-3da1</guid>
      <description>&lt;p&gt;Have you ever spent hours sifting through academic papers only to feel overwhelmed by the sheer amount of information? Finding, analyzing, and synthesizing relevant research can be daunting. But what if there was a tool that could do the heavy lifting for you?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt;, a state-of-the-art &lt;strong&gt;AI framework&lt;/strong&gt; that puts together the accuracy of retrieval systems with the creative problem-solving capabilities of Large Language Models. In this article, we will explain RAG in detail, show how it works, and take you through a step-by-step process to build a research assistant powered by OpenAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Does It Matter?
&lt;/h2&gt;

&lt;p&gt;Let’s break it down before we get too technical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Retrieval:&lt;/strong&gt; This part retrieves the most relevant documents from a large dataset, such as academic papers or books, based on a user’s query.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Generation:&lt;/strong&gt; It takes that information and synthesizes it with the knowledge to generate responses through a language model, such as &lt;strong&gt;OpenAI’s GPT-4&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By combining these two capabilities, &lt;strong&gt;RAG&lt;/strong&gt; generates highly accurate and contextually rich outputs. It’s like having an AI librarian who will not only recommend the best books but also summarize and explain them to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Will You Build?
&lt;/h2&gt;

&lt;p&gt;In this tutorial, we’ll create a RAG-powered Research Paper Assistant capable of searching a database of academic papers, summarizing key points from relevant studies, and answering user queries with accurate, well-cited information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We will be using:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;OpenAI GPT-4&lt;/strong&gt; to generate the text.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Pinecone&lt;/strong&gt; or &lt;strong&gt;Weaviate&lt;/strong&gt; for vector search.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;LangChain&lt;/strong&gt; to orchestrate the RAG pipeline.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 1: Setup Your Environment
&lt;/h3&gt;

&lt;p&gt;Okay, first things first: prepare the tools and libraries that will be used for this project.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; Basic programming skills in any language. In this case, &lt;strong&gt;Python&lt;/strong&gt; will be used.&lt;/li&gt;
&lt;li&gt; Research paper dataset &lt;a href="https://www.kaggle.com/Cornell-University/arxiv" rel="noopener noreferrer"&gt;&lt;strong&gt;Arxiv Open Access&lt;/strong&gt;&lt;/a&gt; or Semantic Scholar&lt;/li&gt;
&lt;li&gt; OpenAI API key&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Environment Setup&lt;/strong&gt;: You can run this application locally on your system or in a cloud-based Jupyter Notebook environment like &lt;a href="https://colab.google/" rel="noopener noreferrer"&gt;Google Colab&lt;/a&gt;, which is beginner-friendly and free for basic usage.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Accessing OpenAI for Free
&lt;/h3&gt;

&lt;p&gt;If you don’t already have an OpenAI account, follow these steps to get started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Sign Up&lt;/strong&gt;: Visit &lt;a href="https://platform.openai.com/" rel="noopener noreferrer"&gt;OpenAI’s website&lt;/a&gt; and sign up for an account.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Free Credits&lt;/strong&gt;: New users receive free credits, which you can use to practice along with this tutorial.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Educational Discounts&lt;/strong&gt;: If you’re a student, check for OpenAI’s educational programs or grants to access additional credits.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Installation of Required Libraries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The following are the commands to install our required libraries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="n"&gt;langchain&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;transformers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If using &lt;strong&gt;Google Colab&lt;/strong&gt;, add the ! prefix before each command to execute it in the notebook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Prepare the Data
&lt;/h2&gt;

&lt;p&gt;A good assistant requires a well-prepared dataset. The preparation of your dataset can be done through the following steps:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Gather Your Data
&lt;/h3&gt;

&lt;p&gt;Download a dataset of research papers or use APIs like Semantic Scholar to fetch the abstracts and metadata. &lt;strong&gt;Pro tip:&lt;/strong&gt; Stick to domains you are interested in unless you want to start analyzing the migratory patterns of penguins when your interests actually lie in studying neural networks.&lt;/p&gt;

&lt;p&gt;In this tutorial, we will make use of the &lt;a href="https://www.kaggle.com/Cornell-University/arxiv" rel="noopener noreferrer"&gt;ArXiv Open Access Dataset&lt;/a&gt; available in Kaggle. The following is how you can access and load it into your environment:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Log in to Kaggle or sign up, then follow the below steps:&lt;/li&gt;
&lt;li&gt; Follow the &lt;a href="https://www.kaggle.com/Cornell-University/arxiv" rel="noopener noreferrer"&gt;dataset page&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt; Download the dataset and upload the &lt;strong&gt;papers.csv&lt;/strong&gt; file to your Google Colab or your local directory.&lt;/li&gt;
&lt;li&gt; Use the following code snippet to load the file:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="c1"&gt;# Load dataset
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.colab&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;
&lt;span class="c1"&gt;# Upload CSV
&lt;/span&gt;&lt;span class="n"&gt;uploaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Select 'papers.csv'
# Load dataset into a DataFrame e.g (papers.csv)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;papers.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If running locally, ensure the papers.csv file is in the same directory as your script and load it as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;papers.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Preprocess the Data
&lt;/h3&gt;

&lt;p&gt;We need to clean up the data by removing duplicates, irrelevant information, and formatting issues. Here’s how we’ll do it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="c1"&gt;# Extract abstracts and titles
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abstract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Generate embeddings
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abstract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# Save preprocessed data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preprocessed_data.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Set Up Vector Search
&lt;/h2&gt;

&lt;p&gt;A vector database enables fast and efficient retrieval. We’ll use &lt;strong&gt;Pinecone&lt;/strong&gt; for this step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create a Pinecone Index
&lt;/h3&gt;

&lt;p&gt;Sign up at &lt;a href="https://www.pinecone.io/" rel="noopener noreferrer"&gt;Pinecone.io&lt;/a&gt; and create an index. Choose parameters like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Metric&lt;/strong&gt;: Cosine similarity.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dimension&lt;/strong&gt;: Match the embedding size of your model (e.g., 384 for MiniLM).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Upload Data to Pinecone
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;
&lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west1-gcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pinecone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Upload data
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Build the RAG Pipeline
&lt;/h2&gt;

&lt;p&gt;Now comes the fun part: combining retrieval with generation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Define the Retrieval Function
&lt;/h3&gt;

&lt;p&gt;This function queries the Pinecone index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_docs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;include_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;matches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Generate Responses
&lt;/h3&gt;

&lt;p&gt;Use OpenAI to synthesize responses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abstract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use the following context to answer the query:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Query: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Answer:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
     &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-davinci-003&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;
   &lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5: Create the Interactive Assistant
&lt;/h2&gt;

&lt;p&gt;Let’s tie everything together with a simple interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basic Command-Line Interface
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Welcome to the RAG-Powered Research Assistant!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter your research question (or &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;exit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to quit): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_docs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No relevant documents found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 6: Enhance the Experience
&lt;/h2&gt;

&lt;p&gt;To make your assistant even more powerful, add the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Add Citations&lt;/strong&gt;: Add paper titles and authors to give credibility to the output.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Web Interface&lt;/strong&gt;: Provide a nice UI created in &lt;strong&gt;React&lt;/strong&gt; or &lt;strong&gt;Next.js&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Summarization:&lt;/strong&gt; Allow summarizing of whole papers.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Advanced Features for Your RAG Assistant
&lt;/h2&gt;

&lt;p&gt;To add more capabilities to your research assistant, the following capabilities can be implemented for it:&lt;/p&gt;

&lt;h3&gt;
  
  
  Citation Generation
&lt;/h3&gt;

&lt;p&gt;The inclusion of citations or references for the retrieved research papers will make your assistant more reliable and useful for academic purposes. You can extract metadata such as authors, titles, and publication years to create properly formatted citations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_citation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; by &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;authors&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Overview of Findings
&lt;/h3&gt;

&lt;p&gt;Summarizing long documents or multiple papers into concise, digestible insights helps streamline the research process. Use OpenAI or specialized summarization models like &lt;strong&gt;bart-large-cnn&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize_docs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;summaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abstract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Limit the text length
&lt;/span&gt;    &lt;span class="n"&gt;combined_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summaries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-davinci-003&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the following research abstracts:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;combined_summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
 &lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Real-Time Data Updates
&lt;/h3&gt;

&lt;p&gt;If your dataset is updated regularly, consider periodic updates with a scheduler to keep the assistant updated. This can be automated using &lt;strong&gt;cron&lt;/strong&gt; jobs or &lt;strong&gt;Celery&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multilingual
&lt;/h3&gt;

&lt;p&gt;If your audience includes researchers all over the world, include translation capabilities. Libraries like &lt;strong&gt;transformers&lt;/strong&gt; with models like &lt;strong&gt;Helsinki-NLP&lt;/strong&gt; can help.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying Your Assistant
&lt;/h2&gt;

&lt;p&gt;Once your RAG-powered assistant is built, deploy it to make it accessible. Here’s how:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Deployment with Flask or FastAPI&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
    Create an API endpoint to handle user queries and return results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_research_assistant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_docs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No relevant documents found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the API on the local system or deploy it to &lt;strong&gt;Heroku&lt;/strong&gt;, &lt;strong&gt;AWS&lt;/strong&gt;, &lt;strong&gt;Google Cloud&lt;/strong&gt;, etc.&lt;/p&gt;

&lt;h3&gt;
  
  
  Web Interface
&lt;/h3&gt;

&lt;p&gt;Create a user-friendly interface using frameworks such as &lt;strong&gt;React&lt;/strong&gt;, &lt;strong&gt;Vue.js&lt;/strong&gt;, or even &lt;strong&gt;Streamlit&lt;/strong&gt; for fast, interactive web application development.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dockerize for Portability
&lt;/h3&gt;

&lt;p&gt;Package your application in a Docker container for easy deployment across environments.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;3.9&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;slim&lt;/span&gt;
&lt;span class="n"&gt;WORKDIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;requirements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;txt&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;RUN&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="n"&gt;requirements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;txt&lt;/span&gt;
&lt;span class="n"&gt;COPY&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;CMD&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uvicorn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app:app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; - host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; - port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Integration with Slack or Discord
&lt;/h3&gt;

&lt;p&gt;Create bots that respond to user queries in Slack or Discord channels to make the assistant accessible in collaboration tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing and Iteration
&lt;/h3&gt;

&lt;p&gt;Prior to the release of your assistant, quality needs to be controlled:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Accuracy Testing:&lt;/strong&gt; The accuracy of both retrieval and generation components should be tested. Test real-world queries and check response relevance.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Performance Testing:&lt;/strong&gt; Measure response times, especially when your assistant processes large amounts of data. If needed, improve the retrieval and generation steps.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;User Feedback:&lt;/strong&gt; Ask researchers or target users for feedback and implement their suggestions to improve the user-friendliness and functionality of your assistant.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Future Possibilities with RAG
&lt;/h2&gt;

&lt;p&gt;Retrieval-augmented generation is not limited to research assistance alone. Here are a few other applications:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Legal Research&lt;/strong&gt;: Quickly find and summarize legal documents or case law.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Healthcare&lt;/strong&gt;: Assist doctors by retrieving relevant medical studies and summarizing patient cases.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;E-commerce:&lt;/strong&gt; Improve customer support by combining product information retrieval with personalized recommendations.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By building a RAG-powered research assistant, you will be entering the future of intelligent tools that mix precision with creativity. This will not only assist academic researchers in doing their work more effectively but also open doors to endless possibilities in other domains. With frameworks like &lt;strong&gt;LangChain&lt;/strong&gt;, powerful language models like &lt;strong&gt;GPT-4&lt;/strong&gt;, and scalable vector databases like Pinecone, creating your assistant has never been more accessible.&lt;/p&gt;

&lt;p&gt;So, why wait? Dive in and change the way research is done in your field. The possibilities are endless, and your journey has just begun.&lt;/p&gt;

&lt;h3&gt;
  
  
  Call-to-Action
&lt;/h3&gt;

&lt;p&gt;Want to build your &lt;strong&gt;RAG-powered assistant&lt;/strong&gt;? Begin coding now and join the AI revolution in academic research. Share your experience and let us know how it transforms your workflow!&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;OpenAI API Documentation&lt;/strong&gt;
Learn more about how to use the OpenAI API for text generation:
&lt;a href="https://platform.openai.com/docs/" rel="noopener noreferrer"&gt;https://platform.openai.com/docs/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;LangChain Documentation&lt;/strong&gt; 
Official guide to orchestrating RAG pipelines with LangChain:
&lt;a href="https://www.langchain.com/" rel="noopener noreferrer"&gt;https://www.langchain.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Pinecone Documentation&lt;/strong&gt;
A comprehensive resource for setting up and using Pinecone for vector search:
&lt;a href="https://docs.pinecone.io/" rel="noopener noreferrer"&gt;https://docs.pinecone.io/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;ArXiv Dataset on Kaggle&lt;/strong&gt; 
Access the ArXiv Open Access Dataset for research papers:
&lt;a href="https://www.kaggle.com/datasets/Cornell-University/arxiv" rel="noopener noreferrer"&gt;https://www.kaggle.com/datasets/Cornell-University/arxiv&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Sentence Transformers&lt;/strong&gt;
Learn about the sentence-transformers library and its usage for embeddings:
&lt;a href="https://www.sbert.net/" rel="noopener noreferrer"&gt;https://www.sbert.net/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Semantic Scholar API&lt;/strong&gt;
Explore the Semantic Scholar API to fetch academic papers:
&lt;a href="https://www.semanticscholar.org/product/api" rel="noopener noreferrer"&gt;https://www.semanticscholar.org/product/api&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Pinecone Blog: Building with RAG&lt;/strong&gt;
A detailed tutorial on Retrieval-Augmented Generation with Pinecone:
&lt;a href="https://www.pinecone.io/learn/rag/" rel="noopener noreferrer"&gt;https://www.pinecone.io/learn/rag/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Google Colab Documentation&lt;/strong&gt; 
Beginner-friendly documentation to get started with Google Colab:
&lt;a href="https://colab.research.google.com/" rel="noopener noreferrer"&gt;https://colab.research.google.com/&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>tutorial</category>
      <category>python</category>
    </item>
    <item>
      <title>Prompt Engineering Patterns for Successful RAG Implementations</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Tue, 25 Feb 2025 15:06:56 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/prompt-engineering-patterns-for-successful-rag-implementations-2m2e</link>
      <guid>https://dev.to/shittu_olumide_/prompt-engineering-patterns-for-successful-rag-implementations-2m2e</guid>
      <description>&lt;p&gt;RAG has been the magic sauce behind the scenes, empowering many AI-driven applications to transcend the divide from static knowledge toward dynamic and real-time information. But getting exactly the right responses- precise, relevant, and of high value- is both a science and an art. Herein comes your guide on implementing prompt engineering patterns to make any implementation of RAG more effective and efficient.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6gsuxui6aagm90wfnmr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6gsuxui6aagm90wfnmr.png" alt="Image source:Source: https://imgflip.com/i/8mv2pm" width="800" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Prompt Engineering Matters in RAG
&lt;/h2&gt;

&lt;p&gt;Now, imagine setting a request to the AI assistant for today’s stock market trends, and it gives information from a finance book from ten years ago. This is what happens when your prompts are not clear, specified, or structured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG&lt;/strong&gt; retrieves information from outside and builds informed responses, but its capability identifies highly with how the prompt is set. Well-structured and clearly defined prompts ensure the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High retrieval accuracy&lt;/li&gt;
&lt;li&gt;Less hallucination and misinformation&lt;/li&gt;
&lt;li&gt;More context-aware responses&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before diving into the deep end, one should have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A high-level understanding of Large Language Models (LLMs)&lt;/li&gt;
&lt;li&gt;Understanding of RAG architecture&lt;/li&gt;
&lt;li&gt;Some Python experience (we are going to write a bit of code)&lt;/li&gt;
&lt;li&gt;A sense of humor- Trust me, it helps.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  1. Direct Retrieval Pattern
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;Retrieve only, no guessing&lt;/em&gt;&lt;/strong&gt;.”&lt;/p&gt;

&lt;p&gt;On questions requiring factual accuracy, forcing the model to rely on the retrieved documents minimizes hallucinations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Using only the provided retrieved documents, answer the following question. Do not add any external knowledge.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Keeps answers grounded in retrieved data&lt;/li&gt;
&lt;li&gt;Less speculation or incorrect responses&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pitfall:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;If too restrictive, the AI becomes overly cautious with many “I don’t know” responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Chain of Thought (CoT) Prompting
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;Think like a detective&lt;/em&gt;&lt;/strong&gt;.”&lt;/p&gt;

&lt;p&gt;For complicated reasoning, the process of leading the AI through logical steps amplifies response quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Break down the following problem into logical steps and solve it step by step using the retrieved data.&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Improves reasoning and transparency&lt;/li&gt;
&lt;li&gt;Improves explainability in responses&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pitfall:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Increases response time and token usage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Context Enrichment Pattern
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;More context, fewer errors&lt;/em&gt;&lt;/strong&gt;.”&lt;/p&gt;

&lt;p&gt;Extra context in the prompt provides for more accurate responses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a cybersecurity expert analyzing a recent data breach.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; Based on the retrieved documents, explain the breach&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s impact and potential solutions.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tailor responses to domain-specific needs&lt;/li&gt;
&lt;li&gt;Reduces ambiguity in AI output&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pitfall:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Too much context can overwhelm the model&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Instruction-Tuning Pattern
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;Be clear, be direct&lt;/em&gt;&lt;/strong&gt;.”&lt;/p&gt;

&lt;p&gt;LLMs perform better when instructions are precise and structured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the following document in three bullet points, each under 20 words.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why this works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Guides the model towards structured output&lt;/li&gt;
&lt;li&gt;Avoids excessive verbosity&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pitfall:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Rigid formats may limit nuanced responses&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Persona-Based Prompting
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;Personalize responses for target groups.&lt;/em&gt;&lt;/strong&gt;”&lt;/p&gt;

&lt;p&gt;If your RAG model serves heterogeneous end users, say novice vs. expert, response personalization will enrich participation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Beginner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain blockchain technology as if I were a &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, using simple language and real-world examples.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Increased accessibility&lt;/li&gt;
&lt;li&gt;Enhances personalization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Common mistake:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Oversimplification could be missing information relevant to an expert&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6. Error Handling Pattern
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;What if AI gets it wrong?&lt;/em&gt;&lt;/strong&gt;”&lt;/p&gt;

&lt;p&gt;Prompts have to include a reflection of the outcome so AI can flag any uncertainties.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If your response contains conflicting information, state your confidence level and suggest areas for further research.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;More transparent responses&lt;/li&gt;
&lt;li&gt;Less risk of misinformation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pitfall:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AI may always give low-confidence answers, even when the answer is correct.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  7. Multi-Pass Query Refinement
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;Iterate until the answer is perfect.&lt;/em&gt;&lt;/strong&gt;”&lt;/p&gt;

&lt;p&gt;Instead of providing a single-shot response, this approach iterates queries to refine accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate an initial answer, then refine it based on retrieved documents to improve accuracy.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Helps AI self-correct mistakes&lt;/li&gt;
&lt;li&gt;Improves factual consistency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pitfall:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Requires more processing time&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  8. Hybrid Prompting with Few-Shot Examples
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;Show, don’t tell.&lt;/em&gt;&lt;/strong&gt;”&lt;/p&gt;

&lt;p&gt;Few-shot learning reinforces the results in consistency, supported with examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Here are two examples of well-structured financial reports. Follow this pattern when summarizing the retrieved data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Gives reference structure&lt;/li&gt;
&lt;li&gt;Develops coherence and quality&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pitfall:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Requires selected examples of curation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementing RAG for Song Recommendations
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RagTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RagRetriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RagSequenceForGeneration&lt;/span&gt;

&lt;span class="c1"&gt;# Load the RAG model, tokenizer, and retriever
&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facebook/rag-sequence-nq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RagTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RagRetriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RagSequenceForGeneration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define user input: Mood for song recommendation
&lt;/span&gt;&lt;span class="n"&gt;user_mood&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m feeling happy and energetic. Recommend some songs to match my vibe.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Tokenize the query
&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_mood&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;

&lt;span class="c1"&gt;# Generate a response using RAG
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
 &lt;span class="n"&gt;output_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_return_sequences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Decode and print the response
&lt;/span&gt;&lt;span class="n"&gt;recommendation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch_decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🎵 Song Recommendations:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recommendation&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Additional Considerations
&lt;/h2&gt;

&lt;p&gt;There are a few things you need to also consider, namely handling long queries, optimizing retrieval quality, and evaluate and refine prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling Long Queries
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Break complicated queries into subqueries.&lt;/li&gt;
&lt;li&gt;Summarize inputs before giving them to the model.&lt;/li&gt;
&lt;li&gt;Order retrievals based on keyword relevance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Optimising Retrieval Quality
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use of embeddings for superior similarity search&lt;/li&gt;
&lt;li&gt;Fine-tuning of retriever models on the domain-specific task&lt;/li&gt;
&lt;li&gt;Hybrid search: experimentation with BM 25 + Embeddings.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Evaluate and Refine Prompts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Response quality could be monitored via human feedback.&lt;/li&gt;
&lt;li&gt;A/B Testing of Prompts for their efficacy&lt;/li&gt;
&lt;li&gt;Iteration on prompts will need to be modified based on various metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion: How to Master Prompt Engineering in RAG
&lt;/h2&gt;

&lt;p&gt;Mastery of &lt;strong&gt;RAG&lt;/strong&gt; requires not only a powerful &lt;strong&gt;LLM&lt;/strong&gt; but also precision in crafting the prompt. The right patterns help considerably increase response accuracy, relevance to the context, and swiftness. Be it finance, healthcare, cybersecurity, or any other domain, structured prompt engineering will ensure your AI delivers value-driven insight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Tip&lt;/strong&gt;: Iterate. The best prompts evolve, much like the finest AI applications. A well-engineered prompt today may need to be adjusted tomorrow as your use cases expand and AI capabilities improve. Stay adaptive, experiment, and refine for optimal performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Lewis, P., et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS, 2020.&lt;/li&gt;
&lt;li&gt;Brown, T., et al. “Language Models are Few-Shot Learners.” NeurIPS, 2020.&lt;/li&gt;
&lt;li&gt;OpenAI. “GPT-4 Technical Report.” 2023.&lt;/li&gt;
&lt;li&gt;Google AI. “Understanding Prompt Engineering for LLMs.” Blog post, 2023.&lt;/li&gt;
&lt;li&gt;Borgeaud, S., et al. “Improving Language Models by Retrieving from Trillions of Tokens.” DeepMind, 2022.&lt;/li&gt;
&lt;li&gt;Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision.” OpenAI, 2021.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>tutorial</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Machine Learning Basics: Building Your First Predictive Model in R</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Wed, 11 Dec 2024 13:38:18 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/machine-learning-basics-building-your-first-predictive-model-in-r-3l88</link>
      <guid>https://dev.to/shittu_olumide_/machine-learning-basics-building-your-first-predictive-model-in-r-3l88</guid>
      <description>&lt;p&gt;Machine learning models used for prediction purposes have become one of the most adopted technologies by different organizations. These models are capable of predicting future occurrences/outcomes giving valuable insights for making key decisions, hence leading to growth and increased productivity.&lt;/p&gt;

&lt;p&gt;In light of this, the demand for professionals capable of building high-performance predictive models continues to soar at a rapid rate. This has seen myriads of people with transferable skill sets move towards becoming machine learning engineers; I am talking about data analysts conversant with R or even statisticians or other researchers alike who utilize R, as R is a powerful language for building machine learning models.&lt;/p&gt;

&lt;p&gt;In the course of this article, you will learn how to build a predictive model using the R programming language.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;This tutorial is suitable for people familiar with the R programming language, Visual Studio Code (as a code editor), and who already know a thing or two about machine learning (just a little is enough)&lt;/p&gt;

&lt;h2&gt;
  
  
  Objectives
&lt;/h2&gt;

&lt;p&gt;By going through this article, you should be able to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn how to build a simple machine-learning model in R using the Logistic Regression algorithm.&lt;/li&gt;
&lt;li&gt;Evaluate the performance of your model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Machine learning (ML) can be likened to teaching a computer to learn from experience instead of being explicitly programmed for every task. Imagine showing a kid hundreds of pictures of cats and dogs and asking them to identify which is which - they'll eventually figure it out. That's essentially how ML works but for computers.&lt;/p&gt;

&lt;p&gt;It all started with the idea of creating systems that could mimic human intelligence. Over the years, it evolved from simple rule-based systems to sophisticated algorithms that can handle vast amounts of data. Today, ML powers everything from voice assistants like Siri to personalized recommendations on Netflix.&lt;/p&gt;

&lt;p&gt;At its core, ML involves feeding data into algorithms, training them to recognize patterns, and making predictions or decisions without constant human guidance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Predictive Model in R
&lt;/h2&gt;

&lt;p&gt;1.&lt;strong&gt;Install and load the necessary packages&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I assume you already have R set up on your computer. If you don't, feel free to check out this &lt;a href="//ttps://cran.r-project.org/doc/manuals/r-release/R-admin.html"&gt;resource&lt;/a&gt;. Install and load the necessary packages you will be using for training the model, as shown below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tidyverse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# For data manipulation and visualization
&lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;caret&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# For model training and evaluation
&lt;/span&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tidyverse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;             &lt;span class="c1"&gt;# loading the tidyverse library
&lt;/span&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caret&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="c1"&gt;# loading the caret library
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2.&lt;strong&gt;Load your preferred dataset&lt;/strong&gt;&lt;br&gt;
In this step, you choose the dataset you will use to train your model. It can be an externally prepared dataset in CSV format or a built-in dataset. For this article, we will be using the built-in &lt;code&gt;mtcars&lt;/code&gt; dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;mtcars&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3.&lt;strong&gt;Exploratory data analysis&lt;/strong&gt;&lt;br&gt;
This step involves understanding the dataset you are working with and checking its suitability for use. It encompasses a series of processes, such as checking for null values in the dataset and filling them appropriately if there are.&lt;/p&gt;

&lt;p&gt;View the dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;glimpse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check for missing values in the &lt;code&gt;mtcars&lt;/code&gt; built-in dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check for missing values
&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;is&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="nf"&gt;na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mtcars&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;View the statistical summary of the dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Summary statistics
&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mtcars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;4.&lt;strong&gt;Split the dataset for testing and training the model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This involves splitting your dataset into a particular proportion. In this example, we are using the 80:20 proportion, that is, 80% for training and the remaining 20% for testing the model.&lt;/p&gt;

&lt;p&gt;Ensure reproducibility and split the dataset&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# For reproducibility
&lt;/span&gt;
&lt;span class="c1"&gt;# Split the data into 80% training and 20% testing
&lt;/span&gt;&lt;span class="n"&gt;train_index&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nf"&gt;createDataPartition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_data&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;test_data&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5.&lt;strong&gt;Build the Linear Regression model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the &lt;code&gt;train()&lt;/code&gt; function from the caret package to build your model&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Train the model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;mpg&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt; &lt;span class="n"&gt;wt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;hp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Formula: mpg as a function of wt and hp
&lt;/span&gt;  &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Linear regression method
&lt;/span&gt;  &lt;span class="n"&gt;trControl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;trainControl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 5-fold cross-validation
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# View model summary
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this step, we have successfully built our linear regression model. The next step is to evaluate its performance using a suitable metric. For this article, we are going to use the RMSE metric.&lt;/p&gt;

&lt;p&gt;6.&lt;strong&gt;Model evaluation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the model you have built in the previous step to make predictions on the test dataset and evaluate the predicted values using the RMSE metric.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Make predictions
&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newdata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluate performance
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;Actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Predicted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Calculate RMSE and R-squared
&lt;/span&gt;&lt;span class="n"&gt;rmse&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Actual&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Predicted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;paste&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RMSE:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rmse&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;7.&lt;strong&gt;Model visualization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is an optional step as it seeks to provide a visual expression of the predicted values and the actual values, showing the accuracy of the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Plot actual vs predicted
&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Actual vs Predicted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;xlab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Actual mpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Predicted mpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;abline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Add diagonal line for reference
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Congrats on making it this far. We have successfully built a linear regression machine learning model using the R programming language.&lt;/p&gt;

&lt;p&gt;Let's combine all the code snippets for a better understanding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tidyverse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# For data manipulation and visualization
&lt;/span&gt;&lt;span class="n"&gt;install&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;packages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;caret&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# For model training and evaluation
&lt;/span&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tidyverse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;             &lt;span class="c1"&gt;# loading the tidyverse library
&lt;/span&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caret&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;mtcars&lt;/span&gt;

&lt;span class="nf"&gt;glimpse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Check for missing values
&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;is&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="nf"&gt;na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mtcars&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Summary statistics
&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mtcars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# For reproducibility
&lt;/span&gt;
&lt;span class="c1"&gt;# Split the data into 80% training and 20% testing
&lt;/span&gt;&lt;span class="n"&gt;train_index&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nf"&gt;createDataPartition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_data&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;test_data&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;train_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt;

 &lt;span class="c1"&gt;# Train the model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;mpg&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt; &lt;span class="n"&gt;wt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;hp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Formula: mpg as a function of wt and hp
&lt;/span&gt;  &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Linear regression method
&lt;/span&gt;  &lt;span class="n"&gt;trControl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;trainControl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 5-fold cross-validation
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# View model summary
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Make predictions
&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newdata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluate performance
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;Actual&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;Predicted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Calculate RMSE and R-squared
&lt;/span&gt;&lt;span class="n"&gt;rmse&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Actual&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;Predicted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;paste&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RMSE:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rmse&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Plot actual vs predicted
&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;mpg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Actual vs Predicted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
     &lt;span class="n"&gt;xlab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Actual mpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ylab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Predicted mpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;abline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;red&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Add diagonal line for reference
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we have built a Machine Learning model using the built-in &lt;code&gt;mtcars&lt;/code&gt; dataset and the linear regression algorithm. Still, for your implementation when working on projects, you can use any other external dataset and algorithm you deem fit for the particular task.&lt;/p&gt;

&lt;p&gt;It is recommended that you conduct proper exploratory data analysis on your dataset and experiment with different algorithms before selecting the one that best suits the task.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Install PySpark on Your Local Machine</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Mon, 09 Dec 2024 13:13:47 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/how-to-install-pyspark-on-your-local-machine-nn4</link>
      <guid>https://dev.to/shittu_olumide_/how-to-install-pyspark-on-your-local-machine-nn4</guid>
      <description>&lt;p&gt;If you’re stepping into the world of Big Data, you have likely heard of &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt;, a powerful distributed computing system. &lt;a href="https://spark.apache.org/docs/latest/api/python/index.html" rel="noopener noreferrer"&gt;PySpark&lt;/a&gt;, the Python library for Apache Spark, is a favorite among data enthusiasts for its combination of speed, scalability, and ease of use. But setting it up on your local machine can feel a bit intimidating at first.&lt;/p&gt;

&lt;p&gt;Fear not — this article walks you through the entire process, addressing common questions and making the journey as straightforward as possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is PySpark, and Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;Before going into installation, let’s understand what PySpark is. PySpark allows you to leverage the massive computational power of Apache Spark using Python. Whether you’re analyzing terabytes of data, building machine learning models, or running ETL (&lt;strong&gt;Extract&lt;/strong&gt;, &lt;strong&gt;Transform&lt;/strong&gt;, &lt;strong&gt;Load&lt;/strong&gt;) pipelines, PySpark allows you to work with data more efficiently than ever.&lt;/p&gt;

&lt;p&gt;Now that you understand PySpark, let’s go through the installation process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Ensure Your System Meets the Requirements
&lt;/h3&gt;

&lt;p&gt;PySpark runs on various machines, including &lt;strong&gt;Windows&lt;/strong&gt;, &lt;strong&gt;macOS&lt;/strong&gt;, and &lt;strong&gt;Linux&lt;/strong&gt;. Here’s what you need to install it successfully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Java Development Kit (JDK)&lt;/strong&gt;: PySpark requires Java (version 8 or 11 is recommended).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt;: Ensure you have Python 3.6 or later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Spark Binary&lt;/strong&gt;: You’ll download this during the installation process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To check your system readiness:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open your &lt;strong&gt;terminal&lt;/strong&gt; or &lt;strong&gt;command prompt&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Type &lt;strong&gt;&lt;em&gt;java -version&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;python —version&lt;/em&gt;&lt;/strong&gt; to confirm Java and Python installations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you don’t have Java or Python installed, follow these steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For &lt;strong&gt;Java&lt;/strong&gt;: Download it from &lt;a href="https://www.oracle.com/java/technologies/javase-downloads.html" rel="noopener noreferrer"&gt;Oracle’s official website&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;For &lt;strong&gt;Python&lt;/strong&gt;: Visit &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;Python’s download page&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Install Java
&lt;/h3&gt;

&lt;p&gt;Java is the backbone of &lt;strong&gt;Apache Spark&lt;/strong&gt;. To install it:&lt;/p&gt;

&lt;p&gt;1.&lt;strong&gt;Download Java&lt;/strong&gt;: Visit the Java SE Development Kit download page. Choose the appropriate version for your operating system.&lt;/p&gt;

&lt;p&gt;2.&lt;strong&gt;Install Java&lt;/strong&gt;: Run the installer and follow the prompts. On Windows, you’ll need to set the &lt;code&gt;JAVA_HOME&lt;/code&gt; environment variable. To do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copy the path variable, go to the &lt;strong&gt;&lt;em&gt;local disk&lt;/em&gt;&lt;/strong&gt; on your machine, select &lt;em&gt;&lt;strong&gt;program files&lt;/strong&gt;&lt;/em&gt;, look for the java folder open it you will see &lt;strong&gt;&lt;em&gt;jdk-17&lt;/em&gt;&lt;/strong&gt; (your own version may not be 17). Open it, and you will be able to see your path and copy like below&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmxzawbs7ht5hmmwfel88.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmxzawbs7ht5hmmwfel88.png" alt="Path Variable" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Search for &lt;strong&gt;Environment Variables&lt;/strong&gt; in the Windows search bar.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Under &lt;strong&gt;System Variables&lt;/strong&gt;, click &lt;strong&gt;New&lt;/strong&gt; and set the variable name as &lt;code&gt;JAVA_HOME&lt;/code&gt; and the value as your &lt;strong&gt;Java&lt;/strong&gt; installation path you copied above (&lt;strong&gt;&lt;em&gt;e.g., C:\Program Files\Java\jdk-17&lt;/em&gt;&lt;/strong&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3.&lt;strong&gt;Verify Installation&lt;/strong&gt;: Open a &lt;strong&gt;terminal&lt;/strong&gt; or &lt;strong&gt;command prompt&lt;/strong&gt; and type &lt;code&gt;java-version&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Install Apache Spark
&lt;/h3&gt;

&lt;p&gt;1.&lt;strong&gt;Download Spark&lt;/strong&gt;: Visit &lt;a href="https://spark.apache.org/downloads.html" rel="noopener noreferrer"&gt;Apache Spark’s website&lt;/a&gt; and select the version compatible with your needs. Use the pre-built package for Hadoop (a common pairing with Spark).&lt;/p&gt;

&lt;p&gt;2.&lt;strong&gt;Extract the Files&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On &lt;strong&gt;Windows&lt;/strong&gt;, use a tool like WinRAR or 7-Zip to extract the file.&lt;/li&gt;
&lt;li&gt;On &lt;strong&gt;macOS/Linux&lt;/strong&gt;, use the command &lt;strong&gt;&lt;em&gt;tar -xvf spark-.tgz&lt;/em&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3.&lt;strong&gt;Set Environment Variables&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For &lt;strong&gt;Windows&lt;/strong&gt;: Add Spark’s bin directory to your system’s PATH variable.&lt;/li&gt;
&lt;li&gt;For &lt;strong&gt;macOS/Linux&lt;/strong&gt;: Add the following lines to your &lt;strong&gt;&lt;em&gt;.bashrc&lt;/em&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;em&gt;.zshrc&lt;/em&gt;&lt;/strong&gt; file:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;SPARK_HOME&lt;/span&gt;&lt;span class="o"&gt;=/&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;
&lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;SPARK_HOME&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nb"&gt;bin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="n"&gt;PATH&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;4.&lt;strong&gt;Verify Installation&lt;/strong&gt;: Open a terminal and type &lt;code&gt;spark-shell&lt;/code&gt;. You should see Spark’s interactive shell start.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Install Hadoop (Optional but Recommended)
&lt;/h3&gt;

&lt;p&gt;While Spark doesn’t strictly require &lt;a href="https://hadoop.apache.org/" rel="noopener noreferrer"&gt;Hadoop&lt;/a&gt;, many users install it for its HDFS (Hadoop Distributed File System) support. To install Hadoop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download Hadoop binaries from &lt;a href="https://hadoop.apache.org/releases.html" rel="noopener noreferrer"&gt;Apache Hadoop’s website&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Extract the files and set up the &lt;code&gt;HADOOP_HOME&lt;/code&gt; environment variable.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 5: Install PySpark via &lt;code&gt;pip&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Installing PySpark is a breeze with Python’s &lt;code&gt;pip&lt;/code&gt; tool. Simply run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pyspark&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To verify, open a Python shell and type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pysparkark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see a version number, congratulations! PySpark is installed 🎉&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Test Your PySpark Installation
&lt;/h3&gt;

&lt;p&gt;Here’s where the fun begins. Let’s ensure everything is working smoothly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create a Simple Script&lt;/strong&gt;:&lt;br&gt;
Open a text editor and paste the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PySparkTest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cathy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save it as &lt;code&gt;test_pyspark.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run the Script&lt;/strong&gt;:&lt;br&gt;
In your terminal, navigate to the script’s directory and type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="n"&gt;test_pyspark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see a neatly formatted table displaying the &lt;strong&gt;names&lt;/strong&gt; and &lt;strong&gt;ages&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Troubleshooting Common Issues
&lt;/h3&gt;

&lt;p&gt;Even with the best instructions, hiccups happen. Here are some common problems and solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Issue&lt;/strong&gt;: &lt;code&gt;java.lang.NoClassDefFoundError&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Solution&lt;/strong&gt;: Double-check your &lt;code&gt;JAVA_HOME&lt;/code&gt; and &lt;code&gt;PATH&lt;/code&gt; variables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Issue&lt;/strong&gt;: PySpark installation succeeded, but the test script failed.&lt;br&gt;
&lt;strong&gt;Solution&lt;/strong&gt;: Ensure you’re using the correct Python version. Sometimes, virtual environments can cause conflicts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Issue&lt;/strong&gt;: The &lt;code&gt;spark-shell&lt;/code&gt; command doesn’t work.&lt;br&gt;
&lt;strong&gt;Solution&lt;/strong&gt;: Verify that the Spark directory is correctly added to your PATH.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Use PySpark Locally?
&lt;/h3&gt;

&lt;p&gt;Many users wonder why they should bother installing PySpark on their local machine when it’s primarily used in distributed systems. Here’s why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Learning&lt;/strong&gt;: Experiment and learn Spark concepts without requiring a cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prototyping&lt;/strong&gt;: Test small data jobs locally before deploying them to a larger environment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convenience&lt;/strong&gt;: Debug issues and develop applications with ease.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Boost Your PySpark Productivity
&lt;/h3&gt;

&lt;p&gt;To get the most out of PySpark, consider these tips:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set Up a Virtual Environment&lt;/strong&gt;: Use tools like &lt;code&gt;venv&lt;/code&gt; or &lt;code&gt;conda&lt;/code&gt; to isolate your PySpark installation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Integrate with IDEs&lt;/strong&gt;: Tools like PyCharm and Jupyter Notebook make PySpark development more interactive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Leverage PySpark Documentation&lt;/strong&gt;: Visit &lt;a href="https://spark.apache.org/docs/latest/api/python/index.html" rel="noopener noreferrer"&gt;Apache Spark’s documentation&lt;/a&gt; for in-depth guidance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Engage with the PySpark Community
&lt;/h3&gt;

&lt;p&gt;Getting stuck is normal, especially with a powerful tool like PySpark. Engage with the vibrant PySpark community for help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Join Forums&lt;/strong&gt;: Websites like Stack Overflow have dedicated Spark tags.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Attend Meetups&lt;/strong&gt;: Spark and Python communities often host events where you can learn and network.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Follow Blogs&lt;/strong&gt;: Many data professionals share their experiences and tutorials online.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Installing PySpark on your local machine may seem daunting at first, but following these steps makes it manageable and rewarding. Whether you’re just starting your data journey or sharpening your skills, PySpark equips you with the tools to tackle real-world data problems.&lt;/p&gt;

&lt;p&gt;PySpark, the Python API for Apache Spark, is a game-changer for data analysis and processing. While its potential is immense, setting it up on your local machine can feel challenging. This article breaks down the process step-by-step, covering everything from installing Java and downloading Spark to testing your setup with a simple script.&lt;/p&gt;

&lt;p&gt;With PySpark installed locally, you can prototype data workflows, learn Spark’s features, and test small-scale projects without needing a full cluster.&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>How to Use PySpark for Machine Learning</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Wed, 04 Dec 2024 16:48:00 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/how-to-use-pyspark-for-machine-learning-62l</link>
      <guid>https://dev.to/shittu_olumide_/how-to-use-pyspark-for-machine-learning-62l</guid>
      <description>&lt;p&gt;Since the release of Apache Spark (an open-source framework for processing Big Data), it has become one of the most widely used technologies for processing large amounts of data in parallel across multiple containers — it prides itself on efficiency and speed compared to similar software that existed before it.&lt;/p&gt;

&lt;p&gt;Working with this amazing technology in Python is feasible through PySpark, a Python API that allows you to interact with and tap into ApacheSpark’s amazing potential using the Python programming language.&lt;/p&gt;

&lt;p&gt;In this article, you will learn and get started with using PySpark to build a machine-learning model using the Linear Regression algorithm.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Having prior knowledge of Python, an IDE like VSCode, how to use a command prompt/terminal and familiarity with Machine Learning concepts is essential for proper understanding of the concepts contained in this article.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By going through this article, you should be able to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand what ApacheSpark is.&lt;/li&gt;
&lt;li&gt;Learn about PySpark and how to use it for Machine Learning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What’s PySpark all about?
&lt;/h2&gt;

&lt;p&gt;According to the Apache Spark official &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;website&lt;/a&gt;, PySpark lets you utilize the combined strengths of ApacheSpark (simplicity, speed, scalability, versatility) and Python (rich ecosystem, matured libraries, simplicity) for “&lt;strong&gt;&lt;em&gt;data engineering, data science, and machine learning on single-node machines or clusters&lt;/em&gt;&lt;/strong&gt;.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0nce3dvyxnssgxbilcn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg0nce3dvyxnssgxbilcn.png" alt="PySpark logo" width="800" height="339"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.databricks.com/glossary/pyspark" rel="noopener noreferrer"&gt;Image source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PySpark is the Python API for ApacheSpark, which means it serves as an interface that lets your code written in Python communicate with the ApacheSpark technology written in Scala. This way, professionals already familiar with the Python ecosystem can quickly utilize the ApacheSpark technology. This also ensures that existing libraries used in Python remain relevant.&lt;/p&gt;
&lt;h2&gt;
  
  
  Detailed Guide on how to use PySpark for Machine Learning
&lt;/h2&gt;

&lt;p&gt;In the ensuing steps, we will build a machine-learning model using the Linear Regression algorithm:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Install project dependencies&lt;/strong&gt;: I’m assuming that you already have Python installed on your machine. If not, install it before moving to the next step. Open your terminal or command prompt and enter the code below to install the &lt;code&gt;PySpark library&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pyspark&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You can install these additional Python libraries if you do not have them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create a file and import the necessary libraries&lt;/strong&gt;: Open VSCode, and in your chosen project directory, create a file for your project, e.g &lt;code&gt;pyspart_model.py&lt;/code&gt;. Open the file and import the necessary libraries for the project.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.ml.feature&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorAssembler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.ml.classification&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.ml.evaluation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BinaryClassificationEvaluator&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create a spark session&lt;/strong&gt;: Start a spark session for the project by entering this code under the imports.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LogisticRegressionExample&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read the CSV file (the dataset you will be working with)&lt;/strong&gt;: If you already have your dataset named &lt;code&gt;data.csv&lt;/code&gt; in your project directory/folder, load it using the code below.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exploratory data analysis&lt;/strong&gt;: This step helps you understand the dataset you are working with. Check for null values and decide on the cleansing approach to use.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Display the schema my
&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;printSchema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="c1"&gt;# Show the first ten rows 
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Count null values in each column
&lt;/span&gt;&lt;span class="n"&gt;missing_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Show the result
&lt;/span&gt;&lt;span class="n"&gt;missing_values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Optionally, if you are working with a small dataset, you can convert it to a Python data frame and directory and use Python to check for missing values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pandas_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toPandas&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Use Pandas to check missing values
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pandas_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data preprocessing&lt;/strong&gt;: This step involves converting the columns/features in the dataset into a format that PySpark’s machine-learning library can easily understand or is compatible with.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use &lt;em&gt;&lt;strong&gt;VectorAssembler&lt;/strong&gt;&lt;/em&gt; to combine all features into a single vector column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Combine feature columns into a single vector column
&lt;/span&gt;&lt;span class="n"&gt;feature_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;assembler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VectorAssembler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputCols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;feature_columns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Transform the data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;assembler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Select only the 'features' and 'label' columns for training
&lt;/span&gt;&lt;span class="n"&gt;final_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Show the transformed data
&lt;/span&gt;&lt;span class="n"&gt;final_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Split the dataset&lt;/strong&gt;: Split the dataset in a proportion that is convenient for you. Here, we are using 70% to 30%: 70% for training and 30% for testing the model.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;final_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randomSplit&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Train your model&lt;/strong&gt;: We are using the Logistic Regression algorithm for training our model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Create an instance of the LogisticRegression class and fit the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;lr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;featuresCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labelCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Train the model
&lt;/span&gt;&lt;span class="n"&gt;lr_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Make predictions with your trained model&lt;/strong&gt;: Use the model we have trained in the previous step to make predictions
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lr_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Show predictions
&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;probability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model Evaluation&lt;/strong&gt;: Here, the model is being evaluated to determine its predictive performance or its level of correctness. We achieve this by using a suitable evaluation metric.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Evaluate the model using the AUC metric&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BinaryClassificationEvaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rawPredictionCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rawPrediction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labelCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metricName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;areaUnderROC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Compute the AUC
&lt;/span&gt;&lt;span class="n"&gt;auc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Area Under ROC: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;auc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The end-to-end code used for this article is shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.ml.feature&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorAssembler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.ml.classification&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.ml.evaluation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BinaryClassificationEvaluator&lt;/span&gt;

&lt;span class="c1"&gt;# Start Spark session
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LogisticRegressionExample&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Load and preprocess data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;assembler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VectorAssembler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputCols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;outputCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;assembler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Split the data
&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randomSplit&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Train the model
&lt;/span&gt;&lt;span class="n"&gt;lr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;featuresCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labelCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lr_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Make predictions
&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lr_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;probability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluate the model
&lt;/span&gt;&lt;span class="n"&gt;evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BinaryClassificationEvaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rawPredictionCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rawPrediction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labelCol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metricName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;areaUnderROC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;auc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Area Under ROC: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;auc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Next steps 🤔
&lt;/h2&gt;

&lt;p&gt;We have reached the end of this article. By following the steps above, you have built your machine-learning model using PySpark.&lt;/p&gt;

&lt;p&gt;Always ensure that your dataset is clean and free of null values before proceeding to the next steps. Lastly, make sure your features all contain numerical values before going ahead to train your model.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Unveiling CodiumAI PR-Agent: A Comparative Analysis Against GitHub Copilot for Pull Requests</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Mon, 18 Dec 2023 14:25:21 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/unveiling-codiumai-pr-agent-a-comparative-analysis-against-github-copilot-for-pull-requests-36jd</link>
      <guid>https://dev.to/shittu_olumide_/unveiling-codiumai-pr-agent-a-comparative-analysis-against-github-copilot-for-pull-requests-36jd</guid>
      <description>&lt;p&gt;Comparing CodiumAI PR-Agent to Copilot for Pull Request&lt;/p&gt;

&lt;p&gt;Are you drowning in pull requests? Are endless revisions slowing you down? Well, trust me you're not alone. This is because traditional workflows lack speed. All thanks to the advent of AI-powered pull request assistants a game-changer in efficiency and accuracy, reshaping how you work.&lt;/p&gt;

&lt;p&gt;As you must have guessed there are so many solutions out there but the two prominent solutions are &lt;a href="https://www.codium.ai/products/git-plugin/" rel="noopener noreferrer"&gt;CodiumAI's PR-Agent&lt;/a&gt; and &lt;a href="https://github.com/features/copilot" rel="noopener noreferrer"&gt;GitHub Copilot for Pull Requests&lt;/a&gt;. These titans promise enhanced workflows, superior code quality, and seamless collaboration. &lt;/p&gt;

&lt;p&gt;Before we dive in, it's important to let you know what you will learn in this article. This article is more than just a comparison article between &lt;strong&gt;CodiumAI PR-Agent&lt;/strong&gt; and &lt;strong&gt;GitHub Copilot for Pull Requests&lt;/strong&gt;, we will build a muscle memory from the ground up by looking at a series of important and key concepts before diving into the comparisons and their importance. Sounds good?&lt;/p&gt;

&lt;p&gt;This article is aimed at developers and project managers who want to elevate their efficiency, collaboration, and code brilliance in team-based software development.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a PR?
&lt;/h2&gt;

&lt;p&gt;A Pull Request (PR) is a fundamental concept in version control and collaborative software development, primarily used in platforms like &lt;a href="https://github.com/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://about.gitlab.com/" rel="noopener noreferrer"&gt;GitLab&lt;/a&gt;, or &lt;a href="https://bitbucket.org/product" rel="noopener noreferrer"&gt;Bitbucket&lt;/a&gt;. It's a mechanism that enables developers to propose changes to a codebase hosted in a repository.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does this work?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A developer makes changes in their local copy of a repository, such as fixing a bug or adding a new feature.&lt;br&gt;
To incorporate these changes into the main codebase (often referred to as the "&lt;strong&gt;master&lt;/strong&gt;" branch), they create a Pull Request.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Creating a Pull Request&lt;/strong&gt;: The developer creates a branch in the repository to work on their changes, separating them from the main codebase. Once the changes are complete, they initiate a Pull Request, signaling that they want their changes reviewed and potentially merged into the main branch.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Review Process&lt;/strong&gt;: Other team members or collaborators are invited to review the proposed changes within the Pull Request. Discussions, comments, and feedback occur within the PR, allowing for collaboration and refinement of the code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Merge or Close&lt;/strong&gt;: After the changes in the PR have been reviewed, tested, and approved, the changes are merged into the main branch. If there are issues or the changes are not yet ready, the PR can be closed without merging.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What are PR-Agents?
&lt;/h2&gt;

&lt;p&gt;Pull Request Agents, are AI-powered assistants designed to enhance the process of reviewing and managing code changes within software development teams. These agents operate within version control platforms, analyzing code modifications proposed by developers before merging them into the main codebase.&lt;/p&gt;

&lt;p&gt;In simpler terms, they're like smart collaborators that help developers improve the quality of their code changes and ensure smooth integration into the larger project. PR-Agents leverage machine learning and natural language processing to suggest improvements, detect potential issues, and assist with code reviews.&lt;/p&gt;

&lt;p&gt;PR-Agent offers extensive pull request functionalities across various git providers: &lt;a href="https://github.com/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, &lt;a href="https://about.gitlab.com/" rel="noopener noreferrer"&gt;GitLab&lt;/a&gt;, or &lt;a href="https://bitbucket.org/product" rel="noopener noreferrer"&gt;Bitbucket&lt;/a&gt;, &lt;a href="https://aws.amazon.com/codecommit/" rel="noopener noreferrer"&gt;CodeCommit&lt;/a&gt;, &lt;a href="https://azure.microsoft.com/en-us/products/devops" rel="noopener noreferrer"&gt;Azure DevOps&lt;/a&gt;, &lt;a href="https://www.gerritcodereview.com/" rel="noopener noreferrer"&gt;Gerrit&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;The diagram below illustrates PR-Agent tools and their flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqyy62ikhifwc0i9ubade.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqyy62ikhifwc0i9ubade.png" alt="PR-Agent illustration" width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do streamlined review procedures matter?
&lt;/h2&gt;

&lt;p&gt;It is a known fact that effective code reviews are the backbone of code quality, consistency, and maintainability, streamlined review processes ensure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Timely Releases&lt;/strong&gt;: Prevent delays in releasing new features or updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster Development Workflow&lt;/strong&gt;: Avoid bottlenecks that slow down progress.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enhanced Code Quality&lt;/strong&gt;: Minimize bugs and errors for a more robust product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Positive Developer Experience&lt;/strong&gt;: Avoid frustration and dissatisfaction among team members.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that you understand the basics, let's move on to talk about these two technologies in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  CodiumAI PR-Agent
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8sue0eu95tf1qfzozkm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8sue0eu95tf1qfzozkm.png" alt="CodiumAI PR-Agent" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.codium.ai/products/git-plugin/" rel="noopener noreferrer"&gt;CodiumAI PR-Agent&lt;/a&gt; is an open-source tool aiming to help developers review pull requests faster and more efficiently. This tool is built by AI experts &lt;a href="https://www.linkedin.com/in/itamarf/" rel="noopener noreferrer"&gt;Itamar Friedman&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/dedy-kredo/" rel="noopener noreferrer"&gt;Dedy Kredo&lt;/a&gt;, and other team members at &lt;a href="https://www.codium.ai/" rel="noopener noreferrer"&gt;CodiumAI&lt;/a&gt;. The PR-Agent gives developers and repo maintainers information to expedite the PR approval process. It also provides code suggestions that help improve the PR’s integrity, from finding bugs to (soon) providing meaningful tests. This seamless integration allows developers to see the effects of their work, without having to leave the git provider (GitHub, Gitlab, etc.) environment.&lt;/p&gt;

&lt;p&gt;Using the PR-Agent, you can cut review time in half, write cleaner, and more efficient code. It automatically analyzes the commits and the PR and can provide several types of feedback:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PR Review&lt;/strong&gt;: Feedback about the PR main theme, type, relevant tests, security issues, focused PR, and various suggestions for the PR content. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Suggestion&lt;/strong&gt;: Committable code suggestions for improving the PR. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-Description&lt;/strong&gt;: Automatically generating PR description - name, type, summary, and code walkthrough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Question Answering&lt;/strong&gt;: Answering free-text questions about the PR.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check out CodiumAI PR-Agent on &lt;a href="https://github.com/Codium-ai/pr-agent" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Installation&lt;br&gt;
Follow these steps to install and run the &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/INSTALL.md" rel="noopener noreferrer"&gt;CodiumAI PR-Agent&lt;/a&gt; tool on different git platforms.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;CodiumAI PR-Agent automatically analyzes the pull request and provides several types of commands:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Just mention &lt;code&gt;@CodiumAI-Agent&lt;/code&gt; and add the desired command in any PR comment. The agent will generate a response based on your command e.g &lt;code&gt;@CodiumAI-Agent /review&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/describe&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/docs/DESCRIBE.md" rel="noopener noreferrer"&gt;describe tool&lt;/a&gt; scans the PR code changes and automatically generates PR description - title, type, summary, walkthrough, and labels. It can be invoked manually by commenting on any PR. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/review&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/docs/REVIEW.md" rel="noopener noreferrer"&gt;review tool&lt;/a&gt; scans the PR code changes, and automatically generates a PR review. It can be invoked manually by commenting on any PR. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/ask "..."&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/docs/ASK.md" rel="noopener noreferrer"&gt;ask tool&lt;/a&gt; answers questions about the PR, based on the PR code changes. It can be invoked manually by commenting on any PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/improve&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/docs/IMPROVE.md" rel="noopener noreferrer"&gt;improve tool&lt;/a&gt; scans the PR code changes, and automatically generates committable suggestions for improving the PR code. It can be invoked manually by commenting on any PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/update_changelog&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/docs/UPDATE_CHANGELOG.md" rel="noopener noreferrer"&gt;update_changelog tool&lt;/a&gt; automatically updates the CHANGELOG.md file with the PR changes. It can be invoked manually by commenting on any PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/similar_issue&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/docs/SIMILAR_ISSUE.md" rel="noopener noreferrer"&gt;similar issue tool&lt;/a&gt; retrieves the most similar issues to the current issue. It can be invoked manually by commenting on any PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/add_docs&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/docs/ADD_DOCUMENTATION.md" rel="noopener noreferrer"&gt;add_docs tool&lt;/a&gt; scans the PR code changes and automatically suggests documentation for the undocumented code components (functions, classes, etc.). It can be invoked manually by commenting on any PR.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/generate_labels&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/docs/GENERATE_CUSTOM_LABELS.md" rel="noopener noreferrer"&gt;generate_labels tool&lt;/a&gt; scans the PR code changes and given a list of labels and their descriptions, it automatically suggests labels that match the PR code changes. It can be invoked manually by commenting on any PR. &lt;/p&gt;

&lt;p&gt;Check out the usage guide for running the PR-Agent commands via different interfaces &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/Usage.md" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why use CodiumAI PR-Agent?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;It emphasizes real-life practical usage with each tool (review, improve, ask, etc.) having a single GPT-4 call, no more. &lt;/li&gt;
&lt;li&gt;The time taken to obtain an answer is less than 30 seconds. &lt;/li&gt;
&lt;li&gt;CodiumAI PR-Agent supports multiple git providers (GitHub, Gitlab, Bitbucket, CodeCommit), there are multiple ways to use the tool (CLI, GitHub Action, GitHub App, Docker, etc.), and multiple models (GPT-4, GPT-3.5, Anthropic, Cohere, Llama2).&lt;/li&gt;
&lt;li&gt;It is open-source, and contributions are welcome from the community.&lt;/li&gt;
&lt;li&gt;The JSON prompting strategy enables to have modular, customizable tools. For example, the &lt;code&gt;/review&lt;/code&gt; tool categories can be controlled via the &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/pr_agent/settings/configuration.toml" rel="noopener noreferrer"&gt;configuration&lt;/a&gt; file. Adding additional categories is easy and accessible.&lt;/li&gt;
&lt;li&gt;CodiumAI PR-Agent's &lt;a href="https://github.com/Codium-ai/pr-agent/blob/main/PR_COMPRESSION.md" rel="noopener noreferrer"&gt;compression strategy&lt;/a&gt; has an ability that enables it to effectively tackle both short and long PRs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  GitHub Copilot for Pull Requests
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/features/copilot" rel="noopener noreferrer"&gt;GitHub Copilot for Pull Requests&lt;/a&gt; is an AI-powered code completion tool developed by &lt;a href="https://github.com/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; in collaboration with &lt;a href="https://openai.com/" rel="noopener noreferrer"&gt;OpenAI&lt;/a&gt;. It integrates directly into the code editor, providing real-time suggestions and completions as developers write code. Leveraging OpenAI's GPT (G*&lt;em&gt;enerative Pre-trained Transformer&lt;/em&gt;*) technology, Copilot analyzes the context of the code being written and generates suggestions for entire lines or chunks of code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Copilot helps developers to quickly discover alternative ways to solve problems, write tests, and explore new APIs without having to tediously tailor a search for answers on sites like Stack Overflow and across the internet, - says &lt;a href="https://www.linkedin.com/in/natfriedman/" rel="noopener noreferrer"&gt;Friedman&lt;/a&gt; (GitHub CEO).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It aims to enhance developer productivity by offering intelligent code suggestions, reducing repetitive coding tasks, and providing assistance in various aspects of software development, ultimately aiming to expedite the coding process and improve code quality.&lt;/p&gt;

&lt;p&gt;How GitHub Copilot for Pull Requests works is shown in the image below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiktbthpqv8jt0b4vq9ke.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiktbthpqv8jt0b4vq9ke.png" alt="GitHub Copilot internal workings" width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key features of GitHub Copilot for Pull Requests include:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI-Powered Code Suggestions&lt;/strong&gt;: Copilot assists developers by offering suggestions for code completion, including function calls, variable assignments, loops, and more. It provides intelligent hints based on the current context and the desired functionality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Natural Language Understanding&lt;/strong&gt;: Developers can write code in plain English-like descriptions or comments, and Copilot interprets these and generates corresponding code snippets. This allows for a more natural and expressive way of coding.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Support for Multiple Languages&lt;/strong&gt;: GitHub Copilot for Pull Requests supports various programming languages, enabling developers to receive contextually relevant code suggestions across a wide range of programming paradigms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Autocompletion and Refactoring&lt;/strong&gt;: Copilot not only helps in writing new code but also assists in refactoring existing code by providing suggestions for optimizations, cleaner implementations, and more efficient solutions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Learning from User Interactions&lt;/strong&gt;: As developers use Copilot, the tool learns from the interactions and feedback, improving its suggestions and adapting to developers' coding styles over time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Collaborative Development&lt;/strong&gt;: Copilot facilitates collaborative coding by suggesting shared code snippets, enabling faster and more efficient teamwork.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Integration with GitHub&lt;/strong&gt;: Seamlessly integrated into the GitHub ecosystem, Copilot augments the coding experience for developers using GitHub's repositories and version control systems.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GitHub Copilot for Pull Requests commands
&lt;/h3&gt;

&lt;p&gt;GitHub Copilot for Pull Requests provides various commands, referred to as &lt;strong&gt;copilot commands&lt;/strong&gt;, that allow developers to interact with the tool and get specific code suggestions or functionalities. Here's a list and explanation of some common Copilot commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;@copilot&lt;/code&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This command serves as a marker tag inserted in the pull request description or code comments to invoke Copilot's functionality. When used in a comment or description, Copilot reads these tags to understand the developer's intent and provide relevant code suggestions or actions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;@copilot:all&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A command that showcases all different kinds of content in one go. It's used to request diverse content suggestions from Copilot within a single command.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;@copilot:summary&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This command expands to provide a concise, one-paragraph summary of the changes or functionality outlined in the pull request. It's useful for quickly grasping the essence of the modifications.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;@copilot:walkthrough&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When used, this command expands to offer a detailed list of changes, including links to relevant pieces of code. It provides a more comprehensive overview of the modifications made in the pull request.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;@copilot:poem&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An interesting command that requests Copilot to generate a poem related to the changes in the pull request. It's more of a creative or fun way to summarize modifications.&lt;/p&gt;

&lt;h2&gt;
  
  
  CodiumAI PR-Agent vs. GitHub Copilot for Pull Requests
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;CodiumAI PR-Agent&lt;/th&gt;
&lt;th&gt;GitHub Copilot for Pull Requests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Availability&lt;/td&gt;
&lt;td&gt;Readily available&lt;/td&gt;
&lt;td&gt;Waitlist required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open Source&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pricing&lt;/td&gt;
&lt;td&gt;Free for individual developers&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supported commands&lt;/td&gt;
&lt;td&gt;Supports many command&lt;/td&gt;
&lt;td&gt;Supports only one command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Git platforms support&lt;/td&gt;
&lt;td&gt;All (GitHub, GitLab, Bitbucket, CodeCommit)&lt;/td&gt;
&lt;td&gt;GitHub only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IDE integration&lt;/td&gt;
&lt;td&gt;Available on Visual Studio Code, Vim, JetBrains IDEs&lt;/td&gt;
&lt;td&gt;Available on Visual Studio Code, Vim, Neovim, JetBrains IDEs, Azure Data Studio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multiple language support&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Powered by GPT&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test Generation&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Completion&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Analysis&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI-powered Features&lt;/td&gt;
&lt;td&gt;Auto-description, code review, Q&amp;amp;A, code suggestion,documentation, etc.&lt;/td&gt;
&lt;td&gt;Pull request summaries, review enhancements, auto-completion&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Some limitations of Codium PR-Agent
&lt;/h2&gt;

&lt;p&gt;The PR-Agent is very resourceful too but it also comes with its limitations, let's have a look at some of these.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Quality and Security of Generated Code&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PR-Agent generates code based on patterns and examples from its training data. The quality and security of the suggested code may vary, potentially leading to suboptimal or insecure implementations. It's crucial to review and validate the generated code thoroughly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Biases and Inaccuracies&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Like any AI model, PR-Agent might exhibit biases or inaccuracies present in its training data. It might generate code that reflects biases or includes inaccuracies, requiring developers to critically assess and validate the suggestions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open Source Collaboration&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The transparent and community-backed nature of open-source tools offers great visibility and support from a diverse pool of contributors. However, this collaborative approach might result in a slower pace of development and features that might lack the refined polish found in proprietary tools such as GitHub Copilot for Pull Requests.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Limited Understanding of Context&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Explanation: While PR-Agent excels at providing suggestions, it might lack a deep understanding of specific project contexts, business requirements, or domain-specific intricacies. Developers might need to fine-tune or adapt the suggestions to fit their particular use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Exploring AI-driven pull request tools has been enlightening. CodiumAI's PR-Agent and GitHub Copilot for Pull Requests offer valuable support, each with its distinct strengths. Those valuing customization, adaptability, and immediate accessibility might favor CodiumAI's PR-Agent.&lt;/p&gt;

&lt;p&gt;PR-Agent's extensive command toolkit empowers users to tailor the tool to their preferences, in contrast to Copilot's single-command approach. This granularity facilitates personalized code analysis, pull request descriptions, and interactive functions, fostering a closer bond with the tool and amplifying productivity.&lt;/p&gt;

&lt;p&gt;PR-Agent's platform flexibility enables diverse teams to standardize workflows across platforms, ensuring uniform code quality. Its adaptable pricing structure democratizes AI access, making it inclusive for all. With scalable paid tiers offering advanced features, it caters to varied team needs—ready for instant use, no waiting, no subscriptions. &lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/YZQSwUbqQrc?start=2"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>github</category>
      <category>ai</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Building a PDF Summarizer with LangChain</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Mon, 11 Dec 2023 23:59:43 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/building-a-pdf-summarizer-with-langchain-18fe</link>
      <guid>https://dev.to/shittu_olumide_/building-a-pdf-summarizer-with-langchain-18fe</guid>
      <description>&lt;p&gt;We are in an era inundated with information, and the ability to distill complex documents into concise, digestible summaries is a skill in high demand. Imagine a world where lengthy PDFs, often dense with data, could be swiftly summarized with precision. &lt;/p&gt;

&lt;p&gt;There are many paid and free tools that can help summarize documents such as PDFs out there, but you can build your custom PDF summarizer tailored to your taste using tools powered by LLMs.&lt;/p&gt;

&lt;p&gt;In this article, you will learn how to build a PDF summarizer using LangChain, Gradio and you will be able to see your project live, so you if are looking to get started with LangChain or build an LLM-powered application for your portfolio, this tutorial is for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangChain&lt;/strong&gt;: This is one of the most useful libraries to help developers build apps powered by LLMs.
LangChain enables LLM models to generate responses based on the most up-to-date information available online and also simplifies the process of organizing large volumes of data so that it can be easily accessed by LLMs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install using &lt;code&gt;pip&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;langchain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gradio&lt;/strong&gt;: Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Installation using &lt;code&gt;pip&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;gradio&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To build this LLM-powered application, we will break it down into three simple and easy steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1:
&lt;/h2&gt;

&lt;p&gt;Import the important libraries, for this tutorial we will import LangChain and Gradio.&lt;br&gt;
&lt;br&gt;
 &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.document_loaders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PyPDFLoader&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;gradio&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;

&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PyPDFLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whitepaper.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;From the code above: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;from langchain.document_loaders import PyPDFLoader&lt;/code&gt;: Imports the PyPDFLoader module from LangChain, enabling PDF document loading ("whitepaper.pdf") which is in the same directory as our Python script.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;import gradio as gr&lt;/code&gt;: Imports Gradio, a Python library for creating customizable UI components for machine learning models and functions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Step 2: 
&lt;/h2&gt;

&lt;p&gt;Create a summarize function to make the summarization. This function will define the PDF file path and an optional custom prompt as input.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize_pdf&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;custom_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PyPDFLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_and_split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_summarize_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chain_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;map_reduce&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;We define a function named &lt;code&gt;summarize_pdf&lt;/code&gt; that takes a PDF file path and an optional custom prompt.&lt;/li&gt;
&lt;li&gt;Then we use the PyPDFLoader to load and split the PDF document into separate sections.&lt;/li&gt;
&lt;li&gt;Utilizing the LangChain's summarization capabilities through the &lt;code&gt;load_summarize_chain&lt;/code&gt; function to generate a summary based on the loaded document.&lt;/li&gt;
&lt;li&gt;Return the generated summary.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 3: 
&lt;/h2&gt;

&lt;p&gt;This is the step where we need to set the Gradio Interface.&lt;br&gt;
&lt;br&gt;
 &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;input_pdf_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Textbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter the PDF file path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;output_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Textbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;interface&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Interface&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;fn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;summarize_pdf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_pdf_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output_summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PDF Summarizer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This app allows you to summarize your PDF files.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;share&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Sets up Gradio UI components for user interaction.&lt;/li&gt;
&lt;li&gt;We define an input field (&lt;code&gt;input_pdf_path&lt;/code&gt;) for users to enter the PDF file path.&lt;/li&gt;
&lt;li&gt;Then we specify an output field (&lt;code&gt;output_summary&lt;/code&gt;) to display the summarized text.&lt;/li&gt;
&lt;li&gt;We create a Gradio interface (&lt;code&gt;interface&lt;/code&gt;) that utilizes the summarize_pdf function as the core functionality.&lt;/li&gt;
&lt;li&gt;Finally, we configure the interface with a title, description, input/output components, and launch the UI for user interaction.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Launch/ Testing
&lt;/h2&gt;

&lt;p&gt;We can now call the &lt;code&gt;main()&lt;/code&gt; function to run the application.&lt;br&gt;
&lt;br&gt;
 &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmw745tb6dg48amfnhtkg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmw745tb6dg48amfnhtkg.png" alt="Demo - One" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the screenshot, you will see that Gradio launched the application on a public and private URL. &lt;/p&gt;

&lt;p&gt;Here is what the finished project looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficfrqfie1o6u82p334yn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficfrqfie1o6u82p334yn.png" alt="PDF Summarizer image" width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By harnessing LangChain’s capabilities alongside Gradio’s intuitive interface, we’ve demystified the process of converting lengthy PDF documents into concise, informative summaries.&lt;/p&gt;

&lt;p&gt;This isn’t just about building a tool; it’s about embracing the potential of technology to enhance information accessibility. Through this article, we’ve bridged the gap between complex data and user-friendly insights, empowering individuals to navigate through vast volumes of information effortlessly.&lt;/p&gt;

&lt;p&gt;Happy summarizing! 😎 &lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;p&gt;Full GitHub code: &lt;a href="https://github.com/zenUnicorn/PDF-Summarizer-Using-LangChain" rel="noopener noreferrer"&gt;https://github.com/zenUnicorn/PDF-Summarizer-Using-LangChain&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LangChain documentation: &lt;a href="https://python.langchain.com/docs/get_started/introduction" rel="noopener noreferrer"&gt;https://python.langchain.com/docs/get_started/introduction&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>tutorial</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Your LLM hallucinates, Why?</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Tue, 28 Nov 2023 23:35:32 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/your-llm-hallucinates-why-4lgk</link>
      <guid>https://dev.to/shittu_olumide_/your-llm-hallucinates-why-4lgk</guid>
      <description>&lt;p&gt;It is a fact that large language models (LLMs) have really made life easier for the whole world, and there has really been a huge adoption of AI in all areas of life. But as much as this new trend is interesting and helpful, it also has its limitations.&lt;/p&gt;

&lt;p&gt;One of the limitations of LLM is "&lt;strong&gt;&lt;em&gt;hallucination&lt;/em&gt;&lt;/strong&gt;", others include: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Contextual understanding&lt;/li&gt;
&lt;li&gt;Domain-Specific Knowledge&lt;/li&gt;
&lt;li&gt;Lack of Common Sense Reasoning&lt;/li&gt;
&lt;li&gt;Continual learning constraints.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What does it really mean for an LLM to hallucinate?
&lt;/h2&gt;

&lt;p&gt;Hallucination refers to instances where the model generates information or outputs that are factually incorrect, nonsensical, or not grounded in the provided context.&lt;/p&gt;

&lt;p&gt;We can also say:&lt;/p&gt;

&lt;p&gt;A hallucination occurs when the model generates text or responses that seem plausible on the surface but lack factual accuracy or logical consistency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqjplevqazum8w3dy6vr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqjplevqazum8w3dy6vr.jpg" alt="Model hallucination" width="800" height="909"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Things to consider if your model is hallucinating:&lt;/p&gt;

&lt;h3&gt;
  
  
  Token Limit
&lt;/h3&gt;

&lt;p&gt;The token limit is the maximum number of tokens or words, that the model can process in a single sequence. Tokens can include words, subwords, or characters, each represented as a unit in the model's input.&lt;/p&gt;

&lt;p&gt;This limit is imposed due to computational constraints and memory limitations within the model architecture. It affects both the input and output of the model. When the input text exceeds this limit, the model cannot process the entire sequence at once, potentially leading to truncation, where only a portion of the input is considered. Similarly, for text generation tasks, the output length is also capped by this token limit.&lt;/p&gt;

&lt;p&gt;The token window limit is how many words an LLM can absorb while generating an answer. Your LLM (e.g. &lt;strong&gt;ChatGPT&lt;/strong&gt;) is a sponge - what does this mean? &lt;/p&gt;

&lt;p&gt;Too much much water (words) = no more absorption.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GPT-3&lt;/code&gt; has a limit of &lt;strong&gt;2048&lt;/strong&gt; tokens (&lt;a href="https://platform.openai.com/docs/models/gpt-3-5" rel="noopener noreferrer"&gt;gpt-3.5-turbo has up 16,385 tokens&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GPT-4&lt;/code&gt; has a token limit of &lt;strong&gt;128,000&lt;/strong&gt; tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpiqvhlvbvf49w3eodh8a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpiqvhlvbvf49w3eodh8a.png" alt=" " width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For instance, if an LLM has a token limit of &lt;strong&gt;2048&lt;/strong&gt; tokens, any input longer than that would need to be split into smaller segments for processing, potentially affecting the context and coherence of the information being processed.&lt;/p&gt;

&lt;p&gt;This token limit poses challenges when dealing with lengthy texts, complex documents, or tasks that require processing a large amount of information in a single sequence and this can lead to model hallucinations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Token isn't everything
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Claude 2.1&lt;/code&gt; now has a &lt;strong&gt;200,000&lt;/strong&gt; token limit and you might think it must be the best LLM yet. Well, not really. An academic paper shows the opposite, they tested &lt;code&gt;Cluade 2.1&lt;/code&gt; and &lt;code&gt;GPT-4&lt;/code&gt;. They performed a "&lt;em&gt;&lt;strong&gt;needle in a haystack&lt;/strong&gt;&lt;/em&gt;" scenario and both LLM had to find the information. The longer the text, the harder. Makes sense?&lt;/p&gt;

&lt;p&gt;The study also shows other factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How attentive to details the LLM is.&lt;/li&gt;
&lt;li&gt;It's ability to discern relevance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So bigger does not necessarily mean better, it challenges the notion that a larger token limit in LLMs = better performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  The data training bias
&lt;/h3&gt;

&lt;p&gt;What you feed the LLM = the quality of the LLM, so this means that the quality of the training matters. When the training data is bad, there can be a biased data reflection, potentially perpetuating or amplifying societal biases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap up
&lt;/h2&gt;

&lt;p&gt;Recognizing these limitations helps in using LLMs effectively while considering their strengths and weaknesses. Addressing and minimizing hallucination in LLMs involves ongoing research and development to enhance the model's contextual understanding, improve fact-checking capabilities, and refine its ability to generate accurate and contextually appropriate responses.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
    </item>
    <item>
      <title>Synergy in AI: Bridging the Gap Between Data Science and Business</title>
      <dc:creator>Shittu Olumide</dc:creator>
      <pubDate>Tue, 24 Oct 2023 09:23:57 +0000</pubDate>
      <link>https://dev.to/shittu_olumide_/synergy-in-ai-bridging-the-gap-between-data-science-and-business-510j</link>
      <guid>https://dev.to/shittu_olumide_/synergy-in-ai-bridging-the-gap-between-data-science-and-business-510j</guid>
      <description>&lt;p&gt;A 2023 report by &lt;a href="https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/new-years-resolutions-for-tech-in-2023" rel="noopener noreferrer"&gt;McKinsey &amp;amp; Company&lt;/a&gt; found that companies that are data-driven are more likely to outperform their peers by up to 5%. In 2021, a study by &lt;a href="https://hbr.org/2019/01/data-science-and-the-art-of-persuasion" rel="noopener noreferrer"&gt;Havard Business&lt;/a&gt; Review found that companies that have a strong data culture are more likely to be innovative and successful.&lt;/p&gt;

&lt;p&gt;These statistics show that data science is becoming increasingly important for businesses of all sizes. However, there is still a gap between data scientists and business users. This gap can lead to a number of problems, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data scientists not understanding the needs of the business.&lt;/li&gt;
&lt;li&gt;Business users not understanding the technical capabilities of data science.&lt;/li&gt;
&lt;li&gt;Data science projects not being aligned with business goals.&lt;/li&gt;
&lt;li&gt;Data science insights not being used to make business decisions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At its core, synergy in AI represents the harmonious integration of data science and business operations. It’s the science of combining data-driven insights with strategic decision-making, creating a dynamic partnership that enhances the overall performance and competitiveness of an organization. In essence, it’s about leveraging the incredible potential of AI and data analytics to fuel business growth, improve efficiency, and stay ahead in an increasingly data-centric world.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-world examples of successful synergy
&lt;/h2&gt;

&lt;p&gt;The concept of synergy between data science and business is not theoretical; it has already yielded remarkable results in real-world scenarios. Several organizations have harnessed the power of this synergy to achieve outstanding outcomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Netflix&lt;/strong&gt;: The streaming giant employs data science to understand user preferences, recommend content, and personalize the user experience. By seamlessly integrating data science into their business model, Netflix has become a content powerhouse, creating hit shows and retaining millions of subscribers worldwide.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Amazon&lt;/strong&gt;: Amazon’s recommendation engine, powered by advanced data analytics, has significantly boosted its sales. This collaborative approach between data scientists and business teams enables them to offer personalized product suggestions, improving customer satisfaction and driving revenue.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Uber&lt;/strong&gt;: Uber utilizes data science not only for optimizing ride routes but also for surge pricing, matching drivers with riders, and predicting demand patterns. This collaborative approach has not only made ride-sharing convenient for customers but also enhanced drivers’ earnings and the company’s overall profitability.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why bridge the gap?
&lt;/h2&gt;

&lt;p&gt;Why is bridging the gap between data science and business so crucial in today’s landscape? The answer lies in the transformative power of data. Data has evolved from being merely a byproduct of business operations to a strategic asset that, when harnessed effectively, can unlock a world of possibilities.&lt;/p&gt;

&lt;p&gt;On one side of the spectrum, data scientists possess the expertise to mine, analyze, and derive insights from vast datasets. On the other side, business leaders have the vision and acumen to steer their organizations toward success. The synergy between these two worlds not only bridges the gap but also unlocks the full potential of data, transforming it into actionable intelligence that drives innovation and growth.&lt;/p&gt;

&lt;p&gt;Here are some other reasons why there is a need to bridge the gap:&lt;/p&gt;

&lt;h3&gt;
  
  
  Breaking down silos
&lt;/h3&gt;

&lt;p&gt;One of the most pressing needs is to break down the silos that traditionally separate these two domains. Silos often develop when data scientists and business professionals work in isolation, each with their own set of objectives, tools, and language. These silos can lead to inefficiencies, miscommunication, and missed opportunities.&lt;/p&gt;

&lt;p&gt;Breaking down silos means fostering a culture of collaboration and communication between data science and business teams. It involves creating an environment where data scientists and business professionals can interact seamlessly, share insights, and align their efforts towards common goals. When silos are dismantled, data scientists gain a deeper understanding of business needs, and business professionals become more data-savvy, resulting in a powerful synergy that drives innovation and growth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Collaborative approaches
&lt;/h3&gt;

&lt;p&gt;Synergy in AI is not merely about coexistence; it’s about active collaboration. Collaborative approaches entail bringing data scientists and business professionals together as equal partners in the decision-making process. By involving data scientists early on in business discussions and vice versa, organizations can ensure that data-driven insights are integrated into strategic planning and day-to-day operations.&lt;/p&gt;

&lt;p&gt;Collaboration also means fostering cross-functional teams where individuals with diverse skills and expertise work together to solve complex problems. Data scientists and business experts can combine their strengths to formulate hypotheses, analyze data, and develop actionable strategies. This collaborative spirit encourages creativity, adaptability, and a holistic perspective, all of which are vital in navigating the data-driven landscape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bridging the gap
&lt;/h2&gt;

&lt;p&gt;To bridge the gap between these two domains effectively, organizations must employ a multi-faceted approach that encompasses cross-training and education, integrated teams, and the utilization of cutting-edge tools and technologies for collaboration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-training and education
&lt;/h3&gt;

&lt;p&gt;To foster collaboration and understanding between data scientists and business professionals, organizations should invest in cross-training and education initiatives. This approach involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Educational workshops&lt;/strong&gt;: Conduct workshops and training programs that help data scientists grasp the intricacies of business operations and enable business professionals to understand the fundamentals of data science. These sessions should be interactive and encourage knowledge sharing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Interdisciplinary courses&lt;/strong&gt;: Encourage employees to pursue interdisciplinary courses or certifications that combine data science and business topics. This not only enhances their skill sets but also breaks down barriers between departments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mentorship programs&lt;/strong&gt;: Implement mentorship programs where experienced data scientists mentor business professionals (and vice versa). This one-on-one guidance fosters a deeper understanding of each other’s roles and responsibilities.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Integrated teams
&lt;/h2&gt;

&lt;p&gt;Breaking down the silos between data science and business departments is paramount for achieving synergy. Building integrated teams involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cross-functional teams&lt;/strong&gt;: Form cross-functional teams that comprise data scientists, business analysts, marketing experts, and other relevant roles. These teams work together on projects, bringing diverse perspectives to the table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shared goals&lt;/strong&gt;: Define shared objectives that encourage collaboration. When teams have common goals and KPIs, they are more likely to work together seamlessly to achieve them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Regular meetings&lt;/strong&gt;: Hold regular meetings where both data science and business professionals come together to discuss progress, challenges, and insights. These meetings promote open communication and knowledge exchange.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tools and technologies for collaboration
&lt;/h2&gt;

&lt;p&gt;To enable effective collaboration, organizations should leverage tools and technologies tailored to the needs of both data science and business teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data visualization tools&lt;/strong&gt;: Invest in data visualization platforms that allow business professionals to easily interpret complex data and gain actionable insights. These tools bridge the gap between raw data and decision-making.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Collaborative analytics platforms&lt;/strong&gt;: Implement collaborative analytics platforms that facilitate real-time collaboration between data scientists and business analysts. These platforms enable joint data exploration and modeling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Communication and documentation tools&lt;/strong&gt;: Use communication and documentation tools that ensure all team members have access to project updates, data sources, and analysis documentation. Transparency is key to successful collaboration.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Benefits of Synergy in AI
&lt;/h2&gt;

&lt;p&gt;In this section, we explore the benefits of this synergy in AI in depth:&lt;/p&gt;

&lt;h3&gt;
  
  
  Enhanced Decision-Making
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data-driven insights&lt;/strong&gt;: Synergy in AI ensures that business decisions are grounded in data-backed insights. Data scientists can analyze vast datasets, identify patterns, and provide actionable recommendations to inform strategic choices.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Predictive analytics&lt;/strong&gt;: By leveraging AI and machine learning models, businesses can predict future trends, customer behavior, and market dynamics. This foresight empowers organizations to make proactive decisions and stay ahead of the competition.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Improved Efficiency and Productivity
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automation&lt;/strong&gt;: AI-powered tools and algorithms can automate repetitive tasks, reducing manual workloads and allowing employees to focus on higher-value tasks. This leads to increased productivity and cost savings.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Optimized operations&lt;/strong&gt;: Data-driven insights can help streamline processes and optimize resource allocation, ensuring that businesses operate more efficiently and effectively.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Innovation and Competitive Advantage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Market responsiveness&lt;/strong&gt;: Synergy in AI enables organizations to adapt swiftly to changing market conditions. They can respond to customer preferences and market shifts in real-time, gaining a competitive edge.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;New revenue streams&lt;/strong&gt;: Collaborative AI initiatives often lead to the development of innovative products or services, opening up new revenue streams and business opportunities.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Customer-Centric Strategies
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Personalization&lt;/strong&gt;: AI-driven customer segmentation and personalization strategies allow businesses to tailor their offerings to individual customer preferences. This leads to higher customer satisfaction and loyalty.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enhanced customer experience&lt;/strong&gt;: With AI-powered chatbots, recommendation engines, and sentiment analysis, organizations can deliver exceptional customer experiences, leading to increased customer retention.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Risk Mitigation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Anomaly detection&lt;/strong&gt;: AI models can identify unusual patterns or outliers in data, helping organizations detect fraud, security breaches, or operational anomalies early, and minimizing potential risks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compliance and regulation&lt;/strong&gt;: By using AI to ensure data compliance and monitor regulatory changes, businesses can avoid legal and financial penalties.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data Monetization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data-driven products:&lt;/strong&gt; Synergy between data science and business can lead to the creation of data-driven products or services that can be monetized, diversifying revenue streams.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Partnerships and collaborations&lt;/strong&gt;: Organizations can leverage their data assets to establish partnerships or collaborations with other businesses, expanding their market reach.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Measurable ROI
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quantifiable outcomes&lt;/strong&gt;: Synergy in AI often yields measurable results, making it easier to calculate the return on investment (ROI) of data science initiatives and justifying ongoing investments.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Employee Satisfaction and Development
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-training&lt;/strong&gt;: Encouraging collaboration between data scientists and business teams fosters a culture of continuous learning and skill development. Employees become more versatile and adaptable in their roles.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrap up
&lt;/h2&gt;

&lt;p&gt;Throughout this article, we’ve emphasized the critical importance of synergy between data science and business. As we look ahead, businesses are encouraged to proactively foster collaboration between their data science and business teams. Breaking down the silos that have traditionally separated these two domains is not just a strategic advantage; it’s a necessity in today’s data-driven era. Encouraging cross-training, investing in integrated technologies, and cultivating a culture of cooperation will be essential steps toward realizing the full potential of synergy in AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  References:
&lt;/h3&gt;

&lt;p&gt;McKinsey &amp;amp; Company: The Value of Data: &lt;a href="https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/new-years-resolutions-for-tech-in-2023" rel="noopener noreferrer"&gt;https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/new-years-resolutions-for-tech-in-2023&lt;/a&gt;&lt;br&gt;
Harvard Business Review: Why Data Science Is More Important Than Ever: &lt;a href="https://hbr.org/2019/01/data-science-and-the-art-of-persuasion" rel="noopener noreferrer"&gt;https://hbr.org/2019/01/data-science-and-the-art-of-persuasion&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>discuss</category>
      <category>career</category>
    </item>
  </channel>
</rss>
