<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bill Chambers</title>
    <description>The latest articles on DEV Community by Bill Chambers (@bllchmbrs).</description>
    <link>https://dev.to/bllchmbrs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1197598%2Fc88667ac-de21-4b41-b40b-91686e81523a.jpeg</url>
      <title>DEV Community: Bill Chambers</title>
      <link>https://dev.to/bllchmbrs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bllchmbrs"/>
    <language>en</language>
    <item>
      <title>Structured Data Extraction using LLMs and Guardrails AI</title>
      <dc:creator>Bill Chambers</dc:creator>
      <pubDate>Tue, 14 Nov 2023 21:57:37 +0000</pubDate>
      <link>https://dev.to/bllchmbrs/structured-data-extraction-using-llms-and-guardrails-ai-3eb3</link>
      <guid>https://dev.to/bllchmbrs/structured-data-extraction-using-llms-and-guardrails-ai-3eb3</guid>
      <description>&lt;p&gt;In this post, we're going to be extracting structured data from a podcast transcript. We've all seen the ability for generative models to be effective &lt;em&gt;generating&lt;/em&gt; text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But what about extracting data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data extraction is a far more common use case (today) than generation, in particular for businesses. Businesses have to process all kinds of documents, exchange them and so on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emerging Use Case: LLMs for data extraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you think about it, summarization is effectively data extraction. We've seen LLMs perform well at summarization, but did you know that they can extract structured data quite well?&lt;/p&gt;

&lt;h2&gt;
  
  
  What you'll learn from this post
&lt;/h2&gt;

&lt;p&gt;In this post, you'll learn how to extract structured data from LLMs into Pydantic objects using the &lt;a href="https://www.guardrailsai.com/"&gt;Guardrails&lt;/a&gt; library.&lt;/p&gt;

&lt;h2&gt;
  
  
  Here's us setting up on environment
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python 3.9.18
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;freeze&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;grep&lt;/span&gt; &lt;span class="n"&gt;guardrails&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ai&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;guardrails-ai==0.2.4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pydantic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1.10.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="n"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;True
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Background on Guardrails AI
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.guardrailsai.com/"&gt;Guardrails&lt;/a&gt; is a project that appears to be backed by a company. The project, aptly named Guardrails, is centered around the following concept:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is Guardrails?&lt;/p&gt;

&lt;p&gt;Guardrails AI is an open-source library designed to ensure reliable interactions with Large Language Models (LLMs). It provides:&lt;/p&gt;

&lt;p&gt;✅ A framework for creating custom validators&lt;br&gt;
✅ Orchestration of prompting, verification, and re-prompting&lt;br&gt;
✅ A library of commonly used validators for various use cases&lt;br&gt;
✅ A specification language for communicating requirements to LLM&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Guardrails shares a similar philosophy with Instructor and Marvin. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Guardrails AI enables you to define and enforce assurance for AI applications, from structuring output to quality controls. It achieves this by creating a 'Guard', a firewall-like bounding box around the LLM application, which contains a set of validators. A Guard can include validators from our library or a custom validator that enforces your application's intended function.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;from &lt;a href="https://www.guardrailsai.com/blog/0.2-release"&gt;the blog&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--at_k3fOw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/guardrails-architecture.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--at_k3fOw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/guardrails-architecture.webp" alt="guardrails approach" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Currently, Guardrails is in beta and has not had a formal release, so it may not be suitable for production use cases.&lt;/p&gt;

&lt;p&gt;However, Guardrails has a broader mission. They aim to create a "bounding box" around LLM apps to validate and ensure quality. They plan to achieve this by introducing a &lt;code&gt;.RAIL&lt;/code&gt; file type, a dialect of XML. This ambitious approach extends beyond the simple concept of "using AI to extract data".&lt;/p&gt;

&lt;p&gt;Here's their explanation of &lt;code&gt;RAIL&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🤖 What is &lt;code&gt;RAIL&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.RAIL&lt;/code&gt; is a dialect of XML, standing for "&lt;strong&gt;R&lt;/strong&gt;eliable &lt;strong&gt;AI&lt;/strong&gt; markup &lt;strong&gt;L&lt;/strong&gt;anguage". It can be used to define:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The structure of the expected outcome of the LLM. (E.g. JSON)&lt;/li&gt;
&lt;li&gt;The type of each field in the expected outcome. (E.g. string, integer, list, object)&lt;/li&gt;
&lt;li&gt;The quality criteria for the expected outcome to be considered valid. (E.g. generated text should be bias-free, generated code should be bug-free)&lt;/li&gt;
&lt;li&gt;The corrective action to take in case the quality criteria is not met. (E.g. reask the question, filter the LLM, programmatically fix, etc.)&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Warning...&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Another challenge I had was the dependencies, Guardrails requires &lt;code&gt;pydantic version 1.10.9&lt;/code&gt;.  &lt;a href="https://docs.pydantic.dev/latest/"&gt;Pydantic&lt;/a&gt;'s latest version is &lt;a href="https://github.com/pydantic/pydantic/releases/tag/v2.4.1"&gt;v.2.4.1&lt;/a&gt;. This caused a number of issues for me during installation and I had to setup a dedicated environment for it. A challenge is that&lt;a href="https://github.com/pydantic/pydantic#pydantic-v110-vs-v2"&gt; pydantic went through a ground up rewrite from version 1.10&lt;/a&gt;, so it may simply be challenging for them to rewrite everything.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;feedparser&lt;/span&gt;

&lt;span class="n"&gt;podcast_atom_link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://api.substack.com/feed/podcast/1084089.rss"&lt;/span&gt; &lt;span class="c1"&gt;# latent space podcastbbbbb
&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;podcast_atom_link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;episode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ep&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'title'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;episode_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;episode&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'summary'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;episode_summary&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;p&amp;gt;&amp;lt;em&amp;gt;Thanks to the &amp;lt;/em&amp;gt;&amp;lt;em&amp;gt;over 11,000 people&amp;lt;/em&amp;gt;&amp;lt;em&amp;gt; who joined us for the first AI Engineer Su
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;unstructured.partition.html&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;partition_html&lt;/span&gt;

&lt;span class="n"&gt;parsed_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partition_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;episode_summary&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; 
&lt;span class="n"&gt;start_of_transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed_summary&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Transcript"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"First line of the transcript: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_of_transcript&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed_summary&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start_of_transcript&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3508&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# shortening the transcript for speed &amp;amp; cost
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;First line of the transcript: 58
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Using Guardrails
&lt;/h2&gt;

&lt;p&gt;Guardrails is similar to &lt;a href="https://learnbybuilding.ai/vs/marvin-ai-vs-guardrails-vs-instructor"&gt;other libraries for structured data extraction&lt;/a&gt; that I've used. We define our pydantic models and then we specify an output class and a prompt.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;guardrails&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;gd&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;school&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"The school this person attended"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"The company this person works for"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;People&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Guard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pydantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;People&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Get the following objects from the text:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt; ${text}"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important thing to note, like we mentioned above, is that Guardrails is translating this into a "RAIL" an underlying abstraction for representing these extractions.&lt;/p&gt;

&lt;p&gt;Now a key difference is that this abstraction might allow for structured data extraction from ANY model, not just OpenAI's function calling interface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;raw_llm_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validated_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validated_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{'people': [{'name': 'Alessio', 'school': 'Residence at Decibel Partners', 'company': 'CTO'}, {'name': 'Swyx', 'school': 'Smol.ai', 'company': 'founder'}, {'name': 'Kanjun', 'school': 'Imbue', 'company': 'founder'}]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;It did a fine job getting the people, although you can see that it got the company and schools wrong - especially compared to the other libraries we've looked at, this is a lower quality result.&lt;/p&gt;

&lt;p&gt;let's see how it does when we ask for more objects.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResearchPaper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;paper_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"an academic paper reference discussed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExtractedInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;companies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;research_papers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ResearchPaper&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="n"&gt;guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Guard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pydantic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ExtractedInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Get the following objects from the text:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt; ${text}"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;raw_llm_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validated_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;guard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validated_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/.../python3.9/site-packages/guardrails/prompt/instructions.py:32: UserWarning: Instructions do not have any variables, if you are migrating follow the new variable convention documented here: https://docs.getguardrails.ai/0-2-migration/
  warn(
/.../python3.9/site-packages/guardrails/prompt/prompt.py:23: UserWarning: Prompt does not have any variables, if you are migrating follow the new variable convention documented here: https://docs.getguardrails.ai/0-2-migration/
  warnings.warn(


incorrect_value={'people': [{'name': 'Alessio', 'school': 'Decibel Partners', 'company': 'CTO'}, {'name': 'Swyx', 'school': 'Smol.ai', 'company': 'founder'}, {'name': 'Kanjun', 'school': 'Imbue', 'company': 'founder'}], 'companies': [{'name': 'Decibel Partners'}, {'name': 'Residence'}, {'name': 'Smol.ai'}, {'name': 'Imbue'}]} fail_results=[FailResult(outcome='fail', metadata=None, error_message='JSON does not match schema', fix_value=None)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Unfortunately, the result failed to parse, while the other two &lt;a href="https://learnbybuilding.ai/vs/marvin-ai-vs-guardrails-vs-instructor"&gt;structured data extraction libraries&lt;/a&gt; succeeded just fine. It might be in the implementation or might be a bug in this version, but it didn't work out of the box.&lt;/p&gt;

&lt;p&gt;While the example didn't work, you can see how well this &lt;em&gt;should work&lt;/em&gt;. These models and libraries grab a whole bunch of structured data for us with basically no work. Right now we're just working on a excerpt of this data, but with basically zero NLP specific work, we're able to get some pretty powerful results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping it all up
&lt;/h2&gt;

&lt;p&gt;The sky is really the limit here. There's so much structured data that's been locked up in unstructured text. I'm bullish on this space and can't wait to see what you build with this tool!&lt;/p&gt;

&lt;p&gt;If you found this interesting, you might want to see similar posts on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/vs/marvin-ai-vs-guardrails-vs-instructor"&gt;Comparing 3 Data Extraction Libraries: Marvin, Instructor, and Guardrails&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/tutorials/structured-data-extraction-with-marvin-ai-and-llms"&gt;Structured Data Extraction using LLMs and Marvin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/tutorials/structured-data-extraction-with-instructor-and-llms"&gt;Structured Data Extraction using LLMs and Instructor&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post was originally published on &lt;a href="https://learnbybuilding.ai"&gt;LearnByBuilding.AI&lt;/a&gt;. &lt;em&gt;See the &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/structured-data-extraction-with-guardrails-and-llms.ipynb"&gt;notebook&lt;/a&gt; and &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/structured-data-extraction-with-guardrails-and-llms.py"&gt;code&lt;/a&gt; on &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/"&gt;Github&lt;/a&gt;.&lt;/em&gt; Please reach out to me on &lt;a href="https://twitter.com/bllchmbrs"&gt;twitter&lt;/a&gt; with any questions or feedback!&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>datascience</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Structured Data Extraction using LLMs and Instructor</title>
      <dc:creator>Bill Chambers</dc:creator>
      <pubDate>Mon, 06 Nov 2023 17:04:07 +0000</pubDate>
      <link>https://dev.to/bllchmbrs/structured-data-extraction-using-llms-and-instructor-811</link>
      <guid>https://dev.to/bllchmbrs/structured-data-extraction-using-llms-and-instructor-811</guid>
      <description>&lt;h1&gt;
  
  
  Structured Data Extraction using LLMs and Instructor
&lt;/h1&gt;

&lt;p&gt;In this post, we're going to be extracting structured data from a podcast transcript. We've all seen the ability for generative models to be effective &lt;em&gt;generating&lt;/em&gt; text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But what about extracting data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data extraction is a far more common use case (today) than generation, in particular for businesses. Businesses have to process all kinds of documents, exchange them and so on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emerging Use Case: LLMs for data extraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you think about it, summarization is effectively data extraction. We've seen LLMs perform well at summarization, but did you know that they can extract structured data quite well?&lt;/p&gt;

&lt;h2&gt;
  
  
  What you'll learn from this post
&lt;/h2&gt;

&lt;p&gt;In this post, you'll learn how to extract structured data from LLMs into Pydantic objects using the &lt;a href="https://jxnl.github.io/instructor/"&gt;Instructor&lt;/a&gt; library.&lt;/p&gt;

&lt;h2&gt;
  
  
  Here's us setting up on environment
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python 3.11.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;freeze&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;grep&lt;/span&gt; &lt;span class="n"&gt;instructor&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;instructor==0.2.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pydantic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2.4.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="n"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;True
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Background on Instructor
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://jxnl.github.io/instructor/"&gt;Instructor&lt;/a&gt; is a library that helps get structured data out of LLMs. It has narrow dependencies and a simple API, requiring basic nothing sophisticated from the end user.&lt;/p&gt;

&lt;p&gt;In the author's words:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Instructor&lt;/code&gt; helps to ensure you get the exact response type you're looking for when using openai's function call api. Once you've defined the &lt;code&gt;Pydantic&lt;/code&gt; model for your desired response, &lt;code&gt;Instructor&lt;/code&gt; handles all the complicated logic in-between - from the parsing/validation of the response to the automatic retries for invalid responses. This means that we can build in validators 'for free' and have a clear separation of concerns between the prompt and the code that calls openai.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The library is still early, version 0.2.9 has a modest star count (1300) but hits above it's weight since it's basically a single person working on it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jOLPIp2D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/instructor-contributions-page.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jOLPIp2D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/instructor-contributions-page.webp" alt="instructor contributors page" width="800" height="823"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now just because it's one person doesn't mean it's bad. In fact, I found it to be very simple to use. It just means that from a long term maintenance perspective, it might be challenging. Something to note.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pre-processing the data
&lt;/h2&gt;

&lt;p&gt;Here, we are fetching and parsing an RSS feed from a podcast. We specifically target an episode by its title and then retrieve its summary from the podcast description. This is similar to a &lt;a href="https://learnbybuilding.ai/tutorials/rag-chatbot-on-podcast-llamaindex-faiss-openai"&gt;tutorial on llamaindex&lt;/a&gt; that we recently published.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;feedparser&lt;/span&gt;

&lt;span class="n"&gt;podcast_atom_link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://api.substack.com/feed/podcast/1084089.rss"&lt;/span&gt; &lt;span class="c1"&gt;# latent space podcastbbbbb
&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;podcast_atom_link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;episode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ep&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'title'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;episode_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;episode&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'summary'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;episode_summary&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;p&amp;gt;&amp;lt;em&amp;gt;Thanks to the &amp;lt;/em&amp;gt;&amp;lt;em&amp;gt;over 11,000 people&amp;lt;/em&amp;gt;&amp;lt;em&amp;gt; who joined us for the first AI Engineer Su
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now we're going to extract the shortened episode summary. We'll also shorten the summary just to speed up our extraction. In the future, we can extend the transcript to cover the whole thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;unstructured.partition.html&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;partition_html&lt;/span&gt;

&lt;span class="n"&gt;parsed_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partition_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;episode_summary&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; 
&lt;span class="n"&gt;start_of_transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed_summary&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Transcript"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"First line of the transcript: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_of_transcript&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed_summary&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start_of_transcript&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3508&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# shortening the transcript for speed &amp;amp; cost
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;First line of the transcript: 58
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Using Instructor
&lt;/h2&gt;

&lt;p&gt;Instructor is the "thinnest" of &lt;a href="https://learnbybuilding.ai/vs/marvin-ai-vs-guardrails-vs-instructor"&gt;the libraries that I've used for structured data extraction&lt;/a&gt;. The interface is super simple.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;school&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"The school this person attended"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"The company this person works for "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;People&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All we do is &lt;a href="https://en.wikipedia.org/wiki/Monkey_patch#:~:text=Monkey%20patching%20is%20a%20technique,Python%2C%20Groovy%2C%20etc."&gt;Monkey patch&lt;/a&gt;) OpenAI's SDK.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This could be hard to maintain, so take note of versions and be sure to not let either one get upgraded without testing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;instructor&lt;/span&gt;

&lt;span class="n"&gt;instructor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we simply call OpenAI but ask for a &lt;code&gt;response_model&lt;/code&gt;. This is the instructor patch at work!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"gpt-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;People&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;people=[Person(name='Alessio', school=None, company='Decibel Partners'), Person(name='Swyx', school=None, company='Smol.ai'), Person(name='Kanjun', school='MIT', company='Imbue'), Person(name='Josh', school=None, company='Imbue')]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Overall the result is high quality, we've gotten all the names and companies.&lt;/p&gt;

&lt;p&gt;Now we can take this to the next level by trying to extract multiple objects at the same time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResearchPaper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;paper_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"an academic paper reference discussed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExtractedInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;companies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;research_papers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ResearchPaper&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"gpt-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ExtractedInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;people=[Person(name='Alessio', school=None, company='Decibel Partners'), Person(name='Swyx', school=None, company='Smol.ai'), Person(name='Kanjun', school='MIT', company='Imbue'), Person(name='Josh', school=None, company='Ember')] companies=[Company(name='Decibel Partners'), Company(name='Smol.ai'), Company(name='Imbue'), Company(name='Generally Intelligent'), Company(name='Ember'), Company(name='Sorceress'), Company(name='Dropbox'), Company(name='MIT Media Lab'), Company(name='OpenA')] research_papers=None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can see how well this works, to grab a whole bunch of structured data for us with basically no work. Right now we're just working on a excerpt of this data, but with basically zero NLP specific work, we're able to get some pretty powerful results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping it all up
&lt;/h2&gt;

&lt;p&gt;The sky is really the limit here. There's so much structured data that's been locked up in unstructured text. I'm bullish on this space and can't wait to see what you build with this tool!&lt;/p&gt;

&lt;p&gt;If you found this interesting, you might want to see similar posts on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/vs/marvin-ai-vs-guardrails-vs-instructor"&gt;Comparing 3 Data Extraction Libraries: Marvin, Instructor, and Guardrails&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/tutorials/structured-data-extraction-with-marvin-ai-and-llms"&gt;Structured Data Extraction using LLMs and Marvin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/tutorials/structured-data-extraction-with-guardrails-and-llms"&gt;Structured Data Extraction using LLMs and Guardrails AI&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post was originally published on &lt;a href="https://learnbybuilding.ai"&gt;LearnByBuilding.AI&lt;/a&gt;. &lt;em&gt;See the &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/structured-data-extraction-with-instructor-and-llms.ipynb"&gt;notebook&lt;/a&gt; and &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/structured-data-extraction-with-instructor-and-llms.py"&gt;code&lt;/a&gt; on &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/"&gt;Github&lt;/a&gt;.&lt;/em&gt; Please reach out to me on &lt;a href="https://twitter.com/bllchmbrs"&gt;twitter&lt;/a&gt; with any questions or feedback!&lt;/p&gt;

</description>
      <category>openai</category>
      <category>nlp</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Structured Data Extraction using GPT-4 and Instructor</title>
      <dc:creator>Bill Chambers</dc:creator>
      <pubDate>Mon, 06 Nov 2023 16:50:39 +0000</pubDate>
      <link>https://dev.to/bllchmbrs/title-structured-data-extraction-using-gpt-4-and-instructor-3dim</link>
      <guid>https://dev.to/bllchmbrs/title-structured-data-extraction-using-gpt-4-and-instructor-3dim</guid>
      <description>&lt;h2&gt;
  
  
  Structured Data Extraction using LLMs and Instructor
&lt;/h2&gt;

&lt;p&gt;In this post, we're going to be extracting structured data from a podcast transcript. We've all seen the ability for generative models to be effective &lt;em&gt;generating&lt;/em&gt; text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But what about extracting data?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data extraction is a far more common use case (today) than generation, in particular for businesses. Businesses have to process all kinds of documents, exchange them and so on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emerging Use Case: LLMs for data extraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you think about it, summarization is effectively data extraction. We've seen LLMs perform well at summarization, but did you know that they can extract structured data quite well?&lt;/p&gt;

&lt;h2&gt;
  
  
  What you'll learn from this post
&lt;/h2&gt;

&lt;p&gt;In this post, you'll learn how to extract structured data from LLMs into Pydantic objects using the &lt;a href="https://jxnl.github.io/instructor/"&gt;Instructor&lt;/a&gt; library.&lt;/p&gt;

&lt;h2&gt;
  
  
  Here's us setting up on environment
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Python 3.11.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;freeze&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;grep&lt;/span&gt; &lt;span class="n"&gt;instructor&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;instructor==0.2.9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pydantic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2.4.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="n"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;True
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Background on Instructor
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://jxnl.github.io/instructor/"&gt;Instructor&lt;/a&gt; is a library that helps get structured data out of LLMs. It has narrow dependencies and a simple API, requiring basic nothing sophisticated from the end user.&lt;/p&gt;

&lt;p&gt;In the author's words:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;Instructor&lt;/code&gt; helps to ensure you get the exact response type you're looking for when using openai's function call api. Once you've defined the &lt;code&gt;Pydantic&lt;/code&gt; model for your desired response, &lt;code&gt;Instructor&lt;/code&gt; handles all the complicated logic in-between - from the parsing/validation of the response to the automatic retries for invalid responses. This means that we can build in validators 'for free' and have a clear separation of concerns between the prompt and the code that calls openai.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The library is still early, version 0.2.9 has a modest star count (1300) but hits above it's weight since it's basically a single person working on it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jOLPIp2D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/instructor-contributions-page.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jOLPIp2D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/instructor-contributions-page.webp" alt="instructor contributors page" width="800" height="823"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now just because it's one person doesn't mean it's bad. In fact, I found it to be very simple to use. It just means that from a long term maintenance perspective, it might be challenging. Something to note.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pre-processing the data
&lt;/h2&gt;

&lt;p&gt;Here, we are fetching and parsing an RSS feed from a podcast. We specifically target an episode by its title and then retrieve its summary from the podcast description. This is similar to a &lt;a href="https://learnbybuilding.ai/tutorials/rag-chatbot-on-podcast-llamaindex-faiss-openai"&gt;tutorial on llamaindex&lt;/a&gt; that we recently published.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;feedparser&lt;/span&gt;

&lt;span class="n"&gt;podcast_atom_link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://api.substack.com/feed/podcast/1084089.rss"&lt;/span&gt; &lt;span class="c1"&gt;# latent space podcastbbbbb
&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;feedparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;podcast_atom_link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;episode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ep&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ep&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'title'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"Why AI Agents Don't Work (yet) - with Kanjun Qiu of Imbue"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;episode_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;episode&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'summary'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;episode_summary&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;p&amp;gt;&amp;lt;em&amp;gt;Thanks to the &amp;lt;/em&amp;gt;&amp;lt;em&amp;gt;over 11,000 people&amp;lt;/em&amp;gt;&amp;lt;em&amp;gt; who joined us for the first AI Engineer Su
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now we're going to extract the shortened episode summary. We'll also shorten the summary just to speed up our extraction. In the future, we can extend the transcript to cover the whole thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;unstructured.partition.html&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;partition_html&lt;/span&gt;

&lt;span class="n"&gt;parsed_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;partition_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;episode_summary&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; 
&lt;span class="n"&gt;start_of_transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed_summary&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Transcript"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"First line of the transcript: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_of_transcript&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed_summary&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start_of_transcript&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3508&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# shortening the transcript for speed &amp;amp; cost
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;First line of the transcript: 58
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Using Instructor
&lt;/h2&gt;

&lt;p&gt;Instructor is the "thinnest" of &lt;a href="https://learnbybuilding.ai/vs/marvin-ai-vs-guardrails-vs-instructor"&gt;the libraries that I've used for structured data extraction&lt;/a&gt;. The interface is super simple.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;school&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"The school this person attended"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"The company this person works for "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;People&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All we do is &lt;a href="https://en.wikipedia.org/wiki/Monkey_patch#:~:text=Monkey%20patching%20is%20a%20technique,Python%2C%20Groovy%2C%20etc."&gt;Monkey patch&lt;/a&gt;) OpenAI's SDK.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This could be hard to maintain, so take note of versions and be sure to not let either one get upgraded without testing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;instructor&lt;/span&gt;

&lt;span class="n"&gt;instructor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we simply call OpenAI but ask for a &lt;code&gt;response_model&lt;/code&gt;. This is the instructor patch at work!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"gpt-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;People&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;people=[Person(name='Alessio', school=None, company='Decibel Partners'), Person(name='Swyx', school=None, company='Smol.ai'), Person(name='Kanjun', school='MIT', company='Imbue'), Person(name='Josh', school=None, company='Imbue')]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Overall the result is high quality, we've gotten all the names and companies.&lt;/p&gt;

&lt;p&gt;Now we can take this to the next level by trying to extract multiple objects at the same time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResearchPaper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;paper_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"an academic paper reference discussed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExtractedInfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;people&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Person&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;companies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;research_papers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ResearchPaper&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"gpt-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ExtractedInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;people=[Person(name='Alessio', school=None, company='Decibel Partners'), Person(name='Swyx', school=None, company='Smol.ai'), Person(name='Kanjun', school='MIT', company='Imbue'), Person(name='Josh', school=None, company='Ember')] companies=[Company(name='Decibel Partners'), Company(name='Smol.ai'), Company(name='Imbue'), Company(name='Generally Intelligent'), Company(name='Ember'), Company(name='Sorceress'), Company(name='Dropbox'), Company(name='MIT Media Lab'), Company(name='OpenA')] research_papers=None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can see how well this works, to grab a whole bunch of structured data for us with basically no work. Right now we're just working on a excerpt of this data, but with basically zero NLP specific work, we're able to get some pretty powerful results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping it all up
&lt;/h2&gt;

&lt;p&gt;The sky is really the limit here. There's so much structured data that's been locked up in unstructured text. I'm bullish on this space and can't wait to see what you build with this tool!&lt;/p&gt;

&lt;p&gt;If you found this interesting, you might want to see similar posts on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/vs/marvin-ai-vs-guardrails-vs-instructor"&gt;Comparing 3 Data Extraction Libraries: Marvin, Instructor, and Guardrails&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/tutorials/structured-data-extraction-with-marvin-ai-and-llms"&gt;Structured Data Extraction using LLMs and Marvin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnbybuilding.ai/tutorials/structured-data-extraction-with-guardrails-and-llms"&gt;Structured Data Extraction using LLMs and Guardrails AI&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post was originally published on &lt;a href="https://learnbybuilding.ai"&gt;LearnByBuilding.AI&lt;/a&gt;. &lt;em&gt;See the &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/structured-data-extraction-with-instructor-and-llms.ipynb"&gt;notebook&lt;/a&gt; and &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/structured-data-extraction-with-instructor-and-llms.py"&gt;code&lt;/a&gt; on &lt;a href="https://github.com/bllchmbrs/learnbybuilding.ai/blob/main/basic-data-extraction-llms/"&gt;Github&lt;/a&gt;.&lt;/em&gt; Please reach out to me on &lt;a href="https://twitter.com/bllchmbrs"&gt;twitter&lt;/a&gt; with any questions or feedback!&lt;/p&gt;

</description>
      <category>openai</category>
      <category>nlp</category>
      <category>gpt4</category>
    </item>
    <item>
      <title>A beginner's guide to building a Retrieval Augmented Generation (RAG) application from scratch</title>
      <dc:creator>Bill Chambers</dc:creator>
      <pubDate>Tue, 31 Oct 2023 22:25:42 +0000</pubDate>
      <link>https://dev.to/bllchmbrs/a-beginners-guide-to-building-a-retrieval-augmented-generation-rag-application-from-scratch-30cg</link>
      <guid>https://dev.to/bllchmbrs/a-beginners-guide-to-building-a-retrieval-augmented-generation-rag-application-from-scratch-30cg</guid>
      <description>&lt;p&gt;Retrieval Augmented Generation, or RAG, is all the rage these days because it introduces some serious capabilities to large language models like OpenAI's GPT-4 - and that's the ability to use and leverage their own data.&lt;/p&gt;

&lt;p&gt;This post will teach you the fundamental intuition behind RAG while providing a simple tutorial to help you get started.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with learning in a fast moving space
&lt;/h2&gt;

&lt;p&gt;There's so much noise in the AI space and in particular about RAG. Vendors are trying to overcomplicate it. They're trying to inject their tools, their ecosystems, their vision.&lt;/p&gt;

&lt;p&gt;It's making RAG way more complicated than it needs to be.&lt;br&gt;
This tutorial is designed to help beginners learn how to build RAG applications from scratch.&lt;br&gt;
No fluff, no (ok, minimal) jargon, no libraries, just a simple step by step RAG application.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://twitter.com/jerryjliu0/status/1716122650836439478"&gt;Jerry from LlamaIndex advocates for building things from scratch to really understand the pieces&lt;/a&gt;. Once you do, using a library like LlamaIndex makes more sense.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Q4UzNOUl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/motivating-rag-from-scratch.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Q4UzNOUl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/motivating-rag-from-scratch.webp" alt="Jerry from LlamaIndex advocating for RAG from scratch" width="800" height="780"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Build from scratch to learn, then build with libraries to scale.&lt;/p&gt;

&lt;p&gt;Let's get started!&lt;/p&gt;
&lt;h2&gt;
  
  
  Introducing our concept: Retrieval Augmented Generation
&lt;/h2&gt;

&lt;p&gt;You may or may not have heard of Retrieval Augmented Generation or RAG.&lt;/p&gt;

&lt;p&gt;Here's the definition from &lt;a href="https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/"&gt;the blog post introducing the concept from Facebook&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Building a model that researches and contextualizes is more challenging, but it's essential for future advancements. We recently made substantial progress in this realm with our Retrieval Augmented Generation (RAG) architecture, an end-to-end differentiable model that combines an information retrieval component (Facebook AI’s dense-passage retrieval system) with a seq2seq generator (our Bidirectional and Auto-Regressive Transformers [BART] model). RAG can be fine-tuned on knowledge-intensive downstream tasks to achieve state-of-the-art results compared with even the largest pretrained seq2seq language models. And unlike these pretrained models, RAG’s internal knowledge can be easily altered or even supplemented on the fly, enabling researchers and engineers to control what RAG knows and doesn’t know without wasting time or compute power retraining the entire model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Wow, that's a mouthful.&lt;/p&gt;

&lt;p&gt;In simplifying the technique for beginners, we can state that the essence of RAG involves adding your own data (via a retrieval tool) to the prompt that you pass into a large language model. As a result, you get an output.&lt;br&gt;
That gives you several benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You can include facts in the prompt to help the LLM avoid hallucinations&lt;/li&gt;
&lt;li&gt;You can (manually) refer to sources of truth when responding to a user query, helping to double check any potential issues.&lt;/li&gt;
&lt;li&gt;You can leverage data that the LLM might not have been trained on.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  The High Level Components of our RAG System
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;a collection of documents (formally called a corpus)&lt;/li&gt;
&lt;li&gt;An input from the user&lt;/li&gt;
&lt;li&gt;a similarity measure between the collection of documents and the user input&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Yes, it's that simple. &lt;/p&gt;

&lt;p&gt;To start learning and understanding RAG based systems, you don't need a vector store, you don't even &lt;em&gt;need&lt;/em&gt; an LLM (at least to learn and understand conceptually). &lt;/p&gt;

&lt;p&gt;While it is often portrayed as complicated, it doesn't have to be.&lt;/p&gt;
&lt;h2&gt;
  
  
  The ordered steps of a querying RAG system
&lt;/h2&gt;

&lt;p&gt;We'll perform the following steps in sequence.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Receive a user input&lt;/li&gt;
&lt;li&gt;Perform our similarity measure&lt;/li&gt;
&lt;li&gt;Post-process the user input and the fetched document(s).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The post-processing is done with an LLM.&lt;/p&gt;
&lt;h2&gt;
  
  
  A note from the paper itself
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2005.11401"&gt;The actual RAG paper&lt;/a&gt; is obviously &lt;em&gt;the&lt;/em&gt; resource. The problem is that it assumes a LOT of context. It's more complicated than we need it to be.&lt;/p&gt;

&lt;p&gt;For instance, here's the overview of the RAG system as proposed in the paper.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MyAwrGMw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/rag-paper-image.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MyAwrGMw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/rag-paper-image.webp" alt="retrieval augmented generation data and user flow" width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's dense. &lt;/p&gt;

&lt;p&gt;It's great for researchers but for the rest of us, it's going to be a lot easier to learn step by step by building the system ourselves.&lt;/p&gt;
&lt;h2&gt;
  
  
  Working through an example - the simplest RAG system
&lt;/h2&gt;

&lt;p&gt;Let's get back to building RAG from scratch, step by step. Here's the simplified steps that we'll be working through.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Qk-pTHnB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/the-simplest-retrieval-augmented-generation-system.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Qk-pTHnB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/the-simplest-retrieval-augmented-generation-system.webp" alt="simple retrieval in RAG system" width="800" height="556"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Getting a collection of documents
&lt;/h3&gt;

&lt;p&gt;Below you can see that we've got a simple corpus of 'documents' (please be generous 😉).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;corpus_of_documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s"&gt;"Take a leisurely walk in the park and enjoy the fresh air."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Visit a local museum and discover something new."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Attend a live music concert and feel the rhythm."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Go for a hike and admire the natural scenery."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Have a picnic with friends and share some laughs."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Explore a new cuisine by dining at an ethnic restaurant."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Take a yoga class and stretch your body and mind."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Join a local sports league and enjoy some friendly competition."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Attend a workshop or lecture on a topic you're interested in."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"Visit an amusement park and ride the roller coasters."&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Defining and performing the similarity measure
&lt;/h3&gt;

&lt;p&gt;Now we need a way of measuring the similarity between the &lt;strong&gt;user input&lt;/strong&gt; we're going to receive and the &lt;strong&gt;collection&lt;/strong&gt; of documents that we organized. Arguably the simplest similarity measure is &lt;a href="https://en.wikipedia.org/wiki/Jaccard_index"&gt;jaccard similarity&lt;/a&gt;. I've written about that in the past (see &lt;a href="https://billchambers.me/posts/tf-idf-explained-in-python"&gt;this post&lt;/a&gt; but the short answer is that the &lt;strong&gt;jaccard similarity&lt;/strong&gt; is the intersection divided by the union of the "sets" of words.&lt;/p&gt;

&lt;p&gt;This allows us to compare our user input with the source documents.&lt;/p&gt;

&lt;h4&gt;
  
  
  Side note: preprocessing
&lt;/h4&gt;

&lt;p&gt;A challenge is that if we have a plain string like &lt;code&gt;"Take a leisurely walk in the park and enjoy the fresh air.",&lt;/code&gt;, we're going to have to pre-process that into a set, so that we can perform these comparisons. We're going to do this in the simplest way possible, lower case and split by &lt;code&gt;" "&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;jaccard_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;intersection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;intersection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;union&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intersection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;union&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we need to define a function that takes in the exact query and our corpus and selects the 'best' document to return to the user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;return_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;similarities&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jaccard_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;similarities&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;corpus_of_documents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;similarities&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarities&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can run it, we'll start with a simple prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"What is a leisure activity that you like?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a simple user input...&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"I like to hike"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can return our response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;return_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corpus_of_documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'Go for a hike and admire the natural scenery.'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Congratulations, you've built a basic RAG application.&lt;/p&gt;

&lt;h4&gt;
  
  
  I got 99 problems and bad similarity is one
&lt;/h4&gt;

&lt;p&gt;Now we've opted for a simple similarity measure for learning. But this is going to be problematic because it's so simple. It has no notion of &lt;strong&gt;semantics&lt;/strong&gt;. It's just looks at what words are in both documents. That means that if we provide a negative example, we're going to get the same "result" because that's the closest document.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cncFzsOk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/a-key-challenge-of-retrieval-augmented-generation-systems-semantics.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cncFzsOk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/a-key-challenge-of-retrieval-augmented-generation-systems-semantics.webp" alt="Bad Similarity Problems in Retrieval Augmented Generation" width="800" height="556"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"I don't like to hike"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;return_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corpus_of_documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'Go for a hike and admire the natural scenery.'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is a topic that's going to come up a lot with "RAG", but for now, rest assured that we'll address this problem later.&lt;/p&gt;

&lt;p&gt;At this point, we have not done any post-processing of the "document" to which we are responding. So far, we've implemented only the "retrieval" part of "Retrieval-Augmented Generation". The next step is to augment generation by incorporating a large language model (LLM).&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding in a LLM
&lt;/h2&gt;

&lt;p&gt;To do this, we're going to use &lt;a href="https://ollama.ai/"&gt;ollama&lt;/a&gt; to get up and running with an open source LLM on our local machine. We could just as easily use OpenAI's gpt-4 or Anthropic's Claude but for now, we'll start with the open source llama2 from &lt;a href="https://ai.meta.com/llama/"&gt;Meta AI&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ollama.ai/"&gt;ollama installation instructions are here&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post is going to assume some basic knowledge of large language models, so let's get right to querying this model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First we're going to define the inputs. To work with this model, we're going to take &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;user input,&lt;/li&gt;
&lt;li&gt;fetch the most similar document (as measured by our similarity measure),&lt;/li&gt;
&lt;li&gt;pass that into a prompt to the language model,&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;then&lt;/em&gt; return the result to the user&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That introduces a new term, the &lt;strong&gt;prompt&lt;/strong&gt;. In short, it's the instructions that you provide to the LLM.&lt;/p&gt;

&lt;p&gt;When you run this code, you'll see the streaming result. Streaming is important for user experience.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"I like to hike"&lt;/span&gt;
&lt;span class="n"&gt;relevant_document&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;return_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corpus_of_documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;full_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# https://github.com/jmorganca/ollama/blob/main/docs/api.md
&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
You are a bot that makes recommendations for activities. You answer in very short sentences and do not include extra information.

This is the recommended activity: {relevant_document}

The user input is: {user_input}

Compile a recommendation to the user based on the recommended activity and the user input.
"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Having defined that, let's now make the API call to ollama (and llama2).&lt;br&gt;
an important step is to make sure that ollama's running already on your local machine by running &lt;code&gt;ollama serve&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: this might be slow on your machine, it's certainly slow on mine. Be patient, young grasshopper.&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'http://localhost:11434/api/generate'&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"llama2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relevant_document&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;relevant_document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'Content-Type'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'application/json'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iter_lines&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# filter out keep-alive new lines
&lt;/span&gt;        &lt;span class="c1"&gt;# count += 1
&lt;/span&gt;        &lt;span class="c1"&gt;# if count % 5== 0:
&lt;/span&gt;        &lt;span class="c1"&gt;#     print(decoded_line['response']) # print every fifth token
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;decoded_line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'utf-8'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

            &lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decoded_line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'response'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Great! Based on your interest in hiking, I recommend trying out the nearby trails for a challenging and rewarding experience with breathtaking views Great! Based on your interest in hiking, I recommend checking out the nearby trails for a fun and challenging adventure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This gives us a complete RAG Application, from scratch, no providers, no services. You know all of the components in a Retrieval-Augmented Generation application. Visually, here's what we've built.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KTCkeuSs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/simplified-version-of-retrieval-augmented-generation.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KTCkeuSs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://images.learnbybuilding.ai/simplified-version-of-retrieval-augmented-generation.webp" alt="simplified version of retrieval augmented generation" width="800" height="556"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The LLM (if you're lucky) will handle the user input that goes against the recommended document. We can see that below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"I don't like to hike"&lt;/span&gt;
&lt;span class="n"&gt;relevant_document&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;return_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;corpus_of_documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# https://github.com/jmorganca/ollama/blob/main/docs/api.md
&lt;/span&gt;&lt;span class="n"&gt;full_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
You are a bot that makes recommendations for activities. You answer in very short sentences and do not include extra information.

This is the recommended activity: {relevant_document}

The user input is: {user_input}

Compile a recommendation to the user based on the recommended activity and the user input.
"""&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'http://localhost:11434/api/generate'&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"llama2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relevant_document&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;relevant_document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'Content-Type'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'application/json'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iter_lines&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# filter out keep-alive new lines
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;decoded_line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'utf-8'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="c1"&gt;# print(decoded_line['response'])  # uncomment to results, token by token
&lt;/span&gt;            &lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decoded_line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'response'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_response&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Sure, here is my response:

&lt;p&gt;Try kayaking instead! It's a great way to enjoy nature without having to hike.&lt;br&gt;
&lt;/p&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Areas for improvement&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;If we go back to our diagream of the RAG application and think about what we've just built, we'll see various opportunities for improvement. These opportunities are where tools like vector stores, embeddings, and prompt 'engineering' gets involved.&lt;/p&gt;

&lt;p&gt;Here are ten potential areas where we could improve the current setup:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The number of documents&lt;/strong&gt; 👉 more documents might mean more recommendations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The depth/size of documents&lt;/strong&gt; 👉 higher quality content and longer documents with more information might be better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The number of documents we give to the LLM&lt;/strong&gt; 👉 Right now, we're only giving the LLM one document. We could feed in several as 'context' and allow the model to provide a more personalized recommendation based on the user input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The parts of documents that we give to the LLM&lt;/strong&gt; 👉 If we have bigger or more thorough documents, we might just want to add in parts of those documents, parts of various documents, or some variation there of. In the lexicon, this is called chunking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Our document storage tool&lt;/strong&gt; 👉 We might store our documents in a different way or different database. In particular, if we have a lot of documents, we might explore storing them in a data lake or a vector store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The similarity measure&lt;/strong&gt; 👉 How we measure similarity is of consequence, we might need to trade off performance and thoroughness (e.g., looking at every individual document).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The pre-processing of the documents &amp;amp; user input&lt;/strong&gt; 👉 We might perform some extra preprocessing or augmentation of the user input before we pass it into the similarity measure. For instance, we might use an embedding to convert that input to a vector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The similarity measure&lt;/strong&gt; 👉 We can change the similarity measure to fetch better or more relevant documents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The model&lt;/strong&gt; 👉 We can change the final model that we use. We're using llama2 above, but we could just as easily use an Anthropic or Claude Model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The prompt&lt;/strong&gt; 👉 We could use a different prompt into the LLM/Model and tune it according to the output we want to get the output we want.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you're worried about harmful or toxic output&lt;/strong&gt; 👉 We could implement a "circuit breaker" of sorts that runs the user input to see if there's toxic, harmful, or dangerous discussions. For instance, in a healthcare context you could see if the information contained unsafe languages and respond accordingly - outside of the typical flow.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The scope for improvements isn't limited to these points; the possibilities are vast, and we'll delve into them in future tutorials. Until then, don't hesitate to &lt;a href="https://twitter.com/bllchmbrs"&gt;reach out on Twitter&lt;/a&gt; if you have any questions. Happy RAGING :).&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2005.11401"&gt;Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/jerryjliu0/status/1716122650836439478"&gt;Jerry Liu on Twitter advocating for users to build RAG from scratch&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you like this post, you'll love what we're doing at &lt;a href="https://learnbybuilding.ai/"&gt;learn By Building AI&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>beginners</category>
      <category>nlp</category>
    </item>
  </channel>
</rss>
