<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matthieu Lienart</title>
    <description>The latest articles on DEV Community by Matthieu Lienart (@mlnrt).</description>
    <link>https://dev.to/mlnrt</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F931761%2Fd7d9e5f1-5edf-4206-8bc7-38af71d606cc.jpeg</url>
      <title>DEV Community: Matthieu Lienart</title>
      <link>https://dev.to/mlnrt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mlnrt"/>
    <language>en</language>
    <item>
      <title>The Missing Link: How to Retrieve Full Documents with AWS S3 Vectors</title>
      <dc:creator>Matthieu Lienart</dc:creator>
      <pubDate>Sun, 14 Sep 2025 12:43:29 +0000</pubDate>
      <link>https://dev.to/mlnrt/the-missing-link-how-to-retrieve-full-documents-with-aws-s3-vectors-dj7</link>
      <guid>https://dev.to/mlnrt/the-missing-link-how-to-retrieve-full-documents-with-aws-s3-vectors-dj7</guid>
      <description>&lt;p&gt;Why write yet another blog article on how to use AWS S3 Vectors, when there are already many such blog articles and tutorials out there?&lt;/p&gt;

&lt;p&gt;Because all existing tutorials I read miss a critical aspect: they don't explain how to actually retrieve your full documents after finding matching vectors. Instead, they store tiny example "documents" (often just a sentence) directly in the vector metadata. This approach, while easy for a tutorial demonstrating similarity search using vectors, completely falls apart when dealing with real-world content. Hopefully nobody is going to store entire documents in a vector metadata!&lt;/p&gt;

&lt;p&gt;This article fills this important gap. I won't rehash the basics that others have covered well. Instead, I'll focus specifically on implementing a complete document retrieval with S3 Vectors and S3 Bucket that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stores your actual documents in a standard S3 Bucket&lt;/li&gt;
&lt;li&gt;Creates and indexes embeddings in S3 Vectors
&lt;/li&gt;
&lt;li&gt;Connects vector search results back to your original documents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Unlike a vector database that handles document storage and retrieval for you, S3 Vectors only manages the vector index. Understanding how to bridge this gap is essential for building production-ready applications with AWS S3 Vectors.&lt;/p&gt;

&lt;p&gt;At a high level, using S3 Vectors is a 3 steps process for both storing and querying as shown in this diagram.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhgw08gh2yzca0d74hx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjhgw08gh2yzca0d74hx4.png" alt=" " width="800" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Storing Documents and Vectors
&lt;/h1&gt;

&lt;p&gt;The steps to store the documents and their embeddings are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Put the documents in a regular S3 bucket. In such a context, I hash the file name or identifier to generate the S3 object key,&lt;/li&gt;
&lt;li&gt;Use an embedding model to generate an embedding based on the content of the document,&lt;/li&gt;
&lt;li&gt;Store the embedding in the vector index.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this article example, the document are crawled web pages and the S3 object key is generated by hashing the page URL. The crawled pages have the format&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;string&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At a high level, the code for the 3 steps is then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;s3_vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3vectors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;MODEL_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amazon.titan-embed-text-v2:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;vectors_data_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;S3_DOCUMENTS_BUCKET_NAME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# store the actual document in the S3 bucket as a text content
&lt;/span&gt;    &lt;span class="n"&gt;vectors_data_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[^a-zA-Z0-9\s]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Generate embedding for the page text
&lt;/span&gt;    &lt;span class="n"&gt;model_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock_runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputText&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;
        &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response_body&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# Set the vector with the same key as the S3 Object
&lt;/span&gt;    &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Store the vectors in the S3 Vectors index
&lt;/span&gt;&lt;span class="n"&gt;s3_vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vectorBucketName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S3_VECTORS_BUCKET_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   
    &lt;span class="n"&gt;indexName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S3_VECTORS_BUCKET_INDEX_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Querying S3 Vectors and Retrieving Documents
&lt;/h1&gt;

&lt;p&gt;Then when you query the S3 Vectors to retrieve the documents you need to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the embedding model to generate the embedding for the search question,&lt;/li&gt;
&lt;li&gt;Query S3 Vectors to get the embeddings close to the search embedding,
&lt;/li&gt;
&lt;li&gt;For all embeddings in the results, use the key to retrieve the objects from the S3 Bucket.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is AWS S3 Vectors?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Invoke the same model to generate the embedding for the question
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bedrock_runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;modelId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputText&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;
    &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;question_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# Use the query embedding to search for similar embeddings in the S3 Vectors index
&lt;/span&gt;&lt;span class="n"&gt;query_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3_vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; 
    &lt;span class="n"&gt;vectorBucketName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S3_VECTORS_BUCKET_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   
    &lt;span class="n"&gt;indexName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S3_VECTORS_BUCKET_INDEX_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;queryVector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;question_embedding&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;topK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;returnDistance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;returnMetadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vectors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="c1"&gt;# Retrieve the actual documents from S3 using the keys from the query results
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;S3_DOCUMENTS_BUCKET_NAME&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nc"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Key Takeaways
&lt;/h1&gt;

&lt;p&gt;Unlike a vector database which stores and retrieves the documents for you, S3 Vectors only stores the vector index. It is up to you to make the relationship between the vector key and the actual document. S3 makes that easy if you use the same key for the document vector in S3 Vectors and the document object in the S3 Bucket. While storing and retrieving becomes a multi-step process that you have to orchestrate and that inevitably increases response latency, this approach offers substantial cost savings compared to dedicated vector databases.&lt;/p&gt;

&lt;p&gt;Note that the documents do not have to be stored in a S3 Bucket. In the above example, we could imagine not storing the pages' content as objects in a S3 Bucket, and just give back the page URLs from the vectors metadata which would be crawled downstream.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>rag</category>
      <category>vectordatabase</category>
    </item>
    <item>
      <title>Tracing LangChain with AWS X-Ray</title>
      <dc:creator>Matthieu Lienart</dc:creator>
      <pubDate>Fri, 11 Jul 2025 13:45:15 +0000</pubDate>
      <link>https://dev.to/mlnrt/tracing-langchain-with-aws-x-ray-3opm</link>
      <guid>https://dev.to/mlnrt/tracing-langchain-with-aws-x-ray-3opm</guid>
      <description>&lt;p&gt;LangChain is a popular framework for developing applications powered by large language models, providing components for working with LLMs through composable chains and agents. Like with microservices, when building production applications with LangChain, tracing and visualizing how the different components interact with each other becomes increasingly important. AWS X-Ray, being a native AWS service to monitor and analyze telemetry traces from the Lambda Functions, is the natural choice to use for tracing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the problem?
&lt;/h2&gt;

&lt;p&gt;In my previous articles &lt;a href="https://dev.to/mlnrt/a-serverless-chatbot-with-langchain-aws-bedrock-5el3"&gt;"A Serverless Chatbot with LangChain &amp;amp; AWS Bedrock"&lt;/a&gt;, and &lt;a href="https://dev.to/mlnrt/logging-langchain-to-aws-cloudwatch-j77"&gt;"Logging LangChain to AWS CloudWatch"&lt;/a&gt; I presented a solution for a serverless Chatbot with LangChain and AWS Bedrock. The solution implements all the features of conversation history, answering in the user language, custom context using RAG, model guardrails, structured outputs together with using LangChain callbacks for custom detailed logging to AWS CloudWatch Logs.&lt;/p&gt;

&lt;p&gt;There are many tools and frameworks (e.g. LangSmith, Arize Phoenix, Langfuse, etc.) based on OpenTelemetry and OpenInference, to trace LLM applications and to do much more (evaluate LLM, evaluate RAG, run experiments, etc.). But I wanted to force myself to see what I could do for tracing with native AWS tools. So, I built a custom solution with the native AWS X-Ray tooling. But to do that, I needed to create X-Ray trace subsegments for every action performed by any type of LangChain &lt;a href="https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html" rel="noopener noreferrer"&gt;&lt;code&gt;Runnable&lt;/code&gt;&lt;/a&gt;. There are two challenges associated with that.&lt;/p&gt;

&lt;p&gt;The first one is that by default when enabling AWS X-Ray tracing on Lambda functions, it only traces call to AWS services. To capture traces to external services (e.g. an HTTP request to a public REST API) the &lt;a href="https://github.com/aws/aws-xray-sdk-python" rel="noopener noreferrer"&gt;AWS X-Ray SDK for Python&lt;/a&gt; for example, uses &lt;a href="https://wrapt.readthedocs.io/en/master/" rel="noopener noreferrer"&gt;Python wrapt&lt;/a&gt; library to "patch" specific class methods to generate a trace subsegment before executing the method. An example of that is creating a trace subsegment when the &lt;code&gt;request()&lt;/code&gt; method (essentially making an HTTP call) of the &lt;code&gt;Session&lt;/code&gt; class of Python &lt;code&gt;requests&lt;/code&gt; library is called, as shown &lt;a href="https://github.com/aws/aws-xray-sdk-python/blob/master/aws_xray_sdk/ext/requests/patch.py" rel="noopener noreferrer"&gt;here&lt;/a&gt;. As X-Ray SDK covers only the most common Python libraries, I have to patch the LangChain library myself to intercept any execution of &lt;code&gt;invoke()&lt;/code&gt; or &lt;code&gt;ainvoke()&lt;/code&gt; methods of every possible type of LangChain &lt;code&gt;Runnable&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The second challenge relates to the limitations of the X-Ray SDK in AWS Lambda Functions. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The X-Ray SDK "is configured to automatically create a placeholder &lt;code&gt;facade&lt;/code&gt; segment when it detects it is running in Lambda".&lt;/li&gt;
&lt;li&gt;You can’t create your own trace segment in a Lambda Function. Only subsegments.&lt;/li&gt;
&lt;li&gt;The SDK creates an &lt;code&gt;AWS::Lambda::Function&lt;/code&gt; subsegment and further subsegments are attached to it.&lt;/li&gt;
&lt;li&gt;When using threads, every thread trace root is the Lambda &lt;code&gt;facade&lt;/code&gt; segment, not the subsegment that created the thread.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As LangChain tries to run its actions in parallel threads, or if you specify yourself some tasks to run in parallel, instead of seeing trace subsegments as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS::Lambda::Function
└── RunnableParallel
    ├── RunnableLambda
    └── RunnableLambda
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you get is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── AWS::Lambda::Function
│   └── RunnableParallel
├── RunnableLambda
└── RunnableLambda
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not what I would expect and want, as it does not accurately represent the inner working of LangChain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;In order to achieve my goals of correctly tracing LangChain Runnable interactions, I then need to fix those two issues by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Patching myself LangChain &lt;code&gt;Runnable&lt;/code&gt; classes,&lt;/li&gt;
&lt;li&gt;Resetting the trace context to the parent subsegment for all LangChain &lt;code&gt;RunnableParallel&lt;/code&gt; classes creating threads.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Patching LangChain
&lt;/h3&gt;

&lt;p&gt;This is "easily" done by following the pattern used by the AWS X-Ray SDK for the &lt;code&gt;requests&lt;/code&gt; library as shared above. The problems, are that there are many classes to cover, they will change, and I don’t want to repeat the same code for every Runnable class. There are also 2 types of classes that I must patch: &lt;code&gt;Runnable&lt;/code&gt; and &lt;code&gt;RunnableSerializable&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To achieve this, I do the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create some functions to recursively list all imported child classes of &lt;code&gt;Runnable&lt;/code&gt; and &lt;code&gt;RunnableSerializable&lt;/code&gt; and remove duplicates.&lt;/li&gt;
&lt;li&gt;Loop over the resulting list and wrap all &lt;code&gt;invoke()&lt;/code&gt; method of those classes
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wrapt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_xray_sdk.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;xray_recorder&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.runnables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Runnable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RunnableSerializable&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dedup_classes_by_origin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return a set of (module, base_name) keys for unique classes.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;module&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__module__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__name__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;base_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;all_subclasses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Recursively find all subclasses of a class.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;__subclasses__&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;__subclasses__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;all_subclasses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;patch_langchain_runnables&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Get all classes from langchain_core.runnables.RunnableSerializable
&lt;/span&gt;    &lt;span class="n"&gt;unique_serializable_classes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dedup_classes_by_origin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;all_subclasses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RunnableSerializable&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;unique_runnable_classes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dedup_classes_by_origin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;all_subclasses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Runnable&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# Combine both lists to ensure we cover all RunnableSerializable and Runnable classes
&lt;/span&gt;    &lt;span class="n"&gt;unique_classes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unique_serializable_classes&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;unique_runnable_classes&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unique_classes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Patch the invoke method of each class
&lt;/span&gt;        &lt;span class="n"&gt;wrapt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrap_function_wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;class_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.invoke&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;traced_invoke&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;patch_langchain_runnables()&lt;/code&gt; has then to be called at the beginning of the Lambda Function code immediately after the imports. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aws_lambda_powertools.utilities.typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LambdaContext&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.runnables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;RunnableParallel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RunnableLambda&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RunnablePassthrough&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;patch_langchain_runnables&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LambdaContext&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do something smart with LangChain here.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Fixing Subsegment Parents
&lt;/h3&gt;

&lt;p&gt;Now that the first challenge is solved and the &lt;code&gt;invoke()&lt;/code&gt; method of all the child classes of &lt;code&gt;Runnable&lt;/code&gt; and &lt;code&gt;RunnableSerializable&lt;/code&gt; are wrapped by the &lt;code&gt;traced_invoke()&lt;/code&gt; method, I need to define it and generate a subsegment before calling the initial class method while ensuring proper subsegment lineage.&lt;/p&gt;

&lt;p&gt;For classes not using threads, a simple execution of the class method inside a subsegment would work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;traced_invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wrapped&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;xray_recorder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;in_subsegment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runnable_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrapped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But as discussed previously, for &lt;code&gt;Runnable&lt;/code&gt; classes which will be invoked in threads by the &lt;code&gt;RunnableParrallel&lt;/code&gt; class, I need to forcibly overwrite their parent subsegment from the &lt;code&gt;facade&lt;/code&gt; segment to the &lt;code&gt;RunnableParrallel&lt;/code&gt; segment. I do that with the following approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Capture the new segment (entity)&lt;/li&gt;
&lt;li&gt;Monkey patch (after taking a backup of it) the &lt;code&gt;threading.Thread.run()&lt;/code&gt; method by my own, which sets the trace entity to the one I just captured before executing the actual &lt;code&gt;Thread.run()&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;Execute the original class &lt;code&gt;invoke()&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;Restore the original &lt;code&gt;threading.Thread.run()&lt;/code&gt; method&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Thus, when the class original &lt;code&gt;invoke()&lt;/code&gt; method will run a new thread, it will first overwrite the trace context to the parent’s subsegment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;traced_invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wrapped&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    A wrapper function to trace the invocation of Runnable classes using AWS X-Ray.
    This function is used to create a subsegment in the X-Ray trace for the Runnable invocation.

    Args:
        wrapped: The original invoke method of the Runnable class.
        instance: The instance of the Runnable class being invoked.
        args: Positional arguments passed to the invoke method.
        kwargs: Keyword arguments passed to the invoke method.

    Returns:
        The result of the wrapped invoke method.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;runnable_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;wrapped&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;xray_recorder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;in_subsegment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runnable_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;runnable_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RunnableParallel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Get the parent entity for the child threads
&lt;/span&gt;            &lt;span class="n"&gt;parent_entity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;xray_recorder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_trace_entity&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="c1"&gt;# Back up the original threading.Thread.run method
&lt;/span&gt;            &lt;span class="n"&gt;orig_thread_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;

            &lt;span class="c1"&gt;# Monkey-patch the threading.Thread.run method to set the parent entity for new threads
&lt;/span&gt;            &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_with_entity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;xray_recorder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_trace_entity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent_entity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;orig_thread_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Replace the threading.Thread.run method with our patched version
&lt;/span&gt;            &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run_with_entity&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrapped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# Once done, restore the original thread run method
&lt;/span&gt;                &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orig_thread_run&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wrapped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; following the same approach, the same can be achieved for LangChain asynchronous &lt;code&gt;ainvoke()&lt;/code&gt; method, but not shown here for simplification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Patching with wrapt vs Monkey Patching
&lt;/h3&gt;

&lt;p&gt;As you saw, I use two different approaches to patch different class methods: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;wrapt&lt;/code&gt; library for LangChain Runnable classes &lt;code&gt;invoke()&lt;/code&gt; method&lt;/li&gt;
&lt;li&gt;Monkey patching for threading &lt;code&gt;Thread.run()&lt;/code&gt; method&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the case of LangChain, I want to patch the &lt;code&gt;invoke()&lt;/code&gt; method for all their executions and for the entire duration of the Lambda Function. The &lt;code&gt;wrapt&lt;/code&gt; library is designed to do exactly that.&lt;/p&gt;

&lt;p&gt;In the case of the threads, I just want to patch the &lt;code&gt;Thread.run()&lt;/code&gt; method in the temporary context of &lt;code&gt;RunnableParrallel.invoke()&lt;/code&gt; method. The &lt;code&gt;wrapt&lt;/code&gt; library does not provide a built-in way to temporarily patch and then restore a method at runtime in a specific block of code.&lt;/p&gt;

&lt;p&gt;Be careful with monkey patching though, as you are changing behaviors of methods in ways that other developers might not be aware of, and updates in the inner working of the method you are patching might interfere with your patching method. But it feels reasonable in this specific context with the actions performed by the patching method.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;The LangChain setup described in my first article &lt;a href="https://dev.to/mlnrt/a-serverless-chatbot-with-langchain-aws-bedrock-5el3"&gt;"A Serverless Chatbot with LangChain &amp;amp; AWS Bedrock"&lt;/a&gt; results in the following AWS X-Ray trace properly showing the different step in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The initial steps of the chain to retrieve references from the knowledge base, retrieve the conversation history and detect the language run in parallel&lt;/li&gt;
&lt;li&gt;The prompt being generated based on all those inputs&lt;/li&gt;
&lt;li&gt;The LLM model called with that prompt using Bedrock&lt;/li&gt;
&lt;li&gt;The tool to structure the model output based on my Pydantic definition called&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79t1y7kf8x94pzu1e2ah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79t1y7kf8x94pzu1e2ah.png" alt=" " width="800" height="507"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;Building upon my previous articles on serverless LangChain applications and logging LangChain to AWS CloudWatch Logs, this tracing implementation with AWS X-Ray has revealed additional insights worth sharing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adapting existing tools for new use cases can be challenging: While AWS X-Ray SDK wasn't designed specifically for tracing AI frameworks like LangChain, this project demonstrated the potential to extend its capabilities creatively.&lt;/li&gt;
&lt;li&gt;Deep understanding of both AWS Lambda and AWS X-Ray internals proved crucial: Knowing how Lambda manages threads and how the X-Ray SDK patches common libraries is key to developing an effective custom tracing solution.&lt;/li&gt;
&lt;li&gt;Performance implications of custom tracing solutions should be considered: While not explicitly discussed, it's important to note that any custom tracing implementation may impact the performance of the system and should be carefully monitored and optimized.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While this implementation successfully integrates LangChain tracing with AWS X-Ray as subsegments within the Lambda trace, the natural progression is to create a separate trace map specifically for LangChain. A visual representation of the interactions between LangChain components, would allow for more granular analysis and optimization of AI-driven applications.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>langchain</category>
      <category>ai</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Logging LangChain to AWS CloudWatch</title>
      <dc:creator>Matthieu Lienart</dc:creator>
      <pubDate>Wed, 18 Jun 2025 19:50:08 +0000</pubDate>
      <link>https://dev.to/mlnrt/logging-langchain-to-aws-cloudwatch-j77</link>
      <guid>https://dev.to/mlnrt/logging-langchain-to-aws-cloudwatch-j77</guid>
      <description>&lt;p&gt;LangChain is a popular framework for developing applications powered by large language models, providing components for working with LLMs through composable chains and agents. When building production applications with LangChain, proper logging becomes essential for monitoring, debugging, and auditing your AI systems. AWS CloudWatch is the natural choice for logging in my serverless context, offering centralized log storage, metrics, and powerful analysis capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the problem?
&lt;/h2&gt;

&lt;p&gt;In my previous article &lt;a href="https://dev.to/mlnrt/a-serverless-chatbot-with-langchain-aws-bedrock-5el3"&gt;"A Serverless Chatbot with LangChain &amp;amp; AWS Bedrock"&lt;/a&gt;, I presented a solution for a serverless Chatbot with LangChain and AWS Bedrock, having all the features of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conversation history&lt;/li&gt;
&lt;li&gt;Answering in the user language&lt;/li&gt;
&lt;li&gt;Custom context using RAG&lt;/li&gt;
&lt;li&gt;Model guardrails&lt;/li&gt;
&lt;li&gt;Structured output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As the described solution aims to be running on AWS Lambda, I naturally want to export all those logs to AWS CloudWatch.&lt;/p&gt;

&lt;p&gt;The problem is that just using &lt;a href="https://python.langchain.com/docs/how_to/debugging/#set_debugtrue" rel="noopener noreferrer"&gt;&lt;code&gt;langchain.globals.set_debug&lt;/code&gt;&lt;/a&gt; function produces verbose, unstructured logs that become virtually unusable in CloudWatch. These logs are difficult to read, impossible to query effectively with CloudWatch Insights, and lack the context needed for proper debugging. For CloudWatch to deliver its full value, logs must be stored in a structured JSON format with consistent fields and meaningful metadata that can be filtered and analyzed programmatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;

&lt;p&gt;In essence, I created a structured logging system that transforms LangChain's verbose text output into CloudWatch-friendly structured JSON format. This solution enables effective monitoring, troubleshooting, and analysis of the LangChain applications in an AWS environment.&lt;/p&gt;

&lt;p&gt;I use a LangChain Callback to capture LangChain actions, format the logs and send them to CloudWatch using the &lt;a href="https://docs.powertools.aws.dev/lambda/python/3.13.0/core/logger/" rel="noopener noreferrer"&gt;AWS Lambda PowerTools Logger&lt;/a&gt;. An advantage of this custom approach is that logs can be enriched with custom metadata, like a session or user ID.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dkcdgjt76zzcfeynduy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1dkcdgjt76zzcfeynduy.png" alt=" " width="800" height="498"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: High-level architecture of the serverless chatbot&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For the full code, you can refer to the Jupyter notebook in &lt;a href="https://github.com/amanoxsolutions/langchain-awsbedrock-demo" rel="noopener noreferrer"&gt;this GitHub repository&lt;/a&gt;. While the notebook demonstrates the components locally, and the logs are just printed after being formatted, instead of being sent to CloudWatch, the principles apply directly to a Lambda Function deployment.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using a Callback for Logging
&lt;/h2&gt;

&lt;p&gt;The approach I use here is to use LangChain callbacks on actions like &lt;code&gt;on_chain_start&lt;/code&gt;, &lt;code&gt;on_chain_end&lt;/code&gt;, &lt;code&gt;on_chain_error&lt;/code&gt;, etc. to capture and log the actions in the chain.&lt;/p&gt;

&lt;p&gt;Since the messages of chain actions or LLM prompts and answers can be long and contain sensitive information, I provide parameters like &lt;code&gt;exclude_inputs&lt;/code&gt;, &lt;code&gt;exclude_outputs&lt;/code&gt; to the Callback to have the ability to redact such content.&lt;/p&gt;

&lt;p&gt;The full list of LangChain callbacks is available &lt;a href="https://python.langchain.com/docs/concepts/callbacks/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The Logging callback is structured as follow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LoggingHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseCallbackHandler&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;exclude_inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;exclude_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exclude_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exclude_inputs&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exclude_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exclude_outputs&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_dict_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="err"&gt;…&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_chain_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="err"&gt;…&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_chain_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
       &lt;span class="err"&gt;…&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_llm_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
       &lt;span class="err"&gt;…&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_llm_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
       &lt;span class="err"&gt;…&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_chain_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
       &lt;span class="err"&gt;…&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_llm_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
       &lt;span class="err"&gt;…&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Parsing the Logs
&lt;/h3&gt;

&lt;p&gt;Inputs, outputs, prompts, model answers, etc. making the content of the callbacks, are dictionaries including LangChain serializable objects. In order to prepare those data and ensure that all nested objects are converted into standard Python types that can be easily handled by CloudWatch Logs, it needs to go through the dictionary recursively and serialize the LangChain objects. This is done by the utility function &lt;code&gt;_parse_dict_values()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_dict_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Serializable&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_dict_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_dict_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Logging LangChain Steps
&lt;/h3&gt;

&lt;p&gt;Logging LangChain steps like &lt;code&gt;on_chain_start&lt;/code&gt;, &lt;code&gt;on_chain_end&lt;/code&gt;, then involves&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redacting the content if instructed so&lt;/li&gt;
&lt;li&gt;Else serialize the content&lt;/li&gt;
&lt;li&gt;Then log to CloudWatch
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_chain_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;serialized&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parent_run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exclude_inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sanitized_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;redacted&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sanitized_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_dict_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;callback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain/start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_get_name_from_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serialized&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parent_run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent_run_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sanitized_inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_chain_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parent_run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exclude_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sanitized_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;redacted&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sanitized_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_dict_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;callback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain/end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_get_name_from_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serialized&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parent_run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent_run_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sanitized_outputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;_get_name_from_callback()&lt;/code&gt; is another utility function which tries to extract the action name in different ways depending on the content of the data. Refer to the Jupyter notebook for the full &lt;code&gt;LoggingHandler&lt;/code&gt; code with all the callbacks and utility functions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;The logs are formatted as desired ready for the AWS Lambda PowerTools logger and AWS CloudWatch as shown in one example below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;level&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on_chain_start:122&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;callback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain/start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RunnableSequence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cda54b41-8c10-47f9-87f8-f0c04a96731a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0174cf48-b8f3-4418-8d7c-13b9b0881938&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parent_run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Wie stimmen Sie die Entwicklung eng mit den Unternehmenszielen ab?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-06-05 13:26:14,029+0000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service_undefined&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cold_start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_memory_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_arn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:lambda:us-east-1:************:function: my-function &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_request_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1ef2901b-a061-40a5-9a4e-eb20ea80fc1b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xray_trace_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1-68419af5-730d439a2c0074857ace2227&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice how the log entry contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard CloudWatch fields like level, timestamp, and Lambda execution context&lt;/li&gt;
&lt;li&gt;Our custom message object with LangChain-specific information&lt;/li&gt;
&lt;li&gt;Custom metadata like the &lt;code&gt;session_id&lt;/code&gt; that allows tracking the logs for a user's entire conversation&lt;/li&gt;
&lt;li&gt;The actual content of inputs (which could be redacted if sensitive)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With this structured format, you can use CloudWatch Insights to run powerful queries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fields @timestamp, @message
| filter message.session_id = "cda54b41-8c10-47f9-87f8-f0c04a96731a"
| sort @timestamp asc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8u9we5hxgju5fowz3q2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8u9we5hxgju5fowz3q2.png" alt=" " width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Image 1: Retrieving all LangChain logs for a specific user session&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;Building upon my previous article on serverless LangChain applications, this logging implementation has revealed additional insights worth sharing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;CloudWatch-friendly logging matters: Simply dumping LangChain's native logs to CloudWatch creates more problems than it solves. Designing logs specifically for CloudWatch's query capabilities enables effective monitoring and analysis.&lt;/li&gt;
&lt;li&gt;Balance detail with privacy: When logging LLM interactions, you must carefully balance capturing sufficient detail for debugging against protecting sensitive information which might be contained in prompts and answers. The parameterized redaction approach demonstrated here offers a flexible solution.&lt;/li&gt;
&lt;li&gt;Custom callbacks provide control: While LangChain offers built-in logging capabilities, custom callbacks give you precise control over what gets logged and how it's formatted, which is essential for production environments.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As LangChain and the broader LLM ecosystem continue to evolve, implementing robust logging practices will remain essential for building reliable, maintainable AI applications on AWS. The approach outlined in this article provides a foundation that you can adapt as both technologies mature.&lt;/p&gt;

&lt;p&gt;Now that we are successfully logging, in the next article I will introduce &lt;a href="https://dev.to/mlnrt/tracing-langchain-with-aws-x-ray-3opm"&gt;tracing LangChain with AWS X-Ray&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>ai</category>
      <category>monitoring</category>
      <category>aws</category>
    </item>
    <item>
      <title>A Serverless Chatbot with LangChain &amp; AWS Bedrock</title>
      <dc:creator>Matthieu Lienart</dc:creator>
      <pubDate>Sun, 15 Jun 2025 11:20:23 +0000</pubDate>
      <link>https://dev.to/mlnrt/a-serverless-chatbot-with-langchain-aws-bedrock-5el3</link>
      <guid>https://dev.to/mlnrt/a-serverless-chatbot-with-langchain-aws-bedrock-5el3</guid>
      <description>&lt;p&gt;LangChain is an open-source framework for building applications powered by large language models (LLMs), while AWS Bedrock is a fully managed service that provides access to foundation models from leading AI companies. I know these days it's all about agentic AI, but even if you're trying to develop a simple serverless non-agentic chatbot using LangChain and AWS Bedrock, you still need to combine many advanced capabilities. This article walks you through the challenges of integrating these powerful tools to create a sophisticated chatbot with features such as conversation history management, retrieval-augmented generation (RAG), multilingual support, and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the problem?
&lt;/h2&gt;

&lt;p&gt;When developing a serverless non-agentic chatbot using LangChain and AWS Bedrock, you'll likely want to incorporate several key features to make it truly useful and robust.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The ability to maintain the current conversation (here I limit the scope to the current conversation, not storing past interactions).&lt;/li&gt;
&lt;li&gt;Provide the model with your own specific context using your knowledge base and use retrieval-augmented generation (RAG) to generate answers relating to your context (here I limit myself to crawled web pages).&lt;/li&gt;
&lt;li&gt;The ability to answer in the language of the user.&lt;/li&gt;
&lt;li&gt;Guardrails to make sure the answers are compliant with your chatbot objectives, prevent prompt attacks, etc.&lt;/li&gt;
&lt;li&gt;The ability to generate outputs directly in a JSON structured format for the frontend.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At least I wanted all those things.&lt;/p&gt;

&lt;p&gt;While numerous code samples and tutorials exist that demonstrate one or two of these capabilities, I found none that comprehensively cover all five. Moreover, many of the more complete examples rely on outdated versions of LangChain with deprecated APIs.&lt;/p&gt;

&lt;p&gt;This article aims to fill the gap despite the fact that it might become itself quickly obsolete, considering the pace of innovation in the field.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution
&lt;/h2&gt;

&lt;p&gt;Here's an overview of the solution I developed to address these challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For conversation history management, I use DynamoDB and LangChain to store the conversation history, but I implement a custom solution instead of using the common &lt;a href="https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.history.RunnableWithMessageHistory.html" rel="noopener noreferrer"&gt;&lt;code&gt;RunnableWithMessageHistory&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;For multilingual support, I use AWS Comprehend to detect the question's language and generate appropriate language instructions for the model's response. Language detection could also be done using an LLM, but I suspect (although I haven't tested it) that the response time and cost would be higher.&lt;/li&gt;
&lt;li&gt;I utilize Bedrock's built-in capabilities for the RAG knowledge base (employing the built-in web crawler to index the content, not shown here), for guardrails and for generating structured JSON outputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0z73n3pmr2kbo22rerh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0z73n3pmr2kbo22rerh.png" alt=" " width="800" height="498"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: High-level architecture of the serverless chatbot&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For the full code, you can refer to the Jupyter notebook in &lt;a href="https://github.com/amanoxsolutions/langchain-awsbedrock-demo" rel="noopener noreferrer"&gt;this GitHub repository&lt;/a&gt;. While the notebook demonstrates the components locally, the principles apply directly to a Lambda Function deployment.&lt;/p&gt;
&lt;h2&gt;
  
  
  Details
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Manage the Conversation History
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Why not Manage Conversation History with RunnableWithMessageHistory?
&lt;/h4&gt;

&lt;p&gt;LangChain provides a &lt;code&gt;RunnableWithMessageHistory&lt;/code&gt; class for managing conversation history which you will see in many code samples. Although very convenient, this approach has two significant drawbacks for my use case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Performance Limitations: &lt;code&gt;RunnableWithMessageHistory&lt;/code&gt; retrieves the conversation history before running the chain. But in my implementation, I need to perform three independent tasks: a) RAG context retrieval, b) Language detection and instruction generation, c) Conversation history retrieval. By parallelizing these tasks, I can reduce the latency of the initial LangChain.&lt;/li&gt;
&lt;li&gt;Incompatibility with Structured Output: The default implementation doesn't work well with structured output. While this can be bypassed by always returning the raw output together with the structured output and storing the raw output in the message history, it introduces a new problem. It would include unnecessary information like RAG references in the conversation history which would consume LLM prompt tokens when using that history in later prompts. So, I need to customize what is stored in the database.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Parallelization of Conversation History Retrieval
&lt;/h4&gt;

&lt;p&gt;The solution implemented uses &lt;a href="https://python.langchain.com/docs/integrations/memory/aws_dynamodb/" rel="noopener noreferrer"&gt;&lt;code&gt;DynamoDBChatMessageHistory&lt;/code&gt;&lt;/a&gt; to define the storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DynamoDBChatMessageHistory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; 
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;table_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CONVERSATION_HISTORY_TABLE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;this_session_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the initial step of the LangChain chain, I use &lt;a href="https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableParallel.html" rel="noopener noreferrer"&gt;&lt;code&gt;RunnableParralel&lt;/code&gt;&lt;/a&gt; to read the past messages from the current session using a &lt;a href="https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableLambda.html" rel="noopener noreferrer"&gt;&lt;code&gt;RunnableLambda&lt;/code&gt;&lt;/a&gt; in parallel with other steps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;RunnableParallel&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;references&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language_instructions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;RunnableLambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="err"&gt;…&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To illustrate the resulting performance improvement, let's compare the telemetry traces of both approaches.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyka2gvft9k705s182ovo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyka2gvft9k705s182ovo.png" alt=" " width="800" height="144"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Image 1: Telemetry trace using &lt;code&gt;RunnableWithMessageHistory&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fun3tyjb2unviebs8j5cg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fun3tyjb2unviebs8j5cg.png" alt=" " width="800" height="145"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Image 2: Telemetry trace with my solution&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The first trace shows the sequential nature of &lt;code&gt;RunnableWithMessageHistory&lt;/code&gt;, where conversation history retrieval in DynamoDB happens before other tasks. In contrast, the second trace demonstrates how my custom implementation allows for concurrent execution of RAG retrieval, language detection, and conversation history retrieval, leading to improved overall performance. The latency of the initial steps before calling the LLM model is improved from 1.2 sec to 1.0 sec by parallelizing the retrieval of the conversation history from DynamoDB.&lt;/p&gt;
&lt;h4&gt;
  
  
  Using a Callback Handler for Storing Messages
&lt;/h4&gt;

&lt;p&gt;To store the new user question and LLM answer, I use a &lt;a href="https://python.langchain.com/docs/concepts/callbacks/" rel="noopener noreferrer"&gt;LangChain callback&lt;/a&gt; &lt;code&gt;on_llm_end&lt;/code&gt;. This allows me to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extract only the relevant parts of the model's response for storage&lt;/li&gt;
&lt;li&gt;Separate the answer from the references&lt;/li&gt;
&lt;li&gt;Minimize token usage in future prompts
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StoreMessagesCallbackHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseCallbackHandler&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BaseChatMessageHistory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_llm_end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LLMResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;
&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Storing question and LLM answer back into DynamoDB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;generations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generations&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;generations&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;generations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
            &lt;span class="n"&gt;ai_message_kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;response_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nb"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;ai_message_kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;ai_message_kwargs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;references&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;references&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
                &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="nc"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;ai_message_kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No generations returned by LLM; no AI message to store.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;It's crucial to keep the number of tokens as low as possible for two reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model prompts have a limitation in the number of input tokens,&lt;/li&gt;
&lt;li&gt;We are charged per token used.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my use-case, the model response consists of two parts: the answer and a list of references (including URLs and excerpts). For future interactions, only the text answer is truly necessary for the model to follow the conversation. By storing only the answer and not the references, I can significantly reduce token usage in subsequent prompts.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. RAG Retrieval
&lt;/h3&gt;

&lt;p&gt;Document retrieval using RAG is done in a classic manner using &lt;a href="https://python.langchain.com/api_reference/aws/retrievers/langchain_aws.retrievers.bedrock.AmazonKnowledgeBasesRetriever.html" rel="noopener noreferrer"&gt;&lt;code&gt;AmazonKnowledgeBasesRetriever&lt;/code&gt;&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;kb_retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AmazonKnowledgeBasesRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bedrock_agent_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;knowledge_base_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BEDROCK_KNOWLEDGE_BASE_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retrieval_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vectorSearchConfiguration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numberOfResults&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;kb_retriever&lt;/code&gt; is then used in the initial step of the chain to retrieve content based on the question and use a custom function to format the result to inject it into the model prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;kb_retriever&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;format_references&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Generate Language Instructions
&lt;/h3&gt;

&lt;p&gt;I created a custom function which uses AWS Comprehend to detect the user language and generate instructions for the model which will be injected into the model prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_language_instructions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;comprehend_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detect_dominant_language&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Comprehend language detection response: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;languages&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Languages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Sort languages by score and return the one with the highest score
&lt;/span&gt;            &lt;span class="n"&gt;languages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;dominant_language&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;languages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LanguageCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Detected language: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dominant_language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer the question in the provided RFC 5646 language code: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;dominant_language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No language detected, defaulting to basic instructions.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer in the same language as the question.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error detecting language: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Defaulting to basic language instructions.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer in the same language as the question.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The initial step of the LangChain chain, I call the above function using a &lt;code&gt;RunnableLambda&lt;/code&gt; and passing it the user question.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nc"&gt;RunnableLambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generate_language_instructions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A logical improvement is to store the language code in the conversation history so that we don’t redetect it at every new user message. But this is not implemented yet.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Using Guardrails
&lt;/h4&gt;

&lt;p&gt;Using AWS Guardrails when calling a model using AWS Bedrock is very simple. You just must pass the guardrails to &lt;a href="https://python.langchain.com/api_reference/aws/chat_models/langchain_aws.chat_models.bedrock_converse.ChatBedrockConverse.html" rel="noopener noreferrer"&gt;&lt;code&gt;ChatBedrockConverse&lt;/code&gt;&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatBedrockConverse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bedrock_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BEDROCK_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stop_sequences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;guardrail_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;guardrailIdentifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BEDROCK_GUARDRAIL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;guardrailVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BEDROCK_GUARDRAIL_VERSION&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  5. Structured Output
&lt;/h4&gt;

&lt;p&gt;To request the model to generate the answer following a specific structure, I just specify that structure using Pydantic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChatBotResponseReference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
   &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;A web reference used to answer the question&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The URL of the reference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;excerpt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The extract from the reference&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ChatBotResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
   &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The response from the chatbot.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The answer to the question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;references&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ChatBotResponseReference&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A list of references relating to the question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then update the model definition as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;structured_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_structured_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ChatBotResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;include_raw&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, I force the model to provide both the raw answer and the structured answer. I do this because there is no guarantee that the model will follow the format instructions, so I want the raw answer as a fallback source if needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warning:&lt;/strong&gt; with this approach, there is still the risk that the model will hallucinate and, instead of reusing the references retrieved from the knowledge base, will generate non existing ones in the answer. If you are facing such an issue and you need to give the exact outputs from the RAG knowledge-base retrieval step to the user, the solution is to not ask the model to generate a structured output with references. Instead, you create a first LangChain initial_step as below, but you invoke it first to generate the prompt inputs. Then you pass that as inputs when invoking the &lt;code&gt;prompt|llm&lt;/code&gt; chain. You then combine the content of the llm answer with the references gathered in the initial step.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LangChain Chain
&lt;/h2&gt;

&lt;p&gt;Now all the pieces are there to create the prompt and the chain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ChatPromptTemplate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are an assistant for question-answering tasks. Use the following pieces of retrieved references to answer the question.
            If you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know the answer, just say that you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know. Use three sentences maximum and keep the answer concise.

            Here is a list of web pages references to be used as context to answer the question.
            Copy-paste them together with your answer in the output:
            {references}

            {language_instructions}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="nc"&gt;MessagesPlaceholder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;variable_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{question}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;initial_step&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RunnableParallel&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; references &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;kb_retriever&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;format_references&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language_instructions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nc"&gt;RunnableLambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generate_language_instructions&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;RunnableLambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;full_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;initial_step&lt;/span&gt;
    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;
    &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;structured_llm&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model can then be called as follow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;est quoi LangChain?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;chain_callbacks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;StoreMessagesCallbackHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;CloudWatchLoggingHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;full_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;callbacks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chain_callbacks&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;Now that we've walked through the implementation details, let's examine the outputs of the LangChain and AWS Bedrock-powered chatbot. We'll look at three key aspects of the results:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The multi-language conversation: We'll see how the chatbot handles a multi-turn conversation in French, demonstrating its language detection and response capabilities.&lt;/li&gt;
&lt;li&gt;The generated prompts: We'll examine the prompts created by my LangChain setup, showcasing how the conversation history and context are incorporated.&lt;/li&gt;
&lt;li&gt;The Message History DynamoDB Table: We'll verify how the conversation is stored in the DynamoDB table, ensuring persistence across interactions.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Conversation
&lt;/h3&gt;

&lt;p&gt;Asking a question in French &lt;em&gt;"C'est quoi LangChain?" (“What is Langchain?”)&lt;/em&gt;, results in a prompt as follow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;System: You are an assistant for question-answering tasks. Use the following pieces of retrieved context references to answer the question. Use three sentences maximum and keep the answer concise.

Here is a list of web pages references to be used as context to answer the question. Copy-paste them together with your answer in the output:
[
    {
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://python.langchain.com/docs/introduction/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;excerpt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LangChain is a framework for developing applications powered by large language models (LLMs).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }
]

Answer the question in the provided RFC 5646 language code: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.

Human: C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;est quoi LangChain?&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This results in a structured answer in the user language including the RAG references.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;references&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://python.langchain.com/docs/introduction/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;excerpt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LangChain is a framework for developing applications powered by large language models (LLMs).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A follow-up question asking &lt;em&gt;"Cela fonctionne-t'il avec AWS Bedrock?" ("Can it work with AWS Bedrock?")&lt;/em&gt; produces an answer showing that it used the context (we were talking about LangChain) of the conversation to answer the new question.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LangChain est compatible avec AWS Bedrock, permettant l’intégration et l’utilisation des modèles de langage fournis par AWS.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;references&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; https://python.langchain.com/docs/integrations/chat/bedrock/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;excerpt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prompt generated looks something like the shortened sample below, showing the conversation history being used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;System: You are an assistant for question-answering tasks. Use the following pieces of retrieved references to answer the question.
If you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know the answer, just say that you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know. Use three sentences maximum and keep the answer concise.

Here is a list of web pages references to be used as context to answer the question. Copy/paste them together with your answer in the output:
[…]

Answer the question in the provided RFC 5646 language code: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.

Human: C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;est quoi LangChain?
AI: LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.
Human: Cela fonctionne-t&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;il avec AWS Bedrock?&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Message History DynamoDB Table
&lt;/h3&gt;

&lt;p&gt;If we list the messages stored in DynamoDB for this conversation, we see the following, showing that the content of messages does not include references (although stored in the table).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;est quoi LangChain?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...),&lt;/span&gt;
 &lt;span class="nc"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...),&lt;/span&gt;
 &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cela fonctionne-t&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;il avec AWS Bedrock?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...),&lt;/span&gt;
 &lt;span class="nc"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; LangChain est compatible avec AWS Bedrock, permettant l’intégration et l’utilisation des modèles de langage fournis par AWS.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;LangChain is a powerful framework that offers numerous abstractions for rapidly developing applications that interact with LLMs. However, given the rapid pace of innovation in this field, it's crucial to approach development thoughtfully:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Before diving into coding based on web examples (including this article), invest time in learning LangChain fundamentals.Always verify that you're using the latest version of the framework to avoid deprecated features.&lt;/li&gt;
&lt;li&gt;While LangChain's modular nature makes it a flexible and powerful tool to work with LLMs, integrating these modules effectively for your specific use case can be complex.&lt;/li&gt;
&lt;li&gt;Be prepared to adapt and innovate, as off-the-shelf solutions may not fully address your unique requirements.&lt;/li&gt;
&lt;li&gt;Using structured output, does not guarantee the model will respect your desired answer structure so you need a fallback mechanism. Also, when asking for references in your structured answer, be aware that the model might hallucinate and generate fake references not in your knowledge base.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By keeping these lessons in mind, you'll be better equipped to leverage LangChain's capabilities while navigating its challenges.&lt;/p&gt;

&lt;p&gt;Now that we have a feature-rich chatbot, I will present in the &lt;a href="https://dev.to/mlnrt/logging-langchain-to-aws-cloudwatch-j77"&gt;next article&lt;/a&gt; how we can log LangChain's details in AWS CloudWatch Logs.&lt;/p&gt;

</description>
      <category>serverless</category>
      <category>ai</category>
      <category>langchain</category>
      <category>aws</category>
    </item>
    <item>
      <title>Data Masking of AWS Lambda Function Logs</title>
      <dc:creator>Matthieu Lienart</dc:creator>
      <pubDate>Sat, 15 Mar 2025 12:59:19 +0000</pubDate>
      <link>https://dev.to/mlnrt/data-masking-of-aws-lambda-function-logs-37mc</link>
      <guid>https://dev.to/mlnrt/data-masking-of-aws-lambda-function-logs-37mc</guid>
      <description>&lt;p&gt;&lt;strong&gt;What are the problems?&lt;/strong&gt;&lt;br&gt;
When logging events, service API call answers, etc. inside Lambda Functions into CloudWatch Logs, you might end up writing sensitive information like PII into CloudWatch Logs. This potentially exposes sensitive information to people who should not have access, e.g. developers and cloud platform administrators. But it is also a huge problem to be compliant with data protection regulations like the right to be forgotten. How do you erase a customer information across all the logs?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Existing Approaches&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;em&gt;CloudWatch Logs Native Data Masking&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Natively, AWS CloudWatch Logs allow data masking by using &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/protect-sensitive-log-data-types.html" rel="noopener noreferrer"&gt;managed or custom data identifiers&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch-logs-data-protection-policies.html" rel="noopener noreferrer"&gt;data protection policies&lt;/a&gt;. The data identifiers are pattern matching or machine learning models which detect sensitive data. Data protection policies are JSON documents describing the operations to perform on the identified sensitive data. The operation can be set to just “audit” or to “de-identify” the data in which case, only principals authorized to perform the &lt;code&gt;log:Unmask&lt;/code&gt; action would be able to see the data.&lt;br&gt;
Note that only new data written to CloudWatch Logs will be masked according to the defined policies. Someone with access to the logs would still be able to see the sensitive data written before enabling data masking.&lt;br&gt;
Although this approach prevents access to sensitive data for unauthorized personnel, it does not help with complying with the right to be forgotten.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;AWS Lambda Powertools Data Masking&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://docs.powertools.aws.dev/lambda/python/latest/" rel="noopener noreferrer"&gt;AWS Lambda Powertools&lt;/a&gt; is “a developer toolkit to implement Serverless best practices and increase developer velocity” originally developed in Python but made available for &lt;a href="https://docs.powertools.aws.dev/lambda/java/" rel="noopener noreferrer"&gt;Java&lt;/a&gt;, &lt;a href="https://docs.powertools.aws.dev/lambda/typescript/latest/" rel="noopener noreferrer"&gt;Typescript&lt;/a&gt; and &lt;a href="https://docs.powertools.aws.dev/lambda/dotnet/" rel="noopener noreferrer"&gt;.NET&lt;/a&gt;. As of now only the Python version offers a utility for data masking.&lt;br&gt;
Two approaches are proposed. One approach which uses a KMS key to encrypt/decrypt the sensitive information inside the log. A second approach which simply erases the sensitive information before writing the logs. To implement the first approach but also comply with regulations like the right to be forgotten, you would need to have one encryption key per customer, find some way to encrypt each customer’s information with their own key. Should they exercise their right to be forgotten, simply delete the encryption key, making their data forever unrecoverable.&lt;br&gt;
Although those approaches can address both problems, you must know exactly what to encrypt/erase. For example, to erase the phone number in a list of customers you would need to do something like this:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;data_masker.erase(data, fields=["customers[*].phone_number"])&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;But what if you are unsure at the start of a project about the data structure and the content? What if the data schema changes? What if you forgot a field in a nested JSON structure? &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Erasing All PII by Default&lt;/strong&gt;&lt;br&gt;
Do you really need sensitive information like PII in application logs? &lt;br&gt;
Probably not. &lt;br&gt;
In that case, the AWS Lambda Powertools data erasing approach seems like the simplest one. But again, it works as long as you know the data structure and it doesn’t change. As a security/compliance officer how can I make sure the developers don’t forget to erase sensitive information?&lt;br&gt;
So, I wanted to improve on the AWS Lambda Powertools approach, to erase sensitive information, wherever they are in the logs…&lt;br&gt;
This is what I came up with based on the AWS Lambda Powertools data masking utility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;1- Create a Function to Erase Sensitive Data&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
I created a Python decorator which calls the &lt;code&gt;data_masker.erase()&lt;/code&gt; function on the message to erase all the fields passed as a parameter before calling the function the decorator is applied to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
from warnings import catch_warnings
from functools import wraps, partial
from decimal import Decimal
from typing import Any
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider import BaseProvider

def is_valid_json_string(json_string: str) -&amp;gt; bool:
    if isinstance(json_string, str):
        try:
            result = json.loads(json_string)
            return isinstance(result, dict)
        except json.JSONDecodeError:
            return False

def log_masking_decorator(masked_fields: list[str]):
    def decorator(func):
        @wraps(func)
        def wrapper(self, msg, *args, **kwargs):
            if is_valid_json_string(msg) or isinstance(msg, dict):
                with catch_warnings(action="ignore"):
                    msg = self.data_masker.erase(msg, fields=masked_fields)
            return func(self, msg, *args, **kwargs)
        return wrapper
    return decorator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Code explanations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;data_masker.erase()&lt;/code&gt; function only works on dictionaries and string containing a JSON object. So we need to verify the type of the message before erasing the data.&lt;/li&gt;
&lt;li&gt;The AWS Lambda Powertools Data Masker raises a warning if you instruct it to mask a field which it can’t find. With this approach where I want to globally define a list a fields to mask everywhere, this will result in a lot of warnings in CloudWatch Logs, which I don’t want. So I ignore the warnings before calling the &lt;code&gt;erase()&lt;/code&gt; method.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;2- Apply the Function on all Logging Methods&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
A class decorator is created to apply a decorator function passed as an argument to all the logging methods (e.g. info, error, exceptions) of the logger class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def decorate_log_methods(decorator):
    def decorate(cls):
        for attr in dir(cls):
            if callable(getattr(cls, attr)) and attr in [
                "info",
                "error",
                "warning",
                "exception",
                "debug",
                "critical",
            ]:
                setattr(cls, attr, decorator(getattr(cls, attr)))
        return cls
    return decorate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;3- Create a Custom Logger Class&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
Finally, a custom logger class is created with the class decorator created in the previous step applied to it. The class decorator takes as an argument the first function decorating the &lt;code&gt;data_masker.erase()&lt;/code&gt; function. The data masker decorator takes as an argument all the JSON keys containing PII and which should be erased.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def decimal_serializer(obj: Any) -&amp;gt; Any:
    if isinstance(obj, Decimal):
        obj = str(obj)
    return obj 

@decorate_log_methods(
    log_masking_decorator(
        masked_fields=[
            "$.[*].phoneNumber",
            "$..[*].phoneNumber",
            "$.[*].name",
            "$..[*].name",
        ]
    )
)
class CustomLogger(Logger):
    def __init__(self):
        super().__init__()
        self.datamasking_provider = BaseProvider(
            json_serializer=partial(json.dumps, default=decimal_serializer),
            json_deserializer=json.loads,
        )
        self.data_masker = DataMasking(
            provider=self.datamasking_provider, raise_on_missing_field=False
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Code explanations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I use here a custom JSON serializer to convert Python Decimal values into strings to avoid errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;4- Usage&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
By instantiating the Python logger into the Lambda Function as a &lt;code&gt;CustomLogger()&lt;/code&gt; instead of the default AWS Lambda Power Tools &lt;code&gt;Logger()&lt;/code&gt;, all values of the JSON keys listed in the class decorator argument will be erased by default.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from log_helpers import CustomLogger
logger = CustomLogger()
@logger.inject_lambda_context(log_event=True)
def lambda_handler(event: dict, context: LambdaContext):
    response = boto3_client.whatever_service_api()
    logger.info(response)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Code explanations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;inject_lambda_context&lt;/code&gt; decorator calls the &lt;code&gt;logger.info()&lt;/code&gt;. Since the logger is here our custom logger, all PII listed in our &lt;code&gt;CustomLogger&lt;/code&gt; class decorator will be erase from the Lambda event logs
This achieves the goal of enforcing the erasure of all the listed PII without the developer having to specifically list each field to erase on every logging action.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full code of the custom logger is available &lt;a href="https://github.com/amanoxsolutions/api-security-demo/blob/main/resources/lambdas/helpers_layer/log_helpers/__init__.py" rel="noopener noreferrer"&gt;⁠here&lt;/a&gt;. The repository contains a full demo showing how to secure an AWS API Gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Would I Use That in Production?&lt;/strong&gt;&lt;br&gt;
No.&lt;br&gt;
Parsing the entire JSON structure of every log will increase the latency of the response of your Lambda function, which is not something you want. As the documentation of the AWS Lambda Power Tools says, logging the event of the Lambda handler function should only be done in non-production environments. And you should also know the data your Lambda Function is handling and thus erase the specific sensitive data where necessary for efficiency.&lt;br&gt;
I still find it an interesting approach which could be useful in some cases. Test environments should not have production data, but hey, we have all seen those cases out there…&lt;br&gt;
It was nevertheless an interesting exercise to try.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: The banner image was generated using AWS Nova Canvas image generation AI model&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>serverless</category>
      <category>lambda</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Fully Automated MLOps Pipeline – Part 1</title>
      <dc:creator>Matthieu Lienart</dc:creator>
      <pubDate>Wed, 07 Aug 2024 14:49:59 +0000</pubDate>
      <link>https://dev.to/mlnrt/fully-automated-mlops-pipeline-part-1-1mnp</link>
      <guid>https://dev.to/mlnrt/fully-automated-mlops-pipeline-part-1-1mnp</guid>
      <description>&lt;p&gt;&lt;strong&gt;THE OBJECTIVE&lt;/strong&gt;&lt;br&gt;
In &lt;a href="https://dev.to/mlnrt/near-real-time-data-ingestion-into-sagemaker-feature-store-44io"&gt;the previous blog post&lt;/a&gt; we introduced the architecture and demo of a near real time data ingestion pipeline into Amazon SageMaker Feature Store. In this post and the following one, we will present the fully automated MLOps pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8dlzjrpsaew5p5nbyago.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8dlzjrpsaew5p5nbyago.jpg" alt="Image description" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This first post will focus on the first two objectives.&lt;br&gt;
The entire source code of the demo is publicly available in the project repository on &lt;a href="https://github.com/amanoxsolutions/mlops-realtime-data-ingestion" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE MODEL&lt;/strong&gt;&lt;br&gt;
As the ingestion pipeline aggregates in near real time blockchain transaction metrics into Amazon SageMaker Feature Store, we chose to forecast the average transaction fee. &lt;br&gt;
In order to train a forecasting model, we decided to use &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html" rel="noopener noreferrer"&gt;Amazon DeepAR Forecasting Algorithm&lt;/a&gt;. That algorithm is better suited for one-dimensional multi time series (e.g. energy consumption of multiple households). However, in our case we have a one-dimensional (average transaction fee) single time series (one stream of blockchain transactions). But as per AWS documentation, DeepAR can still be used for single time series, and based on the quick test we performed, it is the model that was performing the best. &lt;br&gt;
More importantly, the main objective of this demo is – not – to train the most accurate model. We just need – a – model to experiment a fully automated MLOps lifecycle and using a prepackaged AWS model, greatly simplified our pipeline and demo development.&lt;br&gt;
The model is trained to forecast the next 30 average transaction fees. As we aggregate data per minute, it forecasts average transaction fee on the blockchain 30 minutes in the future.&lt;br&gt;
To evaluate the accuracy of the model, this demo uses the &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html" rel="noopener noreferrer"&gt;mean quantile loss metric&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE ARCHITECTURE&lt;/strong&gt;&lt;br&gt;
To see the near real time data ingestion pipeline architecture please refer to the previous blog post &lt;a href="https://dev.to/mlnrt/near-real-time-data-ingestion-into-sagemaker-feature-store-44io"&gt;here&lt;/a&gt;. This architecture abstracts the data ingestion pipeline to focus on the MLOps architecture to train and operate the model.&lt;/p&gt;

&lt;p&gt;The architecture is based on AWS provided SageMaker project for MLOps (provisioned through AWS Service Catalog) which we adapted to our project. The SageMaker project provides the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An AWS CodeCommit repository and AWS CodePipepline pipeline for

&lt;ul&gt;
&lt;li&gt;model building&lt;/li&gt;
&lt;li&gt;model deployment&lt;/li&gt;
&lt;li&gt;model monitoring&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;An Amazon S3 Bucket to store all the artifacts generated during the MLOps lifecycle.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6siq2r6ogq8dkl6om91.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6siq2r6ogq8dkl6om91.jpg" alt="Image description" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The “Model Build” repository and pipeline deploy a SageMaker pipeline to train the forecasting model. The build phase of that pipeline also creates SSM Parameters (if they do not exist) holding the hyperparameters for the model training and to evaluate the model accuracy.&lt;/li&gt;
&lt;li&gt;The manual approval of a trained model automatically triggers the “Model Deploy” pipeline.&lt;/li&gt;
&lt;li&gt;The “Model Deploy” pipeline deploys in the staging environment (and later on in the production environment if approved) of the model behind an Amazon SageMaker API Endpoint.&lt;/li&gt;
&lt;li&gt;Once the endpoint is in service, this automatically triggers the deployment of the “Model Monitoring” pipeline to monitor the new model.&lt;/li&gt;
&lt;li&gt;On an hourly schedule, another SageMaker pipeline is triggered to compare the model forecast results with the latest datapoints. &lt;/li&gt;
&lt;li&gt;If the model forecasting accuracy falls under the acceptable threshold, the “Model Build” pipeline is automatically retriggered, to train a new model based on the latest data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The second half of this architecture will be described in the follow-up blog post.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BUILDING THE MODEL WITH THE SAGEMAKER PIPELINE&lt;/strong&gt;&lt;br&gt;
This pipeline is different from the CodePipeline type of pipeline used to deploy infrastructure and applications. It is a pipeline dedicated to performing machine learning operations like training a model. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnoaorvd9njjxcj9zjlq.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flnoaorvd9njjxcj9zjlq.jpg" alt="Image description" width="800" height="475"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The SageMaker project comes with a built-in SageMaker pipeline code which we had to refactor to match our use case. Our pipeline consists of the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the data from SageMaker Feature store, extract the last 30 data point as a test dataset to evaluate the model and format the data for the DeepAR algorithm.&lt;/li&gt;
&lt;li&gt;Train the model.&lt;/li&gt;
&lt;li&gt;Create the trained model.&lt;/li&gt;
&lt;li&gt;Make a batch prediction of the next 30 data points based on training data.&lt;/li&gt;
&lt;li&gt;Evaluate the forecast accuracy by computing the model’s mean quantile loss between the forecast and test datapoints.&lt;/li&gt;
&lt;li&gt;Check the model accuracy compared to the threshold stored in the SSM parameter (deployed by the “Model Build” pipeline).&lt;/li&gt;
&lt;li&gt;Register the trained model if its accuracy passes the threshold.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;DEPLOYING THE MODEL&lt;/strong&gt;&lt;br&gt;
Once the model is registered in SageMaker, it must be manually approved in order to be deployed in the staging environment first. The approval of the model will automatically trigger the “Model Deploy” pipeline. This pipeline performs 3 main actions.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;As the model has been approved, we take this new model accuracy as the new model threshold – if it is better (lower is better for our metric) than the existing one – and update the SSM parameter. You might not want to do that for your use case, as you might have fixed business/legal metric that you must match. But for this demo we decided to update the model accuracy as new models are retrained, hopefully building an increasingly accurate models as time passes.&lt;/li&gt;
&lt;li&gt;A first AWS CodeDeploy stage deploys the new model behind an Amazon SageMaker endpoint which can then be used to predict 30 data points in the future.&lt;/li&gt;
&lt;li&gt;Once the model has been deployed behind the staging endpoint, the pipeline has a manual approval stage before deploying the new model in production. If approved, then a second AWS CodeDeploy stage deploys the new model behind a second Amazon SageMaker endpoint for production.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;THE CHALLENGES&lt;/strong&gt;&lt;br&gt;
The use of the SageMaker Project provided through AWS Service Catalog, was of significant help to quickly build the overall framework for our fully automated MLOps pipeline. However, it comes with a constraint: the model build, deploy and monitor pipelines are fixed by that AWS Service Catalog product and might not exactly fit your need. In this demo for example, in order to set and update the model accuracy threshold stored in SSM parameters we use the CodeBuild phase of the different pipelines to update that threshold (Build phase of the “Model Deploy” pipeline) or read it to create the alarm metrics. This is not necessarily the best way and place to do that, but it is the best solution we found given that fixed framework.&lt;/p&gt;

&lt;p&gt;As with every built-in framework, you can save time and move faster by benefiting from a pre-built solution, but you lose in flexibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fully Automated MLOps Pipeline – Part 2&lt;/strong&gt;&lt;br&gt;
The MLOps Part 2 article of this series has been written by my colleague Raphael Eymann and is available &lt;a href="https://dev.to/raphaeleymann/fully-automated-mlops-pipeline-part-2-3k64"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>mlops</category>
      <category>sagemaker</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Near Real Time Data Ingestion into SageMaker Feature Store</title>
      <dc:creator>Matthieu Lienart</dc:creator>
      <pubDate>Wed, 07 Aug 2024 14:34:52 +0000</pubDate>
      <link>https://dev.to/mlnrt/near-real-time-data-ingestion-into-sagemaker-feature-store-44io</link>
      <guid>https://dev.to/mlnrt/near-real-time-data-ingestion-into-sagemaker-feature-store-44io</guid>
      <description>&lt;p&gt;&lt;strong&gt;THE OBJECTIVE&lt;/strong&gt;&lt;br&gt;
This blog post is the first part of a 3 parts series about testing a fully automated MLOps pipeline for machine learning prediction on near real time timeseries data in AWS. This part focuses on the data ingestion pipeline into Amazon SageMaker Feature Store.&lt;br&gt;
The entire demo code is publicly available &lt;a href="https://github.com/amanoxsolutions/mlops-realtime-data-ingestion" rel="noopener noreferrer"&gt;in the project repository on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SAGEMAKER FEATURE STORE&lt;/strong&gt;&lt;br&gt;
For this demo, we have chosen to use the Amazon SageMaker Feature Store as the final repository of the data ingestion pipeline. As per &lt;a href="https://aws.amazon.com/sagemaker/feature-store/" rel="noopener noreferrer"&gt;the documentation&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference.” &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;THE DEMO&lt;/strong&gt;&lt;br&gt;
The demo ingests blockchain transactions from the blockchain.com API (see documentation &lt;a href="https://www.blockchain.com/explorer/api/blockchain_api" rel="noopener noreferrer"&gt;here&lt;/a&gt;). Based on the data it ingests, the pipeline computes and stores 3 simple metrics in Amazon SageMaker Feature Store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The total number of transactions&lt;/li&gt;
&lt;li&gt;The total amount of transaction fees&lt;/li&gt;
&lt;li&gt;The average amount of transaction fees&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics are computed per minute. Although it might not be the best window period to analyze blockchain transactions, it allows us to quickly gather a lot of data points in a short period of time, avoiding running the demo for too long which has an impact on the AWS costs.&lt;br&gt;
This demo is developed using the AWS CDK and is available &lt;a href="https://github.com/amanoxsolutions/mlops-realtime-data-ingestion" rel="noopener noreferrer"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE ARCHITECTURE&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb26ddi79srnroedftvo9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb26ddi79srnroedftvo9.jpg" alt="Image description" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The project is made of a self-mutating pipeline which deploys the different stacks of the project. Only the data ingestion pipeline components are shown here (the MLOps part of the architecture will be detailed in future posts). &lt;br&gt;
The pipeline works as follow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An AWS Fargate container polls the data source API every 15 seconds to ingest the last 100 transactions and publish all transactions on the data ingestion event bus of AWS EventBridge.&lt;/li&gt;
&lt;li&gt;An AWS EventBridge Rule routes the ingested data to an AWS Lambda Function.&lt;/li&gt;
&lt;li&gt;The AWS Lambda Function is used in combination with Amazon DynamoDB to keep track of recently ingested transactions and filter out transactions already ingested.&lt;/li&gt;
&lt;li&gt;The filtered data are written into an Amazon Kinesis Data Stream.&lt;/li&gt;
&lt;li&gt;The ingestion data stream is connected to an Amazon Kinesis Firehose stream which stores the raw data to an Amazon S3 Bucket for archival.&lt;/li&gt;
&lt;li&gt;An Amazon Managed Service for Apache Flink application reads the data from the ingestion stream and uses a tumbling window to compute the following 3 metrics per minute:

&lt;ul&gt;
&lt;li&gt;total number of transactions&lt;/li&gt;
&lt;li&gt;total amount of transaction fees&lt;/li&gt;
&lt;li&gt;average amount of transaction fees&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The Flink application writes the aggregated data to a delivery Amazon Kinesis Data Stream. An AWS Lambda Function reads from the delivery stream and writes the aggregated data to Amazon SageMaker Feature Store.&lt;/li&gt;
&lt;li&gt;An AWS Glue Job periodically aggregates the small files in the Amazon SageMaker Feature Store S3 Bucket to improve performance when reading data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In addition to deploying the data ingestion pipeline, the infrastructure stack also deploys the data scientist environment using Amazon SageMaker Studio. It creates an Amazon SageMaker Studio Domain and creates a user in it with the appropriate permissions. With this, the data scientist has access to an IDE to run Jupyter Notebooks to perform analytics on the data, run experiments and test training a model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HOW TO LOOK AT THE DATA BEING INGESTED?&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Monitoring the Pipeline&lt;/strong&gt;&lt;br&gt;
The demo comes with a CloudWatch dashboard for you to see the data flowing through the different components. It displays in the first widget the amount of bytes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingested by the AWS Fargate Container&lt;/li&gt;
&lt;li&gt;Ingested by AWS EventBridge (There is unfortunately no metric per AWS EventBridge bus. This metric shows the total amount of data ingested by EventBridge in the account)&lt;/li&gt;
&lt;li&gt;Ingested by the Amazon Kinesis Data Stream ingestion stream&lt;/li&gt;
&lt;li&gt;Ingested by Amazon Kinesis Firehose from the ingestion stream&lt;/li&gt;
&lt;li&gt;Delivered by Amazon Kinesis Firehose to Amazon S3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The second widget displays the number of records output by the Apache Flink Application consumer and ingested from the consumer by the Apache Flink Application producer (should be equal when the Flink application works correctly). The third widget shows the amount of bytes ingested by the Amazon Kinesis Data Stream delivery stream (1 record per minute).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fudqhgjdgth40cwrfcu8r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fudqhgjdgth40cwrfcu8r.png" alt="Image description" width="800" height="788"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Querying the Data using Amazon Athena&lt;/strong&gt;&lt;br&gt;
Using Amazon Athena, you can query the Offline Storage of Amazon SageMaker Feature Store. Here is a query example (if you deploy the demo, you will have to adapt the feature store table name)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM "sagemaker_featurestore"."mlops_sageefb3c2_agg_feature_group_1699792186"
ORDER BY tx_minute DESC
LIMIT 100;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Querying the Data using Amazon SageMaker Studio Notebook&lt;/strong&gt;&lt;br&gt;
In the repository /resources/sagemaker/tests/ folder we provide a Jupyter notebook &lt;a href="https://github.com/amanoxsolutions/mlops-realtime-data-ingestion/blob/main/resources/sagemaker/notebooks/read_feature_store.ipynb" rel="noopener noreferrer"&gt;read_feature_store.ipynb&lt;/a&gt; to read the latest entry in the online store. From the Amazon SageMaker Studio domain, you can use the provisioned user and launch a studio application. Once in the Jupyter or Code Editor environment, you can upload that notebook and run it. The notebook reads the latest data point from the Online Store of Amazon SageMaker Feature Store.&lt;br&gt;
You will observe a roughly 6 minutes difference between the timestamp of the latest data available in the Online Store versus the Offline Store of Amazon SageMaker Feature Store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE CHALLENGES&lt;/strong&gt;&lt;br&gt;
The main challenge we faced when developing this architecture with the CDK was the cleanup of the SageMaker domain. When creating a SageMaker domain, AWS creates an Amazon EFS share with endpoints in the VPC and NSGs attached to them. When a user starts a SageMaker Studio App, compute resources are deployed to host the Code Editor/Jupyter IDE session and Jupyter kernel session. None of those resources are deleted automatically when deleting the domain. This means that a Custom Resource must be developed in the CDK Stack to clean up the domain before it gets deleted. The main issue is that deleting a SageMaker Studio App can take more than the 15 minutes maximum runtime of the Custom Resource Lambda Function. Implementing a Step Function to periodically check the SageMaker Studio App status and wait for the deletion does not help because Cloud Formation WaitCondition does not support deletes and thus does no wait to receive the signal back from the Custom Resource before continuing with deletion.&lt;br&gt;
Two issues have been opened in the CloudFormation repository:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/1327" rel="noopener noreferrer"&gt;SageMaker Studio domain user fail to delete because of running apps #1327&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/1755" rel="noopener noreferrer"&gt;WaitCondition does not support deletes #1755&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;THE COSTS&lt;/strong&gt;&lt;br&gt;
As mentioned previously, we aggregate ingested data on a minute interval to be able to quickly train a model and see results. Yet, we recommend running the ingestion pipeline and demo for several days to have enough data. Should you want to run the full demo (with the MLOps automation part) for some time, be aware that the average monthly cost in the Ireland (eu-west-1) region is roughly $850/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IMPROVEMENTS&lt;/strong&gt;&lt;br&gt;
The Amazon Managed Service for Apache Flink application computes 3 metrics per minute based on the ingested transactions. This means that the Amazon Kinesis Data Stream delivery stream ingests only one record of few kilobytes per minute. This is overkilled for a streaming application. But we kept this architecture design for this demo for learning purposes and to practice using data stream technology.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fully Automated MLOps Pipeline – Part 2&lt;/strong&gt;&lt;br&gt;
The MLOps Part 2 article of this series has been written by my colleague Raphael Eymann and is available &lt;a href="https://dev.to/raphaeleymann/fully-automated-mlops-pipeline-part-2-3k64"&gt;here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>mlops</category>
      <category>sagemaker</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
