DEV Community: Matthieu Lienart

From LEGO to Video: Building an AI Storytelling App for the AWS Community Builder Community

Matthieu Lienart — Fri, 05 Jun 2026 18:28:55 +0000

As an AWS Community Builder, I wanted to build something fun that would bring the community together — not just another demo, but an actual game people could play. The result is AWS Community Builders Fantasy Quest: take a photo of a LEGO creation, and the app turns it into a 36-second epic AWS architect adventure video, generated by Amazon Nova models and narrated by Amazon Polly. Stories are scored on a community leaderboard so everyone can compete for the title of best AWS cloud hero storyteller.

I first demoed it at the AWS User Group Bern meetup on May 6th where attendees teamed up with LEGO blocks and laughed at the resulting videos of cloud heroes battling latency dragons and Lambda knights on serverless quests.
The code is now public at github.com/mlnrt/cb-fantasy-quest.

The app was kept online for two weeks for the community to play with and is now decommissioned.

Note: This project, especially the frontend, was heavily developed using AI.

⚠️ Before you read further: Nova Reel is going EOL

The video generation in this project relies entirely on Amazon Nova Reel v1.1. AWS has marked it as Legacy and it will reach end-of-life on September 30, 2026, after which it will stop accepting requests. As far as I can tell, there is currently no replacement video generation model announced for Amazon Bedrock. So if you want to deploy and run this demo, you will need to find another video generation model and provider. See the Nova Reel model card for the official notice.

What was built

The full stack connects an Angular web app to a serverless AWS backend. The user flow is:

Take a photo of a LEGO scene via the browser camera
The image is uploaded to S3, triggering story generation via Bedrock Nova via LangChain
A Step Functions pipeline produces a 36-second video: 6 scenes, each with a Nova Reel video clip and an Amazon Polly narration track, stitched together with FFmpeg
The finished video lands on the community leaderboard for rating

The infrastructure is fully defined in AWS CDK (TypeScript) across four stacks: Core, API, Video, and Monitoring. Every Lambda function is Python 3.13, using AWS Lambda Powertools for structured logging.

The story generation pipeline

When the photo lands in S3, it triggers a Lambda function that calls Bedrock Nova via LangChain. The model analyses the LEGO scene — detecting themes, characters, and visual elements — and generates a structured 6-scene story using Pydantic for reliable output.

Each scene is designed to fit 6 seconds of video, giving a total of 36 seconds. Once the story is persisted to DynamoDB, an EventBridge event triggers the video pipeline.

The video pipeline

A Step Functions workflow orchestrates the 6-scene production. For each scene it runs two things in sequence:

generate the Polly audio track,
submit the Nova Reel video generation job, and poll for completion.

The Beriock video gineration jobs writes the final video clip but also all individual 6 clips directly to S3.

The six finished clips are then passed to a Composer Lambda that uses FFmpeg to merge each clip's video and audio into the final 36-second file. If the audio is longer than the video, it takes the last frame of the video and holds it until the audio finishes, creating a freeze-frame effect, but avoiding audio overlap issues. The Composer Lambda also generates the VTT subtitle file from the original story text, and uploads both the final video and subtitles to S3 for the frontend to display.

Cost controls and rate limiting

Exposing this app publicly in my AWS ccount with real AWS costs behind it required controls to manage expenses. The app enforces several layers:

Global daily cap: a DynamoDB counter with a 24-hour TTL limits the total number of videos to 5 per day. When the cap is hit, the frontend shows a friendly message and stops accepting submissions.
One story per pseudo: each Community Builder name can only generate one story, preventing a single person from consuming the entire daily quota.
Distributed generation lock: a DynamoDB item with a 5-minute TTL acts as a mutex, blocking concurrent generation attempts. The frontend polls a lock status endpoint before allowing a submission, giving users honest feedback rather than a silent failure.
API Gateway protection: every request must carry an x-api-key header, validated by a Lambda Authorizer before it ever reaches a Lambda function. The authorizer reads the expected key from SSM Parameter Store (stored as a SecureString). API Gateway caches the authorization result for 5 minutes to reduce Lambda invocations. On top of that, API Gateway throttling (5 req/s, burst 10) adds a final backstop against evil hammering.

An example of what was generated

Here is an example of a story generated from a LEGO scene at the Bern meetup — cloud heroes, latency dragons, and Lambda knights included:

📖 The Crystal Vault of Cloudreach Kingdom

📝 Plot Summary: In Cloudreach Kingdom, Knight Sir Archon and his Ninja sidekick Kaito must decipher a secret flag message to stop the villainous Knight Malakar before he unleashes chaos — with AWS CloudFront, VPC, and S3 as their allies.

💬 Favorite Quote: "In the realm of the cloud, wisdom crystallizes like ice in the vault of S3."

☁️ AWS Services Mastered: CloudFront, VPC, S3

👥 Characters in Your Story: Knight (hero), Ninja (sidekick), Knight with flag (hero), Character on broomstick (neutral), Knight in red (villain)

Cavheat

Building this end-to-end revealed a few things worth noting for anyone attempting something similar:

Nova Reel false-positive guardrail failures are real and unhandled. The model occasionally rejects perfectly innocent LEGO prompts due to Amazon's internal content filters triggering on ambiguous visual descriptions. There is no automated retry in the current code — it is listed as an open TODO. For a conference demo this was manageable; for production it would need a proper fallback strategy.
FFmpeg in Lambda works here but has limits. The FFmpeg distribution is made up of two binaries: ffmpeg (the encoder and processor) and ffprobe (the media analyser, used to inspect codecs, duration, stream metadata, and so on). Both together push the layer size beyond Lambda's 250 MB unzipped limit, so only ffmpeg is included in the layer. For the video composition task in this project that is sufficient — merging clips, overlaying audio, and freeze-framing the last frame are all ffmpeg operations. But if your use case requires inspecting video files before processing them, you would need to either strip down the ffmpeg binary further using a custom build with only the required codecs, or use another service than AWS Lambda for the composition step.
The app controls are good enough for a short-lived demo, not for production. Nothing prevents a determined user from clearing their browser cache and picking a new pseudo to generate another story. Also, a pseudo only exists in DynamoDB once a story is created under it, so two people choosing the same name at the same moment will both be allowed through, creating a race condition. For a demo app that runs for a few days with what I consider a trusted audience, these gaps were an acceptable trade-off. They would need proper server-side identity and atomic pseudo reservation before this could be considered robust.

Conclusion

AWS Community Builders Fantasy Quest was a fun side project that combines image analysis, LLM story generation, text-to-video, text-to-speech, and video composition into one event-driven pipeline — all on AWS, all serverless, all infrastructure-as-code. It was a lot of fun to build and even more fun to play with at the Bern meetup.

The code is open source and available at github.com/mlnrt/cb-fantasy-quest.

The Missing Link: How to Retrieve Full Documents with AWS S3 Vectors

Matthieu Lienart — Sun, 14 Sep 2025 12:43:29 +0000

Why write yet another blog article on how to use AWS S3 Vectors, when there are already many such blog articles and tutorials out there?

Because all existing tutorials I read miss a critical aspect: they don't explain how to actually retrieve your full documents after finding matching vectors. Instead, they store tiny example "documents" (often just a sentence) directly in the vector metadata. This approach, while easy for a tutorial demonstrating similarity search using vectors, completely falls apart when dealing with real-world content. Hopefully nobody is going to store entire documents in a vector metadata!

This article fills this important gap. I won't rehash the basics that others have covered well. Instead, I'll focus specifically on implementing a complete document retrieval with S3 Vectors and S3 Bucket that:

Stores your actual documents in a standard S3 Bucket
Creates and indexes embeddings in S3 Vectors
Connects vector search results back to your original documents

Unlike a vector database that handles document storage and retrieval for you, S3 Vectors only manages the vector index. Understanding how to bridge this gap is essential for building production-ready applications with AWS S3 Vectors.

At a high level, using S3 Vectors is a 3 steps process for both storing and querying as shown in this diagram.

Storing Documents and Vectors

The steps to store the documents and their embeddings are:

Put the documents in a regular S3 bucket. In such a context, I hash the file name or identifier to generate the S3 object key,
Use an embedding model to generate an embedding based on the content of the document,
Store the embedding in the vector index.

In this article example, the document are crawled web pages and the S3 object key is generated by hashing the page URL. The crawled pages have the format

{
    "content": string,
    "metadata": {
        "url": string,
        "title": string
    }
}

At a high level, the code for the 3 steps is then:

s3_vectors = boto3.client("s3vectors")
s3 = boto3.resource("s3")
MODEL_ID = "amazon.titan-embed-text-v2:0"
vectors_data_bucket = s3.Bucket(S3_DOCUMENTS_BUCKET_NAME)
vectors = []

for page in pages:
    key = hashlib.md5(page["metadata"]["url"].encode()).hexdigest()
    # store the actual document in the S3 bucket as a text content
    vectors_data_bucket.put_object(
        Key=key,
        Body=page["content"].encode("utf-8"),
        Metadata={
            "title": re.sub(r"[^a-zA-Z0-9\s]", "", page["metadata"]["title"]),
            "url": page["metadata"]["url"]
        }
    )
    # Generate embedding for the page text
    model_response = bedrock_runtime.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps({
            "inputText": page["content"],
            "dimensions": 1024
        }).encode("utf-8"),
    )
    response_body = json.loads(model_response["body"].read().decode("utf-8"))
    embedding = response_body["embedding"]
    # Set the vector with the same key as the S3 Object
    vector = {
        "key": key,
        "data": {
            "float32": embedding
        },
        "metadata": page["metadata"]
    }
    vectors.append(vector)

# Store the vectors in the S3 Vectors index
s3_vectors.put_vectors(
    vectorBucketName=S3_VECTORS_BUCKET_NAME,   
    indexName=S3_VECTORS_BUCKET_INDEX_NAME,
    vectors=vectors
)

Querying S3 Vectors and Retrieving Documents

Then when you query the S3 Vectors to retrieve the documents you need to:

Use the embedding model to generate the embedding for the search question,
Query S3 Vectors to get the embeddings close to the search embedding,
For all embeddings in the results, use the key to retrieve the objects from the S3 Bucket.

question = "What is AWS S3 Vectors?"
documents = []

# Invoke the same model to generate the embedding for the question
response = bedrock_runtime.invoke_model(
    modelId=MODEL_ID,
    body=json.dumps({
        "inputText": question,
        "dimensions": 1024
    }).encode("utf-8"),
)
model_response = json.loads(response["body"].read())
question_embedding = model_response["embedding"]
# Use the query embedding to search for similar embeddings in the S3 Vectors index
query_results = s3_vectors.query_vectors( 
    vectorBucketName=S3_VECTORS_BUCKET_NAME,   
    indexName=S3_VECTORS_BUCKET_INDEX_NAME,
    queryVector={"float32":question_embedding},
    topK=3, 
    returnDistance=True,
    returnMetadata=True
)
vectors = query_results.get("vectors", [])
# Retrieve the actual documents from S3 using the keys from the query results
for vector in vectors:
    obj = s3.Bucket(S3_DOCUMENTS_BUCKET_NAME).Object(vector["key"]).get()
    content = obj["body"].read().decode("utf-8")
    documents.append({
        "title": vector["metadata"]["title"],
        "url": vector["metadata"]["url"],
        "content": content
    })

Key Takeaways

Unlike a vector database which stores and retrieves the documents for you, S3 Vectors only stores the vector index. It is up to you to make the relationship between the vector key and the actual document. S3 makes that easy if you use the same key for the document vector in S3 Vectors and the document object in the S3 Bucket. While storing and retrieving becomes a multi-step process that you have to orchestrate and that inevitably increases response latency, this approach offers substantial cost savings compared to dedicated vector databases.

Note that the documents do not have to be stored in a S3 Bucket. In the above example, we could imagine not storing the pages' content as objects in a S3 Bucket, and just give back the page URLs from the vectors metadata which would be crawled downstream.

Tracing LangChain with AWS X-Ray

Matthieu Lienart — Fri, 11 Jul 2025 13:45:15 +0000

LangChain is a popular framework for developing applications powered by large language models, providing components for working with LLMs through composable chains and agents. Like with microservices, when building production applications with LangChain, tracing and visualizing how the different components interact with each other becomes increasingly important. AWS X-Ray, being a native AWS service to monitor and analyze telemetry traces from the Lambda Functions, is the natural choice to use for tracing.

What is the problem?

In my previous articles "A Serverless Chatbot with LangChain & AWS Bedrock", and "Logging LangChain to AWS CloudWatch" I presented a solution for a serverless Chatbot with LangChain and AWS Bedrock. The solution implements all the features of conversation history, answering in the user language, custom context using RAG, model guardrails, structured outputs together with using LangChain callbacks for custom detailed logging to AWS CloudWatch Logs.

There are many tools and frameworks (e.g. LangSmith, Arize Phoenix, Langfuse, etc.) based on OpenTelemetry and OpenInference, to trace LLM applications and to do much more (evaluate LLM, evaluate RAG, run experiments, etc.). But I wanted to force myself to see what I could do for tracing with native AWS tools. So, I built a custom solution with the native AWS X-Ray tooling. But to do that, I needed to create X-Ray trace subsegments for every action performed by any type of LangChain Runnable. There are two challenges associated with that.

The first one is that by default when enabling AWS X-Ray tracing on Lambda functions, it only traces call to AWS services. To capture traces to external services (e.g. an HTTP request to a public REST API) the AWS X-Ray SDK for Python for example, uses Python wrapt library to "patch" specific class methods to generate a trace subsegment before executing the method. An example of that is creating a trace subsegment when the request() method (essentially making an HTTP call) of the Session class of Python requests library is called, as shown here. As X-Ray SDK covers only the most common Python libraries, I have to patch the LangChain library myself to intercept any execution of invoke() or ainvoke() methods of every possible type of LangChain Runnable.

The second challenge relates to the limitations of the X-Ray SDK in AWS Lambda Functions.

The X-Ray SDK "is configured to automatically create a placeholder facade segment when it detects it is running in Lambda".
You can’t create your own trace segment in a Lambda Function. Only subsegments.
The SDK creates an AWS::Lambda::Function subsegment and further subsegments are attached to it.
When using threads, every thread trace root is the Lambda facade segment, not the subsegment that created the thread.

As LangChain tries to run its actions in parallel threads, or if you specify yourself some tasks to run in parallel, instead of seeing trace subsegments as:

AWS::Lambda::Function
└── RunnableParallel
    ├── RunnableLambda
    └── RunnableLambda

What you get is:

├── AWS::Lambda::Function
│   └── RunnableParallel
├── RunnableLambda
└── RunnableLambda

This is not what I would expect and want, as it does not accurately represent the inner working of LangChain.

The Solution

In order to achieve my goals of correctly tracing LangChain Runnable interactions, I then need to fix those two issues by:

Patching myself LangChain Runnable classes,
Resetting the trace context to the parent subsegment for all LangChain RunnableParallel classes creating threads.

Patching LangChain

This is "easily" done by following the pattern used by the AWS X-Ray SDK for the requests library as shared above. The problems, are that there are many classes to cover, they will change, and I don’t want to repeat the same code for every Runnable class. There are also 2 types of classes that I must patch: Runnable and RunnableSerializable.

To achieve this, I do the following:

Create some functions to recursively list all imported child classes of Runnable and RunnableSerializable and remove duplicates.
Loop over the resulting list and wrap all invoke() method of those classes

import wrapt
import threading
from aws_xray_sdk.core import xray_recorder
from langchain_core.runnables import Runnable, RunnableSerializable

def dedup_classes_by_origin(classes):
    """Return a set of (module, base_name) keys for unique classes."""
    seen = set()
    for cls in classes:
        module = getattr(cls, "__module__", "")
        name = getattr(cls, "__name__", str(cls))
        base_name = name.split("[")[0]
        key = (module, base_name)
        seen.add(key)
    return seen

def all_subclasses(cls):
    """Recursively find all subclasses of a class."""
    return set(cls.__subclasses__()).union(
        [s for c in cls.__subclasses__() for s in all_subclasses(c)]
    )

def patch_langchain_runnables():
    # Get all classes from langchain_core.runnables.RunnableSerializable
    unique_serializable_classes = dedup_classes_by_origin(
        all_subclasses(RunnableSerializable)
    )
    unique_runnable_classes = dedup_classes_by_origin(all_subclasses(Runnable))
    # Combine both lists to ensure we cover all RunnableSerializable and Runnable classes
    unique_classes = unique_serializable_classes | unique_runnable_classes
    for module, class_name in unique_classes:
        # Patch the invoke method of each class
        wrapt.wrap_function_wrapper(module, f"{class_name}.invoke", traced_invoke)

The patch_langchain_runnables() has then to be called at the beginning of the Lambda Function code immediately after the imports. For example:

import boto3
from aws_lambda_powertools.utilities.typing import LambdaContext
from langchain_core.runnables import (
    RunnableParallel,
    RunnableLambda,
    RunnablePassthrough,
)
patch_langchain_runnables()

def lambda_handler(event: dict, context: LambdaContext):
    print("Do something smart with LangChain here.")

Fixing Subsegment Parents

Now that the first challenge is solved and the invoke() method of all the child classes of Runnable and RunnableSerializable are wrapped by the traced_invoke() method, I need to define it and generate a subsegment before calling the initial class method while ensuring proper subsegment lineage.

For classes not using threads, a simple execution of the class method inside a subsegment would work.

def traced_invoke(wrapped, instance, args, kwargs):
    with xray_recorder.in_subsegment(runnable_name):
        result = wrapped(*args, **kwargs)
    return result

But as discussed previously, for Runnable classes which will be invoked in threads by the RunnableParrallel class, I need to forcibly overwrite their parent subsegment from the facade segment to the RunnableParrallel segment. I do that with the following approach:

Capture the new segment (entity)
Monkey patch (after taking a backup of it) the threading.Thread.run() method by my own, which sets the trace entity to the one I just captured before executing the actual Thread.run() method
Execute the original class invoke() method
Restore the original threading.Thread.run() method

Thus, when the class original invoke() method will run a new thread, it will first overwrite the trace context to the parent’s subsegment.

def traced_invoke(wrapped, instance, args, kwargs):
    """
    A wrapper function to trace the invocation of Runnable classes using AWS X-Ray.
    This function is used to create a subsegment in the X-Ray trace for the Runnable invocation.

    Args:
        wrapped: The original invoke method of the Runnable class.
        instance: The instance of the Runnable class being invoked.
        args: Positional arguments passed to the invoke method.
        kwargs: Keyword arguments passed to the invoke method.

    Returns:
        The result of the wrapped invoke method.
    """
    runnable_name = type(instance).__name__ if instance else wrapped.__name__
    with xray_recorder.in_subsegment(runnable_name):
        if runnable_name.startswith("RunnableParallel"):
            # Get the parent entity for the child threads
            parent_entity = xray_recorder.get_trace_entity()
            # Back up the original threading.Thread.run method
            orig_thread_run = threading.Thread.run

            # Monkey-patch the threading.Thread.run method to set the parent entity for new threads
            def run_with_entity(self, *a, **k):
                xray_recorder.set_trace_entity(parent_entity)
                return orig_thread_run(self, *a, **k)

            # Replace the threading.Thread.run method with our patched version
            threading.Thread.run = run_with_entity
            try:
                result = wrapped(*args, **kwargs)
            finally:
                # Once done, restore the original thread run method
                threading.Thread.run = orig_thread_run
        else:
            result = wrapped(*args, **kwargs)
        return result

Note: following the same approach, the same can be achieved for LangChain asynchronous ainvoke() method, but not shown here for simplification.

Patching with wrapt vs Monkey Patching

As you saw, I use two different approaches to patch different class methods:

The wrapt library for LangChain Runnable classes invoke() method
Monkey patching for threading Thread.run() method

Why?

In the case of LangChain, I want to patch the invoke() method for all their executions and for the entire duration of the Lambda Function. The wrapt library is designed to do exactly that.

In the case of the threads, I just want to patch the Thread.run() method in the temporary context of RunnableParrallel.invoke() method. The wrapt library does not provide a built-in way to temporarily patch and then restore a method at runtime in a specific block of code.

Be careful with monkey patching though, as you are changing behaviors of methods in ways that other developers might not be aware of, and updates in the inner working of the method you are patching might interfere with your patching method. But it feels reasonable in this specific context with the actions performed by the patching method.

The Results

The LangChain setup described in my first article "A Serverless Chatbot with LangChain & AWS Bedrock" results in the following AWS X-Ray trace properly showing the different step in order:

The initial steps of the chain to retrieve references from the knowledge base, retrieve the conversation history and detect the language run in parallel
The prompt being generated based on all those inputs
The LLM model called with that prompt using Bedrock
The tool to structure the model output based on my Pydantic definition called

Lessons Learned

Building upon my previous articles on serverless LangChain applications and logging LangChain to AWS CloudWatch Logs, this tracing implementation with AWS X-Ray has revealed additional insights worth sharing:

Adapting existing tools for new use cases can be challenging: While AWS X-Ray SDK wasn't designed specifically for tracing AI frameworks like LangChain, this project demonstrated the potential to extend its capabilities creatively.
Deep understanding of both AWS Lambda and AWS X-Ray internals proved crucial: Knowing how Lambda manages threads and how the X-Ray SDK patches common libraries is key to developing an effective custom tracing solution.
Performance implications of custom tracing solutions should be considered: While not explicitly discussed, it's important to note that any custom tracing implementation may impact the performance of the system and should be carefully monitored and optimized.

While this implementation successfully integrates LangChain tracing with AWS X-Ray as subsegments within the Lambda trace, the natural progression is to create a separate trace map specifically for LangChain. A visual representation of the interactions between LangChain components, would allow for more granular analysis and optimization of AI-driven applications.

Logging LangChain to AWS CloudWatch

Matthieu Lienart — Wed, 18 Jun 2025 19:50:08 +0000

LangChain is a popular framework for developing applications powered by large language models, providing components for working with LLMs through composable chains and agents. When building production applications with LangChain, proper logging becomes essential for monitoring, debugging, and auditing your AI systems. AWS CloudWatch is the natural choice for logging in my serverless context, offering centralized log storage, metrics, and powerful analysis capabilities.

What is the problem?

In my previous article "A Serverless Chatbot with LangChain & AWS Bedrock", I presented a solution for a serverless Chatbot with LangChain and AWS Bedrock, having all the features of:

Conversation history
Answering in the user language
Custom context using RAG
Model guardrails
Structured output

As the described solution aims to be running on AWS Lambda, I naturally want to export all those logs to AWS CloudWatch.

The problem is that just using langchain.globals.set_debug function produces verbose, unstructured logs that become virtually unusable in CloudWatch. These logs are difficult to read, impossible to query effectively with CloudWatch Insights, and lack the context needed for proper debugging. For CloudWatch to deliver its full value, logs must be stored in a structured JSON format with consistent fields and meaningful metadata that can be filtered and analyzed programmatically.

The Solution

In essence, I created a structured logging system that transforms LangChain's verbose text output into CloudWatch-friendly structured JSON format. This solution enables effective monitoring, troubleshooting, and analysis of the LangChain applications in an AWS environment.

I use a LangChain Callback to capture LangChain actions, format the logs and send them to CloudWatch using the AWS Lambda PowerTools Logger. An advantage of this custom approach is that logs can be enriched with custom metadata, like a session or user ID.

Figure 1: High-level architecture of the serverless chatbot

For the full code, you can refer to the Jupyter notebook in this GitHub repository. While the notebook demonstrates the components locally, and the logs are just printed after being formatted, instead of being sent to CloudWatch, the principles apply directly to a Lambda Function deployment.

Using a Callback for Logging

The approach I use here is to use LangChain callbacks on actions like on_chain_start, on_chain_end, on_chain_error, etc. to capture and log the actions in the chain.

Since the messages of chain actions or LLM prompts and answers can be long and contain sensitive information, I provide parameters like exclude_inputs, exclude_outputs to the Callback to have the ability to redact such content.

The full list of LangChain callbacks is available here.

The Logging callback is structured as follow:

class LoggingHandler(BaseCallbackHandler):
    def __init__(
        self,
        session_id: str,
        exclude_inputs: bool = False,
        exclude_outputs: bool = False,
    ):
        self.session_id = session_id
        self.exclude_inputs = exclude_inputs
        self.exclude_outputs = exclude_outputs

    def _parse_dict_values(…):
        …

    def on_chain_start(…):
        …

    def on_chain_end(…):
       …

    def on_llm_start(…):
       …

    def on_llm_end(…):
       …

    def on_chain_error(…):
       …

    def on_llm_error(…):
       …

Parsing the Logs

Inputs, outputs, prompts, model answers, etc. making the content of the callbacks, are dictionaries including LangChain serializable objects. In order to prepare those data and ensure that all nested objects are converted into standard Python types that can be easily handled by CloudWatch Logs, it needs to go through the dictionary recursively and serialize the LangChain objects. This is done by the utility function _parse_dict_values().

def _parse_dict_values(self, obj: Any) -> Any:
    if isinstance(obj, Serializable):
        return obj.model_dump()
    if isinstance(obj, dict):
        return {k: self._parse_dict_values(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [self._parse_dict_values(item) for item in obj]
    return obj

Logging LangChain Steps

Logging LangChain steps like on_chain_start, on_chain_end, then involves

Redacting the content if instructed so
Else serialize the content
Then log to CloudWatch

def on_chain_start(
    self,
    serialized: dict,
    inputs: dict,
    run_id: UUID,
    parent_run_id: UUID | None = None,
    tags: list[str] | None = None,
    metadata: dict | None = None,
    **kwargs,
) -> Any:
    if self.exclude_inputs:
        sanitized_inputs = "<redacted>"
    else:
        sanitized_inputs = self._parse_dict_values(inputs)
    logger.info(
        {
            "callback": "chain/start",
            "action_name": self._get_name_from_callback(serialized, **kwargs),
            "session_id": self.session_id,
            "run_id": str(run_id),
            "parent_run_id": str(parent_run_id),
            "inputs": sanitized_inputs,
            "tags": tags,
            "metadata": metadata,
        }
    )

def on_chain_end(
    self, outputs: dict, run_id: UUID, parent_run_id: UUID | None = None, **kwargs
) -> Any:
    if self.exclude_outputs:
        sanitized_outputs = "<redacted>"
    else:
        sanitized_outputs = self._parse_dict_values(outputs)
    logger.info(
        {
            "callback": "chain/end",
            "action_name": self._get_name_from_callback(serialized, **kwargs),
            "session_id": self.session_id,
            "run_id": str(run_id),
            "parent_run_id": str(parent_run_id),
            "outputs": sanitized_outputs,
            "tags": kwargs.get("tags", []),
        }
    )

The _get_name_from_callback() is another utility function which tries to extract the action name in different ways depending on the content of the data. Refer to the Jupyter notebook for the full LoggingHandler code with all the callbacks and utility functions.

The Results

The logs are formatted as desired ready for the AWS Lambda PowerTools logger and AWS CloudWatch as shown in one example below.

{
    "level": "INFO",
    "location": "on_chain_start:122",
    "message": {
        "callback": "chain/start",
        "action_name": "RunnableSequence",
        "session_id": "cda54b41-8c10-47f9-87f8-f0c04a96731a",
        "run_id": "0174cf48-b8f3-4418-8d7c-13b9b0881938",
        "parent_run_id": "None",
        "inputs": {
            "question": "Wie stimmen Sie die Entwicklung eng mit den Unternehmenszielen ab?"
        },
        "tags": [],
        "metadata": {}
    },
    "timestamp": "2025-06-05 13:26:14,029+0000",
    "service": "service_undefined",
    "cold_start": false,
    "function_name": "my-function",
    "function_memory_size": "128",
    "function_arn": "arn:aws:lambda:us-east-1:************:function: my-function ",
    "function_request_id": "1ef2901b-a061-40a5-9a4e-eb20ea80fc1b",
    "xray_trace_id": "1-68419af5-730d439a2c0074857ace2227"
}

Notice how the log entry contains:

Standard CloudWatch fields like level, timestamp, and Lambda execution context
Our custom message object with LangChain-specific information
Custom metadata like the session_id that allows tracking the logs for a user's entire conversation
The actual content of inputs (which could be redacted if sensitive)

With this structured format, you can use CloudWatch Insights to run powerful queries like:

fields @timestamp, @message
| filter message.session_id = "cda54b41-8c10-47f9-87f8-f0c04a96731a"
| sort @timestamp asc

Image 1: Retrieving all LangChain logs for a specific user session

Lessons Learned

Building upon my previous article on serverless LangChain applications, this logging implementation has revealed additional insights worth sharing:

CloudWatch-friendly logging matters: Simply dumping LangChain's native logs to CloudWatch creates more problems than it solves. Designing logs specifically for CloudWatch's query capabilities enables effective monitoring and analysis.
Balance detail with privacy: When logging LLM interactions, you must carefully balance capturing sufficient detail for debugging against protecting sensitive information which might be contained in prompts and answers. The parameterized redaction approach demonstrated here offers a flexible solution.
Custom callbacks provide control: While LangChain offers built-in logging capabilities, custom callbacks give you precise control over what gets logged and how it's formatted, which is essential for production environments.

As LangChain and the broader LLM ecosystem continue to evolve, implementing robust logging practices will remain essential for building reliable, maintainable AI applications on AWS. The approach outlined in this article provides a foundation that you can adapt as both technologies mature.

Now that we are successfully logging, in the next article I will introduce tracing LangChain with AWS X-Ray.

A Serverless Chatbot with LangChain & AWS Bedrock

Matthieu Lienart — Sun, 15 Jun 2025 11:20:23 +0000

LangChain is an open-source framework for building applications powered by large language models (LLMs), while AWS Bedrock is a fully managed service that provides access to foundation models from leading AI companies. I know these days it's all about agentic AI, but even if you're trying to develop a simple serverless non-agentic chatbot using LangChain and AWS Bedrock, you still need to combine many advanced capabilities. This article walks you through the challenges of integrating these powerful tools to create a sophisticated chatbot with features such as conversation history management, retrieval-augmented generation (RAG), multilingual support, and more.

What is the problem?

When developing a serverless non-agentic chatbot using LangChain and AWS Bedrock, you'll likely want to incorporate several key features to make it truly useful and robust.

The ability to maintain the current conversation (here I limit the scope to the current conversation, not storing past interactions).
Provide the model with your own specific context using your knowledge base and use retrieval-augmented generation (RAG) to generate answers relating to your context (here I limit myself to crawled web pages).
The ability to answer in the language of the user.
Guardrails to make sure the answers are compliant with your chatbot objectives, prevent prompt attacks, etc.
The ability to generate outputs directly in a JSON structured format for the frontend.

At least I wanted all those things.

While numerous code samples and tutorials exist that demonstrate one or two of these capabilities, I found none that comprehensively cover all five. Moreover, many of the more complete examples rely on outdated versions of LangChain with deprecated APIs.

This article aims to fill the gap despite the fact that it might become itself quickly obsolete, considering the pace of innovation in the field.

The solution

Here's an overview of the solution I developed to address these challenges:

For conversation history management, I use DynamoDB and LangChain to store the conversation history, but I implement a custom solution instead of using the common RunnableWithMessageHistory.
For multilingual support, I use AWS Comprehend to detect the question's language and generate appropriate language instructions for the model's response. Language detection could also be done using an LLM, but I suspect (although I haven't tested it) that the response time and cost would be higher.
I utilize Bedrock's built-in capabilities for the RAG knowledge base (employing the built-in web crawler to index the content, not shown here), for guardrails and for generating structured JSON outputs.

Figure 1: High-level architecture of the serverless chatbot

For the full code, you can refer to the Jupyter notebook in this GitHub repository. While the notebook demonstrates the components locally, the principles apply directly to a Lambda Function deployment.

Details

1. Manage the Conversation History

Why not Manage Conversation History with RunnableWithMessageHistory?

LangChain provides a RunnableWithMessageHistory class for managing conversation history which you will see in many code samples. Although very convenient, this approach has two significant drawbacks for my use case:

Performance Limitations: RunnableWithMessageHistory retrieves the conversation history before running the chain. But in my implementation, I need to perform three independent tasks: a) RAG context retrieval, b) Language detection and instruction generation, c) Conversation history retrieval. By parallelizing these tasks, I can reduce the latency of the initial LangChain.
Incompatibility with Structured Output: The default implementation doesn't work well with structured output. While this can be bypassed by always returning the raw output together with the structured output and storing the raw output in the message history, it introduces a new problem. It would include unnecessary information like RAG references in the conversation history which would consume LLM prompt tokens when using that history in later prompts. So, I need to customize what is stored in the database.

Parallelization of Conversation History Retrieval

The solution implemented uses DynamoDBChatMessageHistory to define the storage.

history = DynamoDBChatMessageHistory( 
    table_name=CONVERSATION_HISTORY_TABLE_NAME, 
    session_id=session_id, 
    key=this_session_key, 
)

In the initial step of the LangChain chain, I use RunnableParralel to read the past messages from the current session using a RunnableLambda in parallel with other steps.

RunnableParallel({
        "references": …,
        "language_instructions": …,
        "history": RunnableLambda(lambda x: history.messages),
        "question": …
    })
})

To illustrate the resulting performance improvement, let's compare the telemetry traces of both approaches.

Image 1: Telemetry trace using RunnableWithMessageHistory

Image 2: Telemetry trace with my solution

The first trace shows the sequential nature of RunnableWithMessageHistory, where conversation history retrieval in DynamoDB happens before other tasks. In contrast, the second trace demonstrates how my custom implementation allows for concurrent execution of RAG retrieval, language detection, and conversation history retrieval, leading to improved overall performance. The latency of the initial steps before calling the LLM model is improved from 1.2 sec to 1.0 sec by parallelizing the retrieval of the conversation history from DynamoDB.

Using a Callback Handler for Storing Messages

To store the new user question and LLM answer, I use a LangChain callback on_llm_end. This allows me to:

Extract only the relevant parts of the model's response for storage
Separate the answer from the references
Minimize token usage in future prompts

class StoreMessagesCallbackHandler(BaseCallbackHandler): 
    def __init__(self, history: BaseChatMessageHistory, session_id: str, question: str):
        self.history = history
        self.session_id = session_id
        self.question = question

    def on_llm_end(self, response: LLMResult, **kwargs)  -> Any
:
        logger.info("Storing question and LLM answer back into DynamoDB")
        generations = response.generations
        if generations and len(generations) > 0 and generations[0] and len(generations[0]) > 0:
            response_message = generations[0][0].message
            ai_message_kwargs = response_message.model_dump()
            if isinstance(response_message.content, list) and response_message.content:
                input = response_message.content[0].get("input")
                ai_message_kwargs["content"] = input.get("answer")
                ai_message_kwargs["references"] = input.get("references")
            self.history.add_messages([
                HumanMessage(content=self.question),
                AIMessage(**ai_message_kwargs)
            ])
        else:
            logger.warning("No generations returned by LLM; no AI message to store.")
            self.history.add_message(HumanMessage(content=self.question))

It's crucial to keep the number of tokens as low as possible for two reasons:

Model prompts have a limitation in the number of input tokens,
We are charged per token used.

In my use-case, the model response consists of two parts: the answer and a list of references (including URLs and excerpts). For future interactions, only the text answer is truly necessary for the model to follow the conversation. By storing only the answer and not the references, I can significantly reduce token usage in subsequent prompts.

2. RAG Retrieval

Document retrieval using RAG is done in a classic manner using AmazonKnowledgeBasesRetriever.

kb_retriever = AmazonKnowledgeBasesRetriever(
    client=bedrock_agent_client,
    knowledge_base_id=BEDROCK_KNOWLEDGE_BASE_ID,
    retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
)

The kb_retriever is then used in the initial step of the chain to retrieve content based on the question and use a custom function to format the result to inject it into the model prompt.

itemgetter("question") | kb_retriever | format_references

3. Generate Language Instructions

I created a custom function which uses AWS Comprehend to detect the user language and generate instructions for the model which will be injected into the model prompt.

def generate_language_instructions(question: str) -> str:
    try:
        response = comprehend_client.detect_dominant_language(Text=question)
        logger.info(f"Comprehend language detection response: {response}")
        if languages := response.get("Languages"):
            # Sort languages by score and return the one with the highest score
            languages.sort(key=lambda x: x["Score"], reverse=True)
            dominant_language = languages[0]["LanguageCode"]
            logger.info(f"Detected language: {dominant_language}")
            return f"Answer the question in the provided RFC 5646 language code: '{dominant_language}'."
        logger.warning("No language detected, defaulting to basic instructions.")
        return "Answer in the same language as the question."
    except Exception as e:
        logger.error(f"Error detecting language: {e}")
        logger.warning("Defaulting to basic language instructions.")
    return "Answer in the same language as the question."

The initial step of the LangChain chain, I call the above function using a RunnableLambda and passing it the user question.

itemgetter("question") | RunnableLambda(generate_language_instructions)

A logical improvement is to store the language code in the conversation history so that we don’t redetect it at every new user message. But this is not implemented yet.

4. Using Guardrails

Using AWS Guardrails when calling a model using AWS Bedrock is very simple. You just must pass the guardrails to ChatBedrockConverse.

llm = ChatBedrockConverse(
    client=bedrock_client,
    model=BEDROCK_MODEL,
    verbose=True,
    max_tokens=2048,
    temperature=0.0,
    top_p=1,
    stop_sequences=["\n\nHuman"],
    guardrail_config={
        "guardrailIdentifier": BEDROCK_GUARDRAIL_ID,
        "guardrailVersion": BEDROCK_GUARDRAIL_VERSION
    }
)

5. Structured Output

To request the model to generate the answer following a specific structure, I just specify that structure using Pydantic.

class ChatBotResponseReference(BaseModel):
   """A web reference used to answer the question"""
    url: str = Field(description="The URL of the reference")
    excerpt: str = Field(description="The extract from the reference")

class ChatBotResponse(BaseModel):
   """The response from the chatbot."""
    answer: str = Field(description="The answer to the question")
    references: list[ChatBotResponseReference] = Field(description="A list of references relating to the question")

And then update the model definition as follows.

structured_llm = llm.with_structured_output(
    ChatBotResponse, 
    include_raw=True,
)

Here, I force the model to provide both the raw answer and the structured answer. I do this because there is no guarantee that the model will follow the format instructions, so I want the raw answer as a fallback source if needed.

Warning: with this approach, there is still the risk that the model will hallucinate and, instead of reusing the references retrieved from the knowledge base, will generate non existing ones in the answer. If you are facing such an issue and you need to give the exact outputs from the RAG knowledge-base retrieval step to the user, the solution is to not ask the model to generate a structured output with references. Instead, you create a first LangChain initial_step as below, but you invoke it first to generate the prompt inputs. Then you pass that as inputs when invoking the prompt|llm chain. You then combine the content of the llm answer with the references gathered in the initial step.

The LangChain Chain

Now all the pieces are there to create the prompt and the chain.

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system", 
            """You are an assistant for question-answering tasks. Use the following pieces of retrieved references to answer the question.
            If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

            Here is a list of web pages references to be used as context to answer the question.
            Copy-paste them together with your answer in the output:
            {references}

            {language_instructions}\n"""
            ), 
        MessagesPlaceholder(variable_name="history"),
        ("human", "{question}"),
    ]
)
initial_step = RunnableParallel({
        " references ": itemgetter("question") | kb_retriever | format_references,
        "language_instructions": itemgetter("question") | RunnableLambda(generate_language_instructions),
        "history": RunnableLambda(lambda x: history.messages),
        "question": itemgetter("question"),
    })

full_chain = (
    initial_step
    | prompt
    | structured_llm
)

The model can then be called as follow:

question = "C'est quoi LangChain?"
chain_callbacks = [
    StoreMessagesCallbackHandler(history, session_id, question),
    CloudWatchLoggingHandler(session_id)
]
response = full_chain.invoke({"question":question }, {"callbacks": chain_callbacks})

The Results

Now that we've walked through the implementation details, let's examine the outputs of the LangChain and AWS Bedrock-powered chatbot. We'll look at three key aspects of the results:

The multi-language conversation: We'll see how the chatbot handles a multi-turn conversation in French, demonstrating its language detection and response capabilities.
The generated prompts: We'll examine the prompts created by my LangChain setup, showcasing how the conversation history and context are incorporated.
The Message History DynamoDB Table: We'll verify how the conversation is stored in the DynamoDB table, ensuring persistence across interactions.

The Conversation

Asking a question in French "C'est quoi LangChain?" (“What is Langchain?”), results in a prompt as follow:

"""System: You are an assistant for question-answering tasks. Use the following pieces of retrieved context references to answer the question. Use three sentences maximum and keep the answer concise.

Here is a list of web pages references to be used as context to answer the question. Copy-paste them together with your answer in the output:
[
    {
        "url": "https://python.langchain.com/docs/introduction/",
        "excerpt": "LangChain is a framework for developing applications powered by large language models (LLMs)."
    }
]

Answer the question in the provided RFC 5646 language code: 'fr'.

Human: C'est quoi LangChain?"""

This results in a structured answer in the user language including the RAG references.

{
  "answer": "LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.",
  "references": [
    {
      "url": "https://python.langchain.com/docs/introduction/",
      "excerpt": "LangChain is a framework for developing applications powered by large language models (LLMs)."
    }
  ]
}

A follow-up question asking "Cela fonctionne-t'il avec AWS Bedrock?" ("Can it work with AWS Bedrock?") produces an answer showing that it used the context (we were talking about LangChain) of the conversation to answer the new question.

{
  "answer": "LangChain est compatible avec AWS Bedrock, permettant l’intégration et l’utilisation des modèles de langage fournis par AWS.",
  "references": [
    {
      "url": " https://python.langchain.com/docs/integrations/chat/bedrock/",
      "excerpt": " Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs)"
    }
  ]
}

The prompt generated looks something like the shortened sample below, showing the conversation history being used.

"""System: You are an assistant for question-answering tasks. Use the following pieces of retrieved references to answer the question.
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

Here is a list of web pages references to be used as context to answer the question. Copy/paste them together with your answer in the output:
[…]

Answer the question in the provided RFC 5646 language code: 'fr'.

Human: C'est quoi LangChain?
AI: LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.
Human: Cela fonctionne-t'il avec AWS Bedrock?"""

The Message History DynamoDB Table

If we list the messages stored in DynamoDB for this conversation, we see the following, showing that the content of messages does not include references (although stored in the table).

[HumanMessage(content="C'est quoi LangChain?", ...),
 AIMessage(content="LangChain est un framework open-source pour développer des applications basées sur des modèles de langage.", ...),
 HumanMessage(content="Cela fonctionne-t'il avec AWS Bedrock?", ...),
 AIMessage(content=" LangChain est compatible avec AWS Bedrock, permettant l’intégration et l’utilisation des modèles de langage fournis par AWS.", ...)]

Lessons Learned

LangChain is a powerful framework that offers numerous abstractions for rapidly developing applications that interact with LLMs. However, given the rapid pace of innovation in this field, it's crucial to approach development thoughtfully:

Before diving into coding based on web examples (including this article), invest time in learning LangChain fundamentals.Always verify that you're using the latest version of the framework to avoid deprecated features.
While LangChain's modular nature makes it a flexible and powerful tool to work with LLMs, integrating these modules effectively for your specific use case can be complex.
Be prepared to adapt and innovate, as off-the-shelf solutions may not fully address your unique requirements.
Using structured output, does not guarantee the model will respect your desired answer structure so you need a fallback mechanism. Also, when asking for references in your structured answer, be aware that the model might hallucinate and generate fake references not in your knowledge base.

By keeping these lessons in mind, you'll be better equipped to leverage LangChain's capabilities while navigating its challenges.

Now that we have a feature-rich chatbot, I will present in the next article how we can log LangChain's details in AWS CloudWatch Logs.

Data Masking of AWS Lambda Function Logs

Matthieu Lienart — Sat, 15 Mar 2025 12:59:19 +0000

What are the problems?
When logging events, service API call answers, etc. inside Lambda Functions into CloudWatch Logs, you might end up writing sensitive information like PII into CloudWatch Logs. This potentially exposes sensitive information to people who should not have access, e.g. developers and cloud platform administrators. But it is also a huge problem to be compliant with data protection regulations like the right to be forgotten. How do you erase a customer information across all the logs?

Existing Approaches
CloudWatch Logs Native Data Masking
Natively, AWS CloudWatch Logs allow data masking by using managed or custom data identifiers and data protection policies. The data identifiers are pattern matching or machine learning models which detect sensitive data. Data protection policies are JSON documents describing the operations to perform on the identified sensitive data. The operation can be set to just “audit” or to “de-identify” the data in which case, only principals authorized to perform the log:Unmask action would be able to see the data.
Note that only new data written to CloudWatch Logs will be masked according to the defined policies. Someone with access to the logs would still be able to see the sensitive data written before enabling data masking.
Although this approach prevents access to sensitive data for unauthorized personnel, it does not help with complying with the right to be forgotten.

AWS Lambda Powertools Data Masking
AWS Lambda Powertools is “a developer toolkit to implement Serverless best practices and increase developer velocity” originally developed in Python but made available for Java, Typescript and .NET. As of now only the Python version offers a utility for data masking.
Two approaches are proposed. One approach which uses a KMS key to encrypt/decrypt the sensitive information inside the log. A second approach which simply erases the sensitive information before writing the logs. To implement the first approach but also comply with regulations like the right to be forgotten, you would need to have one encryption key per customer, find some way to encrypt each customer’s information with their own key. Should they exercise their right to be forgotten, simply delete the encryption key, making their data forever unrecoverable.
Although those approaches can address both problems, you must know exactly what to encrypt/erase. For example, to erase the phone number in a list of customers you would need to do something like this:

data_masker.erase(data, fields=["customers[*].phone_number"])

But what if you are unsure at the start of a project about the data structure and the content? What if the data schema changes? What if you forgot a field in a nested JSON structure?

Erasing All PII by Default
Do you really need sensitive information like PII in application logs?
Probably not.
In that case, the AWS Lambda Powertools data erasing approach seems like the simplest one. But again, it works as long as you know the data structure and it doesn’t change. As a security/compliance officer how can I make sure the developers don’t forget to erase sensitive information?
So, I wanted to improve on the AWS Lambda Powertools approach, to erase sensitive information, wherever they are in the logs…
This is what I came up with based on the AWS Lambda Powertools data masking utility.

1- Create a Function to Erase Sensitive Data
I created a Python decorator which calls the data_masker.erase() function on the message to erase all the fields passed as a parameter before calling the function the decorator is applied to.

import json
from warnings import catch_warnings
from functools import wraps, partial
from decimal import Decimal
from typing import Any
from aws_lambda_powertools import Logger
from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider import BaseProvider

def is_valid_json_string(json_string: str) -> bool:
    if isinstance(json_string, str):
        try:
            result = json.loads(json_string)
            return isinstance(result, dict)
        except json.JSONDecodeError:
            return False

def log_masking_decorator(masked_fields: list[str]):
    def decorator(func):
        @wraps(func)
        def wrapper(self, msg, *args, **kwargs):
            if is_valid_json_string(msg) or isinstance(msg, dict):
                with catch_warnings(action="ignore"):
                    msg = self.data_masker.erase(msg, fields=masked_fields)
            return func(self, msg, *args, **kwargs)
        return wrapper
    return decorator

Code explanations:

The data_masker.erase() function only works on dictionaries and string containing a JSON object. So we need to verify the type of the message before erasing the data.
The AWS Lambda Powertools Data Masker raises a warning if you instruct it to mask a field which it can’t find. With this approach where I want to globally define a list a fields to mask everywhere, this will result in a lot of warnings in CloudWatch Logs, which I don’t want. So I ignore the warnings before calling the erase() method.

2- Apply the Function on all Logging Methods
A class decorator is created to apply a decorator function passed as an argument to all the logging methods (e.g. info, error, exceptions) of the logger class:

def decorate_log_methods(decorator):
    def decorate(cls):
        for attr in dir(cls):
            if callable(getattr(cls, attr)) and attr in [
                "info",
                "error",
                "warning",
                "exception",
                "debug",
                "critical",
            ]:
                setattr(cls, attr, decorator(getattr(cls, attr)))
        return cls
    return decorate

3- Create a Custom Logger Class
Finally, a custom logger class is created with the class decorator created in the previous step applied to it. The class decorator takes as an argument the first function decorating the data_masker.erase() function. The data masker decorator takes as an argument all the JSON keys containing PII and which should be erased.

def decimal_serializer(obj: Any) -> Any:
    if isinstance(obj, Decimal):
        obj = str(obj)
    return obj 

@decorate_log_methods(
    log_masking_decorator(
        masked_fields=[
            "$.[*].phoneNumber",
            "$..[*].phoneNumber",
            "$.[*].name",
            "$..[*].name",
        ]
    )
)
class CustomLogger(Logger):
    def __init__(self):
        super().__init__()
        self.datamasking_provider = BaseProvider(
            json_serializer=partial(json.dumps, default=decimal_serializer),
            json_deserializer=json.loads,
        )
        self.data_masker = DataMasking(
            provider=self.datamasking_provider, raise_on_missing_field=False
        )

Code explanations:

I use here a custom JSON serializer to convert Python Decimal values into strings to avoid errors.

4- Usage
By instantiating the Python logger into the Lambda Function as a CustomLogger() instead of the default AWS Lambda Power Tools Logger(), all values of the JSON keys listed in the class decorator argument will be erased by default.

from log_helpers import CustomLogger
logger = CustomLogger()
@logger.inject_lambda_context(log_event=True)
def lambda_handler(event: dict, context: LambdaContext):
    response = boto3_client.whatever_service_api()
    logger.info(response)

Code explanations:

The inject_lambda_context decorator calls the logger.info(). Since the logger is here our custom logger, all PII listed in our CustomLogger class decorator will be erase from the Lambda event logs This achieves the goal of enforcing the erasure of all the listed PII without the developer having to specifically list each field to erase on every logging action.

The full code of the custom logger is available ⁠here. The repository contains a full demo showing how to secure an AWS API Gateway.

Would I Use That in Production?
No.
Parsing the entire JSON structure of every log will increase the latency of the response of your Lambda function, which is not something you want. As the documentation of the AWS Lambda Power Tools says, logging the event of the Lambda handler function should only be done in non-production environments. And you should also know the data your Lambda Function is handling and thus erase the specific sensitive data where necessary for efficiency.
I still find it an interesting approach which could be useful in some cases. Test environments should not have production data, but hey, we have all seen those cases out there…
It was nevertheless an interesting exercise to try.

Note: The banner image was generated using AWS Nova Canvas image generation AI model

Fully Automated MLOps Pipeline – Part 1

Matthieu Lienart — Wed, 07 Aug 2024 14:49:59 +0000

THE OBJECTIVE
In the previous blog post we introduced the architecture and demo of a near real time data ingestion pipeline into Amazon SageMaker Feature Store. In this post and the following one, we will present the fully automated MLOps pipeline.

This first post will focus on the first two objectives.
The entire source code of the demo is publicly available in the project repository on GitHub.

THE MODEL
As the ingestion pipeline aggregates in near real time blockchain transaction metrics into Amazon SageMaker Feature Store, we chose to forecast the average transaction fee.
In order to train a forecasting model, we decided to use Amazon DeepAR Forecasting Algorithm. That algorithm is better suited for one-dimensional multi time series (e.g. energy consumption of multiple households). However, in our case we have a one-dimensional (average transaction fee) single time series (one stream of blockchain transactions). But as per AWS documentation, DeepAR can still be used for single time series, and based on the quick test we performed, it is the model that was performing the best.
More importantly, the main objective of this demo is – not – to train the most accurate model. We just need – a – model to experiment a fully automated MLOps lifecycle and using a prepackaged AWS model, greatly simplified our pipeline and demo development.
The model is trained to forecast the next 30 average transaction fees. As we aggregate data per minute, it forecasts average transaction fee on the blockchain 30 minutes in the future.
To evaluate the accuracy of the model, this demo uses the mean quantile loss metric.

THE ARCHITECTURE
To see the near real time data ingestion pipeline architecture please refer to the previous blog post here. This architecture abstracts the data ingestion pipeline to focus on the MLOps architecture to train and operate the model.

The architecture is based on AWS provided SageMaker project for MLOps (provisioned through AWS Service Catalog) which we adapted to our project. The SageMaker project provides the following:

An AWS CodeCommit repository and AWS CodePipepline pipeline for
- model building
- model deployment
- model monitoring
An Amazon S3 Bucket to store all the artifacts generated during the MLOps lifecycle.

The “Model Build” repository and pipeline deploy a SageMaker pipeline to train the forecasting model. The build phase of that pipeline also creates SSM Parameters (if they do not exist) holding the hyperparameters for the model training and to evaluate the model accuracy.
The manual approval of a trained model automatically triggers the “Model Deploy” pipeline.
The “Model Deploy” pipeline deploys in the staging environment (and later on in the production environment if approved) of the model behind an Amazon SageMaker API Endpoint.
Once the endpoint is in service, this automatically triggers the deployment of the “Model Monitoring” pipeline to monitor the new model.
On an hourly schedule, another SageMaker pipeline is triggered to compare the model forecast results with the latest datapoints.
If the model forecasting accuracy falls under the acceptable threshold, the “Model Build” pipeline is automatically retriggered, to train a new model based on the latest data.

The second half of this architecture will be described in the follow-up blog post.

BUILDING THE MODEL WITH THE SAGEMAKER PIPELINE
This pipeline is different from the CodePipeline type of pipeline used to deploy infrastructure and applications. It is a pipeline dedicated to performing machine learning operations like training a model.

The SageMaker project comes with a built-in SageMaker pipeline code which we had to refactor to match our use case. Our pipeline consists of the following steps:

Read the data from SageMaker Feature store, extract the last 30 data point as a test dataset to evaluate the model and format the data for the DeepAR algorithm.
Train the model.
Create the trained model.
Make a batch prediction of the next 30 data points based on training data.
Evaluate the forecast accuracy by computing the model’s mean quantile loss between the forecast and test datapoints.
Check the model accuracy compared to the threshold stored in the SSM parameter (deployed by the “Model Build” pipeline).
Register the trained model if its accuracy passes the threshold.

DEPLOYING THE MODEL
Once the model is registered in SageMaker, it must be manually approved in order to be deployed in the staging environment first. The approval of the model will automatically trigger the “Model Deploy” pipeline. This pipeline performs 3 main actions.

As the model has been approved, we take this new model accuracy as the new model threshold – if it is better (lower is better for our metric) than the existing one – and update the SSM parameter. You might not want to do that for your use case, as you might have fixed business/legal metric that you must match. But for this demo we decided to update the model accuracy as new models are retrained, hopefully building an increasingly accurate models as time passes.
A first AWS CodeDeploy stage deploys the new model behind an Amazon SageMaker endpoint which can then be used to predict 30 data points in the future.
Once the model has been deployed behind the staging endpoint, the pipeline has a manual approval stage before deploying the new model in production. If approved, then a second AWS CodeDeploy stage deploys the new model behind a second Amazon SageMaker endpoint for production.

THE CHALLENGES
The use of the SageMaker Project provided through AWS Service Catalog, was of significant help to quickly build the overall framework for our fully automated MLOps pipeline. However, it comes with a constraint: the model build, deploy and monitor pipelines are fixed by that AWS Service Catalog product and might not exactly fit your need. In this demo for example, in order to set and update the model accuracy threshold stored in SSM parameters we use the CodeBuild phase of the different pipelines to update that threshold (Build phase of the “Model Deploy” pipeline) or read it to create the alarm metrics. This is not necessarily the best way and place to do that, but it is the best solution we found given that fixed framework.

As with every built-in framework, you can save time and move faster by benefiting from a pre-built solution, but you lose in flexibility.

Fully Automated MLOps Pipeline – Part 2
The MLOps Part 2 article of this series has been written by my colleague Raphael Eymann and is available here

Near Real Time Data Ingestion into SageMaker Feature Store

Matthieu Lienart — Wed, 07 Aug 2024 14:34:52 +0000

THE OBJECTIVE
This blog post is the first part of a 3 parts series about testing a fully automated MLOps pipeline for machine learning prediction on near real time timeseries data in AWS. This part focuses on the data ingestion pipeline into Amazon SageMaker Feature Store.
The entire demo code is publicly available in the project repository on GitHub.

SAGEMAKER FEATURE STORE
For this demo, we have chosen to use the Amazon SageMaker Feature Store as the final repository of the data ingestion pipeline. As per the documentation:

“Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference.”

THE DEMO
The demo ingests blockchain transactions from the blockchain.com API (see documentation here). Based on the data it ingests, the pipeline computes and stores 3 simple metrics in Amazon SageMaker Feature Store:

The total number of transactions
The total amount of transaction fees
The average amount of transaction fees

These metrics are computed per minute. Although it might not be the best window period to analyze blockchain transactions, it allows us to quickly gather a lot of data points in a short period of time, avoiding running the demo for too long which has an impact on the AWS costs.
This demo is developed using the AWS CDK and is available here.

THE ARCHITECTURE

The project is made of a self-mutating pipeline which deploys the different stacks of the project. Only the data ingestion pipeline components are shown here (the MLOps part of the architecture will be detailed in future posts).
The pipeline works as follow:

An AWS Fargate container polls the data source API every 15 seconds to ingest the last 100 transactions and publish all transactions on the data ingestion event bus of AWS EventBridge.
An AWS EventBridge Rule routes the ingested data to an AWS Lambda Function.
The AWS Lambda Function is used in combination with Amazon DynamoDB to keep track of recently ingested transactions and filter out transactions already ingested.
The filtered data are written into an Amazon Kinesis Data Stream.
The ingestion data stream is connected to an Amazon Kinesis Firehose stream which stores the raw data to an Amazon S3 Bucket for archival.
An Amazon Managed Service for Apache Flink application reads the data from the ingestion stream and uses a tumbling window to compute the following 3 metrics per minute:
- total number of transactions
- total amount of transaction fees
- average amount of transaction fees
The Flink application writes the aggregated data to a delivery Amazon Kinesis Data Stream. An AWS Lambda Function reads from the delivery stream and writes the aggregated data to Amazon SageMaker Feature Store.
An AWS Glue Job periodically aggregates the small files in the Amazon SageMaker Feature Store S3 Bucket to improve performance when reading data.

In addition to deploying the data ingestion pipeline, the infrastructure stack also deploys the data scientist environment using Amazon SageMaker Studio. It creates an Amazon SageMaker Studio Domain and creates a user in it with the appropriate permissions. With this, the data scientist has access to an IDE to run Jupyter Notebooks to perform analytics on the data, run experiments and test training a model.

HOW TO LOOK AT THE DATA BEING INGESTED?
Monitoring the Pipeline
The demo comes with a CloudWatch dashboard for you to see the data flowing through the different components. It displays in the first widget the amount of bytes:

Ingested by the AWS Fargate Container
Ingested by AWS EventBridge (There is unfortunately no metric per AWS EventBridge bus. This metric shows the total amount of data ingested by EventBridge in the account)
Ingested by the Amazon Kinesis Data Stream ingestion stream
Ingested by Amazon Kinesis Firehose from the ingestion stream
Delivered by Amazon Kinesis Firehose to Amazon S3

The second widget displays the number of records output by the Apache Flink Application consumer and ingested from the consumer by the Apache Flink Application producer (should be equal when the Flink application works correctly). The third widget shows the amount of bytes ingested by the Amazon Kinesis Data Stream delivery stream (1 record per minute).

Querying the Data using Amazon Athena
Using Amazon Athena, you can query the Offline Storage of Amazon SageMaker Feature Store. Here is a query example (if you deploy the demo, you will have to adapt the feature store table name)

FROM "sagemaker_featurestore"."mlops_sageefb3c2_agg_feature_group_1699792186"
ORDER BY tx_minute DESC
LIMIT 100;

Querying the Data using Amazon SageMaker Studio Notebook
In the repository /resources/sagemaker/tests/ folder we provide a Jupyter notebook read_feature_store.ipynb to read the latest entry in the online store. From the Amazon SageMaker Studio domain, you can use the provisioned user and launch a studio application. Once in the Jupyter or Code Editor environment, you can upload that notebook and run it. The notebook reads the latest data point from the Online Store of Amazon SageMaker Feature Store.
You will observe a roughly 6 minutes difference between the timestamp of the latest data available in the Online Store versus the Offline Store of Amazon SageMaker Feature Store.

THE CHALLENGES
The main challenge we faced when developing this architecture with the CDK was the cleanup of the SageMaker domain. When creating a SageMaker domain, AWS creates an Amazon EFS share with endpoints in the VPC and NSGs attached to them. When a user starts a SageMaker Studio App, compute resources are deployed to host the Code Editor/Jupyter IDE session and Jupyter kernel session. None of those resources are deleted automatically when deleting the domain. This means that a Custom Resource must be developed in the CDK Stack to clean up the domain before it gets deleted. The main issue is that deleting a SageMaker Studio App can take more than the 15 minutes maximum runtime of the Custom Resource Lambda Function. Implementing a Step Function to periodically check the SageMaker Studio App status and wait for the deletion does not help because Cloud Formation WaitCondition does not support deletes and thus does no wait to receive the signal back from the Custom Resource before continuing with deletion.
Two issues have been opened in the CloudFormation repository:

THE COSTS
As mentioned previously, we aggregate ingested data on a minute interval to be able to quickly train a model and see results. Yet, we recommend running the ingestion pipeline and demo for several days to have enough data. Should you want to run the full demo (with the MLOps automation part) for some time, be aware that the average monthly cost in the Ireland (eu-west-1) region is roughly $850/month.

IMPROVEMENTS
The Amazon Managed Service for Apache Flink application computes 3 metrics per minute based on the ingested transactions. This means that the Amazon Kinesis Data Stream delivery stream ingests only one record of few kilobytes per minute. This is overkilled for a streaming application. But we kept this architecture design for this demo for learning purposes and to practice using data stream technology.

Fully Automated MLOps Pipeline – Part 2
The MLOps Part 2 article of this series has been written by my colleague Raphael Eymann and is available here