szymon-szym for AWS Community Builders

Posted on Dec 21

Serverless semantic search - AWS Lambda, AWS Bedrock, Neon

#aws #serverless #rust #vectordatabase

Introduction

I am looking for a serverless solution for implementing semantic search. I want to handle infrequent traffic without being charged for idle time.

AWS Lambda and AWS Bedrock are fully serverless and scale to zero. The database part is more tricky, though.

I have been experimenting with LanceDB with EFS, which is an interesting approach, but it has some rough edges. This time I decided to use a full-blown DB, so I don't need to worry about anything and just let a DB engine handle everything for me.

Neon provides serverless Postgres with amazing cold start times. It also supports a bunch of Postgres extensions including pgvector I will use for searching embeddings.

The code is available in this repository

Set up DB

I will benefit from the Neon free plan, which allows me to scale from 0 to 2 vCPU with 8 GB RAM. It is more than enough for a small project with infrequent traffic.

After creating a new project with default settings I store the connection string in the .env file in my /db folder.

I use the example data set from kaggle with movie descriptions in Polish. I want to make sure that embeddings created with Titan work with non-English languages.

For DB schema and migrations, I use sqlx-cli.

I update DB name in the .env file to vectordb and run sqlx database create. sqlx automatically gets data from .env file and creates database.

Next, I enable pg_vector. I run sqlx migrate add pgvector, update created SQL file

-- Add migration script here
CREATE EXTENSION vector;

And run sqlx migrate run

The table movies looks like this:

-- Add migration script here
CREATE TABLE IF NOT EXISTS movies (
    id SERIAL PRIMARY KEY,
    title varchar(255) NOT NULL,
    short_description text NOT NULL,
    embeddings vector(1024)
);

I create it with sqlx migrate add and sqlx migrate run commands. Check in the Neon console, and it seems to be OK:

Seed data

When working with AWS Bedrock you might use batch inference for bigger datasets. However, I just crafted a simple python script, to go through csv with movies data, and call AWS Titan Embeddings model to get embeddings, and upsert results to DB. The script is in the repository

It takes a few minutes to get all embeddings, which is ok for me. I can confirm in the Neon console, that data was uploaded:

Lambda function

Setup local deployment

I like to test code locally, before deploying to the cloud. It helps reduce the feedback loop.
I create a simple lambda function with cargo lambda.

cd backend
cargo lambda new movies

Infrastructure is done with AWS CDK.

import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as rust from "@cdklabs/aws-lambda-rust";

export class SemSearchStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const moviesLambda = new rust.RustFunction(this, "MoviesLambda", {
      entry: "backend/movies",
      binaryName: "movies",
    });

    const httpApi = new cdk.aws_apigatewayv2.HttpApi(this, "MoviesApi");

    const moviesLambdaIntegration =
      new cdk.aws_apigatewayv2_integrations.HttpLambdaIntegration(
        "MoviesLambdaIntegration",
        moviesLambda
      );

    httpApi.addRoutes({
      path: "/movies",
      methods: [cdk.aws_apigatewayv2.HttpMethod.GET],
      integration: moviesLambdaIntegration,
    });
  }
}

First approach: SAM local

I can run lambda locally with SAM, even though I have infra defined in CDK

cdk synth --no-staging
sam local start-api -t ./cdk.out/SemSearchStack.template.json

Now curl localhost to see the dummy response.

The great thing about this approach is that we can use tools like Bruno or Postman to test our APIs with ease. The downside is that when anything changes in code, you need to run cdk synth which would end with re-building lambda in release mode. It is slow...

Second approach: cargo lambda

When I want to change something in code and check it right away, I lean to cargo lambda tool.
In one terminal I run cargo lambda watch --env-file .env, and then I can invoke function in the other one using cargo lambda invoke -F events/test.json

If anything changes in the code, the project is recompiled in dev mode. The second approach lets you iterate much faster.

sqlx magic

One of the craziest features of sqlx is compile-time checks for created queries. When using macros sqlx will check if there are columns you try to select etc. It is amazing, but comes with things to consider. We might not want to require running dev DB for each recompilation.

sqlx comes with a solution that allows using macros in "offline mode". You can find more in the documentation.

Sadly I wasn't able to make query! and query_as! macros with a Vector type. For now, I am sticking with a regular query_as in my code.

Application logic

The flow is straightforward - get embeddings for the received query and use them to query the database.

I don't split Bedrock and DB logic into separate services, which I would do in the real project. For simplicity's sake, I created a single service.rs to handle the logic. The same goes for models in the models.rs

Models

I don't have a lot of models

use serde::{Deserialize, Serialize};
use sqlx::prelude::FromRow;

#[derive(Serialize, Deserialize, FromRow)]
#[serde(rename_all = "camelCase")]
pub(crate) struct Movie {
    pub(crate) id: i32,
    pub(crate) title: String,
    pub(crate) short_description: String,
}


#[derive(Debug, serde::Deserialize)]
#[serde(rename_all = "camelCase")]
pub (crate) struct TitanResponse {
    pub(crate) embedding: Vec<f32>,
    input_text_token_count: i128,
}

There is a struct to get Movie from DB and return it as a json in lambda response. To be honest I would probably use separate structs for DB integration and for API response, even if they are identical. I'll keep it this way for now though.

The second model is to deserialize a Bedrock response.

Bedrock integration

To be able to look for similar results, I need to convert a query to embedding first.
I create a MoveService first and implement the new function

use aws_sdk_bedrockruntime::{primitives::Blob, Client};
use pgvector::Vector;
use sqlx::PgPool;

use super::models::{Movie, TitanResponse};

#[derive(Clone)]
pub(crate) struct MoviesService {
    bedrock_client: Client,
    pool: PgPool,
}

impl MoviesService {
    pub(crate) fn new(bedrock_client: Client, pool: PgPool) -> Self {
        Self {
            bedrock_client,
            pool,
        }
    }
// ...

Calling AWS Bedrock APIs is quite straightforward, at least once you figure out how to parse the input for the model. However, I always struggle with deserializing responses.

// ...
async fn get_embedding(&self, text: &str) -> Result<Vec<f32>, Box<dyn std::error::Error>> {
        println!("getting embedding from bedrock");

        let embeddings_prompt = format!(
            r#"{{
            "inputText": "{}"
        }}"#,
            text
        );

        let invoke_res = self
            .bedrock_client
            .invoke_model()
            .model_id("amazon.titan-embed-text-v2:0")
            .body(Blob::new(embeddings_prompt.as_bytes().to_vec()))
            .send()
            .await?;

        let resp =
            serde_json::from_slice::<TitanResponse>(&invoke_res.body().clone().into_inner())?;

        Ok(resp.embedding)
    }
//...

Interaction with Postgres is limited to a single query. Type safety is guaranteed thanks to sqlx. I use L2 distance to find the closest vectors. pgvector supports different distance functions

//...
async fn get_movies(
        &self,
        embedding: Vec<f32>,
    ) -> Result<Vec<Movie>, Box<dyn std::error::Error>> {
        let formatted_embedding = Vector::from(embedding);

        println!("getting records from db");

        let movies = sqlx::query_as::<_, Movie>(
            r#"
                SELECT id, title, short_description
                FROM movies ORDER BY embeddings <-> $1 LIMIT 5;
            "#
        )
        .bind(formatted_embedding)
        .fetch_all(&self.pool)
        .await?;

        Ok(movies)
    }
//...

Finally, both operations are wrapped in the handle_get_movies function

// ...
pub(crate) async fn handle_get_movies(
        &self,
        text: &str,
    ) -> Result<Vec<Movie>, Box<dyn std::error::Error>> {
        let embedding = self.get_embedding(text).await?;
        let movies = self.get_movies(embedding).await?;

        Ok(movies)
    }
// ...

Deployment

For local development, it was OK to pass DATABASE_URL directly to Lambda as an environment variable, but this is not how we want to handle secrets in the long run. I have updated my stack definition with a new secret.

I also add bedrock permissions.

// ...
    const dbSecret = new cdk.aws_secretsmanager.Secret(
      this,
      "movies-db-secret"
    );

    const moviesLambda = new rust.RustFunction(this, "MoviesLambda", {
      entry: "backend/movies",
      binaryName: "movies",
      environment: {
        DATABASE_SECRET_NAME: dbSecret.secretName,
      }
    });

    dbSecret.grantRead(moviesLambda);

    moviesLambda.addToRolePolicy(new cdk.aws_iam.PolicyStatement({
      effect: cdk.aws_iam.Effect.ALLOW,
      actions: [
        'bedrock:InvokeModel'
      ],
      resources: [
        `arn:aws:bedrock:${props?.env?.region || 'eu-central-1'}::foundation-model/amazon.titan-embed-text-v2*`
      ]
    }));
// ...

Function handler:

use lambda_http::{run, service_fn, tracing, Body, Error, Request, RequestExt, Response};
use movies::service::MoviesService;
use sqlx::PgPool;

mod movies;

async fn function_handler(svc: MoviesService, event: Request) -> Result<Response<Body>, Error> {
    // Extract some useful information from the request
    println!("Event: {:?}", event);
    let query = event
        .query_string_parameters_ref()
        .and_then(|params| params.first("query"))
        .unwrap_or("world");
    println!("Query: {}", query);

    let response = svc.handle_get_movies(query).await.unwrap();

    println!("Response got reponse");

    let serialized_response = serde_json::to_string(&response)?;

    let resp = Response::builder()
        .status(200)
        .header("content-type", "application/json")
        .body(Body::Text(serialized_response))
        .map_err(Box::new)?;

    Ok(resp)
}

#[tokio::main]
async fn main() -> Result<(), Error> {
    tracing::init_default_subscriber();

    let aws_config = aws_config::from_env()
        .region(aws_config::Region::new("eu-central-1"))
        .load()
        .await;

    println!("got aw config");

    let database_secret_name =
        std::env::var("DATABASE_SECRET_NAME").expect("DATABASE_SECRET_NAME must be set");

    let secrets_client = aws_sdk_secretsmanager::Client::new(&aws_config);

    println!("getting db_url secret");

    let db_url = secrets_client
        .get_secret_value()
        .secret_id(database_secret_name)
        .send()
        .await
        .unwrap()
        .secret_string()
        .unwrap()
        .to_string();

    let pool: PgPool = PgPool::connect(db_url.as_str()).await.unwrap();

    println!("connected to db");

    let bedrock_client = aws_sdk_bedrockruntime::Client::new(&aws_config);

    println!("got bedrock client");

    let movies_service = MoviesService::new(bedrock_client, pool);

    run(service_fn(|ev| {
        function_handler(movies_service.clone(), ev)
    }))
    .await
}

I deploy the stack to the eu-central-1 region. Make sure, that access to the Bedrock model was granted in the given region.

Testing

When both lambda and DB are hot, the full roundtrip takes ~300ms

In the worst-case scenario, both lambda and Neon DB need time for the cold start. The whole request takes above 2 seconds. This is amazing if you consider that it includes spinning up Postgres database.

Load test

Additionally, I run a simple load test. I use k6 to simulate 100 users sending requests every second for one minute. The important thing is to start the test when Neon is in the idle state, so I can see the impact of the cold start.

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  vus: 100,          // 100 virtual users
  duration: '60s',   // test duration: 1 minute
  thresholds: {
    http_req_duration: ['p(95)<2000'], // 95% of requests should be below 2s
    http_req_failed: ['rate<0.01'],    // less than 1% of requests should fail
  },
};

export default function () {
  const response = http.get('https://jrc9mj6h0i.execute-api.eu-central-1.amazonaws.com/movies?query=ucieczka%2520z%2520wi%25C4%2599zienia');

  // Verify the response
  if (response.status !== 200) {
    console.log(`Failed request: ${response.status} ${response.body}`);
  }

  // Sleep for 1 second between requests
  sleep(1);
}

Over 5k requests were sent, all of them successful. Max time for the request was almost 3 seconds (still not bad), but it didn't impact overall performance with p(95)=171ms

Neon was able to handle traffic with 65 opened connections.

AWS Lambda needed max 100 concurrent executions.

Summary

I wanted to check how serverless Postgres from Neon comes together with AWS Lambda to provide semantic search based on pgvector. I have utilized the AWS Titan Embeddings model to convert queries to vectors.

I used Rust to run in the AWS Lambda to minimize the overwhelm of the function initialization and execution.

2-3 seconds is the worst-case scenario for the end-to-end request sent to my backend via HTTP API gateway. From my perspective, it is a great result keeping in mind, that I have the Postgres database scaled automatically to 0 when it is not used.

Neon's serverless Postgres is a great option for applications with infrequent traffic, which might benefit from scaling to zero.

DEV Community