DEV Community

Cover image for AWS Bedrock KB with Glue data catalog
Shakir for AWS Community Builders

Posted on

AWS Bedrock KB with Glue data catalog

Hi 👋, In this post we shall explore Bedrock's structured KB with this architecture: Upload CSVs to S3 > SNS Queue > Crawl data with Glue > Query with Redshift > Bedrock KB > Query with LLM.

Setup

Let's do some of this with code. Let's get started.

Clone the repo and switch to the project directory.

git clone git@github.com:networkandcode/networkandcode.github.io.git
cd structured-kb-demo/
Enter fullscreen mode Exit fullscreen mode

Do a uv sync.

uv sync
Enter fullscreen mode Exit fullscreen mode

Setup environment variables.

$ cat .env
AWS_ACCOUNT_ID=
AWS_ACCESS_KEY_ID=
AWS_REGION=ap-south-1
AWS_SECRET_ACCESS_KEY=

BEDROCK_KB=StructKb
BEDROCK_KB_IAM_POLICY=StructKbIamPolicy
BEDROCK_KB_IAM_ROLE=StructKbIamRole

GLUE_CRAWLER=struct-kb-glue-crawler
GLUE_CRAWLER_IAM_POLICY=StructKbGlueCrawlerIamPolicy
GLUE_CRAWLER_IAM_ROLE=StructKbGlueCrawlerIamRole
GLUE_DB=struct-kb-glue-db

REDSHIFT_IAM_ROLE=StructKbRedshiftIamRole
REDSHIFT_NAMESPACE=struct-kb-rs-ns
REDSHIFT_WORKGROUP=struct-kb-rs-wg

S3_BUCKET=struct-kb-bucket
S3_FOLDER=inventory

SQS_QUEUE=struct-kb-queue
Enter fullscreen mode Exit fullscreen mode

Common files

The vars file will load all the env vars once. The arns file is used to form some of the arns we need. And the [logger] file is used to setup a common logger for rest of the code.

Bucket

Setup an S3 bucket.

uv run setup_s3_bucket.py 
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Bucket struct-kb-s3-bucket created successfully
Enter fullscreen mode Exit fullscreen mode

Queue

Setup an SQS queue with an access policy that allows the S3 bucket to send message to it.

uv run setup_sqs_queue.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Queue created successfully.
Enter fullscreen mode Exit fullscreen mode

Event notification

Update S3 bucket to notify SQS queue on events.

uv run setup_s3_event_notification.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Successfully added event notifications
Enter fullscreen mode Exit fullscreen mode

Database

Setup a glue database.

uv run setup_glue_db.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Glue database created successfully.
Enter fullscreen mode Exit fullscreen mode

Crawler

Setup an IAM policy that allows access to the S3 bucket and SQS queue.

uv run setup_glue_crawler_iam_policy.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Policy created successfully!
Enter fullscreen mode Exit fullscreen mode

Setup an IAM role which attaches the policy we just defined as well as the AWS managed glue policy.

uv run setup_glue_crawler_iam_role.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Created role
INFO:logger:AWS Glue Service Role policy attached.
INFO:logger:Custom Glue Crawler policy attached.
Enter fullscreen mode Exit fullscreen mode

We can now provision a glue crawler and attach the role above to it.

uv run setup_glue_crawler.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Crawler created successfully.
Enter fullscreen mode Exit fullscreen mode

Redshift

We shall setup a RedShift IAM role by attaching the AWS managed policy to it.

uv run setup_redshift_iam_role.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Created role: StructKbRedshiftIamRole
INFO:logger:Attached AmazonRedshiftAllCommandsFullAccess to StructKbRedshiftIamRole
Enter fullscreen mode Exit fullscreen mode

Provision a namespace, attach the role above to it, and also provision a workgroup to run the namespace workloads on it.

uv run setup_redshift_workgroup.py 
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Namespace creation initiated.
INFO:logger:Workgroup creation initiated.
Enter fullscreen mode Exit fullscreen mode

See the data

There are two small files with sample inventory data: inventory1, inventory2.
Let's upload the first one.

uv run upload_csv_to_s3.py inventory_day_1.csv 
Enter fullscreen mode Exit fullscreen mode
Upload Successful: inventory/inventory_day_1.csv
Enter fullscreen mode Exit fullscreen mode

Run the crawler so that it fetches data from S3 and adds a table on glue database.

uv run run_glue_crawler.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Crawler started.
INFO:logger:Crawler is still running...
INFO:logger:Crawler is still running...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler is stopping...
INFO:logger:Crawler finished. Final State: READY
Enter fullscreen mode Exit fullscreen mode

We did a lot with the cli, let's do some verification from the gui, on the web console. We can see the table on the glue db in the hirerarchy AWS Glue > Data Catalog > Tables.
Table on glue db

Now, go to Amazon Redshift > Serveless > Query editor v2 Click on the workspace, and use the default settings to connect. Run this command on the editor:

SELECT * FROM "awsdatacatalog"."struct-kb-glue-db"."inventory"
Enter fullscreen mode Exit fullscreen mode

In my case the table name is inventory which is same as the s3 folder name. I got results like below.
Redshift query result for 1 day
Note that there are 10 records.

Incremental data

Now, let's add another csv file for day 2.

uv run upload_csv_to_s3.py inventory_day_2.csv 
Enter fullscreen mode Exit fullscreen mode

The SQS queue shoud show there is one message available.
SQS queue status before crawler run

We can run the crawler to fetch the change.

uv run run_glue_crawler.py 
Enter fullscreen mode Exit fullscreen mode

The SQS messages available should become 0.
SQS status after crawler run

The same query in redshift should now give 20 records.
Redshift query result for 2 days

Bedrock KB

We got the results in redshift editor through the command. We can try to retrieve results via Bedrock KB through natural language.

Setup IAM policy for bedrock kb.

uv run setup_bedrock_kb_iam_policy.py 
Enter fullscreen mode Exit fullscreen mode

Setup IAM role and attach this policy.

uv run setup_bedrock_kb_iam_role.py
Enter fullscreen mode Exit fullscreen mode
INFO:logger:Created role: StructKbBedrockKbIamRole
INFO:logger:Attached IAM policy to BedrockKB IAM role.
Enter fullscreen mode Exit fullscreen mode

Create and sync the knowlege base.

uv run setup_bedrock_kb.py
Enter fullscreen mode Exit fullscreen mode

We can go to Amazon Bedrock > Knowledge Bases on the web console and click on the knowledge base that was created. And test the knowledge base, I've used the following settings with a test prompt.
Test knowledge base

Alright, so that's it for this post, it was somewhat a heavy exercice overall, but I think it would help us really when we have large data, than the simple data examples we have used. So far we tested with the test prompt option in the bedrock kb, we could expand this logic and use this KB with agents made using frameworks like strands, langgraph...Thank you for reading!

Top comments (0)