Rev for AWS Community Builders

Posted on May 16

Build and Deploy an Automatic Sync Solution for Amazon Bedrock Knowledge Bases

#aws #amazonbedrock #serverless

Introduction

The AWS Blog post "Build and deploy an automatic sync solution for Amazon Bedrock Knowledge Bases" introduces a solution for automatically synchronizing documents with Amazon Bedrock Knowledge Bases.

This AWS Blog post was published on April 27, 2026. During my validation in May 2026, I encountered two issues: S3 events routed via EventBridge were not being processed correctly, and the status was not being updated even after the Knowledge Base ingestion job completed.

This article documents the fixes I applied to resolve these issues, the validation results, and operational considerations you should be aware of.

References

Overview of the Sample Implementation

Architecture Overview from the AWS Blog

Architecture Diagram

Note: The original architecture diagram in the AWS Blog is low-resolution and hard to read when enlarged, so I recreated it above. The original diagram also lacked lines connecting the Event Processor Lambda and Sync Processor Lambda to DynamoDB, which I have added.

The sample implementation described in the AWS Blog executes the following workflow. As stated in the blog, this solution is designed with service quota limits in mind, and to improve resilience by enabling retries when those limits are exceeded.

Phase 1: Document Change Detection

User uploads/updates/deletes a document in Amazon S3
Amazon EventBridge captures the S3 event and routes it to the Event Processor Lambda
Event Processor Lambda determines the change type (create / update / delete) and records a change entry — including a change_id — in Amazon DynamoDB (TRACKING_TABLE)
Event Processor Lambda sends a change notification message to Amazon SQS

Phase 2: Queuing and Rate Limiting

Amazon SQS buffers the change notification messages and controls throughput to match the StartIngestionJob quota (1 request per 10 seconds)
Sync Processor Lambda receives SQS messages one at a time
Creates job-tracking metadata in Amazon DynamoDB (METADATA_TABLE) and starts an AWS Step Functions workflow

Phase 3: Orchestration via Step Functions

Check Quota Lambda verifies service quotas (5 concurrent per account / 1 concurrent per data source / 1 concurrent per Knowledge Base)
If quotas are exceeded, the workflow waits 5 minutes and retries; otherwise it proceeds
Start Sync Lambda calls the StartIngestionJob API to kick off the sync job and records the job ID in metadata
Monitor Sync Lambda periodically checks the status via GetIngestionJob (waits 60 seconds and re-checks if the job is not yet complete)
On completion, the workflow branches based on success or failure

Phase 4: Knowledge Base Sync Processing

Amazon Bedrock Knowledge Base ingests the entire data source
Documents are converted to vector embeddings and stored in the vector store
Data becomes available for semantic search

Phase 5: Completion Processing, Notification, and Monitoring

Monitor Sync Lambda detects job completion and updates the metadata status
Sets the ingestion_job_id on the corresponding change record in TRACKING_TABLE and marks it as processed
Sends completion/failure notifications via Amazon SNS to email subscribers
Visualizes metrics on an Amazon CloudWatch dashboard and detects anomalies via alarms

Concrete Example: Uploading 50 Files in Bulk

Following the README.md example, the workflow proceeds as follows:

50 S3 events are generated → Event Processor records each change in DynamoDB and sends messages to SQS
SQS delivers messages to Sync Processor at 10-second intervals
Sync Processor starts Step Functions → if quotas allow, kicks off an ingestion job; one job ingests all 50 files (actually the entire data source) at once
After completion, Monitor Sync Lambda updates DynamoDB and sends an SNS notification

As also stated in the README.md Important Note About Amazon Bedrock Knowledge Base Ingestion, the StartIngestionJob operation scans the entire data source — not just the changed files. To ingest at the individual file level, a different API would need to be used. However, this design is intentional because the Knowledge Base's managed sync process determines whether each file is new, modified, or deleted and handles it appropriately.

In this way, the solution is designed to track changes immediately while efficiently batching ingestion within service limits.

Prerequisites and Constraints of the Sample Implementation

Only One Data Source per Knowledge Base

This implementation assumes that there is only one data source under the Knowledge Base. If multiple data sources exist, the implementation uses the first entry returned by the list_data_sources API, but since the order is not guaranteed, an unintended data source may be selected as the sync target. In the code below, maxResults=10 is set but dataSourceSummaries[0] retrieves the first data source:

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/src/sync_processor_lambda.py#L70-L77

Sync Scope of the Bedrock Knowledge Base Data Source

The scope that StartIngestionJob actually scans and syncs to the Bedrock Knowledge Base is determined by the S3 URI configured in the Bedrock Knowledge Base data source settings — not by the S3KeyPrefix parameter in samconfig.toml. In the code below, S3KeyPrefix is used only as a filter for the EventBridge rule. In other words, if the Bedrock Knowledge Base data source references the entire s3://example-bucket/, even if you specify documents/ as the S3KeyPrefix, any event triggered under documents/ will cause StartIngestionJob to scan all documents under s3://example-bucket/.

To avoid this, you need to align the S3 URI used when building the data source with the S3KeyPrefix.

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/samconfig.example.toml#L12

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/template.yaml#L234-L251

Fixes Applied to the Sample Implementation

Fix 1: S3 Events via EventBridge Not Being Processed Correctly

This sample implementation uses an EventBridge rule as the trigger for the Event Processor Lambda. However, the get_change_type() function is implemented to use direct S3 event notifications.

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/src/event_processor_lambda.py#L74-L91

Additionally, the event_name argument of this function references a field called detail.name, but according to the AWS documentation, no such field exists in the EventBridge payload. The correct field to reference is detail-type.

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/src/event_processor_lambda.py#L119

To resolve these issues, I modified the payload reference and the get_change_type() function. The specific changes are here:

https://github.com/revsystem/sample-automatic-sync-for-bedrock-knowledge-bases/pull/3

Fix 2: Metadata in Sync Processor Lambda Not Being Updated Correctly

The Event Processor Lambda generates a change_id and stores it in the autosync-tracking DynamoDB table. However, change_id is not included in the SQS message.

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/src/event_processor_lambda.py#L184-L192

The Monitor Sync Status Lambda references the change_ids field in the message only when the ingestion job reaches COMPLETE status, and uses it to update the autosync-tracking table.

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/src/monitor_sync_lambda.py#L93-L93

The Monitor Sync Status Lambda calls mark_changes_as_processed() only when the ingestion job completes. This function references the change_ids field to update the processed column in the autosync-tracking table. However, since change_ids is not included in the SQS message, processed always remains False.

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/src/monitor_sync_lambda.py#L151-L152

To resolve this, I updated the implementation to include change_id in the SQS message and modified Monitor Sync Status Lambda to handle both change_id and change_ids fields (since the original design intent was unclear, I made it compatible with both). The specific changes are here:

https://github.com/revsystem/sample-automatic-sync-for-bedrock-knowledge-bases/pull/4

Fix 3: Correcting the Quota Value for Concurrent Ingestion Jobs per Account

The README.md and the code state that the quota for Concurrent ingestion jobs per account is 55, but the AWS official documentation shows the correct value is 5.

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/src/check_quota_lambda.py#L26-L27

https://github.com/aws-samples/sample-automatic-sync-for-bedrock-knowledge-bases/blob/22a51f8/README.md?plain=1#L354-L356

To resolve this, I set the Concurrent ingestion jobs per account quota to 5. The specific changes are here:

https://github.com/revsystem/sample-automatic-sync-for-bedrock-knowledge-bases/pull/5

Environment Setup

The code used in this article is published on GitHub:

https://github.com/revsystem/sample-automatic-sync-for-bedrock-knowledge-bases

All AWS resources in this guide are created in the us-east-1 region. If you use a different region, substitute the appropriate region name throughout.

The S3 bucket and Bedrock Knowledge Base are created manually; all other resources are created using the AWS SAM CLI.

Creating the S3 Buckets

You will create three S3 buckets:

One for storing Bedrock Knowledge Base documents
One for S3 server access logs
One for storing text extracted from images in multimodal RAG (optional)

The S3 server access log bucket is not strictly required, but as noted in the Amazon S3 Bucket Security Requirements section of the README, enabling S3 server access logging is recommended for auditing purposes.

Below are the steps for creating the document storage bucket for Bedrock Knowledge Base.

S3 Bucket for Bedrock Knowledge Base Documents

Create an S3 bucket for storing Bedrock Knowledge Base documents. Following the README, enable Block Public Access settings, Default Encryption, and Versioning.

Setting	Value
Bucket type	`General purpose`
Bucket namespace	`Account Regional namespace`
Bucket name prefix	`automatic-sync-for-bedrock-kb`
Block public access settings for this bucket	`Block all public access`
Bucket Versioning	`Enable`
Default Encryption	`Server-side encryption with Amazon S3-managed keys (SSE-S3)`

After creating the bucket, open the Server access logging block in the Properties tab and set it to Enable. Specify the log bucket you created as the destination.

Also in the Properties tab, open the Amazon EventBridge block and set "Send notifications to Amazon EventBridge for all events in this bucket" to On.

Next, open the Bucket Policy block in the Permissions tab and configure the bucket policy as shown below. Replace Resource with the ARN of your Bedrock Knowledge Base document storage bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RestrictToTLSRequestsOnly",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::automatic-sync-for-bedrock-kb-<YOUR_ACCOUNT_ID>-us-east-1-an",
                "arn:aws:s3:::automatic-sync-for-bedrock-kb-<YOUR_ACCOUNT_ID>-us-east-1-an/*"
            ],
            "Condition": {
                "Bool": {
                    "aws:SecureTransport": "false"
                }
            }
        }
    ]
}

Create a folder named documents inside this bucket for uploading documents.

Creating the Bedrock Knowledge Base

Note: The choices made here — such as Parsing strategy and Embedding model — are not essential to validating the automatic sync solution. As long as the S3 bucket is correctly configured in the data source, the rest can use default settings. For cost efficiency, I recommend selecting S3 Vectors as the Vector store type.

Follow the Knowledge Base creation wizard and select Knowledge Base with vector store.

Set the Knowledge Base name to sample-automatic-sync-for-bedrock.

Select Amazon S3 as the data source type.

Set the data source name to sample-automatic-sync-for-bedrock-data-source. For the S3 URI, specify the path to the document storage bucket you created earlier.

Select Foundation models as a parser for the Parsing strategy to enable handling of PDFs and rich documents.

For the parser model, select Nova Pro 1.0. While the model selection screen shows various options, only On-demand profile models can be selected. Cross-region inference profiles and global inference profiles cannot be selected — attempting to use them will result in an error during Knowledge Base creation.

For the Chunking strategy, select Semantic chunking. Set Max tokens size for a chunk to 512. The maximum input token count for Cohere Embed Multilingual v3 is 512 tokens. If the chunk size exceeds this, text will be truncated during embedding, so Max tokens size for a chunk must be set to 512 or lower.

Select Cohere Embed Multilingual v3 as the Embeddings model. This model is known for its strong multilingual embedding quality.

For Vector store, select Quick create a new vector store, and for Vector store type, select Amazon S3 Vectors.

For Multimodal storage destination, specify the S3 bucket for storing text extracted from images in multimodal RAG. This is optional and can be left unconfigured.

Deployment

Clone the GitHub repository:

git clone https://github.com/revsystem/sample-automatic-sync-for-bedrock-knowledge-bases.git

Navigate to the repository root directory:

cd sample-automatic-sync-for-bedrock-knowledge-bases

Copy the sample configuration file:

cp samconfig.example.toml samconfig.toml

vi samconfig.toml

Parameter	Value
KnowledgeBaseId	The ID of your Bedrock Knowledge Base
S3BucketName	The name of your document storage bucket (e.g. `automatic-sync-for-bedrock-kb-<YOUR_ACCOUNT_ID>-us-east-1-an`)
S3KeyPrefix	The folder name inside the document storage bucket (e.g. `documents`)
NotificationsEmail	Email address for notifications (optional)
LambdaMemorySize	Lambda memory size (optional)
LambdaTimeout	Lambda timeout duration (optional)

# Example samconfig.toml.
version = 0.1

[default.deploy.parameters]
stack_name = "autosync"
resolve_s3 = true
s3_prefix = "autosync"
region = "us-east-1"
confirm_changeset = true
capabilities = "CAPABILITY_IAM"
disable_rollback = true
parameter_overrides = "KnowledgeBaseId=\"<your-knowledge-base-id>\" S3BucketName=\"<your-s3-bucket-name>\" S3KeyPrefix=\"<optional-prefix>\" NotificationsEmail=\"<optional-email>\" LambdaMemorySize=\"256\" LambdaTimeout=\"60\""
image_repositories = []

Run the deployment:

sam build
sam deploy --guided --profile {YOUR_PROFILE} --region {YOUR_REGION}

With the --guided option, the CLI will prompt you as follows:

Setting default arguments for 'sam deploy'
=========================================
Stack Name [autosync]:
AWS Region [us-east-1]:
Parameter KnowledgeBaseId [<YOUR_KNOWLEDGE_BASE_ID>]:
Parameter S3BucketName [<YOUR_S3_BUCKET_NAME>]:
Parameter S3KeyPrefix [<YOUR_S3_KEY_PREFIX>]:
Parameter NotificationsEmail [<YOUR_EMAIL_ADDRESS>]:
Parameter LambdaMemorySize [256]:
Parameter LambdaTimeout [60]:
#Shows you resources changes to be deployed and require a 'Y' to initiate deploy
Confirm changes before deploy [Y/n]: Y
#SAM needs permission to be able to create roles to connect to the resources in your template
Allow SAM CLI IAM role creation [Y/n]: Y
#Preserves the state of previously provisioned resources when an operation fails
Disable rollback [Y/n]: Y
Save arguments to configuration file [Y/n]: Y
SAM configuration file [samconfig.toml]:
SAM configuration environment [default]:

A successful deployment will display the following message in your terminal:

Successfully created/updated stack - autosync in us-east-1

For more details on deploying and troubleshooting with the AWS SAM CLI, refer to README-SAM.md.

SNS Subscription Confirmation

During deployment, an SNS subscription is created. This subscription is used to receive notifications when documents are uploaded or deleted.

When the subscription is created, an email like the following will be sent to the NotificationsEmail address:

Subject: AWS Notification - Subscription Confirmation

You have chosen to subscribe to the topic:
arn:aws:sns:us-east-1:<YOUR_ACCOUNT_ID>:autosync-notifications

To confirm this subscription, click or visit the link below (If this was in error no action is necessary):
Confirm subscription

Please do not reply directly to this email. If you wish to remove yourself from receiving all future SNS subscription confirmation requests please send an email to sns-opt-out

Click Confirm subscription to activate the SNS subscription. Be careful not to accidentally click the unsubscribe link.

The subscription confirmation email may contain an unsubscribe link. Clicking this link immediately cancels the subscription. Corporate email security systems or spam filters may automatically open links in incoming emails to inspect them, potentially triggering the unsubscribe link unintentionally.

To prevent this, you should confirm the SNS subscription via the AWS Management Console or AWS CLI, as described in this article:

https://repost.aws/knowledge-center/prevent-unsubscribe-all-sns-topic

Verification

Uploading a Document

Upload a document using the aws s3 cp command or from the S3 Management Console:

aws s3 cp {YOUR_DOCUMENT_PATH} s3://{YOUR_S3_BUCKET_NAME}/{YOUR_S3_KEY_PREFIX}/

Verifying Document Sync

You can verify document sync from the Bedrock Knowledge Base dashboard. The sync history for the data source is shown in Sync history.

If the SNS subscription is active, a message like the following will be sent when the ingestion job completes:

Subject: Bedrock KB Sync Job COMPLETE

{
  "knowledge_base_id": "<YOUR_KNOWLEDGE_BASE_ID>",
  "job_id": "<YOUR_JOB_ID>",
  "status": "COMPLETE",
  "statistics": {
    "numberOfDocumentsDeleted": 0,
    "numberOfDocumentsFailed": 0,
    "numberOfDocumentsScanned": 1,
    "numberOfMetadataDocumentsModified": 0,
    "numberOfMetadataDocumentsScanned": 0,
    "numberOfModifiedDocumentsIndexed": 0,
    "numberOfNewDocumentsIndexed": 1
  },
  "processed_changes": 1
}

Checking DynamoDB Data

The DynamoDB table names are autosync-tracking and autosync-metadata.

aws dynamodb scan --table-name autosync-tracking --profile {YOUR_PROFILE} --region {YOUR_REGION}
aws dynamodb scan --table-name autosync-metadata --profile {YOUR_PROFILE} --region {YOUR_REGION}

Example output from the autosync-tracking table. The change_type field records deletions and new additions:

{
    "Items": [
        {
            "knowledge_base_id": {
                "S": "<YOUR_KNOWLEDGE_BASE_ID>"
            },
            "event_time": {
                "S": "2026-05-10T17:14:43Z"
            },
            "ingestion_job_id": {
                "S": "KGF5YGBN2P"
            },
            "processed": {
                "BOOL": true
            },
            "timestamp": {
                "N": "1778433285.169089"
            },
            "bucket": {
                "S": "<YOUR_S3_BUCKET_NAME>"
            },
            "change_type": {
                "S": "delete"
            },
            "key": {
                "S": "documents/n1120000.pdf"
            },
            "change_id": {
                "S": "4bc82a1e-56da-4f13-8d0e-a2db399aca8b"
            }
        },
        {
            "knowledge_base_id": {
                "S": "<YOUR_KNOWLEDGE_BASE_ID>"
            },
            "event_time": {
                "S": "2026-05-10T17:15:32Z"
            },
            "ingestion_job_id": {
                "S": "YGYDOSMKQR"
            },
            "processed": {
                "BOOL": true
            },
            "timestamp": {
                "N": "1778433333.72125"
            },
            "bucket": {
                "S": "<YOUR_S3_BUCKET_NAME>"
            },
            "change_type": {
                "S": "create"
            },
            "key": {
                "S": "documents/n1120000.pdf"
            },
            "change_id": {
                "S": "07b7b739-d44f-4a35-aa5c-8bc1fa104b01"
            }
        },
...
    ],
    "Count": 10,
    "ScannedCount": 10,
    "ConsumedCapacity": null
}

Checking Logs

Check logs in CloudWatch Logs. Logs are output to log groups matching /aws/lambda/autosync-*.

You can also check the status and metrics for each resource in the EventBridge, Step Functions, and Lambda Management Consoles.

Conclusion

I validated the sample implementation introduced in the AWS Blog and confirmed its behavior end-to-end.

In this sample implementation, when a file is uploaded, updated, or deleted in S3, the Knowledge Base sync API (StartIngestionJob) is automatically invoked, ensuring reliable ingestion while respecting service quotas.
The design — using EventBridge to detect S3 events, SQS to buffer and deliver messages, and Step Functions for orchestration — is highly efficient. Change tracking via DynamoDB also makes state management straightforward.

While a few fixes were necessary due to discrepancies between the implementation and the specification, the sample implementation provided a clear picture of how automatic sync works end-to-end.

DEV Community