Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering

#agents #ai #aws #rag

This is the second in a series of posts documenting the architecture, implementation, and lessons learned from building the AWS Briefing Agent - a personalised AWS assistant deployed on Amazon Bedrock AgentCore Runtime.

Part 1: Building a Full-Stack AI Agent on Bedrock AgentCore
Part 2: Data Ingestion: RSS Feeds, Knowledge Base, S3 Vectors, and Metadata Filtering
Part 3: Strands Agents + AgentCore Runtime - a perfect match
Part 4: Adding Memory to the Agent
Part 5: Experimenting with API Gateway
Part 6: Observability and Evaluations
Part 7: Third Party Integrations - Identity, Gateway and Slack Notifications

When I started building the AWS Briefing Agent, the first version queried the AWS What's New RSS feed on every invocation. This worked in terms of showing the agent could return tailored information back to the client. However, it was costly and wasteful, with the same data fetched repeatedly, which added latency to every invocation. The RSS feed also only covers recent information, and it was likely we would want to start searching for releases that had been launched in the past 6 months or more. The next step therefore, was to separate the retrieval by the agent from the ingestion.

Amazon Bedrock Knowledge Base

One of the key design goals was to allow the agent to match a natural language query "what's new in Bedrock this week?" against a large corpus of documents to return the most semantically similar results. This is where Amazon Bedrock Knowledge Base comes into its own. It allows the agent to use RAG (Retrieval-Augmented Generation). By querying the Knowledge Base, we can retrieve relevant documents at query time, and then inject them into the prompt as context. The LLM then generates a response from this retrieved information which we know to be factual.

The python CDK code that creates the Knowledge Base is shown below:

knowledge_base = bedrock.CfnKnowledgeBase(
    self,
    "AnnouncementKnowledgeBase",
    name="aws-briefing-agent-announcements",
    ...
    knowledge_base_configuration=bedrock.CfnKnowledgeBase.KnowledgeBaseConfigurationProperty(
        type="VECTOR",
        vector_knowledge_base_configuration=bedrock.CfnKnowledgeBase.VectorKnowledgeBaseConfigurationProperty(
            embedding_model_arn=f"arn:aws:bedrock:{self.region}::foundation-model/amazon.titan-embed-text-v2:0",
        ),
    ),
    storage_configuration=bedrock.CfnKnowledgeBase.StorageConfigurationProperty(
        type="S3_VECTORS",
        s3_vectors_configuration=bedrock.CfnKnowledgeBase.S3VectorsConfigurationProperty(
            index_name="announcements",
            vector_bucket_arn=f"arn:aws:s3vectors:{self.region}:{self.account}:bucket/briefing-agent-vectors",
        ),
    ),
)

This declares the embeddings model to be used as amazon.titan-embed-text-v2:0 and the vector store as being of type S3_VECTORS. There is no code required to handle aspects such as embeddings. Instead, Bedrock manages all of this for us.

Amazon S3 Vectors

Amazon Bedrock Knowledge Bases support several vector stores. A vector store is the retrieval engine that makes RAG work. It stores documents as numerical embeddings (vectors) that are generated by an embeddings model. At query time, the user's question is embedded, and the vector store finds documents whose embeddings are closest in meaning.

The prototype uses Amazon S3 Vectors as the underlying vector store. S3 Vectors provides cost-effective, elastic, and durable vector storage at up to 90% lower costs for uploading, storing, and querying vectors than alternatives such as OpenSearch Serverless. There is no infrastructure to manage, and it still provides a sub-second query latency which is acceptable for this use case.

Scheduling the Ingestion

The ingestion pipeline is run every 6 hours using Amazon EventBridge Scheduler. This service provides capabilities such as built-in retry policies, time zone support, and dead-letter queues. The schedule triggers an AWS Lambda function that carries out the required processing. This includes:

Lists existing document hashes in S3
Fetches the AWS What’s New RSS feed (~100 announcements)
Fetches 13 AWS blog RSS feeds (aws, machine-learning, compute, security, database, containers, devops, networking, storage, infrastructure-and-automation, developer, big-data, iot)
Fetches the AWS Security Bulletins RSS feed
For each new blog post, fetches the canonical URL and extracts the full article body using a stdlib HTML parser
Parses publication dates into YYYYMMDD integers
Writes .txt and .metadata.json files per new item to S3
Triggers a Bedrock KB ingestion job

Deduplication and Incremental Writes

When the ingestion pipeline runs, most of the content in the various RSS feeds is not new. It was important to find a way to prevent re-fetching and re-writing hundreds of announcements every 6 hours.

To support this, we created an MD5 hash of the blog posts URL, truncated to 12 hex characters. This hash is used as the S3 filename. The sample code snippet is shown below:

def write_to_s3(items, existing_keys=None):
    existing = existing_keys or set()
    for item in items:
        url_hash = hashlib.md5(item["link"].encode()).hexdigest()[:12]
        if url_hash in existing:
            continue # Already in S3, skip
        # ... write doc + metadata files

At startup, get_existing_keys() lists all the .txt files in S3 and extracts the hash from each filename into a set. When processing the blog posts, the Lambda functions computes the URL hash and checks to see if it is already in the set. If it already exists, then it has been ingested in a previous run, and there is no need to re-fetch the page. If the hash does not exist, then the function fetches the page, extracts the content, and writes to S3. The hash gives a stable, deterministic filename derived from the URL. The same URL always produces the same hash.

Chunking Strategy

The chunking strategy is set on the Data Source resource in the CDK stack as shown below:

data_source = bedrock.CfnDataSource(
    self,
    "AnnouncementDataSource",
    name="aws-announcements-s3",
    knowledge_base_id=knowledge_base.attr_knowledge_base_id,
    data_source_configuration=bedrock.CfnDataSource.DataSourceConfigurationProperty(
        type="S3",
        s3_configuration=bedrock.CfnDataSource.S3DataSourceConfigurationProperty(
            bucket_arn=data_bucket.bucket_arn,
        ),
    ),
    vector_ingestion_configuration=bedrock.CfnDataSource.VectorIngestionConfigurationProperty(
        chunking_configuration=bedrock.CfnDataSource.ChunkingConfigurationProperty(
            chunking_strategy="SEMANTIC",
            semantic_chunking_configuration=bedrock.CfnDataSource.SemanticChunkingConfigurationProperty(
                breakpoint_percentile_threshold=92,
                buffer_size=1,
                max_tokens=600,
            ),
        ),
    ),
)

We utilise a SEMANTIC chunking strategy. This uses the embedding model itself to decide where to split. The following three parameters control this behaviour:

breakpoint_percentile_threshold=92 - controls the percentile threshold that will result in a split. A higher threshold requires sentences to be more distinguishable to split the document into different chunks.
max_tokens=600 - the maximum number of tokens that should be included in a single chunk, while honoring sentence boundaries.
buffer_size=1 - for a given sentence, the buffer size defines the number of surrounding sentences to be added for embeddings creation. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking.

Filtering by Date

One of the goals in writing the agent was that a user could ask to constrain information by how recent it is e.g. "what is new in the past 7 days?".

To help achieve this, at ingestion time for each document, we create an associated metadata.json sidecar file that attaches structured, filterable attributes to a document so the agent can narrow search results without relying only on semantic similarity. An example companion file is shown below:

{
  "metadataAttributes": {
    "published_date": 20260415,
    "service": "amazon-bedrock",
    "category": "artificial-intelligence",
    "source_type": "announcement"
  }
}

During the Knowledge Base sync, Bedrock reads this sidecar and attaches those attributes to every vector chunk generated from that document. At query time, the agent can combine semantic search with metadata filters:

"What's new in Bedrock this week?" → vector similarity for "Bedrock" + greaterThanOrEquals filter on published_date
"Show me security bulletins" → vector similarity + equals filter on source_type: "security-bulletin"
"Lambda announcements from the last month" → vector similarity + filters on both service and published_date

Without the metadata file, the agent would get the most semantically similar results regardless of date or service — so a question about "this week" might return announcements from 3 months ago that happen to be textually similar. The metadata filters let the agent constrain results to the correct time window or service before ranking by relevance.

The naming convention (.metadata.json) is a Bedrock KB convention — it automatically associates the sidecar with its parent document during ingestion. No code links them; the filename pattern is enough.

Bedrock Knowledge Base metadata supports four types: STRING, NUMBER, BOOLEAN and STRING_LIST. There is no native data type. The comparison operators (greaterThan, greaterThanOrEquals, lessThan, lessThanOrEquals) only work with NUMBER. Our original implementation stored published_date as a string ("2026-05-14"). When the agent tried to filter, we got back the following exception:

ValidationException: The filter value type provided isn't supported
for the given operation: GREATER_THAN_OR_EQUALS

The fix was to store dates as YYYYMMDD numbers (so using "20260514" instead of "2026-05-14"). We also inject today's date into the system prompt at runtime so the LLM can easily calculate relative dates.

Note that Amazon S3 Vectors has a strict 2 KB limit on filterable metadata per vector. We found the Bedrock Knowledge Base internal metadata keys (AMAZON_BEDROCK_TEXT and AMAZON_BEDROCK_METADATA) were set as filterable by default, which caused frequent ValidationException errors. The fix was mark both of these keys as non-filterable when creating the vector index:

vector_index = s3vectors.CfnIndex(
    self, "AnnouncementVectorIndex",
    index_name="announcements",
    vector_bucket_name=vector_bucket.vector_bucket_name,
    dimension=1024,  # Titan Embed Text v2
    distance_metric="cosine",
    data_type="float32",
    metadata_configuration=s3vectors.CfnIndex.MetadataConfigurationProperty(
        non_filterable_metadata_keys=[
            "AMAZON_BEDROCK_TEXT",
            "AMAZON_BEDROCK_METADATA",
        ],
    ),
)