Mahak Faheem

Posted on Dec 12, 2025

The Problem: My AWS Q Business Bot Didn’t Understand My Data

#aws #ai #aiops #data

When I started experimenting with AWS Q Business, I connected multiple data sources:

Confluence
S3 documents
PDFs & documentations
Website pages through the Web Crawler

Setup was smooth. Indexing completed. Everything looked perfect.
At first, I assumed the embeddings weren't refreshed or access permission issues existed.
But the real culprit was something far simpler:
I had connected the data sources but I hadn’t configured the metadata or document schemas properly.
Q was indexing my data but not understanding the structure, relationships, recency or context boundaries.

Why Metadata Matters in Q Business

Unlike a typical RAG system where you're manually controlling embeddings, chunking and retrieval: AWS Q Business handles all of this automatically.
But "automatic" doesn’t mean "perfect"
Without metadata, Q struggles with:

Prioritizing fresh vs old content
Understanding document categories
Scoping answers to specific teams or contexts
Navigating Confluence pages with nested hierarchy
Handling versioned documents
Distinguishing source-of-truth vs duplicates

And most importantly:
Q can retrieve irrelevant content that "looks similar" but isn’t actually correct.
Metadata fixes that.

1. Clean Inputs: Well-Structured Data Sources

Each data source needed:

A clear folder/project hierarchy
Document titles that convey meaning
Removal of outdated versions
Explicit version numbers when needed
Logical grouping (S3 prefixes / Confluence spaces)

Example restructuring in S3:

s3://company-knowledge-base/
  engineering/
    architecture/
      system-overview-v1.pdf
      service-boundaries-v2.md
    apis/
      public-api-spec-v3.yaml
      rate-limiting-rules-v1.pdf
    deployment/
      deployment-checklist-v3.md
      rollback-runbook-v2.md
    troubleshooting/
      common-errors/
        error-catalog-v2.json
        service-x-known-issues.md

  product/
    specs/
      feature-a-spec-v1.pdf
      feature-b-updates-v2.pdf
    roadmaps/
      q4-2025-roadmap.pdf

  operations/
    monitoring/
      alert-guide-v2.md
      oncall-playbook-v1.md
    logs/
      access-logs-structure.json
      application-log-fields.md

  knowledge/
    faq/
      internal-faq-v1.md
    glossary/
      terms-v2.md

This alone improved retrieval accuracy by ~30%.

2. Metadata: The Secret to Making Q Business “Smart”

Here’s what Q Business respects significantly during retrieval:
Recommended Metadata Keys

 Key               | Purpose                                       
 ----------------- | --------------------------------------------- 
 title             | Overrides filename during ranking             
 category          | Helps classification (“engg.”, “ops”, etc.) 
 tags              | Multiple labels improve semantic grouping     
 version           | Helps avoid outdated responses                
 updated_at        | Influences recency scoring                    
 department        | Great for permission-based personalization    
 summary           | Q uses this in ranking + reranking            
 source-of-truth   | Boolean; strong influence

Example metadata attached to an S3 object:
{
"title": "ABC Execution Workflow",
"category": "operations",
"tags": ["abc", "execution", "workflow", "ops"],
"version": "3.0",
"updated_at": "2025-10-10",
"source-of-truth": true,
"department": "engineering",
"summary": "Detailed ABC Process execution workflow."
}

This made Q consistently pick the correct ABC document every time.

3. Indexing Controls: Chunking, Schema & Access

AWS Q Business implicitly chunks content based on structure, but you can influence it:
Ensure documents have:

headings (h1, h2, h3)
bullet points
numbered sections
clear paragraphs

Avoid:

huge dense text
poorly formatted PDFs
scanned pages without OCR

Give Q a Schema (for JSON, logs, configs)
Example schema:

{
  "type": "object",
  "properties": {
    "step_name": { "type": "string" },
    "description": { "type": "string" },
    "owner": { "type": "string" },
    "timestamp": { "type": "string" }
  }
}

This is especially useful if you push logs or structured data.

My Final Setup That Worked Amazingly Well

Here’s what gave me the best accuracy:

S3 with Clean Structure: Organized by domains → modules → versions.
Confluence with Proper Page Hierarchy : Q understands “parent → child → sub-page” beautifully if the hierarchy is clean.
Role-Based Access : Users get personalized answers based on IAM roles.
Scheduled Re-indexing : After every source update.
Content Freshness / Sync : As per the content update process sync strategy was configured.
Metadata on Every Document
- title
- tags
- category
- version
- updated_at
- summary

What I Learned

Q isn’t truly “no configuration needed”: smart metadata is everything.
Hierarchy and structure matter more than quantity.
Recency metadata avoids hallucinating old content.
“source-of-truth: true” is extremely powerful.
Q Business is excellent, but only if your inputs are clean.

Conclusion

I initially thought AWS Q Business wasn’t retrieving the right data.
Turns out: I wasn’t feeding it the right structure.

Once I fixed the data sources & metadata:

retrieval accuracy improved drastically
domain-specific answers became sharp
version conflicts vanished
hallucinations dropped significantly

If you’re using AWS Q Business for enterprise search or internal assistants, your metadata & indexing strategies determine the quality of your AI.

DEV Community