DEV Community

Cover image for The Problem: My AWS Q Business Bot Didn’t Understand My Data
Mahak Faheem
Mahak Faheem

Posted on

The Problem: My AWS Q Business Bot Didn’t Understand My Data

When I started experimenting with AWS Q Business, I connected multiple data sources:

  • Confluence
  • S3 documents
  • PDFs & documentations
  • Website pages through the Web Crawler

Setup was smooth. Indexing completed. Everything looked perfect.
At first, I assumed the embeddings weren't refreshed or access permission issues existed.
But the real culprit was something far simpler:
I had connected the data sources but I hadn’t configured the metadata or document schemas properly.
Q was indexing my data but not understanding the structure, relationships, recency or context boundaries.

Why Metadata Matters in Q Business

Unlike a typical RAG system where you're manually controlling embeddings, chunking and retrieval: AWS Q Business handles all of this automatically.
But "automatic" doesn’t mean "perfect"
Without metadata, Q struggles with:

  • Prioritizing fresh vs old content
  • Understanding document categories
  • Scoping answers to specific teams or contexts
  • Navigating Confluence pages with nested hierarchy
  • Handling versioned documents
  • Distinguishing source-of-truth vs duplicates

And most importantly:
Q can retrieve irrelevant content that "looks similar" but isn’t actually correct.
Metadata fixes that.

1. Clean Inputs: Well-Structured Data Sources

Each data source needed:

  • A clear folder/project hierarchy
  • Document titles that convey meaning
  • Removal of outdated versions
  • Explicit version numbers when needed
  • Logical grouping (S3 prefixes / Confluence spaces)

Example restructuring in S3:

s3://company-knowledge-base/
  engineering/
    architecture/
      system-overview-v1.pdf
      service-boundaries-v2.md
    apis/
      public-api-spec-v3.yaml
      rate-limiting-rules-v1.pdf
    deployment/
      deployment-checklist-v3.md
      rollback-runbook-v2.md
    troubleshooting/
      common-errors/
        error-catalog-v2.json
        service-x-known-issues.md

  product/
    specs/
      feature-a-spec-v1.pdf
      feature-b-updates-v2.pdf
    roadmaps/
      q4-2025-roadmap.pdf

  operations/
    monitoring/
      alert-guide-v2.md
      oncall-playbook-v1.md
    logs/
      access-logs-structure.json
      application-log-fields.md

  knowledge/
    faq/
      internal-faq-v1.md
    glossary/
      terms-v2.md

Enter fullscreen mode Exit fullscreen mode

This alone improved retrieval accuracy by ~30%.

2. Metadata: The Secret to Making Q Business “Smart”

Here’s what Q Business respects significantly during retrieval:
Recommended Metadata Keys

 Key               | Purpose                                       
 ----------------- | --------------------------------------------- 
 title             | Overrides filename during ranking             
 category          | Helps classification (“engg.”, “ops”, etc.) 
 tags              | Multiple labels improve semantic grouping     
 version           | Helps avoid outdated responses                
 updated_at        | Influences recency scoring                    
 department        | Great for permission-based personalization    
 summary           | Q uses this in ranking + reranking            
 source-of-truth   | Boolean; strong influence                     

Enter fullscreen mode Exit fullscreen mode

Example metadata attached to an S3 object:
{
"title": "ABC Execution Workflow",
"category": "operations",
"tags": ["abc", "execution", "workflow", "ops"],
"version": "3.0",
"updated_at": "2025-10-10",
"source-of-truth": true,
"department": "engineering",
"summary": "Detailed ABC Process execution workflow."
}

This made Q consistently pick the correct ABC document every time.

3. Indexing Controls: Chunking, Schema & Access

AWS Q Business implicitly chunks content based on structure, but you can influence it:
Ensure documents have:

  • headings (h1, h2, h3)
  • bullet points
  • numbered sections
  • clear paragraphs

Avoid:

  • huge dense text
  • poorly formatted PDFs
  • scanned pages without OCR

Give Q a Schema (for JSON, logs, configs)
Example schema:

{
  "type": "object",
  "properties": {
    "step_name": { "type": "string" },
    "description": { "type": "string" },
    "owner": { "type": "string" },
    "timestamp": { "type": "string" }
  }
}
Enter fullscreen mode Exit fullscreen mode

This is especially useful if you push logs or structured data.

My Final Setup That Worked Amazingly Well

Here’s what gave me the best accuracy:

  1. S3 with Clean Structure: Organized by domains → modules → versions.

  2. Confluence with Proper Page Hierarchy : Q understands “parent → child → sub-page” beautifully if the hierarchy is clean.

  3. Role-Based Access : Users get personalized answers based on IAM roles.

  4. Scheduled Re-indexing : After every source update.

  5. Content Freshness / Sync : As per the content update process sync strategy was configured.

  6. Metadata on Every Document

    • title
    • tags
    • category
    • version
    • updated_at
    • summary

What I Learned

  • Q isn’t truly “no configuration needed”: smart metadata is everything.
  • Hierarchy and structure matter more than quantity.
  • Recency metadata avoids hallucinating old content.
  • “source-of-truth: true” is extremely powerful.
  • Q Business is excellent, but only if your inputs are clean.

Conclusion

I initially thought AWS Q Business wasn’t retrieving the right data.
Turns out: I wasn’t feeding it the right structure.

Once I fixed the data sources & metadata:

  • retrieval accuracy improved drastically
  • domain-specific answers became sharp
  • version conflicts vanished
  • hallucinations dropped significantly

If you’re using AWS Q Business for enterprise search or internal assistants, your metadata & indexing strategies determine the quality of your AI.

:)

Top comments (0)