DEV Community: Brittany

# How to Build a Beginner-Friendly HR RAG Chatbot with Amazon Bedrock

Brittany — Sat, 09 May 2026 03:18:25 +0000

Learn how Retrieval-Augmented Generation (RAG) works on AWS using Amazon Bedrock, Knowledge Bases, embeddings, and Amazon S3 through a realistic HR chatbot scenario.

Why Everyone Is Talking About RAG

AI chatbots are becoming a major part of modern businesses.

Companies want AI assistants that can answer questions about:

internal company documents
support articles
training manuals
policies
product documentation
customer knowledge bases

The Problem

Large language models (LLMs) do not automatically know your company’s private information.

This is where Retrieval-Augmented Generation (RAG) becomes incredibly important.

Instead of training a custom model from scratch, businesses can allow an AI model to retrieve relevant company information in real time before generating a response.

This creates a smarter, more cost-effective AI assistant.

In this article, we’ll walk through a beginner-friendly AWS architecture for building a RAG chatbot using Amazon Bedrock.

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI architecture pattern where:

A user asks a question
The system retrieves relevant information from a data source
The AI model uses that retrieved information to generate a better response

Instead of relying only on the model’s built-in training knowledge, the model can access updated and company-specific information.

This helps:

reduce hallucinations
improve answer accuracy
provide more current information
avoid expensive model retraining

Real-World Business Scenario

Imagine a fictional company called Northstar Health Services.

Northstar Health Services is a growing healthcare staffing company with over 2,000 employees across multiple states.

As the company expanded, the HR department started facing a major problem:

Employees constantly asked repetitive questions such as:

“How many PTO days do I receive?”
“What is the remote work policy?”
“How do I update my healthcare benefits?”
“Where can I find onboarding documents?”

The HR team spent hours every week responding to the same requests.

The company wanted a solution that could:

provide employees with fast answers
reduce HR workload
search internal company documents
avoid building and training a custom AI model from scratch

To solve this problem, Northstar Health Services decided to build an internal AI chatbot using Amazon Bedrock and a Retrieval-Augmented Generation (RAG) architecture.

The chatbot retrieves information from company documents stored in Amazon S3 and generates conversational responses for employees.

This is one of the most common real-world RAG use cases businesses are exploring today.

AWS Services Used in This Architecture

Amazon S3

Amazon S3 stores the company documents.

Examples include:

PDFs
text documents
policies
manuals
FAQ files

S3 acts as the document storage layer.

Amazon Bedrock

Amazon Bedrock provides access to foundation models without requiring businesses to manage infrastructure.

This is one reason Bedrock is becoming extremely popular.

Businesses can use models from providers like:

Anthropic Claude
Amazon Titan
AI21 Labs
Cohere
Meta

Bedrock reduces operational overhead because AWS manages the infrastructure.

Amazon Bedrock Knowledge Bases

Knowledge Bases simplify the RAG workflow.

Instead of manually building:

vector databases
embedding pipelines
retrieval logic
indexing systems

AWS can manage much of the workflow automatically.

This makes RAG architectures more beginner-friendly.

Architecture Overview

Northstar Health Services wants a fully managed AI solution with minimal operational overhead.

Instead of building custom infrastructure for:

vector databases
model hosting
GPU management
retrieval pipelines
embedding systems

The company uses managed AWS AI services to simplify deployment.

This architecture follows a common enterprise AI workflow pattern:

documents stored in Amazon S3
semantic search using embeddings
retrieval through a knowledge base
response generation through a foundation model

This approach helps reduce infrastructure complexity while improving scalability and maintainability.

Beginner-Friendly RAG Workflow

Here’s the simplified workflow:

Company Documents
        ↓
Amazon S3
        ↓
Bedrock Knowledge Base
        ↓
Embeddings + Vector Search
        ↓
Foundation Model (LLM)
        ↓
Generated Response

Step-by-Step Explanation of the Workflow

Step 1: Upload Documents to Amazon S3

The company uploads documents into an S3 bucket.

Examples include:

employee handbooks
training documents
customer support guides
product information

These documents become the knowledge source for the chatbot.

Step 2: Create a Bedrock Knowledge Base

Amazon Bedrock Knowledge Bases connect to the S3 bucket.

AWS then prepares the documents for retrieval.

This includes:

chunking documents into smaller pieces
generating embeddings
indexing the content for semantic search

This process is critical because AI systems work better when information is broken into smaller searchable sections.

Step 3: Convert Data into Embeddings

Embeddings are numerical representations of text.

They allow the system to understand semantic meaning instead of relying only on keyword matching.

For example:

“How many PTO days do I receive?”

can still retrieve a document section that says:

“Employees are eligible for 15 vacation days annually.”

Even though the wording is different, embeddings help the system understand the meaning.

Step 4: Retrieve Relevant Information

When the user submits a question, the system searches for the most relevant document chunks.

This process is called retrieval.

The retrieved information is then passed to the language model.

Step 5: Generate the Final Response

The foundation model uses:

the user’s question
the retrieved document context

to generate a final response.

This improves accuracy and relevance.

Why Businesses Prefer RAG Over Fine-Tuning

One important AWS AI design decision is understanding when to use:

RAG architectures
fine-tuning
custom model training

For many enterprise chatbot workloads, RAG is often the better starting point because businesses can retrieve company-specific information without retraining a model.

This reduces:

operational overhead
training complexity
infrastructure management
model retraining costs

Benefits of RAG

Lower operational overhead

Businesses avoid managing large training pipelines.

Faster updates

Documents can simply be updated in S3.

Lower cost

No expensive retraining process is required.

Better for dynamic information

Policies and documentation frequently change.

RAG allows businesses to update information quickly.

Why Amazon Bedrock is a Strong Choice

Amazon Bedrock is becoming one of the most important AWS AI services.

Why?

Because businesses want:

managed AI services
less infrastructure management
faster deployment
access to multiple foundation models
enterprise security

Bedrock simplifies AI adoption for many organizations.

This is especially valuable for companies that do not want to manage GPUs, infrastructure scaling, or custom model hosting.

Common Beginner Mistakes in RAG Architectures

Mistake #1: Trying to Fine-Tune Everything

Many workloads do not require custom model training.

RAG is often enough.

Mistake #2: Using Massive Documents Without Chunking

Large documents are difficult to retrieve efficiently.

Chunking improves search quality.

Mistake #3: Ignoring Cost Considerations

Real-time AI systems can become expensive.

Businesses should understand:

inference costs
storage costs
retrieval costs
scaling requirements

Mistake #4: Treating RAG Like Simple Keyword Search

RAG systems rely heavily on semantic understanding.

Embeddings are a critical part of the architecture.

Future Improvements for This Architecture

This beginner architecture could later evolve into:

API-based chatbot systems
customer support assistants
travel recommendation systems
ecommerce recommendation engines
internal enterprise copilots
voice-enabled AI assistants

Additional AWS services could include:

AWS Lambda
Amazon API Gateway
Amazon OpenSearch Service
Amazon CloudWatch
Amazon Cognito

Final Thoughts

RAG is rapidly becoming one of the most important AI architecture patterns.

Businesses want AI systems that can:

retrieve accurate information
reduce hallucinations
work with company documents
scale efficiently
reduce operational overhead

Amazon Bedrock and Knowledge Bases make this process much more approachable for beginners.

For anyone learning AWS AI, understanding RAG workflows is becoming an essential skill.

The ability to explain AI systems clearly — especially real-world workflows like this — is becoming just as valuable as building the systems themselves.

What’s Next?

Future AWS AI workflow articles could include:

Real-Time vs Batch Inference on AWS
CI/CD for Amazon SageMaker Models
Building ML Feature Store Pipelines
Streaming ML Workflows with Kinesis
AI Recommendation Systems on AWS
Multi-Agent AI Architectures with Bedrock

If you enjoyed this article, feel free to connect and follow along as I continue building beginner-friendly AWS AI and Machine Learning workflow tutorials.

Most ML accuracy issues aren’t model problems. They’re upstream SQL problems. JOIN granularity. Silent NULLs. Distorted aggregations. Sometimes the biggest ML improvement isn’t a new model — it’s a better query.

Brittany — Sun, 22 Feb 2026 03:06:01 +0000

Brittany

Feb 22

Your ML Model Isn’t Wrong. Your SQL Probably Is.

#machinelearning #dataengineering #sql #mlops

Comments 1

2 min read

Your ML Model Isn’t Wrong. Your SQL Probably Is.

Brittany — Sun, 22 Feb 2026 02:54:36 +0000

Your churn model isn’t degrading because the algorithm is weak.

It might be degrading because of a JOIN.

I’ve seen teams spend weeks tuning hyperparameters, switching architectures, and debating feature importance — only to discover the real issue was upstream data logic.

Before you tune the model, check your SQL.

The Problem Most Teams Misdiagnose

When performance drops, we usually suspect:

Model drift

Hyperparameter tuning

Feature scaling

Algorithm choice

Those are valid concerns.

But machine learning models don’t invent patterns.

They learn from the data we feed them.

If the dataset is flawed, the model will faithfully learn those flaws.

Upstream data logic determines downstream model behavior.

Scenario: The “Failing” Churn Model

A churn prediction model starts underperforming.

Same architecture.
Same training pipeline.
Same evaluation framework.

Nothing obvious changed.

After investigation, the issue wasn’t model complexity.

It was this:

SELECT *
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id;

It looks harmless.

It runs fast.
It returns data.
It passes basic tests.

But customers with multiple orders are duplicated across rows.

High-activity users become unintentionally overweighted in the training dataset.

The model didn’t fail.

It did exactly what we told it to do.

Mistake #1: Duplicate Rows from JOINs

If your model expects one row per customer but your query returns one row per transaction, you’ve changed the learning problem.

The issue isn’t SQL skill — it’s granularity awareness.

A better approach:

SELECT
c.customer_id,
COUNT(o.order_id) AS total_orders
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY c.customer_id;

Aggregate intentionally before training.

Define the learning unit.

Mistake #2: Silent NULL Handling

NULLs rarely crash pipelines.

They quietly distort them.

SELECT income
FROM customers;

If income contains NULLs and you don’t handle them deliberately, the model sees noise.

Even something simple like:

SELECT
COALESCE(income, 0) AS income
FROM customers;

forces you to define intent.

The important part isn’t the function.

It’s the decision.

Mistake #3: Distorted Aggregations

Global averages can hide meaningful segmentation.

SELECT AVG(transaction_amount)
FROM transactions;

It works.
It returns a number.
It feels reasonable.

But a model trained on broad aggregates may underperform in production because it lacks entity-level context.

Instead:

SELECT
customer_id,
AVG(transaction_amount) AS avg_txn
FROM transactions
GROUP BY customer_id;

Aggregation logic should reflect the model objective — not convenience.

Aggregation is feature construction.

Feature construction is model behavior.

The Bigger Pattern

Many ML failures blamed on “model accuracy” are actually upstream data logic issues.

Strong ML systems require strong SQL foundations.

Data pipelines are part of the model architecture — not just preprocessing.

Strong models are built on strong data contracts.

Before You Tune the Model, Ask:

Are joins intentional?

Is entity granularity clearly defined?

Are aggregations aligned with the objective?

Are NULLs handled deliberately?

Is the training dataset versioned?

Sometimes the biggest ML improvement isn’t a new model.

It’s a better query.

If you’d like to see the structured breakdown with examples and commentary, I documented it here:

👉 GitHub repository:
https://github.com/brie1807/sql-to-ml-pipeline-mistakes

We debate neural networks, but SQL quietly shapes what a model is allowed to believe. A systems-level perspective on ML architecture.

Brittany — Mon, 16 Feb 2026 23:30:52 +0000

Machine Learning Starts With a WHERE Clause

Brittany ・ Feb 16

Machine Learning Starts With a WHERE Clause

Brittany — Mon, 16 Feb 2026 23:26:15 +0000

🧠 Intro (Systems-Level Tone)

Most people think machine learning starts with a model.

It doesn’t.

It starts with a query.

Before SageMaker trains.
Before scikit-learn fits.
Before hyperparameters are tuned.

Someone writes a WHERE clause.

And that clause quietly decides what the model is allowed to learn.

🏗️ SQL Is Architectural — Not Just Operational

In real ML systems, SQL isn’t just for “getting data.”

It defines:

Which records are included

Which time windows matter

Which behaviors become features

Which outcomes are excluded

Which bias is unintentionally preserved

Example:

SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(amount) AS lifetime_value,
MAX(order_date) AS last_purchase
FROM transactions
WHERE order_date >= '2024-01-01'
GROUP BY customer_id;

That single WHERE clause just decided:

The time boundary of learning

What counts as “recent behavior”

Whether seasonality exists

Whether older patterns are erased

The model hasn’t trained yet.

But its worldview has already been shaped.

📊 Feature Engineering Happens Before Python

Most ML discussions focus on:

Neural networks

Gradient descent

Model selection

But feature engineering often happens inside the database.

Aggregations like:

SUM()

AVG()

COUNT()

Window functions

Time-based grouping

These are not “data prep steps.”

They are architectural decisions.

If you compute:

AVG(amount)

Instead of:

SUM(amount)

You change the scale of influence.

If you group by week instead of month, you change volatility.

If you filter out NULLs, you may remove entire demographic signals.

SQL quietly determines signal strength.

⚠️ Data Leakage Is Often a Query Problem

Many ML failures aren’t algorithmic.

They’re temporal mistakes.

Example:

SELECT *
FROM training_data
WHERE prediction_date > outcome_date;

If your query accidentally includes future outcomes,
you’ve created a perfect model.

And a useless one.

Leakage is rarely a Python issue.

It’s usually a SQL design issue.

🧠 The System View

Machine learning is often presented as:

Data → Model → Prediction

In reality, it’s:

Raw Data → SQL Constraints → Engineered Features → Training Dataset → Model

SQL is the gatekeeper.

The model only sees what the query allows.

💡 Why This Matters (Cost + Architecture)

In AWS environments:

Bad queries increase Athena/Redshift cost

Poor feature aggregation increases training time

Overly wide datasets increase memory usage

Incorrect joins inflate SageMaker compute bills

SQL decisions scale financially.

Models amplify whatever SQL defines.

🛠 GitHub Companion Plan

Create repo:

sql-ml-architecture-foundations

Include:

queries.sql (example feature engineering queries)

Small sample dataset (CSV)

README explaining:

How each query changes model behavior

How SQL affects cost, bias, and drift

How this ties into ML pipelines (SageMaker, Glue, Feature Store)

This makes the article:

Conceptual

Applied

Portfolio-ready

🔥 This Is Authority Writing

You are not saying:
“SQL is important.”

You are saying:

SQL is the architectural layer that defines what a model is allowed to believe.

That is senior-level framing.

Missing Data Isn’t a Cleanup Problem — It’s a Signal

Brittany — Tue, 10 Feb 2026 19:56:29 +0000

Most machine learning courses teach you how to handle missing data.

Fill it.
Drop it.
Impute it.
Move on.

And for exams, that’s usually enough.

But production systems tell a different story.

In the real world, missing data isn’t just something to fix —
it’s often the first signal that something upstream is breaking.

This is where the gap between passing exams and building durable ML systems begins.

What Exams Teach About Missing Data

In exam scenarios, missing values are treated as a technical inconvenience:

Replace with the mean or median
Forward-fill or backward-fill
Drop rows with too many nulls
Use models that tolerate missing values These techniques are valid.

They’re also context-free.

The exam assumes the data problem already happened —
your job is just to make the model run.

Production doesn’t care that your model runs.
It cares that it keeps running.

What Production Systems Teach Instead

In production, missing data usually shows up for a reason.

And that reason matters more than the fix.

Missing values often mean:

A pipeline failed silently
An upstream service timed out
A schema changed without notice
A feature stopped being generated
A data source degraded slowly over time

None of these are modeling problems.
They’re system problems.

If you immediately impute and move on, the model may keep producing outputs —
but now it’s learning from broken assumptions.

That’s how models degrade quietly.

Missing Data as a Diagnostic Signal

Missing values are often symptoms, not errors.

Instead of asking:

“How do I fill this?”

Production systems force you to ask:

Why did this feature go missing?
Is the missingness random or systematic?
Did this appear suddenly or gradually?
Does missing data correlate with certain users, times, or regions? Those questions don’t show up on exams.

They do decide whether a system survives in the real world.

Why Simple Methods Sometimes Win

This is why simpler techniques often outperform complex ones in production.

Not because they’re smarter —
but because they’re more stable when assumptions break.

Mean imputation is predictable
Dropping features is transparent
Rule-based fallbacks are debuggable Complex models can hide data issues by adapting too well — until performance suddenly collapses weeks later.

The Real Skill Gap

Passing exams proves you know what to do when data is missing.

Building durable ML systems requires knowing when missing data is trying to tell you something.

That’s the gap.

Exams ask: “What’s the correct technique?”
Production asks: “Why is this happening now?”

Exams optimize for correctness

Production optimizes for awareness

And awareness is what keeps models alive.

Final Thought

Missing data isn’t just a preprocessing step.

It’s feedback.

If you listen to it early, you fix pipelines.
If you ignore it, you retrain models that are already drifting.

And that’s where the real difference between learning ML
and operating ML begins.

DEV Tags

machinelearning datascience mlops careerdevelopment artificialintelligence

Missing Data in Machine Learning: A Practical Step-by-Step Approach

Brittany — Sat, 07 Feb 2026 19:15:07 +0000

Missing data breaks more machine learning models than bad algorithms — not because it’s hard to detect, but because it’s easy to overthink.

When datasets contain NaNs, sparse features, or incomplete records, the default reaction is often to add complexity.
In practice, stability usually matters more than sophistication.

Here’s a practical, step-by-step way to think about missing data in real ML systems.

Step 1: Assume Missing Data Is Normal

In real systems, missing data isn’t an edge case.

It comes from:

partially filled forms

dropped logs or sensors

schema changes over time

merged datasets from different sources

If you treat missing values as rare exceptions, you’ll design fragile pipelines.
Instead, assume they’re part of the data distribution.

Goal: design preprocessing that keeps working as systems evolve.

Step 2: Identify Why the Data Is Missing (Not Just Where)

Not all missing data is random.

Ask:

Did users skip a field?

Did a service fail?

Did a logging or schema change occur?

When missingness is tied to behavior or infrastructure, it carries information — but it also introduces risk.
Models trained on one missingness pattern may fail when that pattern changes.

Goal: avoid baking temporary assumptions into your model.

Step 3: Start With the Simplest Stable Baseline

Before reaching for advanced techniques, establish a stable baseline.

Simple imputation methods (mean or median):

reduce variance

preserve feature scale

behave consistently over time

They don’t adapt. They don’t infer.
That predictability is exactly what makes them reliable in production.

Goal: maximize stability before optimizing accuracy.

Step 4: Be Careful With “Smart” Solutions

Advanced imputers, PCA, and neural networks can accidentally learn the pattern of missingness, not the underlying signal.

Common failure modes:

great validation metrics

poor generalization

silent performance decay after deployment

Complexity increases sensitivity to distribution shifts — especially when missing data is involved.

Goal: avoid solutions that look good during training but fail quietly later.

Step 5: Use PCA and Deep Learning Only When the Pipeline Is Stable

Advanced techniques work best when:

missingness is minimal or well-understood

feature definitions are consistent

training data matches production patterns

PCA is useful for noise reduction — not for “fixing” missing values.
Deep learning handles missing data well only when designed explicitly for it.

Goal: earn complexity after stability is proven.

Step 6: Treat Missing Data as System Feedback

Missing values often signal:

broken pipelines

misaligned teams

shifting assumptions

Feature stores help by enforcing consistent definitions and freshness.
Monitoring helps detect when missingness patterns change.

Fixing the system upstream is often more effective than adding intelligence downstream.

Goal: solve the root cause, not just the symptom.

Step 7: Optimize for Long-Term Behavior, Not Short-Term Metrics

A slightly less accurate model that behaves predictably will outperform a fragile one over time.

This is why simple preprocessing approaches persist in production systems:
they survive real-world variability.

Goal: choose approaches that fail gracefully.

Final Takeaway

When handling missing data:

assume it’s normal

understand why it exists

start simple

earn complexity

prioritize stability

Machine learning systems don’t fail because they’re not smart enough.
They fail because they’re not stable enough.

Why Categorical Data Can Quietly Break Your ML Model

Brittany — Fri, 06 Feb 2026 02:07:48 +0000

Your model didn’t fail because of the algorithm.
It failed because of how your data was represented.

One of the easiest ways to break a machine learning model isn’t choosing the wrong algorithm.

It’s feeding the model categorical data without thinking about how the model actually interprets numbers.

This problem shows up constantly in real ML pipelines—especially when models perform well during training but behave unpredictably in production.

The Hidden Problem with Categorical Data

Machine learning models don’t understand categories.

They understand numbers.

When you pass categorical values like:

country

product type

customer segment

status

you’re forced to decide how those categories are represented numerically.

That decision matters more than many people realize.

Why “Just Assigning Numbers” Is Dangerous

A common mistake is encoding categories like this:

Red → 1

Blue → 2

Green → 3

To a human, these are just labels.

To a model, they look like ordered values.

The model now assumes:

Green > Blue > Red

The “distance” between categories has meaning

But in most real-world problems, that relationship doesn’t exist.

This can quietly distort model behavior without throwing errors or warnings.

What One-Hot Encoding Actually Fixes

One-hot encoding removes false relationships.

Instead of a single numeric column, each category becomes its own binary feature:

Red → [1, 0, 0]

Blue → [0, 1, 0]

Green → [0, 0, 1]

Now the model sees:

No ordering

No implied distance

Each category as an independent signal

This is why one-hot encoding is often the default choice in many ML pipelines.

When One-Hot Encoding Helps Most

One-hot encoding works best when:

Categories have no natural order

Models assume numeric relationships (e.g., linear models)

You want to avoid injecting unintended bias

You’ll often see it used with:

Linear regression

Logistic regression

Feature engineering pipelines before training

When One-Hot Encoding Creates New Problems

One-hot encoding isn’t free.

It introduces:

High dimensionality

Sparse data

Increased memory and compute cost

This becomes an issue when:

Categories have high cardinality (thousands of values)

You’re working with large datasets

You’re deploying models with tight performance constraints

At that point, encoding strategy becomes a system design decision, not just preprocessing.

Why This Matters in Real ML Systems

Encoding choices affect:

Model performance

Training time

Inference cost

Data consistency between training and production

A model may look accurate in experiments and still fail quietly after deployment if encoding isn’t handled consistently.

This is why many ML failures aren’t algorithm failures.

They’re data representation failures.

The Bigger Takeaway

Feature engineering decisions shape how a model understands the world.

One-hot encoding isn’t just a technical detail—it’s a way of protecting your model from learning relationships that don’t exist.

If a model behaves strangely, don’t start by changing the algorithm.

Start by asking:

How is this data represented?

What assumptions does this encoding introduce?

Is the model learning real patterns—or artificial ones?

Most ML issues begin there.