DEV Community: Shruti Gupta

Amazon RDS vs DynamoDB: I Stopped Asking "Which Database Is Better?" and Started Asking "What Problem Am I Solving?"

Shruti Gupta — Sun, 19 Jul 2026 11:02:06 +0000

When I first started learning AWS, I had a question that probably every beginner has at some point.

Why does AWS have multiple database services?

If Amazon RDS stores data, why do we also need DynamoDB?

Can't one database do everything?

For a while, I assumed the difference was just SQL vs NoSQL.

After learning more, I realized that wasn't the real difference.

The real difference is that they're designed to solve different problems.

That changed the way I look at databases.

Instead of asking,

"Which database should I learn?"

I now ask,

"What kind of application am I building?"

And surprisingly, the answer becomes much clearer.

Before We Compare...

Imagine you're building two applications.

The first one is a University Examination Portal.

The second is a social media platform where millions of users continuously like posts, comment, and upload content.

Would you design both systems in exactly the same way?

Probably not.

That's exactly why Amazon RDS and DynamoDB both exist.

Amazon RDS - When Relationships Matter

Suppose you're building a university management system.

You have:

Students
Courses
Professors
Departments
Attendance
Examination Results

Everything is connected.

A student belongs to multiple courses.

A course has multiple students.

Every result belongs to a particular student and subject.

This is relational data.

Trying to store all of this without relationships would quickly become difficult to manage.

That's where Amazon RDS shines.

What is Amazon RDS?

Amazon RDS (Relational Database Service) is a managed relational database service.

The important word here is managed.

You don't have to spend your time worrying about:

installing database software
backups
software updates
monitoring
hardware failures

AWS handles much of that for you.

You focus on building your application.

Supported Database Engines

One thing I found interesting is that RDS isn't a database itself.

It's a managed service that supports different database engines, including:

MySQL
PostgreSQL
MariaDB
Oracle
SQL Server

So if you're already familiar with SQL, moving to RDS feels natural.

Why SQL Works Well Here

Suppose you want to answer questions like:

Which students scored above 90?
Which department has the highest attendance?
Which professor teaches the maximum number of courses?

These questions involve relationships between multiple tables.

SQL was built for exactly this.

Using JOINs makes these queries straightforward.

Features That Stood Out to Me

While learning about RDS, a few features really caught my attention.

Automatic Backups

Instead of remembering to create backups manually, RDS can do it automatically.

If something goes wrong, restoring data becomes much easier.

Multi-AZ Deployment

Imagine your database server suddenly fails.

Normally, your application would stop working.

With Multi-AZ deployment, AWS keeps another synchronized copy in a different Availability Zone.

If one fails, another can take over.

That level of reliability makes a lot of sense for applications like banking or healthcare.

Read Replicas

Sometimes an application receives far more reads than writes.

Instead of putting all the load on one database, RDS allows Read Replicas.

Users can continue reading data without affecting write performance.

When Would I Choose RDS?

Personally, if I were building:

College ERP
Banking System
Hospital Management
Library Management
Employee Portal
E-commerce Order Management

I'd naturally think about Amazon RDS first.

Because all these applications depend heavily on relationships between data.

Then I Learned About DynamoDB...

Initially, I wondered:

"If RDS already works so well...

...why does DynamoDB even exist?"

The answer surprised me.

Because not every application needs relationships.

Some applications need speed.

Some need massive scale.

Some need both.

Amazon DynamoDB - Built for Speed

Think about Instagram.

Or LinkedIn.

Or X.

Millions of people are:

liking posts
commenting
reacting
refreshing feeds

every single second.

Waiting for complex relational queries here would become expensive.

Instead, applications often need extremely fast access to individual pieces of data.

That's where DynamoDB comes in.

What is DynamoDB?

Amazon DynamoDB is a fully managed NoSQL database.

Unlike relational databases,

it stores data as items inside tables instead of rows connected through relationships.

Every item can even have slightly different attributes.

That flexibility is one of its biggest strengths.

The Feature That Impressed Me Most
Automatic Scaling

This was probably my favourite part.

Imagine your application suddenly goes viral.

Yesterday you had:

500 users.

Today you have:

500,000.

With DynamoDB, AWS can automatically scale to handle increased traffic without you manually provisioning additional servers.

As a beginner, I found that incredibly fascinating.

Global Tables

Suppose users are accessing your application from:

India
USA
Germany
Australia

Global Tables allow data to stay synchronized across multiple AWS Regions.

That means users can access nearby copies of data with lower latency.

DynamoDB Streams

Another interesting feature.

Whenever data changes,

Streams can capture those changes.

Other services like AWS Lambda can react automatically.

This makes event-driven applications much easier to build.

RDS vs DynamoDB
Amazon RDS Amazon DynamoDB
Relational database NoSQL database
SQL queries Key-value & document model
Fixed schema Flexible schema
Great for relationships Great for high-speed lookups
Supports JOINs No JOINs
Best for transactional systems Best for massive scale applications
Manual query optimization Automatic scaling
Strong ACID transactions Supports transactions but optimized for scalability
Which One Is Faster?

This question comes up a lot.

But I don't think it's the right question.

DynamoDB is incredibly fast because it's designed differently.

RDS is powerful because it solves different kinds of problems.

Comparing them directly is like asking:

Which is better?

A bicycle or a truck?

The answer depends entirely on what you're trying to transport.

My Biggest Takeaway

When I first started learning AWS,

I tried to remember services.

Now I try to remember problems.

If my application depends on complex relationships,

I immediately think about Amazon RDS.

If my application needs to handle enormous traffic with predictable access patterns,

I think about DynamoDB.

Neither service replaces the other.

They simply solve different challenges.

And I think that's one of the biggest lessons AWS has taught me so far.

Technology becomes much easier to understand when you stop asking,

"What does this service do?"

and start asking,

"Why was this service built?"

Once I started looking at databases this way, choosing between Amazon RDS and DynamoDB stopped feeling confusing-and started feeling like a design decision based on the problem I wanted to solve.

I'd love to know how you approach this decision. If you're learning AWS too, what was the concept that finally made the difference between RDS and DynamoDB click for you?

I Didn't Expect Open Source to Feel Like This.

Shruti Gupta — Sat, 11 Jul 2026 12:42:07 +0000

If someone asked me earlier why I started contributing to open source, my answer would've been pretty straightforward.

Because it was one of the activities in the AWS Summer Builder Cohort 2026.

It sounded like a good opportunity to learn something new while earning points along the way.

I had no idea that GitHub would slowly become one of the tabs I check the most every day.

My first thought wasn't "Let's contribute."

It was...

"Where do I even begin?"

Thousands of repositories.

Hundreds of open issues.

Labels I had never seen before.

good first issue

help wanted

documentation

enhancement

Everything looked interesting.

Everything also looked intimidating.

Then came Git...

I still remember opening my terminal and wondering if I had typed the right command.

Even today, after multiple contributions, I still pause before running some Git commands.

"Wait... should I pull first?"

"Am I on the correct branch?"

"Did I commit everything?"

"Why is Git telling me something completely different from what I expected?"

The funny part?

The more I contribute, the less scared I become.

But Git still manages to surprise me every now and then...

Open source is unpredictable.

Sometimes you spend hours understanding an issue.

You finally figure out a solution.

You start working on it.

And then...

The issue gets closed.

Or someone else submits a PR first.

Or the maintainer decides to solve it another way.

At first, that felt disappointing.

Now I see it as part of the process.

Not every contribution ends with a merge, and that's okay.

Then there's the "Should I even work on this?" thought.

Sometimes an issue isn't assigned to anyone.

Sometimes maintainers don't assign issues at all.

Sometimes they do... but much later.

So you keep wondering:

"Can I start working on this?"

"Should I wait?"

"What if someone else is already working on it?"

It's a small thing, but as a beginner, these questions stay in your head more than you'd expect.

The kind of issues I naturally enjoy

I noticed something interesting about myself.

I usually get drawn towards issues that improve the experience for users.

Sometimes it's documentation.

Sometimes it's improving a UI component.

Sometimes it's adding a small feature that makes an application easier to use.

They're not always the biggest issues in the repository.

But I like working on things where I can clearly see the difference my contribution makes.

And then comes the best notification.

"Merged."

I don't think it'll ever become a normal notification for me.

Every single merged PR still makes me smile.

Not just because the code was accepted.

But because someone, somewhere, reviewed my work, suggested improvements, trusted it, and decided it was worth becoming part of their project.

Even better are those little comments from maintainers.

"Looks good!"

"Thanks for the contribution!"

"Great work."

They're just a few words.

But they honestly make my entire day.

Looking back...

It's funny.

I started this journey because of an AWS cohort activity.

But somewhere between cloning repositories, accidentally making Git mistakes, reading unfamiliar code, waiting for reviews, refreshing GitHub, and celebrating merged PRs...

It stopped feeling like a task.

It became something I genuinely enjoy.

I'm still a beginner.

I still Google Git commands.

I still get confused while reading large codebases.

I still spend more time understanding an issue than actually writing code.

I still make mistakes.

But every repository teaches me something different.

Every review makes me a little better.

Every PR makes the next one feel slightly less intimidating.

And I think that's my favorite part about open source.

You don't have to know everything before you start.

You just have to be willing to learn, one contribution at a time.

I Thought "Serverless" Meant There Were No Servers. Turns Out, That Wasn't the Point.

Shruti Gupta — Sat, 11 Jul 2026 12:30:51 +0000

Whenever I came across the word serverless, I always had the same question.

If there are no servers... where does the code actually run?

I knew the term, but I had never really understood what it meant.

Yesterday, I attended a Session - Serverless Applications with AWS Lambda by Aditya Dubey as part of the AWS Summer Builder Cohort 2026, and I think this was one of those sessions where things started making sense instead of just adding another definition to my notes.

The first thing that surprised me...

Before this session, somewhere in my mind I thought that building an application always meant worrying about servers.

Where will I deploy it?

How will it scale?

What if traffic suddenly increases?

But Lambda changes that conversation.

Instead of thinking about the infrastructure first, it lets you focus on what your code is supposed to do.

That shift in perspective was probably my biggest takeaway.

Watching it happen made a difference

I've watched tutorials before, but seeing a Lambda function being created, deployed, and tested live inside the AWS Console made everything much easier to connect.

Creating a function...

Deploying it...

Testing it...

It looked surprisingly simple.

What fascinated me wasn't the number of clicks.

It was realizing how much work AWS was handling behind the scenes without us even noticing.

One concept that really clicked

The explanation of event-driven execution was something I'll probably remember.

The function doesn't keep running all the time.

It simply waits until something triggers it.

A request comes in.

The function executes.

It finishes its job.

And that's it.

For some reason, I found that idea really interesting.

Maybe because it's such a different way of thinking compared to applications that are always running.

Lambda + DynamoDB finally made sense together

One thing I appreciated about the session was that it wasn't just about Lambda.

When Lambda was connected with DynamoDB, I stopped seeing them as two separate AWS services.

Instead, I started seeing a workflow.

A request comes in.

Lambda processes it.

DynamoDB stores or retrieves the data.

The response goes back.

It sounds simple, but that's exactly what helped me understand how different AWS services work together instead of independently.

Something I realized

I've noticed a pattern in the AWS sessions I've attended so far.

Every session starts with a technology.

But I somehow end up learning a different way of thinking.

Earlier, if someone mentioned AWS Lambda, I'd probably remember it as "AWS's serverless compute service."

Now I'll probably remember it as "the service that made me stop worrying about servers and start thinking about the actual problem my code is solving."

I think that's a much better way to remember it.

Looking ahead

I know this is just the beginning, and I still have so much to explore.

Questions like:

How does Lambda handle thousands of requests at the same time?
When should we use Lambda instead of EC2?
What are its limitations?

...are things I'm curious to learn next.

A big thank you to Aditya Dubey for such an engaging and beginner-friendly session, and to the AWS Summer Builder Cohort 2026 for creating learning experiences that make complex concepts feel much less intimidating.

I'm excited to keep learning, one session at a time.

The Click Behind Every Click - My Biggest Takeaway from a Full Stack Session

Shruti Gupta — Sat, 04 Jul 2026 06:48:19 +0000

Have you ever used an app and wondered what actually happens after you click a button?

Honestly, I hadn't.

I knew terms like frontend, backend, database, and API, but they always felt like separate pieces of a puzzle.

After attending the Full Stack Development session as part of the AWS Summer Builder Cohort 2026, conducted by Sumit Grover and Vridhi Duggal, I finally started seeing the complete picture.

One thing I really liked was that the session wasn't about introducing fancy technologies. Instead, it answered a much more interesting question:

"What actually happens behind the scenes?"

A single user request isn't just a click on a webpage.

It travels through the frontend, reaches the backend, interacts with databases, sometimes checks the cache first, processes the required logic, and finally returns the response we see on our screen.

I've used hundreds of applications, but this was probably the first time I paused and thought about everything happening in those few milliseconds.

Another concept that genuinely clicked for me was caching.

Earlier, I simply knew that caching makes applications faster.

The explanation during the session completely changed that understanding.

If thousands of users are requesting the same data repeatedly, why keep asking the database every single time?

Store frequently accessed data in memory, reduce unnecessary database calls, lower latency, and let the database handle requests that actually require it.

Such a simple idea.

Such a huge impact.

Another interesting discussion was around scalability.

Building an application that works for 50 users is one thing.

Building one that continues to work smoothly for thousands or even millions of users is a completely different challenge. It made me realize that writing code is only one part of software development - designing systems that can grow is equally important.

Apart from the technical content, I really appreciated the way Sumit Grover and Vridhi Duggal conducted the session.

The coordination between them was seamless. It never felt like two speakers taking turns. Every topic naturally connected to the next, making the entire session feel more like an engaging conversation than a presentation.

I joined the session expecting to learn about full stack development.

I left with something much more valuable - a better understanding of how all the pieces come together to build the applications we use every day.

A big thank you to Sumit Grover, Vridhi Duggal, and the entire AWS Summer Builder Cohort 2026 team for such an insightful session.

Already looking forward to the next one! 🚀

AWSSummerBuilderCohort2026 #AWS #FullStackDevelopment #LearningInPublic #Backend #WebDevelopment #CloudComputing #Students #Tech

I Thought I Was Learning AWS Services. I Was Actually Learning to Solve Problems.

Shruti Gupta — Fri, 03 Jul 2026 20:55:55 +0000

When I joined this AWS challenge, I expected one thing - a lot of new service names.

Amazon S3. Amazon RDS. Amazon DynamoDB. Amazon SNS. Amazon Bedrock.

Like many beginners, I thought the goal would be simple: learn what each service does, remember its definition, complete the challenge, and move on.

What I didn't realize was that each week's challenge was quietly building on the previous one.

Looking back, I don't think these three weeks were just about learning AWS. They were about changing the way I approach technical problems.

Week 1 - Stop Memorizing. Start Asking "Why?"

The first week's task was to understand five AWS services, explain them in our own words, think of real-life use cases, and share one feature we found interesting.

At first, I approached it the way I usually prepare for a new topic - read, understand, and remember.

But I noticed something.

The more I tried to remember definitions, the more everything started sounding similar. Then I changed my approach and asked myself one simple question:

"Why would someone even build this service?"

That single question made everything much easier to understand.

Instead of seeing Amazon S3 as just another storage service, I imagined the millions of photos and videos uploaded every second on platforms like Instagram or YouTube.

Instead of treating Amazon SNS as a notification service, I pictured a hackathon platform sending registration confirmations, mentor updates, deadline reminders, and final results to thousands of students.

Instead of memorizing what Amazon Bedrock does, I imagined building an AI assistant that could help students discover hackathons based on their interests.

Those examples were simply my way of connecting technical concepts with situations I could actually relate to.

That was my first mindset shift.

Technology becomes much easier to understand when you stop memorizing features and start thinking about the problems they were created to solve.

Week 2 - Knowing the Tool Isn't Enough

The second week's challenge felt completely different.

This time, we weren't asked to explain services.

Instead, we were given different scenarios and had to choose the most suitable AWS service, along with the reasoning behind our choice.

Initially, I expected every question to have one obvious answer.

It didn't.

I found myself comparing services, thinking about trade-offs, and asking questions like:

Why is this service a better fit than another?
What exactly is the requirement here?
Am I solving the right problem?

For the first time, I realized that learning technology isn't just about knowing what different tools can do.

It's about understanding why one solution makes more sense than another.

Week 3 - Understand the Problem Before the Solution

The third week's challenge was probably my favorite.

Instead of directly choosing an AWS service, we were given bug scenarios.

Our task was to identify what had actually gone wrong, decide which AWS service could help, and explain what we would tell the development team to fix the issue.

This completely changed the order in which I started thinking.

Earlier, my first thought used to be:

"Which AWS service fits here?"

After this challenge, my first thought became:

"What is the actual problem?"

It sounds like a small difference.

But I think it's one of the most important lessons I've learned.

Because if you misunderstand the problem, even the best technology won't give you the right solution.

Looking Back

Three weeks ago, I would've looked at a problem and wondered, "Which AWS service should I use?"

Today, I'd probably ask, "What exactly is the problem I'm trying to solve?"

Instead of memorizing services, I try to understand why they exist.

Instead of searching for the "correct" solution immediately, I spend more time understanding the requirements.

Instead of jumping to conclusions, I try to identify the root cause first.

Looking back, I don't think these three weeks were really about AWS.

They were about learning a different way to think.

I know I've only scratched the surface of cloud computing, and there's still a lot left to explore.

But these challenges gave me something I'll carry beyond AWS.

Whether I'm working on an open-source issue, building a college project, participating in a hackathon, or learning a completely new technology, I think I'll always start with the same question:

"What problem am I actually trying to solve?"

Earlier, I used to ask,

"What does this technology do?"

Now I find myself asking,

"Why was this technology built in the first place?"

Surprisingly, that one question has made learning feel much less overwhelming and a lot more meaningful.

And for me, that's been the biggest takeaway from these three weeks.

Beyond AWS Service Names: Understanding the Problems They Actually Solve

Shruti Gupta — Fri, 26 Jun 2026 17:15:50 +0000

My Learning Journey with AWS

When I first saw AWS, I honestly felt overwhelmed.

There were so many services- S3, RDS, DynamoDB, SNS, Bedrock—and they all sounded important, but I couldn't understand why AWS needed so many different services. At one point, I even thought, "Can't one database or one storage service do everything?"

While exploring these services for an AWS challenge, I stopped trying to memorize definitions and instead asked myself one simple question:

"What real-world problem is this service trying to solve?"

That completely changed how I understood AWS.

Here are the five services that helped me look at cloud computing differently.

Amazon Bedrock:- AI Without Building an AI Model Yourself
When I first heard about Amazon Bedrock, I thought it was used to create and train AI models from scratch.

After learning more, I realized I had misunderstood it.

Bedrock is more about using existing foundation models and customizing them with your own data to build AI applications. AWS takes care of the complex infrastructure, while developers focus on solving actual problems.

The first idea that came to my mind was a Hackathon Discovery Platform.

Students usually spend hours searching different websites to find suitable hackathons. Instead, imagine asking an AI assistant:

"Find beginner-friendly AI hackathons."
"Which hackathons allow solo participation?"
"Show hackathons with prize money above ₹50,000."_

The assistant could understand information like eligibility, deadlines, FAQs, technologies used, and previous editions because it is connected to a knowledge base.

That was the moment I understood what Bedrock is actually meant for- not creating AI models, but creating useful AI applications.

Amazon S3:- Not Just Storage, but Storage That Never Becomes a Problem
Before learning about Amazon S3, cloud storage sounded similar to Google Drive.

Then I realized companies deal with millions and sometimes billions of files.

Photos.

Videos.

Documents.

Backups.

Datasets.

Managing all of that isn't as simple as saving files on a computer.

The first example I thought of was social media platforms like Instagram, YouTube, and Snapchat. Every second, users upload content, and all those media files need to be stored somewhere reliable.

Another idea I had was a college event archive where photographs, certificates, recordings, and event documents could be safely stored and accessed even years later.

S3 made me realize that storage isn't only about saving files. It's about making sure those files remain secure, durable, and available whenever they're needed.

Amazon RDS:– Let AWS Handle the Database Work
I had worked with databases before, but I never really thought about what happens behind the scenes.

I assumed databases just "store data."

Later I learned someone has to maintain them too.

Someone has to manage backups.

Someone has to install updates.

Someone has to recover data if something goes wrong.

That's where Amazon RDS made sense to me.

The example that immediately clicked was a University Examination Management System.

Student records, attendance, marks, course details, and examination results all have relationships with each other, making a relational database the right choice.

I also liked the fact that RDS can automatically create backups and even maintain a standby database in another Availability Zone. It made me realize why organizations trust managed databases instead of maintaining everything themselves.

Since educational data is sensitive, features like encryption and access controls also help reduce the risk of unauthorized access.

DynamoDB:- When Millions of People Use Your App Together
Understanding DynamoDB also helped me understand that not every database is built for the same purpose.

Initially, I wondered why AWS had both RDS and DynamoDB.

Then I thought about LinkedIn.

People continuously like posts.

Comment.

React.

Send connection requests.

View profiles.

All these activities happen at an enormous scale.

A traditional relational database isn't always the best fit for this kind of workload.

That's where DynamoDB comes in.

It automatically scales while still responding incredibly fast, even if millions of users are active at the same time.

One feature I found particularly interesting was Global Tables, where data stays synchronized across multiple AWS Regions, making applications more reliable for users around the world.

Amazon SNS:- One Update, Many Notifications
SNS became one of the easiest services for me to visualize because we all receive notifications every day.

What I liked most was the Publish-Subscribe idea.

Instead of sending the same update separately to different places, an application publishes one message and SNS distributes it wherever it's needed.

The example I came up with was a Hackathon Management Platform.

Whenever students register, form teams, receive mentor session details, meeting schedules, deadlines, or final results, SNS could send notifications through email, SMS, mobile notifications, and the platform itself at the same time.

The same idea could also be used inside organizations to send important meeting announcements across different communication channels.

Another interesting thing I learned was that SNS supports both:

Application-to-Person (A2P) communication, like emails and SMS.
Application-to-Application (A2A) communication, where different software systems notify each other automatically.

That made me realize SNS is much more than a notification service.

What Changed for Me?
Before this challenge, I used to think AWS was just a collection of complicated cloud services.

Now I see each service as a solution to a specific problem.

If I need to store huge amounts of files, I think about Amazon S3.
If I need structured data with relationships, Amazon RDS makes sense.
If I need a database for millions of fast user interactions, DynamoDB is a better choice.
If I need to notify people or systems whenever something happens, I think of Amazon SNS.
If I want to build an AI-powered application without worrying about training large models, Amazon Bedrock is the service I would explore.

The biggest lesson I learned wasn't remembering AWS service names.

It was understanding why each service exists.

Once I started thinking in terms of "What problem am I trying to solve?", AWS became much less intimidating and much more practical.

This is only the beginning of my cloud journey, but now when I hear the name of an AWS service, I don't just remember its definition, I remember the real-world problem it can solve.

I know I've only scratched the surface of AWS, but this challenge changed the way I learn technology. Instead of memorizing concepts, I now try to connect every new service with a real-world problem. That approach made learning much more meaningful for me.

Your On-Call Agent Forgot Everything. Ours Doesn't.

Shruti Gupta — Sun, 14 Jun 2026 10:38:35 +0000

The first time I used something that actually remembered a past production failure, I didn't fully trust it. I submitted the same incident twice just to make sure the result wasn't a coincidence.

It wasn't.

I was building On-Call Copilot — an incident response agent that doesn't just generate advice, it recalls what actually happened the last time something similar broke. The live app is at on-call-copilot.vercel.app. The memory layer is Hindsight. And the thing that surprised me most wasn't how hard it was to integrate — it was how immediately obvious the difference was once it was working.

What the system actually does

On-Call Copilot is an AI Incident Commander with organizational memory. The tagline on the app is "Learn from every outage. Resolve the next one faster." That's not marketing — it's literally the architecture.

When a production alert comes in — a Sentry traceback, a Datadog trigger, raw CLI logs — you paste it into the Incident Ingestion Console. The system runs it through a five-stage pipeline:

Production Alert → FastAPI Router → Hindsight Memory → Groq Reasoning → SRE Playbook

Stage three is the one that matters. Before Groq generates anything, Hindsight's semantic graph runs a recall against the full organizational incident history. It doesn't do keyword search. It finds semantically related past incidents — things that failed for the same underlying reason, even if the error messages look different on the surface.

What comes back isn't just "here's a similar incident." It's structured: historical root cause, successful fix, and critically — failed attempts to avoid. Things someone already tried that made it worse. That last part is what makes this different from any generic LLM response.

The FastAPI layer: how the triage request flows

Every incident starts at a single POST endpoint. The frontend sends the raw alert text; FastAPI handles the orchestration — first pulling memory context from Hindsight, then passing that context alongside the alert into the Groq reasoning chain.

# backend/api.py
@app.post("/analyze")
def analyze(data: IncidentRequest):
    return {
        "analysis": analyze_incident(data.incident)
    }

@app.post("/teach")
def teach(data: IncidentRequest):
    store_incident(data.incident)

    return {
        "status": "saved"
    }

@app.get("/")
def home():
    return {
        "message": "On-Call Copilot API Running 🚀"
    }

This ordering is the key design decision. The recall happens before the LLM sees anything. By the time Groq is reasoning about root cause, it already has the organizational context baked in — not as a separate lookup, but as part of the prompt.

What Hindsight's recall actually returns

I had never used Hindsight before this project. My mental model going in was that it would behave like search — give it keywords, get back matching documents.

What it actually does is closer to semantic reasoning over a knowledge graph. When I submitted "FATAL: database pool choked during active transaction," it recalled two past incidents:

INC-103 — Database connection pool exhaustion under high transactional traffic. 91% match. Successful fix: increment proxy pool limits to 50, implement transaction timeout safeguards. Failed attempt: scaling pool replicas dynamically (triggered DB lock storms).
INC-104 — Redis cluster memory allocation overrun. 87% match. Successful fix: configure maxmemory-policy to volatile-lru. Failed attempt: cold restarts of Redis service (nuked all active sessions).

The match percentages are real confidence scores from Hindsight's agent memory system. The "failed attempt" field is the part that earns its keep at 3 AM — it tells you what not to reach for before you waste 40 minutes on it.

The two memory operations: retain and recall

The entire Hindsight integration in backend/memory.py is built on two calls. Here's both of them side by side:

# backend/memory.py
from contextlib import contextmanager
from hindsight import HindsightClient

BANK_ID = os.getenv("BANK_ID")

@contextmanager
def _get_hindsight_client():
    client = HindsightClient(api_key=os.getenv("HINDSIGHT_API_KEY"))
    try:
        yield client
    finally:
        client.close()

def recall_similar_incidents(incident_description: str) -> list[dict]:
    with _get_hindsight_client() as client:
        results = client.recall(
            bank_id=BANK_ID,
            query=incident_description,
            top_k=5,
        )
    return results

def save_resolution(incident_description: str, resolution_summary: str):
    content = (
        f"INCIDENT: {incident_description}\n"
        f"RESOLUTION: {resolution_summary}"
    )
    with _get_hindsight_client() as client:
        client.retain(
            bank_id=BANK_ID,
            content=content,
            context="incident_postmortem",
        )

Two functions. One call each. The entire organizational memory layer is those ~30 lines. What the Hindsight retain/recall API does behind the scenes — semantic indexing, graph traversal, confidence scoring — you get all of that for free.

The pipeline in practice

The Incident Resolution Timeline in the UI makes the pipeline visible in real time:

Alert Received — raw metrics or trace ingested into buffer
Memory Retrieved — FastAPI semantic correlation against regional index maps
Root Cause Identified — LLM isolates anomalies, computes match confidence
Resolution Suggested — detailed playbook with avoidance warnings generated
Knowledge Stored — post-mortem answers indexed back into organizational memory

That fifth step is the learning loop. Every resolved incident feeds back into Hindsight via retain(). The next similar incident pulls it as recalled context. The system gets more specific over time — not because the model changed, but because the memory bank grew.

Before vs after — what memory actually changes

Without organizational memory:

Generic advice pulled from training data — "check network adapters," "reinstall OS"
No awareness of what's already been tried in your specific environment
Suggestions that have failed twice before in your cluster show up again
Every incident starts from zero

With Hindsight memory (150 incidents in the knowledge base):

Precise matches pulled from actual past outages, not textbook examples
Failed fixes flagged explicitly so engineers don't repeat them
One-step indexing after resolution so the next incident benefits immediately
42% estimated reduction in mean time to resolution shown live on the dashboard

That 42% figure isn't a benchmark I'm claiming — it's what the dashboard shows based on the system's historical recall performance across the loaded incident knowledge base.

What the Teach the System panel does

At the bottom of the app is a section called "Teach the System — Training Mode." Paste a resolution summary or an incident link, submit it, and Hindsight indexes it immediately. The Telemetry Console logs the whole thing in real time — you can watch the retain call go out, see the success response, and know that the next engineer who hits a similar issue will get this resolution in their recalled context.

The log from a real session:

[1:20:26 pm] SUCCESS [OUTPOST] Triage finished successfully in 44.97s. Status 200 OK.
[1:20:27 pm] SUCCESS [OUTPOST] Taught system successfully in 39.57s. Status 200 OK.

Triage and teach. Those two operations are the entire product loop.

What using Hindsight taught me

I went into this thinking about memory as a storage problem. I came out thinking about it as a retrieval design problem. What you store matters less than whether the right things surface at the right time.

Hindsight's retain/recall API is small — two core operations cover almost everything. But the quality of what you get back at recall time depends entirely on how well-structured the retained content is. A postmortem that clearly separates root cause, successful fix, and failed attempts produces recall that's immediately actionable. A vague free-text summary produces noise.

The other thing I'd do differently is seed the knowledge base earlier. The system only becomes convincingly better than a generic LLM once there's enough incident history to surface precise matches. With 150 incidents loaded, the difference is stark. With 5, it's marginal. Data quality and quantity are part of the product.

What's next

The current "Teach the System" input is free text. The obvious next step is a structured form — separate fields for root cause, fix steps, failed attempts, and the customer message that generated the least confusion. Structured inputs produce more consistent memories, which produce more reliable recall.

The architecture also has room to expand beyond a single knowledge bank. Right now there's one organizational memory shared across all incidents. A multi-tenant version with per-team or per-service memory banks would let different engineering teams maintain separate incident histories while still being able to query across them when needed.

The memory layer works. What I keep thinking about is how much better it gets with every incident that runs through it — and how most engineering teams are sitting on years of incident history that a system like this could immediately put to use.