DEV Community

Cover image for DEV Track Spotlight: AI Agents for Databases: Discover, Recommend, Optimize (DEV315)
Gunnar Grosch for AWS

Posted on

DEV Track Spotlight: AI Agents for Databases: Discover, Recommend, Optimize (DEV315)

What if your databases could help identify and diagnose problems before they impact production? This is the question that Namrata H Shah (AWS Community Hero and Managing Director at Nuveen) and Rob Koch (AWS Data Hero and Senior Principal at Slalom) tackled in their DEV315 session at AWS re:Invent 2025.

The session demonstrated a fundamental shift in how we approach database operations. As Rob explained at the start: "When databases can discover, recommend, and optimize on their own, they stop being reactive systems and become intelligent partners." This vision moves teams away from the familiar pattern of rushing to fix production issues after they occur, toward a proactive approach where AI agents continuously monitor, analyze, and optimize database performance.

Watch the Full Session:

The Challenge: Breaking the Reactive Loop

Most database teams live in a reactive loop. Metrics spike, alerts fire, and engineers scramble to diagnose the root cause. Rob described this common scenario: "Things come up in production. Often they're catching off guard or blindsided in the moment. Maybe they have something happening during peak business hours where there's a large data load or a big release."

The problem is not the telemetry itself. Organizations invest heavily in monitoring tools and dashboards that capture every metric imaginable. The real issue is that these systems capture data about the data, but they do not explain why problems occur. As Rob noted: "The metrics of CPU I/O, query latency, et cetera, they don't really explain why the problem is happening."

This leads to several cascading challenges:

Alert Fatigue - When every alert seems urgent but is not always actionable, even experienced engineers can lose sight of what is actually impacting the business. Teams start ignoring alerts, and real problems slip through.

Scaling Costs and Unpredictability - Auto-scaling promises help, but in practice, usage patterns and transactional workloads rarely follow predicted patterns. Teams over-provision resources out of fear, and costs blow up before improvements are seen.

Lack of Context - The feedback loop between performance decisions and financial impact leaves teams unable to align operations with business priorities. They end up being reactive, and their efforts do not align with business goals.

The Vision: Proactive Intelligent Partners

Nam articulated the vision clearly: "Our session today focuses on moving away from being reactive to proactive and basically converting these reactive systems into proactive intelligent partners that can discover, optimize, and recommend by themselves."

This is not about getting more alerts at three in the morning. It is about the system learning normal behavior patterns, understanding when Monday morning ramps up or month-end billing creates predictable spikes, and separating noise from signal. The AI agent points to exactly where the issue lives and provides actionable recommendations, whether that means changing an index, reallocating compute, or restructuring a query.

Rob emphasized the learning aspect: "The system is learning from what works and then keeps tweaking itself. And the better, it gets better the longer it runs." The database becomes an active participant in its own performance optimization.

Why AI Agents for Database Operations

Rob explained why AI agents have become essential: "Modern data move very fast and humans can track that, as well as the AI agents becoming essential for the database operations."

AI-powered monitoring shifts from simply collecting telemetry to learning from it. AI agents model normal behavior and adapt to various baselines in real time across CPU, query performance, and latency. They catch subtle degradations that might have been overlooked in the past before they become outages.

Context is Key - When a metric spikes, AI systems can explain why it changed, what the impact is, and what to do next. The agent might send out a ticket, create a user story, fix indexing, tune configuration, or restructure a query. As Rob put it: "The AI agent can also change from something that I think is, maybe looks off. It's right there and it will show you what the problem is. Here's the action, here's what you need to do."

Amazon RDS: Query Performance and Resource Optimization

The session covered four AWS database services, starting with Amazon RDS. Nam identified three common challenges and how AI agents address them:

Slow Query Performance

The root causes are often inefficient queries, suboptimal query plans, missing indexes, or poor schema design. AI agents can review queries, analyze execution plans, examine indexes, recommend missing indexes, and even rewrite entire queries for better performance.

Nam explained: "They can, worst case scenario, you know, rewrite the whole query for you saying that, okay, you know what? Forget this query, I'm gonna rewrite the whole thing. This is far more performant and better compared to what you have written."

Manual Instance Scaling

While RDS scales horizontally very well, changing instance types vertically requires manual intervention. AI agents analyze CPU utilization, memory utilization, I/O patterns, and workload patterns to help rightsize instances. Nam noted: "Imagine if you have one or two, it's fine, but if you have hundreds of instances, which are highly over-provisioned or possibly even under provisioned, in most cases generally, over provisioned, then you have trouble."

Storage Optimization

Teams tend to over-provision storage out of fear. AI agents review metrics, I/O patterns, and storage utilization to rightsize storage allocation. The benefit is price performance optimization and a better performing RDS instance overall.

In the demo, Rob showed how the Kiro CLI with MCP (Model Context Protocol) servers bridges the gap between AI and technical control of CloudWatch, RDS, and other services. The agent automatically aggregates performance data, cost information, and utilization metrics in real time, then generates recommendations with clear reasoning behind each suggestion.

The demo showed an agent analyzing a week of data and recommending downsizing from a T3 medium to a T3 small instance based on actual usage patterns, explaining business hours versus off-hours traffic, and providing the exact CLI commands to implement the changes.

Amazon Redshift: Distribution and Query Optimization

For Amazon Redshift, the cloud-based data warehouse, Nam highlighted three interconnected challenges:

Skewed Queries

Uneven data or workload distribution across Redshift cluster nodes causes some nodes to be overworked while others sit idle. AI agents analyze query logs, examine load distribution across nodes, and optimize data redistribution or workload redistribution strategies. The result is an evenly distributed cluster with faster, more performant queries.

Inefficient Joins

When Redshift runs massive parallel queries with high numbers of joins, significant data shuffle occurs across nodes. AI agents analyze query plans, examine data distribution keys, and can even suggest schema denormalization to eliminate joins entirely. This reduces query time and network I/O.

Suboptimal Distribution Keys

This is the root cause challenge at the schema or architectural level. Poor distribution keys lead to both skewed queries and inefficient joins. AI agents review schemas, analyze distribution keys, and recommend better keys to prevent these downstream issues, making the overall Redshift cluster far more performant.

In the Redshift demo, Rob showed how the agent connects to the cluster, analyzes query patterns and node performance, and detects issues like row skew where 80% of data sits on a single node. The agent explains the problem, estimates performance improvements from redistribution, and provides concrete recommendations without trial and error testing.

The agent also uses historical data to forecast future spikes, such as end-of-month billing or Black Friday traffic increases, and makes proactive recommendations for right-sizing or setting up auto-scaling before the spike occurs.

Amazon Aurora: Replication and Connection Management

For Amazon Aurora, the cloud-native database, Nam identified three critical challenges:

Replication Lag

Replication lag typically occurs when the server has too many writes compared to reads, combined with network latency or under-provisioned read replicas. AI agents monitor replication metrics, predict replication lag before it occurs, and recommend read replica tuning and failover strategies. Nam explained: "It can actually say that, okay, if your writes go beyond a certain amount, there is a high chance or a possibility that you can potentially face a replication lag."

Connection Storms

When the database runs out of connections due to too many unused connections or connections that were opened but never closed, genuine connection requests get rejected. AI agents identify abnormal connections, detect connections that have been open for extended periods without use, and suggest connection pooling or throttling strategies. The outcome is stabilized performance and reduced risk of server downtime.

Scaling Bottlenecks

When reads and writes are not in sync and write volume increases significantly while read replicas are underutilized, scaling bottlenecks occur. AI agents forecast workload trends and patterns, recommend read replica fine-tuning, and suggest auto-balancing strategies to keep reads and writes in sync, resulting in a well-balanced cluster.

Amazon DynamoDB: Partition and Capacity Management

For Amazon DynamoDB, the fully managed serverless NoSQL database, Nam covered three common challenges:

Hot Partitions

Uneven key distribution across partitions causes one partition to be overworked and overutilized. AI agents detect hot partitions, analyze workload patterns and data load distribution, and recommend better partitioning or sharding strategies. This results in evenly distributed data across partitions.

Throttling

When capacity limits are exceeded due to disproportionate traffic, the database starts throttling requests. AI agents forecast when capacity limits are approaching thresholds and can dynamically adjust capacity limits. Nam noted: "It's not just making recommendations, it can also optimize it for you if you want it of course. You can always have Lumen in the loop over here as well." The result is fewer throttling incidents and more stable performance.

Cost Spikes

Unpredictable traffic causes auto-scaling to kick in, and suddenly budgets are blown. AI agents analyze workload patterns, suggest auto-scaling strategies to avoid cost spikes, and recommend caching strategies to redirect some traffic. This leads to optimized costs and consistent performance.

The Benefits: From Reactive to Proactive

Rob summarized the major benefits of AI-driven database monitoring:

Proactive Monitoring - Catch issues earlier and reduce outages by improving performance and offloading analytics to an automated approach.

DBA Productivity - A huge leap in productivity by reducing manual overhead.

Intelligent Cost Management - Continuous optimization of resources without excessive manual effort.

Rob emphasized the importance of predictability: "Moving from being reactive to being proactive and the usability of predictability operations."

Getting Started: Call to Action

Nam and Rob recommended starting small. Deploy one AI agent for one single service where you can see immediate value, measure the results, and then scale across your AWS database ecosystem. The integration of AI agents with your organization's runbooks and best practices will create a transitional journey toward intelligent, proactive recommendations that unlock higher efficiency and resiliency.

As Nam concluded: "While today's databases act as reactive systems and we react typically to resolve these situations, we foresee that in the future, these reactive systems will basically become our proactive intelligent partners."

The session demonstrated that this future is not theoretical. With tools like Kiro CLI, MCP servers, and AI agents, teams can start transforming their reactive database operations into proactive intelligent systems today.


About This Series

This post is part of DEV Track Spotlight, a series highlighting the incredible sessions from the AWS re:Invent 2025 Developer Community (DEV) track.

The DEV track featured 60 unique sessions delivered by 93 speakers from the AWS Community - including AWS Heroes, AWS Community Builders, and AWS User Group Leaders - alongside speakers from AWS and Amazon. These sessions covered cutting-edge topics including:

  • πŸ€– GenAI & Agentic AI - Multi-agent systems, Strands Agents SDK, Amazon Bedrock
  • πŸ› οΈ Developer Tools - Kiro, Kiro CLI, Amazon Q Developer, AI-driven development
  • πŸ”’ Security - AI agent security, container security, automated remediation
  • πŸ—οΈ Infrastructure - Serverless, containers, edge computing, observability
  • ⚑ Modernization - Legacy app transformation, CI/CD, feature flags
  • πŸ“Š Data - Amazon Aurora DSQL, real-time processing, vector databases

Each post in this series dives deep into one session, sharing key insights, practical takeaways, and links to the full recordings. Whether you attended re:Invent or are catching up remotely, these sessions represent the best of our developer community sharing real code, real demos, and real learnings.

Follow along as we spotlight these amazing sessions and celebrate the speakers who made the DEV track what it was!

Top comments (0)