DEV Community

Cover image for AWS re:Invent 2025 - What's new in Amazon Redshift and Amazon Athena (ANT206)
Kazuya
Kazuya

Posted on • Edited on

AWS re:Invent 2025 - What's new in Amazon Redshift and Amazon Athena (ANT206)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - What's new in Amazon Redshift and Amazon Athena (ANT206)

In this video, AWS product managers present updates to Amazon Redshift and Amazon Athena at re:Invent. Imran Moin covers Redshift's 2025 innovations across three pillars: cloud data warehouse fundamentals (including Multidimensional Data Layouts delivering 10x better price performance, enhanced materialized views, and FedRAMP authorization for Serverless), distributed warehouse architecture enabling workload isolation with data sharing, and Apache Iceberg support with read/write capabilities and 2x better price performance on data lake queries. Sean McKibben from Twilio shares how they reimagined their billing engine using Redshift Serverless and Zero ETL with history mode, achieving 75% cost reduction while processing 0.5 trillion events with serializable transaction isolation. Scott Rigney discusses Athena's performance improvements, including 1.5x faster Iceberg queries, new S3 Tables features reaching general availability, materialized views for Iceberg, capacity reservations with auto-scaling, and integration with SageMaker Unified Studio for AI-powered data analysis.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Session Introduction: Analytics Use Cases and AWS Portfolio Overview

Hello everyone, we're going to get started. Quick reminder to please put your headphones on. Hope everyone is having a great re:Invent. Welcome to the "What's New in Amazon Redshift and Amazon Athena" session. My name is Imran Moin. I'm a Senior Manager of Product Management within the Amazon Redshift team, and I'm delighted to have my co-speakers with me: Scott Rigney, who is a Principal Product Manager with Amazon Athena, and Sean McKibben, who is a Principal Software Engineer with Twilio.

Thumbnail 40

In this session today, I'm going to kick things off and talk about what's new in Amazon Redshift. Then I'm going to hand it over to Sean, who will talk about how Twilio is using Redshift and Athena to transform billing and analytics at scale. Finally, Scott will come up on stage and talk about what's new in Amazon Athena.

Thumbnail 60

When we look at customer needs for analytics, we see that broadly things fall into three main use cases. On one hand, you have customers that want to store and analyze vast amounts of structured data from their business systems, run SQL queries on them, and get insights on that data by visualizing it in BI dashboards like Tableau, Quicksight, and Looker. These use cases are well suited for a cloud data warehouse.

On the other hand, we have customers that have a vast amount of raw, unstructured, or semi-structured data, and this data is often used for advanced analytics or machine learning types of use cases. These use cases are well suited for a cloud data lake. Then you have customers that want to bring the data warehouse and data lake together. They want to combine the structured business data with the unstructured or semi-structured data so that their teams can work off of a unified data foundation and get richer insights into all of their available data. The industry typically refers to this as a lakehouse.

At AWS, we have built a portfolio of services that are geared towards meeting all these different use cases. Amazon Redshift is our purpose-built cloud data warehouse that provides high-performance SQL analytics on your structured data. Amazon Athena is a serverless way for customers to query their unstructured or semi-structured data stored in S3 using a familiar SQL interface. Amazon SageMaker brings it all together, where you can analyze your data warehouse data and data lake data together in one place so that your teams can get richer insights, mostly for advanced analytics and machine learning types of use cases, and all your teams can work off of a unified data foundation.

Thumbnail 190

Amazon Redshift's Evolution and 2025 Innovation Pillars

Let's begin with Amazon Redshift. Amazon Redshift has had a history of innovation over the last twelve to thirteen years. When we first launched Redshift in 2013, it was the first ever cloud data warehouse, providing massively parallel processing, columnar storage, and a familiar SQL interface at a fraction of the cost of your traditional on-premises system.

When we launched Redshift Spectrum in 2017, customers had a way to query their data lake data stored in S3, bridging the worlds of data warehouse and data lake together. More recently, in 2022, we launched support for Zero ETL, where you could bring your data from your operational databases and business systems directly into Redshift without having to worry about managing complex ingestion pipelines and data processing workloads.

Last year, we launched AI-driven scaling and optimization for Redshift Serverless, where Redshift would automatically scale up your compute clusters if there is a sudden spike in demand or query volumes, so that you continue to get the best performance out of your underlying Redshift infrastructure. Along the way, Redshift has also invested heavily in optimizing the performance of your queries, which is why Redshift has 2.2 times better price performance than its nearest competitors, enabling you to get faster insights on your data at a fraction of the cost.

Thumbnail 290

Each of these first two market capabilities demonstrates that Redshift continues to evolve with the changing customer needs, which is why today it is used by tens of thousands of customers globally for their business-critical applications. It processes billions of queries and exabytes of data every single day.

Thumbnail 310

As we look at 2025, Amazon Redshift has delivered a number of key features and innovations that can be broadly categorized into three investment pillars. The first one is cloud data warehouse fundamentals. The second one is distributed warehouse. And the third one is Apache Iceberg support. I'm going to talk about each of these three investment areas.

Thumbnail 330

Cloud Data Warehouse Fundamentals: Performance and Security Enhancements

Cloud data warehouse fundamentals remain at the core of Amazon Redshift innovation because they underpin every customer's trust and success in the platform. This includes areas such as security, performance, global region availability, and other areas that are crucial for any cloud data warehouse to work well. Let's start with performance.

Thumbnail 360

Historically, Amazon Redshift has invested heavily in query performance, delivering up to 2.28 times better price performance than its nearest competitor. In fact, for specific workloads such as BI dashboarding, Redshift actually offers seven times better price performance than its nearest competitors, which is one of the hallmarks of its leadership in cloud data warehousing. This performance is an area where we continue to invest, and we're not done yet. You can expect further improvements from Redshift in 2026 and beyond.

Thumbnail 400

Let me dive into some specific areas of performance. The first one is materialized views. Materialized view is an area of ongoing innovation for Redshift, where we provide you pre-computed results and query results stored in a database designed to provide you the best performance for your commonly used queries. This year we have delivered a number of key capabilities for materialized views, starting with June when we delivered incremental auto-refresh of materialized views triggered by DML commits. Any time there is a change in the underlying base tables, Redshift automatically triggers a refresh of the materialized view. This provides you with near real-time updates on your data without having to go through manual tuning.

In July, we also released cascading refreshes across nested materialized views so that you don't have to go through complete refreshes every single time. In September we delivered materialized views that are built on top of materialized views on shared data, providing you the ability to optimize and pre-compute queries on shared datasets between clusters. Each of these advanced materialized view capabilities is designed to provide you with near real-time access and insights into your data with minimal manual effort.

Thumbnail 480

I'm also happy to share that in September this year we announced general availability of a feature called Multidimensional Data Layouts, or MDDL, which is a query-aware sorting mechanism where Redshift automatically sorts your tables based on your query patterns and query volumes. Unlike fixed column sorting, MDDL provides up to ten times better price performance for selective repetitive queries, and it far outperforms tables that are sorted based on single columns. What's even better is that once Redshift determines the right layout for your table, it automatically applies that layout using automatic table optimization, thereby minimizing the amount of manual tuning you need to do on your tables on an ongoing basis.

Thumbnail 540

Security has always been job zero for Amazon. It is our highest priority, and it is the foundation of our customer trust. This year, in January, we enhanced the security defaults of Amazon Redshift. Now your Redshift clusters are private by default. They are fully encrypted and they require SSL for every client connection coming into Redshift. With these new settings, it's easy for you to adhere to the best practices in data protection and also minimizes the chances of misconfiguration that might happen on your part.

Thumbnail 590

Distributed Warehouse Architecture: Hub-and-Spoke and Data Mesh

Let's take a look at the second big investment pillar for Redshift in 2025, which is distributed warehouse. As big enterprise customers continue to scale the Redshift environment, we had to reimagine what the future of cloud data warehouses would look like.

Last year we outlined a vision where customers would have different analytics workloads such as BI dashboarding, ETL and ELT ingestion, real-time analytics, and more. Customers would run all of these different analytics workloads on their own dedicated clusters optimized to provide the best performance for each of those workloads. All of these clusters were also designed in such a way that they can share one common dataset, allowing them to either read or write to a single copy of data.

Thumbnail 650

This year, we have enhanced this architecture even more and added a bunch of new features and capabilities. This architecture is available in two different flavors. The first one is hub and spoke, where different workloads run on their dedicated clusters but they read and write from a single copy of data that is tied to a main cluster or producer cluster. The second architecture is a data mesh architecture where each cluster has its own copy of data but securely shares that data with all other remote clusters that might need access to that data.

Thumbnail 710

Redshift Serverless Improvements: Cost Efficiency, Flexibility, and Concurrency Scaling

This distributed warehouse architecture is designed to provide the best performance for each of your analytics workloads. It provides workload isolation and also provides more granular cost visibility and chargeback capabilities so that enterprise customers with different business units or different teams can deploy this in a secure and scalable manner. Next, I'll talk about Amazon Redshift Serverless. Ever since we launched Serverless in 2021, we have seen many enterprise customers adopt Serverless for their business critical needs.

This year we have made Redshift Serverless more cost effective, more flexible, and easier to use. In March we launched support for trailing tracks where customers that required more stability and more time to test new releases before they are deployed in production are able to do that. All you have to do is simply set your Serverless clusters to be on trailing track, take your time testing those features out, and when you're ready, you can deploy those new versions into production.

We knew that a lot of government and federal agencies required a high degree of security and compliance standards, so I'm very happy to share that in May this year, Redshift Serverless achieved FedRAMP authorization so that your federal contractors, government agencies, and regulated industry customers can now deploy Redshift Serverless in their production environment with confidence. In June this year we deployed support for four RPUs, which is a very small form factor, down from eight RPUs which we used to have. So customers that have smaller workloads can easily get started with Redshift Serverless with lower cost and still get the benefits that Serverless offers.

Four RPUs costs as little as $1.50 per hour, so it's easy for you to get started with smaller workloads. In July this year we also made Serverless easy to use in environments that have constrained networking. Serverless now requires only three IP addresses to get started, and you can deploy it with only two availability zones. It is fully compatible with IPv6, so if you have regulated or constrained networking environments, it's easy for you to get started with Serverless.

Thumbnail 840

While Redshift Serverless has seen tremendous growth since we launched it in 2021, we heard from a lot of our large enterprise customers that had predictable usage. They were asking us for a way to get discounts and lower their overall cost of Redshift. So I'm very happy to share that in April this year, we launched support for Redshift Serverless reservations where you can commit to a certain number of RPUs every year and get up to 24 percent discount off of your list price for Redshift Serverless.

This option is available in two different flavors. You can pay all upfront or you can pay partial upfront. What's even better is that these reservations are applied at the payer account level and can be shared across your multiple different AWS accounts, thereby giving you a lot of flexibility if you're running a large enterprise architecture where you have different business units or different teams that might want to benefit from the reservations.

Thumbnail 910

Centralized identity and unified governance are very important as customers continue to scale their Redshift environment. I'm very happy to share that only a few days back, Redshift launched support for centralized identity and unified governance by integrating with AWS Lake Formation and AWS Glue data catalog. You can now log into any Redshift cluster using your IAM Identity Center credentials or simply your IAM credentials. You can define fine-grained access control permissions like row-level security, column-level permissions, or dynamic data masking, and have those permissions apply to any cluster or any warehouse across your entire Redshift environment.

Thumbnail 980

With these features, your users can log into any Redshift cluster, define permissions, and have the confidence that those permissions and identities will apply to any cluster in any warehouse across your entire Redshift environment. Another area of investment for us as we build out the vision for a distributed warehouse is concurrency scaling. Ever since we launched concurrency scaling in 2019, customers told us how much they value the fact that their Redshift clusters can automatically and dynamically scale up any time there is a spike in query volume or demand. This year we have enhanced the concurrency scaling feature so that all of your ingestion, transformation, and consumption workloads can easily burst across different clusters using concurrency scaling.

I'm happy to share that very soon you will have support for additional workload types on concurrency scaling such as autocopy from S3, materialized view creation, Spectrum queries, and refreshes of streaming materialized views. All of these different workloads can scale across clusters using concurrency scaling. In fact, these features are already available to you in all regions except ID, and we are rolling out support for ID very soon, after which we will make public announcements around this. Related to concurrency scaling, I'm also happy to share that in October this year we launched support for DDL and DML commands. You can now use create table and alter table commands on concurrency scaling clusters, which makes it very easy for customers to do complex table operations and manage ingestion pipelines across clusters.

Thumbnail 1090

Apache Iceberg Support: Read and Write Capabilities with Performance Gains

The third big pillar for Redshift this year has been around Apache Iceberg support. We are seeing an increasing number of analytic customers adopt Apache Iceberg as their open table format, as it provides a high-performance open-source solution for customers to build their analytics architecture. Across AWS we are fully behind Apache Iceberg, and we are adding support for Apache Iceberg across a variety of different data and analytics services at AWS. In Redshift, we have also added support for Iceberg, and I'm very happy to share that Redshift now has support for Iceberg table writes where you can create and insert Iceberg tables directly from within Redshift. Building upon last year when we added support for Iceberg read capability, Redshift now has support for both Iceberg read and write capabilities, giving customers maximum flexibility to build their analytics architectures on top of Iceberg.

We have also made a number of enhancements in terms of improving the performance of data lake queries from Redshift. I'm happy to share that based on a series of query optimization techniques, Redshift now delivers two times better price performance on data lake queries that are run on data lake tables built on top of Iceberg. The last thing I'll mention on this is that we have added support for auto-refresh of materialized views built on top of Iceberg data. The way this feature works is that it periodically pulls S3 buckets to see if there are any new Iceberg files, and any time it detects a new Iceberg file, it automatically triggers a refresh of materialized views. The benefit of this is that customers get access to near-real-time updates and near-real-time insights into your data without having to do this work manually yourself.

Thumbnail 1210

Customer Use Cases: From Batch Analytics to Near Real-Time Interactive Analytics

So far we have discussed why it is important for customers to have a platform with very strong cloud data warehouse fundamentals. We've talked about performance, security, distributed warehouses, and global region availability. But what about the end customer use cases? When I look at the landscape of what customers are doing with Amazon Redshift, I see that these use cases fall into a variety of buckets.

If you look at the left hand side of this slide, Redshift has historically been very strong on batch analytics. When customers want to run end of day sales reports or end of quarter regulatory reporting, many customers are using Redshift for that, and it is very tailored for that particular use case. Another strong use case we see customers using Redshift for is BI dashboarding, where they want to visualize their structured data inside BI dashboards, whether it is AWS Quicksight, Tableau, Looker, or any BI tool of your choice.

We are also seeing an increasing number of customers use Redshift for complex decision support or ad hoc analytics. This includes things like market forecasting and customer segmentation. If you remember, all of the performance enhancements I talked about that we focused on this year are basically geared towards reducing your latency for first time queries as well as repeat queries. As we continue to make improvements in that area, this use case is going to get even stronger.

Moving on, we also see Redshift play a very critical role in data processing workloads and pipelines, where customers want to ingest data from their operational databases directly into Redshift, and they want to transform this data either using AWS Glue or custom SQL commands within Redshift or use a third party tool like dbt. Redshift plays a very critical role in the entire data processing pipeline. All the enhancements we have made around zero ETL and the distributed warehouse architecture that I talked about earlier makes this use case even stronger.

And then the last use case I will talk about is near real-time interactive analytics. We are seeing an increasing number of customers use Redshift for this particular use case, and for this use case, both your ingestion latency as well as query latency becomes very important. On the ingestion side, we have made a lot of improvements on zero ETL and as a result, we can bring the ingestion latency down to single digit seconds. On the query performance front, we continue to improve query performance for both first time and repeat queries.

Thumbnail 1390

All of these enhancements will make this use case even stronger, and customers can use it for things like clickstream analytics and IoT monitoring. Now we have thousands of enterprise customers that use Redshift daily for their business critical applications. One of those customers is Twilio, which provides a cloud communication platform for you to embed email, voice, and video directly into your applications. To learn more about how Twilio is using Redshift to transform billing and analytics at scale, please welcome Principal Software Engineer of Twilio, Sean McKibben.

Thumbnail 1450

Twilio's Billing Transformation: Building a Distributed Data Warehouse with Redshift

Thanks, Imran. About a year ago, my leadership on Twilio's commerce platform charged us with reimagining our billing engine. The business needed new ways to deal with challenges of scale and flexibility, and our finance teams needed to have analytical views of our data that were not possible with our classic architecture. What came out of this project helped us modernize and streamline how we process analytical customer financial data to empower the business to imagine new ways of delivering our products to customers.

Thumbnail 1460

Twilio's billing engine processes billions of usage events every day. Every text message, every voice call, every video session from customers like Toyota, Salesforce, and Shopify needs to be accurately priced and billed. Our legacy billing engine was built when Twilio was much smaller. It was laser focused on operational speed: ingest an event, price it, and move on. But that architecture created two critical gaps.

The first was flexibility. The business wanted innovative pricing models like package deals and committed use discounts with overage pricing, but our system could not adapt without major rearchitecture. Unwinding pricing misconfigurations was a multi-day engineering effort, and backdating a discount was a bespoke, time-consuming process. The second gap was with analytics.

Finance couldn't answer basic questions like what would revenue look like if we changed this pricing tier. Our BI teams couldn't build dashboards showing customer usage trends without impacting the production billing system. The operational system and the analytical needs were stepping on each other's toes, and neither served our business use cases very well. We needed a distributed data warehouse architecture that could handle both transactional billing integrity and analytical flexibility.

Thumbnail 1540

We evaluated several platforms: traditional data warehouses, data lakes with query engines, and even a split architecture with separate operational and analytical systems. But our workload is unusual. This system generates invoice line items, so we needed financial-grade consistency and guarantees against duplicate charges or missing events. We also needed analytical speed. Most data warehouses are optimized for analytics at the expense of consistency. Most operational databases can't handle analytical queries at scale.

Redshift bridges that gap for us as an operational data warehouse. We get serializable transactional isolation, the strongest consistency guarantees possible, while maintaining analytical performance. That's the unlock. We can ingest events, aggregate with transactional correctness, and serve BI queries all in one system. Plus, Redshift serverless gives us this without infrastructure overhead—no provisioning, no capacity planning, and no failover management. Price-performance scaling means our compute adjusts automatically as our workload changes. We pay for usage and not for capacity.

Thumbnail 1630

The result is that Redshift serverless costs us 75% less than what our previous system did, all while processing more data with stronger consistency guarantees and enabling analytics that were impossible before. Billions of events daily flow from MSK into Redshift serverless. Every usage event across Twilio's platform lands here. Two architectural decisions make this work at scale. First, serializable transaction isolation ensures idempotency. Events can arrive multiple times, whether from network retries or system resubmissions, but Redshift's consistency guarantees ensure that we only count each event once. No duplicate charges, no custom deduplication logic—the transaction model handles it.

Thumbnail 1690

Second, auto-refreshing materialized views enable real-time aggregation. We're not batch processing overnight. As events land, they're incrementally aggregated by customer, product, and day, and our rating system downstream gets near real-time data. We've stored over 0.5 trillion events in Redshift so far, all deduplicated and immediately queryable. That's the foundation. Now let me show you how we built on top of it. We didn't build just one big cluster. We built what ended up being a distributed data warehouse: multiple Redshift workgroups sharing data through data shares.

Our AWS folks call this the golden architecture, and honestly, once we understood it, we couldn't imagine building the system any other way. We have multiple workgroups all working off the same dataset. Ingestion is writing billions of events. The pricing engine is running transformations and aggregations on those events. Analytics is querying the results, but they're all looking at the same tables through data shares. We're not making copies or ETLing data between them. This gave us workload isolation and cost visibility. Heavy analytical queries don't slow down invoice processing. When we need to reprocess data, it doesn't impact dashboards, and we can actually see what each workload costs.

Thumbnail 1750

For a financial system, live data shares are critical. Everyone sees the same version of the data in real time. Let's talk about what's actually happening in that pricing engine workgroup, because this is where it gets complicated. How complex are we talking? Our DBT pipeline manages over 200 million price points across Twilio's product catalog. We're dealing with multi-tiered volume discounts, promotional discounts that stack in specific orders, multi-product bundles where one thing affects the pricing of another, and bifurcated taxes. It's genuinely intricate.

We use DBT Core to break this down into manageable pieces. Each pricing rule is a discrete SQL model, and DBT orchestrates how they run. Redshift parallelizes what it can and serializes what it needs to, and DBT manages the dependency graph between them. The bonus here is that DBT auto-generates documentation as the pipeline evolves. We can actually trace the lineage to see exactly how an invoice line item was calculated. That kind of transparency was impossible with our old system, but it's critical for customer trust and compliance.

Here's what surprised us: Redshift handled these complex joins and window functions with very little optimization. We budgeted weeks for performance tuning and barely needed it. We can recalculate an entire month of billing in 30 minutes, where our old system took 6 hours or more.

Thumbnail 1830

Midway through this process, we hit a bit of a snag. Our product catalog and pricing configuration data lived spread across about 7 or 8 Aurora databases. We only had the current state of our configuration stored in them. The old system just priced everything with whatever was current at the time of processing. That's fine if you only want to go forward, but we needed the flexibility to re-price historical usage for corrections, promotional adjustments, and contract changes.

To do that accurately, we needed to know when prices changed and what they were at any point in time. Zero ETL with history mode, which launched in April of this year, fixed this for us. We integrated that with Aurora databases and got full change history flowing automatically into Redshift. Every price update, every configuration change, time stamped and captured, with no custom CDC to manage. The result is that we can reconstruct any bill from any point in time.

Thumbnail 1900

We can reprice historical usage with the exact configuration that should have applied, or we can model what invoices would look like under different scenarios. That's only possible with complete change history. Let's talk about outcomes, starting with cost. We're running at a 75% lower cost than the previous system while processing more data with stronger guarantees. Four people on my team are managing almost 100 billion events a month.

Thumbnail 1920

Thumbnail 1930

Speed has improved dramatically. We've gone from 6 hours down to 30 minutes for full month recalculations. Finance dashboards went from daily batch updates to sub-minute response times. But the real win is flexibility. We can now do retroactive pricing changes and accurately recalculate last month's usage with today's promotional rules.

Product managers can model new pricing strategies against real-time usage, answering questions like what if we change this pricing tier structure. Billing ops can audit any invoice and see the complete calculation lineage in minutes. We're also generating new data products like an effective pricing model that represents what any customer will pay for any product at any time. Direct writes to S3 and soon to Iceberg tables eliminate Spark ETL jobs for data lake delivery.

Thumbnail 1980

The previous system served us well, but it wasn't built for this kind of scale or flexibility. This architecture fundamentally changed what's possible for the business. The power of this architecture is that it doesn't end with the billing engine. We're about to start writing our curated financial data sets to Iceberg tables indexed with AWS Glue data catalog. This opens up the data to broader organizational use without having to give everyone direct access to Redshift.

Which brings me to the next part of our data story. Teams at Twilio have built what we call the Odin system, a query infrastructure layer on top of these Iceberg tables and the Glue data catalog using Athena. Scott is going to walk you through how you can leverage Athena's latest features to democratize data access across your organization. Scott, I appreciate it. Thanks. Good luck.

Amazon Athena Overview: Iceberg Performance and Materialized Views

Thank you, Sean. Thanks again, Sean. I really loved hearing about the outcomes that Twilio has had, especially the 75% cost savings, which is really impressive at Twilio's scale. As Imran mentioned, I'm Scott. I lead the product management function for the Athena service. I'm delighted to be with you today. Like many of you, I'm a data nerd and I love working with data, so I wanted to gather some data from you all.

Thumbnail 2060

Raise your hand if this is your first re:Invent. And raise your hand if this is your first "What's New in Athena" session. Great to see some friendly new faces and previous ones alike. So as a refresher, Athena is our serverless interactive query service that's designed to make it dead simple to run SQL on your data lakes. Customers love Athena's ease of use. We talk about it often. It's really what we focus on on the product team in every feature that we build.

There's no infrastructure to set up or manage. It just works with the tools that you have today. Customers love how Athena is optimized for interactivity, meaning you get queries that start in under 1 second and a fast engine on top of great out-of-the-box performance. Last but not least, our engine just works with the data that you have, so you can bring your Parquet, JSON, Iceberg, CSV, and other data to Athena and start querying right away. And best yet, we're able to pair Athena's query engine capabilities with really deep fine-grain access capabilities through Lake Formation, our sister service.

Thumbnail 2120

Thumbnail 2130

We launched in 2016, and today we have customers from every industry, from startups to large enterprises who are running lots of queries per week. In fact, we're seeing over billions of queries a week in our most recent metrics. Customers are getting a lot of insights and value out of their S3 data lakes.

Let's talk about the customer journey with Athena and how that often starts with Amazon S3. Customers choose S3 for its unmatched durability, scalability, and availability. Because of those foundational capabilities and unique differentiators, customers have been able to build millions of data lakes on top of S3. But as Imran mentioned earlier, there's a big shift happening in data lakes, and that's open table formats.

Thumbnail 2160

Folks love how these open table formats bring familiar SQL functionality to where we're all working today, which is in our S3 data lake stores. Athena actually launched its open table format support back in 2021, and since then we've seen Iceberg emerge as a leading choice amongst the open table formats that we support. Iceberg is unique amongst open table formats for its ability to handle massive scale data sets while maintaining simplicity, supporting multiple query engines like Athena and Spark, and providing transactional features that allow you to write into your data lakes with ease, all without locking you into a single vendor ecosystem.

I'm excited to share that Athena this year is 1.5 times faster on Iceberg with Parquet based on TPCDS 1 terabyte compared to this time last year. We've been able to achieve that through a number of changes, starting with our engine and cascading out to our catalog experiences as well. The first is Iceberg Statistics, which is an existing feature of Athena, but two weeks ago we launched a new update where we've enhanced the behavior of the Athena service when it interacts with statistics files to ultimately make queries run faster.

Thumbnail 2200

With Iceberg statistics, you'll fire up AWS Glue and ask Glue to collect statistics on your Iceberg tables. Then when you come to Athena and query those tables, Athena will automatically apply the statistics to make more intelligent query planning decisions, automatically accelerating your queries without a ton of work to do. We also launched a net new feature called Parquet column indexing, which allows queries to skip irrelevant blocks of data, reducing unnecessary I/O operations when reading from S3. This is especially beneficial on queries that have selective filter predicates on sorted data, which is fairly common in a lot of enterprise reporting scenarios.

For our customers who are using Lake Formation today to secure and govern their Iceberg tables, just a couple of weeks ago we announced new partition pruning behaviors for Lake Formation tables with row filters and column masks, and additional predicate pushdown behaviors. Altogether, these will help improve query performance and reduce costs without sacrificing data governance.

On Sunday we launched a new feature for Iceberg called Materialized Views. This is a really cool feature. It's essentially a managed Iceberg table that stores pre-computed query results for you in an always-ready-to-query S3 table, or Iceberg table, that you can query through Athena and other engines. They're super easy to use and automatically updated when source data changes, so your data is flowing through materialized views and always available for you to query, so you're always seeing the latest and greatest. It does all of that without any infrastructure to manage.

Thumbnail 2310

Today we support read support on Athena, but you can also go out to Redshift, Spark on EMR, and Glue and perform additional operations over there. A lot of customers come to Athena and build these decomposed SQL pipelines on top of this service. Imagine if you have a complex data transformation pipeline today that's joining a bunch of tables together to produce one single output table. That's now something you can reflect as a materialized view in the Glue Data Catalog and express that as SQL and get the always-updated benefits that we now provide through materialized views.

You'll fire up one of our Spark experiences in EMR or Glue or use those Glue APIs to launch or create your materialized view. When you're creating one, you'll get to choose whether you store your materialized view in a self-managed Iceberg table or an Amazon S3 table.

Thumbnail 2410

S3 Tables, Performance Optimizations, and Administrator Features in Athena

This is a great segue to all the good work we've been doing with S3 Tables. If you remember back to this time last year at re:Invent, we talked about and announced a preview for Amazon S3 Tables as the first fully managed Iceberg offering in the cloud. S3 Tables is great because it simplifies the tedious aspects of running an Iceberg data lake, such as table compaction and expiring old snapshots. S3 Tables handles all of that for you.

Thumbnail 2440

This year we've been busy taking S3 Tables to general availability, which we did on Pi Day in March. In that launch, we added expanded DDL operations, including create database and create table operators, plus a bunch of others that made it possible to work with S3 Tables end to end through Athena SQL. We're excited about what that allows customers to do. In that launch, we also added a new console wizard to help you get started quickly.

In August, we launched CreateTable as Select, which is a very popular feature in Athena. We call it CTAS. What CTAS lets you do is convert data from one format to another, and it's a no-brainer feature to have with S3 Tables. Let me give you an example use case. Assume you have an application that's writing JSON data out to S3, and you want to query that JSON data, but you need columnar-level performance to meet the latency requirements of your use case. What you can now do is run a CTAS on that JSON data, convert it into an S3 Table, and then tomorrow when new records show up from your JSON data stream, you can run an insert operation on top of that table to bring your new records over to your S3 Table and drive your queries off of the S3 Table instead of your JSON data.

In the back half of the year, what we've been doing is optimizing the rest of the S3 Tables experience. What's great about how we're working with the S3 team is that all of the features I mentioned for Iceberg on the previous slides you also get with S3 Tables. Parquet column indexing, updated statistics, all of that good stuff applies to S3 Tables. But you're also getting the performance features that our partner teams in S3 and Glue Data Catalog have been delivering, like Z-Order and sort compaction, plus a bunch of operational tweaks that reduce latency across the experience.

Thumbnail 2570

We recognize that Iceberg is not the only data format out there. When we look at the data that customers are querying today, we see lots of JSON, text, CSV, and other data formats. We're proud to share that the majority of those queries are quite fast, completing in under two seconds. We're able to achieve this through how Athena's engine is designed and tuned, and also because of how Athena provides the serverless compute infrastructure that automatically scales based on query complexity and is able to run multiple queries in parallel.

Thumbnail 2600

This year, as we looked at this data, we asked how else we can accelerate or bring the performance bar higher for some of these additional data formats that customers are querying today. That took us down an interesting path of actually redesigning our file system and our readers and writers for a bunch of different file formats to bring a faster experience when interacting with those files through the Athena engine. We're seeing some really awesome results, as this slide highlights. Just to highlight a few, Parquet, as I mentioned, is a super popular format and is now 1.2 times faster when used with Hive than this time last year. JSON is 2 times faster. CSV, folks are still passing CSVs around, and it's now 1.8 times faster compared to last year. Iceberg we talked about already, but Delta Lake is seeing a good boost this year as well.

When we talk about these features rolling out, normally what you hear is that you have to upgrade to this new thing and that's a lot of work. But where we try to do things differently on Athena is work ahead of you and figure out better ways to migrate our customers and bring them along for these journeys. What we've been doing with this rollout is over the course of this year automatically moving customers to these new formats and our new file system based on compatibility checks that we run in the background. Chances are you're already getting these benefits, or you'll be getting them very soon in your production applications.

Thumbnail 2680

We've also been shipping performance features elsewhere in the service, so I wanted to highlight a few of those today as well. Query result reuse, or QRR as we call it, is a caching feature that Athena has. What's neat about this is it can bypass query execution under certain conditions, which means you're bypassing the actual latency from the query engine itself, but you're also getting a query that ran in sub-100 milliseconds as you're bypassing the execution and picking up previous query results.

We actually changed how our hashing function works to ignore small variations in how different people write queries. A good example is code comments or use of whitespace characters in your queries. We've done all of that to increase the cache hit ratio when you're using query result reuse. That's a really sweet feature to turn on. You just toggle a switch within the Athena console or our APIs to activate it, and if the result is found, Athena will bypass execution and give you the previous matching result.

We also rewrote some of our memory logic deep in the engine with the goal of reducing failures and making queries that have heavy shuffle and join operators more stable at scale. We've tweaked a lot of the very deep behaviors of the engine internals to get better memory stability and efficiency. You should get fewer failures on queries that process a lot of data in memory.

Thumbnail 2780

Let's switch gears and talk about some administrator features. One of those is capacity reservations. This is not too different from Redshift serverless reservations, and this is Athena's feature for guaranteed serverless compute on top of our managed compute infrastructure. It's really ideal for mission-critical applications, dashboards, any SLA-sensitive reporting jobs you may have, as well as user-facing applications that have a requirement for significant concurrency.

To use it, it's pretty straightforward. You create a reservation and then you choose a number of compute units, which are called DPUs. Then you assign this reservation to your workgroups, which allows your reservation to share capacity across your workgroups. What's really great about how this feature works is there are no SQL changes needed. You basically just configure this setup and Athena automatically knows what to do and routes queries from those workgroups to your reserved capacity units.

Thumbnail 2840

Last week we announced two new features for capacity reservations. The first is capacity and cost controls. This is a new setting that influences how Athena assigns data processing units, or DPUs, to each of your queries. You can set it at the workgroup level, which gives you a centralized way to apply defaults and a min and max range for each of your queries. But you can also control DPU allocation at the query level through new features in our start query execution API that now allow this property to be set, giving you precise task-level performance control.

Customers who've used our preview of this feature have been able to get very low latency out of their queries through experimentation and achieve very high levels of utilization on the reserved capacity. That's a really great feature. Last but not least on that one is we now have new observability data in our get query results API. After you run a query on capacity reservation, we now tell you exactly how many DPUs it consumes, so you can use that new data to help you plan for future queries that have a similar shape and may have a different or slightly different capacity need.

Thumbnail 2910

We also launched an auto-scaling solution. This is really neat. This works as a Step Functions state machine, so it's completely serverless and its entire logic is presented to you as the state machine. You can go into the Step Functions console and explore and look at how it works, but that also means it's highly open and flexible to customization for different requirements that you may have. The way it works is it basically monitors the utilization of a reservation and makes capacity adjustments based on the parameters that you set for lookback and how frequently and by how much you should scale based on your workload.

Thumbnail 2960

To launch it or to get started, you have a console click-through button that we've provided as well as documentation that provides you a link into the CloudFormation experience. Managed query results is another admin feature that we launched earlier this year. This is a really great feature if you're an administrator because it simplifies your experience by allowing the Athena service to automatically store query results in our own storage rather than your S3 bucket. We're taking over the headache of having to manage query result files for you.

What's neat about this is we store results for a day at no cost to you and automatically expire and delete them after that one-day period with no action needed on your part. You no longer need to take action to manage query result files or their lifecycle.

Thumbnail 3020

Additional Launches and Session Wrap-Up: SageMaker Integration and Learning Resources

With this feature, we also elevated the permissions for query results to be tied to workgroups instead of an S3 bucket. This should simplify permissions for folks who are accessing Athena through different workgroups. It has been a really busy year of launches and we could not get to everything today, but we want to make sure there are some key launches that you have visibility on to check out while you are here at re:Invent.

Thumbnail 3030

The first is Amazon SageMaker notebooks. This is a really cool feature and I am glad we had a chance to talk about it today. This is a brand new native notebook UI in SageMaker Unified Studio. It is integrated with Athena SQL and Athena Spark engines and delivered to you as a single canvas, providing you a data analysis and development experience in one spot. It is also multi-dialect, so in one cell you can run a SQL query on Athena, then you can run some Python code, then you can run some Spark code all in one canvas with data frame access in between each cell, so you work with data in memory and in a really snappy experience.

Thumbnail 3080

Thumbnail 3100

Best yet is you can work alongside our data agent, which is a built-in AI agent that helps you do data analysis. This is a really great feature if you are getting started to learn a new data set and want to extract some insights from it. You can task the data agent with helping you with some of that work. Behind the scenes, we have upgraded the Athena Spark engine to 3.5.6, and we have also given Spark Connect support and Spark Live UI support to the Athena Spark experience.

With SageMaker Unified Studio, we have continued on the notebook experience. We have also added one-click onboarding with IAM permission experience, brought new AI-powered column and metadata rules, and many other features to our unified studio. If you get a chance while you are here at re:Invent, drop by some of the SageMaker sessions to see what that is all about.

At re:Invent this year you will probably also hear a lot about trusted identity propagation, also known as TIP. This gives you a way to authenticate end users based on their corporate identities and have that identity propagate to all of the AWS services that they are using when they are querying and analyzing data. That includes Athena, S3, Glue, Lambda, KMS, and all the services that Athena will talk to when querying your data, and that gives you end-to-end auditability on user actions tied to identity.

It also plugs into Lake Formation so you can define fine-grained access controls based on corporate identity and have those enforced by engines like Athena and Spark. From a Redshift and Athena perspective, we have also carried TIP support out to our drivers. So if you are logging in through third-party clients like DataGrip, you now have browser-based authentication flows that use TIP to authenticate with your corporate identity.

On MCP, a growing number of customers are developing AI agents and doing new and interesting things with agents to perform different tasks. We launched earlier this year several MCP servers for Redshift, Athena, EMR, and Glue, just to name a few, and we are seeing customers build really cool things on top of that. For example, building agents to query and analyze and extract insights from your data lakes all autonomously. So really cool progress on MCP.

As we wrap up, if you want to learn more about what we shared today, definitely scan these QR codes and get your phones out. There are a lot of really awesome blogs here. One is on Redshift serverless reservations and Iceberg writes, which you should definitely check out. There is one on data transformation with Athena and S3 tables. As you think about that create table as select experience we talked about earlier, that is a great blog to check out with a lot of practical takeaways.

Thumbnail 3230

Thumbnail 3250

We have also included the Odin Deep Dive deck. Our folks at Twilio and our customers at Twilio who have built and support the Odin platform have a really neat blog that describes their journey in building that on top of Athena, which we recommend. We hope that you have taken the learnings from our session today further and that you can leverage other learning and knowledge capabilities and resources that we have here at AWS. For that, we have AWS Skill Builder, tons of great resources for you to learn, practice, and get AWS certified, so definitely check that out and share that with your teammates.

That about wraps up the 206 session on what is new in Redshift and Athena. On behalf of the AWS team, I want to send a thank you to Sean at Twilio for sharing his journey with us and Imran for sharing the insights on the exciting Redshift launches. Last but not least, thank you for attending and spending the hour with us today. We would love to hear your feedback, so pop into the events app and give us your rating and have a great rest of re:Invent. Thanks.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)