Kazuya

Posted on Dec 5

AWS re:Invent 2025 - All Power, No Pain: Scale Agentic AI Apps in an Open Lakehouse (AIM247)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - All Power, No Pain: Scale Agentic AI Apps in an Open Lakehouse (AIM247)

In this video, Qlik presents their Open Lakehouse solution for building AI-ready data foundations on AWS using Apache Iceberg. The session covers how organizations face challenges with AI ROI and data trust, emphasizing the need for open data architectures. Qlik demonstrates high-throughput ingestion into Iceberg tables from various sources including Amazon Kinesis, their Adaptive Optimizer for continuous table optimization, and warehouse mirroring for interoperability. A benchmark shows 5x faster data freshness and 75-90% cost savings compared to direct data warehouse ingestion. The presentation includes a live demo of real-time IoT streaming data from exercise bikes, ingesting sensor readings through Kinesis into Iceberg tables with built-in data quality scoring.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Qlik and the Challenge of Building AI Data Foundations

What's going on everyone? Thank you for taking 20 minutes out of a busy conference schedule to talk with me about AI initiatives and building solid data foundations on open lakehouse technologies. We'll introduce you to Qlik if you haven't heard of us. We'll spend the bulk of the time today talking about where the current AI landscape is and most specifically how to build scalable and cost-effective data platforms to drive any data initiative you might have, but specifically those focused around AI.

Qlik is a global software company with offices around the world. Many of you likely use us even if you don't know that you use us. Fundamentally, Qlik has a product portfolio that provides customers with an end-to-end data platform focused on both data integration and transformation as well as data analytics. Wherever you are in your data journey, we can likely help get data into digestible and consumable formats, or if your data formats are already in good shape, we can help you analyze and provide analytics on top of your data.

We are very proud of our AWS partnership. It has been long-lasting and we hope it will continue into the future. We've been strategic partners with AWS for many years and we rely on AWS to power much of the capabilities we're going to talk through today. We are both a partner of AWS and a consumer of AWS services. One of our primary objectives in partnering with AWS is to help customers deliver successful data projects on AWS as fast as possible. We want to get you from data to outcome very quickly and easily, and really enable all of your data professionals to be successful driving data projects on AWS.

AI has happened. I don't think I'm out of bounds by saying that. If you look at the maturity curve, we're inside of it. Most organizations are starting and continuing to deliver AI projects within their organizations. It's really up to us as data professionals and data teams to figure out how we're going to build strong data platforms and strong data foundations to deliver successful AI initiatives.

Many organizations are starting to make progress. It's good to see that almost 70 percent of organizations have started to build a formalized AI strategy. It's pretty rare these days to approach an organization and find they have no strategy or no developed plans for accommodating AI. What has become really challenging, and one of the factors that is starting to hold it back, is that ROI can be really hard to predict and quantify on AI projects. Many organizations are struggling to dip their toe into the water because they're not sure whether they're going to get the perceived benefit out of those projects.

That has started to slow down the rollout across the industry. However, you're seeing lots of successful AI initiatives, many of them around chatbot-type engagements, code writing and development. We're starting to see more AI initiatives fall into the analytics perspective and realm. If you talk to organizations about what's holding them back, we discussed the difficulty in predicting ROI, but a lot of the barriers are around that data foundation. You can't really do anything with your data if you can't trust your data. How do you build that strong data foundation and strong data platform that you can then deliver successful AI projects on top of? That is something we're talking to a lot of customers about, and they're telling us that it's holding them back from achieving a lot of their vision around some of these projects they want to implement.

How do we start to make progress? How can we start to build on these data foundations and data platforms to deliver on our successful projects? We at Qlik believe that it all starts with an open data architecture. We'll talk about what that means and how we can deliver on that through the remainder of today's presentation.

Open Data Architecture and the Qlik Open Lakehouse Powered by Apache Iceberg

What does an architecture look like in terms of these open foundations? If you look across the left of the screen, that's your data. Those are your data sources. This is where the data is going to come from that you want to build your projects on top of. We believe an open architecture should support as many sources as you might have, from the simple, maybe API endpoints or SaaS-type applications, to the more complicated, maybe database transactional systems that you may want to bring into your data architectures, the Oracles of the world, SQL Servers of the world, PostgreSQL, all the way down to the much more complicated streaming real-time data sources.

Those sources that deliver semi-structured data, those sources that deliver data in real time, those sources whose schema evolves over time—all of these sources need to be brought into a particular data architecture. Now that data architecture then needs to provide capabilities to help you move and transform that data.

A data platform that is open needs to handle data movement and also needs to be able to analyze and act on that data in the most efficient way possible. We believe that an open data lake architecture is the best place to store and manage your data so that it is available to as many different consumers as might want to consume it. These could be AI agents, or these could be traditional data consumers like analytical and query engines.

This is what a proposed architecture looks like. We think that an Iceberg-formatted lakehouse is the best approach for storing your data. There are two primary reasons for that. If you talk to customers about why they're considering adopting Iceberg or lakehouse solutions in general, most of those conversations drive towards one or two particular drivers. The first is cost. When building out these architectures that have to accommodate data sources from small scale to large scale, from batch-oriented to real-time, cost is going to be a factor in bringing that data in. Lakehouse technologies, specifically those based on Iceberg, have proven to deliver the most flexible data storage at the lowest possible cost.

The second value proposition is interoperability. We are trying to help organizations shift from what I would call platform-oriented architectures, where all data sits inside of the platform consuming that data, to a more open architecture where the data can sit inside of an open architecture and then as many platforms as you want can connect to and consume that data. Iceberg serves as a very powerful framework for that. As proven by what you see over on the right, across the industry, most vendors, AWS leading the charge, have chosen Iceberg as a lakehouse format that they are putting significant investment into. AWS has services like S3 Tables, a variety of Glue services, SageMaker, EMR, Glue, and Athena, all of them very well support Iceberg.

Within the AWS ecosystem and across the industry, whether data warehouse providers, other hyperscalers, or a variety of consumption engines, they all have chosen Iceberg as a format and a technology that they will support. This interoperability across the data ecosystem is a huge value add of any lakehouse technology, specifically those based on Iceberg. Qlik, of course, is now in that world. We released earlier this year the Qlik Open Lakehouse that has introduced a managed Iceberg offering within the Qlik product portfolio that we can help customers ingest data into, transform data within, and then present out to their consumption engines.

If I layer this on top of the architecture diagram that you saw earlier, you can see the Qlik Open Lakehouse powered by Apache Iceberg that runs on AWS provides three primary capabilities. First, you'll see high-throughput ingestion, which is ingestion directly into Iceberg tables from all of the various sources that I mentioned, from simple API type endpoints to more complicated database and CDC type endpoints. We're happy to announce at this show new support for streaming data ingestions from streaming providers like Amazon Kinesis, various versions of Kafka, and even micro-batch ingestions for many files that might exist on object stores. All of that data can be very simply and cost-effectively ingested into Iceberg tables.

Beyond that Iceberg ingestion, any customer who has tried to use Iceberg at scale will know that Iceberg, if it is not optimized, will not deliver on its core value, especially for real-time ingestion. Iceberg tables need to be optimized. You have to compact files for efficient file-level access, you have to expire snapshots so your metadata files remain reasonably sized, and you have to delete orphan files so that you are not consuming terabytes on disk to process a 100-gigabyte dataset. The Qlik Adaptive Optimizer will do all of that for you. As you ingest data into Iceberg, our optimizer kicks in and we continuously optimize Iceberg for low cost and great performance.

Finally, we have warehouse mirroring, which is an important component of the interoperability function that I mentioned. It is not good enough to store your data in Iceberg and just query it from a single engine. Iceberg tables need to be consumable by as many possible consumption engines as possible in order to deliver on that open perspective. What we offer is we ingest data into Iceberg, we optimize Iceberg, and then we can present that data to a variety of consumption engines from your data warehouses to your query engines, so that they can reach into the Qlik Open Lakehouse and run optimized queries against Iceberg tables. All of this runs on AWS.

As an example of the cost efficiency that we're helping customers achieve, we actually ran a benchmark that we published a few weeks ago. We took the exact same data, which was real-time streaming ingestion into Iceberg at a relatively moderate volume—not low scale, not high scale, but right in the middle, probably a volume similar to what many of you are working with today. We ingested that data in real time into a Qlik Open Lakehouse Iceberg table. Conversely, we ingested the same data directly into a data warehouse to showcase the cost efficiency that you can see with Open Lakehouse ingestion into Iceberg.

The conclusion of the benchmark was that we were able to deliver data freshness at a magnitude of five times faster. Data latency in the Open Lakehouse was in the one to three minute range, while the data warehouse was in the five to fifteen minute range. So we achieved five times faster data delivery, and we could do that at a lower cost in the seventy-five to ninety percent range. We ran the test on a few different warehouse sizes, from very small to slightly larger, and we were able to showcase not only fresher data into Iceberg but at a much lower cost.

You also get to take advantage of the interoperability of those Iceberg tables that are now consumable across the widest range of query engines. If you look online, we published this benchmark a few weeks ago with a lot of detail and information on how you can replicate it yourself. Again, we're delivering on that cost efficiency at scale objective of getting data into Iceberg in an efficient manner. So the Open Lakehouse benefits are very fresh, high-speed, real-time data ingestion into Iceberg at a very low cost, and with the adaptive optimizer layered on top so that you get great query performance out of your Iceberg tables while benefiting from the cost savings that we see by ingesting data into an Open Lakehouse.

Live Demo: Real-Time Streaming Data Ingestion with IoT Bike Sensors

I did prepare a quick demo if you want to see how this works. However, if you want to see an even better demo, come to the Qlik booth at booth 1727. It is literally as far that way as you can walk in the expo hall. If you go that way, you can listen for the cowbells or see people huffing and puffing on bikes, and then you'll know you're in the right space.

This demo is a real streaming data IoT-style use case where you can generate your own data. You get on the bikes and pedal as fast as you want. During your ride, you generate about three hundred to five hundred meter readings that we're pulling off the bike as real-time sensor data. We then load that data into Amazon Kinesis as a real-time streaming platform, ingest that data into Iceberg tables in real time, optimize those tables so that we get great query performance, and then provide analytics capabilities on top of the data that you've generated while riding the bike. It's both a fun user experience and a real-world example of real-time streaming data ingestion into Iceberg.

If you want to see how this works, I'm going to keep my fingers crossed that I can switch to a demo. Now, those of you who know the Wi-Fi in here is not great, so I'm going to take advantage of a recording. I normally don't do this because I like to do demos live, but so that I'm not suffering from Wi-Fi issues, I'm just going to run the recording and talk you through what we're seeing here. This is our Qlik cloud platform, specifically our Open Lakehouse solution. We're going to build a streaming ingestion project, give our product a name, and define the Iceberg details of how we want to ingest this data.

The catalog we're going to use is AWS Glue, the storage engine is AWS S3, and then a compute cluster that runs in AWS to process that data into Iceberg and optimize that data continuously as it's being ingested. We step through a data onboarding wizard. This is where we select our Amazon Kinesis connection, which is where those real-time events are going to get loaded. We have two Kinesis streams in this account. This is like an e-commerce type dataset, and we're going to pull those two different Kinesis streams into Open Lakehouse.

Our data is in JSON format. We support a variety of formats, so any type of data, whether it's highly structured or less structured, we can pull in and ingest as part of this process. We have options to determine how far back we want to read, how we want to handle nested data—if you're using JSON and you've got structs or arrays, we can either keep it nested or flatten it out. We can append or merge, and then we can also define the partitioning of the table that we're ingesting into.

We build out the project, and here's a picture of an end-to-end pipeline from Kinesis into landing for raw data and then processed into Iceberg. From here, we've got a running pipeline that in real time is taking data from source events in Kinesis into a queryable Iceberg table at very low latency. Over the course of the event, we've processed almost two million bike sensor readings, and that number is growing, so we'll probably be much higher than that after the event finishes.

This is just an example of a running pipeline processing data into Iceberg. Here's an example of AWS Athena querying that Iceberg table with near real-time freshness. Whether you're using Athena or any other query engine that you might have access to, you can run those queries against the Iceberg tables.

Lastly, how do you trust that data? The Qlik cloud platform has this concept of data quality. Every dataset that we ingest data into is assigned a trust score that's based on accuracy, validity, semantic quality of that data, and completeness. This means you can know not only that we're ingesting this data into Iceberg, but that we're delivering data you can trust. It's of the highest quality, and you can measure and monitor that over time.

That's the Qlik cloud platform, and that's the example of real-time streaming ingestion into Iceberg. It's very simple. You set it up once, set it and forget it, and everything runs continuously. Everything is optimized for you, cost is managed and kept low, and that data is accessible to the widest possible variety of query engines. So if you'd like to try out this use case for yourself, definitely come by the booth.

You can either ride the bikes and work up a little sweat, or it's only about a 32-second ride, so you probably won't work up too much of a sweat. You can try out the bikes for yourself, generate some data, and then we can walk you through the end-to-end data pipeline that you yourself have participated in. You can see how that data is ingested, processed, and then finally stored in Iceberg and served up to a variety of analytics use cases.

I do want to thank everyone for your time today. I'll be over there if anyone has any questions before the next session starts. If you want to learn more about Qlik, please come find us at the booth. Walk that way as far as you can, listen for the cowboys, look for the cowboys with cowbells, or watch for the spinning Qlik logo. Come over, ride the bike, see a demo. We'd love to talk further.

Thank you all for your time today. I know it's a jam-packed agenda and I'm very thankful that you were able to spend 20 minutes with me here today. You can come find me right over there or in the Qlik booth over the rest of the conference. Thanks everyone.

; This article is entirely auto-generated using Amazon Bedrock.