DEV Community

Apache Doris
Apache Doris

Posted on

Building Real-Time Lakehouse with S3 Tables, AWS Glue, and Apache Doris

We built a real-time lakehouse with S3 Tables, AWS Glue, and Apache Doris. In this solution, S3 Tables stores data in the Apache Iceberg format on Amazon S3. AWS Glue manages and organizes metadata and schema, providing a single catalog that connects all resources. And Apache Doris runs sub-second queries directly on those Iceberg tables: no ETL, no data copies, no complex architecture.

Together, the S3 Tables + AWS Glue + Apache Doris form a real-time lakehouse that combines the openness of a data lake with the high performance of a data warehouse, providing a key data foundation for AI and agentic workloads.

You get:

  • Unified metadata for easy table discovery and governance

  • Open Apache Iceberg tables on S3 with ACID, time-travel, and schema evolution

  • A high-performance query engine with Apache Doris offering low-latency and high-concurrency

  • Interoperability across engines with Spark, Flink, Trino, Doris, and more

This is a practical, production-ready real-time lakehouse you can use to power dashboards, streaming analytics, or AI features directly from the data lake. The solution is also applicable to many other open-source combinations, with table formats like Iceberg, Paimon, catalogs like Unity, Polaris, Gravitino, and query engines like Spark, Flink, and Trino.

Simple steps to replicate:

Let's see how to set up this solution in a demo. We will explore how to harness the power of Apache Doris, as well as configure a third-party engine to work with AWS Glue Iceberg REST Catalog. The demo will include details on how to perform read/write data operations against S3 tables with AWS Glue.

  1. Create S3 Table Buckets

  1. Create policy for Glue and S3 Tables

Use the following JSON policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "glue:GetCatalog",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable"
            ],
            "Resource": [
                "arn:aws:glue:<region>:<account_id>:catalog",
                "arn:aws:glue:<region>:<account_id>:catalog/s3tablescatalog",
                "arn:aws:glue:<region>:<account_id>:catalog/s3tablescatalog/<bucket_name>",
                "arn:aws:glue:<region>:<account_id>:table/s3tablescatalog/<bucket_name>/<db_name>/*",
                "arn:aws:glue:<region>:<account_id>:database/s3tablescatalog/<bucket_name>/<db_name>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}

Enter fullscreen mode Exit fullscreen mode
  1. Attach the policy to user

Search the policy you just created and attach it to your user.

  1. Connect to Iceberg catalog using SQL
-- Create Catalog
CREATE CATALOG my_glue_catalog properties (
    'type' = 'iceberg',
    'iceberg.catalog.type' = 'rest',
    'warehouse' = '<acount_id>:s3tablescatalog/<bucket_name>',
    'iceberg.rest.uri' = 'https://glue.<region>.amazonaws.com/iceberg',
    'iceberg.rest.sigv4-enabled' = 'true',
    'iceberg.rest.signing-name' = 'glue',
    'iceberg.rest.signing-region' = '<region>',
    'iceberg.rest.access-key-id' = '<ak>',
    'iceberg.rest.secret-access-key' = '<sk>',
    'test_connection' = 'true'
);
-- Switch to the catalog
SWITCH my_glue_catalog;
-- View current existing databases
SHOW DATABASES;
-- Create a new database
CREATE DATABSE gluedb;
-- Change to the newly created database
USE gluedb;
-- Create a new Iceberg table
CREATE TABLE iceberg_table(id INT, name STRING);
-- Insert values into table
INSERT INTO iceberg_table VALUES(1, "Jacky");
-- Query the Iceberg table
SELECT * FROM iceberg_table

Enter fullscreen mode Exit fullscreen mode

Replace the placeholders with the real information.

Conclusion and Next Steps

A unified data foundation is what makes real-time analytics possible, and key for companies to adopt large-scale AI and agentic workloads.

S3 Tables and AWS Glue provide an open, governed data layer, and Apache Doris delivers sub-second analytics directly on that data. This real-time lakehouse offers a simpler architecture, smarter governance, and AI readiness, allowing teams to query fresh information without complex ETL or data silos.

Top comments (0)