Jacob

Posted on Mar 15

DUCKDB, S3 Tables with iceberg using Iceberg Rest API

#duckdb #s3tables #iceberg

I wrote my previous article about Duckberg, a combination of PyIceberg with Duckdb to access iceberg tables on aws.

But on Pi-Day (3 March) DuckDB released a preview with native iceberg glue integration for read and write. I'll show you how this works

AWS, create S3 table bucket

Open S3
Select Table Bucket
Enable the integration with AWS analytics services once.
Create Table Bucket
Give it a name like testtablebucket

AWS, create s3 Table

Select the bucket you created before
Click Create Table with Athena
Create namespace or select an existing one. This is similar to a Glue Database, but not the same. Create testnamespace
This opens Athena
If you didn't do this before, create a bucket and edit the workgroup to use this bucket as output location
Execute the Create Table Statement

CREATE TABLE `testnamespace`.daily_sales (
sale_date date, 
product_category string, 
sales_amount double)
PARTITIONED BY (month(sale_date))
TBLPROPERTIES ('table_type' = 'iceberg')

Create a new query to insert some test values

INSERT INTO daily_sales
VALUES
(DATE '2024-01-15', 'Laptop', 900.00),
(DATE '2024-01-15', 'Monitor', 250.00),
(DATE '2024-01-16', 'Laptop', 1350.00),
(DATE '2024-02-01', 'Monitor', 300.00),
(DATE '2024-02-01', 'Keyboard', 60.00),
(DATE '2024-02-02', 'Mouse', 25.00),
(DATE '2024-02-02', 'Laptop', 1050.00),
(DATE '2024-02-03', 'Laptop', 1200.00),
(DATE '2024-02-03', 'Monitor', 375.00);

Test that the data is correctly loaded

SELECT 
product_category,
COUNT(*) as units_sold,
SUM(sales_amount) as total_revenue,
AVG(sales_amount) as average_price
FROM daily_sales
WHERE sale_date BETWEEN DATE '2024-02-01' and DATE '2024-02-29'
GROUP BY product_category
ORDER BY total_revenue DESC;

Duckdb, setup

This follows the duckdb blog.

Set the credentials in your console, with export AWS_PROFILE=myuser or the aws key credentials

Make sure you have duckdb 1.21 or newer installed.

Run the new duckdb webinterface with : duckdb -ui

FORCE INSTALL aws FROM core_nightly;
FORCE INSTALL httpfs FROM core_nightly;
FORCE INSTALL iceberg FROM core_nightly;

CREATE SECRET (
    TYPE s3,
    PROVIDER credential_chain
);


ATTACH 'arn:aws:s3tables:eu-west-1:<your_account>:bucket/testtablebucket'
    AS s3_tables_db (
        TYPE iceberg,
        ENDPOINT_TYPE s3_tables
    );

Replace the with your account id and use the correct s3 table bucket

Run SHOW ALL TABLES; to view the internal tables and the available s3 tables

Duckdb , Query Data

Run: FROM s3_tables_db.testnamespace.daily_sales;

Duckdb , Alter data

In the blogpost there is a option to use alter table, unfortunatly i still get: Not implemented Error: Alter Schema Entry

alter table s3_tables_db.testnamespace.daily_sales 
add column test STRING;

Maybe this works only in the iceberg rest method.

Duckdb , Attach Catalog via iceberg

The steps before access the tables directly on S3 Tables, but the latest extension also support the Iceberg Rest API from Glue Catalog/AWS Sagemaker Lakehouse to access the data through a standardized method

ATTACH '311141556126:s3tablescatalog/testnamespace'
    AS glue_db (
        TYPE iceberg,
        ENDPOINT_TYPE glue
    );

show all tables;

Gives me:

IO Error: Failed to query https://glue.eu-central-1.amazonaws.com/iceberg/v1/catalogs/311141556126:s3tablescatalog:testnamespace/namespaces, http error 404 thrown. Message: {"error":{"code":404,"message":"Catalog not found.","type":"EntityNotFoundException"}}

My pyiceberg script only sees glue-databases also. Not the s3table namespace.

    region = 'eu-west-1'

    rest_catalog = load_catalog(
    "testnamespace",
    **{
        "type": "rest",
        "uri": f"https://glue.{region}.amazonaws.com/iceberg",
             "rest.sigv4-enabled": "true",
               "rest.signing-name": "glue",
              "rest.signing-region": region
    }
    )
    print(rest_catalog.list_namespaces())

Time for some more investigation

Update , Iceberg Catalog for S3Tables

After some searching i found that there is a special endpoint for Iceberg Restapi for S3 Tables. The details can be found [here].(https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-open-source.html)

Going back to the previous article with pyiceberg

    region = 'eu-west-1'

    rest_catalog = load_catalog(
    "s3tablescatalog",
    **{
        "type": "rest",
        "warehouse":"arn:aws:s3tables:eu-west-1:311141556126:bucket/testtablebucket",
        "uri": f"https://s3tables.{region}.amazonaws.com/iceberg",
        "rest.sigv4-enabled": "true",
        "rest.signing-name": "s3tables",
        "rest.signing-region": region
    }
    )
    print(rest_catalog.list_namespaces())
    print(rest_catalog.list_tables("testnamespace"))
 print(rest_catalog.load_table("testnamespace.daily_sales").scan().to_pandas())

Gives this output

[('testnamespace',)]
[('testnamespace', 'daily_sales')]
    sale_date product_category  sales_amount
0  2024-01-15           Laptop         900.0
1  2024-01-15          Monitor         250.0
2  2024-01-16           Laptop        1350.0
3  2024-02-01          Monitor         300.0
4  2024-02-01         Keyboard          60.0
5  2024-02-02            Mouse          25.0
6  2024-02-02           Laptop        1050.0
7  2024-02-03           Laptop        1200.0
8  2024-02-03          Monitor         375.0

Lakeformation and S3Tables integration

Using lakeformation it's possible to access the tables using the Glue iceberg rest endpoint. See AWS Documentation

In Lakeformation -> Administrator -> Application integration settings -> Allow external engines to access data in Amazon S3 locations with full table access. And click save

Grant Permissions to the database:

Grant Permissions to the tables:

Now we can query the s3table through the Glue Iceberg endpoint instead of the S3Table endpoint.

    region = 'eu-west-1'

    rest_catalog = load_catalog(
    "s3tablescatalog",
    **{
        "type": "rest",
        "warehouse": "311141556126:s3tablescatalog/testtablebucket",
        "uri": f"https://glue.{region}.amazonaws.com/iceberg",
        "rest.sigv4-enabled": "true",
        "rest.signing-name": "glue",
        "rest.signing-region": region
    }
    )
    print(rest_catalog.list_namespaces())
    print(rest_catalog.list_tables("testnamespace"))
    print(rest_catalog.load_table("testnamespace.daily_sales").scan().to_pandas())

Combining this in duckdb

ATTACH '311141556126:s3tablescatalog/testtablebucket'
    AS glue_db (
        TYPE iceberg,
        ENDPOINT_TYPE glue
    );

show all tables;

Quering data now works

Only error i get when selecting the daily_sales table in the ui and click Query in current notebook. Catalog Error: SET schema: No catalog + schema named "glue_db.main" found. Maybe a bug somewhere since most duckdb have a main namespace. But that's some minor detail.

With very few improvements, the duckdb UI can easily complete with AWS Athena. It's fast and easy to use.

Maybe in the next blog i'll do some performance tests

DEV Community

DUCKDB, S3 Tables with iceberg using Iceberg Rest API

AWS, create S3 table bucket

AWS, create s3 Table

Duckdb, setup

Duckdb , Query Data

Duckdb , Alter data

Duckdb , Attach Catalog via iceberg

Update , Iceberg Catalog for S3Tables

Lakeformation and S3Tables integration

Combining this in duckdb

Top comments (0)

Real challenges. Real solutions. Real talk.

Read next

Authority and Accountability

Automating Software Development & System Administration: How Shell Scripts Can Save You Hours of Work

Why I Use Jekyll for Custom, Cost-Effective Web Development (And When It’s Not the Right Choice)

Asynchronous Events: Getting Rid of Internal Services in Angular applications

Okay