DEV Community

Jacob
Jacob

Posted on

1

DUCKDB, S3 Tables with iceberg using Iceberg Rest API

I wrote my previous article about Duckberg, a combination of PyIceberg with Duckdb to access iceberg tables on aws.

But on Pi-Day (3 March) DuckDB released a preview with native iceberg glue integration for read and write. I'll show you how this works

AWS, create S3 table bucket

  • Open S3
  • Select Table Bucket
  • Enable the integration with AWS analytics services once.
  • Create Table Bucket
  • Give it a name like testtablebucket

AWS Create S3 Table Bucket

AWS, create s3 Table

  • Select the bucket you created before
  • Click Create Table with Athena
  • Create namespace or select an existing one. This is similar to a Glue Database, but not the same. Create testnamespace
  • Image description
  • This opens Athena
  • If you didn't do this before, create a bucket and edit the workgroup to use this bucket as output location
  • Execute the Create Table Statement
  • Image description
CREATE TABLE `testnamespace`.daily_sales (
sale_date date, 
product_category string, 
sales_amount double)
PARTITIONED BY (month(sale_date))
TBLPROPERTIES ('table_type' = 'iceberg')
Enter fullscreen mode Exit fullscreen mode
  • Create a new query to insert some test values
INSERT INTO daily_sales
VALUES
(DATE '2024-01-15', 'Laptop', 900.00),
(DATE '2024-01-15', 'Monitor', 250.00),
(DATE '2024-01-16', 'Laptop', 1350.00),
(DATE '2024-02-01', 'Monitor', 300.00),
(DATE '2024-02-01', 'Keyboard', 60.00),
(DATE '2024-02-02', 'Mouse', 25.00),
(DATE '2024-02-02', 'Laptop', 1050.00),
(DATE '2024-02-03', 'Laptop', 1200.00),
(DATE '2024-02-03', 'Monitor', 375.00);
Enter fullscreen mode Exit fullscreen mode
  • Test that the data is correctly loaded
SELECT 
product_category,
COUNT(*) as units_sold,
SUM(sales_amount) as total_revenue,
AVG(sales_amount) as average_price
FROM daily_sales
WHERE sale_date BETWEEN DATE '2024-02-01' and DATE '2024-02-29'
GROUP BY product_category
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Duckdb, setup

This follows the duckdb blog.

Set the credentials in your console, with export AWS_PROFILE=myuser or the aws key credentials

Make sure you have duckdb 1.21 or newer installed.

Run the new duckdb webinterface with : duckdb -ui

Image description

FORCE INSTALL aws FROM core_nightly;
FORCE INSTALL httpfs FROM core_nightly;
FORCE INSTALL iceberg FROM core_nightly;

CREATE SECRET (
    TYPE s3,
    PROVIDER credential_chain
);


ATTACH 'arn:aws:s3tables:eu-west-1:<your_account>:bucket/testtablebucket'
    AS s3_tables_db (
        TYPE iceberg,
        ENDPOINT_TYPE s3_tables
    );
Enter fullscreen mode Exit fullscreen mode

Replace the with your account id and use the correct s3 table bucket

Run SHOW ALL TABLES; to view the internal tables and the available s3 tables

Image description

Duckdb , Query Data

Run: FROM s3_tables_db.testnamespace.daily_sales;

Image description

Duckdb , Alter data

In the blogpost there is a option to use alter table, unfortunatly i still get: Not implemented Error: Alter Schema Entry

alter table s3_tables_db.testnamespace.daily_sales 
add column test STRING;
Enter fullscreen mode Exit fullscreen mode

Maybe this works only in the iceberg rest method.

Duckdb , Attach Catalog via iceberg

The steps before access the tables directly on S3 Tables, but the latest extension also support the Iceberg Rest API from Glue Catalog/AWS Sagemaker Lakehouse to access the data through a standardized method

ATTACH '311141556126:s3tablescatalog/testnamespace'
    AS glue_db (
        TYPE iceberg,
        ENDPOINT_TYPE glue
    );
Enter fullscreen mode Exit fullscreen mode
show all tables;
Enter fullscreen mode Exit fullscreen mode

Gives me:

IO Error: Failed to query https://glue.eu-central-1.amazonaws.com/iceberg/v1/catalogs/311141556126:s3tablescatalog:testnamespace/namespaces, http error 404 thrown. Message: {"error":{"code":404,"message":"Catalog not found.","type":"EntityNotFoundException"}}
Enter fullscreen mode Exit fullscreen mode

My pyiceberg script only sees glue-databases also. Not the s3table namespace.

    region = 'eu-west-1'

    rest_catalog = load_catalog(
    "testnamespace",
    **{
        "type": "rest",
        "uri": f"https://glue.{region}.amazonaws.com/iceberg",
             "rest.sigv4-enabled": "true",
               "rest.signing-name": "glue",
              "rest.signing-region": region
    }
    )
    print(rest_catalog.list_namespaces())
Enter fullscreen mode Exit fullscreen mode

Time for some more investigation

Update , Iceberg Catalog for S3Tables

After some searching i found that there is a special endpoint for Iceberg Restapi for S3 Tables. The details can be found [here].(https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-open-source.html)

Going back to the previous article with pyiceberg

    region = 'eu-west-1'

    rest_catalog = load_catalog(
    "s3tablescatalog",
    **{
        "type": "rest",
        "warehouse":"arn:aws:s3tables:eu-west-1:311141556126:bucket/testtablebucket",
        "uri": f"https://s3tables.{region}.amazonaws.com/iceberg",
        "rest.sigv4-enabled": "true",
        "rest.signing-name": "s3tables",
        "rest.signing-region": region
    }
    )
    print(rest_catalog.list_namespaces())
    print(rest_catalog.list_tables("testnamespace"))
 print(rest_catalog.load_table("testnamespace.daily_sales").scan().to_pandas())
Enter fullscreen mode Exit fullscreen mode

Gives this output

[('testnamespace',)]
[('testnamespace', 'daily_sales')]
    sale_date product_category  sales_amount
0  2024-01-15           Laptop         900.0
1  2024-01-15          Monitor         250.0
2  2024-01-16           Laptop        1350.0
3  2024-02-01          Monitor         300.0
4  2024-02-01         Keyboard          60.0
5  2024-02-02            Mouse          25.0
6  2024-02-02           Laptop        1050.0
7  2024-02-03           Laptop        1200.0
8  2024-02-03          Monitor         375.0
Enter fullscreen mode Exit fullscreen mode

Lakeformation and S3Tables integration

Using lakeformation it's possible to access the tables using the Glue iceberg rest endpoint. See AWS Documentation

In Lakeformation -> Administrator -> Application integration settings -> Allow external engines to access data in Amazon S3 locations with full table access. And click save

Grant Permissions to the database:

Image description

Grant Permissions to the tables:

Image description

Now we can query the s3table through the Glue Iceberg endpoint instead of the S3Table endpoint.

    region = 'eu-west-1'

    rest_catalog = load_catalog(
    "s3tablescatalog",
    **{
        "type": "rest",
        "warehouse": "311141556126:s3tablescatalog/testtablebucket",
        "uri": f"https://glue.{region}.amazonaws.com/iceberg",
        "rest.sigv4-enabled": "true",
        "rest.signing-name": "glue",
        "rest.signing-region": region
    }
    )
    print(rest_catalog.list_namespaces())
    print(rest_catalog.list_tables("testnamespace"))
    print(rest_catalog.load_table("testnamespace.daily_sales").scan().to_pandas())
Enter fullscreen mode Exit fullscreen mode

Combining this in duckdb

ATTACH '311141556126:s3tablescatalog/testtablebucket'
    AS glue_db (
        TYPE iceberg,
        ENDPOINT_TYPE glue
    );

show all tables;
Enter fullscreen mode Exit fullscreen mode

Image description

Quering data now works

Image description

Only error i get when selecting the daily_sales table in the ui and click Query in current notebook. Catalog Error: SET schema: No catalog + schema named "glue_db.main" found. Maybe a bug somewhere since most duckdb have a main namespace. But that's some minor detail.

Image description

With very few improvements, the duckdb UI can easily complete with AWS Athena. It's fast and easy to use.

Maybe in the next blog i'll do some performance tests

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay