I wrote my previous article about Duckberg, a combination of PyIceberg with Duckdb to access iceberg tables on aws.
But on Pi-Day (3 March) DuckDB released a preview with native iceberg glue integration for read and write. I'll show you how this works
AWS, create S3 table bucket
- Open S3
- Select Table Bucket
- Enable the integration with AWS analytics services once.
- Create Table Bucket
- Give it a name like
testtablebucket
AWS, create s3 Table
- Select the bucket you created before
- Click Create Table with Athena
- Create namespace or select an existing one. This is similar to a Glue Database, but not the same. Create
testnamespace
- This opens Athena
- If you didn't do this before, create a bucket and edit the workgroup to use this bucket as output location
- Execute the Create Table Statement
-
CREATE TABLE `testnamespace`.daily_sales (
sale_date date,
product_category string,
sales_amount double)
PARTITIONED BY (month(sale_date))
TBLPROPERTIES ('table_type' = 'iceberg')
- Create a new query to insert some test values
INSERT INTO daily_sales
VALUES
(DATE '2024-01-15', 'Laptop', 900.00),
(DATE '2024-01-15', 'Monitor', 250.00),
(DATE '2024-01-16', 'Laptop', 1350.00),
(DATE '2024-02-01', 'Monitor', 300.00),
(DATE '2024-02-01', 'Keyboard', 60.00),
(DATE '2024-02-02', 'Mouse', 25.00),
(DATE '2024-02-02', 'Laptop', 1050.00),
(DATE '2024-02-03', 'Laptop', 1200.00),
(DATE '2024-02-03', 'Monitor', 375.00);
- Test that the data is correctly loaded
SELECT
product_category,
COUNT(*) as units_sold,
SUM(sales_amount) as total_revenue,
AVG(sales_amount) as average_price
FROM daily_sales
WHERE sale_date BETWEEN DATE '2024-02-01' and DATE '2024-02-29'
GROUP BY product_category
ORDER BY total_revenue DESC;
Duckdb, setup
This follows the duckdb blog.
Set the credentials in your console, with export AWS_PROFILE=myuser
or the aws key credentials
Make sure you have duckdb 1.21 or newer installed.
Run the new duckdb webinterface with : duckdb -ui
FORCE INSTALL aws FROM core_nightly;
FORCE INSTALL httpfs FROM core_nightly;
FORCE INSTALL iceberg FROM core_nightly;
CREATE SECRET (
TYPE s3,
PROVIDER credential_chain
);
ATTACH 'arn:aws:s3tables:eu-west-1:<your_account>:bucket/testtablebucket'
AS s3_tables_db (
TYPE iceberg,
ENDPOINT_TYPE s3_tables
);
Replace the with your account id and use the correct s3 table bucket
Run SHOW ALL TABLES;
to view the internal tables and the available s3 tables
Duckdb , Query Data
Run: FROM s3_tables_db.testnamespace.daily_sales;
Duckdb , Alter data
In the blogpost there is a option to use alter table, unfortunatly i still get: Not implemented Error: Alter Schema Entry
alter table s3_tables_db.testnamespace.daily_sales
add column test STRING;
Maybe this works only in the iceberg rest method.
Duckdb , Attach Catalog via iceberg
The steps before access the tables directly on S3 Tables, but the latest extension also support the Iceberg Rest API from Glue Catalog/AWS Sagemaker Lakehouse to access the data through a standardized method
ATTACH '311141556126:s3tablescatalog/testnamespace'
AS glue_db (
TYPE iceberg,
ENDPOINT_TYPE glue
);
show all tables;
Gives me:
IO Error: Failed to query https://glue.eu-central-1.amazonaws.com/iceberg/v1/catalogs/311141556126:s3tablescatalog:testnamespace/namespaces, http error 404 thrown. Message: {"error":{"code":404,"message":"Catalog not found.","type":"EntityNotFoundException"}}
My pyiceberg script only sees glue-databases also. Not the s3table namespace.
region = 'eu-west-1'
rest_catalog = load_catalog(
"testnamespace",
**{
"type": "rest",
"uri": f"https://glue.{region}.amazonaws.com/iceberg",
"rest.sigv4-enabled": "true",
"rest.signing-name": "glue",
"rest.signing-region": region
}
)
print(rest_catalog.list_namespaces())
Time for some more investigation
Update , Iceberg Catalog for S3Tables
After some searching i found that there is a special endpoint for Iceberg Restapi for S3 Tables. The details can be found [here].(https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-open-source.html)
Going back to the previous article with pyiceberg
region = 'eu-west-1'
rest_catalog = load_catalog(
"s3tablescatalog",
**{
"type": "rest",
"warehouse":"arn:aws:s3tables:eu-west-1:311141556126:bucket/testtablebucket",
"uri": f"https://s3tables.{region}.amazonaws.com/iceberg",
"rest.sigv4-enabled": "true",
"rest.signing-name": "s3tables",
"rest.signing-region": region
}
)
print(rest_catalog.list_namespaces())
print(rest_catalog.list_tables("testnamespace"))
print(rest_catalog.load_table("testnamespace.daily_sales").scan().to_pandas())
Gives this output
[('testnamespace',)]
[('testnamespace', 'daily_sales')]
sale_date product_category sales_amount
0 2024-01-15 Laptop 900.0
1 2024-01-15 Monitor 250.0
2 2024-01-16 Laptop 1350.0
3 2024-02-01 Monitor 300.0
4 2024-02-01 Keyboard 60.0
5 2024-02-02 Mouse 25.0
6 2024-02-02 Laptop 1050.0
7 2024-02-03 Laptop 1200.0
8 2024-02-03 Monitor 375.0
Lakeformation and S3Tables integration
Using lakeformation it's possible to access the tables using the Glue iceberg rest endpoint. See AWS Documentation
In Lakeformation -> Administrator -> Application integration settings -> Allow external engines to access data in Amazon S3 locations with full table access. And click save
Grant Permissions to the database:
Grant Permissions to the tables:
Now we can query the s3table through the Glue Iceberg endpoint instead of the S3Table endpoint.
region = 'eu-west-1'
rest_catalog = load_catalog(
"s3tablescatalog",
**{
"type": "rest",
"warehouse": "311141556126:s3tablescatalog/testtablebucket",
"uri": f"https://glue.{region}.amazonaws.com/iceberg",
"rest.sigv4-enabled": "true",
"rest.signing-name": "glue",
"rest.signing-region": region
}
)
print(rest_catalog.list_namespaces())
print(rest_catalog.list_tables("testnamespace"))
print(rest_catalog.load_table("testnamespace.daily_sales").scan().to_pandas())
Combining this in duckdb
ATTACH '311141556126:s3tablescatalog/testtablebucket'
AS glue_db (
TYPE iceberg,
ENDPOINT_TYPE glue
);
show all tables;
Quering data now works
Only error i get when selecting the daily_sales table in the ui and click Query in current notebook
. Catalog Error: SET schema: No catalog + schema named "glue_db.main" found. Maybe a bug somewhere since most duckdb have a main namespace. But that's some minor detail.
With very few improvements, the duckdb UI can easily complete with AWS Athena. It's fast and easy to use.
Maybe in the next blog i'll do some performance tests
Top comments (0)