Introducing RisingWave's Hosted Iceberg Catalog-No External Setup Needed

#productivity #news #datascience #startup

At RisingWave, our goal is to simplify the process of building real-time data applications. A key part of this is enabling users to build modern, open data architectures. That’s why we developed the Iceberg Table Engine (see the Iceberg table engine docs), which allows you to stream data directly into tables using the open Apache Iceberg format. This is a powerful way to build a streaming lakehouse where your data is immediately available for both real-time and batch analytics.

However, using any Iceberg engine traditionally requires a first, crucial step: setting up and configuring an Iceberg catalog. This catalog is responsible for managing the table metadata. While flexible, this often means provisioning and managing a separate service like AWS Glue, a dedicated PostgreSQL database for the JDBC catalog, or a REST service. This adds an extra layer of configuration and operational overhead before you can even write your first line of data.

To simplify this process, we're excited to introduce the Hosted Iceberg Catalog in RisingWave, a fully managed, internal catalog option that removes external dependencies.

Our Solution: The New Hosted Iceberg Catalog

The Hosted Iceberg Catalog is a built-in option that lets you use RisingWave's own metadata store as a fully functional Iceberg catalog. You don't need to set up anything externally.

The difference is best shown with code. Previously, to connect to an external JDBC catalog (using Iceberg’s JDBC Catalog), your connection setup would look something like this:

The Old Way: Connecting to an External Catalog

CREATE CONNECTION external_jdbc_conn WITH (
    type = 'iceberg',
    warehouse.path = 's3://hummock001/iceberg-data',
    s3.access.key = '...',
    s3.secret.key = '...',
    catalog.type = 'jdbc',
    catalog.uri = 'jdbc:postgresql://external-postgres:5432/iceberg_meta', -- External DB
    catalog.jdbc.user = 'user',
    catalog.jdbc.password = 'password',
    catalog.name = 'dev'
);

Now, with the hosted catalog, the setup is radically simpler. You just need to tell RisingWave to manage the catalog for you by adding a single parameter:

CREATE CONNECTION my_hosted_catalog_conn WITH (
    type = 'iceberg',
    warehouse.path = 's3://your/warehouse/path',
    s3.access.key = 'xxxxx',
    s3.secret.key = 'yyyyy',
    s3.endpoint = 'your_s3_endpoint',
    hosted_catalog = true  -- This is all it takes!
);

That’s it. With hosted_catalog = true, RisingWave handles the catalog setup internally, allowing you to get started with the Iceberg engine in minutes (see our blog announcement).

How It Works Under the Hood

When you enable the hosted catalog, RisingWave uses its internal PostgreSQL-based metastore to manage Iceberg's metadata. It exposes two system views, iceberg_tables and iceberg_namespace_properties, which contain the necessary catalog information.

Most importantly, this implementation follows the standard Iceberg JDBC Catalog protocol, so it's fully interoperable with external tools.

Getting Started: A Practical Example

Let's walk through the three simple steps to create and populate your first table using the hosted catalog.

Step 1: Create the Connection
(Create with hosted_catalog = true, as shown above.)

Step 2: Set the Active Connection

SET iceberg_engine_connection = 'public.my_hosted_catalog_conn';

Step 3: Create and Populate Your Table

CREATE TABLE t_hosted_catalog (id INT PRIMARY KEY, name VARCHAR) 
ENGINE = iceberg;

INSERT INTO t_hosted_catalog VALUES (1, 'RisingWave');

Interoperability: Connecting with External Tools like Spark, Trino, Flink

Because the hosted catalog is a standard JDBC catalog, tools like Spark, Trino, and Flink can still access your tables. For example:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,\
org.postgresql:postgresql:42.7.4 \
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \
    --conf spark.sql.catalog.dev.warehouse=s3://your/warehouse/path \
    --conf spark.sql.catalog.dev.uri="jdbc:postgresql://risingwave-hostname:4566/dev" \
    --conf spark.sql.catalog.dev.jdbc.user="your_rw_user" \
    --conf spark.sql.catalog.dev.jdbc.password="your_rw_password"

This ensures your data remains open and accessible within the broader data ecosystem—not locked behind RisingWave.

Summary and Next Steps

The Hosted Iceberg Catalog feature:

Simplifies Setup – no need for external catalog services like AWS Glue or PostgreSQL.
Reduces Operational Overhead – no extra catalog service to manage.
Ensures Openness – compatible with Spark, Trino, Flink, and other tools.
Creates Integrated Workflows – stream data, manage tables, and query everything from within RisingWave.

Learn more in the official documentation. We’d love your feedback—join our Slack community or check out our GitHub repo.