Introduction
Apache Iceberg is one of three popular open table formats (OTF).
(Hudi, Uber) and (Delta Lake, Databricks)
In this post:
- Iceberg specification
- Docker Compose Hands-on
- Metadata
What is OTF?
Turn files into tables
Open Table Format is a specification for organizing a collection of files containing the same information such that they are presented as a single table.
Table
Implying is that we want all these files to be viewable and updateable as if they were a single entity - the table.
We can interact with this collection of files in the same way with a table in a database.
Various parties must implement this specification to produce usable software.
Apache Iceberg specification (3)
To implement the Apache Iceberg specification, we need three things:
-
Catalog
: keep track of all the metadata files - Processing engine: e.g., query engine
- Scalable storage: object storage
The compute node that ties everything together.
Issue commands to the compute node for:
- Creating tables
- Inserting data into tables
- Querying tables
Logical Diagram
- The Rest catalog uses MinIO for storing metadata.
Iceberg's Data Architecture
- Catalog: The processing engine connects to the Catalog to get a list of all tables. Each table, the catalog keeps track of all Metadata files.
- Metadata file: Schema and partition and snapshots (previous schemas and partitions)
- Manifest Lists: Snapshot itself. (so, s1 ..)
-
Manifest Files
- Points to one or more data files (parquets).
- important for efficient query execution.
- Column-level information
- e.g., max / min values for each Data File
- Data Files: Typically Parquet format
Hands-on
$ docker compose up
$ docker exec -it spark-iceberg spark-sql
Create db
❯ docker exec -it spark-iceberg spark-sql
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/16 08:01:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/16 08:01:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark Web UI available at http://e5356708c7af:4041
Spark master: local[*], Application Id: local-1758009676374
spark-sql ()> CREATE DATABASE IF NOT EXISTS climate;
Time taken: 0.537 seconds
Create table
spark-sql ()> CREATE TABLE IF NOT EXISTS climate.weather (
> datetime timestamp,
> temp double,
> lat double,
> long double,
> cloud_coverage string,
> precip double,
> wind_speed double
> )
> USING iceberg
> PARTITIONED BY (days(datetime))
> ;
Time taken: 0.911 seconds
{
"format-version": 2,
"table-uuid": "195f0af4-6ff1-4ea1-8def-73c85fa2d483",
"location": "s3://warehouse/climate/weather",
"last-sequence-number": 0,
"last-updated-ms": 1758009757603,
"last-column-id": 7,
"current-schema-id": 0,
"schemas": [
{
"type": "struct",
"schema-id": 0,
"fields": [
{
"id": 1,
"name": "datetime",
"required": false,
"type": "timestamptz"
},
{
"id": 2,
"name": "temp",
"required": false,
"type": "double"
},
{
"id": 3,
"name": "lat",
"required": false,
"type": "double"
},
{
"id": 4,
"name": "long",
"required": false,
"type": "double"
},
{
"id": 5,
"name": "cloud_coverage",
"required": false,
"type": "string"
},
{
"id": 6,
"name": "precip",
"required": false,
"type": "double"
},
{
"id": 7,
"name": "wind_speed",
"required": false,
"type": "double"
}
]
}
],
"default-spec-id": 0,
"partition-specs": [
{
"spec-id": 0,
"fields": [
{
"name": "datetime_day",
"transform": "day",
"source-id": 1,
"field-id": 1000
}
]
}
],
"last-partition-id": 1000,
"default-sort-order-id": 0,
"sort-orders": [
{
"order-id": 0,
"fields": []
}
],
"properties": {
"owner": "root",
"write.parquet.compression-codec": "zstd"
},
"current-snapshot-id": -1,
"refs": {},
"snapshots": [],
"statistics": [],
"partition-statistics": [],
"snapshot-log": [],
"metadata-log": []
}
Adding Data
Let's add data from jupyter notebook (using pyspark
)
from datetime import datetime
schema = spark.table("climate.weather").schema
data = [
(datetime(2023,8,16), 76.2, 40.951908, -74.075272, "Partially sunny", 0.0, 3.5),
(datetime(2023,8,17), 82.5, 40.951908, -74.075272, "Sunny", 0.0, 1.2),
(datetime(2023,8,18), 70.9, 40.951908, -74.075272, "Cloudy", .5, 5.2)
]
df = spark.createDataFrame(data, schema)
df.writeTo("climate.weather").append()
Top comments (0)