Yue @ Datastrato (Admin) for Apache Gravitino

Posted on Jan 31 • Edited on Feb 5

Configuring Gravitino Lance REST Service

#api #dataengineering #lance #gravitino101

Author: Qi Yu
Last Updated: 2026-01-23

Overview

In this tutorial, you will learn how to configure and use the Gravitino Lance REST service. By the end of this guide, you'll have a fully functional Lance REST service that enables Lance clients to interact with Gravitino through HTTP APIs.

The Gravitino Lance REST service provides a RESTful interface for managing Lance datasets, implementing the standard Lance REST API. It acts as a centralized catalog service that allows Lance clients (like Spark and Ray) to discover and access Lance datasets managed by Gravitino.

Key concepts:

Lance REST catalog: A standard HTTP API for Lance dataset operations
Gravitino Lance REST service: Implements the Lance REST API and integrates with Gravitino's metadata system
Unified Metadata: Stores Lance dataset metadata in Gravitino, enabling centralized governance

The REST endpoint base path is http://<host>:<port>/lance/.

Architecture overview:

Prerequisites

Before starting this tutorial, you will need:

System Requirements:

Linux or macOS operating system with outbound internet access for downloads
Python environment (3.10+) for running PySpark or Ray clients

Required Components:

Gravitino server installed and configured (see 02-setup-guide/README.md)

Optional Components:

Apache Spark with Lance runtime JARs for client verification (recommended for testing)
Ray framework for distributed Lance data processing

Before proceeding, verify your Python installation and install required packages:

python --version
pip install pyspark==3.5.0 lance-ray==0.1.0 lance-namespace

Setup

Step 1: Start a Gravitino server with Lance REST service

Use this approach if you want the Lance REST service embedded in a full Gravitino server (with Web UI, unified REST APIs, etc.).

Configure Lance REST as auxiliary service

1. Install Gravitino server distribution

Follow the previous tutorial 02-setup-guide/README.md to download or build the Gravitino server package.

2. Enable Lance REST as an auxiliary service

Modify conf/gravitino.conf to enable the lance-rest service and configure it:

# Enable Lance REST service
gravitino.auxService.names = lance-rest
gravitino.lance-rest.httpPort = 9101
gravitino.lance-rest.host = 0.0.0.0
gravitino.lance-rest.namespace-backend = gravitino
gravitino.lance-rest.gravitino-uri = http://localhost:8090
gravitino.lance-rest.gravitino-metalake = lance_metalake

Note: The lance_metalake should exist in Gravitino when you access Lance REST service. You can create it via the Gravitino REST API or Web UI after starting the Gravitino server if it doesn't exist.

3. Start the Gravitino server

./bin/gravitino.sh start

4. Create the Metalake (if not exists)

curl -X POST -H "Content-Type: application/json" \
  -d '{"name":"lance_metalake","comment":"comment"}' \
  http://localhost:8090/api/metalakes

5. Check server logs (optional)

tail -f logs/gravitino-server.log

Step 2: Verify the Lance REST endpoint and create a catalog namespace

Test the service endpoint

You can verify the service is running by the following command:

curl -X GET http://localhost:9101/lance/v1/namespace/$/list \
  -H 'Content-Type: application/json'

On success, you should see a JSON response with namespace information.

Create a catalog namespace

Create a catalog namespace (e.g., lance_catalog) that will hold your Lance schemas and tables:

curl -X POST http://localhost:9101/lance/v1/namespace/lance_catalog/create \
  -H 'Content-Type: application/json' \
  -d '{
    "id": ["lance_catalog"],
    "mode": "exist_ok"
  }'

If successful, it returns the namespace information.

Step 3: Connect with Spark

Configure your PySpark session to use the Lance REST catalog.

Configure Spark with Lance REST catalog

Prerequisites:

Install pyspark: pip install pyspark==3.5.0
Download the lance-spark bundle jar matching your Spark version (e.g., lance-spark-bundle-3.5_2.12-0.0.15.jar)

Execute sample operations

Run the following Python script:

from pyspark.sql import SparkSession
import os

# Set path to your lance-spark bundle
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--jars /path/to/lance-spark-bundle-3.5_2.12-0.0.15.jar "
    "--conf \"spark.driver.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" "
    "--conf \"spark.executor.extraJavaOptions=--add-opens=java.base/sun.nio.ch=ALL-UNNAMED\" "
    "--master local[1] pyspark-shell"
)

spark = SparkSession.builder \
    .appName("lance_rest_demo") \
    .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceNamespaceSparkCatalog") \
    .config("spark.sql.catalog.lance.impl", "rest") \
    .config("spark.sql.catalog.lance.uri", "http://localhost:9101/lance") \
    .config("spark.sql.catalog.lance.parent", "lance_catalog") \
    .config("spark.sql.defaultCatalog", "lance") \
    .getOrCreate()

# Create a schema and table
spark.sql("CREATE DATABASE IF NOT EXISTS demo_schema")
spark.sql("""
    CREATE TABLE demo_schema.test_table (id INT, value STRING)
    USING lance
    LOCATION '/tmp/lance_catalog/demo_schema/test_table'
""")

# Insert and query data
spark.sql("INSERT INTO demo_schema.test_table VALUES (1, 'test')")
spark.sql("SELECT * FROM demo_schema.test_table").show()

Step 4: Connect with Ray

You can also access the data created by Spark using Ray with Lance Ray integration.

Configure Ray with Lance REST catalog

Prerequisites:

Install required packages: pip install lance-ray==0.1.0 lance-namespace

Execute sample operations

import ray
import lance_namespace as ln
from lance_ray import read_lance, write_lance

ray.init()

# Connect to Lance REST
namespace = ln.connect("rest", {"uri": "http://localhost:9101/lance"})

# Read the table created by Spark
# Note: Table ID is [catalog, schema, table]
ds = read_lance(namespace=namespace, table_id=["lance_catalog", "demo_schema", "test_table"])
print(f"Row count: {ds.count()}")
ds.show()

# Perform filtering operation
result = ds.filter(lambda row: row["id"] < 100).count()
print(f"Filtered row count: {result}")

Troubleshooting

Common issues and their solutions:

Service connectivity issues:

Service fails to start: Check logs/gravitino-server.log for startup errors and configuration issues
Connection refused: Verify gravitino.lance-rest.httpPort (default 9101) is open and accessible
curl returns 404: Confirm the Lance REST base path is /lance and the port matches configuration

Client connection issues:

Spark ClassNotFoundException: Ensure the lance-spark-bundle jar is correctly referenced in PYSPARK_SUBMIT_ARGS or --jars
Namespace not found: Remember to create the parent catalog namespace (e.g., lance_catalog) before creating schemas or tables
Ray connection errors: Verify lance-ray and lance-namespace packages are installed and the REST endpoint is accessible

Configuration issues:

Metalake not found: Ensure the metalake specified in gravitino.lance-rest.gravitino-metalake exists in Gravitino
Permission errors: Check that the Gravitino server has proper access to the configured storage locations

Congratulations

You have successfully completed the Gravitino Lance REST service configuration tutorial!

You now have a fully functional Lance REST service with:

A configured Lance REST endpoint running on port 9101
A catalog namespace configured for organizing Lance datasets
Verified client connectivity through Apache Spark and Ray
Understanding of Lance dataset operations across different compute engines

Your Gravitino Lance REST service is ready to serve Lance clients across your data ecosystem.

Next Steps

Continue reading Spark ETL
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

DEV Community