DEV Community: ksanaka

Databend Monthly Report: July 2025

ksanaka — Wed, 06 Aug 2025 14:54:05 +0000

July was a big month for us. We've been heads-down optimizing one thing: JSON query performance. The main story? We've turbocharged our Virtual Columns feature and the results are pretty impressive - 3x faster JSON queries with 26x less data scanning.

By the Numbers

This month: 25+ new features, 25+ bug fixes, 20+ performance optimizations, and 35+ other improvements. But honestly, the JSON performance work is what I'm most excited about.

Monthly Highlights Summary

🔥 Major Improvements

Virtual Columns performance breakthrough - 3x faster JSON queries, 26x less data scanning
Enhanced RBAC - now supports connection and sequence objects
Shuffle sort optimization - much better performance for large ordered datasets
Stream processing improvements - virtual columns now work with streaming data
Workload management - memory percentage quotas for better resource control

🛠 Developer Experience

40+ JSON functions - comprehensive toolkit for JSON operations
Multiple access patterns - both Snowflake and PostgreSQL syntax work
Enhanced UDFs - IMMUTABLE support for better performance
Better debugging - improved metactl tools

⚡ Performance & Infrastructure

Meta-service improvements - millisecond precision, better architecture
Query optimizations - including grouping sets rewrites
Memory management - percentage-based quotas

🐛 Stability

25+ bug fixes - JSON handling, decimal ops, query execution
Better error handling - edge cases and concurrent operations
Transaction reliability - temporary tables and system history

What's New and Game-Changing

🚀 Virtual Columns: Automatic JSON Indexing

So here's what we did. We took our existing Virtual Columns feature and completely optimized it for JSON workloads. Think of it as automatic JSON indexing - no manual intervention required.

Virtual Columns work like smart, automatic indexes for your JSON data. When you load JSON, we automatically:

Analyze access patterns in your JSON documents
Create optimized indexes for commonly-accessed paths
Route queries to these fast indexes instead of execute the jsonb function every time

It's like having a database admin who never sleeps, constantly creating the perfect indexes for your JSON queries.

Example JSON:

{
  "customer_id": 123,
  "order": {
    "items": [{"name": "Shoes", "price": 59.99}]
  },
  "user": {"name": "Alice", "email": "alice@example.com"}
}

Automatic indexes created for:

customer_id → Fast integer index
order.items[0].price → Fast numeric index
user.name → Fast string index

Zero configuration. Zero maintenance. Just faster queries.

📊 Real Numbers

Here's a real example from our testing:

Before (traditional JSON parsing):

SELECT data['account_balance'], data['address']['city'] 
FROM user_logs;
-- 3.76 seconds, 11.90 GB processed

After (automatic Virtual Column indexes):

-- Exact same query
SELECT data['account_balance'], data['address']['city'] 
FROM user_logs;
-- 1.32 seconds, 461 MB processed

That's 3x faster and 26x less data. And you don't change a single line of code.

🔧 All Your JSON Syntax Works

We support the syntax you're already using:

Snowflake style:

SELECT data['user']['name'] FROM customers;       -- brackets
SELECT data:user:name FROM customers;             -- colons
SELECT data['user'].profile.name FROM customers;  -- mixed

PostgreSQL style:

SELECT data->'user'->'name' FROM customers;       -- arrows

All of it gets the same automatic indexing performance boost.

What's Next: One Architecture, Continuous Evolution

July's JSON optimization is actually part of something much bigger we're building. We're working toward mastering all types of data with one unified architecture:

Our multi-modal data approach:

✅ Structured data - World-class columnar performance (already there)
✅ Semi-structured data - Automatic Virtual Column indexing (July milestone)
🚀 Unstructured data - Advanced vectorization with predicate pushdown (August preview)

🔮 August Preview: Unstructured Data Storage & Compute

While July was all about automatic indexing for semi-structured JSON, August is going to be about unstructured data. We're launching comprehensive capabilities for vector storage, compute, and query acceleration:

What's coming in August:

Vectorized storage - Native vector data types with optimized columnar storage
HNSW indexing - Hardware-accelerated similarity search with sub-second query times
Cross-modal SQL joins - Query across structured, semi-structured, and unstructured data in single queries

Imagine queries like this:

-- Find similar products using vector embeddings
-- AND analyze their sales performance from structured data
-- AND extract sentiment from JSON reviews
-- All in one SQL query

SELECT p.product_name, 
       sales.total_revenue,
       reviews.data['sentiment']['score'] as sentiment,
       COSINE_SIMILARITY(p.embedding, :search_vector) as similarity
FROM products p
JOIN sales_data sales ON p.id = sales.product_id  -- structured
JOIN product_reviews reviews ON p.id = reviews.product_id  -- semi-structured JSON
WHERE COSINE_SIMILARITY(p.embedding, :search_vector) > 0.8  -- unstructured vectors
  AND reviews.data['sentiment']['score'] > 0.7
ORDER BY similarity DESC, total_revenue DESC;

This is one platform for the AI era - where structured analytics, JSON document queries, and vector similarity search all work together seamlessly. No data movement, no separate systems, just SQL that works across all your data types.

Early feedback from production users has been really encouraging. People are seeing immediate performance improvements on their JSON workloads while getting positioned for AI use cases down the road.

Check out what we're building at https://github.com/databendlabs/databend.

Databend Monthly Report（June 2025）

ksanaka — Tue, 08 Jul 2025 23:16:13 +0000

June has been absolutely massive for us - and I mean that in the best possible way. We've been heads-down building some seriously powerful features that I think you're going to love. The big story this month? Enterprise-grade audit capabilities that'll make your compliance teams very, very happy.

By the Numbers

This month we shipped 45+ new features, fixed 30+ bugs, and delivered 15+ performance optimizations along with 30+ other improvements. But honestly, the raw numbers don't tell the whole story - the quality and impact of these changes is what really gets me excited.

Monthly Highlights Summary

🔥 Major New Features

Comprehensive audit trail system with access_history, login_history, and query_history tables
Enhanced Decimal64 support for better precision in financial calculations
Runtime filters in shuffle joins - significantly faster complex analytical queries
Automatic CTE materialization - optimizer now intelligently caches common table expressions
ASOF joins - perfect for time-series analysis and event correlation
Python UDF improvements with imports and packages support
Streaming load enhancements with placeholder support and better syntax

🛠 Developer Experience

New SQL functions: regexp_split_to_table, bool_and, bool_or, age, trunc, and more
"REPORT ISSUE" syntax for quick error debugging and troubleshooting
Better workload management with max_concurrency quotas for workload groups
ZIP compression support for improved storage efficiency

⚡ Performance & Infrastructure

Meta-service improvements with better RPC error handling and observability
Query pipeline refactoring using graphs for more efficient execution
Operator caching for faster repeated operations
Enhanced memory management and spill handling for large datasets

🐛 Stability Improvements

30+ bug fixes across query execution, storage, and meta-service components
Better error handling for edge cases in aggregation and join operations
Improved transaction handling for temporary tables and concurrent operations

🚀 AI & Vector Capabilities (Preview)

Vector data type support - laying the foundation for AI-powered applications
HNSW indexing - early implementation for similarity search
🚧 Note: These features are in active development and not yet production-ready

What's New and Game-Changing

🔐 Enterprise Audit Trail - The Star of the Show

Alright, let's talk about the elephant in the room - or should I say, the feature that's going to make your security and compliance teams do a little happy dance. We've built a comprehensive audit trail system that automatically tracks everything happening in your database.

This isn't just "we log some stuff" - this is enterprise-grade, compliance-ready, full-visibility auditing that works out of the box.

What Gets Tracked:

Every single query - who ran what, when, and how long it took
All data access - which tables, columns, and files were touched
Authentication events - successful logins, failed attempts, the works
Schema changes - DDL operations with full before/after details
System events - because sometimes you need to know what happened under the hood

📊 The Five Pillars of Audit Excellence

We've organized everything into five specialized history tables, each designed for specific use cases:

Table	What It Does	Why You Need It
`query_history`	Complete SQL execution audit	Performance analysis, compliance tracking, usage monitoring
`access_history`	Data access and modification logs	Data lineage, compliance reporting, change management
`login_history`	User authentication tracking	Security auditing, failed login monitoring
`profile_history`	Detailed query execution profiles	Performance optimization, resource planning
`log_history`	Raw system logs and events	System troubleshooting, operational monitoring

🚨 Real-World Security Scenarios

Let me show you how this actually works in practice:

Catching Suspicious Activity:

-- Spot those failed login attempts that might indicate trouble
SELECT event_time, user_name, client_ip, error_message 
FROM system_history.login_history 
WHERE event_type = 'LoginFailed'
ORDER BY event_time DESC;

Compliance Reporting Made Easy:

-- Track who accessed sensitive customer data this week
SELECT query_id, query_start, user_name, base_objects_accessed
FROM system_history.access_history 
WHERE base_objects_accessed LIKE '%customer_data%'
AND query_start >= TODAY() - INTERVAL 7 DAY;

Change Management Tracking:

-- Monitor all schema changes with full details
SELECT query_id, query_start, user_name, object_modified_by_ddl
FROM system_history.access_history 
WHERE object_modified_by_ddl != '[]'
ORDER BY query_start DESC;

Complete Query Audit Trail:

-- Get comprehensive query execution details
SELECT query_id, sql_user, query_text, query_start_time, 
       query_duration_ms, client_address
FROM system_history.query_history 
WHERE event_date >= TODAY() - INTERVAL 7 DAY
ORDER BY query_start_time DESC;

The best part? This all happens automatically. No complex setup, no performance impact on your regular queries, no forgetting to enable logging for that one critical table.

📋 Audit Use Cases That Matter

Security Monitoring:

Track failed login attempts to identify potential security threats
Monitor unusual access patterns or unauthorized data access attempts
Investigate security incidents with complete authentication history

Compliance Reporting:

Maintain complete audit trails for regulatory requirements (SOX, GDPR, HIPAA)
Track who accessed what data and when for data governance
Monitor DDL operations for change management compliance

Operational Intelligence:

Analyze query performance and resource usage patterns
Identify optimization opportunities and bottlenecks
Monitor database activity for capacity planning

Configuration That Actually Makes Sense

Databend Cloud

✅ Automatically enabled - All system history tables are ready to use without any configuration.

Self-Hosted Databend

📝 Manual configuration required - To enable system history tables, you must configure all 5 tables in your databend-query.toml\. You'll need to specify each table name with optional retention settings (default: 7 days).

Optional Custom Storage: By default, history tables use your main database storage. You can optionally configure separate S3 storage for audit data.

For complete configuration details and examples, see the System History Tables documentation.

Access Control & Security

We've built robust security into the audit system. System history tables are protected against unauthorized modifications - users can only SELECT or DROP them, never ALTER. To query audit data, users need appropriate SELECT permissions:

-- Example: Create an audit role for compliance team
CREATE ROLE audit_team;
GRANT SELECT ON system_history.* TO ROLE audit_team;
CREATE USER compliance_officer IDENTIFIED BY 'secure_password' WITH DEFAULT_ROLE='audit_team';
GRANT ROLE audit_team TO USER compliance_officer;

Looking Forward

June has been transformative for Databend in multiple ways. Our enterprise audit capabilities represent a major milestone for compliance-focused organizations, while our ongoing AI vector development signals our commitment to the future of data analytics.

We're building the unified data platform of tomorrow. Databend already excels at massive structured data workloads and offers best-in-class semi-structured data analytics - from JSON schema auto-detection to intelligent indexing that makes complex nested queries lightning-fast. But we're not stopping there.

The next frontier is unstructured data. We're actively developing capabilities to handle text, video, audio, and other unstructured formats natively within Databend. Imagine being able to query your video content for specific scenes, analyze audio transcripts, or perform sentiment analysis on text documents - all using familiar SQL syntax, all within the same platform that handles your traditional analytics.

This isn't just about adding features - it's about creating a truly unified data experience. No more ETL pipelines between different systems for different data types. No more choosing between analytics performance and AI capabilities. One platform, one query language, all your data.

We're building for the AI era. Traditional data warehouses handle historical analysis beautifully, but the next generation of applications needs to seamlessly blend structured analytics with AI workloads. Our investment in vector capabilities and unstructured data processing isn't just about following trends - it's about creating a platform where SQL becomes the universal language for all data types.

The response from production users has been fantastic, and it's driving our roadmap for the rest of 2025.

What's Next?

We're continuing to expand both our audit capabilities and AI foundations. Check out our progress at https://github.com/databendlabs/databend - whether you're looking for enterprise compliance features today or want to prepare for a future where all your data - structured, semi-structured, and unstructured - lives in one intelligent platform, we're building exactly that.

Deep Dive into Databend UDF, implementing your data solutions with python, WASM and beyond

ksanaka — Wed, 25 Jun 2025 19:17:08 +0000

"How do we extract/summarize/complete/validate our data using LLM functions?"

"Can we detect anomalies in this time-series data using our existing proprietary functions?"

"We need to analyze sentiment in these customer reviews, but there's nothing built-in for that."

"How do we connect our gaming matchmaking system to our analytics pipeline?"

If these questions sound familiar, you're not alone. As data teams scale their analytics capabilities, they inevitably hit the limitations of built-in database functions. That's where Databend's User-Defined Functions (UDFs) come in – and they're transforming how we approach custom data processing.

Why UDFs Matter for Modern Data Teams

From financial services to e-commerce to gaming, organizations are leveraging UDFs to:

Embed domain expertise and proprietary business logic directly into data pipelines
Process data where it lives by eliminating unnecessary data movement while maintaining security and compliance
Connect to specialized services – enabling companies to use the same algorithms in both operations and analytics
Extend analytics capabilities without waiting for feature requests

Databend's UDF Ecosystem: Power Meets Flexibility

What makes Databend's approach to UDFs particularly powerful is its multi-language support and deployment options:

Lambda UDFs: Quick SQL expressions for simple transformations
Python & JavaScript UDFs: Familiar languages for complex logic and ML
WASM UDFs: Near-native performance for compute-intensive operations
External UDF Servers: Scalable microservices for enterprise workloads

In the following sections, we'll explore each approach with practical examples and performance insights to help you choose the right tool for your specific needs.

Lambda UDFs: SQL-Powered Simplicity

Lambda UDFs are the simplest form of user-defined functions in Databend, allowing you to define custom expressions directly in SQL. Think of them as SQL "shorthand" that encapsulates frequently used logic into reusable components.

When to Use Lambda UDFs

Lambda UDFs excel in scenarios where you need:

Data cleansing and standardization across multiple queries
Simple calculations that combine several built-in functions
Text manipulation and pattern matching beyond basic SQL capabilities

For data engineers working with messy data sources or implementing business rules, Lambda UDFs provide a way to define the logic once and reuse it everywhere with lambda expression, ensuring consistency and reducing query complexity.

Syntax and Examples

-- Simple data cleaning function
CREATE FUNCTION clean_phone AS (s) -> 
  regexp_replace(s, '[^0-9]', '');

-- More complex example with multiple parameters and logic
CREATE FUNCTION calculate_discount AS (base_price, customer_tier, purchase_history) -> 
  CASE 
    WHEN customer_tier = 'premium' AND purchase_history > 10000 THEN base_price * 0.85
    WHEN customer_tier = 'premium' THEN base_price * 0.90
    WHEN purchase_history > 5000 THEN base_price * 0.95
    ELSE base_price
  END;

-- Using the functions in a query
SELECT 
  customer_id,
  clean_phone(raw_phone_number) AS standardized_phone,
  calculate_discount(item_price, customer_tier, ytd_purchases) AS discounted_price
FROM customer_orders;

UDF Script: enhancing your data with python, javascript and WASM

While Lambda UDFs are perfect for simple SQL expressions, more complex data processing often requires the full power of programming languages. Databend's UDF Script capability allows you to embed Python, JavaScript, and compile Rust, go, C++ or Zig to WebAssembly code directly into your data pipeline, bringing sophisticated algorithms right to where your data lives.

Python UDFs: A Taste of Data Science

Python UDFs unlock the vast ecosystem of Python libraries for data science, machine learning, and AI. This is particularly valuable when you need to apply complex algorithms or interact your data natively with external services.

Sample use case: E-Commerce Order Validation

An e-commerce retailer wants to ensure that shipping addresses entered by customers are clean and valid before fulfillment. The data team automates the process of standardizing address strings and flagging obviously invalid entries (such as addresses missing a street number).

To achieve this, the data engineer writes a Python UDF in Databend (using only Python’s standard library). The UDF receives a raw address string, trims whitespace, converts it to title case, and checks if it contains at least one digit (to indicate a street number). It returns a standardized address if valid, or NULL otherwise.

CREATE FUNCTION standardize_address( STRING)
RETURNS STRING
LANGUAGE python
HANDLER = 'clean_address'
AS $$
def clean_address(address):
    if not address or not isinstance(address, str):
        return None
    cleaned = address.strip().title()
    # Check for at least one digit (street number)
    if any(char.isdigit() for char in cleaned):
        return cleaned
    return None
$$;

-- Use the UDF to clean and validate addresses in orders
SELECT
  order_id,
  customer_id,
  standardize_address(shipping_address) AS standardized_address
FROM orders
WHERE standardize_address(shipping_address) IS NOT NULL;

JavaScript UDFs: Lightweight Data Wrangling with Security

JavaScript UDFs in Databend provide a secure, sandboxed environment for data transformations. They are particularly well-suited for handling semi-structured data like JSON, offering a balance of flexibility, security, and performance for common data preparation tasks.

Sample use case: E-commerce Event Log Processing

An e-commerce platform captures raw user interaction events as JSON objects. Before this data can be used for analytics, the data engineering team needs to clean, enrich, and standardize it. Specifically, they need to:

Redact PII: Remove sensitive information like IP addresses.
Enrich Data: Add a processing timestamp to each event.
Standardize Fields: Ensure certain fields, like country codes, adhere to a consistent format.

To handle this, they implement a JavaScript UDF in Databend.

-- This UDF processes a JSON event payload: 
-- 1. Deletes the 'ip_address' field from 'user_details'.
-- 2. Adds a 'processed_at_udf' timestamp to 'metadata'.
-- 3. Converts 'country_code' in 'location_details' to uppercase.
-- 4. Adds a default 'source' if missing in 'event_properties'.
CREATE FUNCTION process_event_payload( VARIANT)
RETURNS VARIANT
LANGUAGE javascript
HANDLER = 'transform_event'
AS $$
export function transform_event(event) {
  // Ensure event is a valid object
  if (!event || typeof event !== 'object') {
    return null;
  }

  // 1. Delete PII (e.g., user's IP address)
  if (event.user_details && typeof event.user_details === 'object') {
    delete event.user_details.ip_address;
  }

  // 2. Add a new field (e.g., processing timestamp)
  if (typeof event.metadata !== 'object' || event.metadata === null) {
    event.metadata = {}; // Initialize if not an object or is null
  }
  event.metadata.processed_at_udf = new Date().toISOString();

  // 3. Update/Standardize a field (e.g., ensure country code is uppercase)
  if (event.location_details && typeof event.location_details.country_code === 'string') {
    event.location_details.country_code = event.location_details.country_code.toUpperCase();
  }

  // 4. Add a default value if a field is missing
  if (typeof event.event_properties !== 'object' || event.event_properties === null) {
    event.event_properties = {}; // Initialize if not an object or is null
  }
  if (typeof event.event_properties.source === 'undefined') {
      event.event_properties.source = 'unknown_source_js_udf';
  }

  return event;
}
$$;

-- Using the JavaScript UDF to transform raw event data
WITH raw_events AS (
  SELECT PARSE_JSON('{
    "event_id": "evt_123",
    "user_details": { "user_id": "usr_abc", "ip_address": "192.168.1.100", "email": "test@example.com" },
    "location_details": { "city": "San Francisco", "country_code": "us" },
    "event_properties": { "page_url": "/products/awesome-widget" }
  }') AS payload
  UNION ALL
  SELECT PARSE_JSON('{
    "event_id": "evt_456",
    "user_details": { "user_id": "usr_def", "ip_address": "10.0.0.5" },
    "location_details": { "city": "London", "country_code": "gb" },
    "event_properties": null 
  }') AS payload
)
SELECT
  payload:event_id::STRING AS original_event_id,
  process_event_payload(payload) AS processed_payload
FROM raw_events;

WASM UDFs: Near-Native Performance without sacrifice security

For computationally intensive operations requiring high performance, WebAssembly (WASM) UDFs enable near-native execution speeds within a secure, sandboxed environment. Unlike Python and JavaScript UDFs, which interpret source code at runtime, WASM UDFs are compiled to a binary format that executes efficiently and securely directly in the engine.

At first, you need to setup your rust environment, and add arrow-udf as dependency, implement your udf function.

// Rust code (compiled to WASM)
use arrow_udf::function;

#[function("fib(int) -> int")]
fn fib(n: i32) -> i32 {
    let (mut a, mut b) = (0, 1);
    for _ in 0..n {
        let c = a + b;
        a = b;
        b = c;
    }
    a
}

Compile your Rust code to the wasm32-wasip1 target:

cargo build --release --target wasm32-wasip1

Upload the compiled WASM file (e.g., arrow_udf_example.wasm found in target/wasm32-wasip1/release/) to a Databend stage and create the function:

-- Create a stage if you don't have one
CREATE STAGE IF NOT EXISTS my_wasm_stage;

-- Upload the .wasm file to the stage (replace with your actual path and stage name)
-- Example: PUT fs:///path/to/your/project/target/wasm32-wasip1/release/arrow_udf_example.wasm @my_wasm_stage;
PUT fs:///tmp/arrow_udf_example.wasm @my_wasm_stage;

-- Create the WASM UDF
CREATE OR REPLACE FUNCTION fib_wasm (INT)
RETURNS INT
LANGUAGE wasm
HANDLER = 'fib'
AS $$@my_wasm_stage/arrow_udf_example.wasm$$;

-- Using the WASM UDF
SELECT fib_wasm(10) AS fibonacci_number;

External UDF Server: interact your data seamlessly with your existing services

If you want to use computed data to interact with your business services seamlessly, you definitely need to consider External UDF servers. These servers, communicating with Databend via the Arrow Flight protocol, allow you to decouple your UDF logic from the database and integrate with existing microservices or specialized computation engines.

For instance, one of our biggest customers in the gaming industry leverages External UDF servers to process real-time, sub-second computed data. This data is crucial for their game matching algorithms and feature engineering services, handling over 1,000 QPS with a P99 latency of less than 200ms. This showcases the power and scalability of External UDFs for demanding, low-latency applications.

Here's how you can set up and use an External UDF Server, using a Fibonacci example implemented in Python:

1. Implement and Run the External UDF Server (Python Example):

First, you'll need the databend_udf Python package. You can install it via pip:

pip install databend-udf

Then, create a Python script (e.g., my_udf_server.py) with the following content:

from databend_udf import udf, UDFServer
import logging

logging.basicConfig(level=logging.INFO)

@udf(
    input_types=["INT"],  # Corresponds to Databend's INT type
    result_type="INT",    # Corresponds to Databend's INT type
    skip_null=True,       # Optional: if True, null inputs won't be passed to the function
)
def fib(n: int) -> int:
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

if __name__ == "__main__":
    # The UDF server will listen on this address
    udf_server = UDFServer(
        location="0.0.0.0:8815"
    )
    # Register the function with the server
    udf_server.add_function(fib)
    # Start the server
    udf_server.serve()

Run this Python script. It will start a server listening on 0.0.0.0:8815.

python my_udf_server.py

2. Create the External Function in Databend:

Once your UDF server is running, you can register it in Databend:

-- Create the external function, pointing to the running server
CREATE OR REPLACE FUNCTION fib_external (val INT)
RETURNS INT
LANGUAGE python
HANDLER = 'fib'  -- This must match the function name in your Python script
ADDRESS = 'http://0.0.0.0:8815'; -- The address of your UDF server

3. Using the External UDF:

Now you can call this function in your SQL queries just like any other UDF:

SELECT fib_external(10) AS fibonacci_number;
-- Expected output: 55

Best Practices

Most Databend users don’t care about UDF theory—they care about what works, what breaks, and what saves time. After seeing dozens of teams (and making plenty of mistakes myself), here’s the honest truth:

If you can do it in SQL, do it in SQL. Lambda UDFs are instant, transparent, and you’ll thank yourself later when you need to debug at 2am. But don’t force it—once your logic starts looking like a regex contest or you’re copying the same formula everywhere, switch gears.

Python and JavaScript UDFs are the real workhorses. Python is the default for anything that smells like business logic, validation, LLM, RAG, or data wrangling. JavaScript is the secret weapon for JSON or semi-structured messes—especially if you’re coming from a web background. Both are easy to write, but don’t expect magic performance. If you’re processing millions of rows per second, you’ll hit a wall. For most ETL, though? They’re perfect.

WASM is for the performance-obsessed. If you’re building a recommendation engine, crunching numbers for a game, or your boss keeps asking “can this be faster?”—bite the bullet, learn some Rust, and ship a WASM UDF. It’s not easy, but it’s the only way to get close to native speed inside Databend. Most people never need it. If you do, you’ll know.

External UDF Servers are for teams with real infrastructure—when you already have a service, a model, or a matching engine you trust, and you want Databend to tap into it. There’s network overhead, but you get freedom: scale compute separately, deploy in your own stack, and keep business logic where it belongs. Just remember to batch requests, or you’ll be debugging latency charts all day.

We do have some basic benchmarking the average execution time per row for different Databend UDF types using the following SQL, but performance is not the only factor to consider when choosing a UDF type. It is recommended to measure the performance of your specific use case(like simplicity, ease of debugging, ease of maintenance, etc.) before making a decision.

SELECT fib((n % 10)::Int32) FROM range(1, 1000000) t(n) IGNORE_RESULT;

Here are the results:

UDF Type	Avg. Time per Row (µs)	Typical Use Case
Lambda UDF	-	Simple transforms, prototyping
Python UDF	0.18	Complex logic, AI integration
JavaScript UDF	2.68	Lightweight data processing
WASM UDF	0.11	High-performance computation
External UDF	23.2	Large-scale, distributed workloads

Note: External UDF timing includes network overhead; actual compute time is often less.

The best teams start simple, refactor when it hurts, and always measure before optimizing. Don’t overthink it. Every UDF type has its place—what matters is using the right one for your problem, not the fanciest one in the docs.

If you’re still unsure, ask around—most Databend users are happy to share what worked (and what didn’t) in production.

Getting Started

Why Your Data Architecture Needs More Than Basic Storage-Compute Separation

ksanaka — Wed, 04 Jun 2025 15:15:38 +0000

The Storage-Compute Separation Everyone Talks About

If you've been in the big data related industry for the past few years, you've likely heard "storage-compute separation" thrown around in architecture discussions. The concept seems straightforward: separate your compute resources from storage, scale them independently, and voilà—you have a modern data platform.

But here's the reality check: most organizations implementing basic storage-compute separation quickly discover it's just the starting point, not the destination.

Let me walk you through what we've learned from real-world deployments and why the next generation of data platforms looks fundamentally different from what you might expect.

The Problem with "Good Enough" Architecture

Picture this: It's 3 AM, and your phone is buzzing with alerts. Your beautifully architected storage-compute separation—the one that passed every design review and impressed the board—is choking under real-world pressure.

Here's what's happening: Your overnight batch jobs are wrestling with real-time analytics for resources. The executive dashboard that needs to load in under 2 seconds is taking 30. Your data science team is complaining that their ML training jobs are being starved by the marketing team's customer segmentation queries. Everyone's pointing fingers, but the real culprit isn't any single team—it's your architecture.

You see, basic storage-compute separation makes a dangerous assumption: that all workloads are created equal. They're not. Your fraud detection system needs millisecond responses. Your monthly reporting can wait. Your recommendation engine requires massive parallel processing. Your compliance queries need guaranteed resources.

But your current setup? It's like having one highway for sports cars, delivery trucks, and emergency vehicles. Sure, they're all "vehicles," but treating them the same way creates chaos.

The frustrating part? You can't simply buy your way out of this problem. Adding more compute power is like widening that highway—it helps temporarily, but the fundamental traffic management problem remains unsolved.

The Solution: Three Architectural Breakthroughs

The problems we just discussed—resource conflicts, scaling bottlenecks, and operational complexity—aren't inevitable. They're symptoms of architectural limitations that modern platforms have solved through three key innovations.

1. Separate Metadata Management

Traditional systems tie metadata (table structures, permissions, data locations) directly to compute clusters. This creates a bottleneck: when multiple teams need the same data, they're forced to share the same compute resources.

Advanced platforms move metadata into its own independent service. Now your fraud detection team and marketing analytics team can access the same customer data simultaneously, each using their own optimized compute cluster. No more resource wars, no more performance interference.

2. Elastic Compute Scaling

Instead of guessing how much compute power you'll need and pre-purchasing it, modern platforms scale automatically based on actual demand. When your monthly reports start running, new compute nodes appear. When they finish, the nodes disappear.

This works because compute nodes are stateless and can run anywhere—on your EC2 instances, Docker Compose environments, or Kubernetes clusters—while data lives in object storage. The two scale independently, so you only pay for what you actually use.

3. Workload-Specific Clusters

Different jobs need different resources. Real-time fraud detection needs fast response times. Monthly reporting needs massive parallel processing. Data science experiments need specialized hardware.

Advanced platforms create separate compute clusters for each workload type. Your dashboard queries run on low-latency infrastructure. Your batch jobs run on cost-optimized nodes during off-peak hours. Each gets exactly what it needs.

Enhancing Storage-Compute Separation: Performance and Scale

While the three architectural breakthroughs form the foundation, modern data platforms need additional capabilities to deliver optimal performance and scale. Let's explore how these enhancements work together to create a truly effective storage-compute separation architecture.

Intelligent Caching and Multi-Cluster Warehouse

Storage-compute separation creates a new challenge: every data request now travels across the network to remote storage. While object storage has improved dramatically, this round-trip still introduces latency that can impact user experience.

The bigger issue emerges under heavy concurrent load. When hundreds of dashboard queries hit your system simultaneously, they all compete for bandwidth to the same storage layer. This creates bottlenecks that can bring even well-designed systems to their knees.

Modern platforms address these challenges through a three-pronged approach:

Local Disk Caching: Frequently accessed data stays close to compute nodes, eliminating most remote storage calls
Memory Caching: Hot data lives in memory for instant access
Multi-Cluster Warehouse: Instead of scaling up one massive cluster, the system scales out across multiple smaller clusters, each handling a portion of the concurrent load

This layered approach ensures that storage-compute separation doesn't sacrifice the performance your applications require.

Native External Tables: Unlocking Data Lakehouse Flexibility

In storage-compute separation architectures, the choice and capabilities of underlying object storage become critically important. Traditionally, many viewed object storage as suitable only for data archival and backup—inadequate for high-concurrency OLAP scenarios. However, technological advances and the evolution of cloud services have changed this perception dramatically.

Today, enterprises are increasingly adopting object storage solutions tailored to their specific needs:

Cloud Deployments: Services like AWS S3 provide the foundation for high-performance analytics workloads.

On-Premises Solutions: Organizations can build cost-effective, high-performance private object storage using open-source solutions. For instance, deploying MinIO on SSD-backed infrastructure delivers the performance needed for real-time data analysis and AI computing scenarios, while HDD-based storage handles batch analysis workloads efficiently.

Databend takes this foundation further with its "native external tables" approach, offering two powerful features for working with external data:

External Tables: These are tables that point to data stored in your own object storage (like AWS S3,Cloudflare R2, MinIO, etc.). The data remains in your storage while Databend provides the compute layer.

** Creating an External Table**

-- Connect to your S3 data

CREATE EXTERNAL TABLE population_data

(

   city VARCHAR,

   population INT,

   year INT

)

LOCATION = 's3://your-bucket/population/'

CONNECTION = (

   REGION = 'us-east-2',

   AWS_KEY_ID = 'your_key',

   AWS_SECRET_KEY = 'your_secret'

);

This functionality allows organizations to utlize storage-compute separation in an ideal way like:

Store high-frequency concurrent data directly on high-speed object storage
Leverage all-flash object storage for analytical workloads
Combine these capabilities with multi-cluster architecture to achieve true lakehouse integration with elastic resources and rapid response times

Here's where it gets interesting: Databend Cloud lets you keep your data in your own storage buckets while providing a fully managed compute layer. You get enterprise-grade architecture, 99.9% availability, and automatic scaling—without nealy 0 migration cost and operational complexity.

The economics speak for themselves. Organizations are seeing 50% cost reductions compared to platforms like Snowflake. One retail client dropped from $40,000 to $18,000 monthly—same workloads, better performance, half the cost.

Turns out advanced architecture doesn't have to break the budget.

Zero Copy data migration with attach table

Here's a scenario every data team knows too well: your analytics team needs access to production data, but copying terabytes for every analysis creates a nightmare of version conflicts, storage costs, and stale insights. Meanwhile, your data engineers spend countless hours building and maintaining synchronization pipelines that break whenever schemas change.

Traditional data architectures force this painful choice between data freshness and operational complexity. But what if you could eliminate the choice entirely?

Databend introduce a game-changing capability: seamless data sharing without the overhead. Instead of moving data around, you only need to point table to your storage location.

Attach Tables: This feature lets you create logical connections between different Databend deployments, treating remote data as if it were local. Think of it as creating a symbolic link to data that lives elsewhere—no copying, no synchronization delays, no storage duplication.

The practical applications are transformative:

Zero-copy migrations: Move from on-premises to cloud by gradually shifting compute while keeping data in place
Instant analytics environments: Spin up read-only analytical workspaces that always reflect current production data
Cross-environment collaboration: Development, staging, and production teams work with the same underlying datasets

The below SQL is all you need to connect another tenants' data with selected columns

-- Attach a table from your on-premises deployment to cloud

ATTACH TABLE cloud_population (city, population) 

's3://databend-doc/1/16/' 

CONNECTION = (

   REGION = 'us-east-2',

   AWS_KEY_ID = 'your_key',

   AWS_SECRET_KEY = 'your_secret'

);

The Path Forward

The evolution of storage-compute separation architecture is like injecting infinite vitality and elasticity into enterprise data infrastructure. From the initial decoupling of compute and storage, to multi-cluster high-concurrency support and object storage upgrades, to seamless data sharing within and between enterprises—each innovation reshapes the future of data processing.

Databend closely aligns with the evolution of cloud platform infrastructure, cleverly leveraging cloud resources to create simple, cost-effective architecture for users, serving as a solid bridge between users and cloud providers. Behind every technological breakthrough lies deep insight into users' core needs and innovative spirit.

Today, Databend has successfully deployed as a leading cloud-native storage-compute separation platform across multiple industries including high-frequency trading, biopharmaceuticals, data trading, gaming, and e-commerce. We've helped enterprises significantly reduce costs, improve efficiency, and unlock unlimited innovation potential.

The question isn't whether to evolve your data architecture—it's how quickly you can make the transition while minimizing disruption to existing operations. Organizations that recognize this early and adopt advanced architectures will have significant competitive advantages in terms of both cost and capability.

Looking ahead, as artificial intelligence and data lakehouses deeply integrate, storage-compute separation architecture will lead more industry-level applications and new models of data collaboration, opening a new chapter in the intelligent data era. The future of data platforms isn't just about storing and processing information more efficiently—it's about creating infrastructure that adapts to your business needs rather than constraining them.

Beyond the Buzzword: What Lakehouse Actually Means for Your Business

ksanaka — Wed, 28 May 2025 13:59:56 +0000

The lakehouse revolution isn't just another tech trend - it's a game-changer that's redefining how industry leaders leverage their most valuable asset: data. Forward-thinking enterprises worldwide are taking notice, and for good reason.

Are you wondering if this approach could be your competitive edge? You should be. The lakehouse architecture delivers unprecedented value by combining the best of data warehouses and data lakes into a unified, high-performance platform that's built for today's complex data challenges.

To truly appreciate where we're headed, we need to understand where we've been. Throughout the evolution of data platforms, technologies have come and gone, but the core challenge has remained constant: how to unlock maximum business value from your data with minimum complexity and cost.

This is exactly why industry leaders are rapidly adopting lakehouse architecture. It's not just an improvement - it's a fundamental shift that could redefine what's possible for your business, positioning you to make strategic decisions that transform your data capabilities and create sustainable competitive advantage.

The beginning of big data infrastructure: Hadoop and its JVM family

About 10–15 years ago, the first wave of big data platforms emerged around Hadoop, with Lambda architecture (combining batch and real-time processing) becoming the industry standard. These systems were extraordinarily complex and resource-intensive. Organizations invested heavily in specialized talent, yet the resulting systems were often fragmented collections of offline components with limited commercial viability.

During this era, technical teams would impress executives with elaborate Hadoop + Hive + Spark architecture diagrams, promising transformative data capabilities while justifying the expansion of their teams. Companies would assemble massive 50+person big data teams to build these ambitious platforms.

However, the reality fell far short of expectations:

Data integrity issues were commonplace, with teams downplaying concerns by claiming that "a few missing records in massive datasets won't impact business operations"
Data corrections were painfully inefficient - incorrect data required entire partitions to be deleted and reprocessed
Storage costs ballooned as data frequently needed duplication across systems
Despite their supposed power, these platforms would collapse when executives attempted basic queries without specifying partition keys
Systems that initially performed well with 1 billion monthly records would deteriorate with 2 billion daily records, and by the time data volumes reached trillions of records, the original architects had typically moved on, leaving behind platforms that required complete reconstruction
The excessive complexity (often 30+ components) meant that upgrading any single element risked breaking the entire system
The fragmented architecture created security vulnerabilities with each additional component
Query planning alone could take minutes, with execution plans rendered impractical by metadata overload from excessive partitioning
Even seemingly straightforward tasks like deduplication would overwhelm teams

The list of frustrations was virtually endless….

This generation of big data platforms became a source of constant anxiety for executives. Real-world implementation revealed that these architectures not only failed to deliver on their promises but often led companies into strategic dead ends. Data engineers found themselves trapped in endless cycles of data loading and preparation using Spark and Hadoop, while downstream business users remained unable to extract meaningful value from their data investments.

Spark-based lakehouse: simpler but not simple enough

In this evolution, the industry made significant progress toward simplification. Data lakes adopted more structured approaches with Apache Iceberg format providing a robust table format for massive datasets. By leveraging Apache Spark and Apache Flink, organizations reduced their architecture from 30+ components to roughly 10 core technologies, allowing teams to shrink from 50+ specialists to about 10.

This second-generation lakehouse focused on specific technical challenges:

Managing Iceberg's historical versions, compaction operations, and Z-order processing
Implementing Shuffle services for distributed processing
Maintaining security across integrated components
Building intelligent SQL Gateways to route queries appropriately

These improvements delivered meaningful benefits:

ACID transaction support ensuring data reliability
Consistent metadata management across the ecosystem
Unified storage eliminating redundant data duplication

However, significant challenges remained:

Organizations still needed 10+ specialized engineers just for maintenance The architecture required senior Spark experts, particularly challenging when handling billions to trillions of records
Practical limitations emerged: infra team generally set a serious of limitations like single tables couldn't exceed 10 billion records, and databases needed to stay under 10,000 tables to avoid system degradation
Exceeding these limits often resulted in system instability, missed SLAs, and engineering burnout

These persistent challenges explain why managed solutions like Snowflake, Databend and Databricks gained such tremendous market traction.

Organizations increasingly recognized that the technical complexity and operational overhead of maintaining these systems in-house simply wasn't worth the effort compared to fully-managed alternatives offering similar capabilities with significantly reduced operational burden.

Cloud-Native, Self-driven lakehouse: The Evolution We've Been Waiting For

After witnessing the painful journey through previous data architecture generations, the industry has crystallized what truly matters in a modern lakehouse solution:
Simplicity is the ultimate sophistication. A truly effective lakehouse must eliminate complexity at every level - from deployment and scaling to daily operations and user experience. It should free your technical teams from mundane maintenance tasks and empower them to deliver actual business value.
This fundamental insight drove us to create Databend in 2021 - a solution that finally delivers on the lakehouse promise without the traditional overhead and complexity.

The Databend Difference: Simplicity with Enterprise Power

True Cloud-Native Architecture Databend was built from first principles with a diskless, S3-native design. Unlike retrofitted solutions, our architecture requires minimal infrastructure - just a few compute nodes atop your existing S3 storage. No complex caching layers, no specialized hardware requirements. Scaling becomes trivially simple: add compute resources when needed, and your performance and concurrency scale linearly with your business demands.
Zero-Maintenance Operations We've eliminated the operational burden that plagued previous generations. Databend intelligently handles all the tedious background tasks - compaction, reclustering, optimization - automatically maintaining peak performance without human intervention. Whether you choose our self-hosted enterprise version or cloud offering, your team is finally free from the endless cycle of maintenance firefighting.
Seamless Migration Path We understand that technology transitions must be practical. That's why Databend supports the tools and workflows your team already knows - from familiar SQL syntax to popular language drivers, UDFs, and integrations with tools like DBT, Airbyte, DBeaver, and Tableau. Your existing skills transfer directly to our platform, eliminating painful retraining cycles.

Real-World Transformation: Gaming Industry Case Study

A leading gaming company with 100M+ monthly active users was struggling with their data infrastructure. Their CTO put it bluntly: "Every minute of analytics delay costs us approximately $10,000 in revenue opportunity. With our current setup, we're leaving millions on the table."

Their requirements were clear:

Sub-second data ingestion for player behavior analysis
Support for 5,000+ concurrent streaming computations during peak hours
Processing of 100+ billion monthly events with consistent performance

After implementing Databend, the results were immediate:
From Data Wrangling to Revenue Generation "We cut our data preparation cycle from 48 hours to 30 minutes," their Director of Analytics reported. "Game designers now optimize monetization strategies based on same-day insights rather than week-old data." By processing diverse data formats directly with standard SQL, they eliminated their Spark engineering team, saving $1.2M annually.

From Complex Streaming to Simple Solutions Their Lead Architect explained: "We replaced our entire Flink cluster with Databend's UDF capabilities. A single engineer now implements features that previously required specialized teams." This enabled critical player-retention features like post-match analysis to deploy in days instead of months, with automatic scaling handling 10x traffic spikes during tournaments.

From Infrastructure Management to Business Innovation "We've shifted 60% of our engineering resources from maintenance to innovation," noted their CTO. "While competitors struggle with their data platforms, we're shipping new features weekly." The business impact was clear: 22% increase in player retention and 15% growth in in-game purchases directly tied to faster analytics capabilities.

The Future Belongs to Those Who Simplify

Let's be honest - the data architecture journey has been unnecessarily painful. We've collectively spent billions on complex systems that required armies of specialists just to keep running. The promise of insights-driven business has too often been buried under infrastructure complexity.

Databend was born from this frustration. As practitioners who lived through these challenges, we built what we wished had existed all along - a lakehouse that simply works.

The results speak for themselves:

Companies reducing their data engineering teams by 60–70%
Analytics cycles compressed from days to minutes
Infrastructure costs cut by 40–60% while handling larger workloads
Business teams empowered to answer their own questions without technical bottlenecks

This isn't just incremental improvement - it's a fundamental shift in how organizations can approach their data strategy. The company case study isn't an outlier - it's what becomes possible when you remove the artificial complexity that's been holding your data initiatives back.

The most powerful technologies are those that fade into the background, doing their job so reliably that you barely notice them. That's the future Databend delivers - where your team's energy goes into extracting business value from data, not babysitting infrastructure.

The choice is clear: continue investing in increasingly complex architectures that demand specialized skills, or embrace a solution that finally delivers on the lakehouse promise without the traditional overhead. Your competitors are already making this decision - which side of the divide will you be on?

Simplify Private Data Warehouse Ops: Visualized, Secure, and Fast with BendDeploy on Kubernetes

ksanaka — Thu, 15 May 2025 15:05:32 +0000

What is BendDeploy?

BendDeploy is a standardized, Kubernetes-based private deployment and management platform provided by Databend. It offers a one-stop solution for managing Databend clusters, enabling fast and secure deployment and administration of multiple clusters and tenants. With an intuitive user interface, BendDeploy delivers a simple and user-friendly visual operation experience that significantly improves operational efficiency, reduces errors and downtime, and enhances service quality and stability.

BendDeploy Features

BendDeploy offers a wide range of capabilities, including:

Multi-Tenant Management: Supports multi-tenancy with full tenant isolation and management.
Cluster Deployment: Easily launch a complete Databend cluster with one click using BendDeploy.
Cluster Management:
- Cluster Overview: View cluster status, configuration details, connection info, and more.
- Cluster Scaling: Quickly scale the number of warehouse instances up or down with a single click.
- Rolling Upgrade: Perform rolling upgrades of Databend cluster versions.
- Cluster Rollback: Roll back to a previous cluster version with minimal downtime.
- Cluster Restart: Perform rolling restarts of databend-query nodes within a cluster.
- Node Details: View real-time status and configuration for individual nodes.
External Meta Takeover: If you’ve already deployed Databend, BendDeploy can take over the external Meta service—only stateless databend-query nodes need to be deployed afterward.
Parameter Configuration: Customize configuration files for warehouse clusters, including resource limits and Databend-specific parameters.
Monitoring Integration: Connect to external Prometheus instances to monitor cluster metrics.
Log Viewing: After enabling tenant-level log collection, easily access logs such as databend-query logs and profile logs.
SQL Worksheet: Use the built-in web UI to run SQL queries directly on a selected warehouse within a tenant.
User Management: Manage users, assign them to tenants, remove them, and control roles and permissions.

With BendDeploy, nearly all Databend operations and maintenance tasks can be performed through a visual interface—no more managing clusters from the shell. It's accurate, efficient, and user-friendly.

BendDeploy Architecture

Let’s first take a look at the BendDeploy architecture from a component-level perspective:

Object Storage: Databend follows the best practice of storing data in object storage.

Meta: Only one Meta cluster needs to be installed per platform, or you can connect to an existing external Meta cluster.

Warehouse: A compute cluster consisting of a group of databend-query instances, belonging to a single tenant.

Tenant: A tenant is a logical unit managed by the control platform. One platform can manage multiple tenants, and each tenant can contain multiple warehouses.

Ingress Layer: Each tenant has an ingress component that exposes database connection services via domain names.

User Management: This module manages users, roles, and permission-related settings.

Web UI: A visual SQL IDE for running SQL queries.

From an administrator's perspective, the architecture illustrates the relationships between each component:

BendDeploy Deployment

As a cloud-native lakehouse, Databend is recommended to be deployed in a Kubernetes (K8s) environment. BendDeploy is currently limited to K8s-only deployments. Therefore, before deploying BendDeploy, a Kubernetes cluster must be set up. This guide assumes that the user already has a K8s cluster ready.

Install the Logging Components

BendDeploy collects logs from Databend. To enable this, we configure an OTLP endpoint for Databend to collect query detail logs and query profile logs.

To deploy the related logging components, download the Databend logging Helm chart, which comes pre-configured to collect logs from Databend components. This includes configurations for Vector and the log collector.

To complete the setup, simply update the object storage settings in the values.yaml file—specifically, replace the configuration under warehouseLogCollector.s3 with your own endpoint, auth, and other relevant details.

warehouseLogCollector:
  enabled: true
  replicas: 1
  nameOverride: warehouse-log-collector
  fullnameOverride: warehouse-log-collector
  s3:
    endpoint: "S3 endpoint"
    bucket: "s3.bucket"
    region: "us-east-1"
    auth:
      accessKeyId: "access_key_id"
      secretAccessKey: "secret_key"
    batch:
      maxBytes: 10485760  # 10MB
      timeoutSecs: 60    # 1min

Then:

helm upgrade --install logging ./logging --namespace=logging --create-namespace

Use the following command to check whether the logging components have been successfully installed:

The logging component is not mandatory. If you do not need to enable the platform's logging collection feature, you can choose not to install it for now.

Install BendDeploy

Download Helm Chart & Modify values.yaml

First, download the BendDeploy Helm chart file. Then, modify the values.yaml and configmap-benddeploy.yaml files in the chart based on your actual requirements, such as the ingress host, Prometheus, OIDC provider endpoints, and the PostgreSQL connection string.

apiVersion: v1
kind: ConfigMap
metadata:
  name: benddeploy-config
  namespace: benddeploy
data:
  config.json: |
    {
      "env": "test",
      "namespace": "benddeploy",
      "imageRegistry": "",
      "secretKey": "",
      "lokiHost": "",
      "grafanaEndpoint": "http://grafana.monitoring.svc.cluster.local:80",
      "promEndpoint": "thanos-query.monitoring.svc.cluster.local:9090",
      "logCollectorEndpoint": "http://warehouse-log-collector.logging.svc.cluster.local:4318",
      "salt": "salt",
      "jwt": {
        "kind": "static",
        "static": {
          "signingKey": "-----BEGIN EC PRIVATE KEY-----\nMHcCAQEEIN2zd9w6nMcFpP0FuUb3413RSgo92nhtGYlXSBfgQD9yoAoGCCqGSM49\nAwEHoUQDQgAEJI4I7jG0eVUMTo8TkrOso1VTWBddEaUD2UI7efrYzmI2zFocd7Ga\nfLbl54WFHeJ75ur0ED8EUf7QPEmgkr5QHQ==\n-----END EC PRIVATE KEY-----",
          "pubKey": "-----BEGIN PUBLIC KEY-----\nMFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEJI4I7jG0eVUMTo8TkrOso1VTWBdd\nEaUD2UI7efrYzmI2zFocd7GafLbl54WFHeJ75ur0ED8EUf7QPEmgkr5QHQ==\n-----END PUBLIC KEY-----"
        }
      },
      "db": {
        "postgresDSN": "postgres://postgres:postgres@pg-benddeploy.benddeploy.svc.cluster.local:5432/default?sslmode=disable"
      },
      "adminUsers": [
        "databend",
        "admin"
      ],
      "jwtKeyFile": "http://benddeploy-service.benddeploy.svc.cluster.local:8080/.well-known/jwks.json",
      "oidcProvider": "https://oidc.eks.us-east-2.amazonaws.com/id/120A6F5E89F8F14E7A1E5E15FE48196D/keys"
    }

oidcProvider: Allows Kubernetes to integrate with external identity providers, enabling users to access the Kubernetes cluster using their existing enterprise identity credentials.

For production clusters, you need to override the following parameters in the configmap-benddeploy.yaml using your own values:

imageRegistry: The image repository for Databend Docker images. You need to deploy your own registry and upload the corresponding Databend image. The official image repo is available at Docker Hub. If the deployed machine can access the public internet, this parameter can be left empty, and the official image from Docker Hub will be used.
registryUsername: The username for the image registry.
registryPassword: The password for the image registry. Leave empty if no authentication is required.
repoNamespace: The namespace in the registry where databend-query and databend-meta Docker images are stored. For example, if the registry is registry.databend.local and the namespace is datafuselabs, the image would be uploaded as registry.databend.local/datafuselabs/databend-query:tag.
promEndpoint: The endpoint for the Prometheus server provided by the user.
logCollectorEndpoint: The endpoint for the log collector.
grafanaEndpoint: The endpoint for the Grafana server provided by the user.
db.postgresDSN: The DSN for the PostgreSQL database. If values.postgres.enabled is true, this value will be ignored (the system will create a simple PostgreSQL service, but note that this is for testing purposes only; for production environments, use an external PostgreSQL service).
oidcProvider: The OIDC provider used for authentication in the Kubernetes cluster.

After completing the configuration modifications, execute:

helm install benddeploy -n benddeploy --create-namespace ./chart --set image=public.ecr.aws/databendlabs/benddeploy:v1.0

You can find all the released versions of the BendDeploy Docker images at https://gallery.ecr.aws/databendlabs/benddeploy.

After a successful run, check the BendDeploy installation status by executing:

Access the web interface in your browser

After a successful deployment, you can open the BendDeploy page by visiting the domain configured in the ingress.

The default initialization username and password for Databend are: databend/Databend@1*.

After logging into the system, you can change the password in the user management module.

Please note that this account password is only for the BendDeploy management platform and is not the username and password for the Databend database!

Platform Configuration

Cloud-native technologies have become highly mature, and it is believed that most companies' cloud management platforms already have monitoring and alerting components like Grafana and Prometheus installed. Therefore, BendDeploy no longer includes these components by default but provides an entry point for users to configure them.

During the configuration, connectivity with the server will also be verified.

BendDeploy User Manual

Create a New Tenant

After the deployment is complete, let's create the first tenant.

First, let's watch the video tutorial on how to create a new tenant:

Name the new tenant

Specify an external Meta or create a Meta cluster.

If you have created a Meta cluster in BendDeploy, you can select the built-in Meta here. Alternatively, if you have an external Meta cluster that you want to integrate with BendDeploy, you can provide the information for your custom Meta.

When you click "Next," the system will validate the availability of the Meta cluster you specified. Once the verification passes, you can continue with the next steps of the process.

Configure Object Storage

The best practice for Databend is to store data in object storage. Here, we configure the user's object storage-related information.

After configuring the object storage, clicking "Next" will trigger a validation to ensure the accuracy of the configuration.

Configure Databend Query

Here, we configure the initial username and password for Databend Query, which will be used to connect to Databend for the first time.

If the "Enable Load Balance" option is selected, a load balancing strategy will be configured on the databend-query ingress. This will ensure that traffic is evenly distributed across the backend databend-query nodes.

Configure a wildcard domain

The wildcard domain provided by the user will offer an access point for each warehouse. For example, if the domain *.databend.com is provided, after creating a warehouse named test-1, you can access Databend through test-1.databend.com.

Enter Databend Enterprise License

If you have purchased the Databend Enterprise Edition, this step will involve entering the Databend License, after which you will be able to use the enterprise-grade features in Databend.

Create a warehouse

Under the tenant we just created, we can now create a warehouse . Let's first watch the overall video tutorial.

Click "Create Warehouse," then fill in the warehouse name, select the version, specify the number of instances and resource configuration for the cluster, and set tags for the cluster before creating it.

Note that you need to upload the Databend Docker image to the registry within the Kubernetes cluster in advance. This way, you can select the desired Databend version from the image dropdown list during deployment.

For more information about Databend's runtime parameters and settings, please refer to the documentation: https://docs.databend.com/guides/deploy/references/node-config/query-config

After the creation is complete, and you will be able to see the successfully created warehouse.

Manage warehouses

After clicking the dropdown menu on the right, you can perform operations such as scaling the warehouse, rolling restart, updating, deleting, and viewing details.

If the ingress domain is configured correctly, you can use BendSQL to directly connect to the cluster.

Click "Connect" to obtain the connection string:

Then,

bendsql -u databend -h test-warehouse.for-test.a -P 8000 -p databend

Enable Logging

The logging module in BendDeploy is quite special. Since Databend itself has strong capabilities for log storage, querying, and analysis, we store all warehouse-related logs directly in Databend. Additionally, you can enable or disable log collection on a per-tenant basis, which further helps reduce the user's log storage costs.

Initialize Logging Module

When BendDeploy is first deployed, the logging module is disabled by default. Users need to create at least one tenant before enabling log collection. This is because a dedicated warehouse for user log collection needs to be created within the tenant.

After clicking the "Enable logs" button, the log component will begin its initialization.

Select the tenant of the warehouse where user logs will be collected. You can use the existing warehouse host within the tenant. If you do not provide a warehouse host, the system will automatically create a warehouse named internale-log-collector. Additionally, you need to provide the object storage configuration for storing the logs.

Once the log module is initialized, log collection will be enabled. There may be a delay in log refreshing. If you want to view the logs immediately, you can click the "Flush" button on the tenant, which will immediately refresh the logs to Databend.

Currently, there are three types of logs stored: query_logs, query_profiles, and query_details. query_profiles are essentially the graphical version of the EXPLAIN command. They provide execution plans and performance details of queries, and within the logs, you can jump to the profile chart of each SQL request.

You can also query logs based on the query_id or keywords within the logs. BendDeploy consolidates log queries into a single platform, eliminating the need to configure Elasticsearch (ES) or Grafana, which greatly simplifies the operation.

Manage Users

The user module is used to manage the creation, deletion, and password modification of users. When creating a user, you can add them to a specific tenant and select the role (role) for running SQL queries. Only administrators have the permission to manage users. For more details, refer to the video tutorial below.

Switch User

Users can click the three dots in the lower-left corner to view user details and log out, allowing them to switch users and log in again.

SQL IDE

BendDeploy provides a convenient SQL IDE access entry in the left navigation bar. In the IDE, users can view data in the database, select databases, and perform SQL queries, data viewing, permission management, and other operations.

Upgrade BendDeploy

To upgrade BendDeploy, simply find the latest Docker image on AWS ECR, and then use helm upgrade to perform the upgrade.

Future Evolution Plan

As data infrastructure evolves toward cloud-native technologies, there is a growing demand for private deployments from users with high data privacy requirements. To meet these needs, we will continue to optimize the user experience of BendDeploy. The upcoming evolution roadmap includes the following:

Integration of Databend Cloud's Task Features: We will integrate Databend Cloud's Task feature, supporting scheduled tasks and DAG-based task flow orchestration, enabling automated data processing workflows.
Control Panel Upgrade with Databend Operator: We will upgrade the control panel by introducing the Databend Operator, a fully Kubernetes-native, scalable operations tool. This will allow users to abstract tenant configurations and compute resources into declarative objects using Custom Resource Definitions (CRD), providing the same level of automation, elasticity, and observability in any Kubernetes cluster as in cloud services.
Based on the databend-operator, support Multi-Cluster to enhance the stability and reliability of the query service.

These improvements will make BendDeploy the preferred solution for deploying Databend in enterprise private environments, matching cloud services in both functionality and operational convenience.

Summary

Databend offers a standardized Kubernetes-based private deployment and management platform — BendDeploy, enabling users to quickly and securely deploy and manage multiple Databend clusters and tenants. BendDeploy provides a user-friendly visual interface for operations and maintenance, significantly improving efficiency while reducing errors and downtime. With BendDeploy, users can perform multi-tenant management, cluster deployment, scaling, upgrading, and node management.

In addition, BendDeploy includes capabilities for log collection, querying, and analysis — centralizing logs in one platform and simplifying operations. Users can also manage users, run SQL queries, and view data directly through BendDeploy. Thanks to its fully cloud-native architecture, deploying BendDeploy is fast and convenient.

In short, BendDeploy provides Databend users with a comprehensive and easy-to-use management platform. Give it a try!

Is Big Data Dying?

ksanaka — Mon, 05 May 2025 11:59:45 +0000

In recent years, the notion that “big data is dying” seems to be gaining traction. Some say the big data craze has faded, while others lament the shrinking job opportunities, the increasing complexity of platforms, and the growing intricacy of business demands. But does this really mean big data is dying?

I don’t think so. In my view, what’s truly dying is not big data itself, but rather the outdated “dragon-slaying techniques” that no longer have any “dragons” to slay. The “dragon” has evolved, while our weapons remain stuck in the past. Technology needs to evolve, architectures need to be restructured, and capabilities need to be upgraded.

To understand the current state and trends of big data, we must first examine how it has evolved step by step. The development of big data can be divided into three stages:

Stage Ⅰ: Traditional Hadoop — The First Generation of “Siloed” Big Data Platforms

The earliest big data platforms were primarily built around Hadoop, with the Lambda architecture (offline + real-time) serving as the de facto industry standard at the time. However, in practice, I deeply realized that big data platforms at this stage were still extremely difficult to use, requiring substantial manpower and resource investment. The systems built were hard to truly commercialize, functioning merely as collections of offline components.

Back then, technical teams would present various Hadoop + Hive + Spark architecture diagrams to their bosses, pitching grand visions of the future. Riding the wave of the “data middle platform” narrative, they expanded their teams, assembling ambitious big data teams of fifty people, determined to stack up the platforms with great enthusiasm.

However, reality quickly delivered a harsh wake-up call:

Data inconsistencies were rampant, and the only consolation to the boss was, “A discrepancy of one or two records in massive datasets doesn’t affect usage; this system isn’t tied to core business operations.”
When data was found to be written incorrectly, it couldn’t be updated — entire partitions had to be deleted and rewritten.
Costs remained stubbornly high because data often had to be stored in duplicate.
Despite boasts about how powerful the big data platform was, a simple query could crash the entire system, just because the query didn’t include a partition key.
In the early days, the platform handled 1 billion rows per month with reasonably designed partitions. But when daily data volumes hit 2 billion, the system began to lag severely. By the time data reached the trillion scale, the original team had already jumped ship, leaving the platform to be scrapped and rebuilt.
The platform consisted of numerous components (at least 30 or more), and upgrading any single component was enough to bring the entire system down.
The “siloed” architecture resulted in a forest of isolated systems, with each new “silo” introducing fresh security vulnerabilities that could destroy the platform.
SQL execution plans took several minutes to generate, rendering them practically unusable (due to excessive partitions overwhelming metadata).
When the boss casually requested, “Just deduplicate the data,” the entire team wanted to quit. (Back then, only the ODS layer of a certain major platform was barely functional.)
… The list of such frustrating examples is endless.

It’s fair to say that big data platforms at this stage kept bosses up at night. Practice proved that this architecture not only failed to support business needs but could also lead the company’s big data strategy into a dead end. Big data engineers spent most of their time using Spark to load data into Hadoop, performing cleaning and governance, and preparing data for upstream data development and utilization teams. Critically, at this stage, the data development and utilization teams barely dared to speak up, often finding themselves in a disadvantaged position.

Stage Ⅱ: Data Lake + Lakehouse Integration — Component “Slimming,” but Architectural Complexity Remains High

The painful lessons from the first generation of big data platforms forced a restructuring of architectures. Data lake architectures, represented by Apache Iceberg and Delta Lake, emerged to replace the closed Hadoop ecosystem, enabling a “lakehouse” solution with transaction support, unified metadata, and unified storage.

At this stage, technical teams began attempting to simplify components and reduce deployment complexity. We started introducing data lakes, persisting data in Iceberg format (ODS), supporting statistical read/write operations with Spark and Flink, and reducing the number of components from over 30 to around 10. The size of big data teams also shrank from over 50 people to about 10. During this phase, the work of big data teams primarily included:

Learning Spark scheduling tasks to clean up historical versions of Iceberg.
Learning Spark scheduling tasks to perform Iceberg’s compact operations.
Learning Spark scheduling tasks to handle Iceberg’s Z-order processing.
Learning to use Shuffle service invocations.
Managing permissions for the integration of various open-source components.
Developing an SQL Gateway to route SQL queries to three backend compute engines based on query characteristics.
Presenting the boss with a “beautifully designed, powerful” platform.

At this stage, it was possible to paint an optimistic picture for the boss:

Support for ACID transactions.
Implementation of unified metadata management.
Support for unified storage.

However, bosses often raised doubts: “Wait a minute, this platform still requires at least 10 people. Wouldn’t it be better to have those 10 people work on business operations instead?”

In reality, most companies built their big data platforms on this foundation. Although the platform slimmed down, with team sizes reduced from 50 to 10 and component counts from 30 to 10, the underlying complexity remained excessively high, making it difficult for most enterprises to fully implement and utilize.

If such a platform could withstand data volumes reaching the trillion level, it would be considered exceptionally fortunate. Otherwise, big data developers would have to “take the fall.” As a big data engineer working on such platforms, it’s critical to understand its positioning: single-table data should ideally not exceed 10 billion records, and the entire database should be kept under 10,000 tables. Beyond these limits, no amount of Spark tasks can save the day, and overtime becomes inevitable.

Stage Ⅲ: Cloud-Native, Disrupting the “Big Data Engineer” (Current)

Entering the third stage, particularly in the gaming industry, large gaming companies launch new business initiatives weekly, followed by a flood of data analysis demands. In such scenarios, using the solutions from Stage Ⅱ, you’d find yourself building a new platform for each initiative, with any issue leading to all-night overtime. In the gaming industry, every moment translates to money, and every step impacts the year-end bonuses of the team. Traditional big data solutions can no longer meet business demands:

Data needs to be ingested at second-level latency.
Systems must support tens of thousands of streaming computations.
There are massive demands for data mining.

At this point, cloud-native platforms like Snowflake and Databend — offering pay-as-you-go pricing, integrated batch and streaming processing, and full SQL expressiveness — have emerged to meet these needs.

They have transformed big data platforms: files like CSV, NDJSON, Parquet, ORC, and others stored in S3 can now be queried directly using standard SQL. Without relying on a Spark team for preprocessing, ETL engineers can complete data ingestion tasks that used to take two days in just half an hour.

If you know some Python, you can also leverage products like Databend, which support external UDFs, to implement table-level change capture and replace traditional Flink jobs. In the gaming industry, for example, tasks that output detailed match data immediately after a game ends can now be handled this way. Workloads that used to require dozens of Flink clusters can now be completed in just half a day — with the added benefit of elastic scaling.

At this point, we start to see the role of “data preparation engineers” disappearing, and traditional “data warehouse platform specialists” gradually becoming marginalized. Big data has entered a new stage: lightweight platforms, expressive data modeling, and strong business understanding.

Today, products like Snowflake, Databricks, and Databend offer robust solutions in this area.

Summary

There’s still plenty of work across today’s big data ecosystem, with roles that include but are not limited to:

Data preparation engineers (formerly known as “data movers”) — increasingly platformized or SaaS-based;
ETL engineers;
BI and data analysts (who might still track tasks in Excel daily, or use Text2SQL tools for assistance);
Data quality and lineage engineers (platformized & SaaS-based);
Algorithm and machine learning engineers;
Scheduling platform developers (SaaS);
Data pipeline developers (SaaS — “data sync” roles rebranded);
Data visualization engineers (platformized & SaaS-based);
Data platforms (large-scale big data systems built on stacks of concepts, now shifting toward “Data + AI”);
Business-focused data analysts (for example, analyzing the impact of a $100,000 marketing spend to optimize future campaigns);
Operators and decision-makers who monetize and drive actions from data insights.

In fact, demand for big data talent remains strong — it’s just that the industry now calls for more specialized skills. Traditional “data mover” roles are indeed shrinking, as simpler tools and platforms have made much of that work obsolete.

If you’re considering a career in big data, there’s still huge potential. Here’s my personal view on a learning and growth path:

Start as an ETL engineer or data ingestion engineer (focus on low-code tools + SQL);
Grow into specialized engineering roles — such as working on data synchronization tools, scheduling platforms, or visualization systems;
Dive deeper into the core internals — storage, algorithms, SQL engines, and compilation — all excellent long-term directions.

Big data isn’t dying — the infrastructure era is simply ending, and we’re entering the smart operations era. In the future, it won’t be about who can stack more components — it’ll be about who can create data value faster and more cost-effectively.

I transitioned from OLTP systems to big data infrastructure, and I’m currently the Co-founder and Head of Ecosystem at Databend Labs. This year marks my fourth year in the big data field. Although I’m still a newcomer in many ways, I’ve already deeply felt the tremendous changes and limitless opportunities in this industry. I warmly welcome more big data professionals to connect, share, and grow together.

Unlocking 10x Revenue Growth with Data Analytics in Gaming

ksanaka — Fri, 18 Apr 2025 16:33:04 +0000

The gaming industry is actually very close to money. Often, a game is expected to break even within six months after launch; if it doesn't, the project might gradually be sidelined. This pace in the gaming industry, from a data production perspective, means generating hundreds of thousands of events per second, with tens of terabytes of data produced daily being the norm.

The gaming business has a wide range of needs, such as operations, gameplay, and economic layers. For example, a small traffic team of four or five people might need to allocate 1 million in traffic - can they recover that 1 million within this period? How should they analyze data over these few days? For such data analysis demands, relying on traditional dashboard displays is often impractical. In these cases, teams typically resort to direct SQL queries, exporting data to Excel sheets. Each flexible demand is handled this way, leading to an extremely high volume of analysis requests.

In addition, data engineers in the gaming industry often face a mountain of demands. They spend most of their time writing SQL, performing ad-hoc statistics, and exporting data for requests. Typically, they produce many Excel files daily, handling 30 or more tasks, sometimes even hundreds.

Under such immense demand, data governance in the gaming industry is extremely complex. On the data source side, it's often divided into web games and client-based games, with client games further segmented into different mobile platforms like iPhone and Android, and Android split into various brands. How do users from these different brands and channels behave? What's their user experience like? It even drills down to how the game performs on different screen sizes. Take the issue of game bitrate, for example - typically, you see data at 60 frames per second, and many games store every single frame of those 60, along with latency data for each interaction, to enable future optimizations.

Additionally, games incorporate "thrill and excitement" design elements. For instance, when a player defeats a monster, what rewards does the system provide? Can rewards be pushed in real-time after a player completes a task? Or, in a 15-minute match, how do you quickly match opponents for the next round to enhance the player's experience and boost retention? How do you achieve real-time processing? After a match ends, how do you swiftly showcase the player's highlight moments and game stats? These all pose significant challenges to real-time stream computing capabilities.

We once worked with a team whose game ranked in the TOP 10 on the iPhone charts. Their big data team had over 50 people, yet they still couldn't keep up with all the business demands, with each request typically taking two to three days to resolve. This is very common in traditional big data teams and has become a standard business processing model. As a result, when facing rapid business iterations with traditional big data solutions, many companies find themselves in an even more chaotic situation. In real-world scenarios, you see countless Flink tasks flying around, with severe issues of task delays and queuing.

Many game companies' early architectures faced similar situations. While it may seem like there are only a few components, the reality is far more complex than imagined. For example, behind a NameNode or directory service, there's at least one MySQL database, and service availability discovery relies on ZooKeeper at minimum. These stacked components end up being far more numerous than initially anticipated.

Additionally, platform stability is a major challenge. Data transfer tasks often account for 80% of the workload. Data engineers frequently joke that they're "data movers," with their daily expertise lying in moving data from A to B, then from B to C, for aggregation, computation, and drill-down analysis. The platform's computing resource bottlenecks are evident, and task queuing is a severe issue. MySQL databases are designed for high concurrency, capable of handling tens of thousands of concurrent requests at once, whereas traditional big data teams often endure tasks running slowly one by one, making the process extremely cumbersome.

Furthermore, one of the biggest issues with gaming platforms is the challenge of security. The platform incorporates a large number of stacked components, which may be very useful at the time, but if a security issue arises, it's nearly impossible to pinpoint which component is at fault. If any part of the data pipeline fails, the entire data chain is dragged down.

Therefore, in big data platforms, network isolation is a critical aspect that requires special attention. Another severe challenge is that if an IDC (Internet Data Center) fails, it often necessitates migrating to an external data center or setting up intra-city redundancy, which is an extremely troublesome process.

How to achieve a 10x revenue boost for data analysis in the gaming industry?

Databend's current efforts have shifted from merely "storing" data to integrating with business scenarios, directly helping to improve business efficiency.

Massive Data: Ingested and Visible in Seconds

First, Databend achieves massive data ingestion at the second level, which is distinctly different from traditional big data practices. Traditional big data focuses more on batch processing, typically with intervals of at least 5 or 30 minutes per batch. Databend emphasizes second-level data ingestion and enables real-time processing at every stage, ensuring data is visible in real time.

The left side of the diagram shows the data ingestion process in a traditional scenario, typically flowing from the data source to Vector, to Kafka, to Spark, and then to the data lakehouse. This pipeline appears clear at each stage, but the biggest issue is that every additional component in the cloud incurs a cost. Adding one more component means additional expenses and extra maintenance work.

Can the pipeline be further simplified? Is it possible to write directly from the data source to object storage? Can object storage replace Kafka's role? For example, could multiple - or even hundreds - of applications write data directly to object storage? Once the data is written to object storage, Databend has a concept called External Stage, which allows linking to the specific location in object storage and directly invoking Databend's COPY INTO command to quickly load the data into Databend tables.

Based on current testing, a standard cloud server can achieve an average data loading speed of approximately 7 million rows per second. This speed is typically limited by the bandwidth of the object storage and can generally saturate either the available bandwidth or the performance of a single node. During the loading process, Databend also supports data transformation. However, Databend adheres to the principle of minimizing transformations whenever possible - data should ideally be loaded in its original ODS (Operational Data Store) form. Subsequent processing can then be performed on the raw data, which also makes it easier to stay aligned with the source data.

In addition, Databend can take advantage of the event notification mechanisms provided by object storage itself. Users can subscribe to event notifications from the object storage, and when a file is generated via a PUT operation, the system can directly trigger a loading task - either through a webhook or a scheduling system such as Airflow - to quickly load the data into Databend.

In real-world scenarios, some customers still retain components like Flink and Kafka. For example, in collaborations with automotive companies or the IoT industry, it's common to encounter data in binary formats. Since Databend currently does not support direct processing of binary data, Flink is often used to convert such binary data into JSON or other structured formats before it can be ingested by Databend. At present, Databend supports formats such as CSV, TSV, NDJSON, JSON, Parquet, ORC, and Arrow, as well as their corresponding compressed variants. Essentially, Databend is capable of replacing the majority of Spark's data processing tasks.

Therefore, in an ideal setup, users can eliminate both the Kafka and Spark stages, and simply align data directly via object storage. Of course, if your data format is irregular or you have special processing requirements, it is still advisable to retain the original preprocessing steps before loading the data.

Table-Level Change Data Capture for Incremental Computation

Earlier, we mentioned the real-time dashboard scenario in the gaming industry. Suppose a user finishes a game session and needs to immediately see an overview of their game data. Traditionally, this process involves ingesting data into a big data platform, processing it with Flink, and then writing the results into Redis or MySQL. This allows the game user to retrieve the relevant data directly from the database after the session ends.

In Databend, there's a feature called Table-level Change Data Capture (CDC), which records operations such as INSERT, UPDATE, and DELETE on a given table. This makes it possible to accurately determine what new data has been added to the table at any point in time. Databend can use internally defined Tasks to process this incremental data. For example, a Task can be set up to pull data in real time from the incremental table (or stream-like table). Once the new data is retrieved, it can be passed to an external UDF (User-Defined Function) for processing, and then the processed results can be written directly to external systems.

Many users adopt this approach to write data into MySQL, other large lakehouse platforms, or Redis for serving external systems. This pattern has become a standard solution - gradually replacing traditional Kafka-based components with Databend's native CDC capabilities, and substituting Flink processing with external UDFs.

In the implementation of this feature, Databend also introduced some special handling: when invoking external UDFs, it batches the data into groups - typically around 100 rows per batch - for each UDF call. The reason for this design is that if your backend needs to process a very large amount of data at once - for example, 10,000 rows - passing all of it to a single UDF in one go can quickly create a performance bottleneck.

However, by processing data in batches of 100 rows per call, and assuming you have multiple UDF nodes deployed on the backend - say, 10 nodes - each node only needs to handle 100 rows at a time. This leads to faster and more efficient processing. If a performance bottleneck does occur, it can be mitigated by horizontally scaling out with additional UDF service nodes.

The overall architecture resembles a model where external Lambda functions can be attached and registered with Databend via an API Gateway, forming a unified set of external functions. With this registration and invocation mechanism, the system can achieve a calling frequency of approximately once per second. As a result, incremental data can be continuously and seamlessly pushed to the target system for real-time processing. This architecture also enables the implementation of complex algorithms and business logic, including both data processing and large-scale data presentation.

Building upon this, Databend has further extended its functionality by integrating AI capabilities. For example, you can encapsulate certain AI functions into callable functions, register them within Databend, and directly support use cases like similarity search, embedding vector computation, or even Text-to-SQL (natural language to SQL conversion).

Boost ETL Performance by 10× through Stream-Based Processing

Traditional big data systems typically refer to the process as ETL, but within the Databend ecosystem, we prefer to call it ELT. This means that data is ingested in its raw form and then processed via internal tasks (Tasks) within Databend. For example, in traditional setups, binlog data is first ingested into Kafka, after which it is typically consumed in a passive manner. Databend introduces a different approach - binlog data can directly enter Databend, and Databend actively pushes the data out, creating a more proactive push mechanism. This approach leverages Databend's clear internal logging method (table-level change data capture) to simplify the architecture.

In this process, users can specify different compute clusters and clearly define which cluster should execute data processing tasks, including the scheduling frequency of these tasks, which can be defined by the user. The current smallest scheduling interval is 1 second (in contrast, Snowflake's smallest scheduling interval is 1 minute). Databend achieves this second-level scheduling because, last year, we worked with some gaming industry users who required sub-second timeliness. If a task takes more than 1 second to run, an alert is triggered to notify the user to scale or optimize the system to ensure sub-second response times.

The diagram above shows the specific process of how Databend migrates data from the ODS layer to the DWD layer in an ETL scenario:

First, the data enters Databend in its raw form but needs to undergo deduplication. After deduplication, different processing logic is applied based on the specific type of data operation. For instance, using the MERGE INTO statement, the incremental stream table mentioned earlier is used with the row_number function to retrieve the data and retain only the most recent row, creating an intermediate table for further processing. The next step is to determine the type of data operation: if the operation is an UPDATE, the update is performed; if the operation is a DELETE, the deletion is executed, or even a "soft delete" can be performed. These operations can be explicitly defined and implemented through SQL statements. If the data does not match any existing entries, an INSERT operation is executed.

During the project implementation, an interesting scenario emerged: traditional big data teams usually prefer to perform DW layer data merging operations at night, i.e., processing a large amount of data in a unified manner overnight. However, it was later discovered that when the data volume is particularly large, it's often impossible to complete the processing within one night. To address this issue, the data merging tasks are now broken down into smaller tasks with minute-level granularity, yielding great results. In large-scale projects, a single tenant may have millions of tables, and without breaking down the tasks more finely, many times the tasks cannot be completed within the limited time.

The diagram above shows the task visualization interface provided by Databend Cloud. For on-premises deployments, users can leverage tools like Airflow to achieve similar task visualization functionality. This type of functionality is relatively simple to implement and very intuitive to use. When an error occurs in a certain step of the task, the dependency status of the task is clearly displayed. For example, if there is a dependency between tasks, the subsequent tasks must wait until the previous ones are completed before they can run. This status indication, task scheduling, and error alerts displayed in red are all important details to pay attention to.

Databend has now expanded with many new SQL syntaxes, some of which are rarely seen in other databases, such as MERGE INTO, INSERT OVERWRITE, and INSERT Multi. There's also a special syntax called INSERT Multi (multi-path insert), which was commonly mentioned by big data teams in the past with terms like "single-path log, multi-path split," meaning a single piece of data is inserted into different tables based on different tags. At first, the scenario for this need wasn't fully understood, but later it became clear that a single log could be inserted into different tables based on different tags, while also retaining an ODS table to store the original logs.

Therefore, Databend introduced the INSERT Multi syntax, where data is first loaded intact into an ODS table. Then, through incremental capture, newly added data is retrieved in real time, processed based on different tags, and finally inserted into multiple corresponding tables using the INSERT Multi syntax. For example, data for different platforms or device types of mobile data can be inserted into separate tables, while still keeping the original logs intact in the ODS table.

In addition, Databend supports directly reading CSV, TSV, NDJSON, Parquet, ORC, Arrow, and other format files from object storage using SQL. This functionality can largely replace the file processing tasks typically handled by Spark. Previously, similar products could only handle a single file, but Databend's design allows it to process a batch of files at once through regular expression matching, greatly improving data processing efficiency. Databend also specifically solves the idempotency issue during data loading: after a file is written into the corresponding table, it is not allowed to be written again within a certain time range (default is 2 days). If a repeat write is necessary, it can be forced by specifying certain parameters (such as the force command).

Databend has also optimized complex JOIN operations. In the past, when dealing with simple scenarios, JOIN usually involved two or three tables. However, in complex scenarios, many JOIN operations are nested within subqueries, and it's common to have JOIN operations involving dozens of tables. When migrating Greenplum users, we encountered such scenarios: early versions of Databend were not mature enough, and generating an execution plan (via EXPLAIN) would take 4 to 5 seconds, which was quite poor. However, now we've broken through this bottleneck, and even extremely complex SQL JOIN operations can be handled smoothly.

How does a Data Analyst Work?

Databend Cloud provides a visual SQL IDE that also supports sharing. You can share SQL scripts with team members and collaborate with them within your group. Databend Cloud supports a Dashboard feature, allowing users to generate visual charts and displays through SQL.

Additionally, Databend offers a visual execution plan feature. With the execution plan, users can intuitively see the specific operators included in each SQL query and how much time each operator took. This helps users further optimize SQL query performance. This feature was developed in-house by the Databend team, so they have a deep understanding of the specific operators involved in the product and can display more detailed metrics. For example, when interacting with object storage, the response time for each step can be clearly displayed, allowing users to better perform performance analysis and optimization.

Excellent Features of Databend for the Gaming Industry

Databend has also made significant extensions in its foundational data types, adding support for types like bitmap and Array map. The Array type, in particular, has become a key feature that was previously not emphasized but is widely used in the big data field. For example, in the advertising industry, funnel functions are frequently used for retention analysis. This method enables easy calculation of user behavior paths, such as whether a user, after browsing the homepage, added a product to the cart, or whether they viewed a specific product. Such behavior analysis, and the ability to assess whether actions align with promotional expectations, can be easily achieved through the use of Array types and corresponding functions in Databend.

For complex SQL expressions, Databend also supports a Pipeline-style computation model, which allows for effective handling of intricate SQL operations. Overall, Databend's goal is to ensure that as long as users know how to write SQL, they can perform any necessary data analysis operations within Databend.

Storage-Compute Separation Helps Users Achieve Greater Benefits

The concept of storage-compute separation is likely familiar to most, but the implementation of this architecture in Databend has advanced to a more sophisticated stage.

In Databend's approach, there are several aspects that differ from traditional big data products:

No emphasis on partitions: Databend does not advocate for the use of partitions (Partition).
No emphasis on indexing: Unlike many other systems, Databend does not place a heavy focus on creating indexes (Index).
No strict data type requirements: Databend does not require users to meticulously choose data types, making it more flexible, even if a design includes non-standard types like VARCHAR.
Support for non-standard data design: Even when users' data design includes VARCHAR or other non-optimized data types, Databend can fully support them.

In industries such as government and mobile, the heavy use of VARCHAR fields is common, and traditional databases (e.g., Oracle) support this well. Therefore, when Databend receives user feedback, it typically avoids interfering too much with the user's data design decisions. Databend focuses more on a principle of "No Partition, No Index" but internally incorporates a mechanism called the Cluster Key, similar to the concept of Cluster Key in Snowflake.

Currently, Databend offers two types of Cluster Keys:

Line Cluster Key: Based on sorting a single column using Order By.
Hilbert Cluster Key: Based on sorting multiple columns using Order By.

By using Cluster Keys, Databend can significantly enhance query performance, especially for frequent queries targeting specific columns, improving the overall user experience in data service scenarios.

In the hybrid cloud scenario, Databend also enables a complete separation of compute and storage. For example, compute can be hosted on Databend Cloud, while data storage can remain in the user's own Bucket storage. This architecture gives users greater flexibility. If users need to migrate their data out of the cloud, the process is straightforward since the data resides on the user's premises.

In another case, if a company has internally deployed Databend but faces complex queries that require substantial compute resources, the company can register the data directly to Databend Cloud through the Attach feature. This allows them to leverage Databend Cloud's powerful compute capabilities to handle these complex analysis tasks while maintaining direct access to the data. Databend also provides High Availability (HA) solutions, and the overall deployment architecture is simple and efficient.

Databend Cloud redefines the separation of compute and storage

In the past, the term separation of compute and storage typically referred to the idea of having one layer for compute and a separate layer for storage. However, Databend has taken this concept further by creating a new paradigm. In this approach, compute is managed by Databend, where users interact with the SQL IDE, resource management, and resource scaling - all of these are handled by Databend. On the other hand, the data can be stored in the user's own Bucket storage.

The specific implementation is also very simple. By creating a link to specify the location of the user's storage bucket, and then creating the corresponding table to associate with the data location in the bucket, the data can be directly used. In this model, users have come up with more creative solutions. For example, some users are very smart, placing their real-time business on Databend (because real-time business requires higher resource demands), with Databend Cloud handling the most recent real-time analysis needs. Meanwhile, they place their offline business in their own environment, as they have sufficient resources for offline computing.

In this model, data ingestion and real-time computing are handled on the Databend side, while data storage and offline computing are carried out in the user's own environment. Additionally, user-side data can be quickly mounted to Databend Cloud in a read-only manner and used directly, which brings great convenience to users. Another huge benefit of this model is that even if Databend Cloud itself experiences a serious failure or crash, the user's data remains stored on the user's side, unaffected. The data remains fully visible and can be quickly restored.

In 2023, a client in the short video industry adopted this model. The client had internally deployed a Databend instance for advertising and data analysis. Their CEO closely monitored the advertising data every day, and if any issues arose, they would immediately make a call. At the same time, they registered their core data on Databend Cloud, allowing the internal data analysis team and data scientists to freely perform more complex data analysis and exploration on Databend Cloud without affecting the company's core business systems. This way, the client only needed to focus on ensuring the data security and performance of the core business, while data analysis and mining tasks could be completed on Databend Cloud at a very low cost.

Additionally, Databend enables cloud-based internal data sharing and reuse through the Attach table method. This direct-read approach eliminates the need for the extensive data synchronization tasks commonly found in traditional big data systems.

Databend Cloud Multi-IDC High Availability

The high availability solution of Databend Cloud leverages a significant advantage of object storage in the cloud: when users purchase object storage, they can choose to replicate data between different data centers. This enables automatic data synchronization and high availability across multiple IDCs. The metadata layer, even with millions of tables and tens of millions of fields, has relatively small data size, making replication very easy. As a result, synchronization and replication between three nodes or even across three data centers is very simple and has low latency.

Currently, Databend is further exploring the possibility of storing metadata directly in object storage. This would allow metadata to be replicated directly between two object storage systems. As a result, computation nodes and metadata nodes would require minimal additional deployment, enabling users to more easily achieve rapid deployment and data replication across multiple IDCs. For example, after mounting a compute cluster in one data center, it can even be quickly brought up in a read-only mode and directly provide services. As long as data is not written to both IDCs simultaneously, it is entirely feasible. Unless bi-directional replication between the two object storages is also possible, true high availability for data writing across two IDCs can be achieved.

Building an Efficient and Cost-Effective Business Data Analytics System with Databend Cloud

ksanaka — Sat, 29 Mar 2025 03:31:46 +0000

With the rise of large AI models such as OpenAI's ChatGPT, DeepL, and Gemini, the traditional machine translation field is being disrupted. Unlike earlier tools that often produced rigid translations lacking contextual understanding, these new models can accurately capture linguistic nuances and context, adjusting wording in real-time to deliver more natural and fluent translations. As a result, more users are turning to these intelligent tools, making cross-language communication more efficient and human-like.

Recently, a highly popular bilingual translation extension has gained widespread attention. This tool allows users to instantly translate foreign language web pages, PDF documents, ePub eBooks, and subtitles. It not only provides real-time bilingual display of both the original text and translation but also supports custom settings for dozens of translation platforms, including Google, OpenAI, DeepL, Gemini, and Claude. It has received overwhelmingly positive reviews online.

As the user base continues to grow, the operations and product teams aim to leverage business data to support growth strategy decisions while ensuring user privacy is respected.

Business Challenges

Business event tracking metrics are one of the essential data sources in a data warehouse and among a company's most valuable assets. Typically, business data analytics rely on two major data sources: business analytics logs and upstream relational databases (such as MySQL). By leveraging these data sources, companies can conduct user growth analysis, business performance research, and even precisely troubleshoot user issues through business data analytics.

The nature of business data analytics makes it challenging to build a scalable, flexible, and cost-effective analytics architecture. The key challenges include:

High Traffic and Large Volume: Business data is generated in massive quantities, requiring robust storage and analytical capabilities.
Diverse Analytical Needs: The system must support both static BI reporting and flexible ad-hoc queries.
Varied Data Formats: Business data often includes both structured and semi-structured formats (e.g., JSON).
Real-Time Requirements: Fast response times are essential to ensure timely feedback on business data.

Due to these complexities, the tool’s technical team initially chose a general event tracking system for business data analytics. This system allows data to be automatically collected and uploaded by simply inserting JSON code into a website or embedding an SDK in an app, generating key metrics such as page views, session duration, and conversion funnels.

However, while general event tracking systems are simple and easy to use, they also come with several limitations in practice:

Lack of Detailed Data: These systems often do not provide detailed user visit logs and only allow querying predefined reports through the UI.
Limited Custom Query Capabilities: Since general tracking systems do not offer a standard SQL query interface, data scientists struggle to perform complex ad-hoc queries due to the lack of SQL support.
Rapidly Increasing Costs: These systems typically use a tiered pricing model, where costs double once a new usage tier is reached. As business traffic grows, querying a larger dataset can lead to significant cost increases.

Additionally, the team follows the principle of minimal data collection, avoiding the collection of potentially identifiable data, specific user behavior details, and focusing only on necessary statistical data rather than personalized data, such as translation time, translation count, and errors or exceptions. Under these constraints, most third-party data collection services were discarded. Given that the tool serves a global user base, it is essential to respect data usage and storage rights across different regions and avoid cross-border data transfers. Considering these factors, the team must exercise fine-grained control over data collection and storage methods, making building an in-house business data system the only viable option.

The Complexity of Building an In-House Business Data Analytics System

To address the limitations of the generic tracking system, the translation tool decided to build its own business data analysis system after the business reached a certain stage of growth. After conducting research, the technical team found that traditional self-built architectures are mostly based on the Hadoop big data ecosystem. A typical implementation process is as follows:

Embed SDK in the client (APP, website) to collect business data logs (activity logs);
Use an Activity gateway for tracking metrics, collect the logs sent by the client, and transfer the logs to a Kafka message bus;
Use Kafka to load the logs into computation engines like Hive or Spark;
Use ETL tools to import the data into a data warehouse and generate business data analysis reports.

Although this architecture can meet the functional requirements, its complexity and maintenance costs are extremely high:

Kafka relies on Zookeeper and requires SSD drives to ensure performance.
Kafka to Data Warehouse requires kafka-connect.
Spark needs to run on YARN, and ETL processes need to be managed by Airflow.
When Hive storage reaches its limit, it may be necessary to replace MySQL with distributed databases like TiDB.

This architecture not only requires a large investment of technical team resources but also significantly increases the operational maintenance burden. In the current context where businesses are constantly striving for cost reduction and efficiency improvement, this architecture is no longer suitable for business scenarios that require simplicity and high efficiency.

Why Databend Cloud？

The technical team chose Databend Cloud for building the business data analysis system due to its simple architecture and flexibility, offering an efficient and low-cost solution:

100% object storage-based, with full separation of storage and computation, significantly reducing storage costs.
The query engine, written in Rust, offers high performance at a low cost. It automatically hibernates when computational resources are idle, preventing unnecessary expenses.
Fully supports 100% ANSI SQL and allows for semi-structured data analysis (JSON and custom UDFs). When users have complex JSON data, they can leverage the built-in JSON analysis capabilities or custom UDFs to analyze semi-structured data.
Built-in task scheduling drives ETL, fully stateless, with automatic elastic scaling.

After adopting Databend Cloud, they abandoned Kafka and instead used Databend Cloud to create stages, importing business logs into S3 and then using tasks to bring them into Databend Cloud for data processing.

Log collection and storage: Kafka is no longer required. The tracking logs are directly stored in S3 in NDJSON format via vector.
Data ingestion and processing: A copy task is created within Databend Cloud to automatically pull the logs from S3. In many cases, S3 can act as a stage in Databend Cloud. Data within this stage can be automatically ingested by Databend Cloud, processed there, and then exported back from S3.
Query and report analysis: BI reports and ad-hoc queries are run via a warehouse that automatically enters sleep mode, ensuring no costs are incurred while idle.

Databend, as an international company with an engineering-driven culture, has earned the trust of the technical team through its contributions to the open-source community and its reputation for respecting and protecting customer data. Databend's services are available globally, and if the team has future needs for global data analysis, the architecture is easy to migrate and scale.
Through the approach outlined above, Databend Cloud enables enterprises to meet their needs for efficient business data analysis in the simplest possible way.

Solution

The preparation required to build such a business data analysis architecture is very simple. First, prepare two Warehouses: one for Task-based data ingestion and the other for BI report queries. The ingestion Warehouse can be of a smaller specification, while the query Warehouse should be of a higher specification, as queries typically don't run continuously. This helps save more costs.

Then, click Connect to obtain a connection string, which can be used in BI reports for querying. Databend provides drivers for various programming languages.

The next preparation steps are simple and can be completed in three steps:

Create a table with fields that match the NDJSON format of the logs.
Create a stage, linking the S3 directory where the business data logs are stored.
Create a task that runs every minute or every ten seconds. It will automatically import the files from the stage and then clean them up.

Vector configuration:

[sources.input_logs]
type = "file"
include = ["/path/to/your/logs/*.log"]
read_from = "beginning"

[transforms.parse_ndjson]
type = "remap"
inputs = ["input_logs"]
source = '''
. = parse_json!(string!(.message))
'''

[sinks.s3_output]
type = "aws_s3"
inputs = ["parse_ndjson"]
bucket = "${YOUR_BUCKET_NAME}"
region = "%{YOUR_BUCKET_REGION}"
encoding.codec = "json"
key_prefix = "logs/%Y/%m/%d"
compression = "none"
batch.max_bytes = 10485760  # 10MB
batch.timeout_secs = 300    # 5 minutes
aws_access_key_id = "${AWS_ACCESS_KEY_ID}"
aws_secret_access_key = "${AWS_SECRET_ACCESS_KEY}"

Once the preparation work is complete, you can continuously import business data logs into Databend Cloud for analysis.

Architecture Comparisons & Benefits

By comparing the generic tracking system, traditional Hadoop architecture, and Databend Cloud, Databend Cloud has significant advantages:

Architectural Simplicity: It eliminates the need for complex big data ecosystems, without requiring components like Kafka, Airflow, etc.
Cost Optimization: Utilizes object storage and elastic computing to achieve low-cost storage and analysis.
Flexibility and Performance: Supports high-performance SQL queries to meet diverse business scenarios.

In addition, Databend Cloud provides a snapshot mechanism that supports time travel, allowing for point-in-time data recovery, which helps ensure data security and recoverability for "immersive translation."

Ultimately, the technical team of the translation tool completed the entire POC test in just one afternoon, switching from the complex Hadoop architecture to Databend Cloud, greatly simplifying operational and maintenance costs.

When building a business data tracking system, in addition to storage and computing costs, maintenance costs are also an important factor in architecture selection. Through its innovation of separating object storage and computing, Databend has completely transformed the complexity of traditional business data analysis systems. Enterprises can easily build a high-performance, low-cost business data analysis architecture, achieving full-process optimization from data collection to analysis. This not only reduces costs and improves efficiency but also unlocks the maximum value of data.

From Snowflake to Databend: Leading Game Platform replaced Snowflake with Databend Cloud for real-time Data Cloud

ksanaka — Mon, 17 Mar 2025 17:15:10 +0000

A leading global gaming enterprise specializes in delivering engaging, skill-based gaming experiences to millions of players worldwide. The company also leads innovation in advertising, payments, and player engagement, developing cutting-edge solutions to facilitate seamless real-time interaction across their popular game titles.

The global game platform previously utilized Amazon Aurora to store real-time customer activities, advertising events, and financial data. Data was exported as JSON files to an S3 bucket and then ingested into Snowflake as a raw table. This data, including player actions, in-game purchases, and matchmaking details, was transformed into multiple refined tables using Snowflake’s stream and task features. Accurate, up-to-date data in Snowflake was crucial for enhancing player engagement and retention, ensuring a seamless and responsive gaming experience.

Snowflake's Struggle with Real-Time Data: Cloud Costs Soar

The gaming platform faced critical challenges using Snowflake for real-time data ingestion and analysis. The platform needed to process user data from S3-staged JSON files, generated from Aurora binlogs, in real-time. However, Snowflake struggled with sub-second data ingestion and transformation, crucial for updating raw tables and maintaining a real-time customer activity stream.

Snowflake's pricing model proved impractical for such tasks, as it charged excessive credits for cloud services like tasks. The lack of direct SQL support for sub-second tasks forced the platform to use minute-level tasks with SQL scripts, which drastically increased costs. This inability to handle real-time data led to gaps in their data analysis, hindering their ability to measure customer engagement and advertising performance effectively.

Additionally, despite spending millions on Snowflake, the platform struggled to meet its 99.9% SLA requirement. Even as a high-volume enterprise customer, they received inadequate support. In 2023, a service outage took four hours for a response, which merely stated the issue wasn't Snowflake's fault, without resolution. This inadequate support led to critical delays and affected their ability to maintain high availability.

The platform first discovered Databend Cloud on GitHub and was intrigued by its mission to create an open-source alternative to Snowflake. They were impressed by Databend Cloud's familiar SQL syntax and its powerful streaming and task capabilities. The platform was particularly drawn to Databend's promise of 10x faster performance in data ingestion, transformation, and analytics, along with sub-second task support. Databend Cloud’s clear and cost-effective pricing model resonated perfectly with their requirements and goals.

Familiar SQL Syntax from Snowflake

Databend Cloud offers serverless data cloud on AWS with robust SQL support, including full TPC-H and TPC-DS benchmarks. It features transactions, stored procedures, and UDFs in Python/JS, along with support for diverse DML operations like insert, update, and merge into.

On the gaming platform, migrating thousands of SQL queries related to data ingestion, CDC, transformation, and dashboards to Databend Cloud took only three weeks. The seamless transition was possible due to Databend Cloud's compatibility with Snowflake’s SQL syntax.

Reliable and Predictable Pricing for Your Data Needs

Databend Cloud separates compute from storage and offers minute-level automatic suspend and resume features, which helps significantly reduce costs. Unlike Snowflake, where task scheduling services could make up to 50% of the total costs and lacked API cost transparency, Databend Cloud provides a more straightforward and predictable pricing model. The gaming platform, which was on Snowflake's enterprise plan with customer-specific discounts, faced high and unpredictable costs, particularly for task scheduling services. Despite the discounts, the lack of transparency in API costs made it challenging to forecast monthly expenses accurately.

With Databend Cloud, they experienced a 70% reduction in costs, thanks to clear and consistent pricing. This transparency allowed the gaming platform to efficiently manage and forecast their monthly data management expenses. They successfully migrated their extensive historical data to Databend Cloud, achieving the ability to handle up to 100 billion rows of data per month with complete cost clarity. This shift has empowered them to provide their customers with comprehensive and real-time business insights without the fear of unexpected financial surprises.

According to the platform’s Head of Technology:"We’re incredibly grateful to Databend for their quick and reliable service. Your product’s performance in streaming data has been outstanding, clearly leading the industry. After a smooth three-week integration, Our platform is excited to have Databend in our production system. Thanks to everyone involved for your hard work. We look forward to seeing Databend grow and achieve even greater things! "

Streamline Your Real-Time Data Pipelines with Ease

Databend Cloud provides powerful streaming and task execution capabilities that deliver superior performance and ease of use for customers. The gaming platform benefits from Databend Cloud by efficiently ingesting large volumes of data into a single raw table. They leverage multiple streams to seamlessly distribute and transform this data using external UDFs for various operational needs.

With Databend Cloud, the platform achieves a steady ingestion rate of 10,000 rows per second and schedules workflows with sub-second task execution, enabling real-time insights. This allows them to process and analyze 100 billion rows per month, significantly enhancing their data operations and driving better business outcomes. The platform now enjoys reliable, efficient real-time data processing, allowing them to focus on delivering exceptional gaming experiences and making data-driven decisions with confidence.

Key Benefits for the Customer

Effortless migration from Snowflake to Databend Cloud thanks to familiar SQL syntax and extensive API support for tasks, streams, and multi-table inserts, dramatically reducing learning curves.
Achieved true real-time ELT processes with sub-second task execution capabilities, fueling growth in core gaming operations.
Cost-effective Databend Cloud solutions notably decreased operational expenses, effectively improving cost-efficiency.
Dedicated, concierge-style technical support services enable the customer to confidently pursue business development and innovation.

How did a cloud-native database company quickly achieve SOC 2 and GDPR certifications?

ksanaka — Mon, 03 Mar 2025 15:21:34 +0000

With the advent of the digital era, data has become one of the most valuable assets for enterprises. As data usage continues to expand, ensuring data security and privacy protection has become a common challenge for businesses worldwide. To address this challenge, we are thrilled to announce that the Databend Cloud platform has recently achieved two internationally recognized certifications - SOC 2 and GDPR compliance. This significant milestone not only marks a solid step forward in Databend Cloud's commitment to data security and privacy protection but also further strengthens global customers' trust in the platform.

Databend Cloud is a next-generation big data analytics platform built on the open-source, cloud-native data lakehouse project Databend. It offers ultra-fast elastic scalability and a pay-as-you-go Data Cloud experience. As a data lakehouse solution developed in Rust and designed with an object storage architecture, Databend is dedicated to leveraging technological innovation to provide global users with a more efficient, cost-effective, and high-performance cloud data lakehouse solution.

Since its launch, Databend Cloud has successfully replaced multiple traditional data warehouse solutions, including Snowflake, Redshift, BigQuery, Greenplum, ClickHouse, and CDH. It has helped customers across various industries - particularly in gaming, finance, advertising, and e-commerce - reduce costs and improve efficiency in big data processing.

Data Security and Privacy: Databend's Core Commitment

As a global company, Databend has prioritized data security and privacy protection as a core objective since its inception. SOC 2 and GDPR compliance are two essential regulatory standards that provide strong assurances for data security and privacy protection. When selecting a data warehouse or cloud service platform, these certifications have become key factors in enterprise decision-making.

SOC 2 Certification: A Globally Recognized Data Security Standard

SOC 2 certification, established by the American Institute of Certified Public Accountants (AICPA), is a globally recognized standard for assessing data security. It primarily evaluates a service provider's ability to manage data protection, privacy, availability, processing integrity, and confidentiality.

SOC 2 certification provides strong assurance to customers, demonstrating that the service provider has undergone an independent audit and implemented appropriate technical and managerial measures to ensure data security. The certification is divided into two types: Type I and Type II, with the latter involving a longer audit period and stricter requirements.

GDPR Compliance: The World's Strictest Data Privacy Regulation

The General Data Protection Regulation (GDPR), enacted by the European Union in 2018, is a data privacy and security law designed to protect individuals' personal data and regulate data processing and storage practices.

GDPR sets the highest global standards for privacy protection, making compliance a legal requirement for businesses operating in the EU market. It ensures that data processing and cross-border transfers adhere to strict regulatory guidelines, helping organizations avoid significant fines for non-compliance.

Key Measures for Certification: Streamlining Audits and Ensuring Compliance

Databend's successful completion of SOC 2 Type II certification was no coincidence. It was the result of thorough preparation in compliance and security architecture, as well as a transparent third-party audit process. We implemented three key measures to effectively streamline the entire audit process:

1.Implementing Controls and Policies in Advance

Before initiating the formal audit, Databend had already established a compliance framework by proactively deploying key control measures, laying a solid foundation for the audit process.

In terms of data security, Databend ensures encryption for both data at rest and data in transit, along with a robust key management system and multi-region replication architecture to guarantee data security and durability. Additionally, for code governance, strict branch protection policies and code review mechanisms are enforced through tools like GitHub. These include mandatory pull request (PR) reviews, static code analysis, and automated deployment pipelines, effectively mitigating potential code risks.

Moreover, Databend prepared a comprehensive set of organizational governance documents in advance, covering job responsibilities, risk management reports, and legal documents related to human resources. These documents provide clear business support and regulatory justification throughout the compliance process.

2. Leveraging Compliance Automation Platforms

To streamline compliance processes, Databend adopted Vanta as its compliance automation platform, building a comprehensive intelligent compliance management system. This platform enables real-time monitoring and visualization of audit control point compliance rates while also automating compliance framework mapping, remediation planning, and system integration for automatic ticket creation - significantly reducing manual workload.

By automating evidence collection and documentation, Vanta enhances efficiency and accuracy throughout the audit process. Additionally, it provides valuable resources such as security policy templates and customized security training modules, allowing Databend to quickly address compliance challenges and ensure all measures meet the stringent requirements of SOC 2 Type II.

3. Learning from Industry Best Practices

Throughout the audit process, Databend leveraged insights from companies that had successfully passed SOC 2 audits. By confirming the audit date in advance, the company ensured a smooth process. Within a week of signing the contract with the audit firm, Databend engaged with auditors to discuss requirements, prepare documentation, and efficiently advance the audit process.

Data Security Architecture: Multiple Technological Safeguards for User Privacy

In addition to achieving SOC 2 and GDPR certifications, Databend Cloud has built a comprehensive security system encompassing access control, data encryption, network policies, password strategies, and compliance measures:

Access Control

RBAC + DAC Model: Databend combines Role-Based Access Control (RBAC) and Discretionary Access Control (DAC) models to implement flexible and granular access control functionalities.
Masking Policies: Masking policies are rules and settings that control the display or access to sensitive data, ensuring data confidentiality while allowing authorized users to interact with the data.

Network Policies

Network Policy Configuration: Databend's network policies provide a configuration mechanism for managing and enforcing network access control for users within the system. It allows the definition of rules for IP address ranges that are either allowed or blocked for specific users, effectively controlling their network-level access.
AWS PrivateLink: PrivateLink enhances network security through VPC peering connections. Customers can initiate connections to Databend Cloud clusters using VPC endpoints and configure security groups to create trust boundaries and control access to endpoints. This feature is currently available only on AWS.

Password Policies

Password Policy Features: Databend includes password policies to enhance system security and simplify user account management. These policies set rules for creating or changing passwords, covering aspects such as length, character types, age restrictions, retry limits, lockout duration, and password history.

Encryption Measures

TLS 1.2: Databend provides end-to-end encryption, ensuring that all customer data flows are transmitted exclusively via HTTPS. Connections from clients to Databend API gateways are encrypted using TLS 1.2.
Storage Encryption: Databend Enterprise supports server-side encryption in Object Storage Service (OSS). This feature enhances data security and privacy by activating server-side encryption for stored data in OSS. You can choose the encryption method that best suits your needs.

Future Outlook: Continuously Enhancing Security Standards and Compliance

By achieving SOC 2 and GDPR certifications, Databend Cloud is positioned to provide secure and compliant cloud data lakehouse solutions to customers worldwide. While completing these certifications marks a significant milestone, we recognize that data security is an ongoing process. Databend will continue to strengthen its technical measures for data security and privacy protection, while advancing data management and big data analytics to help businesses achieve data-driven transformations.

For more information, feel free to visit our official website at https://databend.com or join our channels on Slack to connect directly with our team.

From LLM to Data Warehousing: How to Achieve AI-Driven Data Processing and Analysis

ksanaka — Mon, 24 Feb 2025 10:20:58 +0000

Discussions around Large Language Models (LLMs) have surged again, especially with the introduction of the Chain of Thought (CoT) mechanism in many advanced models. By simulating human reasoning, CoT enables models to break down complex tasks step by step, producing more accurate and reliable results. This breakthrough has significantly improved LLM performance in areas like mathematics and logical reasoning.

At the same time, DeepSeek has gained rapid traction in the market due to its low-cost and high-performance advantages. By adopting innovative architectures and training methods, DeepSeek has significantly reduced inference costs, lowering the barriers to AI adoption. This disruptive advancement has drawn widespread industry attention and discussion. Meanwhile, as more LLMs emerge, market competition continues to intensify.

The Impact of LLMs on the Data Industry

In the database industry, LLM applications are driving intelligent advancements in data processing while also fueling innovation in data warehouse and database technologies. In the AI era, data has become the key resource for enterprise success. Although companies can access the same foundational LLMs, those that effectively utilize their own data to build business-driven LLM applications will stand out in the competition.

The widespread adoption of LLMs is transforming data management, storage, and processing. As LLM technology matures, traditional database systems face unprecedented challenges. More companies are now exploring the fusion of LLMs with database technologies to redefine data querying, analysis, and processing methods. To meet the needs of LLM-powered data workloads, many database vendors are actively innovating in areas such as:

Enhancing data processing performance
Improving model inference capabilities
Optimizing data storage and access efficiency

Databend’s Exploration of LLM Integration

As a cloud-native lakehouse provider, Databend recognizes the crucial role of data processing technologies in enterprise success. Since 2022, we have been exploring the integration of LLMs with the lakehouse architecture to enhance the intelligence of data warehouses.

After ChatGPT launched in 2022, our database engineering team initiated the first phase of exploration. We integrated OpenAI's API into the data warehouse and leveraged vectorization and Retrieval-Augmented Generation (RAG) to help users improve query efficiency and enable intelligent data processing.

During this phase, we introduced AskBend, which allows users to store their knowledge bases in Databend and conduct intelligent Q&A using embedding-based queries combined with RAG. This solution enables users to query Databend and receive AI-generated responses based on stored documents, showcasing LLMs' immense potential in data warehousing.

However, this approach also revealed some key issues, particularly in data privacy and cost control. Enterprises had to upload their data to external platforms for processing, leading to potential privacy risks and high service costs.

Challenges of Data Privacy and Cost Control

Although the introduction of LLM technology provides users with convenient intelligent services, in a traditional LLM application model, users must upload their data to an external platform for processing. This not only fails to effectively ensure data privacy protection but also makes the cost of using LLM services very high. For example, in the case of OpenAI, API call fees can be a significant expense for enterprises engaged in large-scale data processing. Especially when the data volume reaches hundreds of thousands or even more, the cumulative cost of API calls and token usage can increase rapidly.

To address these challenges, Databend further advanced its exploration of LLM technology in 2024, starting to integrate open-source LLMs, such as the open-source large models provided by Hugging Face, to solve the issues of data privacy and cost control. With this approach, user data is entirely stored locally and does not need to be uploaded to a cloud platform, ensuring data privacy. At the same time, the introduction of open-source LLMs significantly reduces the cost of model inference, allowing enterprises to use LLM technology more flexibly to meet their growing data processing needs. This year, with the explosive popularity of DeepSeek, we have also begun exploring its integration, combining it with DeepSeek’s API for further experimentation.

In data analysis scenarios, the powerful capabilities of LLMs are particularly evident, especially in cases with small datasets. For many users, especially those unfamiliar with SQL or data warehouse operations, LLM technology can automatically generate query scripts and perform data analysis through natural language, significantly lowering the barrier to use.

For example, a user only needs to ask:

"Please analyze last year's sales data and identify the fastest-growing product."

The LLM will automatically generate an SQL query, execute the analysis in the data warehouse, and finally return the analysis results along with a visualized presentation.

However, as data volume grows, LLMs face certain limitations when processing large-scale datasets. Particularly in real-time data processing scenarios, traditional data warehouse systems still play an indispensable role. Therefore, Databend is further exploring ways to integrate LLM inference with real-time data processing technology to achieve intelligent analysis and processing of big data.

Exploration of LLM Applications

In practical applications, an increasing number of enterprises are successfully automating data analysis and enhancing the intelligence of business decision-making by leveraging LLM + data services. For example, e-commerce companies can integrate LLM technology with data warehouses to conduct in-depth analysis of user behavior, enabling precise decision-making for advertising placement and product recommendations. Additionally, LLMs can be applied in customer service, where they work in conjunction with data warehouses to build personalized recommendation systems based on users' historical behavior and preferences, ultimately improving customer satisfaction and loyalty.

Currently, Databend integrates its SQL-based data processing capabilities with the natural language processing and data understanding strengths of LLMs like DeepSeek, enabling users to efficiently process data and extract valuable insights. This approach has already been applied in internal quality assurance systems, AI function services, and unstructured data processing, significantly reducing manual analysis time and workload.

Internal Quality Assurance System

To minimize the impact on enterprise users during the upgrade process, we have developed a set of smoke tests based on DeepSeek's SQL model. The core technology of the test data generation engine relies on DeepSeek's powerful data processing capabilities, enabling it to generate data distributions that are closer to the user's real business scenarios based on the SQL model, especially for test data that is likely to trigger boundary issues. This approach not only improves test coverage but also more effectively identifies potential system risks, providing enterprise users with more reliable quality assurance.

AI Functions

Databend offers users a range of AI Functions to implement data ETL, allowing users to directly call functions in SQL and leverage AI capabilities to extract greater value from their data. Initially, all Databend services were supported by OpenAI's API. However, due to high costs and privacy concerns, we have replaced OpenAI with open-source large models in certain scenarios.

Currently, based on AI functions, we have primarily implemented the following capabilities:

ai_text_similarity: Text similarity;
ai_mask: Data masking for protecting sensitive information, such as addresses and phone numbers. Previously, relying on manual labor for this task would require significant manpower, especially with large data volumes; AI greatly improves efficiency in this regard;
ai_extract: Entity extraction, which identifies and extracts specific entity information from text. For instance, if your data contains entities like addresses and gender, this function can extract the data containing that information;
ai_sentiment: Sentiment analysis (positive/negative/mixed) used to determine the emotional tendency of the text. For example, in e-commerce, this capability can assess the sentiment direction of comments;
ai_classify: Classification, which categorizes text according to predefined categories;
ai_translate: Translation, which converts text from one language to another;

These capabilities essentially fall under the category of data cleansing tasks, where AI can replace human labor, significantly reducing labor costs. Throughout the implementation of this solution, we have made several optimizations, such as modifying the UDF Server. The traditional row-by-row processing method is inefficient, so we switched to a batch processing model, which greatly enhances data processing efficiency. We also revamped the original model with vectorization technology to make it more suitable for running on low-spec GPUs, thereby lowering operational costs.

Additionally, we implemented a fine-grained billing and monitoring mechanism to ensure transparency in resource usage and cost control. Currently, these capabilities are not yet available for public testing but will be opened up shortly.

Databend's AI functions are designed to be user-friendly, even for those who are not familiar with machine learning or natural language processing. With Databend, you can quickly and easily integrate powerful AI features into your SQL queries, elevating your data analysis to a new level.

Unlike before, we have changed the model for integrating large models. Instead of running the AI cluster internally within Databend's Query, we now connect to open-source large models through Databend's Item Function Server model, where Databend only needs to define an API for the function. Once deployed, users can automatically connect to the large model in the cloud. This setup allows for easy switching to other better open-source large models in the future, with all data processing occurring on the user's machine. This not only addresses data privacy and compliance issues but also achieves cost control. If users require it, we can even provide a private deployment for them.

Unstructured Data Processing

Many users of Databend often need to extract entity information from unstructured data and convert it into structured data to uncover its data value in real business scenarios. Databend utilizes Deepseek's data processing and analysis capabilities to extract data information more efficiently and output it in JSON format.

Here is an example implemented using DeepSeek V3:
For instance, we input the following text:

“Please send an email to hi@databend.com to contact us, or visit our office at 123 Zhongguancun Street, Beijing.”

Deepseek V3 can output:

{
  "email": "hi@databend.com",
  "address": "123 Zhongguancun Street, Beijing"
}

Conclusion: Exploration and Practice

It is worth mentioning that through a relaxed open-source strategy, DeepSeek gained tens of millions of users and attention in just one month. The success of DeepSeek not only demonstrates the significant role of open-source strategies in promoting the popularization and application of new technologies but also provides an effective path to challenge industry giants. This open approach lowers technical barriers, accelerates technology adoption, and drives overall industry development.

In the database industry, open-source strategies have also proven to be an effective path. Taking Databend as an example, as an open-source cloud-native data lakehouse product, Databend has attracted a large number of developers and users through its open-source model, quickly establishing an active community and ecosystem. This open-source strategy has allowed Databend to rapidly become an open-source alternative to the database giant Snowflake, offering users a more cost-effective big data solution.

Currently, the integration of LLM and data is still evolving. For database practitioners and technology developers, how to better leverage LLM technology to enhance the intelligence level of data warehouses and how to promote the deep integration of LLM and data while ensuring data privacy and cost control are directions that require continuous exploration in the future.

Do you have better insights and practical experiences? Do you think there are more efficient ways to integrate LLM and data? Or in your practice, are there better technical solutions that can further advance this trend? We welcome you to share your thoughts and experiences with us in Slack or other ways as we explore the future of data technology together.