Apache Doris

Posted on Oct 22

Apache Doris 4.0: One Engine for Analytics, Full-Text Search, and Vector Search

#dataengineering #olap #database #bigdata

We're thrilled to announce the official release of Apache Doris 4.0—a milestone version focused on four core enhancements: 1) New AI capabilities (vector search & AI functions) 2) Enhanced full-text search 3) Improved ETL/ELT processing 4) Performance optimizations (TopN lazy materialization & SQL cache). This release truly delivers a "one-engine-fits-all" analytics experience.

This release is the result of collaborative efforts from over 200 community members, with more than 9,000 improvements and fixes submitted. A heartfelt thank you to everyone who contributed to testing, reviewing, and refining this milestone version!

👉 Quick Resources:

GitHub Release Page: https://github.com/apache/doris/releases
Download Doris 4.0: https://doris.apache.org/download

Key Highlights at a Glance

AI-Ready Capabilities: Built-in vector search, AI functions, and hybrid search—no external vector database required
Enhanced Full-Text Search: New SEARCH() function with Elasticsearch-like DSL, paired with BM25 scoring for better accuracy
Robust ETL/ELT: Spill-to-disk mechanism automatically offloads memory to disk, ensuring stability for large-scale tasks
Performance Leap: TopN queries up to dozens of times faster with lazy materialization; SQL cache enabled by default with 100x parsing efficiency boost

I. AI Capabilities: Dual Breakthroughs in Vector Search & AI Functions

A. Vector Search: Natively Integrated for Simpler Architecture

Apache Doris 4.0 introduces vector indexing, enabling the combination of vector search with native SQL analytics. This eliminates the need for external vector databases, making it ideal for AI scenarios like semantic search, intelligent recommendation, and image retrieval.

Core Vector Search Functions

l2_distance_approximate(): HNSW index-based approximate Euclidean distance calculation (smaller value = higher similarity)
inner_product_approximate(): HNSW index-based approximate inner product calculation (larger value = higher similarity)

Practical Examples

Create a table with vector index:


CREATE TABLE doc_store (
  id BIGINT,
  title STRING,
  tags ARRAY<STRING>,
  embedding ARRAY<FLOAT> NOT NULL,
  -- Vector index configuration
  INDEX idx_vec (embedding) USING ANN PROPERTIES (
      "index_type"  = "hnsw",        -- Only HNSW is supported
      "metric_type" = "l2_distance", -- Metric type: l2_distance/inner_product
      "dim"         = "768",         -- Vector dimension (must match imported data)
      "quantizer"   = "flat"         -- Quantization: flat/sq8/sq4
  ),
  -- Full-text index for hybrid search
  INDEX idx_title (title) USING INVERTED PROPERTIES ("parser" = "english")
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 16
PROPERTIES("replication_num"="1");

Basic TopN vector search:


SELECT id, l2_distance_approximate(embedding, [/* Your query vector */]) AS dist
FROM doc_store
ORDER BY dist ASC
LIMIT 10;

Hybrid search (full-text filtering + vector ranking):


SELECT id, title,
       l2_distance_approximate(embedding, [/* Your query vector */]) AS dist
FROM doc_store
WHERE title MATCH_ANY 'music'                -- Full-text index filtering
  AND array_contains(tags, 'recommendation') -- Structured filtering
ORDER BY dist ASC
LIMIT 5;

Key Parameter Notes

quantizer: Quantization affects memory usage—SQ8 index is approximately 1/3 the size of FLAT index, trading minor recall for cost savings
Pre-filtering Mechanism: First filters data via precise indexes (e.g., inverted indexes), then performs vector search on the result set to ensure recall stability
Table Model Limitation: Currently only supports Duplicate Key table model; vector column must be ARRAY<FLOAT> NOT NULL

B. AI Functions: Call LLMs Directly via SQL

Data analysts can call large language models (LLMs) through simple SQL statements—no glue code required. Supported use cases include information extraction, sentiment analysis, and text summarization. Currently integrated with locally deployed models and major LLMs like OpenAI, Anthropic, Gemini, DeepSeek, Zhipu AI, and Qwen.

Core AI Functions List

Function	Description
AI_CLASSIFY	Matches text to the most relevant label from a specified set
AI_EXTRACT	Extracts key information from text based on specified labels
AI_SENTIMENT	Analyzes text sentiment (positive/negative/neutral/mixed)
AI_SUMMARIZE	Generates a concise summary of the text
AI_FILTER	Validates text correctness and returns a boolean value
AI_TRANSLATE	Translates text to a specified language

Practical Example: Resume-Job Matching

Use AI_FILTER for semantic matching between resumes and job requirements:


-- 1. Create candidate and job tables
CREATE TABLE candidate_profiles (
    candidate_id INT,
    name         VARCHAR(50),
    self_intro   VARCHAR(500)
)
DUPLICATE KEY(candidate_id)
DISTRIBUTED BY HASH(candidate_id) BUCKETS 1
PROPERTIES ("replication_num" = "1");

CREATE TABLE job_requirements (
    job_id   INT,
    title    VARCHAR(100),
    jd_text  VARCHAR(500)
)
DUPLICATE KEY(job_id)
DISTRIBUTED BY HASH(job_id) BUCKETS 1
PROPERTIES ("replication_num" = "1");

-- 2. Insert test data
INSERT INTO candidate_profiles VALUES
(1, 'Alice', 'I am a senior backend engineer with 7 years of experience in Java, Spring Cloud and high-concurrency systems.'),
(2, 'Bob',   'Frontend developer focusing on React, TypeScript and performance optimization for e-commerce sites.'),
(3, 'Cathy', 'Data scientist specializing in NLP, large language models and recommendation systems.');

INSERT INTO job_requirements VALUES
(101, 'Backend Engineer', 'Looking for a senior backend engineer with deep Java expertise and experience designing distributed systems.'),
(102, 'ML Engineer',      'Seeking a data scientist or ML engineer familiar with NLP and large language models.');

-- 3. Semantic matching query
SELECT
    c.candidate_id, c.name,
    j.job_id, j.title
FROM candidate_profiles AS c
JOIN job_requirements AS j
WHERE AI_FILTER(CONCAT('Does the following candidate self-introduction match the job description?', 
                'Job: ', j.jd_text, ' Candidate: ', c.self_intro));

Query Result:


+--------------+-------+--------+------------------+
| candidate_id | name  | job_id | title            |
+--------------+-------+--------+------------------+
|            3 | Cathy |    102 | ML Engineer      |
|            1 | Alice |    101 | Backend Engineer |
+--------------+-------+--------+------------------+

II. Enhanced Full-Text Search: More Flexible & Accurate

Apache Doris 4.0 redefines full-text search capabilities with Elasticsearch-like lightweight DSL, paired with the BM25 scoring algorithm. This balances performance and accuracy, making it perfect for hybrid search scenarios.

A. New SEARCH() Function: Unified Search Entry

Consolidates complex text search operators with syntax similar to Elasticsearch Query String, reducing migration costs and supporting multi-condition index pushdown for better performance.

Core Syntax Examples


-- 1. Term query
SELECT * FROM docs WHERE search('title:apache');

-- 2. Multi-value matching (ANY matches any, ALL matches all)
SELECT * FROM docs WHERE search('tags:ANY(java python golang)');
SELECT * FROM docs WHERE search('tags:ALL(machine learning)');

-- 3. Multi-field boolean combination
SELECT * FROM docs
WHERE search('(title:Doris OR content:database) AND NOT category:archived');

-- 4. Combine with structured filtering
SELECT * FROM docs
WHERE search('title:apache') AND publish_date >= '2025-01-01';

B. BM25 Scoring Algorithm: Improved Relevance

Replaces the traditional TF-IDF algorithm, dynamically adjusting term frequency weights based on document length. This significantly improves accuracy for long-text and multi-field search scenarios.

-- Sort by relevance score
SELECT *, score() as score 
FROM search_demo 
WHERE content MATCH_ANY 'search query' 
ORDER BY score DESC 
LIMIT 10;

C. Enhanced Tokenization Capabilities

New multilingual tokenizers and custom tokenization pipelines meet diverse text processing needs.

1. New Built-in Tokenizers

ICU Tokenizer: Adapts to multilingual mixed documents, supporting complex writing systems like Arabic and Thai
Basic Tokenizer: Lightweight and high-performance, suitable for simple scenarios like log processing

-- ICU Tokenizer example (multilingual mix)
SELECT TOKENIZE('مرحبا بالعالم Hello 世界', '"parser"="icu"');
-- Result: ["مرحبا", "بالعالم", "Hello", "世界"]

-- Basic Tokenizer example (Chinese character splitting)
SELECT TOKENIZE('你好世界', '"parser"="basic"');
-- Result: ["你", "好", "世", "界"]

2. Custom Tokenization Pipelines

Build personalized tokenization logic through chained configuration of char filters, tokenizers, and token filters. Reusable components reduce maintenance costs.

-- Example 1: Split identifiers by . and _
-- 1) Create token filter
CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS complex_word_splitter
PROPERTIES
(
    "type" = "word_delimiter",
    "type_table" = "[. => SUBWORD_DELIM], [_ => SUBWORD_DELIM]"
);

-- 2) Create custom analyzer
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS complex_identifier_analyzer
PROPERTIES
(
    "tokenizer" = "standard",
    "token_filter" = "complex_word_splitter, lowercase"
);

-- Test tokenization effect
SELECT TOKENIZE('apy217.39_202501260000026_526', '"analyzer"="complex_identifier_analyzer"');
-- Result: ["apy", "217", "39", "202501260000026", "526"]

III. Robust ETL/ELT Processing: Spill-to-Disk Ensures Large-Scale Stability

The new Spill-to-Disk mechanism automatically offloads intermediate data to disk when a computation task exceeds memory limits. This prevents OOM errors and supports core operators like Hash Join, Aggregation, Sort, and CTE.

Core Configurations

1. BE Configuration (be.conf)


# Spill data storage path (defaults to data storage path)
spill_storage_root_path=/mnt/disk1/spill;/mnt/disk2/spill
# Spill disk space limit (supports percentage or specific size, default 20%)
spill_storage_limit=100%

2. FE Session Variables

-- Enable spill-to-disk
set enable_spill=true;
-- Single query memory limit
set exec_mem_limit = 10g;
-- Extend query timeout (spill scenarios may take longer)
set query_timeout = 3600;

Execution Monitoring

Monitor spill execution via audit logs, Query Profile, and system tables:

-- Query spill statistics from system table
SELECT BE_ID, QUERY_ID, SPILL_WRITE_BYTES_TO_LOCAL_STORAGE, SPILL_READ_BYTES_FROM_LOCAL_STORAGE
FROM information_schema.backend_active_tasks;

Performance Validation

Tested with TPC-DS 10TB dataset (3 BE nodes, 16-core CPU/64GB RAM per node, 1:52 memory-to-data ratio). All 99 queries completed successfully with a total execution time of 28,102.386 seconds—validating stability in large-scale scenarios.

IV. Data Quality Assurance: End-to-End Accuracy & Control

Apache Doris 4.0 comprehensively standardizes function behaviors, establishing an end-to-end validation mechanism from data ingestion to computation. Key optimizations focus on CAST function, floating-point calculation, and date function scenarios.

CAST Function: Three Modes for Diverse Scenarios

CAST behavior is standardized via BNF syntax, supporting strict mode, non-strict mode, and TRY_CAST function to meet different data reliability requirements.

Strict Mode: Strictly validates against BNF rules; invalid data triggers errors (ideal for high-precision scenarios like financial reconciliation)
Non-Strict Mode: Converts invalid data to NULL to ensure task continuity (suitable for log processing)
TRY_CAST Function: Controls conversion logic at the individual function level; returns NULL on failure

-- TRY_CAST example
SELECT TRY_CAST('123' AS INT) AS valid, TRY_CAST('abc' AS INT) AS invalid;
-- Result: valid=123, invalid=NULL

V. Performance Optimizations: Dozens of Times Faster in Key Scenarios

1. TopN Queries: Lazy Materialization Breaks Bottlenecks

For high-frequency SELECT * FROM table ORDER BY col LIMIT N scenarios, execution is split into two phases:

Only reads sort columns and row identifiers to quickly filter target rows matching LIMIT N
Precisely reads full-column data for target rows using row identifiers

Now supports multi-table join and external table query scenarios, delivering dozens of times faster performance in wide-table, small-LIMIT scenarios.

2. SQL Cache: Enabled by Default with 100x Parsing Boost

After resolving cache correctness issues related to permission changes, session variables, and non-deterministic functions, SQL cache is now enabled by default. For complex view queries with 163 joins and 17 unions, parsing time dropped from 400ms to 2ms—a 100x efficiency improvement.

3. JSONB Optimization: Decimal Support + 30% Performance Boost

Added Decimal type support to avoid precision loss in high-precision numeric conversion
30%+ performance improvement for functions like json_extract, accelerating semi-structured data processing

VI. More User-Friendly Resource Management

Unified CPU/memory soft/hard limit configuration for Workload Groups, supporting both limits in the same group. Integrates with Spill-to-Disk for dynamic resource scheduling.

Typical Scenario: Departmental Resource Isolation

-- Sales Department (high-priority CPU-intensive)
MIN_CPU_PERCENT=40%  -- Guarantees 40% CPU during contention
MAX_MEMORY_PERCENT=60% -- 60% memory upper limit

-- Marketing Department (low-priority CPU-intensive)
MAX_CPU_PERCENT=30%  -- 30% CPU upper limit to avoid resource preemption
MIN_MEMORY_PERCENT=20% -- 20% basic memory guarantee

Conclusion: The Transformation Brought by Apache Doris 4.0

Apache Doris 4.0 achieves a core breakthrough in "AI readiness" by deeply integrating vector search, AI functions, and full-text search—breaking the scenario boundaries of traditional analytics engines. Whether for traditional BI scenarios like real-time dashboards and user behavior analysis, or AI-driven scenarios like document search and intelligent recommendation, a single engine can provide efficient support.

Download the latest version today and visit the official documentation for detailed upgrade guides. We sincerely invite more developers to join the Doris community to drive technological innovation and build a more powerful analytics ecosystem together!

DEV Community