DEV Community

Apache Doris
Apache Doris

Posted on

Apache Doris AI Capabilities Unveiled (Part II): Deep Analysis of AI_AGG and EMBED Functions

Introduction

After a preliminary exploration of the possibilities of AI functions, we now turn our attention to two more core functions: AI_AGG and EMBED. We will delve into the design philosophy, implementation principles, and business applications of these two functions, demonstrating how Doris seamlessly integrates text aggregation and semantic vector analysis into SQL through native function design, providing users with a more powerful and user-friendly intelligent data analysis experience.

AI_AGG: AI-Based Text Aggregation

Aggregation is one of the most common operations in data analysis. However, when dealing with massive volumes of user comments, support tickets, or log texts, traditional aggregate functions struggle to process such unstructured text data directly. To address this, Doris supports AI_AGG, a function that can call AI to perform text aggregation. It allows analysts to handle specific tasks on large amounts of text according to custom instructions.

Examples

For detailed usage of AI_AGG, please refer to: Apache Doris AI_AGG Documentation

Example 1: Summarize Customer Service Tickets

The following table simulates a simple customer service ticket:

SELECT * FROM support_tickets;
Enter fullscreen mode Exit fullscreen mode
ticket_id customer_name subject details
2 Bob Login Failure Same problem as Alice. Also seeing 502 errors on the SSO page.
3 Carol Payment Declined Credit card charged twice but order still shows pending.
5 Eve Login Failure Getting redirected back to login after entering 2FA code.
1 Alice Login Failure Cannot log in after password reset. Tried clearing cache and different browsers.
4 Dave Slow Dashboard Dashboard takes >30 seconds to load since the last release.

We can use AI_AGG to summarize customer issues for different problem types:

SELECT
    subject,
    AI_AGG(
        details,'Summarize every ticket detail into one short paragraph') AS ai_summary
FROM support_tickets
GROUP BY subject;
Enter fullscreen mode Exit fullscreen mode

The output is as follows:

subject ai_summary
Slow Dashboard The dashboard is experiencing slow loading times, taking over 30 seconds to load following the most recent release.
Payment Declined A customer reports being charged twice for their order, which remains in a pending status.
Login Failure Users are experiencing login issues, including 2FA redirection, post-password reset failures, and SSO 502 errors, despite clearing cache and trying different browsers.

AI_AGG Technical Analysis: Dynamic Pre-aggregation

Combining aggregate functions with AI requires solving the problem that the total text volume within a group can far exceed the model's context window. If all text is concatenated and sent to the AI at once, it can easily exceed the model's maximum context window. Doris solves this problem through dynamic pre-aggregation:

  1. Context Monitoring: During the text aggregation process, AI_AGG maintains an internal text buffer for each group (currently fixed at 128KB, compatible with most AI context windows).

  2. Dynamic Pre-aggregation: When a new text row would cause the buffer to exceed the threshold, AI_AGG triggers pre-aggregation—pausing to send the current buffer to the AI for intermediate processing.

  3. Context Replacement: The AI's concise intermediate result replaces the original long text in the buffer, freeing space for more data. If the buffer still exceeds the threshold after replacement, AI_AGG errors out to prevent model service overload.

This implementation integrates seamlessly with Doris's distributed query plan, leveraging multi-node parallel computing. Users can perform efficient intelligent analysis on massive text data using familiar SQL aggregation syntax.

EMBED: Text Vectorization Function

For detailed usage of EMBED, please refer to: Apache Doris EMBED Documentation

The core function of EMBED is to convert any text into a high-dimensional floating-point vector through AI. This vector is a mathematical representation of the text in a semantic space, capturing its semantic information. Texts with similar semantics will have vectors that are closer in this space.

Examples

Example 1: Build a Knowledge Base with Vectorization

The following table simulates a simple employee handbook:

CREATE TABLE knowledge_base (
    id BIGINT,
    title STRING,
    content STRING,
    embedding ARRAY<FLOAT> COMMENT 'Embedding vector generated by EMBED function'
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES (
    "replication_num" = "1"
);

INSERT INTO knowledge_base (id, title, content, embedding) VALUES
(1, "Travel Reimbursement Policy",
    "Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached.",
    EMBED("travel reimbursement policy")),
(2, "Leave Policy",
    "Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required.",
    EMBED("leave request policy")),
(3, "VPN User Guide",
    "To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate.",
    EMBED("VPN guide intranet access")),
(4, "Meeting Room Reservation",
    "Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified.",
    EMBED("meeting room booking reservation")),
(5, "Procurement Request Process",
    "Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required.",
    EMBED("procurement request process finance"));
Enter fullscreen mode Exit fullscreen mode

By vectorizing text with EMBED, combined with Doris's vector functions, you can perform the following operations:

1. Q&A Retrieval (with COSINE_DISTANCE)
SELECT 
    id, title, content,
    COSINE_DISTANCE(embedding, EMBED("How to apply for travel reimbursement?")) AS score
FROM knowledge_base
ORDER BY score ASCLIMIT 2;
Enter fullscreen mode Exit fullscreen mode
id title content score
1 Travel Reimbursement Policy Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached. 0.4463210454563673
5 Procurement Request Process Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required. 0.5726841578491431
2. Problem Analysis Matching (with L2_DISTANCE)
SELECT 
    id, title, content,
    L2_DISTANCE(embedding, EMBED("How to access the company intranet")) AS distance
FROM knowledge_base
ORDER BY distance ASCLIMIT 2;

Enter fullscreen mode Exit fullscreen mode
id title content distance
3 VPN User Guide To access the internal network, employees must use VPN. For the first login, download and install the client and configure the certificate. 0.5838271122253775
1 Travel Reimbursement Policy Employees must submit a reimbursement request within 7 days after the business trip, with invoices and travel approval attached. 1.272394695975331
3. Text Relevance Matching (with INNER_PRODUCT)
SELECT 
    id, title, content,
    INNER_PRODUCT(embedding, EMBED("Leave system request leader approval")) AS score
FROM knowledge_base
WHERE id != 2ORDER BY score DESCLIMIT 2;

Enter fullscreen mode Exit fullscreen mode
id title content score
5 Procurement Request Process Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required. 0.33268885332504
4 Meeting Room Reservation Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified. 0.29224032230852487
4. Find Similar Content (with L1_DISTANCE)
SELECT 
    id, title, content,
    L1_DISTANCE(embedding, EMBED("Procurement application process")) AS distance
FROM knowledge_base
ORDER BY distance ASCLIMIT 3;

Enter fullscreen mode Exit fullscreen mode
id title content distance
5 Procurement Request Process Departments must fill out a procurement request form for purchasing items. If the amount exceeds $5000, financial approval is required. 18.66882028897362
4 Meeting Room Reservation Meeting rooms can be reserved in advance through the OA system, with time and number of participants specified. 30.90449328294426
2 Leave Policy Employees must apply for leave in the system in advance. If the leave is longer than three days, approval from the direct manager is required. 31.060405636536416

Flexible Vector Dimension Control

Through Doris's built-in RESOURCE mechanism, users can set the ai.dimensions parameter when configuring an AI Resource to precisely specify the dimension of the generated vectors:

  • High-dimensional vectors: Retain richer semantic information (suitable for high-precision retrieval).

  • Low-dimensional vectors: Save storage space and accelerate computation (suitable for lightweight matching).

Note:

  • Ensure the AI model configured in the RESOURCE supports the specified dimension (otherwise, requests may fail).

  • For models that do not support dimension customization (e.g., OpenAI's text-embedding-ada-002), the ai.dimensions setting will be ignored, and the model's default dimension will be used.

Summary and Outlook

With the AI_AGG and EMBED functions, Apache Doris has successfully embedded AI capabilities into its database kernel, injecting powerful intelligent analysis capabilities into its native SQL engine and greatly expanding the boundaries of data analysis and intelligent applications:

  • AI_AGG: With dynamic pre-aggregation, it enables intelligent analysis of unstructured text (e.g., user comments, logs) directly in the database.

  • EMBED: Seamlessly integrates with vector functions to provide end-to-end semantic retrieval solutions (e.g., Q&A systems, content recommendation), simplifying application development.

These features empower SQL with the ability to command AI models, allowing data analysts to harness powerful AI at low cost and high efficiency to uncover deeper semantic value in data.

Looking ahead, Doris will continue to deepen the integration of AI and databases:

  • Optimize model scheduling and computational performance.

  • Explore cutting-edge features like multi-modal data analysis and AI Agent interactions.

  • Continuously lower the barrier to using AI technology, making data-driven intelligent decisions ubiquitous.

Top comments (0)