DEV Community: TiDB Community

How to Build an AI Agent that Builds Full-Stack Apps

TiDB Community — Tue, 06 Jan 2026 07:33:22 +0000

By Hao Huo, Director of AI Innovation at PingCAP

An open-source starter kit for building a “Lovable.dev”-style AI agent
More and more, we’re seeing AI agents build entire applications from a single prompt. Platforms like Lovable.dev and Manus are pioneering this space. Many of them are using TiDB Cloud to power their data layer. So, we decided to build one too, as a public, open-source dev kit available on GitHub that you can use as a starting point to build your own AI-driven development platform.

Video walkthrough:

Watch the YouTube walkthrough of the Full Stack App Builder AI Agent

What can this agent do?

This agent is an AI chat app and codegen platform that ties together everything you need to prompt (i.e., vibe code) complete web applications end-to-end. It can:

Generate complete web apps from a prompt: You describe what you want to build, and the agent scaffolds it automatically with the correct files, dependencies, and setup.
Provision a TiDB Cloud database automatically: Every generated project gets its own TiDB Cloud cluster including database, schema, and a connection string fully preconfigured. Each instruction within the same project is versioned as a separate TiDB Cloud branch, ensuring isolated and trackable iterations.
Track project versions: The agent stores every generated app and its database credentials (user, password, branch info), so you can revisit previous versions or roll back safely. Most importantly, your database always stays in sync with the selected project version.
Run and preview apps instantly: Test each generated project directly from the UI or open the preview URL immediately.
Keep context between generations: The agent remembers your previous instructions and builds on them, enabling iterative refinement instead of isolated one-shot generations.
Scale down when idle: Each generated app uses TiDB Cloud, which automatically scales to zero when unused.

Architecture Overview

This architecture transforms natural-language prompts into production-ready web apps by combining:

Codex (gpt-5.1-codex) or Claude Code for reasoning and codegen
TiDB Cloud for cluster-level, branchable data environments
Kysely for type-safe SQL
Vercel for execution and deployment
GitHub for version control

At the center of this stack is TiDB Cloud, acting as the central nervous system for the entire operation. It serves as both the control-plane database for the platform itself and provides the branchable, on-demand data environments for every app the agent generates.

How It Works: The Agent’s End-to-End Flow

This entire process can be broken down into two key parts: the tools the agent needs to do its job, and the step-by-step workflow it follows to build an application.

Part 1: The Agent’s Toolkit

Before the agent can generate or deploy anything, it must authenticate itself against all the external systems it orchestrates. These credentials form the control plane — the unified interface through which the agent provisions environments, generates code, manages data, and ships applications.

The agent requires:

GitHub tokens: Creates repositories, pushes generated code, and manages branches. GitHub becomes the source of truth for code checkpoints.
Vercel tokens: Creates ephemeral sandboxes, uploads build artifacts, and deploys web apps. Vercel acts as the execution layer.
Codex or Claude Code API keys: Enable reasoning and planning. The model acts as the brain of the system.
TiDB Cloud API keys: Allow the agent to programmatically create, manage, and branch TiDB Cloud clusters. This provides isolated data environments and the backbone for safe iteration.

Together, these credentials let the agent fully automate the lifecycle: plan → provision → generate → migrate → deploy → iterate.

Part 2: The App-Building Workflow in Action

Now, let’s see how the agent uses its toolkit in a real-world scenario. From the initial prompt to the final deployment, the agent follows a precise, automated sequence to turn an idea into a running application.

Step 1: Prompt

It begins when the user creates a request, such as: “Build a Todo List app.”

Step 2: Plan

Immediately upon receiving the prompt, the model analyzes the requirements and generates a comprehensive execution plan.

Step 3: Provision

Once the plan is set, the agent provisions the necessary infrastructure. It creates a new TiDB Cloud cluster, a Vercel Sandbox, and a GitHub repo while simultaneously configuring the required environment variables.

Step 4: Generate Code

With the environment ready, Codex or Claude Code proceeds to generate the application code. This includes creating pages, components, and API routes, alongside the Kysely schema and migration files.

Step 5: Database Migration

Next, to ensure data consistency, Kysely applies the typed migrations directly to the TiDB Cloud cluster.

Step 6: Deploy

Following the successful migration, the agent commits the code to GitHub and automatically triggers a Vercel deployment to bring the app live.

Step 7: Iterate with Follow-Up Instruction

Finally, the process becomes cyclical. For example, when you provide a follow-up instruction like “Add a username field,” the agent creates a new TiDB Cloud branch. It then updates the schema, regenerates and applies migrations, updates the UI/API code, and redeploys. Because of this branching capability, everything stays isolated and fully reversible.

This is critical for iterative development. Because Kysely generates both an up and a down migration for every schema change, the agent can not only apply new structures but also safely reverse them. This ensures that every iteration is clean and fully reversible without manual database intervention.

Magic Features: Where It All Clicks

While the workflow seems straightforward, a few key techniques are what make this system robust, safe, and efficient. Let’s look at the ‘magic’ that makes it all click.

Technique 1: Syncing Code and Data with Git + TiDB Branches

Every instruction becomes a lightweight checkpoint for both code and data. Each version stores the Git repo name and branch name alongside the TiDB Cloud cluster ID and branch name, ensuring perfect code–data synchronization.

Example: Creating a TiDB Cloud Branch Programmatically

import fetch from "node-fetch";

const SERVERLESS_API_BASE = "https://serverless.tidbapi.com/v1beta1";

async function createBranch(clusterId, newBranchName, publicKey, privateKey) {
  const url = `${SERVERLESS_API_BASE}/clusters/${clusterId}/branches`;
  const body = JSON.stringify({ displayName: newBranchName });

  const response = await fetch(url, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization:
        "Basic " + Buffer.from(`${publicKey}:${privateKey}`).toString("base64"),
    },
    body,
  });

  if (!response.ok) {
    const text = await response.text();
    throw new Error(`Failed to create branch: ${response.status} ${text}`);
  }

  const data = await response.json();
  return data.branchId;
}

createBranch(
  "1234567890",
  "new-feature-branch",
  process.env.TIDB_CLOUD_PUBLIC_KEY,
  process.env.TIDB_CLOUD_PRIVATE_KEY
)
  .then((branchId) => console.log("Branch created:", branchId))
  .catch(console.error);

Technique 2: Safe, Type-Safe Schema Migrations with Kysely

When the agent needs to change the schema, it generates a Kysely migration file. This makes schema evolution safe, reversible, and fully automated.

Example: A Kysely Migration File

import type { Kysely } from "kysely";

export async function up(db: Kysely<any>) {
  await db.schema
    .alterTable("todo_list")
    .addColumn("username", "varchar(255)", (col) => col.notNull().defaultTo(""))
    .execute();
}

export async function down(db: Kysely<any>) {
  await db.schema
    .alterTable("todo_list")
    .dropColumn("username")
    .execute();
}

This is critical for iterative development. Because Kysely generates both an “up” and a “down” migration for every schema change, the agent can not only apply new structures but also safely reverse them. This ensures that every iteration is clean and fully reversible without manual database intervention.

Technique 3: On-Demand Dev Environments with Scale-to-Zero

TiDB Cloud automatically scales down to $0 when idle. This enables:

Ephemeral, AI-generated apps.
On-demand development environments.
Branch-per-instruction workflows.
Burst-heavy workloads from agents.

The agent can create many isolated environments without persistent costs.

Wrapping Up

This starter kit is more than just code; it’s a blueprint for building dynamic, AI-driven applications. The key takeaway is treating your database not as a static box, but as a programmable, API-driven resource. We’ve open-sourced the entire project for you to explore. We encourage you to fork the repo, use the code, and see what you can create.

Ready to Build Your Own? Try TiDB Cloud for free to test out the database branching and scale-to-zero features used in this project. No credit card is required to get started.

TiKV Component GC (Physical Space Reclamation) Principles and Common Issues

TiDB Community — Fri, 04 Jul 2025 04:11:55 +0000

This article was written by Shirly Wu, Support Escalation Engineer at PingCAP.

In TiDB, the GC worker uploads the GC safepoint to the PD server, and all TiKV instances regularly fetch the GC safepoint from PD. If there is any change, the TiKV instances will use the latest GC safepoint to start the local GC process. In this article, we will focus on the principles of TiKV-side GC and discuss some common issues.

GC Key in TiKV

During GC, TiDB clears all locks before the GC safepoint across the entire cluster using the resolve locks mechanism. This means that once the data reaches TiKV, all transaction statuses before the GC safepoint are already resolved, and there are no remaining locks. The transaction statuses can be retrieved from the primary key of the distributed transaction. At this point, we can confidently and safely delete old version data.

So, how exactly does the deletion process work?

Let’s look at the following example:

The current key has four versions, written in the following sequence:

1:00 — New write or update, data stored in the default CF
2:00 — Update, data stored in the default CF
3:00 — Delete
4:00 — New write, data stored in the default CF

If the GC safepoint is 2:30, which ensures snapshot consistency up to 2:30, we will retain the version read at 2:30: key_02:00 => (PUT, 01:30). All previous versions before 2:30 will be deleted, including key_01:00. The write-cf and default cf corresponding to this transaction will both be deleted.

What if the GC safepoint is 3:30?

Similarly, the version read at 3:30,key_03:00 => DELETE, will be retained. All versions before 3:00 will be deleted.

We see that the snapshot at 3:30 indicates that the transaction for key_03:00 is deleting this key.So, is it necessary to retain the MVCC version at 03:00? Not necessary. Therefore, under normal circumstances, if the GC safepoint is 3:30, the data that needs to be garbage collected for this key will be:

This is how the GC process works for a specific key — TiKV’s GC process requires scanning all the keys on the current TiKV instance and deleting the old versions that meet the GC criteria.

Related Monitoring

gc_keys can create read pressure on the system since it needs to scan all versions of the current key to determine whether old versions should be deleted. Related monitoring can be found in tikv-details -> GC -> GC scan write/default details, which records the pressure on RocksDB’s write/default CF during GC worker execution:

The duration of gc_keys can be seen in tikv-details -> GC -> GC tasks duration. If there is high latency in this area, it indicates significant GC pressure or that the read/write pressure on the system is affecting the GC process.

GC in TiKV

GC Worker

Each TiKV instance has a GC worker thread responsible for handling specific GC tasks. The GC worker in TiKV mainly handles the following two types of requests:

GC_keys: This involves scanning and deleting old versions of a specific key that meet the GC criteria. The detailed process was described in the first chapter.
GC(range): This involves using GC_keys on a specified range of keys, performing GC on each key individually within the range.
unsafe-destroy-range: This is the direct physical cleanup of a continuous range of data, corresponding to operations like truncate/drop table/partition mentioned in the previous article.

Currently, the GC worker has two key configurations, which cannot be adjusted:

Thread count: Currently, the GC worker only has one thread. This is hardcoded in the code here, and no external configuration is provided.
GC_MAX_PENDING_TASKS: This is the maximum number of tasks the GC worker queue can handle, set to 4096.

Related Monitoring

GC tasks QPS/duration can be monitored in tikv-details -> GC -> GC tasks/GC tasks duration. If the GC tasks duration is high, we need to check whether the GC worker’s CPU resources are sufficient, in combination with the QPS.

GC worker CPU usage: tikv-details -> thread CPU -> GC worker

GC manager

The GC manager is the thread responsible for driving the GC work in TiKV. The main steps are:

Syncing the GC safepoint to local memory
Globally guiding the execution of specific GC tasks

1. Syncing GC Safepoint to Local
The GC manager regularly requests the latest GC safepoint from PD every ten seconds and refreshes the safepoint in memory. The related monitoring can be found in tikv-details -> GC:

Common Issues
When monitoring shows that the TiKV auto GC safepoint is stuck for a long time and not advancing, it indicates that the GC state on the TiDB side may have a problem. At this point, you need to follow the troubleshooting steps in the previous article to investigate why GC on the TiDB side is stuck.

2. Implementing GC Jobs
If the GC manager detects that the GC safepoint has advanced, it begins implementing the specific GC tasks based on the current system configuration. This part is mainly divided into two methods based on the gc.enable-compaction-filter parameter:

a. Traditional GC, where GC(range) is called for each region.

b. GC via compaction filter (default method after 5.0): Instead of performing a real GC, this method waits until RocksDB performs compaction, at which point old versions are reclaimed using the compaction filter.

TiKV GC Implementation Methods

Next, we will explain the principles and common troubleshooting steps for these two GC methods.

GC by Region (Traditional GC)

In traditional GC, when gc.enable-compaction-filter is set to false, the GC manager performs GC on each TiKV region one by one. This process is referred to as “GC a round.”

GC a Round

In traditional GC, a round of GC is completed when GC has been executed on all regions, and we mark the progress as 100%. If GC progress never reaches 100%, it indicates that GC pressure is high, and the physical space reclamation process is being affected. GC progress can be monitored in tikv-details -> GC -> tikv-auto-gc-progress. We can also observe the time taken for each round of GC in TiKV.

After introducing the concept of a ‘GC round,’ let’s now look at how TiKV defines a round during execution.

In simple terms, the GC manager starts GC work from the first region, continuing until the last region’s GC work is completed. However, if there is too much GC work, the GC safepoint may advance before the GC reaches the last region. In this case, do we continue using the old safepoint or the new one?

The answer is to use the latest GC safepoint in real time for each region that follows. This is a simple optimization of traditional GC.

Here’s an example of how TiKV processes an updated GC safepoint during GC execution

When GC starts, gc-safepoint is 10, and we use safepoint=10 for GC regions 1 and 2.
After finishing GC in region 2, the GC safepoint advances to 20. From this point on, we use 20 to continue GC in the remaining regions.
Once all regions have been GC’d with gc safepoint=20, we start again from the first region, now using gc safepoint=20 for GC.

Common Issues

In traditional GC, since all old versions must be scanned before being deleted, and then the delete operation is written to RocksDB’s MVCC, the system is heavily impacted. This is manifested in the following ways:

1.Impact on GC Progress:

The GC worker becomes a bottleneck. Since all reclamation tasks need to be handled by the GC worker, which only has one thread, its CPU usage will be fully occupied. You can monitor this using: tikv-details -> thread CPU -> GC worker

2.Impact on Business Read/Write:

Raftstore read/write pressure increases: The GC worker needs to scan all data versions and then delete the matching ones during GC_keys tasks.
The increased write load on RocksDB in the short term causes rapid accumulation of L0 files, triggering RocksDB compaction.

3.Physical Space Usage Increases:

Since a DELETE operation in RocksDB ultimately results in writing a new version of the current key, physical space usage might actually increase.
Only after RocksDB compaction completes will the physical space be reclaimed, and this compaction requires temporary space.

In such cases where business cannot tolerate these impacts, the workaround is to enable the gc.enable-compaction-filter parameter.

GC with Compaction Filter

As mentioned earlier, in traditional GC, we scan TiKV’s MVCC entries one by one and use the safepoint to determine which data can be deleted, sending a DELETE key operation to Raftstore (RocksDB). Since RocksDB uses an LSM tree architecture and its internal MVCC mechanism, old versions of data are not immediately deleted when new data is written, even during a DELETE operation. They are retained alongside the new data.

Why Compaction in RocksDB?

Let’s explore the RocksDB architecture to understand the compaction mechanism.

RocksDB uses an LSM tree structure to improve write performance.

RocksDB Write Flow
When RocksDB receives a write operation (PUT(key => value)), the complete process is as follows:

1.When a new key is written, it is first written to the WAL and memtable, and then a success response is returned.

RocksDB appends the data directly to the WAL file, persisting it locally.
The data is written on an active memtable, which is fast because it operates in memory, and the data in the memtable is ordered.

2.As more data is written, the memtable gradually fills up. When it becomes full, the active memtable is marked as immutable, and a new active memtable is created to store new writes.

3.The data from the immutable memtable is flushed to a local file, which we call an SST file.

4.Over time, more SST files are created. Note that SST files directly converted from memtables contain ordered data, but the data ranges are global [~, ~]. These SST files are placed in Level 0.

RocksDB Read Flow
When a read request is received, the lookup follows this sequence:

Search in the memtable. If the key is found, return the value.
Search in the block-cache (the data in the block-cache comes from SST files, which we will refer to as SST here). Search for the corresponding key in the SST files. Since the SST files in Level 0 are the most recent, data will be searched first in these files. Additionally, the SST files in Level 0 are directly converted from memtables, and their data ranges are global. In extreme cases, it may be necessary to search through all such SST files one by one.

Compaction for Improved Read Performance
From the above read flow, we can see that if we only have SST data in Level 0, as more and more files accumulate in Level 0, RocksDB’s read performance will degrade. To improve read performance, RocksDB performs merge sorting on the SST files in Level 0, a process known as compaction. The main tasks of RocksDB compaction are:

Merging multiple SST files into one.
Keeping only the latest MVCC version from RocksDB’s perspective (GC).
Compacting lower levels into Level 1~Level 6.

From the above process, we can see that RocksDB’s compaction work is similar to what we do during GC. So, is it possible to combine TiKV’s GC process with RocksDB’s compaction? Of course, it is.

Combining TiKV GC with Compaction

RocksDB provides an interface for compaction filters, which allows us to define rules for filtering keys during the compaction process. Based on the rules we provide in the compaction filter, we can decide whether to discard the key during this phase.

Implementation Principle

Let’s take TiKV‘s GC as an example to understand how data is reclaimed during the compaction process when the compaction filter is enabled.

TiKV‘s Compaction Filter Only Affects Write-CF

First, TiKV’s compaction filter only affects write-cf. Why? Because write-cf stores MVCC data, while data-cf stores the actual data. As for lock-cf, as we’ve discussed on the TiDB side, after the GC safepoint is updated to PD, there are no more locks in lock-cf before the safepoint.

Directly Filtering Unnecessary MVCC Keys in Compaction
Next, let’s look at how an MVCC key a behaves during a compaction process:

The TiKV MVCC key a in write-cf has a commit_ts suffix. Initially, let’s assume we are compacting two SST files on the left, which are in Level 2 and Level 3. After compaction, these files will be stored in Level 3.
The SST file in Level 2 contains a_90.
The SST file in Level 3 contains a_85, a_82, and a_80.
The current GC safepoint is 89. Based on the GC key processing rules from Chapter 1, we know we need to retain the older version a_85, meaning all versions before a_85 can be deleted.
From the right, we can see that the new SST file only contains a_85 and a_90. The other versions, along with the corresponding data in default-cf, are deleted during compaction.

In summary, compared to traditional GC, using a compaction filter for GC has the following advantages:

It eliminates the need to read from RocksDB.
The deletion (write) process for RocksDB is simplified.

Although compaction introduces some pressure, it completely removes the impact of GC on RocksDB’s read/write operations, resulting in a significant performance optimization.

Compaction on Non-L6 SST Files with Write_type::DEL

When compaction occurs, if we encounter the latest version of data as write_type::DEL, should we delete it directly? The answer is no. Unlike the gc_keys interface, which scans all versions of the key, compaction only scans the versions within the SST files involved in the current compaction. Therefore, if we delete write_type::DEL at the current level during compaction, there might still be older versions of the key at lower levels. For example, if we delete a_85 => write_type::DEL during this compaction, when users read the snapshot at gc_safepoint=89, a_85 would be missing, and the latest matching version would be a_78, which breaks the correctness of the snapshot at gc_safepoint=89.

Handling Write_type::DEL in Compaction Filter
As we know from the gc_keys section, write_type::DEL is a special case. When the compaction filter is enabled, how do we handle this type of key? The answer is yes, it requires special handling. First, we need to consider when we can delete write_type::DEL data.

When compacting the Lowest-level SST files, if we find that the current key meets the following conditions, we can safetly reclaim this version using gc_keys(a,89):

The current key is the latest version before the gc_safepoint 89.
The current key is in Level 6, and it is the only remaining version. This ensures there are no earlier versions (and ensures no additional writes are generated during gc_keys).

After this compaction:

The new SST file will still include a write_type::DEL version.
gc_keys will write a (DELETE, a_85) operation to RocksDB. This is the only way to generate a tombstone for write-cf with the compaction filter enabled.

Related Configuration

As we’ve discussed, when the compaction filter is enabled, most physical data reclamation is completed during the RocksDB write CF compaction. For each key:

For tso > gc safepoint, the key is retained and skipped.
For tso <= gc safepoint: The latest version is retained, and older versions are filtered.

Now, the next question is: Since the GC process (physical space reclamation) relies heavily on RocksDB’s compaction, how can we stimulate RocksDB’s compaction process?

In addition to automatic triggers, TiKV also runs a thread that periodically checks the status of each region and decides whether to trigger compaction based on the presence of old versions in the region. Currently, we offer the following parameters to control the speed of region checks and whether a region should initiate compaction:

region-compact-check-interval: Generally, this does not need adjustment.
region-compact-check-step: Generally, this does not need adjustment.
region-compact-min-tombstones: The number of tombstones required to trigger RocksDB compaction. Default: 10,000.
region-compact-tombstones-percent: The percentage of tombstones required to trigger RocksDB compaction. Default: 30%.
region-compact-min-redundant-rows (introduced in v7.1.0): The number of redundant MVCC data rows needed to trigger RocksDB compaction. Default: 50,000.
region-compact-redundant-rows-percent (introduced in v7.1.0): The percentage of redundant MVCC data rows required to trigger RocksDB compaction.

Notably, for versions prior to 7.1.0, since we did not have the redundant MVCC version detection, we must manually compact regions to trigger the first round of compaction.

Related Monitoring

tikv-details -> GC -> GC in Compaction-filter:

Key field definitions: During the compaction filter process, if the key-value meets the following conditions:

If it is before the GC safepoint and not the latest version (a-v1, a-v5, b-v5):

filtered: The number of old versions directly filtered (physically deleted, no new writes) by the compaction filter, representing the effective reclamation of old version data. If this metric is empty, it means there are no old versions to reclaim.
Orphan-version: After the old version is deleted from write-cf, data in default-cf needs to be cleaned up. If cleaning fails, it will be handled via GcTask::OrphanVersions. If there is data here, it indicates that there were too many versions to delete, causing RocksDB to be overloaded.

Latest version before the GC safepoint (the oldest version to retain, a-v10, b-v12):

rollback/lock: Write types of Rollback/Lock, which will create a tombstone in RocksDB.
mvcc_deletion_met: Write type = DELETE, and it’s the lowest-level SST file.
mvcc_deletion_handled: Data with write type = DELETE reclaimed through gc_keys().
mvcc_deletion_wasted: Data reclaimed through gc_keys() that was already deleted.
mvcc_deletion_wasted + mvcc_deletion_handled: The number of keys with type = DELETE that, at the bottommost level, only have one version.

Common Issues

The Problem of Physical Space Not Being Released Long-Term After Deleting Data Using Compaction-filter
Although the compaction filter GC can directly clean up old data during the compaction stage and alleviate GC pressure, we know from the above principles that relying on RocksDB’s compaction for data reclamation means that if RocksDB’s compaction is not triggered, physical space will not be released even after the GC safepoint has passed.

Therefore, stimulating RocksDB compaction becomes crucial in such cases.

Workaround 1: Adjusting Compaction Frequency via Parameters
For large DELETE operations, since no new writes occur after the data is deleted, it becomes more difficult to stimulate RocksDB’s automatic compaction.

In versions 7.1.0 and later, we can adjust the parameters related to MVCC redundant versions to stimulate RocksDB compaction, with key parameters including:

region-compact-min-redundant-rows (introduced in v7.1.0): The number of redundant MVCC data rows required to trigger RocksDB compaction. Default: 50,000.
region-compact-redundant-rows-percent (introduced in v7.1.0): The percentage of redundant MVCC data rows required to trigger RocksDB compaction.

In versions before 7.1.0, this feature was not available, so these versions require manual compaction. However, adjusting the above two parameters in such cases has minimal effect.

Currently, there are no parameters that stimulate compaction by counting deleted MVCC versions, so in most cases, this data cannot be automatically reclaimed. We can track this issue: GitHub issue #17269.

Workaround 2: Manual Compaction
If we’ve performed a large DELETE operation on a table, and the deletion has passed the GC lifetime, we can quickly reclaim physical space by manually compacting the region:

Method 1: Perform Full Table Compaction During Off-Peak Hours

1.Query the minimum and maximum keys for the table (the key is computed from TiKV and has already been converted to memcomparable):

select min(START_KEY) as START_KEY, max(END_KEY) as END_KEY from information_schema.tikv_region_status where db_name='' and table_name=''

2.Convert the *start and end keys * to escaped format using tikv-ctl:

----compact write cf----
   tiup ctl:v7.5.0 tikv --host "127.0.0.1:20160" compact --bottommost force -c write --from "zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372" --to "t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370"
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write --from zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372 --to t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370
store:"127.0.0.1:20160" compact db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 183, 95, 114, 128, 0, 0, 0, 0, 255, 69, 67, 29, 0, 0, 0, 0, 0, 250], [116, 128, 0, 0, 0, 0, 0, 0, 255, 186, 0, 0, 0, 0, 0, 0, 0, 248]) success!
  ---If the above doesn’t work, try compact default cf---
  tiup ctl:v7.1.1 tikv --host IP:port compact --bottomost force -c default --from 'zr\000\000\001\000\000\000\000\373' --to 'zt\200\000\000\000\000\000\000\377[\000\000\000\000\000\000\000\370'

3.Use tikv-ctl to perform compaction, adding a z prefix to the converted string, and compact both write-cf and default-cf (for all TiKV instances):

----compact write cf----
   tiup ctl:v7.5.0 tikv --host "127.0.0.1:20160" compact --bottommost force -c write --from "zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372" --to "t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370"
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write --from zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372 --to t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370
store:"127.0.0.1:20160" compact db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 183, 95, 114, 128, 0, 0, 0, 0, 255, 69, 67, 29, 0, 0, 0, 0, 0, 250], [116, 128, 0, 0, 0, 0, 0, 0, 255, 186, 0, 0, 0, 0, 0, 0, 0, 248]) success!
  ---If the above doesn’t work, try compact default cf---
  tiup ctl:v7.1.1 tikv --host IP:port compact --bottomost force -c default --from 'zr\000\000\001\000\000\000\000\373' --to 'zt\200\000\000\000\000\000\000\377[\000\000\000\000\000\000\000\370'

Note: As we learned earlier when cleaning up writeType::DELETE, when DELETE is the latest version, this version becomes the only version of the key to delete. It needs to be compacted to the lowest level of RocksDB before it can be fully cleaned. Therefore, we generally need to run the compaction command at least twice to reclaim physical space.

Note: RocksDB compaction requires temporary space. If the TiKV instance doesn’t have sufficient temporary space, it’s recommended to use Method 2 to split the compaction pressure.

Method 2: For large tables, to reduce performance impact on the cluster, compact by region instead of the entire table:
1.Query all regions of the table:

select * from information_schema.tikv_region_status where db_name='' and table_name=''

2.For all regions in the table, run the following commands on their respective TiKV replicas:

Use tikv-ctl to query the MVCC properties of the current region. If the mvcc.num_deletes and write_cf.num_deletes are small, the region has been processed, and you can skip to the next region.

tikv-ctl --host tikv-host:20160. region-properties -r  {region-id}
-- example-- 
 tiup ctl:v7.5.0 tikv --host  "127.0.0.1:20160" region-properties -r 20026
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 region-properties -r 20026
mvcc.min_ts: 440762314407804933
mvcc.max_ts: 447448067356491781
mvcc.num_rows: 2387047
mvcc.num_puts: 2454144
mvcc.num_deletes: 9688
mvcc.num_versions: 2464879
mvcc.max_row_versions: 952
writecf.num_entries: 2464879
writecf.num_deletes: 0
writecf.num_files: 3
writecf.sst_files: 053145.sst, 061055.sst, 057591.sst
defaultcf.num_entries: 154154
defaultcf.num_files: 1
defaultcf.sst_files: 058164.sst
region.start_key: 7480000000000000ff545f720380000000ff0000000403800000ff0000000004038000ff0000000006a80000fd
region.end_key: 7480000000000000ff545f720380000000ff0000000703800000ff0000000002038000ff0000000002300000fd
region.middle_key_by_approximate_size: 7480000000000000ff545f720380000000ff0000000503800000ff0000000009038000ff0000000005220000fdf9ca5f5c3067ffc1

Use tikv-ctl to manually compact the current region. After completion, continue looping to check whether the region’s properties have changed.

tiup ctl:v7.5.0 tikv --pd IP:port compact --bottommost force -c write --region {region-id}
  tiup ctl:v7.5.0 tikv --pd IP:port compact --bottommost force -c default --region {region-id}
 --example--
 tiup ctl:v7.5.0 tikv --host  "127.0.0.1:20160" compact --bottommost force -c write -r 20026
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write -r 20026
store:"127.0.0.1:20160" compact_region db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 84, 95, 114, 3, 128, 0, 0, 0, 255, 0, 0, 0, 4, 3, 128, 0, 0, 255, 0, 0, 0, 0, 4, 3, 128, 0, 255, 0, 0, 0, 0, 6, 168, 0, 0, 253], [122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 84, 95, 114, 3, 128, 0, 0, 0, 255, 0, 0, 0, 7, 3, 128, 0, 0, 255, 0, 0, 0, 0, 2, 3, 128, 0, 255, 0, 0, 0, 0, 2, 48, 0, 0, 253]) success!

Method 3: Before v7.1.0, you can directly disable the compaction filter and use the traditional GC method.
This approach can significantly impact the system’s read and write performance during GC, so use it with caution.

TiKV Component GC (Physical Space Reclamation) Principles and Common Issues

TiDB Community — Fri, 04 Jul 2025 04:11:54 +0000

This article was written by Shirly Wu, Support Escalation Engineer at PingCAP.

GC Key in TiKV

So, how exactly does the deletion process work?

Let’s look at the following example:

The current key has four versions, written in the following sequence:

1:00 — New write or update, data stored in the default CF
2:00 — Update, data stored in the default CF
3:00 — Delete
4:00 — New write, data stored in the default CF

What if the GC safepoint is 3:30?

Similarly, the version read at 3:30,key_03:00 => DELETE, will be retained. All versions before 3:00 will be deleted.

This is how the GC process works for a specific key — TiKV’s GC process requires scanning all the keys on the current TiKV instance and deleting the old versions that meet the GC criteria.

Related Monitoring

GC in TiKV

GC Worker

Each TiKV instance has a GC worker thread responsible for handling specific GC tasks. The GC worker in TiKV mainly handles the following two types of requests:

GC_keys: This involves scanning and deleting old versions of a specific key that meet the GC criteria. The detailed process was described in the first chapter.
GC(range): This involves using GC_keys on a specified range of keys, performing GC on each key individually within the range.
unsafe-destroy-range: This is the direct physical cleanup of a continuous range of data, corresponding to operations like truncate/drop table/partition mentioned in the previous article.

Currently, the GC worker has two key configurations, which cannot be adjusted:

Thread count: Currently, the GC worker only has one thread. This is hardcoded in the code here, and no external configuration is provided.
GC_MAX_PENDING_TASKS: This is the maximum number of tasks the GC worker queue can handle, set to 4096.

Related Monitoring

GC worker CPU usage: tikv-details -> thread CPU -> GC worker

GC manager

The GC manager is the thread responsible for driving the GC work in TiKV. The main steps are:

Syncing the GC safepoint to local memory
Globally guiding the execution of specific GC tasks

a. Traditional GC, where GC(range) is called for each region.

TiKV GC Implementation Methods

Next, we will explain the principles and common troubleshooting steps for these two GC methods.

GC by Region (Traditional GC)

In traditional GC, when gc.enable-compaction-filter is set to false, the GC manager performs GC on each TiKV region one by one. This process is referred to as “GC a round.”

GC a Round

After introducing the concept of a ‘GC round,’ let’s now look at how TiKV defines a round during execution.

The answer is to use the latest GC safepoint in real time for each region that follows. This is a simple optimization of traditional GC.

Here’s an example of how TiKV processes an updated GC safepoint during GC execution

When GC starts, gc-safepoint is 10, and we use safepoint=10 for GC regions 1 and 2.
After finishing GC in region 2, the GC safepoint advances to 20. From this point on, we use 20 to continue GC in the remaining regions.
Once all regions have been GC’d with gc safepoint=20, we start again from the first region, now using gc safepoint=20 for GC.

Common Issues

1.Impact on GC Progress:

The GC worker becomes a bottleneck. Since all reclamation tasks need to be handled by the GC worker, which only has one thread, its CPU usage will be fully occupied. You can monitor this using: tikv-details -> thread CPU -> GC worker

2.Impact on Business Read/Write:

Raftstore read/write pressure increases: The GC worker needs to scan all data versions and then delete the matching ones during GC_keys tasks.
The increased write load on RocksDB in the short term causes rapid accumulation of L0 files, triggering RocksDB compaction.

3.Physical Space Usage Increases:

Since a DELETE operation in RocksDB ultimately results in writing a new version of the current key, physical space usage might actually increase.
Only after RocksDB compaction completes will the physical space be reclaimed, and this compaction requires temporary space.

In such cases where business cannot tolerate these impacts, the workaround is to enable the gc.enable-compaction-filter parameter.

GC with Compaction Filter

Why Compaction in RocksDB?

Let’s explore the RocksDB architecture to understand the compaction mechanism.

RocksDB uses an LSM tree structure to improve write performance.

RocksDB Write Flow
When RocksDB receives a write operation (PUT(key => value)), the complete process is as follows:

1.When a new key is written, it is first written to the WAL and memtable, and then a success response is returned.

RocksDB appends the data directly to the WAL file, persisting it locally.
The data is written on an active memtable, which is fast because it operates in memory, and the data in the memtable is ordered.

2.As more data is written, the memtable gradually fills up. When it becomes full, the active memtable is marked as immutable, and a new active memtable is created to store new writes.

3.The data from the immutable memtable is flushed to a local file, which we call an SST file.

4.Over time, more SST files are created. Note that SST files directly converted from memtables contain ordered data, but the data ranges are global [~, ~]. These SST files are placed in Level 0.

RocksDB Read Flow
When a read request is received, the lookup follows this sequence:

Search in the memtable. If the key is found, return the value.
Search in the block-cache (the data in the block-cache comes from SST files, which we will refer to as SST here). Search for the corresponding key in the SST files. Since the SST files in Level 0 are the most recent, data will be searched first in these files. Additionally, the SST files in Level 0 are directly converted from memtables, and their data ranges are global. In extreme cases, it may be necessary to search through all such SST files one by one.

Merging multiple SST files into one.
Keeping only the latest MVCC version from RocksDB’s perspective (GC).
Compacting lower levels into Level 1~Level 6.

Combining TiKV GC with Compaction

Implementation Principle

Let’s take TiKV‘s GC as an example to understand how data is reclaimed during the compaction process when the compaction filter is enabled.

TiKV‘s Compaction Filter Only Affects Write-CF

Directly Filtering Unnecessary MVCC Keys in Compaction
Next, let’s look at how an MVCC key a behaves during a compaction process:

The TiKV MVCC key a in write-cf has a commit_ts suffix. Initially, let’s assume we are compacting two SST files on the left, which are in Level 2 and Level 3. After compaction, these files will be stored in Level 3.
The SST file in Level 2 contains a_90.
The SST file in Level 3 contains a_85, a_82, and a_80.
The current GC safepoint is 89. Based on the GC key processing rules from Chapter 1, we know we need to retain the older version a_85, meaning all versions before a_85 can be deleted.
From the right, we can see that the new SST file only contains a_85 and a_90. The other versions, along with the corresponding data in default-cf, are deleted during compaction.

In summary, compared to traditional GC, using a compaction filter for GC has the following advantages:

It eliminates the need to read from RocksDB.
The deletion (write) process for RocksDB is simplified.

Although compaction introduces some pressure, it completely removes the impact of GC on RocksDB’s read/write operations, resulting in a significant performance optimization.

Compaction on Non-L6 SST Files with Write_type::DEL

When compacting the Lowest-level SST files, if we find that the current key meets the following conditions, we can safetly reclaim this version using gc_keys(a,89):

The current key is the latest version before the gc_safepoint 89.
The current key is in Level 6, and it is the only remaining version. This ensures there are no earlier versions (and ensures no additional writes are generated during gc_keys).

After this compaction:

The new SST file will still include a write_type::DEL version.
gc_keys will write a (DELETE, a_85) operation to RocksDB. This is the only way to generate a tombstone for write-cf with the compaction filter enabled.

Related Configuration

As we’ve discussed, when the compaction filter is enabled, most physical data reclamation is completed during the RocksDB write CF compaction. For each key:

For tso > gc safepoint, the key is retained and skipped.
For tso <= gc safepoint: The latest version is retained, and older versions are filtered.

Now, the next question is: Since the GC process (physical space reclamation) relies heavily on RocksDB’s compaction, how can we stimulate RocksDB’s compaction process?

region-compact-check-interval: Generally, this does not need adjustment.
region-compact-check-step: Generally, this does not need adjustment.
region-compact-min-tombstones: The number of tombstones required to trigger RocksDB compaction. Default: 10,000.
region-compact-tombstones-percent: The percentage of tombstones required to trigger RocksDB compaction. Default: 30%.
region-compact-min-redundant-rows (introduced in v7.1.0): The number of redundant MVCC data rows needed to trigger RocksDB compaction. Default: 50,000.
region-compact-redundant-rows-percent (introduced in v7.1.0): The percentage of redundant MVCC data rows required to trigger RocksDB compaction.

Notably, for versions prior to 7.1.0, since we did not have the redundant MVCC version detection, we must manually compact regions to trigger the first round of compaction.

Related Monitoring

tikv-details -> GC -> GC in Compaction-filter:

Key field definitions: During the compaction filter process, if the key-value meets the following conditions:

If it is before the GC safepoint and not the latest version (a-v1, a-v5, b-v5):

filtered: The number of old versions directly filtered (physically deleted, no new writes) by the compaction filter, representing the effective reclamation of old version data. If this metric is empty, it means there are no old versions to reclaim.
Orphan-version: After the old version is deleted from write-cf, data in default-cf needs to be cleaned up. If cleaning fails, it will be handled via GcTask::OrphanVersions. If there is data here, it indicates that there were too many versions to delete, causing RocksDB to be overloaded.

Latest version before the GC safepoint (the oldest version to retain, a-v10, b-v12):

rollback/lock: Write types of Rollback/Lock, which will create a tombstone in RocksDB.
mvcc_deletion_met: Write type = DELETE, and it’s the lowest-level SST file.
mvcc_deletion_handled: Data with write type = DELETE reclaimed through gc_keys().
mvcc_deletion_wasted: Data reclaimed through gc_keys() that was already deleted.
mvcc_deletion_wasted + mvcc_deletion_handled: The number of keys with type = DELETE that, at the bottommost level, only have one version.

Common Issues

Therefore, stimulating RocksDB compaction becomes crucial in such cases.

In versions 7.1.0 and later, we can adjust the parameters related to MVCC redundant versions to stimulate RocksDB compaction, with key parameters including:

region-compact-min-redundant-rows (introduced in v7.1.0): The number of redundant MVCC data rows required to trigger RocksDB compaction. Default: 50,000.
region-compact-redundant-rows-percent (introduced in v7.1.0): The percentage of redundant MVCC data rows required to trigger RocksDB compaction.

In versions before 7.1.0, this feature was not available, so these versions require manual compaction. However, adjusting the above two parameters in such cases has minimal effect.

Method 1: Perform Full Table Compaction During Off-Peak Hours

1.Query the minimum and maximum keys for the table (the key is computed from TiKV and has already been converted to memcomparable):

select min(START_KEY) as START_KEY, max(END_KEY) as END_KEY from information_schema.tikv_region_status where db_name='' and table_name=''

2.Convert the *start and end keys * to escaped format using tikv-ctl:

----compact write cf----
   tiup ctl:v7.5.0 tikv --host "127.0.0.1:20160" compact --bottommost force -c write --from "zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372" --to "t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370"
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write --from zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372 --to t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370
store:"127.0.0.1:20160" compact db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 183, 95, 114, 128, 0, 0, 0, 0, 255, 69, 67, 29, 0, 0, 0, 0, 0, 250], [116, 128, 0, 0, 0, 0, 0, 0, 255, 186, 0, 0, 0, 0, 0, 0, 0, 248]) success!
  ---If the above doesn’t work, try compact default cf---
  tiup ctl:v7.1.1 tikv --host IP:port compact --bottomost force -c default --from 'zr\000\000\001\000\000\000\000\373' --to 'zt\200\000\000\000\000\000\000\377[\000\000\000\000\000\000\000\370'

3.Use tikv-ctl to perform compaction, adding a z prefix to the converted string, and compact both write-cf and default-cf (for all TiKV instances):

----compact write cf----
   tiup ctl:v7.5.0 tikv --host "127.0.0.1:20160" compact --bottommost force -c write --from "zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372" --to "t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370"
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write --from zt\200\000\000\000\000\000\000\377\267_r\200\000\000\000\000\377EC\035\000\000\000\000\000\372 --to t\200\000\000\000\000\000\000\377\272\000\000\000\000\000\000\000\370
store:"127.0.0.1:20160" compact db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 183, 95, 114, 128, 0, 0, 0, 0, 255, 69, 67, 29, 0, 0, 0, 0, 0, 250], [116, 128, 0, 0, 0, 0, 0, 0, 255, 186, 0, 0, 0, 0, 0, 0, 0, 248]) success!
  ---If the above doesn’t work, try compact default cf---
  tiup ctl:v7.1.1 tikv --host IP:port compact --bottomost force -c default --from 'zr\000\000\001\000\000\000\000\373' --to 'zt\200\000\000\000\000\000\000\377[\000\000\000\000\000\000\000\370'

Note: RocksDB compaction requires temporary space. If the TiKV instance doesn’t have sufficient temporary space, it’s recommended to use Method 2 to split the compaction pressure.

Method 2: For large tables, to reduce performance impact on the cluster, compact by region instead of the entire table:
1.Query all regions of the table:

select * from information_schema.tikv_region_status where db_name='' and table_name=''

2.For all regions in the table, run the following commands on their respective TiKV replicas:

Use tikv-ctl to query the MVCC properties of the current region. If the mvcc.num_deletes and write_cf.num_deletes are small, the region has been processed, and you can skip to the next region.

tikv-ctl --host tikv-host:20160. region-properties -r  {region-id}
-- example-- 
 tiup ctl:v7.5.0 tikv --host  "127.0.0.1:20160" region-properties -r 20026
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 region-properties -r 20026
mvcc.min_ts: 440762314407804933
mvcc.max_ts: 447448067356491781
mvcc.num_rows: 2387047
mvcc.num_puts: 2454144
mvcc.num_deletes: 9688
mvcc.num_versions: 2464879
mvcc.max_row_versions: 952
writecf.num_entries: 2464879
writecf.num_deletes: 0
writecf.num_files: 3
writecf.sst_files: 053145.sst, 061055.sst, 057591.sst
defaultcf.num_entries: 154154
defaultcf.num_files: 1
defaultcf.sst_files: 058164.sst
region.start_key: 7480000000000000ff545f720380000000ff0000000403800000ff0000000004038000ff0000000006a80000fd
region.end_key: 7480000000000000ff545f720380000000ff0000000703800000ff0000000002038000ff0000000002300000fd
region.middle_key_by_approximate_size: 7480000000000000ff545f720380000000ff0000000503800000ff0000000009038000ff0000000005220000fdf9ca5f5c3067ffc1

Use tikv-ctl to manually compact the current region. After completion, continue looping to check whether the region’s properties have changed.

tiup ctl:v7.5.0 tikv --pd IP:port compact --bottommost force -c write --region {region-id}
  tiup ctl:v7.5.0 tikv --pd IP:port compact --bottommost force -c default --region {region-id}
 --example--
 tiup ctl:v7.5.0 tikv --host  "127.0.0.1:20160" compact --bottommost force -c write -r 20026
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v7.5.0/ctl tikv --host 127.0.0.1:20160 compact --bottommost force -c write -r 20026
store:"127.0.0.1:20160" compact_region db:Kv cf:write range:[[122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 84, 95, 114, 3, 128, 0, 0, 0, 255, 0, 0, 0, 4, 3, 128, 0, 0, 255, 0, 0, 0, 0, 4, 3, 128, 0, 255, 0, 0, 0, 0, 6, 168, 0, 0, 253], [122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 84, 95, 114, 3, 128, 0, 0, 0, 255, 0, 0, 0, 7, 3, 128, 0, 0, 255, 0, 0, 0, 0, 2, 3, 128, 0, 255, 0, 0, 0, 0, 2, 48, 0, 0, 253]) success!

TiDB’s Chat2Query: Instant Business Insights, No SQL Required

TiDB Community — Tue, 15 Apr 2025 10:56:26 +0000

What if you could interact with your data just like you would with a colleague? No need for complicated SQL queries or advanced data analysis tools—just ask your data a question and get an immediate, clear answer. Sounds like magic, right? Well, it’s not. It’s TiDB’s Chat2Query, and it’s changing the game in how businesses explore and understand their data.

In this post, we’ll dive into how Chat2Query works, the Text2SQL magic behind it, and why it’s been so successful at transforming the way businesses get insights. If you missed our previous post, where we discussed Chat2Query’s recent breakthroughs and its performance on industry benchmarks (86.30 on the Spider benchmark and Top 4 on the BIRD benchmark), we recommend checking that out for more context.

Let’s get started!

What is Chat2Query?

Forget the jargon. Forget the complicated queries. Chat2Query lets you ask questions in plain language and get answers straight from your data—instantly.

Want to know, “What were our sales last quarter?”
Curious about, “Which product category performed the best?”
Or maybe you’re wondering, “What are the trends in customer complaints this month?”

Just type your question in simple language, and Chat2Query will translate it into SQL, pull the data, and serve it up in an easy-to-read format, complete with visuals. It’s as if your data speaks directly to you—no SQL required.

How Does Chat2Query Work?

It’s surprisingly simple—and powerful.

1. Enrich the Data Context

First, Chat2Query needs to get familiar with your data. To do this, it analyzes your database using both relational and vector databases. This hybrid approach allows Chat2Query to understand the structure of the database and the relationships between tables, columns, and entities. The vector database is key for enriching the context by storing more complex, high-dimensional data that helps Chat2Query better understand the relationships between data points. The more Chat2Query knows about the data, the more accurate and insightful the responses.

2. Ask Your Questions

Once the data context is enriched, you can start asking questions. Chat2Query then converts your question into a SQL query, pulls the data, and gives you your answer—usually with a handy chart or graph to make it even easier to digest.

Why Chat2Query Works So Well

For Chat2Query to function effectively, it needs to understand the database schema. That’s where “Understand db” comes in. It’s like giving Chat2Query a roadmap of your data—helping it understand relationships between tables, columns, and entities.

Why it matters: This step alone can improve SQL accuracy by 2-3% on benchmarks like Spider. A small increase, but a big deal when you’re dealing with large datasets.

Perfecting Prompts: Prompt Engineering

You can’t just throw any question at a system and expect it to work perfectly. So, we engineer our prompts to ensure Chat2Query really nails it. By using advanced techniques like Chain of Thought (COT) and Retrieval Augmented Generation (RAG), we guide the system to reason through the question step-by-step, ensuring the most accurate SQL possible.

Why it matters: Our COT + RAG combo helps Chat2Query consistently crush benchmarks like Spider and BIRD. It’s the secret sauce behind those impressive scores.

Fine-Tuning with Post-Processing

Even with advanced technologies like LLMs, sometimes hallucinations can occur. That’s why we use a multi-agent collaboration mechanism during post-processing. This system works like a team of experts reviewing the raw SQL output, identifying any inconsistencies, and refining the query to improve its accuracy.

The multi-agent collaboration mechanism helps ensure that even if a model misses something, other components of the system step in to catch those issues and make necessary adjustments. This added layer of refinement significantly enhances the reliability of the SQL queries produced by Chat2Query.

Why it matters: This post-processing mechanism, along with the multi-agent system, increases the overall accuracy of SQL queries by 2-4%. It helps reduce errors and ensure that the output is stable, consistent, and ready for actionable insights.

Real-World Use Cases: How Chat2Query Powers Smarter Decisions

Now, let’s talk about how Chat2Query is making a real difference for businesses:

Sales Performance: Ask “How much did our sales increase this month compared to last month?” and instantly get the data you need to adjust your sales strategy. No more waiting on reports or pulling data manually.
Customer Insights: Quickly discover what your customers are saying. Ask “What were the most common customer complaints this month?” to identify areas that need attention and improve your service.
Supply Chain Optimization: Stay on top of your inventory by asking “Which products are below the safety stock level?” or “Why did we experience delays last week?” to adjust your supply chain strategy on the fly.
Financial Reporting: Need to know what caused fluctuations in the market? Simply ask, “What were the main factors driving market changes this week?” and get detailed insights to stay ahead of the competition.

Looking Ahead: The Future of Chat2Query

Chat2Query is still evolving, and we’re continuously working to improve it. We’re exploring new techniques and optimizations to make Text2SQL even more accurate and powerful. Our goal is to keep simplifying the way businesses interact with their data, making complex analysis accessible to anyone, regardless of their technical background.

Get Started with Chat2Query Today

Ready to see the magic of Chat2Query for yourself? It’s time to make your data work harder for you.

Try TiDB Cloud for Free: Experience the power of Chat2Query with TiDB Cloud and see how simple data analysis can be.
Schedule a Demo: Want to learn more? Book a session with one of our experts to see how Chat2Query can transform your business.

Migrating Vector Data from Milvus to TiDB

TiDB Community — Tue, 17 Dec 2024 08:35:37 +0000

This article was written by caiyfc, a dedicated TiDB Cloud Serverless user and TiDB Community Moderator.

Introduction

Recently, I’ve been exploring the use of vector databases to build Retrieval-Augmented Generation (RAG) applications, successfully implementing a setup with Milvus, Llama 3, Ollama, and LangChain. After obtaining a TiDB Cloud Serverless credit through an event, I decided to migrate the vector data from Milvus to TiDB Cloud Serverless.

Upon researching, I discovered that existing migration tools currently do not support transferring data from Milvus to TiDB. However, this doesn’t mean migration is impossible. While existing tools do not facilitate this process, a manual migration is feasible. This article outlines my approach to achieving this.

For more information about obtaining free TiDB Cloud Serverless credit, visit the event page.

Migration Plan

To perform data migration, the first step is to determine the migration plan. The simplest migration consists of two steps: exporting data from the source database and importing it into the target database, thus completing the data migration.

However, this case is different. The RAG application utilizes LangChain, and based on research, the structure created by LangChain in Milvus differs from that in TiDB.

In Milvus, the collection name is: LangChainCollection, with the structure being:

However, in TiDB, the table name is langchain_vector, and the structure is as follows:

The documentation for LangChain also provides details:

Therefore, the data migration will require two additional steps: data preparation and table structure adjustment. Since these are heterogeneous databases, the exported data format chosen is the more universal CSV.

The overall plan is as follows:

Exporting Data from Milvus

import csv
from pymilvus import connections, Collection

# Connect to Milvus
connections.connect("default", host="10.3.xx.xx", port="19530")

# Get the Collection
collection = Collection("LangChainCollection")

# Paginate through all the data
limit = 1000
offset = 0
all_results = []

while True:
    # Pass expr parameter, using a simple condition to query all data
    results = collection.query(expr="", output_fields=["pk", "source", "page", "text", "vector"], limit=limit, offset=offset)
    if not results:
        break
    all_results.extend(results)
    offset += limit

# Open the CSV file, prepare to write data
with open("milvus_data.csv", "w", newline="", encoding='utf-8') as csvfile:
    # Define the CSV column names
    fieldnames = ["pk", "source", "page", "text", "vector"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    # Write the header
    writer.writeheader()

    # Write each record
    for result in all_results:
        # Parse JSON data and extract fields
        vector_str = ','.join(map(str, result.get("vector", [])))  # Convert the vector array to a string
        writer.writerow({
            "pk": result.get("pk"),          # Get the primary key
            "source": result.get("source"),  # Get the source file
            "page": result.get("page"),      # Get the page number
            "text": result.get("text"),      # Get the text
            "vector": vector_str             # Write the vector data
        })

print(f"Total records written to CSV: {len(all_results)}")

The format of the exported CSV file data is as follows:

Data Preparation and Table Structure Adjustment

I converted a small amount of test data into vectors and used LangChain to load it into TiDB Cloud Serverless. This process facilitated obtaining the data structure and format within TiDB Cloud Serverless.

The table structure is as follows:

CREATE TABLE `langchain_vector` (
`id` varchar(36) NOT NULL,
`embedding` vector(512) NOT NULL COMMENT 'hnsw(distance=cosine)',
`document` text DEFAULT NULL,
`meta` json DEFAULT NULL,
`create_time` datetime DEFAULT CURRENT_TIMESTAMP,
`update_time` datetime DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`) /*T![clustered_index] CLUSTERED */
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

The exported data format in CSV is as follows (partial content omitted):

"id","embedding","document","meta","create_time","update_time"
"081574ec-b3e4-481b-b0c7-9a789080d160","[-0.4020949,-0.6850993,******,0.16776393,-0.049104385]","forecasting and disaster mitigation. The organization is committed to advancing scienti\0c\nknowledge and improving public safety and well-being through its work.\nFor further information, please contact:","{\"page\": 5, \"source\": \"./WMO report highlights growing shortfalls and stress in global water resources.pdf\"}","2024-10-22 02:41:49","2024-10-22 02:41:49"

Given the CSV file exported from Milvus, the relationship is clear: "embedding" corresponds to "vector," "document" corresponds to "text," and "meta" corresponds to "page" and "source." This logic clarifies the mapping. A data preparation script based on this relationship is as follows:

import pandas as pd
import json
from uuid import uuid4
from datetime import datetime

# Read the CSV file
input_csv = 'milvus_data.csv'  # Replace with your CSV file name
df = pd.read_csv(input_csv)

# Create a new DataFrame
output_data = []

for _, row in df.iterrows():
    # Extract required fields
    id_value = str(uuid4())  # Generate a unique ID
    embedding = f"[{','.join(row['vector'].split(','))}]"  # Convert vector to embedding format
    document = row['text']

    # Generate meta information
    meta_dict = {"page": row['page'], "source": row['source']}
    meta = json.dumps(meta_dict, ensure_ascii=False)  # First generate standard JSON
    # meta = meta.replace('"', '\\"')  # Escape double quotes if needed

    create_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    update_time = create_time  # Same time for update

    # Add to output data
    output_data.append({
        "id": id_value,
        "embedding": embedding,
        "document": document,
        "meta": meta,
        "create_time": create_time,
        "update_time": update_time
    })

# Convert to DataFrame
output_df = pd.DataFrame(output_data)

# Save as CSV file
output_csv = 'output.csv'  # Output file name
output_df.to_csv(output_csv, index=False, quoting=1)  # quoting=1 ensures strings are quoted

print(f"Conversion completed, saved as {output_csv}")

Once the data preparation is complete, the data can be imported into TiDB Cloud Serverless.

Importing Data to TiDB Cloud Serverless

TiDB Cloud Serverless provides three import methods:

In this case, we will use the "Upload a local file" method.
CSV files smaller than 50 MiB can be uploaded using the first option for uploading local files. If the file exceeds 50 MiB, a script can be used to split the file into smaller chunks before uploading:

After uploading the file, select the previously created database and table, and click define table:

Adjust the mappings as needed, then click start import:

For more import methods, you can refer to the documentation: Migration and Import Overview.

Validation of Results

After successfully importing the data, the next step is to validate it. I modified the code for the RAG application to read vector data from both Milvus and TiDB. Using the same question, I queried the large model to return answers and checked whether the answers were similar.

from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain import hub
from langchain.chains import RetrievalQA
from langchain.vectorstores.milvus import Milvus
from langchain_community.embeddings.jina import JinaEmbeddings
from langchain_community.vectorstores import TiDBVectorStore
import os

llm = Ollama(
model="llama3",
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]
),
stop=["<|eot_id|>"],
)


embeddings = JinaEmbeddings(jina_api_key="xxxx", model_name="jina-embeddings-v2-small-en")

vector_store_milvus = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": "http://10.3.xx.xx:19530"},
)


TIDB_CONN_STR="mysql+pymysql://xxxx.root:password@host:4000/test?ssl_ca=/Downloads/isrgrootx1.pem&ssl_verify_cert=true&ssl_verify_identity=true"
vector_store_tidb = TiDBVectorStore(
    connection_string=TIDB_CONN_STR,
    embedding_function=embeddings,
    table_name="langchain_vector",
)


os.environ["LANGCHAIN_API_KEY"] = "xxxx"
query = input("\nQuery: ")
prompt = hub.pull("rlm/rag-prompt")   

qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=vector_store_milvus.as_retriever(), chain_type_kwargs={"prompt": prompt}
)
print("milvus")
result = qa_chain({"query": query})

print("\n--------------------------------------")
print("tidb")
qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=vector_store_tidb.as_retriever(), chain_type_kwargs={"prompt": prompt}
)
result = qa_chain({"query": query})

The connection string for TiDB can be obtained directly from TiDB Cloud Serverless:

After posing the question to the RAG application and reviewing the responses, I found that the answers from Milvus and TiDB were essentially consistent, indicating that the vector migration was successful. It is also advisable to compare the number of data entries in Milvus and TiDB tables; if they match, the migration should be considered successful.

Summary

Data migration between different databases is fundamentally about converting the data into a universal format that all databases can recognize, including vector data. This migration from Milvus to TiDB Cloud Serverless differs from traditional relational database migrations. Although the RAG application utilizes LangChain, the table structures and data formats created by LangChain vary across different databases. Therefore, additional organization of the data and table structures is required to successfully migrate to the target database. Fortunately, TiDB Cloud Serverless offers various convenient data import methods, making the migration process relatively straightforward.

TiDB Future App Hackathon 2024

TiDB Community — Fri, 19 Jul 2024 08:49:32 +0000

Introduction

We’re excited to announce another edition of TiDB Future App Hackathon! Our first Hackathon last year was an amazing success with almost 1,500 participants from 87 countries (see the recap blog for more details) and we’re looking forward to an even bigger event this year.

Similar to last year, we will have over $US 30k in prizes, and beyond prizes, we want the Hackathon to provide opportunities for participants to expand your network and work with like-minded developers while working with the latest technologies.

What’s New This Year

During this year’s Hackathon, participants will have an opportunity to develop new applications leveraging the new Vector Search feature on TiDB Serverless. We have seen amazing growth in AI and Machine Learning applications over the past few years, and we’re excited to see what TiDB community members will build to leverage our new Vector Search feature.

What to Create

With TiDB Serverless, the possibilities are endless for creating an innovative AI Application. To help you kickstart AI your project, we've identified a few categories you might consider.
Remember, these are just a few ideas to spark your imagination – the sky's the limit!

Sample applications built with TiDB Serverless

If you have any questions during the Hackathon, you can drop in on our HACKATHON 2024 channels on our Discord. Join us today by registering on our Hackathon site for a chance to be one of the winners of more than $30k in prizes!

Easy Local Development with TiDB

TiDB Community — Mon, 25 Sep 2023 06:42:32 +0000

This article is written by Daniël Van Eeden.

Here’s how TiDB can help you develop your application locally using the same type of distributed database platform used in production.

When you develop an application, you begin by coding and testing in your local environment. Many applications interface with a database, so in this early stage, you might use SQLite rather than the database brand used in production. This is an issue, however, because ideally, you want to develop the application with the production database in mind.

When using a distributed system setting up and starting/stopping the components needed for this can become error-prone and time-consuming.

In this article, I’ll explain how you can develop your application locally and use the type of database used in production. In this case, TiDB, a distributed SQL database platform that features horizontal scalability, strong consistency, and high availability, is an excellent choice.

A Quick Overview Of TiDB

TiDB is a relational database that is compatible with the MySQL protocol and syntax and can easily scale beyond a single machine by dividing the data over multiple machines. This also makes it resilient to machine failures.

If you have seen pictures of the TiDB architecture, you know that it consists of many components, including the TiDB server, TiKV, a transaction, key-value database, and the Placement Driver (PD), which manages metadata. For a production setup, you’ll need multiple instances of each component.

A scalable, highly available distributed system like TiDB runs multiple components on multiple hosts. For production setups, this is not a problem as there are good tools to manage this: TiDB Operator for Kubernetes and TiUP Cluster.

When you develop and test your work, you may have a cluster that has a similar setup as production. However, if you want to test against new versions of TiDB or if your development interferes with what other developers are doing, this may not work. Also, if you are traveling you may not have an internet connection or have an unreliable connection.

Setting up a local copy of a distributed system can be a complex task because there are multiple components and there also might be OS settings to manage. Also, running multiple instances of a component can easily result in conflicts on TCP ports or filesystem paths.

Another good use of a local cluster is to test certain operations before doing them in production. This includes learning basic tasks like upgrading your application and backing and restoring the database.

So let’s look into some of the options for local development. Some of these can also be used for CI jobs where you need a database.

Install TiUP and Start a TiDB Playground

TiUP is the tool to use to manage TiDB installations, both in production and for local development. One of the things TiUP can do is to set up a playground, with this TiUP will download the components that you need and configure and start them so that you get a local environment. This works on macOS, Linux, and Windows with WSL.

The playground gives you a TiDB installation to work with including the TiDB Dashboard and a set of Grafana dashboards.

To install TiUP run:

curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh

Once TiUP is installed you can start a playground:

tiup playground

These steps are available on https://tiup.io.

In the output of the tiup playground command, shown above, you can see the information about how to connect to the playground with a MySQL client and also the URLs for the TiDB Dashboard and Grafana.

Additional tips for working with playgrounds:

You can specify the version you want to run with tiup playground v4.0.14 for example.
With tiup -T myapp playground you can put a nametag on the playground. This is useful when you use multiple playgrounds. This also allows you to more easily find the data directory of the playground to inspect the logs etc. For example, with the “myapp” tag you can find the datadir on ~/.tiup/data/myappwith subdirectories for the various components. Setting a tag also lets you keep the data after stopping the playground.
The tiup playground command has options to set the number of instances per component. This can be used to set the number of TiFlash instances to 0 in case you for example don’t intend to use the HTAP functionality.
You can add --monitor=false if you don’t want to use monitoring tools like Grafana and Prometheus. This is useful if you want to save resources on your local machine. ## TiDE: A Visual Studio Code Extension for TiDB If you use Visual Studio Code, the TiDE extension lets you work with TiUP Playground, TiUP clusters, and Kubernetes clusters right from your IDE.

This extension gives you a Ti icon on the sidebar of VS Code. If you click this icon you get presented with options for starting a playground and an overview of any running playgrounds. TiDE also supports clusters deployed with TiUP, vagrant, and TiDB Operator for Kubernetes. Besides starting and stopping playgrounds you can also use this to inspect the logs, change the configuration of components, and open a MySQL client session right in the terminal pane of VS Code.

Set Up Your Containers With Docker Compose

Use TiDB Docker Compose to set up your containers. This is a useful tool if you already use Docker Compose to manage the containers of the application you are developing.

To set up a set of containers with this tool run the following:

Now you can connect with a MySQL client:

Use TiDB Operator With Minikube or Kind

If you run TiDB on Kubernetes in production, a good option is to use minikube or kind for local development. This approach teaches you how to work with TiDB Operator. For detailed information on setting up TiDB Operator, see Getting Started with TiDB Operator on Kubernetes.

A Simple Option: Run a Single TiDB Instance

You can run one instance of TiDB server without TiKV, Placement Driver, TiFlash, or any of the other components. In this case, TiDB uses unistore, a local storage engine, as a backend instead of TiKV. By default, TiDB stores data in /tmp/tidb. For TiDB server you only need a single binary, so this makes deployment easy. However, this approach is quite limited: you won’t have access to TiDB Dashboard, Grafana, or TiFlash.

To download, extract, and start TiDB server, enter:

wget -q -O - https://download.pingcap.org/tidb-v5.1.1-linux-amd64.tar.gz | tar --strip-components 2 -zxf - tidb-v5.1.1-linux-amd64/bin/tidb-server
./tidb-server`

Conclusion

Having a complex distributed system doesn’t need to prevent you from doing local development in the way that fits your needs. There are multiple methods to set up local development environments.

Ready to supercharge your data integration with TiDB? Join our Discord community now! Connect with fellow data enthusiasts, developers, and experts too: Stay Informed: Get the latest updates, tips, and tricks for optimizing your data integration. Ask Questions: Seek assistance and share your knowledge with our supportive community. Collaborate: Exchange experiences and insights with like-minded professionals. Access Resources: Unlock exclusive guides and tutorials to turbocharge your data projects. Join us today and take your data integration to the next level with TiDB!

How to load data from Slack to TiDB

TiDB Community — Mon, 11 Sep 2023 10:06:00 +0000

TL;DR

This can be done by building a data pipeline manually, usually a Python script (you can leverage a tool as Apache Airflow for this). This process can take more than a full week of development. Or it can be done in minutes on Airbyte in three easy steps:

set up Slack as a source connector (using Auth, or usually an API key)
set up TiDB as a destination connector
define which data you want to transfer and how frequently

You can choose to self-host the pipeline using Airbyte Open Source or have it managed for you with Airbyte Cloud.

This tutorial’s purpose is to show you how.

What is Slack

Slack is an enterprise software platform that facilitates global communication between all sizes of businesses and teams. Slack enables collaborative work to be more efficient and more productive, making it possible for businesses to connect with immediacy from half a world apart. It allows teams to work together in concert, almost as if they were in the same room. Slack transforms the process of communication, bringing it into the 21st century with powerful style.

What is TiDB

TiDB is a distributed SQL database that is designed to handle large-scale online transaction processing (OLTP) and online analytical processing (OLAP) workloads. It is an open-source, cloud-native database that is built to be highly available, scalable, and fault-tolerant. TiDB uses a distributed architecture that allows it to scale horizontally across multiple nodes, while also providing strong consistency guarantees. It supports SQL and offers compatibility with MySQL, which makes it easy for developers to migrate their existing applications to TiDB. TiDB is used by companies such as Didi Chuxing, Mobike, and Meituan-Dianping to power their mission-critical applications.

Prerequisites

A Slack account to transfer your customer data automatically from.
A TiDB account.
An active Airbyte Cloud account, or you can also choose to use Airbyte Open Source locally. You can follow the instructions to set up Airbyte on your system using docker-compose.

Airbyte is an open-source data integration platform that consolidates and streamlines the process of extracting and loading data from multiple data sources to data warehouses. It offers pre-built connectors, including Slack and TiDB, for seamless data migration.

When using Airbyte to move data from Slack to TiDB, it extracts data from Slack using the source connector, converts it into a format TiDB can ingest using the provided schema, and then loads it into TiDB via the destination connector. This allows businesses to leverage their Slack data for advanced analytics and insights within TiDB, simplifying the ETL process and saving significant time and resources.

Step 1: Set up Slack as a source connector

First, navigate to the Slack source connector page on Airbyte.com.
Click on the "Add Source" button to begin the process of adding your Slack credentials.
In the "Connection Configuration" section, enter a name for your Slack connection.
Next, enter your Slack workspace's API token in the "API Token" field. You can generate an API token by following the instructions in the Airbyte documentation.
In the "Channels" field, enter the names of the Slack channels you want to sync data from. You can enter multiple channels by separating them with commas.
If you want to filter the data that is synced from Slack, you can enter a date range in the "Start Date" and "End Date" fields.
Once you have entered all the necessary information, click on the "Test" button to ensure that your credentials are valid and that Airbyte can connect to your Slack workspace.
If the test is successful, click on the "Save & Continue" button to save your Slack connection.
You can now use your Slack source connector to sync data from your Slack workspace to your destination of choice.

Step 2: Set up TiDB as a destination connector

First, navigate to the Airbyte website and log in to your account.
Once you are logged in, click on the "Destinations" tab on the left-hand side of the screen.
Scroll down until you find the TiDB destination connector and click on it.
You will be prompted to enter your TiDB database credentials, including the host, port, username, and password.
Once you have entered your credentials, click on the "Test" button to ensure that the connection is successful.
If the test is successful, click on the "Save" button to save your TiDB destination connector settings.
You can now use the TiDB destination connector to transfer data from your source connectors to your TiDB database.
To set up a data integration pipeline, navigate to the "Connections" tab on the left-hand side of the screen and create a new connection.
Select your TiDB destination connector as the destination and choose your source connector as the source.
Configure the settings for your data integration pipeline, including the frequency of data transfers and any data transformations that you want to apply.
Once you have configured your data integration pipeline, click on the "Save" button to save your settings.
Your data integration pipeline will now run automatically, transferring data from your source connectors to your TiDB database on a regular basis.

Step 3: Set up a connection to sync your Slack data to TiDB

Once you've successfully connected Slack as a data source and TiDB as a destination in Airbyte, you can set up a data pipeline between them with the following steps:

Create a new connection: On the Airbyte dashboard, navigate to the 'Connections' tab and click the '+ New Connection' button.
Choose your source: Select Slack from the dropdown list of your configured sources.
Select your destination: Choose TiDB from the dropdown list of your configured destinations.
Configure your sync: Define the frequency of your data syncs based on your business needs. Airbyte allows both manual and automatic scheduling for your data refreshes.
Select the data to sync: Choose the specific Slack objects you want to import data from towards TiDB. You can sync all data or select specific tables and fields.
Select the sync mode for your streams: Choose between full refreshes or incremental syncs (with deduplication if you want), and this for all streams or at the stream level. Incremental is only available for streams that have a primary cursor.
Test your connection: Click the 'Test Connection' button to make sure that your setup works. If the connection test is successful, save your configuration.
Start the sync: If the test passes, click 'Set Up Connection'. Airbyte will start moving data from Slack to TiDB according to your settings.

Remember, Airbyte keeps your data in sync at the frequency you determine, ensuring your TiDB data warehouse is always up-to-date with your Slack data.

Use Cases to transfer your Slack data to TiDB

Integrating data from Slack to TiDB provides several benefits. Here are a few use cases:

Advanced Analytics: TiDB’s powerful data processing capabilities enable you to perform complex queries and data analysis on your Slack data, extracting insights that wouldn't be possible within Slack alone.
Data Consolidation: If you're using multiple other sources along with Slack, syncing to TiDB allows you to centralize your data for a holistic view of your operations
Historical Data Analysis: Slack has limits on historical data. Syncing data to TiDB allows for long-term data retention and analysis of historical trends over time.
Data Security and Compliance: TiDB provides robust data security features. Syncing Slack data to TiDB ensures your data is secured and allows for advanced data governance and compliance management.
Scalability: TiDB can handle large volumes of data without affecting performance, providing an ideal solution for growing businesses with expanding Slack data.
Data Science and Machine Learning: By having Slack data in TiDB, you can apply machine learning models to your data for predictive analytics, customer segmentation, and more.
Reporting and Visualization: While Slack provides reporting tools, data visualization tools like Tableau, PowerBI, Looker (Google Data Studio) can connect to TiDB, providing more advanced business intelligence options.

Wrapping Up

To summarize, this tutorial has shown you how to:

Configure a Slack account as an Airbyte data source connector.
Configure TiDB as a data destination connector.
Create an Airbyte data pipeline that will automatically be moving data directly from Slack to TiDB after you set a schedule

Building AI Applications: Real-World Stories and Insights for Finding an Ideal Tech Stack

TiDB Community — Fri, 08 Sep 2023 03:55:30 +0000

Have you ever marveled at the far-reaching impact of AI in various domains? Now, it’s your turn to turn your AI ideas into reality. The champions of TiDB Future App Hackathon 2023, true pioneers in the AI landscape, are eager to share their remarkable journey, the hurdles they conquered, and the profound insights they’ve gained while crafting AI-native applications. Their story of humble beginnings and relentless innovation is sure to inspire and guide you on your own AI journey.

📅 Date: September 27th

⏰ Time: 11:00-12:00 AM SGT

📍 Location: Virtual Meetup on Zoom

What to Expect:

Winning Insights (15 mins): Uncover the secrets behind crafting award-winning AI applications from industry leaders.
AI Innovation and Challenges (20 mins): Explore how AI is reshaping application development, offering inspiration for startups and individual developers.
Future AI Trends (20 mins): Dive into the future of AI applications and discover the next-gen data infrastructure for AI-driven innovation.
Exclusive Benefits Await (5 mins): Learn about our special offers, starting free and staying that way until your business achieves success. Stick around till the end for surprises! ## Meet Our Esteemed Speakers:

Announcing the Winners of the TiDB Future App Hackathon 2023!

TiDB Community — Thu, 31 Aug 2023 09:03:51 +0000

Drumroll, please! We are absolutely ecstatic to unveil the winners of the TiDB Future App Hackathon 2023! This extraordinary event brought together 1492 participants hailing from 88 countries, who unleashed their creativity and showcased their ingenuity in 100 projects.

Throughout the hackathon, participants used the opportunity to harness the power of TiDB Serverless. With $36,000 of the prize pool at stake, the competition was fierce and we were impressed with the number of innovative projects that were submitted. Without further ado, we’re excited to unveil the Hackathon prize winners

1st Place:$13,500 USD

Convex AI

Project Description: Less Coding, More Thinking. Generate Productivity App, not beautiful junk.

2nd Place:$7,500 USD

Quizmefy

Project Description: An AI powered multiplayer trivia game. Our platform is revolutionizing the quiz and trivia experience by harnessing the immense potential of artificial intelligence

3rd Place: $3,500 USD

Heuristic AI

Project Description: Heuristic AI is a search management system that creates search pages for products. The tool allows companies to have fast and efficient direct insights into their consumers’ issues thanks to TiFlash.

4th - 7th Place: $1,500 USD

4th Place: AI-Mon: Artificial Intelligence Activity Observability

Project Description: AI-Mon(aka. AI Monitoring) is an advanced successor to Ira (Now a browser extension OSS to record cut/copy/paste events on websites) especially to monitor data with the popularity of AI-driven chatbots

5th Place: moyubie

Project Description: Talk to AI and your friends in private, and get ads free news feeds!

6th Place: Briefly

Project Description: Never miss a beat in Slack

7th Place: Comant

Project Description: Comant is a digital physical disability and speech difficulties assistant, which allows users to effectively communicate with anyone using eye tracking with advanced AI technologies powered by TiDB.

Vercel | Best User Experience Award:$1500 in usage credit to the Vercel platform.

Quizmefy

Project Description: An AI powered multiplayer trivia game. Our platform is revolutionizing the quiz and trivia experience by harnessing the immense potential of artificial intelligence.

Best AI Application Award: $1,500 USD

Hacker Jobs

Project Description: Hacker Jobs: Your one-stop platform for tech job seekers. Leverage TiDB analytics on Hacker News data, making job search fast and efficient!

Prize for Social Good:$1,500 USD

Comant

Project Description: Comant is a digital physical disability and speech difficulties assistant, which allows users to effectively communicate with anyone using eye tracking with advanced AI technologies powered by TiDB.

Finalist Prize (60): Each of the 60 finalists will receive a TiDB Hackathon Swag.

Incentive Awards: Top 5 Idea-makers with the most votes:$100 USD

TOP 1 : Bit size learning platform, use GPT to generate and explain complex modules and topics.

Winner: LyonKvalid

Votes: 20

TOP 2: bebrah - for modern creators

Winner: mzir0
Votes: 19

TOP 3: UFO/UAP Sightings Database and Reporting System 1

Winner: Mike
Votes:17

TOP 4： Spam-Jam

Winner: Deepak_09
Votes:17

TOP 5： AI-Powered Cybersecurity and Translation Bot

Winner: ChickenMcSwag
Votes:15

Incentive Awards: Top 5 Story-tellers with the most votes:$100 USD

TOP 1: Once upon a time, there are three engineers…

Votes:21
Winner: Alez

TOP 2: In my first hackathon, my family thought I was working with terrorists.

Votes:18

Winner: Deepak_09

TOP 3: Mad teenager designing sth at 4 AM…?

Votes:15

Winner: mzir0

TOP 4: How I went through fire but didn’t get burnt during TiDB Hackathon

Votes: 11

Winner: Lucky Victory

TOP 5: First Hackathon I really put a lot of effort in

Votes: 11

Winner: Nifemi

More project

Once again, a tremendous thank you to all the participants for your enthusiasm and for making this hackathon a great success. The level of talent and innovation showcased throughout the competition was truly impressive. Your commitment to excellence and passion for technology is inspiring.

We hold hackathon competitions every year, and we would be thrilled to have you participate next year and beyond. Be on the look out for announcements including on our Discord) that we used during the recent Hackathon. It’s also a wonderful platform for exchanging thoughts, ideas, and knowledge with like-minded individuals who share a passion for technology and innovation.

Some Attempts to Optimize the Accuracy of TiDB Bot Responses

TiDB Community — Wed, 02 Aug 2023 22:58:39 +0000

About the Author: Li Li, Product Manager at PingCAP.

TL;DR

This article introduces how to optimize the accuracy of the enterprise-exclusive knowledge base user assistant robot, TiDB Bot, solving problems such as “incorrect toxicity detection”, “misunderstanding of context”, “erroneous semantic search results”, and “insufficient or outdated documentation”. In addition, an internal operation platform was established to continuously iterate the TiDB Bot. Ultimately, the continuous operation method optimized over 50% of the dislike rate to less than 5%.

Introduction

Based on the method of Building a Company-specific User Assistance Robot with Generative AI, I have constructed TiDB Bot, a robot that answers customer questions based on TiDB and TiDB Cloud's official documentation, and is capable of refusing to answer questions outside of its scope of business.
However, upon its initial launch, the response was less than satisfactory, with over 50% of users providing dislike feedback.

Issues During Internal Testing

In order to investigate the existing problems, I conducted tests and discovered the following categories in dialogues where issues arise:

Incorrect toxicity detection: Some questions related to the company's business are refused, for instance, 'dumpling' is a data export tool for TiDB, but when directly asking 'what is dumpling?', the robot refuses to answer and advises you to consult a food expert instead.
Incorrect understanding of context: When multi-round dialogues occur, users usually ask questions about the previous content. At this point, a simple description like 'what is the default value for this parameter?' is provided. When searching for semantically related content from the vector database, simply using the user's original text for search usually yields no meaningful results. This causes the problem that when it is passed to GPT, it is unable to provide the correct answer based on the official documentation.
Incorrect semantic search results: Sometimes, the user's question is very clear, but the ranking of the content searched from the vector database is problematic. The correct document content to answer the question cannot be found in the Top N.
Insufficient or outdated documentation: Although the customer's question is very clear, the official documentation is either not comprehensive enough or not up-to-date, so it doesn't contain these contents. As a result, GPT will improvisationally provide an answer, which often turns out to be incorrect.

The Missed Targets of Toxicity Detection

Problem Analysis

While I have employed the Few-Shot method to assist GPT in determining whether a user's question falls within the TiDB business scope (detailed in the section on Limiting the Response Field) there are always limited examples compared to the breadth of user's questions and perspectives. The bot cannot make accurate judgments based solely on the examples written in the system prompts, leading to missed targets.

Solution

Fortunately, the application scenarios in enterprises are always limited. Therefore, theoretically, the user's questioning perspectives are also limited. If I were to provide all the questions asked by users and feed them to GPT, it should be able to accurately identify whether any question belongs to the TiDB business category.
So, how can we feed all the questions to GPT? This scenario is not isolated. In the initial design of the bot, it relied on official documentation to answer user's questions. However, it is unrealistic to stuff all the official documentation into GPT at once. Therefore, I have designed to search for relevant documents from the vector database according to semantic similarity. In this case, the feature of semantic search can also be used to solve the problem.
To implement this solution, the following steps need to be accomplished:

Data Preparation

Step one: Collect all relevant questions online and during testing, mark them for toxicity, and clean them into a format similar to the examples in the current system prompts.

instruction: {user's querstion} question: is the instruction out of scope (not related with TiDB)? answer: YES or NO

Importing data into a vector database, supporting the search for semantically similar results
Step Two: Referencing the method of Correct Answering in Sub-Domains Knowledge, the cleaned data is placed into a vector database, and it supports searching in the vector database when the user asks questions, finding the most semantically similar examples, and providing them together to the GPT model.

Thus, when the GPT model is assessing toxicity, it will reference the most relevant examples to provide the most accurate response possible.

Discussion: The Similarities and Differences between Example and Domain Document Search

Although the search for examples and domain documents both involve finding content with high semantic similarity in a vector database, and both use the same vector database, the same Embedding model, the same vector length, and the same similarity calculation function.
However, there are still certain differences in their practical execution.

In terms of Embedding content
- When conducting a domain knowledge document search, all content within the document needs to be searched. Therefore, the document content will be split and all content needs to go through embedding, and then stored in the vector database.
- However, when conducting an example search, since only the instruction part is related or similar to the user's question, the instruction part of the example needs to undergo Embedding, while the answer part does not.
In terms of split
- Domain knowledge documents are longer and need to be split before undergoing Embedding.
- Examples requiring embedding are all questions, each of which is not too lengthy, so there’s no need for split. They can be treated as an independent chunk; this way, the final search results will be individual question and answer examples.

Difficulties in Contextual Understanding

Problem Analysis

Thanks to the contextual understanding capabilities of the GPT model, applications can provide continuous dialogue features. However, if the robot needs to provide relevant domain knowledge dynamically based on context, several problems usually arise.
When users engage in multi-turn dialogue, they would ask follow-up questions about the previous dialogue content, such as, "What is the default value of this parameter?". At this point, the system directly uses the text of "What is the default value of this parameter?" to search for domain knowledge in the vector database. The quality of the search results is quite poor.

Solution

The root cause of this problem is the subjective contextual semantics in human conversations, which the system fails to understand. Fortunately, as mentioned earlier, "The GPT model has the capability of contextual understanding". Therefore, a simple solution is to let GPT rewrite the user's original question before the system searches for domain knowledge. The aim is to describe the user's intent as clearly as possible in one sentence. This act is known as "question revision".

To ensure consistency in the user questions that the entire robot system faces and avoid errors due to inconsistencies, I placed the question revision feature at the very forefront of the system information flow. This way, user questions are revised as soon as they enter the robot.
During the revision, the robot asks the GPT model to describe the user's question intent in one sentence based on the overall dialogue context, adding as much detailed information as possible. This way, whether in toxicity detection or domain knowledge search, the system can execute based on a more specific intent.
What if there are obvious errors found in the question revision? In fact, we can use a combination of few shot + semantic search to specifically optimize these errors.

Limitations of Semantic Search

Discussion: The Process and Optimization Methods of Semantic Search

The method of using vectors for semantic search is the cornerstone of TiDB Bot. Without this method, relying solely on the capacity of the GPT model, it would be impossible to simply build a robot to answer specific knowledge in a niche field.
However, the more foundational the content, the more necessary it is to understand its potential issues, in order to find some methods for positive optimization.
Overall, in the process of preparing domain knowledge data, splitting, vectorizing, and searching, there are many ways to optimize. Here are a few examples that the author has tried:

During the data preparation stage: Clean the documents, remove images, links, and other meaningless symbols and document structures.
During the splitting stage: Use different methods to split the document (like splitting by token, splitting by natural paragraphs, splitting by symbol, etc.). After splitting, consider whether some overlap is needed and determine an appropriate amount for this overlap.
During the vectorization stage: Whether to use a proprietary or open-source embedding model, how long the vector should be, and whether it supports multi-language vectorization. If using an open-source model, how to fine-tune it, how to prepare the fine-tuning corpus, and how to handle epochs and rounds to allow the model to converge with high quality.
During the semantic search stage: Decide which similarity algorithm is best, how much document content to search to satisfy the intent, and whether the split content needs to be aggregated again after being searched. The advantages of the above methods:
Each method is a systematic solution, effective for all domain knowledge documents, without prejudice.
The methods used in the data preparation and splitting stages can generally achieve stable positive optimization, allowing for higher quality data material. Disadvantages:
Key optimization methods during the vectorization and semantic search stages cannot achieve stable positive optimization. The direction of optimization is random, and improving one aspect of the model's ability may weaken another.
Each optimization requires a deep understanding of the relationship between the business and the optimization method. It requires repeated fine-tuning under the business test set, continual experimentation, and deepening the understanding of the adaptability between technology and business, to have the chance to achieve relatively good results.

Problem Analysis

In the beta testing phase, a common problem encountered is: the user's question is clear, but the corresponding document content cannot be found in the Top N results from the vector database search. This implies that related documents to the question do exist within the system, but they just aren't being retrieved. This could be due to several possibilities:

The document is not well-written or too obscure, making it challenging to find based on semantic similarity.
The embedding model needs to be improved, as the vector distance between user's query and the directly relevant domain knowledge is not the shortest.
The similarity algorithm is not optimal, and other similarity algorithms could potentially be utilized to address this. To solve these possible issues, it could take several months. Even though some improvements might be achieved, the effectiveness of these improvements cannot be guaranteed. Therefore, in order to stably improve the output quality of semantic search, there are two direct, effective, and rapidly implementable methods:
First, adjust the vector distance between the domain content and the query directly.
Second, recall specific content examples in addition to recalling domain knowledge content. Both methods can provide correct information in system prompts, but they have different pros and cons:
Method One:
- Cons:
- Direct adjustment of vector distance involves moving and rotating existing vectors, which could affect other user queries and disrupt the overall distribution of the domain knowledge vectors.
- Direct adjustment of vector distance might also mean using an additional metric or function to express the new vector distance. However, creating a new similarity function may not necessarily solve the problem.
Method Two:
- Pros:
- Introducing new content (examples) into the system prompts does not impact the existing domain knowledge vector space, thus providing relative decoupling.
- It also offers higher flexibility, allowing for rapid additions and deletions in the future.
- Cons:
- When domain knowledge is updated, the examples also need to be updated, requiring an additional process. Considering the simplicity of system maintenance and the real-time nature of optimization, the author eventually chose Method Two.

Solution

The primary method I use is a combination of examples and training the Embedding model.
In the first step, a method similar to 'The Missed Targets of Toxicity Detection' is used to supplement examples that specifically target common mistakes. These examples are then provided to the GPT model along with the system prompt words, in order to improve accuracy.
In the second step, once a sufficient number of examples have been accumulated, these examples are used as training data to train the Embedding model. This enables the Embedding model to better understand the relationship between questions and domain knowledge, thereby producing more appropriate vector data results.

In practical work, the cyclical use of the first and second steps helps to maintain the number of examples at a manageable level, and continuously promotes the improvement of the Embedding model.

Garbage In, Garbage Out

Problem Analysis

In machine learning, one of the most famous phrases is "Garbage In, Garbage Out", which means if incorrect or meaningless data is input into the model, the model will inevitably output incorrect or meaningless results. Therefore, if the quality of the domain document content is poor, or its timeliness has passed, the quality of the answer given by the GPT model is likely to be poor as well.

Solution

I have established the ability to regularly update domain knowledge documents, and when users report errors, I submit the corresponding documents to the appropriate team to encourage the update and enrichment of domain documents.

The Only Rule to Product Usability: Continuous Operation

The aforementioned strategies are some of the attempts I made while optimizing the TiDB Bot. These methods can to a certain extent enhance the accuracy of the bot's responses. However, to reduce the dislike rate from over 50% to less than 5%, we need to progress step by step to achieve our long-term goal.
To ensure the continuous optimization of TiDB Bot, I built an internal operation platform. This platform can conveniently implement the optimization methods introduced in this article. The core capabilities of this platform include:

Feedback Information Display: It presents the upvotes or downvotes from users on the replies. For downvoted information, it displays the handling logs of each node in the information flow, which is convenient for troubleshooting.
Quick Addition of Examples: For each node interacting with GPT, it supports the capability to provide examples, including revising questions, toxicity detection, domain knowledge, and more. All stages can quickly supplement examples.
Automatic Update of Domain Knowledge: For domain knowledge with a fixed source, such as official documents, it supports regular automatic updates of the document content in the vector database to keep the domain knowledge up to date.
Data Organization for Model Iteration: It automatically organizes the training data needed for tweaking the Embedding model, including users' upvote information and example information supplemented during operation, etc. Finally, by using this operation platform, I gradually improved the accuracy over 103 days. Eventually, with the help of community test users, it was successfully launched.

Discussion: The Choice Between Model Fine-Tuning and Continuous Operation

The term "model fine-tuning" here refers to the method of using more domain-specific data to train models, including Embedding and GPT models, directly through fine-tuning. By contrast, "continuous operation" refers to practices similar to those described in this article, which involve leveraging more high-quality domain knowledge and examples, as well as engaging in multiple interactions with GPT to enhance the accuracy of the application.
Many people may ask, why does this article emphasize the method of continuous operation and not accentuate the method of model fine-tuning? To answer this question, we first need to look at the pros and cons of both methods:

Model Fine-Tuning Method:
- Pros:
- The opportunity to comprehensively improve the quality of responses in a specific domain.
- Once trained successfully, the demands for domain knowledge in answering questions will decrease, thus saving on the cost of collecting domain knowledge in the later stage.
- The training cost is acceptable. As seen from the open-source community, using the Low-Rank Adaptation of Large Language Models (LoRA) approach to fine-tune a model only takes about 8 hours on a V100 graphics card to converge.
- Cons:
- It requires collecting and preprocessing a vast amount of high-quality domain data. If Full Fine-Tuning (FFT) is needed, more than 100,000 corpora are required, and even if the Parameter-Efficient Fine-Tuning (PEFT) method is used, over 50,000 corpora are still needed.
- The training effect is uncertain. After training, although the capacity to respond to domain knowledge has improved, the abilities in other general knowledge and reasoning may decline. When facing real user questions, it may result in inadequate reasoning and a decrease in the ability to answer questions. As the fine-tuning method is based on an existing model for training, whether it improves or deteriorates depends on the existing model. If a good existing model can be found, it will enable the fine-tuned model to start from a higher point.
- The quality of open-source models cannot rival that of OpenAI. Although there is a chance to reduce training costs, there are currently no academic or industrial reports that can produce an open-source model with capabilities similar to OpenAI.
- Each iteration takes a relatively long time. Each iteration (measured in months) requires undergoing one or several cycles of data preparation, training, and testing to possibly obtain a usable model. Especially in terms of data preparation, high-quality training datasets may not be prepared until after undergoing several rounds of actual training.
Continuous Operation Method:
- Pros:
- Relatively stable positive optimization. This article adopts a systematic method to optimize accuracy without depending on the randomness produced by model training.
- Fast. The optimization of the example part can achieve minute-level iteration speed, which allows for rapid troubleshooting if users encounter problems.
- Economical. It only requires the reuse of existing semantic search capabilities, with no need for additional components or extra costs.
- Low migration cost. The method in this article can be used in any chat-type GPT model, allowing for quick migration to other models. Should there be a better model in open-source or commercial models, it can be integrated swiftly.
- Friendly to cold start. Problems can be solved as they arise, without the need for a large amount of training data in advance.
- Cons:
- More frequent human intervention is required. Because the example-based method requires more human verification and supplementation processes, it demands more frequent human intervention than model fine-tuning during product operation.
- Excessive content. After a period of operation, the supplemented content may become too much to handle, leading to difficulties in maintenance and a decline in search accuracy. From the above, we can see that both methods have their advantages and disadvantages. They are not mutually exclusive but complementary. For example, the author has fine-tuned the Embedding model. In the early stages of the TiDB Bot, the author leans more towards the continuous operation method, applying a systematic approach for stable, economical, and rapid positive optimization, making sure that the entire team focuses on business issues. Perhaps in the middle and later stages of TiDB Bot's development, the method of model fine-tuning could be considered for further optimization.

The Holistic Logical Architecture Including Optimization Methods

So far, we have obtained the ability to continuously optimize the TiDB Bot.

Following up

The TiDB Bot has been launched on TiDB Cloud, Slack, and Discord channels. Everyone is welcomed to use it.
In the future, we will provide open-source tools for building applications similar to TiDB Bot, enabling everyone to quickly build their own GPT applications.

Building a Company-specific User Assistance Robot with Generative AI

TiDB Community — Thu, 20 Jul 2023 03:21:34 +0000

About the Author: Li Li, Product Manager at PingCAP.

TL:DR

This article introduces how to use Generative AI to build a user assistance robot that uses a corporate-specific knowledge base. In addition to the industry’s commonly used knowledge base response method, it also attempts to judge toxicity under the few-shot method. Eventually, the robot has been applied to various channels facing global customers of the company, with a dislike ratio of less than 5% after user usage.

The Magic of Generative AI Unveiled

Since 2022, Generative AI (hereafter referred to as GenAI) has spearheaded a global revolution. From the buzz created by MidJourney and DALL-E in generating imagery from text, to the phenomenal attention garnered by ChatGPT through its natural and fluent conversations, GenAI has become an unavoidable topic. Whether AI can support better living and work in more general scenarios became one of the core topics in 2023.

The rise of development tools such as LangChain signifies that engineers have begun to mass-produce applications based on GenAI. PingCAP has also conducted some experiments and accomplished some works, such as:

Ossingisht’s Data Explorer: A project that generates SQL to explore Github open-source software projects using natural language
TiDB Cloud’s Chat2Query: A project that uses the in-cloud database to generate SQL using natural language

After building these applications, the author began to contemplate whether the capabilities of GenAI can be used to construct more universal applications, providing bigger value for the users.

Considering the Demand

With the global growth of TiDB and TiDB Cloud, providing support for global users has become increasingly important. As the number of users grows exponentially, the number of support staff will not increase rapidly. Hence, how to handle the large volume of users becomes an urgent matter to consider.

Based on the actual experience of supporting users, according to our research on the user queries in the global community and the internal ticket system, over 50% of the user issues could actually be addressed by referring to the official documentation. It’s just that the vast volume of content makes it difficult to find. Therefore, if we can provide a robot armed with all the necessary knowledge from the official TiDB documentation, perhaps it could help users utilize TiDB more effectively.

The Gap between Generative AI and Demand Fulfillment

After identifying the demand, it’s also necessary to understand the characteristics and limitations of GenAI to determine whether it can be applied to this particular demand. Based on the work completed so far, the author is able to summarize some features of GenAI. Here, GenAI primarily refers to GPT (Generative Pre-trained Transformer) type models, with text dialogue as the main focus, and GPT will be used to describe in the following context of this article.

Capabilities of GPT

Ability to understand semantics. GPT possesses a strong capability in comprehending semantics, essentially able to understand any given text without obstacles. Regardless of the language (human or computer languages), the level of textual expression — even in a multilingual mixture, or texts with grammatical or vocabulary errors, it can interpret user queries.
Logical reasoning ability. GPT has a certain degree of logical reasoning power. Without the need for additional special prompt words, GPT can perform simple inference and uncover deeper contents of a question. With certain prompt words supplemented, GPT can demonstrate stronger inferential capabilities. The methods to provide these prompt words include Few-shot, Chain-of-Thought (COT), Self-Consistency, Tree of thought (TOT), and so on.
Attempting to answer all questions. GPT, especially the Chat type, like GPT 3.5, GPT 4, will always try to respond to all user questions in a conversational form, as long as it aligns with the preset value perception, even if the answer is “I cannot provide that information.”
General knowledge capability. GPT itself possesses a vast amount of general knowledge, which is highly accurate and covers a broad range.
Multi-turn dialogue capability. GPT is set up to understand the meanings of multiple dialogues between different roles, meaning it can utilize the method of further questioning in a conversation without having to repeat all the historical key information in every dialogue. This behavior aligns very well with human thinking and conversational logic.

Limitations of GPT

Passive triggering. GPT requires an input from the user to generate a response. This implies that GPT would not initiate interaction on its own.
Knowledge expiration. This specifically refers to GPT 3.5 and GPT 4. The training data for both cease in September 2021, which means any knowledge or events post this date are unknown to GPT. It is unrealistic to expect GPT to provide new knowledge itself.
Illusion of specialized fields. Although GPT possesses excellent abilities in general knowledge, in a specific knowledge domain, such as the author’s field of database industry, most of GPT’s answers are more or less erroneous and cannot be trusted directly.
Dialogue length. GPT has a character limit for each dialogue round. Therefore, if the content provided to GPT exceeds this character limit, the dialogue will fail.

Gap in Implementation of Requirements

The author intends to use GPT to realize an "enterprise-specific user assistant robot," which means the following requirements:

Requirement 1: Multiturn dialogues, understanding the user's queries, and providing answers.
Requirement 2: The content of the answers regarding TiDB and TiDB Cloud must be accurate.
Requirement 3: Can't answer content unrelated to TiDB and TiDB Cloud, especially political-related content.

Analyzing these requirements:

Requirement 1: can basically be met, based on GPT's "ability to understand semantics," "logical reasoning ability," "try to answer questions ability," and "context understanding ability."
Requirement 2: can't be met. Due to GPT's "knowledge obsolescence" and "illusion of segmented fields" limitations.
Requirement 3: can't be met. Because of GPT’s "ability to try to answer all questions," any question will be answered, and GPT itself does not restrict answering political questions. Therefore, in the construction of this assistant robot, it is mainly about solving problems with requirement two and requirement three. ### Correct Answering in Sub-Domains Knowledge

Here, we are to address the second requirement.
The task of enabling GPT to answer user's queries based on specific domain knowledge is not a new field. My previous optimization for Ossinsight - Data Explorer uses such specific domain knowledge, helping boost the execution rate of natural language generated SQL (i.e., the SQL generated can successfully run and produce results in TiDB) by over 25% (as seen in the performance upon OssInsight's launch).
What needs to be employed here is the spatial similarity search capability of vector databases. This typically involves three steps:

Storing Domain Knowledge in a Vector Database

The first step is to put the official documents of TiDB and TiDB Cloud into the vector database.

Once the documents have been retrieved, the text content is put into the Embedding model to generate the corresponding vector of the textual content, and these vectors are placed into a specific vector database.
In this step, two points need to be checked:

If the quality of the document is poor, or the format of the document does not meet the expectations, a round of preprocessing will be carried out on the document in advance to convert it into a relatively clean text format that can be easily understood by LLM.
If the document is long and exceeds the single conversation length of GPT, the document must be trimmed to meet the length requirement. There are many methods of trimming, such as trimming by specific characters (e.g., commas, periods, semicolons), trimming by text length, and so on.

Searching for Relevant Content from the Vector Database

The second step is to search for relevant text content from the vector database when the user poses a question.

When a user initiates a conversation, the system will convert the user's conversation into a vector through the Embedding model and put this vector into the vector database to perform a search with the existing data. During the search, we use similarity algorithms (such as cosine similarity, dot-product, etc.) to calculate the most similar domain knowledge vectors and extract the text content corresponding to these vectors.
A user's specific question may require multiple documents to answer, hence during the search, we retrieve the top N (currently N equals 5) documents with the highest similarity. These top N documents can satisfy the need for spanning multiple documents, all of which will contribute to the content provided to GPT in the next step.

Relevant content and user queries are presented to GPT together.

The third step is assembling all the pertinent information and submitting it to GPT.

The task objective and relevant domain knowledge are incorporated into the system prompts and the chat history is assembled based on past dialogues. Providing all the content to GPT allows for a domain-specific response based on this acquired knowledge.
Upon completing the above steps, basically, we can meet the second requirement - answering questions based upon specific domain knowledge. The correctness of the answers is greatly improved compared to directly querying GPT.

Limiting the Response Field

Here we aim to address the issue raised in Demand Three.
As this robot is intended to serve as a business support capability for users, we expect it to only answer questions related to the company itself, such as those about TiDB, TiDB Cloud, SQL issues, application construction problems, etc. If inquiries go beyond these scopes, we hope the robot will decline to respond, for instance to questions about the weather, cities, arts, politics, etc.
Given that we previously mentioned GPT's aptitude to "attempt to answer all questions", it is within GPT's own setting that it should respond to any question in a manner that aligns with human values. Therefore, this restriction cannot be built with the help of GPT, and must be attempted on the application side.
Only by accomplishing this requirement can a service actually go live to serve its users. Regrettably, at present, there isn't any satisfactory industrial solution for this. Indeed, the majority of application designs do not even address this aspect.

Concept: Toxicity

As mentioned earlier, GPT attempts to tailor its responses to align with human values, a step referred to as "Alignment" in model training. It prompts GPT to deny answering questions related to hate and violence. If GPT doesn't comply with the condition and ends up answering hate or violence-related questions, it is deemed as having displayed toxicity.
Therefore, with regard to the robot that I'm about to create, the scope of toxicity has effectively expanded. Namely, any responses not pertaining to the company's business can be considered as being laced with toxicity. Under this definition, we can draw upon the previous work done in the field of detoxification. Johannes Welbl from DeepMind (2021), among others, advises utilizing language models for toxicity detection. Now, with the considerable advancements in GPT's capabilities, it has become possible to use GPT itself to judge whether a user's question falls within the company's business scope.
To limit the answer domain, two steps are necessary.

Determination within a limited domain

The first step is to evaluate the user's initial inquiry.

Here, it's necessary to use the few-shot method to construct prompts for toxicity detection, enabling GPT to determine if the user's inquiry falls within the scope of enterprise services when multiple examples are at hand.
For instance, some examples are:
``<< EXAMPLES >>

instruction: who is Lady Gaga?
question: is the instruction out of scope (not related with TiDB)?
answer: YES

instruction: how to deploy a TiDB cluster?
question: is the instruction out of scope (not related with TiDB)?
answer: NO

instruction: how to use TiDB Cloud?
question: is the instruction out of scope (not related with TiDB)?
answer: NO``

After the judgment is completed, GPT will input 'Yes' or 'No' for subsequent proceedings. Note, here 'Yes' signifies 'toxic' (not relevant to the business), and 'No' means 'non-toxic' (relevant to the business).

Post-Judgment Processing

In the second step, after obtaining whether the result is toxic or not, we branch the processes into two: toxic and non-toxic, for the handling of abnormal and normal processes respectively.
The normal process is the Correct Answering in Sub-Domains Knowledge, as mentioned above. The focus here is on the explanation of the abnormal process flow.
When the system discovers that the generated content is "Yes", it will guide the process into the toxic content reply process. At this time, a system prompt word that refuses to answer the user's question and the corresponding question from the user will be submitted to GPT, and finally, the user will receive a reply that refuses to answer.
Upon completing these two steps, Requirement Three is essentially completed.

Overall Logical Framework

Thus, we have developed a basic assistant robot, which we named TiDB Bot, that can be provided to users and has specific enterprise domain knowledge.

TiDB Bot Test Stage Results

Starting from March 30, TiDB Bot began internal testing, and officially opened to Cloud users on July 11.
During the 103 days of TiDB Bot’s incubation, countless communities and developers provided valuable feedback on the test product, gradully making TiDB Bot usable. During the test phase, a total of 249 users accessed the bot, sending 4570 messages. By the end of the test stage, 83 users had given 266 pieces of feedback, with negative feedbacks accounting for 3.4% of the total amount of information conveyed and positive feedbacks accounting for 2.1%.
In addition to the direct users, there were also communities who suggested ideas and proposed more solutions. Thank you to all the communities and developers, without you, the product launch of TiDB Bot would not have been possible.

Further Plans

As the number of users steadily increases, there still remain significant challenges whether in terms of the accuracy of recall content or toxicity judgment. Therefore, the author has been optimizing the accuracy of TiDB Bot in actual service provision, to gradually enhance the effectiveness of its answers. These matters will be introduced in future articles.