Apache SeaTunnel

Posted on Nov 26, 2024

VTS: An Open-Source Vector Data Migration Tool Based on Apache SeaTunnel

Introduction

VTS (Vector Transport Service) is an open-source tool developed by Zilliz. It is focused on the migration of vectors and unstructured data. VTS's core feature is its development based on Apache SeaTunnel, which gives it a significant advantage in data processing and migration. As a distributed data integration platform, Apache SeaTunnel is known for its rich connector system and multi-engine support. VTS further extends its capabilities in vector database migration and unstructured data processing.

What is a Vector Database

A vector database is a database system specifically designed for storing and retrieving vector data:

It can efficiently handle high-dimensional vector data and supports similarity searches.
Supports KNN (K-Nearest Neighbors) search.
Calculates the distance between vectors (Euclidean distance, cosine similarity, etc.).
Quickly retrieves the most similar vectors.
Mainly used in AI and machine learning application scenarios.
Image retrieval systems.
Recommendation systems.
Natural language processing.
Facial recognition.
Similar product search.

Development Motivation and Background

As a leading provider of vector database services, Zilliz is well aware that developing outstanding AI applications is inseparable from the data itself. However, when effectively dealing with unstructured data in AI applications, we often face the following challenges:

Data fragmentation: User data is scattered across multiple platforms, such as S3, HDFS, Kafka, data warehouses, and data lakes.
Diverse data formats: Unstructured data exists in various formats, including JSON, CSV, Parquet, JPEG, etc.
Lack of complete solutions: Currently, no product can fully meet the complex needs of efficiently transferring unstructured data and vector data across systems.

Among these challenges, the most prominent is how to transform non-structured data from various data sources and in various formats and import it into vector databases. This process is much more complex than dealing with traditional SQL relational data, and most companies or organizations underestimate this point.

Therefore, many companies or organizations face performance, scalability, and maintenance cost issues when building custom unstructured data pipelines. These issues can affect data quality and accuracy, which may in turn weaken the data analysis capabilities of applications.

What's worse, many companies overlook or underestimate factors such as vendor lock-in and data disaster recovery when choosing a vector database.

Impact of Vendor Lock-In

Vendor lock-in refers to an organization's over-reliance on proprietary technology from a single vendor. In this case, the organization would find it difficult to switch to another solution, or the cost of switching would be very high. This issue is particularly important in the field of vector databases, as the characteristics of vector data and the lack of standardized data formats may make cross-system data migration extremely challenging.

The impact of vendor lock-in goes beyond this. It also restricts the organization's flexibility in the face of changing business needs and may even increase operational costs over time. In addition, being locked into a single vendor's ecosystem can limit technological innovation. If the chosen solution does not scale well with the growth of organizational needs, it can also affect the performance of application systems.

When choosing a vector database, organizations should prioritize open standards and interoperability to reduce the risks mentioned above. In the process of formulating a clear data governance strategy, planning the portability of data is crucial. Regularly assessing the dependence on vendor-specific features can help organizations maintain system flexibility.

Challenges in Unstructured Data Migration

However, even with the above precautions, organizations must be prepared to face the unique challenges brought by vector databases. We have found that data migration between vector databases is much more complex than traditional relational database migration. This complexity highlights the importance of choosing the right vector database and explains why attention must be paid to avoiding vendor lock-in. The main challenges in vector database migration include:

Lack of ETL tools for vector databases: Mainstream tools like Airbyte and SeaTunnel are only for traditional relational databases and cannot effectively meet the data migration needs between vector databases.
Differences in capabilities between vector databases:
- Many vector databases do not support data export.
- Some vector databases have limited real-time processing capabilities for incremental data.
- Data schema mismatches between vector databases.

To address these challenges, organizations need to build more resilient, flexible, and up-to-date AI applications, fully leverage the power of unstructured data, and maintain flexibility to adapt to future technologies.

Data Migration Tool Born for Vector Data

Zilliz has launched a new migration service and made it open-source to help users deal with the above challenges. Zilliz's migration service is a tool designed for vector data migration based on Apache SeaTunnel.

GitHub Address: https://github.com/zilliztech/vts

After being verified and tested, this service will be merged into the SeaTunnel official branch.

The reasons behind Zilliz's development of this tool include:

Meeting the growing demand for data migration: User needs continue to expand to include migrating data from various vector databases, traditional search engines (such as Elasticsearch and Solr), relational databases, data warehouses, document databases, and even S3 and data lakes.
Supporting real-time stream data and offline import: As the capabilities of vector databases continue to expand, users need support for real-time stream data and offline batch import capabilities.
Simplifying the unstructured data conversion process: Unlike traditional ETL, converting unstructured data requires the power of AI models. The migration service, combined with Zilliz Cloud Pipelines, can convert unstructured data into Embedding vectors and complete data tagging tasks, significantly reducing data cleaning costs and operational difficulties.
Ensuring end-to-end data quality: Data loss and inconsistency issues are common during data integration and synchronization. The migration service addresses these potential issues that may affect data quality with strong monitoring and alert mechanisms.

Core Capabilities of VTS

Based on Apache SeaTunnel

VTS inherits the high throughput and low latency characteristics of Apache SeaTunnel while adding support for vector data and unstructured data. This makes VTS a powerful tool for building AI application data pipelines, achieving real-time synchronization of vector data, and converting and loading unstructured data.

VTS's core capabilities include:

Vector database migration
Building AI application data pipelines
Real-time synchronization of vector data
Conversion and loading of unstructured data
Cross-platform data integration

Vector Database Migration

One of the core capabilities of VTS is vector database migration. It can handle the migration of vector data, which is crucial for AI and machine learning applications that often need to deal with a large amount of high-dimensional vector data.

Cross-Platform Data Integration

VTS supports cross-platform data integration, meaning it can seamlessly migrate data from one system to another, whether it's a traditional relational database or a modern vector database.

VTS Supported Connectors and Transforms

Supported Connectors

VTS supports a variety of connectors, including but not limited to Milvus, Pinecone, Qdrant, Postgres SQL, ElasticSearch, Tencent Vector DB, etc., making VTS compatible with a variety of data sources and storage systems.

Supported Transforms

VTS also supports a variety of data transformation operations, such as TablePathMapper (change table names), FieldMapper (add or delete columns), Embedding (text vectorization), etc., making VTS more flexible in data processing.

Supported Data Types

VTS supports various data types, including Float Vector, Sparse Float Vector, multi-vector columns, dynamic columns, and data insertion, including Upsert and Bulk Insert (offline, large batch), further enhancing its ability to handle complex data migration tasks.

Excellent Performance

VTS also performs well in terms of performance. For example, in the Pinecone to Milvus migration demo, the synchronization rate of 100 million vectors is 2961/s, which takes about 9 and a half hours (4 cores/8GB memory).

Unstructured Data Support

In addition, VTS also supports the processing of unstructured data, currently supporting Shopify data types, and will gradually support unstructured data types including PDF, Google Doc, Slack, Image/Text, etc., continuously strengthening its support in the extremely important area of unstructured data.

Application Scenarios

The use scenarios of VTS are extensive, such as in product recommendation scenarios, where data can be synchronized from Shopify for products and inventory, calling embedding services, storing data into Milvus, and performing similarity searches to return the most similar products, greatly optimizing the effectiveness of product recommendations.

Future Plans

Looking ahead, the migration service will continue to evolve. By offering the VTS open-source migration service tool, it not only solves the current problems and challenges in vector data management but also paves the way for the development of innovative AI applications.

The plans for VTS include supporting more data sources, such as Chroma DB, DataStax(Astra DB), DataLake, Mongo DB, Kafka (real-time AI), object storage import, etc.

It is worth noting that VTS's direct insertion of raw data and the use of raw data for search functionality are expected to be implemented in Milvus version 2.5.

In addition, in terms of GenAI's ETL pipeline, VTS will also attempt to support task flow orchestration, Embedding service, external APIs, and support for the open-source big data workflow scheduling platform Apache DolphinScheduler.

Conclusion

As a vector data migration tool developed based on Apache SeaTunnel, VTS not only inherits SeaTunnel's powerful data processing capabilities but also extends its support for vector data and unstructured data, making it an indispensable data migration tool in the fields of AI and machine learning. More information and resources about VTS can be found on its GitHub page.