DEV Community: intalink

IntaLink Community: Exploring the Open Source New Power of Data Table Relationship Automatic Analysis Platform

intalink — Wed, 04 Dec 2024 02:36:36 +0000

01-What is Intalink platform

The Intalink platform is based on the different data integration application requirements of users in multiple scenarios, without the need for business background support. It automatically completes the analysis of inter table relationships and generates data association paths, and based on different application strategies, provides the best data association path to achieve on-demand search and use. Eliminating a large number of repetitive manual analysis processes in traditional data integration applications.
The IntaLink platform has been officially open sourced on GitHub. We welcome technology enthusiasts or platform demanders to participate in open source projects, contribute their technology, knowledge, and strength, and create a more complete ecosystem, advanced technology, and powerful IntaLink platform!

02-How to quickly find the IntaLink open source project?

1、 GitHub open source code address:
You can view project details, download code, and participate in development on the IntaLink project homepage here. ( https://github.com/YT-DATA/INTALINK )
2、 Community Guide:
The IntaLink community guide includes community tasks, contribution incentive mechanisms, etc., to help new users quickly get started. ( https://github.com/YT-DATA/community )

03-How to download IntaLink open source code?

We provide users with three ways to download code, and you can choose according to your own habits:

Use HTTPS: gitclone https://github.com/YT-DATA/community.git

2.Use SSH
git clone git@github.com :YT-DATA/community.git

3.Use GitHub CLI：
gh repo clone YT-DATA/community

04-Quickly learn about the Intalink open source community

The IntaLink community welcomes the participation of all developers, whether you are a novice or an experienced expert, you will find a contribution path that suits you.
The community provides a variety of tasks for novice and advanced developers, which can help you start from scratch and gradually improve your development skills. The following are task types:
·Basic tasks: fixing minor bugs, updating documentation, or optimizing code comments, etc.
·Functional perfection: Enhance existing functions or modules, improve performance or enhance user experience.
·New feature development: Design and implement new functional modules to enhance IntaLink's application scenarios and scalability.
For a more detailed contribution process, please refer to the detailed contribution guide on how to contribute(https://github.com/YT-DATA/community/blob/main/README.ch.md)

05-If you encounter any problems, we are always ready to provide support for you

During the use of IntaLink, whether it is technical issues or functional suggestions, assistance or feedback can be obtained through the following methods:

GitHub Issues： Submit issues or suggestions in the Issues section and discuss and resolve issues with community members. （https://github.com/YT-DATA/INTALINK/issues)

Instant messaging platform: Join our Discord community for real-time communication with global developers and share your insights. (https://discord.com/invite/FvhqEZ6z95)，

06-Act now, join us, join the open source family

Joining our open source community means stepping into a vibrant and innovative technology ecosystem. Here, every line of code and every discussion pushes the boundaries of technology, providing new ideas for solving practical problems.
We will periodically launch a series of award-winning solicitation activities in the community, covering multiple fields such as solutions and technical directions, with the aim of promoting deep communication and collision among technology enthusiasts. This is not only a great opportunity for learning and improvement, but also a stage to showcase talent and gain recognition.
Join us now and contribute your strength to the technology community through practical actions. Let's work together to promote technological progress and create a better future!

IntaLink: A New NL2SQL Technology Distinct from Large Models

intalink — Tue, 29 Oct 2024 06:05:42 +0000

IntaLink: A New NL2SQL Technology Distinct from Large Models

Hidden Gem

Wide Application Scenarios of IntaLink

Background Review: In previous articles, it was mentioned that "the goal of IntaLink is to achieve automated data linking in the field of data integration." From the discussion, it is clear that IntaLink addresses the issue of automatic linking of "relational data and multiple tables."

Now, let's discuss whether this issue has broad application scenarios or if it is merely a pseudo-proposition without practical demand.

01 Relational Data Remains One of the Most Important Data Assets

Although large models, big data platforms, and other technologies can utilize various types of information, including documents, images, audio, and video, such as multimodal generative AI capable of producing videos and facilitating voice interactions, the results are often open-ended and subjective, occasionally leading to "hallucinations." Thus, while using them for reference or assistance is acceptable, in certain rigorous working environments, we cannot rely on this information or large models to complete tasks. In sectors like banking, finance, transportation, trading, accounting, production, and energy, core business data must be managed using structured relational data.

02 Data Construction is Inevitable and Distributed

(1) The Design Paradigm of Relational Databases requires data to be reasonably divided to avoid significant redundancy. If the data generated during the construction phase contains a lot of redundancy, not only is the data collection workload duplicated, but data consistency is also difficult to ensure. From another perspective, if all related data are stored in a single table, but the data items come from different business sources, with varying data collectors and generation times, maintaining such data records becomes impossible. Thus, data construction will inherently organize data based on object orientation and business activities, leading to its distribution across different tables.
(2) Data Must Originate from Multiple Systems. Since information technology construction is not completed in one go, there will inevitably be a sequence of developments. Even within the same system, there may be variations in implementation timelines. Moreover, different application scenarios require different technological choices; for instance, business data, real-time data, and log information may be realized through various technologies, making data inherently multi-sourced.

03 Integration is the Most Effective Means of Unlocking Data Value

Data needs to be integrated for application. The demand for data integration applications has various possibilities. For example, integrating production data and planning data can assess the status of plan completion; integrating production data and sales data can identify product backlogs or fulfillment of order deliveries; and integrating production data with financial data can evaluate production costs and profitability. Therefore, data integration is the most effective way to maximize data value and empower business processes.

In summary, the integration application of relational data will remain one of the most important data application scenarios for a long time. As long as this scenario exists, IntaLink will have broad adaptability.

Comparison of IntaLink and Large Model Data Integration Methods

T2SQL (Text to SQL) and NL2SQL (Natural Language to SQL) automatically generate the required data queries through text or natural language input. The terms T2SQL and NL2SQL essentially describe the same concept: utilizing AI technology to transform semantic understanding into data operation methods, which is the same idea but with different terminologies. This is a research direction in data applications. In recent years, with the emergence of large model technologies, this field has seen significant advancement. I have researched technical reports from Alibaba and Tencent and tried out open-source projects like DB-GPT. These technologies are largely similar, at least in their underlying technical logic, while IntaLink’s approach is entirely different.

Let’s set aside the underlying technical logic for now and conduct a comparative analysis based on implementation methods:

1. Utilizing Large Model Technology for Automatic Data Queries Requires Data Training

Suppose we have a set of tables named T1, T2, ..., Tn, each containing several data items labeled C1, C2, ..., Cn, with varying counts of items per table. Consider a simulated dataset for table T1 as follows:

C1	C2	C3	C4	C5	C6
Orange	5	3	3	2	1

From this content alone, we cannot derive any useful information. We are unclear about the meaning of the data above. Let’s simulate two meanings for the data:

Fruit Type	Warehouse No.	Shelf No.	Stock	Shelf Life	Warehouse Manager ID
Orange	5	3	3	2	1

Hotel Name	Warehouse Hotness Ranking	Star Rating	Years in Business	Remaining Rooms	Discount Available
Orange	5	3	3	2	1

We won't dwell on the validity of these datasets or the existence of such tables. However, it is evident that without understanding the meaning of the tables and data items, the data cannot be applied. One cannot link data application needs to the data itself, let alone discuss more complex data operations.

Using a dataset for testing NL2SQL, let’s illustrate the application pattern of large model technology in this field.

The Spider dataset is a T2S dataset for multi-database, multi-table, single-round queries and is recognized as the most challenging large-scale cross-domain evaluation leaderboard. It was proposed by Yale University in 2018, annotated by eleven Yale students. The dataset contains ten thousand one hundred eighty-one natural language questions and five thousand six hundred ninety-three SQL statements, covering over two hundred databases across one hundred thirty-eight different domains. Seven thousand questions are used for training, one thousand thirty-four for development, and two thousand one hundred forty-seven for testing. In other words, by providing questions along with their corresponding answers (SQL), the large model learns to utilize the data. For simplicity, we can condense the logic as follows:

Question 1: How many red lipsticks are in stock?
Answer 1: select amount from warehouse where good_name='lipstick' and color='red'

After training the model with such a dataset, we can pose the following test question:

Test Question: How many blue lipsticks are in stock?
Output Answer: select amount from warehouse where good_name='lipstick' and color='blue'

From this, we see that NL2SQL emphasizes deriving possible SQL queries based on semantic and contextual understanding, relying on a trained dataset.

IntaLink’s Data Integration Method

IntaLink's data integration does not require users to provide any training data. The relationships between data are generated through an inter-table relationship analysis model. This relationship generation does not require understanding the actual significance of the tables and data items but is derived through a set of methods that analyze the data's characteristic values to deduce associations between tables. Below, we illustrate the establishment of inter-table relationships using two sample tables.

Tab_1

Name	Student_ID	CLASS	Age	Sex
Zhang San	2021_0001	2021_01	19	Male
Li Si	2021_0002	2021_01	18	Female
Wang Wu	2021_0003	2021_01	19	Male

Tab_2

Student_ID	Course	Grade	Rank
2021_0001	Math	135	18
2021_0001	Chinese	110	23
2021_0002	Math	120	25
2021_0002	Chinese	125	10

In Tab_1, the Student_ID matches the Student_ID in Tab_2, sharing the same characteristic values. Therefore, to link these two tables, the condition Tab_1.Student_ID = Tab_2.Student_ID holds true. This analysis of inter-table linkage requires consideration of numerous factors. In IntaLink, we replicate the data characteristic value memory database as an analysis tool, utilizing a set of optimized analytical methods to produce inter-table relationship analysis results. Due to the complexity of the content involved, we will not elaborate further here. A separate article will discuss the implementation logic.

Differences Between IntaLink and Large Model Technologies in Implementing NL2SQL

1) There is no need to prepare a training question set for the large model; rather, relationships are derived through data analysis. Therefore, IntaLink can be applied to a wide range of data. The more data that can be integrated, the greater its advantages.
2) Focuses on data integration, specifically the generation of relational conditions during integration, without concentrating on data usage methods. Note: Data integration concerns establishing relationships between multiple tables, while data usage methods can vary, such as summation, counting, averaging, minimum and maximum values, etc. NL2SQL selects appropriate data operation methods based on semantics, like SUM, COUNT, AVG, MIN, MAX, etc.
3) High accuracy: Excluding data quality issues, the relational conditions generated by IntaLink theoretically can achieve one hundred percent accuracy.

Potential Combination of IntaLink and Large Model Technologies

Large model technologies excel in semantic understanding and generative content, while IntaLink has advantages in data association analysis with lower upfront workload and higher accuracy. Ideally, large model technologies could be integrated to understand user input requirements, converting that information into the necessary data tables and items, which IntaLink would then generate for data sets, followed by the large model generating the desired outcomes (e.g., reports, charts, etc.) for user presentation.

Join the IntaLink Community!

We would love for you to be a part of the IntaLink journey! Connect with us and contribute to our project:

🔗 GitHub Repository: IntaLink

💬 Join our Discord Community

Be a part of the open-source revolution and help us shape the future of intelligent data integration!

Transforming Data Linkage: An In-Depth Look at IntaLink

intalink — Tue, 08 Oct 2024 02:20:32 +0000

In-depth Analysis of IntaLink Data Auto-Linking Platform's Product Strength!

Hidden Gem, Yuantuo Data Intelligence

1. The Goal of IntaLink

In one sentence: IntaLink's goal is to achieve automatic data linkage in the field of data integration.

Let's break down this definition:

IntaLink's application scenario is for data integration. The simplest case is linking multiple data tables within the same system; the more complex case is linking data across heterogeneous sources.
For data integration applications, relationships between tables need to be established.
The data to be integrated must be able to form linkable relationships.

With the above conditions met, IntaLink’s goal is: Given the data tables and data items specified by the user, IntaLink will provide the available data linkage routes.

2. The Role of IntaLink

Let's explain the problem IntaLink solves through a specific scenario. This example is complex and requires careful consideration to understand the data relationships, which highlights IntaLink's value.

Scenario:

A university has different departments. Each department is identified by an abbreviation, and the table is defined as T_A. Sample data:

DEPARTMENT_ID	DEPART_NAME
GEO	School of Earth Sciences
IT	School of Information Engineering

Each department has several classes, and each class has a unique ID based on the enrollment year and a class number. This table is T_B. Sample data:

CLASSES_ID	CLASSES_NAME	DEPARTMENT
2020_01	Earth Sciences Class 1 (2020)	GEO
2020_02	Earth Sciences Class 2 (2020)	GEO

Each class has students, and each student has a unique ID. This table is T_C. Sample data:

STUDENT_ID	STUDENT_NAME	CLASSES
202000001	Zhang San	2020_01
202000002	Li Si	2020_02

The university offers various courses. Each course has a course code, maximum score, and credits. This table is T_D. Sample data:

CLASS_CODE	CLASS_TITLE	FULL_SCORE	CREDIT
MATH_01	Advanced Math I	100	4

Different departments have different pass scores for the same course. This table is T_E. Sample data:

DEPARTMENT	CLASS	PASS_SCORE
GEO	MATH_02	60
IT	MATH_02	75

Different semesters offer different courses, and students have scores for each course. This table is T_F. Sample data:

STUDENT_ID	TERM	CLASS	SCORE
202000001	2023_1	MATH_02	85

Based on this scenario, the requirement is to list each student’s courses for the 2023_1 semester, showing their score and the passing score. The result might look like this:

Class	Name	Term	Course	Pass Score	Score
Earth Sciences 2020 Class 1	Zhang San	2023_1	Advanced Math II	60	85

The critical challenge lies in determining which tables to link and ensuring the relationships between tables are correctly interpreted. For example, a student is not directly linked to a department but to a class, and the class belongs to a department.

3. Problems Solved by IntaLink

You might think this is just a standard multi-table data linkage application that can be easily achieved with SQL queries. However, the real challenge is identifying which tables to use, especially when the system comprises numerous tables and fields across different applications.

For instance, imagine a university with dozens of application systems, each containing numerous tables. A non-IT personnel requesting data might not know which table contains the required data. IntaLink automatically generates the necessary links between the data tables, reducing the complexity of data analysis and saving significant development time.

Conclusion

IntaLink solves the following key challenges:

No need to understand underlying business logic—just focus on the data integration goal.
No need to manually identify which tables to link—IntaLink determines the relationships.
Significantly reduces the time spent on data analysis and development, enhancing efficiency by over 10 times.

Join the IntaLink Community!

We would love for you to be a part of the IntaLink journey! Connect with us and contribute to our project:

🔗 GitHub Repository: IntaLink

💬 Join our Discord Community

Be a part of the open-source revolution and help us shape the future of intelligent data integration!