DEV Community: RogerWoods

How proficient is generated AI in transforming text or natural language into SQL?

RogerWoods — Sat, 16 Dec 2023 15:02:00 +0000

SQL, as the most important standard syntax in the field of data analysis, has been widely applied by people. Every developer, data engineer, and database administrator must learn SQL in order to interact with databases. Although SQL is not a very difficult syntax, it involves many details. Most developers may struggle with simple Join and feel hard to write high-quality and efficient SQL statements.

Whether you believe it or not, with the advent of Large Language models like ChatGPT, generating accurate SQL through natural language may make SQL writing less crucial. I dare not say that AI will make SQL disappear, but it will at least relieve people from the agony of meticulously crafting every syntax, saving a significant amount of time, especially for programmers who are not proficient in writing SQL. Let me show you some examples what AI can do now.

Start From Create Table and Add Data

First of all, you need to have a clear idea of the table and fields you want to create. You can ask GPT for some suggestions, but often in our daily production systems, we already have an idea of the structure of the table. Let's take the example of student school performance. We need tables look like this.

Table Student

student_number	name	birthday	gender
0001	Jhon	1995-06-04	male

Table Course

course_number	name	teacher_number
0001	Math	0002

Table Score

student_number	course_number	score
0001	0001	90

Table Teacher

teacher_number	name
0001	William

Hard to write so many field name and create data. Well, let GPT help us.
just type:

create two tables for me: student with student_num, name, birthday and gender. course with course number, name and teacher number

The description seems not so accurate, but the result is very good. It can even choose the field type, char length for you.

Then, we generate another two tables with the same plain sentence. But a small amazing thing still shocks me: in the table Score, it uses student_number and course_number as union primary key! How does it know this?

As a test table, I should create some data and insert them into the tables. It is too hard and annoying to create so many names, ages, scores and so on. Don't worry, we can still use AI to do this though people seldom use this function in production environment.

fake about 20 data randomly for all the for tables for me, and make the data is correlated

Wow, we have table and data now. Let's do something with the table.

Basic Data Query

Firstly let's just see all the data of one table, so easy right?

show student data for me

No surprising, it can certainly show the right result. OK, add some diffculty, let him give us a fuzzy query.

show teacher name endwith n

Clearly using the key word "like".

Complex Data Query with Multiple tables

Maybe just query one table is not so hard, then let me try some complex syntax. First, choose the tables you want to query with the software instead of all the database. And write sentence that should use multiple talbes.

calculate all the average score of students and tell me the students name whose scroe is large than 60

AI exactly uses the key word of Join, Group by, Having and function AVG to return thc correct results.
Then I want to try an inline query.

query all the students name number whose courses score are all under 80

No doubt, inline query is shown here. But why it use "NOT IN" instead IN? Let's check with AI. By asking it to optimize the sql for me. Well it now uses "IN" and add a judgement of NULL value. So the choice of AI may be short first~

Let's see some thing more complex.

query all the students who has not learned all the courses

Wow, it's hard for most people to write such code.

Use of Function

As we can see it uses some function such as AVG above to generate results you want, it can use more functions. For example,

query all students ages by year

We can see it uses the function of TIMESTAMPDIFF. Then let me ask him something complex.

ranking the students scores by the course number and show course numbers, student number, score and ranking number.

Wow, Wow, Wow! Do you see the use of ROW_NUMBER() and OVER key word? Can you write out sql like this?

Many people will doubt the ability of generated AI before using it as I was before. But after I use it more and more, it gives me more surprise.

Tips:The examples above is written with a software called TableChat , it can generate SQL according to you table structure and run to return to see the results.

Why do we still need to create an NL2SQL product?

RogerWoods — Fri, 08 Dec 2023 09:07:16 +0000

The concept of NL2SQL exists long before the popularity of large language models（LLMs）. Instead of being called NL2SQL, it also has other names such as Text2SQL and AI2SQL. This concept is not rare, and there are many products with similar concept. So, why do we still decide to create TableChat, an AI generated based SQL IDE tool？

LLMs is Coming

In the pre-era of LLMs, people use algorithms like word segmentation and RNN to process and understand queries. They build Q&A systems through knowledge graph to answer specific questions. However, these approaches need special keywords in queries, such as fields related to the database table structure. These products result in poor performance in search coverage, accuracy, and generated results, severely limiting the widespread adoption of such products. With the advent of large language models like ChatGPT and other generative AI models, it has a significant improvement in the ability to understand queries and generate contents. This has greatly enhanced the usability of such products.

Why don't we use ChatGPT?

Given the effectiveness of ChatGPT, why do not we just directly use ChatGPT or even build a GPTs to address our SQL statement generation problem? Why do we need a specific product? Yes, after the emergence of LLMs, many products immediately utilize the capabilities of GPT to assist users in generating SQL. The general appearance of these products is as follows:

These products remain the form of conversation and chatting, they just send some conversational context to GPT. Users can only get standard SQL query-related statements, just like a ChatGPT shell. Because there is no incorporation with the user's table schema, the generated SQL cannot be directly used. To use these statements, I need to spend a considerable amount of time switching between my IDE and GPT. What's more, what we need is not only sql generation but also total features of handle database. Moreover, we are all quite fed up with these dialogue boxes, aren't we?

What we aim to do

TableChat is still an IDE, but it's a super intelligent editor. AI capability is essential for us, but AI doesn‘t mean all.
We correlate LLMs with database schemas to obtain production-ready SQL statements.
We don't create traditional SQL IDEs with numerous features that are rarely used. We only develop the most essential features.
Our goal is not just for database development, we also aim to understand and visualize data better. People don't need to put everything into Excel for visualization anymore.
We are not only targeted to database engineers but to all developers, including database administrators, data engineers, and backend developers. Therefore, our product will not only include database management but also database-related code generation.

What we have already done

Text2SQL generation
SQL editing, executing, and AI debugging
Data insights
Database-related code generation

This is what we look like now:

TableChat is just starting, and there is much more to be done. Welcome everyone to experience and use our product, we appreciate any feedback!

Data Analysis with ChatGPT Plugin Noteable

RogerWoods — Sun, 12 Nov 2023 14:55:44 +0000

The Noteable ChatGPT plugin is a third-party ChatGPT plugin developed by the collaborative data notebook platform Noteable.io. This plugin seamlessly integrates the natural language processing capabilities of ChatGPT with data notebooks (similar to Jupyter Notebooks) that allow users to create and share documents containing real-time code, equations, visualizations, and annotated text. It finds extensive applications in areas such as data cleaning and transformation, data analysis, numerical simulation, statistical modeling, data visualization, and machine learning.

With the Notable ChatGPT plugin, users can command ChatGPT through conversation to load datasets, perform exploratory data analysis, create visualizations, run machine learning models, and more—all within the collaborative Jupyter notebook environment that can be shared with others.

Installing the Notable Plugin

Ensure that you have the paid version of ChatGPT, ChatGPT Plus, to use ChatGPT4 and install plugins.
Visit the ChatGPT plugin store and search for the "Notable" plugin to install.
After clicking the "Install" button, a login page for your Notable account will appear. Connect your Notable account to ChatGPT. If the login page doesn't appear, you can visit Notable.io and register for a free account.
Once your Notable account is created, the Noteable plugin will be automatically activated, and the Noteable logo will appear below the button to select the GPT version.

Creating a Noteable Project

Log in to Noteable, click the "Create" button, and create a project.
Name the project and copy the URL link, then pass it to your ChatGPT following the provided instructions.

Importing Data
Noteable allows users to easily import data from various sources into notebooks, including uploading CSV files, Excel spreadsheets, and connecting to databases like Postgres and MySQL. In the example, the Titanic dataset is imported.

GPT Analysis

Automated EDA Analysis: ChatGPT can perform exploratory data analysis (EDA) to provide an overview of the dataset.
Code Generation: Generated code for the EDA analysis is automatically visible in the Noteable project.

Performing Additional Data Analysis
If needed, users can continue the analysis in ChatGPT. For example, ChatGPT can be asked to provide a machine learning model for predicting survivors.

This interactive language-based approach enables the automation of various data analysis tasks, including data analysis, machine learning, visualization, model generation, and even web scraping within the Noteable environment.

Different Type of Data Integration Tools People Use

RogerWoods — Fri, 10 Nov 2023 15:53:54 +0000

Data integration technology has rapidly iterated along with the development of the big data technology stack. It has evolved from early offline data integration to gradually include real-time data integration, giving rise to an increasing number of excellent products.

Offline BigData Integration

SQOOP

Apache SQOOP is a specialized tool that facilitates seamless data transfer between HDFS and various structured data repositories. These repositories could include relational databases, enterprise data warehouses, and NoSQL systems. SQOOP operates through a connector architecture, which employs plugins to enhance data connections with external systems, ensuring efficient data migration.

Datax

DataX is a widely used offline data synchronization tool/platform within Alibaba, which practically achieves the extension of all common data storage. It supports data synchronization between various heterogeneous data sources. DataX, as an offline data synchronization framework, adopts a Framework + plugin architecture. It abstracts data source reading and writing into Reader/Writer plugins, incorporating them into the entire synchronization framework.

Real-time incremental data integration

Canal

Its primary purpose is based on parsing incremental logs from MySQL databases, providing incremental data subscription and consumption. Currently, it supports MySQL versions including 5.1.x, 5.5.x, 5.6.x, 5.7.x, and 8.0.x.

FlinkCDC

In traditional CDC-based ETL analysis, the process involves first collecting data, then relying on an external message queue (MQ) for data delivery, performing calculations after downstream consumption, and finally storing the data. The overall data pipeline is relatively long. The core concept of FlinkCDC is to simplify the data pipeline by integrating Debezium for binlog collection at the lower level, eliminating the need for an MQ, and ultimately performing calculations through Flink. The entire pipeline is based on the Flink ecosystem, providing a clearer structure.

New Real-time data integration

Airbyte

Airbyte is an open-source data integration engine that enables the rapid construction of a reliable data pipeline (supporting Change Data Capture - CDC) in a matter of minutes. It facilitates integration and synchronization from source to destination, encompassing data from databases, data warehouses, and data lakes. Airbyte, grounded in a modern understanding of Extract, Load, and focusing on the Extract and Load phases, delegates transformation operations to dbt. Its robust open-source ecosystem supports 200+ connectors, with ongoing additions according to the product development roadmap.

Fivetran

Fivetran connects to all of your supported data sources and loads the data from them into your destination. Each data source has one or more connectors that run as independent processes that persist for the duration of one update.

There are many different eras of big data integration products, including Debezium, Maxwell, Flinkx, SeaTunnel, Stitch, Singer, Meltano, there is no absolute 'best' option; it depends on specific use cases and requirements. Each product has its unique features, applicable scope, and strengths. Users should choose based on their specific data integration needs, technology stack, and preferences.

AutoGPT gives up vector databases, Do we still need them？

RogerWoods — Thu, 09 Nov 2023 03:17:59 +0000

Generative AI has driven the popularity of vector databases, but the technological landscape seems to be changing rapidly. As one of the world's most renowned AI projects, AutoGPT has announced its decision to no longer utilize vector databases, a move that may come as a surprise to many. After all, from the outset, vector databases have consistently supported the long-term memory of AI agents.

Why has this fundamental design approach changed? What new solution is set to replace it? Are vector databases essential for large-scale model applications?

AutoGPT gives up vector database

AutoGPT, released on March 30th this year as an 'AI agent', similar to LlamaIndex and LangChain, made a significant impact immediately after its launch. Within just 7 days of going live, it garnered 44,000 stars on GitHub. In contrast to the conventional method of repeatedly feeding prompt words into models, AutoGPT can autonomously work, plan tasks, break down problems into smaller components, and execute them individually. Undoubtedly, it's an ambitious initiative.

The design concept of AutoGPT involved a method of managing an AI agent's memory in an embedded format, along with a set of vector databases for storing and retrieving memories when necessary. From that perspective, the vector databases were considered a crucial part of the entire solution. Moreover, other Artificial General Intelligence (AGI) projects have also adopted similar methods, such as BabyAGI.

Initially, AutoGPT supported five storage modes by default:

LocalCache (renamed to JSONFileMemory)
Redis
Milvus
Pinecone
Weaviate

However, reviewing AutoGPT's documentation now reveals a surprising warning:

AutoGPT has recently undergone a 'Vector Memory Refactor,' removing all vector database implementations, including Milvus, Pinecone, Weaviate, retaining only a few classes responsible for memory management. Presently, JSON files have become the default method for storing memories/embeddings.

Why?

Maintainer Reinier van der Leer in May this year raised a query on GitHub regarding opinions on the 'value of adding different storage modes.' They were contemplating a refactor and intended to discard everything except the 'local' memory provider (now known as json_file) while striving to implement Redis VectorMemoryProvider.

Some developers expressed agreement, suggesting that if the backend is good enough, there's no reason to retain these vector databases. 'But I suggest integrating Pinecone (or Redis if it's advantageous) into a customized JSONFileMemory.'

As of now, AutoGPT's choice to "abandon" vector databases likely stems from the realization that the operational and usage costs of employing these databases outweigh their benefits. Under these circumstances, building a solution from scratch aligns better with the long-term gains of the project. After all, in software development, premature optimization can lead to high development costs and risks, resulting in uncontrollable software complexity.

Do we still need vector database?

For scenarios requiring storage of vast amounts of vectors, such as extensive image or audiovisual retrieval, it's evident that using a vector database can offer more powerful and specialized functionalities. However, for scenarios with less substantial data volumes, employing libraries like Numpy in Python for computations might be more efficient and convenient. Within the realm of vector databases, there are various types, including lightweight and heavyweight options. Choosing between utilizing plugins like pgvector on PostgreSQL or opting for a dedicated distributed vector database necessitates specific application analysis before making a decision.

As far as our current knowledge goes, not only AutoGPT but also other models like GPT Engineer, GPT Pilot, and even GitHub Copilot, refrain from using vector databases. Instead, they derive contextual relevance from recent files, proximity within the file system, or locating references to specific classes/functions.

The decision to use vector databases depends on the specific context, and AutoGPT's abandonment of vector databases marks an important step in the right direction. This move reflects a focus on delivering value rather than getting bogged down in the technical intricacies.

Operations With Time Types in a Database

RogerWoods — Thu, 09 Nov 2023 02:47:20 +0000

Addition and Subtraction of Days, Months, and Years

In Oracle, when dealing with date types, you can directly add or subtract days. However, when it comes to manipulating months, you would use the add_months function:

SQL> SELECT hiredate AS Hire Date,
          hiredate - 5 AS Minus 5 Days,
          hiredate + 5 AS Plus 5 Days,
          add_months(hiredate, -5) AS Minus 5 Months,
          add_months(hiredate, 5) AS Plus 5 Months,
          add_months(hiredate, -5 * 12) AS Minus 5 Years,
          add_months(hiredate, 5 * 12) AS Plus 5 Years
     FROM emp
     WHERE ROWNUM <= 1;

Addition and Subtraction of Hours, Minutes, and Seconds

SQL> SELECT hiredate AS Hire Date,
           hiredate - 5 / 24 / 60 / 60 AS Minus 5 Seconds,
           hiredate + 5 / 24 / 60 / 60 AS Plus 5 Seconds,
           hiredate - 5 / 24 / 60 AS Minus 5 Minutes,
           hiredate + 5 / 24 / 60 AS Plus 5 Minutes,
           hiredate - 5 / 24 AS Minus 5 Hours,
           hiredate + 5 / 24 AS Plus 5 Hours
      FROM emp
     WHERE ROWNUM <= 1;

Time Intervals in Hours, Minutes, and Seconds

SQL> SELECT Interval_Days,
           Interval_Days * 24 AS Interval_Hours,
           Interval_Days * 24 * 60 AS Interval_Minutes,
           Interval_Days * 24 * 60 * 60 AS Interval_Seconds
    FROM( SELECT MAX(hiredate) - MIN(hiredate) AS Interval_Days
    FROM emp
    WHERE ename IN ('WARD','ALLEN'))X;

Time Intervals in Days, Months, and Years

SQL> SELECT max_hd - min_hd AS Days_Interval,
           months_between(max_hd, min_hd) AS Months_Interval,
           months_between(max_hd, min_hd) / 12 AS Years_Interval
      FROM (SELECT min(hiredate) as min_hd, MAX(hiredate) as max_hd FROM emp) x;