Gleb Otochkin

Posted on May 21 • Originally published at Medium on Feb 7

Talk to Your Data: Analyze Data in AlloyDB Using Natural Language

#data #postgres #alloydb #naturallanguageproce

SQL is language for your data

If you are working with databases as an analyst or developer you are probably quite familiar with SQL, or Structured Query Language. This is the language you use to work with data, extract and aggregate information, analyze and build the entire database backend. The language itself is easy and difficult at the same time depending on what you want to do and how complicated your data schema is. Modern AI models can translate natural language requests to SQL queries relatively well but the devil is in the details. Did I mention that SQL in Oracle can be slightly different from SQL in Postgres? And to write a good SQL query working correctly with your data you most likely need to dig down into the data, understand dependencies between tables and columns and how to combine them together to achieve the results.

So, how to make it reliable and make sure the AI model knows enough to make it working? That’s the main topic of this blog. AlloyDB has a new set of functionswhich can help you to create complex SQL queries using a natural language request.

Components

I am starting with some basic components and functions required to enable NL2SQL (Natural Language To SQL) capabilities in AlloyDB. Just to be on the safe side — at the time when the blog is written the feature is still in preview and some things can be changed in the final version.

The functionality is provided by the alloydb_ai_nl extension. You can read in the documentation how to enable the extension and make it working in your database.

Once the feature is enabled you can create a basic configuration by one simple command using alloydb_ai_nl.g_create_configuration function. That provides you basic functionality and you can already try to generate queries using alloydb_ai_nl.get_sql function. In this mode AlloyDB translates your natural language query to a SQL primarily based on tables and columns names making logical connections between different relations. That’s not too bad and you might get some results out of the box. But … if your data are not in the “public” schema then it is not useful since by default it checks only tables and views in that “public” schema. And what would happen if you have similar table names or same name columns in different tables? In such a case you really need to know your data. So, we have to give more information about the data and tables layout to the AlloyDB natural language processing.

Let us dive into the process and go through the general steps to make the best out of the feature.

The first step as we’ve already mentioned is to create a configuration using the alloydb_ai_nl.g_create_configuration function. That will create some kind of container for all future information about our data.

Then we register our schema in the configuration using alloydb_ai_nl.g_manage_configuration function. And when I say “schema” I mean the Postgres schema where you create your database objects. By default it is “public” but if you are serious about data separation and access you might use a dedicated schema for your application tables, indexes and other objects. Here is how you register schemas ecomm and public for the natural language configuration. In the example we have named our natural language configuration as cymbal_ecomm_config.

SELECT
  alloydb_ai_nl.g_manage_configuration(
    operation => 'register_schema',
    configuration_id_in => 'cymbal_ecomm_config',
    schema_names_in => '{ecomm,public}'
  );

We can configure one or multiple schemas in the same configuration. It can be useful when you want to build cross schema analytical queries for example.

When you register a schema in the configuration it is getting first knowledge about your data and can build queries based on that information. It will try to build the query based primarily on the tables metadata but it might not be enough. To make it more reliable and accurate we need to check the actual contents of the tables and their dependencies.

To build that information layer about your data we create a schema context. In the automatic mode it will analyze all your tables and columns in the registered schema trying to understand dependencies and what exactly is stored there. That is done using the alloydb_ai_nl.generate_schema_context function. Here is an example of generating context for our cymbal_ecomm_config configuration.

SELECT
  alloydb_ai_nl.generate_schema_context(
    nl_config_id => 'cymbal_ecomm_config',
    overwrite_if_exist => TRUE
  );

After the execution, which can take some time, the generated information will be stored in the internal tables but not yet applied. You can review it before applying using the alloydb_ai_nl.generated_schema_context_view.

SELECT schema_object, object_context
FROM alloydb_ai_nl.generated_schema_context_view;

And you can be more specific and read the generated context for a particular table or a column. For example, if we want to get information about the ecomm.events table we can run the following.

SELECT
  object_context
FROM
  alloydb_ai_nl.generated_schema_context_view
WHERE
  schema_object = 'ecomm.events';

If you are not satisfied with the result you can update the context for the table using the alloydb_ai_nl.update_generated_relation_context function. In my experience in most of the cases the automatically generated context is mostly correct and doesn’t require additional correction.

Then you can choose what context you want to be used for the query generation. You might choose to apply all of it or, for example, only for a particular table. Here is an example of how to apply it only for the ecomm.events table.

SELECT alloydb_ai_nl.apply_generated_relation_context(
  relation_name => 'ecomm.events', 
  overwrite_if_exist => TRUE
);

You already noticed optional parameter overwrite_if_exist for the managing context functions. It commands to replace any existing context by the new one. It helps to redefine context from time to time making it better.

After applying the context it disappears from the alloydb_ai_nl.generated_schema_context_view and starts to be used for all the new queries generations.

By the way you also can add your custom application context based on your internal domestic knowledge about queries patterns and conditions. You can read about it more in the documentation.

That can be sufficient for some applications but what if we have some particular queries patterns where we use some domestic functions or maybe certain predicates to be used? In such a case you might look at the query templates. A query template can be added to the configuration based on the natural language intent and define the query structure to be used to get reliable and deterministic execution for the known query patterns specific for your business. Query templates support intent parametrization and query fragments to make it more flexible.

There are functions in the allydb_ai_nl extension to manage the query templates and fragments. Here is an example of how to add a query template:

SELECT alloydb_ai_nl.add_template(
    nl_config_id => 'cymbal_ecomm_config',
    intent => 'List the last names and the country of all customers who bought products of `Republic Outpost` in the last year.',
    sql => 'SELECT DISTINCT u."last_name", u."country" FROM "ecomm"."users" AS u INNER JOIN "ecomm"."order_items" AS oi ON u.id = oi."user_id" INNER JOIN "ecomm"."products" AS ep ON oi.product_id = ep.id WHERE ep.brand = ''Republic Outpost'' AND oi.created_at >= DATE_TRUNC(''year'', CURRENT_DATE - INTERVAL ''1 year'') AND oi.created_at < DATE_TRUNC(''year'', CURRENT_DATE)',
    sql_explanation => 'To answer this question, JOIN `ecomm.users` with `ecom.order_items` on having the same `users.id` and `order_items.user_id`, and JOIN the result with ecom.products on having the same `order_items.product_id` and `products.id`. Then filter rows with products.brand = ''Republic Outpost'' and by `order_items.created_at` for the last year. Return the `last_name` and the `country` of the users with matching records.',
    check_intent => TRUE
);

Or disable the query template

SELECT alloydb_ai_nl.disable_template(INPUT template_id);

And you can automatically generate query templates based on your query history using alloydb_ai_nl.generate_templates functions.

SELECT
  alloydb_ai_nl.generate_templates(
    'cymbal_ecomm_config',
);

In addition to all that configuration options you also can create value indexes based on samples of your data in the tables. The value index provides associations between column name plus values in the column and a concept type which can be for example a city, country or name. It can associate a value used in the natural language request with a potential concept type and what table and column can be used in the resulting SQL. So if somebody asks “How many Clades do we have?” — the value index can help to figure out that the “Clades” in the request is the brand name, not a name for a product. You can get more information about concepts types and value indexes in the guide.

If you combine all those components together it can help you to avoid uncertainty and make the natural language to SQL reliable and predictable. At the same time the fragments and deep knowledge of your data helps to generate queries for deep analysis and still avoid disambiguation.

Summary

The AlloyDB NL2SQL makes your natural language to SQL processing robust and enterprise ready, preventing disambiguation and saving from the non-deterministic nature of AI models. At the same time it is still flexible enough to make it useful and dynamic, helping data analysts to dig in through data generating reports and helping with analysis.

You can try it now and let us know what you think. Start from the Google Cloud codelab and test all the listed NL2SQL features in your own project. Happy testing.

DEV Community

Talk to Your Data: Analyze Data in AlloyDB Using Natural Language

SQL is language for your data

Components

Summary

Top comments (0)