DEV Community: Joan

How pgroll works under the hood

Joan — Thu, 07 Dec 2023 15:00:00 +0000

At the start of October we released pgroll, an open source tool for zero-downtime, reversible schema migrations for Postgres.

We've been hard at work on pgroll for the past couple of months, and now seems like the perfect time to delve deeper into pgroll and explore how it really works.

A brief recap

pgroll is an open source schema migration tool that takes a different approach, using the expand/contract pattern, to solve some of the problems associated with schema migrations:

Multiple schema versions: pgroll keeps two versions of your schema active at the same time during a migration; the old schema and the new. This means that old versions of client applications can co-exist with new versions of client applications during an application rollout because each version of the application sees the version of the database schema that it expects to see.
Lock free migrations: A careful approach to locking to ensure that tables don't get locked for long periods during a migration.
Easy rollbacks: Rollbacks are simplified with pgroll, as it keeps two versions of your database schema during a migration. To rollback, you just need to remove the new schema version. Since the old version remains intact throughout, older client applications continue to work without any disruption
Data backfilling: During a migration, pgroll allows both old and new schema versions to coexist. It synchronizes any data changes between these versions. This means data added to the old schema is automatically updated in the new schema, and the reverse is also true. As a result, both old and new client applications can operate simultaneously without issues until the migration is complete.

A step-by-step example

Let's see how pgroll works by walking through an example migration and seeing what pgroll does at each step of the migration process.

Initialization

pgroll needs to store its own internal state somewhere in the target Postgres database. Initializing pgroll configures this store and makes pgroll ready for first use:

pgroll init

A message is displayed confirming the successful configuration of pgroll.

What data does pgroll store?

pgroll stores its data in the pgroll schema. In this schema it creates:

A migrations table containing the version history for each schema in the database.

Functions to capture the current database schema for a given schema name.

Triggers to capture DDL statements that run outside of pgroll migrations.

First migration

With pgroll initialized, let's run our first migration. Here is a migration that creates a table:

{
  "name": "01_create_users_table",
  "operations": [
    {
      "create_table": {
        "name": "users",
        "columns": [
          {
            "name": "id",
            "type": "serial",
            "pk": true
          },
          {
            "name": "name",
            "type": "varchar(255)",
            "unique": true
          },
          {
            "name": "description",
            "type": "text",
            "nullable": true
          }
        ]
      }
    }
  ]
}

Now, save this file as sql/01_create_users_table.json.

The migration will create a users table with three columns. It is equivalent to the following SQL DDL statement:

CREATE TABLE users(
  id SERIAL PRIMARY KEY,
  name VARCHAR(255) UNIQUE NOT NULL,
  description TEXT
)

To apply the migration to the database, run the following command:

pgroll start sql/01_create_users_table.json --complete

What does the --complete flag do here?

pgroll divides migration application into two steps: start and complete. During the start phase, both old and new versions of the database schema are available to client applications. After the complete phase, only the most recent schema is available.

As this is the first migration, there is no old schema to maintain, so the migration can safely be started and completed in one step.

The users table can be filled with sample data using this SQL command:

INSERT INTO users (name, description)
 SELECT
   'user_' || suffix,
   CASE
     WHEN random() < 0.5 THEN 'description for user_' || suffix
     ELSE NULL
   END
 FROM generate_series(1, 100000) AS suffix;

This will insert 100,000 users into the users table. Roughly half of the users will have descriptions and the other half will have NULL descriptions.

Second migration

Now that the users table is set up, let's apply a non-backwards-compatible schema change and see how pgroll assists in managing both the old and new schema versions simultaneously.

We'd like to change the users table to disallow NULL values in the description field. We also want a description to be set explicitly for all new users, so we will not set a default value for this column.

There are two things that make this migration difficult:

We have existing NULL values in our description column that need to be updated to something not NULL.
Existing applications using the table are still running and may be inserting more NULL descriptions.

pgroll helps solve both problems by maintaining old and new versions of the schema side-by-side and transferring or modifying data between them as needed.

Here is the pgroll migration that will perform the migration to make the description column NOT NULL:

{
  "name": "02_user_description_set_nullable",
  "operations": [
    {
      "alter_column": {
        "table": "users",
        "column": "description",
        "nullable": false,
        "up": "(SELECT CASE WHEN description IS NULL THEN 'description for ' || name ELSE description END)",
        "down": "description"
      }
    }
  ]
}

Save this migration as sql/02_user_description_set_nullable.json and start the migration:

pgroll start 02_user_description_set_nullable.json

After some progress updates you'll receive a message confirming the successful start of the migration.

What's happening behind the progress updates?

In order to add the new description column, pgroll creates a temporary _pgroll_new_description column and copies over the data from the existing description column, using the up SQL from the migration. As we have 10^5 rows in our table, this process takes some time. This process is called backfilling and it is performed in batches to avoid locking all rows in the table simultaneously.

At this point it's useful to look at the table data and schema to see what pgroll has done. Let's look at the data first:

SELECT * FROM users ORDER BY id LIMIT 10

You should see something like this:

+-----+----------+-------------------------+--------------------------+
| id  | name     | description             | _pgroll_new_description  |
+-----+----------+-------------------------+--------------------------+
| 1   | user_1   | <null>                  | description for user_1   |
| 2   | user_2   | description for user_2  | description for user_2   |
| 3   | user_3   | <null>                  | description for user_3   |
| 4   | user_4   | description for user_4  | description for user_4   |
| 5   | user_5   | <null>                  | description for user_5   |
| 6   | user_6   | description for user_6  | description for user_6   |
| 7   | user_7   | <null>                  | description for user_7   |
| 8   | user_8   | <null>                  | description for user_8   |
| 9   | user_9   | description for user_9  | description for user_9   |
| 10  | user_10  | description for user_10 | description for user_10  |

This is the "expand" phase of the expand/contract pattern in action; pgroll has added a _pgroll_new_description field to the table and populated the field for all rows using the up SQL logic from the 02_user_description_set_nullable.json file:

"up": "(SELECT CASE WHEN description IS NULL THEN 'description for ' || name ELSE description END)",

This has copied over all description values into the _pgroll_new_description field, rewriting any NULL values using the provided SQL.

Now let's look at the table schema:

DESCRIBE users

You should see something like this:

+-------------------------+------------------------+-----------------------------------------------------------------+
| Column                  | Type                   | Modifiers                                                       |
|-------------------------+------------------------+-----------------------------------------------------------------|
| id                      | integer                |  not null default nextval('_pgroll_new_users_id_seq'::regclass) |
| name                    | character varying(255) |  not null                                                       |
| description             | text                   |                                                                 |
| _pgroll_new_description | text                   |                                                                 |
+-------------------------+------------------------+-----------------------------------------------------------------+
Indexes:
    "_pgroll_new_users_pkey" PRIMARY KEY, btree (id)
    "_pgroll_new_users_name_key" UNIQUE CONSTRAINT, btree (name)
Check constraints:
    "_pgroll_add_column_check_description" CHECK (_pgroll_new_description IS NOT NULL) NOT VALID
Triggers:
    _pgroll_trigger_users__pgroll_new_description BEFORE INSERT OR UPDATE ON users FOR EACH ROW EXECUTE FUNCTION _pgroll_trigger_users__pgroll_new_description()
    _pgroll_trigger_users_description BEFORE INSERT OR UPDATE ON users FOR EACH ROW EXECUTE FUNCTION _pgroll_trigger_users_description()

The _pgroll_new_description column has a NOT NULL CHECK constraint, but the old description column is still nullable.

Why is the IS NOT NULL constraint on the new _pgroll_new_description column NOT VALID?

Defining the constraint as NOT VALID means that the users table will not be scanned to enforce the NOT NULL constraint for existing rows. This means the constraint can be added quickly without locking rows in the table. pgroll assumes that the up SQL provided by the user will ensure that no NULL values are written to the _pgroll_new_description column.

We'll talk about what the two triggers on the table do later.

For now, let's look at the schemas in the database:

\dn

You should see something like this:

+-----------------------------------------+-------------------+
| Name                                    | Owner             |
+-----------------------------------------+-------------------+
| pgroll                                  | postgres          |
| public                                  | pg_database_owner |
| public_01_create_users_table            | postgres          |
| public_02_user_description_set_nullable | postgres          |
+-----------------------------------------+-------------------+

We have two schemas: one corresponding to the old schema, public_01_create_users_table, and one for the migration we just started, public_02_user_description_set_nullable. Each schema contains one view on the users table. Let's look at the view in the first schema:

\d+ public_01_create_users_table.users

The output should contain something like this:

 SELECT users.id,
    users.name,
    users.description
   FROM users;

and for the second view:

\d+ public_02_user_description_set_nullable.users

The output should contain something like this:

 SELECT users.id,
    users.name,
    users._pgroll_new_description AS description
   FROM users;

The second view exposes the same three columns as the first, but its description field is mapped to the _pgroll_new_description field in the underlying table.

By choosing to access the users table through either the public_01_create_users_table.users or public_02_user_description_set_nullable.users view, applications have a choice of which version of the schema they want to see; either the old version without the NOT NULL constraint on the description field or the new version with the constraint.

When we looked at the schema of the users table, we saw that pgroll has created two triggers:

_pgroll_trigger_users__pgroll_new_description BEFORE INSERT OR UPDATE ON users FOR EACH ROW EXECUTE FUNCTION _pgroll_trigger_users__pgroll_new_description()
_pgroll_trigger_users_description BEFORE INSERT OR UPDATE ON users FOR EACH ROW EXECUTE FUNCTION _pgroll_trigger_users_description()

These triggers are used by pgroll to ensure that any values written into the old description column are copied over to the _pgroll_new_description column (rewriting values using the up SQL command from the migration) and to copy values written to the _pgroll_new_description column back into the old description column (rewriting values using the down SQL command from the migration). Our migration did not specify any down SQL command, so the default behaviour is just to copy data from the _pgroll_new_description column into the description column without modification.

Let's see the first of those triggers in action.

First set the search path of the Postgres session to use the old schema:

SET search_path = 'public_01_create_users_table'

Now insert some data into the users table through the users view:

INSERT INTO users(name, description) VALUES ('Alice', 'this is Alice'), ('Bob', NULL)

This inserts two new users into the users table, one with a description and one without.

Let's check that the data was inserted:

SELECT * FROM users WHERE name = 'Alice' or name = 'Bob'

Running this query should show:

+--------+-------+---------------------+
| id     | name  | description         |
+--------+-------+---------------------+
| 100001 | Alice | this is Alice       |
| 100002 | Bob   | NULL                |
+--------+-------+---------------------+

The trigger should have copied the data that was just written into the old description column (without the NOT NULL constraint) into the _pgroll_new_description column (with the NOT NULL constraint) using the up SQL from the migration.

Let's check. Set the search path to the new version of the schema:

SET search_path = 'public_02_user_description_set_nullable'

Now, find the users we just inserted:

SELECT * FROM users WHERE name = 'Alice' or name = 'Bob'

The output should look like this:

+--------+-------+---------------------+
| id     | name  | description         |
+--------+-------+---------------------+
| 100001 | Alice | this is Alice       |
| 100002 | Bob   | description for Bob |
+--------+-------+---------------------+

Notice that the trigger installed by pgroll has rewritten the NULL value inserted into the old schema by using the up SQL from the migration definition.

How do applications configure which version of the schema to use?

pgroll allows old and new versions of an application to exist side-by-side during a migration. Each version of the application should be configured with the name of the correct version schema, so that the application sees the database schema that it expects.

This is done by setting the Postgres search_path for the client's session and is described in more detail in the Client applications section below.

Completing the migration

Once the old version of the database schema is no longer required (for instance, the old applications that depend on the old schema are no longer running in production) the current migration can be completed:

pgroll complete

After the migration has completed, the old version of the schema is no longer present in the database:

\dn

shows something like:

+-----------------------------------------+-------------------+
| Name                                    | Owner             |
+-----------------------------------------+-------------------+
| pgroll                                  | postgres          |
| public                                  | pg_database_owner |
| public_02_user_description_set_nullable | postgres          |
+-----------------------------------------+-------------------+

Only the new version schema public_02_user_description_set_nullable remains in the database.

Let's look at the schema of the users table to see what's changed there:

DESCRIBE users

shows something like:

+-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------+
| Column      | Type                   | Modifiers                                                       | Storage  | Stats target | Description |
+-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------+
| id          | integer                |  not null default nextval('_pgroll_new_users_id_seq'::regclass) | plain    | <null>       | <null>      |
| name        | character varying(255) |  not null                                                       | extended | <null>       | <null>      |
| description | text                   |  not null                                                       | extended | <null>       | <null>      |
+-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------+
Indexes:
    "_pgroll_new_users_pkey" PRIMARY KEY, btree (id)
    "_pgroll_new_users_name_key" UNIQUE CONSTRAINT, btree (name)

A few things have happened:

The extra _pgroll_new_description has been renamed to description.
The old description column has been removed.
The description column is now marked as NOT NULL.
The triggers to copy data back and forth between the old and new column have been removed.

How is the column made NOT NULL without locking?

Because there is an existing NOT NULL constraint on the column, created when the migration was started, making the column NOT NULL when the migration is completed does not require a full table scan. See the Postgres docs for SET NOT NULL.

At this point, the migration is complete. There is just one version schema in the database:

public_02_user_description_set_nullable and the underlying users table has the expected schema.

Rollbacks

The expand/contract approach to migrations means that the old version of the database schema (01_create_users_table in this example) remains operational throughout the migration. This has two key benefits:

Old versions of client applications that rely on the old schema continue to work.
Rollbacks become trivial!

Looking at the second of these items, rollbacks, let's see how to roll back a pgroll migration. We can start another migration now that our last one is complete:

{
  "name": "03_add_is_active_column",
  "operations": [
    {
      "add_column": {
        "table": "users",
        "column": {
          "name": "is_atcive",
          "type": "boolean",
          "nullable": true,
          "default": "true"
        }
      }
    }
  ]
}

This migration adds a new column to the users table. As before, we can start the migration with this command:

pgroll start 03_add_is_active_column.json

Once again, this creates a new version of the schema:

\dn

Shows something like:

+-----------------------------------------+-------------------+
| Name                                    | Owner             |
|-----------------------------------------+-------------------|
| pgroll                                  | postgres          |
| public                                  | pg_database_owner |
| public_02_user_description_set_nullable | postgres          |
| public_03_add_is_active_column          | postgres          |
+-----------------------------------------+-------------------+

And adds a new column with a temporary name to the users table:

+-----------------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------+
| Column                | Type                   | Modifiers                                                       | Storage  | Stats target | Description |
|-----------------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------|
| id                    | integer                |  not null default nextval('_pgroll_new_users_id_seq'::regclass) | plain    | <null>       | <null>      |
| name                  | character varying(255) |  not null                                                       | extended | <null>       | <null>      |
| description           | text                   |  not null                                                       | extended | <null>       | <null>      |
| _pgroll_new_is_atcive | boolean                |  default true                                                   | plain    | <null>       | <null>      |
+-----------------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------+

The new column is not present in the view in the old version of the schema:

\d+ public_02_user_description_set_nullable.users

Shows:

 SELECT users.id,
    users.name,
    users.description
   FROM users;

But is exposed by the new version.

\d+ public_03_add_is_active_column.user

Shows:

 SELECT users.id,
    users.name,
    users.description,
    users._pgroll_new_is_atcive AS is_atcive
   FROM users;

However, there's a typo in the column name: isAtcive instead of isActive. The migration needs to be rolled back:

pgroll rollback

The rollback has removed the old version of the schema:

+-----------------------------------------+-------------------+
| Name                                    | Owner             |
|-----------------------------------------+-------------------|
| pgroll                                  | postgres          |
| public                                  | pg_database_owner |
| public_02_user_description_set_nullable | postgres          |
+-----------------------------------------+-------------------+

And the new column has been removed from the underlying table:

+-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------+
| Column      | Type                   | Modifiers                                                       | Storage  | Stats target | Description |
|-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------|
| id          | integer                |  not null default nextval('_pgroll_new_users_id_seq'::regclass) | plain    | <null>       | <null>      |
| name        | character varying(255) |  not null                                                       | extended | <null>       | <null>      |
| description | text                   |  not null                                                       | extended | <null>       | <null>      |
+-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------+

Since the original schema version, 02_user_description_set_nullable, was never removed, existing client applications remain unaware of the migration and subsequent rollback.

Client applications

pgroll uses the expand/contract pattern to roll out schema changes. Each migration creates a new version schema in the database.

In order to work with the multiple versioned schema that pgroll creates, clients need to be configured to work with one of them.

This is done by having client applications configure the search path when they connect to the Postgres database.

For example, this fragment for a Go client application shows how to set the search_path after a connection is established:

db, err := sql.Open("postgres", "postgres://postgres:postgres@localhost:5432/postgres?sslmode=disable")
if err != nil {
    return nil, err
}

searchPath := "02_user_description_set_nullable"
_, err = db.Exec(fmt.Sprintf("SET search_path = %s", pq.QuoteIdentifier(searchPath)))
if err != nil {
    return nil, fmt.Errorf("failed to set search path: %s", err)
}

In practice, the searchPath variable would be provided to the application as an environment variable.

Get involved

pgroll is an open source project, and we're really excited to see more people getting involved. If you're interested in submitting issues, giving feedback, or contributing pull requests, our repository is the place to be!

If you want to discuss further, find out more about projects at Xata, or just say hi, join us on Discord or follow us on X | Twitter. We're always ready to chat, answer questions, and keep you in the loop with the latest from Xata. We look forward to your input and ideas!

Announcing the release of the Xata Go SDK

Joan — Thu, 07 Dec 2023 07:00:00 +0000

Earlier this year we shared a community spotlight focused on a Go SDK developed by a dedicated contributor, xata-go.

Kerdo Kurs, the developer behind this project, was a huge inspiration for us committing to an official Go SDK. Today, we're thrilled to announce the official release of our Xata Go SDK, and open the doors to the growing Go community 🎉

The v0.0.1 release features our most popular endpoints, enabling functionalities like searching a branch, using the ask endpoint for follow-up questions, and creating a database, among others. You can see the evolving endpoint coverage in the ticket xata-go#1.

If you want to get started right away, head over to the Go SDK documentation and browse the code on GitHub.

In our documentation, we've included examples showing how to use the Go SDK, similar to the existing guides we have for TypeScript, Python, cURL, and SQL.

Anatomy of the SDK

In the Go SDK, components are organized into specialized clients within distinct namespaces, like the RecordsClient for record-related functions.

All endpoints that are listed in our API reference, are modeled within this client.

The available clients are:

BranchClient
DatabasesClient
FilesClient
RecordsClient
SearchAndFilterClient
TableClient
UsersClient
WorkspaceClient

The code and corresponding test suites can be found in the xata package.

If you're familiar with our SDKs, you'll notice familiar API naming patterns, adapted to align with Go's idiomatic style.
To illustrate the usage of the Go SDK, we curated a couple of examples. Browse our docs to see the full bandwidth of examples.

The following example succinctly demonstrates, how to use the Ask endpoint.

For demo purposes we assume that the table IMDB contains all movies listed from IMDB.com and we want to learn how many Ace Ventura movies exist.

More complex queries and follow up questions examples can be found here.

searchClient, _ := xata.NewSearchAndFilterClient()
result, _ := searchClient.Ask(context.TODO(), xata.AskRequest{
  TableName: "IMDB",
  Question:  "How many Ace Ventura movies are there?",
})

In this example, we have not normalized data in multiple tables in one branch and want to search across this entire branch.

We are looking for the name Philip, but as there are multiple ways of writing Philip (such as Filip or Phillip or Philippe) we need to apply fuzziness to our search. Fuzziness: xata.Int(2) sets the search fuzziness level to a degree of 2, allowing for slight variations in the spelling of search terms.

searchClient, _ := xata.NewSearchAndFilterClient()
results, _ := searchClient.SearchBranch(context.TODO(), xata.SearchBranchRequest{
  Payload: xata.SearchBranchRequestPayload{
    Query:     "Philip",
    Fuzziness: xata.Int(2),
  },
})

The last example illustrates the use of the transaction endpoint.
Transactions are a powerful way to do multiple operations in one go.

For the sake of the example, we solely want to create multiple records of actors that participated in the movie the Matrix.

recordsClient, _ := xata.NewRecordsClient()
result, _ := recordsClient.Transaction(context.TODO(), xata.TransactionRequest{
  Operations: []xata.TransactionOperation{
    xata.NewInsertTransaction(xata.TransactionInsertOp{
      Table: "Actors",
      Record: map[string]any{
        "name": xata.String("Keanu Charles Reeves"),
        "movie": xata.String("the_matrix"),
      },
      Record: map[string]any{
        "name": xata.String("Carrie-Anne Moss"),
        "movie": xata.String("the_matrix"),
      },
      Record: map[string]any{
        "name": xata.String("Laurence Fishburne"),
        "movie": xata.String("the_matrix"),
      },
    }),
  },
})

What’s next?

The SDK is an alpha release, and not all endpoints are covered, we continue to increase the API coverage of the SDK and provide bug fixes.

We can't do it without you! If you want to contribute code, add an example update the docs, or found a bug please open a PR or issue and we will assist you. All contributions are welcome ✅

To stay up to date with that latest at Xata, follow us on X | Twitter or pop in and say "hi" 👋 in Discord.

Is there a language you wished we supported natively but don't today? Feel free to open up a feature request or check-in with our amazing community.

What is Horizontal Sharding?

Joan — Wed, 06 Dec 2023 23:35:40 +0000

Recently, while wandering through the narrow aisles of a small comic store brimming with pop culture relics, I came across an interesting find. Hidden amongst a varied assortment of DC and Marvel comics, were shelves dedicated to the "Jodoverse" - a series of comics created by Alejandro Jodorowsky. With the spines showing slight signs of age, I remarked to the store owner about the find and asked for a specific issue. Without resorting to a system, ledger, or anything, he quickly scanned the shelves and in a matter of milliseconds found exactly what I asked for with no hassle. I was impressed to see that his collection was organized by genre, collaborator, publication year as well as story arc!

The store owner's organizational strategy allowed him to quickly access comics. His store was divided into different sections; each section was like a mini comic store specializing in a particular category. For example, if you were looking for a 90s Jodorowsky comic created in collaboration with a particular artist, you'd first go to the section dedicated to Jodorowsky's works, find a specific area for collaborations, and then narrow your search down to the shelves tagged with the 90s. This setup, where each 'mini-store' or section held a distinct subset of the entire collection, mirrors the concept of horizontal sharding in databases, where data is split into manageable parts based on specific keys for more efficient access and retrieval.

Just as the comics are grouped and allocated to specific shelves based on various attributes, horizontal sharding distributes rows of a database across multiple locations, or "shards", based on specific key attributes. This method not only ensures quick access but also optimizes storage and performance. The store owner knew exactly where each comic belonged; the same with horizontal sharding which ensures that every piece of data finds its right place in a database system.

What is Sharding

Sharding is a database design technique where data is split across multiple servers, or "shards", each holding a portion of the data. Think of it like the shop mentioned above: instead of having everything jumbled together, the shop is organized into sections, with each section (or "shard") holding a specific category of comics. This structure makes it easier and faster to find a particular item, just as sharding can make databases more efficient by spreading the load across multiple servers.

As data volumes grow exponentially, efficient data management becomes extremely important. That's why sharding is so important, it's one of the techniques employed to maintain the sanity and speed of large databases. So, what are the different types of sharding?

Vertical sharding: In this approach, different tables or columns are placed on different servers. For example, one server might store names and descriptions, while another might store order histories.
Horizontal sharding: This is the most common form of sharding. Here, rows of a database table are held separately, rather than splitting by columns. For instance, if you had a database of comics, one shard might contain issue IDs 1 to 1000, while another might hold 1001 to 2000, and so on.
Range-based sharding: Range-based sharding is a specific method of horizontal sharding where rows are partitioned across shards based on ranges of values of a sharding key. For instance, user IDs 1 to 1000 might be stored on one shard, while user IDs 1001 to 2000 might be on another. In essence, all range-based sharding is horizontal sharding, but not all horizontal sharding is range-based. Other horizontal sharding strategies might involve hashing, directory-based sharding, or even geographically-based sharding. Each strategy has its own advantages and best-use cases.

Other types of sharding exist and include:

Hash-based sharding

A hash function is applied to a key attribute, and the result determines which shard the record goes to. This type of sharding may be good for evenly distributing data, especially when the range or list is not predictable.
For example, user IDs can be hashed, and the hash value determines the shard.

Directory-based sharding

A separate lookup service or directory maps each data item to a specific shard. This offers flexibility in sharding and can handle more complex data distribution scenarios.
For example, a mapping service can direct user queries to the appropriate shard based on the user's location or preferences.

Geo-sharding

Data is sharded based on geographical locations. This is useful for services that need to reduce latency by locating data closer to the user.
For example, social media posts might be stored in a shard located in the same region as the user.

Key-based sharding

This is also known as customer sharding. The sharding is done based on a specific key, such as customer ID or tenant ID. This can be used for multi-tenant applications where data isolation between tenants is crucial.
For example, each comic's data can be stored in a separate shard.

Functional sharding

This is sharding based on the function or the type of data. This is suitable for systems where different types of data have vastly different access patterns.
For example, this can be used when one shard handles all the transactional data, while another handles logging or archival data.

Sharding comparison example

In horizontal sharding, each shard contains a subset of the rows from the original table, but the structure (columns) remains the same across all shards.

Suppose there is a table of Marvel comic book issues:

Issue ID	Series Title	Character	Release Date	Storyline
...	...	...	...	...

To shard this table horizontally based on the series title, it might look like this:

Shard 1 (Avengers series)

Issue ID	Series Title	Character	Release Date	Storyline
001	Avengers	Iron Man	2018-05-01	Story A
002	Avengers	Thor	2018-06-01	Story B
...	...	...	...	...

Shard 2 (Spider-Man series)

Issue ID	Series Title	Character	Release Date	Storyline
101	Spider-Man	Spider-Man	2019-05-01	Story X
102	Spider-Man	Spider-Man	2019-06-01	Story Y
...	...	...	...	...

This is represented in the below diagram:

Conversely, in vertical sharding, each shard contains a subset of the columns from the original table.

Using the same original table structure, if divided into two shards with vertical sharding, it might look something like the following:

Shard 1 (Basic information)

Issue ID	Series Title	Release Date
001	Avengers	2018-05-01
002	Spider-Man	2019-05-01
...	...	...

Shard 2 (Content details)

Issue ID	Character	Storyline
001	Iron Man	Story A
002	Spider-Man	Story X
...	...	...

In horizontal sharding, the data is divided into different rows across shards, but the table structure (columns) is the same. In vertical sharding, the data is divided into different columns across shards, potentially splitting the table into different aspects or types of data.

Why horizontal sharding?

Database sharding is important for several reasons, primarily relating to the scalability, performance, and reliability of database systems, especially when managing massive amounts of data. Here's why it's considered crucial:

Scalability: As databases grow in size, they can experience slowdowns in query performance, potentially affecting the user experience for applications that depend on them. Sharding helps databases scale out by distributing the data across multiple servers or clusters, enabling them to handle larger workloads and more users.
Improved performance: Distributing the data across multiple servers means that fewer rows are queried in each shard. This leads to faster query performance since each server has less data to sift through.
Load balancing: Sharding can distribute the database load evenly across servers. This ensures that no single server becomes a bottleneck, improving the overall performance and responsiveness of the database system.
Hardware cost efficiency: Instead of scaling up, sharding allows organizations to scale out, adding more servers to the mix. This spreads data across multiple servers, each handling a part of the total data, which lightens the load on individual servers.
Failover and redundancy: If one shard fails, it won't bring down the entire database. Only the users or transactions tied to that specific shard would be affected, while others can continue operations normally. This can be combined with replication strategies to provide even better fault tolerance.
Geographical distribution: For global applications, sharding can be done based on geographical considerations. This means that users can be served from a nearby shard, reducing latency and improving application responsiveness.
Isolation of problematic workloads: If a particular shard starts to experience issues, it can be isolated and managed without affecting the performance of other shards. This isolation can also be good for maintenance and updates.
Storage management: Different shards can be placed on different storage mediums depending on the access patterns.

Benefits of horizontal sharding

Handles more data and users: Sharding lets a database grow and handle more data and users.
Speeds up searches: It's usually quicker to find something in a smaller room than a huge library. Similarly, sharding makes data searches faster.
Keeps things running: If one shard has a problem, the others keep working.

Challenges of horizontal sharding

Efficient organization: Think of it as sorting comics by themes and years in the comic store. In database sharding, the challenge is to efficiently organize data across shards, ensuring it's easy to manage and retrieve without getting jumbled up.
Consistency across sections: Just like maintaining different sections in the store, managing shards can lead to inconsistencies in data retrieval. Users accessing different shards might experience inconsistent query performance, similar to how different sections in the store can offer varying experiences. In sharding, data integrity can become a concern, especially when managing shards independently. Ensuring consistency and integrity across shards is vital to prevent data issues.
Balancing workload: Imagine one section of the comic store becoming too popular, causing overcrowding, while others remain empty. In sharding, this relates to load balancing. Uneven data distribution can overload some shards, resulting in performance issues, similar to a crowded section causing chaos.
Handling historical data: Just as we talked about those older comic books, database sharding also involves the challenge of managing historical data. As data accumulates over time, adopting efficient strategies to handle this older information becomes essential to sustain both performance and storage effectiveness.
Performance bottlenecks: An overcrowded store sections lead to slower service, improper shard distribution can cause performance bottlenecks in database sharding. Ensuring optimal data distribution among shards is essential to avoid these performance issues.
Complexity management: Similar to the comic store's categorization system, managing multiple shards in a database can be challenging. Implementing and maintaining sharding strategies require careful planning to avoid complexity, in addition to managing costs associated with adding more servers or clusters for sharding can be a challenge. Also, PostgreSQL itself does not provide built-in database sharding in the same way some distributed databases do. However, PostgreSQL supports table partitioning, which can be seen as a form of horizontal sharding at the table level.

Implementing horizontal sharding

Horizontal sharding is not just a method of splitting a database; it's a strategic approach to reorganize how data is stored and accessed. Each phase, from choosing the sharding key to managing data across multiple shards, plays an important role in making the database more scalable and efficient. When done correctly, horizontal sharding can significantly improve a database's performance.

1. Selecting a sharding key

The process starts with choosing a sharding key. This critical step involves identifying a specific data attribute that will dictate how the database is divided into shards. The success of sharding largely depends on this choice, as it affects how easily data can be accessed and managed.

2. Shard creation and management

After determining the sharding key, the next step is to create the actual shards. These are smaller portions of the original database, designed to improve efficiency. However, it's not just about creating these shards; managing them is equally important. This includes tasks like ensuring data is evenly distributed across shards and that each shard is performing well.

3. Data migration

Migrating existing data to the newly created shards is a delicate operation. It requires careful planning to make sure that the data fits well with the chosen sharding key. Proper execution of this step is vital to ensure that data is distributed correctly across the shards.

4. Handling queries across shards

Sometimes, queries need to pull data from multiple shards. Handling these queries efficiently is crucial. The aim is to keep the retrieval process quick and smooth, maintaining the benefits of having a sharded database.

Horizontal sharding in Postgres

To get sharding in PostgreSQL, it usually has to be set up manually. This means creating several Postgres instances where each one acts like a separate piece of the larger database. Deciding how to split the data across these pieces involves choosing a key, like user IDs or locations. This process requires a lot of initial setup and ongoing work to keep it running smoothly.

There are also extensions available that can help with sharding in PostgreSQL. These tools make it easier for PostgreSQL to handle data spread out across different places, making the database more scalable. However, adding sharding, whether by doing it manually or using these tools, makes the database system more complex. It requires careful planning and regular maintenance, especially when dealing with big datasets.

Tips for horizontal sharding

Implementing horizontal sharding in database management involves several key strategies. Below is a summary:

Aspect	Description
Sharding key	Choose based on frequent data access to evenly split data.
Balancing data	Keep shards evenly loaded and adjust regularly to prevent overloading.
Speedy queries	Optimize queries across shards for speed using efficient algorithms.
Complex transactions	Use protocols to maintain consistent data across all shards.
Consistency checks	Regularly sync and check data for accuracy across shards.
Adaptable sharding	Design sharding to easily grow and adapt to changing data needs.
Managing shards	Monitor shard performance and automate tasks for efficiency.
Backup plans	Implement strong backup and recovery strategies for each shard.
Resource management	Smartly plan and use resources, utilizing cloud services for flexibility.
Shard security	Protect data in each shard with encryption and strict access control.

And there we go – a basic rundown of horizontal sharding in databases. This overview touches just the tip of the iceberg, but it gives some insight into how horizontal sharding helps in managing large datasets more efficiently.

Let us know what you think. We have a lot more to say on the topic, so reach out to us on Discord or follow us on X | Twitter. We'd love to hear your thoughts, answer your questions, and keep you updated on the latest at Xata.

Using Next.js to improve speed and efficiency

Joan — Tue, 14 Nov 2023 11:14:56 +0000

The following post highlights some of the insights and experiences our Xata's Senior Product Designer, Elizabet Oliveira, shared at a recent talk at Next.js Conf 2023

In the ever-evolving landscape of tech and startups, the journey from inception to success can be compared to Buzz Lightyear's famous catchphrase: "To infinity and beyond!" The possibilities are limitless, but the challenges can be overwhelming. Fast-growing startups often find themselves grappling with the demands of rapid scaling, talent acquisition, market competition, and the relentless pressure to innovate. Yet, in this quest for excellence, there's an unexpected roadblock – web app fragmentation.

Web app fragmentation

Web app fragmentation occurs when startups, trying to overcome these challenges, end up using various tech tools for different tasks. For instance, Webflow might be their choice for the website, Docusaurus for documentation, React for their dashboard, and the list goes on. This approach may lead to a disjointed and inconsistent user experience.

Navigating startup success with Next.js

Success with Next.js for startups involves selecting tools that support efficient growth and ensure that the product development aligns with startup goals.

Unifying the tech stack for consistency

Web application fragmentation often poses a considerable obstacle for emerging companies. At Xata, we've faced and addressed this dilemma by standardizing our technological tools through the implementation of Next.js. Opting for this approach has brought a range of benefits, including simplified maintenance, efficient development, improved SEO, and overall a consistent user experience.

Beyond unifying our tech stack

We've looked into several factors that have significantly contributed to our progress.

The role of a design system

A design system, guided by clear standards and comprising reusable components, is very important. We've implemented a design system using Chakra UI and custom tokens, which promotes accessibility and semantic consistency.

Using this approach ensures that different products have a similar design because it relies on reusing the same building blocks and style elements.

Enhancing SEO with Next.js

Effective SEO practices draw in users. Next.js helped us improve SEO with its server-side rendering feature, which creates previews of content that are optimized for social media, and it also provides easy ways to manage meta tags. These features help search engines better understand and index content, potentially leading to higher rankings.

Leveraging analytics and monitoring

For startups, analytics, and monitoring are essential for tracking user behavior, troubleshooting issues, and making data-driven decisions. We use Next.js and Vercel to conduct A/B tests, which allows us to make informed choices for our startup.

Using Next.js, we can easily connect 3rd party analytics tools and use built-in performance monitoring. This means we can keep an eye on how fast our website loads and how quickly people can see our content when they visit.

A thriving community for startups

The Next.js community and ecosystem provide invaluable resources for startups. One particularly noteworthy plugin is Contentlayer, a content SDK that simplifies the validation and transformation of your content into type-safe JSON data, which can be seamlessly imported into applications. This plugin has been a game-changer for us at Xata, making the process of building our documentation and blog significantly easier.

Powering growth with Next.js and Vercel

Combining Next.js and Vercel supercharges performance and development speed. We use Vercel features such as preview deployments, templates, starters, and themes. As a team of two designers with coding skills at Xata, we design directly with Next.js and gather feedback through the Vercel commenting system.

Creating a template for a product compatible with Next.js is an effective method for making it accessible to users.

Key Takeaways

Wrapping up, startups can typically grapple with several challenges, one of the most common being the fragmentation of web applications. This can result in a confusing experience for users in which Next.js can be a great solution, offering a way to consolidate different technologies and create a cohesive tech stack:

By using a consistent design system, better interaction for users across various platforms can be achieved.
Leveraging Next.js for search engine optimization boosts a site’s discoverability and increases organic reach.
Integrated analytics and monitoring provide invaluable data, enabling informed decision-making.
The supportive Next.js community and its rich ecosystem contribute to a more efficient development cycle.

Ultimately, making informed technology choices are pivotal for a startup's journey towards success, and Next.js can be a key part of that toolkit. If you're at a startup and juggling various tech stacks, consider trying Next.js. You can learn more from our session at Next.js Conf 2023, which you can watch by clicking here.

So, are you facing tech hurdles like we did or do you have something to share? Contribute to our community; we'd love to hear from you. Tell us on Discord or X | Twitter and happy coding! 🦋

Database Horror Stories

Joan — Tue, 31 Oct 2023 12:02:47 +0000

Halloween brings to mind ghosts and goblins, but for developers, the true frights can often lie hidden in their tech stack. Beneath the surface of applications, databases hold their own unsettling tales.

In this post, we are looking at some of the terrors that can adversely impact your database project, including:

The dread of failed backups
Spooky schema migrations
Vampiric vacuums that suck the life out of you and your database

Boogeyman backups: Nightmares of lost data

As midnight neared, under the eerie glow of her computer monitor, the developer's small room was awash in a chilling light. Her screen flashed an ominous message: "Backup Failed.” A cold dread settled in the pit of her stomach. To her horror, she realized that months of critical data and work had vanished into the digital abyss, never to be retrieved.

Database backups are essential lifelines for any data-driven operation. At their core, they involve the creation and storage of database copies at specific moments, acting as snapshots of data at those points in time. These snapshots become invaluable when faced with adversities like system crashes, unexpected data corruption, or even simple human errors leading to data deletions. Backups offer a reliable mechanism to restore the database to its prior state, ensuring data remains intact and operations continue seamlessly.

However, as straightforward as this may sound, backups come with their own set of challenges. There's the dread of silent failures where backups simply don't run as intended, with no significant alarms being raised. This can lead to startling gaps in backup history. Then there's the potential for data corruption. Imagine the horror of finding out that the backup, while seemingly successful, is corrupted and unusable. Further, not all backups capture the entirety of the database, sometimes leaving out chunks of data and causing partial restorations. A particularly tricky challenge arises post system upgrades, where older backups might not mesh well with the newer system versions. And when the clock's ticking during emergencies, long restoration times can add to the chaos, causing unwanted downtimes.

How to prevent backup mishaps?

Backups are very important but come with their own set of potential pitfalls. Addressing these proactively is usually the best course of action. You can look into the following techniques:

Review logs regularly: You can check PostgreSQL logs and any error messages from your backup tool, such as pg_dump or pg_basebackup. The logs can provide vital clues to the root cause of the failure. Ensure you're using a compatible version of pg_dump or pg_basebackup with your PostgreSQL server. The logs can be found in the directory specified by the log_directory configuration parameter (usually found in the postgresql.conf file). The logs will contain entries for various events, and any errors related to backups will usually be evident with messages from tools like pg_dump or pg_basebackup.
Try daily backups or point-in-time recovery (PITR): Daily backups can add a layer of protection against data loss. By taking consistent snapshots of the database, it's possible to recover data if mishaps occur, ensuring that the history and versioning of records is preserved. Similarly, point-in-time recovery (PITR) allows for database recovery to a specific moment, helping restore data to a precise moment in time and to its exact state at that time. This can be particularly useful when there's a need to revert unintended changes or in the event of data corruption.
Stay vigilant: Inspect error messages and logs from your backup tools to pinpoint issues and remember to keep your backup tools updated to avoid compatibility problems. Also, ensure there's adequate storage space for backups and double-check the specified paths to make sure backups are being saved to the correct and accessible locations.

Schema spectres: How misconfigured schemas can haunt you with data inconsistencies

On a fog-shrouded evening, deep within the echoing chambers of a startup, a lone developer embarked on what was to be a routine schema migration. He unwittingly unleashed an arcane script, and in so doing summoned tables twisted and contorted in unnatural ways, data that floated into the abyss, and records once familiar - appeared as gibberish incantations. Panic set in with the chilling knowledge that in the realm of databases, some mistakes can haunt persistently.

Schema migrations can be intricate, especially when replication is involved.

Start a migration, and there's always the risk it might stall. When that happens, the database can find itself in a twilight zone, neither in its previous state nor fully migrated. This in-between state can lead to unpredictable behaviors when applications interact with it. Think you can just reverse a migration that went wrong? It's trickier than it sounds. Attempting to roll back a problematic change can sometimes introduce more inconsistencies than the original issue.

The order of deploying these changes is of paramount importance. Some updates hinge on the successful implementation of others. Misordering them might not only disrupt data access but can also bring entire operations to a standstill. There's another layer of complexity when the development environment doesn't mirror the live production setup. Pushing a change meant for testing into the live environment can result in unexpected complications.

Migrations not only change the structure of a database but can also modify the actual data. A seemingly simple task, like dividing a data column, can misrepresent or even lose data if done incorrectly.

Now, factor in database replication. In systems with replicas, any schema change must be accurately mirrored across every single replica. Failure to ensure uniformity risks divergence between the primary database and its replicas, a situation that can lead to data integrity issues. Let’s break this down a bit further - The primary and the replica now have different data. The primary contains records in the new table that the replica doesn't have. This discrepancy is a major issue in replicated database setups, as they rely on consistent data across all instances for backup, load distribution, and failover scenarios.

Misconfigured schema migrations and replication issues in databases, can lead to serious problems, such as data loss, data inconsistencies, and downtime. Addressing these challenges requires a combination of preventive, detective, and corrective measures.

How to prevent schema migration mishaps?

Test before deploying: Execute schema migrations in a development environment mirroring your production setup. It's important that the development environment closely resembles the production setup. Before implementing any schema migration, it's essential to test the potential performance implications on critical queries. This can be done using tools like the EXPLAIN command in PostgreSQL to understand how queries will be executed post-migration, which can help identify potential bottlenecks or inefficiencies.
Use schema version control: When you make changes to your database schema, use tools that keep track of the versions and ensure that the changes are applied correctly. This allows you to easily manage and track the evolution of your schema over time. Additionally, pgroll provides features which further enhance the control and management of your data as well as schema migrations.
See no evil, just monitor: Real-time monitoring of replication ensures you're alerted to any disruptions or lags. PostgreSQL’s pg_stat_replication offers valuable insights and monitoring capabilities.

Vampiric VACUUM: Reclaiming dead tuples and space

Amidst the quiet hum of over worked servers, the database began to groan, choked by the weight of dead tuples. A forgotten VACUUM process, now awakened, thirsted for resources, plunging the system into darkness. As data vanished, whispers of lost records haunted the corridors.

While "horror stories" about the VACUUM process in PostgreSQL can be dramatized for effect, there are some genuine concerns and potential mishaps that can arise if VACUUM isn't managed correctly.

In PostgreSQL, VACUUM is a maintenance operation that helps reclaim storage space and maintain the health of the database. Over time, as data gets updated or deleted, it leaves behind what are known as "dead tuples." These are essentially old versions of rows that are no longer needed. If left unchecked, these dead tuples can accumulate and lead to database bloat and degraded performance. VACUUM can reclaim storage and collect statistics on performance.

There are different types of VACUUM operations:

Regular VACUUM: This reclaims space but doesn't return it to the operating system. Instead, it makes the space available for reuse by the database.
VACUUM FULL: This not only reclaims space but also compacts tables and returns the freed space to the operating system. It's a more intensive operation and can take longer, locking the tables in the process.
Autovacuum: To avoid manual vacuuming, PostgreSQL has an autovacuum process that runs automatically in the background. It checks tables for dead tuples and vacuums them periodically.

VACUUM is a critical maintenance task, but it must be configured and executed correctly for optimal database performance and to prevent problems. Failing to run VACUUM regularly can result in excessive disk usage because the process reclaims space from deleted or outdated tuples. If this task is overlooked, the database can grow in size, consuming more disk space than necessary, and in severe cases, it might exhaust all available space.

In PostgreSQL, transaction IDs are finite. Without VACUUM cleaning up old IDs, the system can experience a "wraparound." This means older data entries could become inaccessible, posing a risk of data loss. By regularly running VACUUM, you can prevent the system from reaching the point where a wraparound would cause problems, and safeguard the accessibility and integrity of your data.

However, sometimes vacuuming can be too much! When a VACUUM FULL is executed during high database activity, it can cause significant slowdowns. A VACUUM FULL operation requires exclusive access to the table it's processing. This means other operations on that table are paused, which can create a bottleneck, especially if the table is frequently accessed. It's important to time these operations outside of peak periods to avoid disrupting the database's performance.

Last but not least, the autovacuum process, if not configured correctly, can either be too aggressive, affecting performance, or too lax, allowing dead tuples to accumulate. Turning off autovacuum on specific tables with the intention of manual oversight can be risky. If forgotten, it could lead to unchecked table growth.

If a VACUUM operation is interrupted, especially a manual one, it might leave behind temporary files that occupy disk space.

How to prevent VACUUM mishaps?

The VACUUM process in PostgreSQL is important for maintaining database health, but managing it effectively requires understanding its nuances. Here are some solutions and best practices to address common VACUUM-related challenges:

Tune autovacuum: You can adjust the autovacuum_naptime parameter to control how often the autovacuum process checks tables for cleanup. Modify or tweak the autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor to determine when a table should be vacuumed or analyzed based on the proportion of changed tuples. Additionally, you can use autovacuum_vacuum_cost_limit and autovacuum_vacuum_cost_delay to influence how aggressive the autovacuum process is, preventing it from consuming too many resources.
Try manual VACUUM: Run a manual VACUUM if you know a table has accumulated a significant number of dead tuples. Use VACUUM (VERBOSE) to get detailed information about the vacuuming process, helping diagnose potential issues.
Use less invasive tools: Instead of frequently using VACUUM FULL, which is resource-intensive and locks tables, consider using the less invasive pg_repack extension to reclaim space without the extensive locks. Monitor disk space usage, table bloat, and autovacuum activity, and set up alerts for scenarios like nearing transaction ID wraparound or excessive table bloat.
Increase maintenance work memory: In PostgreSQL, the maintenance_work_mem configuration parameter determines how much memory can be allocated for maintenance operations. Boosting maintenance_work_mem can help speed up the VACUUM process by allowing it to sort and process more data in memory.

Trick or treat

Want to share some of your own database horror stories? Reach out to us on Discord or follow us on X | Twitter. We'd love to hear your thoughts, answer your questions, and keep you updated on the latest at Xata.

Xata Community Spotlight: Extracting News and Research With Chat Interactions

Joan — Fri, 20 Oct 2023 14:33:32 +0000

Today we’re putting the community spotlight on two developers based out of Paraguay, Agustín Gomez and Alvaro Machuca. They discovered Xata while looking for a serverless database offering that supported vector embeddings. Since starting with Xata a few months ago, Agustín and Alvaro have built real world applications for conversational search with data; they provide chat experiences on very specific sets of data. Both apps have a similar tech stack and build experiences on top of Python / Django, Xata, and Vercel.

Generative search solutions

The first application is called ChatGOV and provides a simple and interactive way to converse about Paraguayan law.

All publicly accessible Paraguayan law was scraped and integrated into a Xata database. With these articles, LangChain and OpenAI were used to create embeddings, which were then stored in the same record as the respective legal documents. The ease of having the embeddings stored directly alongside the relational data made the journey to production extremely fast.

"Having a way to store all the vectors along with your records saves me time and I don’t have to go fight with another client like Pinecone. The fact that it’s in one stop shop for everything is great! I want to build products, not mount database services."

Alvaro Machuca - Co-Founder of ChatGOV

The second application is called Briefly News (GitHub repo). It provides a quick way to navigate the Paraguayan news and delve deeper into details about individuals mentioned in the articles.

This project is still early days, but has already scraped and ingested nearly 70,000 articles. It provides end-to-end workflows for filtering news articles based on certain criteria, visualizing common terms with a word cloud, aggregating statistics of the news articles, and a chat-based workflow for learning more about the people featured in a news article.

Rather than having disparate services and data stores to handle each one of these use cases, all of the data is simply stored in a Xata database. The Xata Python SDK and ORM-like experience was used for filtering, full-text search, aggregations and chat features.

"All in all, I have to say, I was really surprised at how easy it was to adopt Xata and just jump in and start using it. I haven’t had to look at my database in 3 months, it’s stable, no maintenance required and it just works."

Agustín Gomez - Software Engineer by Day, Superhero by Night

Feedback and feature requests

Having been using Xata since early this year, we asked both Agustín and Alvaro what their favorite aspects of Xata have been so far, and what they’d like to see on our roadmap. Here were some of the reasons Agustín and Alvaro chose Xata for their chat solutions.

Transition from prototype to production. For both projects it was extremely easy to prototype, iterate quickly and turn on for production use.
Built-in vector DB. Not having to worry about another database service specialized for vector embeddings was a huge benefit.
Python SDK. The Python SDK has steadily seen improvements with each release; it’s been great to see the progression over time.

When asked what they’re looking forward to seeing, here’s what they shared.

Usage observability. As their project grows, they’d like to see more details about their usage. Luckily, this is already on our roadmap and in the works.
Python functions. It would be beneficial to have additional helper functions in the Python SDK. A Django-like get_or_create function would be helpful for the web scraping use case. Sometimes an article needs to be created or updated. Having better warnings for pagination to ensure all data is returned also would have also been nice to see from the client.
Built-in embedding generation. We discussed a bit about some ✨. Simplifying the embedding creation process with more dynamic columns and supporting lighter weight embeddings would be great for their use case.

If you’re interested in learning more about how to build practical solutions for generative search and the types of technical challenges you may find along the way, you can find Agustín, Alvaro, and the Xata team on our Discord server.

Share your story

Do you have a similar story or community contribution you’d like to share? Send us an email if you’d like to be featured in our community spotlight.
Until then, happy building 🦋

Xata's JSON Column Type

Joan — Tue, 17 Oct 2023 17:33:51 +0000

Data models that use schemas are great! At Xata, we believe they're a solid choice in most scenarios. But, we also know that not every piece of data fits perfectly into the relational model or is not as convenient as schemaless, and sometimes, especially in the early stages of a project, you want something more flexible and straightforward. That's where using JSON documents within a relational data store comes in handy. It offers the best of both worlds – structure when you need it and a bit of freedom when you don't.

Many of our users have been asking for this feature, and as part of launch week we are happy to announce that it's finally here.

Basic support for JSON column type has been added and it will bring many benefits including:

Enhanced flexibility by providing a way to store schemaless data in a relational database
Data integrity with JSON validation according to RFC 7159
Streamlined development that stores any unstructured data directly with no need to handle a schema or data conversions
Efficient queries so you can apply many filters to nested nodes
Search functionality, as JSON documents are indexed as any other Xata data type and can be searched using the full-text search capabilities of Xata

We plan on extending Xata support for JSON even further in the future, but the current capabilities are already very powerful and will solve most use cases.
Below we provide examples of how you can start using JSON documents in Xata today.

Let's think a bit about a simple data model for an online shop. Suppose we have a Products table.

Creating a JSON column

Sometimes different categories of products have completely different specs, so we don't want to create a column for each of them.

We can use a JSON column to store the product details. Let's do this by adding a details field to our table. You can do this via the UI or via the API like this:

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/columns

{
  "name": "details",
  "type": "json"
}

Let's add a few products with different details, for instance:

A t-shirt with size and color
A book with author, ISBN and number of pages
A climbing rope with a length and a thickness

TypeScript

const record1 = await xata.db.Products.create({
  name: 'Xata xwag T-shirt',
  details: {
    color: 'purple',
    size: 'M',
  }
});
const record2 = await xata.db.Products.create({
  name: 'Meditations',
  details: {
    author: 'Marcus Aurelius',
    isbn: '978-0140449334',
    pages: 304
  }
});
const record3 = await xata.db.Products.create({
  name: 'Long climbing rope',
  details: {
    length: 80,
    thickness: 9.8,
    color: 'blue',
  }
});

Python

record1 = xata.records().insert("Products", {
  "name": "Xata xwag T-shirt",
  "details": {
    "color": "purple",
    "size": "M",
  }
})
record2 = xata.records().insert("Products", {
  "name": "Meditations",
  "details": {
    "author": "Marcus Aurelius",
    "isbn": "978-0140449334",
    "pages": 304
  }
})
record3 = xata.records().insert("Products", {
  "name": "Long climbing rope",
  "details": {
    "length": 80,
    "thickness": 9.8,
    "color": "blue"
  }
})

JSON

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/data
{
  "name": "Xata xwag T-shirt",
  "details": "{\"color\": \"purple\", \"size\": \"M\"}"
}

{
  "name": "Meditations",
  "details": "{\"author\": \"Marcus Aurelius\", \"isbn\": \"978-0140449334\", \"pages\": 304}"
}

{
  "name":"Long climbing rope",
  "details":"{\"length\": 80, \"thickness\": 9.8, \"color\": \"blue\"}"
}

It's important to note that the JSON documents are processed and stored in a binary format in order to improve querying and storage performance.
This has the following implications:

White spaces are not preserved
Key order is not preserved
In case of duplicate key, only the last one is stored

Querying JSON documents

The arrow notation `->`

This is PostgreSQL's syntax for navigating JSON fields. It's used to access the value of any JSON node, no matter how deep in the tree.

Xata uses a similar notation to query data and apply some of the existing filters to any JSON value. PostgreSQL uses different operators and casting
depending on the data types, but Xata is able to infer the data type from the provided value and apply the correct operator.

So far, comparison by strings and numbers is supported but this will be extended in the near future.

Filter products size 'M'

TypeScript

const records = await xata.db.Products.filter({
  "details->size": 'M'
}).getMany();

Python

records = xata.data().query("Products", {
  "filter": {
    "details->size": "M"
  }
})

SQL

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql

{
  "statement": "SELECT * FROM \"Products\" WHERE details->>'size' = 'M';"
}

JSON

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/query

{
  "filter": {
    "details->size": "M"
  }
}

Filter products with a length greater than 50 meters

TypeScript

const records = await xata.db.Products.filter({
  "details->length": {
    "$gt": 50
  }
}).getMany();

Python

records = xata.data().query("Products", {
  "filter": {
    "details->length": {
      "$gt": 50
    }
  }
})

SQL

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql

{
  "statement": "SELECT * FROM \"Products\" WHERE (details->>length)::numeric > 50;"
}

JSON

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/query

{
  "filter": {
    "details->length": {
      "$gt": 50
    }
  }
}

Check for a substring in a nested JSON node

TypeScript

const records = await xata.db.Products.filter({
  "details->author": {
    "$contains": "Marcus"
  }
}).getMany();

Python

records = xata.data().query("Products", {
  "filter": {
    "details->author": {
      "contains": "Marcus"
    }
  }
})

SQL

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
  "statement": "SELECT * FROM \"Products\" WHERE details->>author LIKE '%Marcus%';"
}

JSON

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/query
{
  "filter": {
    "details->author": {
      "$contains": "Marcus"
    }
  }
}

Other general control or negation operators work as well

TypeScript

const records = await xata.db.Products.filter({
    "$not": {
        "details->length": {
            "$gt": 50
        }
    }
}).getMany();

Python

records = xata.data().query("Products", {
    "filter": {
        "$not": {
            "details->length": {
                "$gt": 50
            }
        }
    }
})

SQL

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/sql
{
  "statement": "SELECT * FROM \"Products\" WHERE NOT (details->>length)::numeric > 50;"
}

JSON

// POST https://{workspace}.{region}.xata.sh/db/{db}:{branch}/tables/{table}/query
{
  "filter": {
    "$not": {
      "details->length": {
        "$gt": 50
      }
    }
  }
}

Conclusion

Xata is committed to simplifying the way you work with data. We will keep improving our offering by both adding more rich data types and extending the capabilities of the current ones.
Basic JSON support is one more step in that direction along with the previously released files attachments.

If you have feedback or questions, you can reach out to us on Discord or X / Twitter.

Announcing the Winners of the Xata Content Hackathon

Joan — Tue, 17 Oct 2023 13:37:28 +0000

To commemorate our most recent launch week, we called upon our vibrant community to showcase their creativity, enthusiasm, and interest for Xata. The challenge? Write a captivating blog post, create an engaging video, or simply spread the word about Xata. And you all delivered in style!

The anticipation has been building, and now, we're thrilled to reveal the winners of our content hackathon.

🥁 Drum roll, please! 🥁

Without further ado, the winners of our hackathon (in no particular order) are:

🦋 Congratulations to our winners for your exceptional contributions! We'll be in touch to award the $500 cash prizes to each of you.

To every participant, a heartfelt thank you and kudos! Your dedication, creativity, and efforts have truly amazed us. Each contribution has been a testament to the incredible talent and passion within our community.

Curious about some of the projects our community created? Take a look at the content below to see how users incorporated Xata into their winning development projects.

Winning Content

From data retrieval and building libraries to crafting developer portfolios and wishlist apps, discover how our community harnessed Xata to simplify their data tasks.

Build an Online Library

Join @Aboo_Turaab in this tutorial, Building an online library using Xata, and learn how to use Xata's file attachment capabilities.

In the related blog post File attachment in Xata Database: How to build an Online Library, @Aboo_Turaab discusses Xata's file attachment feature and uses it to build an online library without relying on external storage services.

Optimize Data Retrieval

In Optimizing data retrieval, @dotAadarsh addresses Xata's innovative approach to optimizing database queries and looks into the pesky N+1 problem in one-to-many relationships.

Build an Efficient Waitlist App

Through Building an efficient waitlist app with Next.js and Xata, @terieyenike introduces a waitlist application suited for impactful pre-launch marketing campaigns.

Build a File Explorer

Follow along with @moe_rayo in How to build a file explorer using Xata and Vue.js as he demonstrates the process of creating a user-friendly file explorer.

Create a Developer Portfolio

In Creating an amazing developer portfolio using the NeXuS stack, @ishnbedi uses Xata to craft a standout developer portfolio that resonates with both recruiters and clients alike.

You're Soaring! Don't Stop Now

Again, a roaring applause for all participants 🎉 Expect to receive some gifted exclusive Xata swag soon.

The fun doesn't have to end here, though. Participate in Hacktoberfest 2023 and contribute to Xata's open-source repos like pg-roll, our Python SDK, or docs. Also, don't forget to keep a lookout for our next exciting challenge.

To learn more about Xata, sign in and experiment, check out our docs, or join the conversation on Discord and X / Twitter.

At Xata, we are aflutter with pride and gratitude for every single participant. Your inventive efforts underline what we've always believed — our community is our strongest asset. Your dedication, zest, and ingenuity reaffirm our mission. 🦋

Here's to more creativity, coding, and community camaraderie! Keep making waves, and remember, we're here cheering you on.

Until our next adventure together... 🚀

Build a Chatbot With OpenAI, Vercel AI and Xata

Joan — Fri, 13 Oct 2023 16:33:43 +0000

In today's data-driven world, efficient interaction with databases is a crucial aspect of many applications. But what if we could go beyond conventional search methods and enable a natural language conversation with our databases?

At Xata, we aim to provide developers with the tools to build powerful applications that can interact with data in a natural way. Built-in with our core APIs and SDKs, we offer a powerful ask endpoint that allows you to ask questions about your data and get the answers that matter most.

However, how would you integrate Xata with an existing application built with OpenAI? In this tutorial, we'll show you how to integrate OpenAI's function calling feature with Xata's TypeScript SDK to create a chatbot that can use search to answer questions about your data.

In a previous post, we introduced the concept of using our built-in ask endpoint to simplify the process of querying your data.

Prerequisites

Before we begin, make sure you have the following prerequisites:

Existing project configured with the Xata CLI
Xata API key
OpenAI API key

In your preferred serverless environment, make sure you install the OpenAI API Library and Vercel AI library to get started.

After ensuring your prerequisites are met, you can integrate Xata with your existing OpenAI application in three steps: Define a search function for AI, ask questions about your data, and run completions while streaming the results.

Step 1 - Defining a search function

First, we'll define a function that allows us to search our database using OpenAI's function calling feature.

const functions: CompletionCreateParams.Function[] = [
  {
    name: 'full_text_search',
    description: 'Full text search on a branch',
    parameters: {
      type: 'object',
      properties: {
        query: {
          type: 'string',
          description: 'The search query'
        }
      },
      required: ['query']
    }
  }
];

If we want to allow OpenAI to fine-tune the search results, we can add more options to the parameters object. For example, we can add a fuzziness parameter to allow for fuzzy search.

const functions: CompletionCreateParams.Function[] = [
  {
    name: 'full_text_search',
    description: 'Full text search on a branch',
    parameters: {
      type: 'object',
      properties: {
        query: {
          type: 'string',
          description: 'The search query'
        },
        fuzziness: {
          type: 'number',
          description: 'Maximum levenshtein distance for fuzzy search, minimum 0, maximum 2'
        }
      },
      required: ['query']
    }
  }
];

Step 2 - Ask a question about your data

Now that we have our search function defined, we can use the OpenAI library to ask a question about our data.

const openai = new OpenAI({
  // Make sure to properly load and set your OpenAI API key here
  apiKey: process.env.OPENAI_API_KEY
});

const model = 'gpt-3.5-turbo';

const response = await openai.chat.completions.create({
  model,
  messages: [
    {
      role: 'user',
      content: question
    }
  ],
  functions
});

With the functions defined, the AI will be able to call the full_text_search function and pass the query parameter with the parts of the question that are relevant to the search.

To enhance the results, we can include additional information in the messages array as system messages. For instance, we can provide instructions to the AI or offer hints related to our database.

const { schema } = await api.branches.getBranchDetails({ workspace, region, database, branch });

const response = await openai.chat.completions.create({
  model,
  stream: true,
  messages: [
    {
      role: 'user',
      content: question
    },
    {
      role: 'system',
      content: `
        Workspace: ${workspace}
        Region: ${region}
        Database: ${database}
        Branch: ${branch}
        Schema: ${JSON.stringify(schema)}

        Reply to the user about the data in the database, do not reply about other topics.
        Only use the functions you have been provided with, and use them in the way they are documented.
      `
    }
  ],
  functions
});

OpenAI provides a variety of models, which you can find listed here. If you require a different model that aligns more closely with your specific use case, you can easily switch the model parameter.

Step 3 - Running the completion and streaming the results

Finally, we can run the completion and stream the results to the client.

const stream = OpenAIStream(response, {
  experimental_onFunctionCall: async ({ name, arguments: args }, createFunctionCallMessages) => {
    switch (name) {
      case 'full_text_search': {
        const response = await api.searchAndFilter.searchBranch({
          workspace,
          region,
          database,
          branch,
          query: args.query as string,
          fuzziness: args.fuzziness as number
        });

        const newMessages = createFunctionCallMessages(response);

        return openai.chat.completions.create({
          messages: [...messages, ...newMessages],
          stream: true,
          model,
          functions
        });
      }
      default:
        throw new Error('Unknown OpenAI function call name');
    }
  }
});

return new StreamingTextResponse(stream);

We have chosen to stream the results to the client, but you can also wait for the completion to finish and return the results as a single response.

Bonus - Building an interactive chatbot UI

With React and the useChat hook you can easily create a chatbot that can answer questions about your data.

const { messages, append, reload, stop, isLoading } = useChat({
  // The route to the endpoint we have just created
  api: '/api/chat',
  initialMessages
});

Conclusion

Congratulations! You've just built a powerful system that combines the capabilities of OpenAI, Vercel AI and Xata's database API. Users can now engage in natural language conversations with their databases, and OpenAI will utilize the provided functions to perform searches and retrieve relevant information.

By following this tutorial, you've learned how to integrate multiple APIs, handle requests and create interactive responses. This foundation can be extended to create even more sophisticated systems that enable seamless human-machine interactions with data.

If you want to learn more about Xata's database API, check out our documentation and come say hi on our Discord if you have any questions.

Happy coding! 🦋

The Xata Playground Now Runs Python in the Browser

Joan — Fri, 13 Oct 2023 16:08:05 +0000

At Xata, we're committed to providing you with a versatile and powerful UI to interact with your data. That's why early on we built the Xata Playground, a web-based IDE that allows you to write code in TypeScript and SQL to query your data in Xata.

Now, the Xata Playground includes Python! You can write code in Xata's Playground using Python, TypeScript, and SQL.

In this post, we'll do a technical deep dive into how the Xata Playground works and how we added Python to the mix.

The Xata Playground

As a new hire that had just marked my first month at Xata, I opened a pull request with the first version of the Xata Playground. It was a very simple proof of concept that I wanted to show during our weekly team meeting.

The pull request introduced a monaco-editor where you could write TypeScript code and run it to see the results in a separate panel. The main goal was to allow users to experiment and try out our TypeScript SDK without having to install anything.

Both internally and with some early adopters, we noticed that it was something we had to invest in. We decided to iterate quickly over the proof of concept and release it as soon as possible. Just a month later, we launched the first version of the Xata Playground, and it has remained one of our most popular features ever since.

How it works

The Playground was built following the inspiration of other online IDEs like TypeScript Playground, CodeSandBox, or StackBlitz.

One of the core principles, was that all code should be executed in the browser. We wanted to avoid the complexity of having a backend service that would execute the code and return the results. This would have required work to secure the code execution, and it would have added extra latency to the execution.

To achieve this, we use Rollup to bundle the code and a plugin to load any external library from a CDN. This way, we can transpile the code and all its dependencies into a single file that can be executed in a separate thread using Web Workers.

Running the code in a separate thread is important, as it allows us to avoid blocking the main thread. This way, the UI is always responsive, and you can continue writing code while the previous code is being executed. Also, it provides some built-in security, as the code is isolated from the main thread.

Also, to improve the experience when writing TypeScript code, we switched to the fork of monaco-editor that TypeScript Playground uses. This fork, TypeScript Sandbox, is always up to date with the latest version of TypeScript, and includes several improvements for TypeScript developers, such as twoslash inlay hints.

Adding support for multiple files and languages

The first version of the Playground only supported a single TypeScript file. This was a limitation that we wanted to remove. So when we started working on it, we also decided to add support for other languages.

Monaco is the open source editor that powers VS Code, and it has support for multiple files and languages. We just needed to use its virtual file system that would allow us to load multiple files and execute them separately.

The refactor allowed us to add support for SQL, both for full file execution and for inline execution. This was a great addition, as it allowed us to query data with SQL over HTTP.

Adding support for Python

The next step was to add support for Python. We wanted to have the same experience as with TypeScript and SQL, where you could write code and execute it in the browser. Luckily, there are several projects that allow you to run Python in the browser, and we decided to use Pyodide.

Pyodide is a project started by Mozilla that allows you to run Python in the browser using WebAssembly. It includes the standard library and several popular packages like NumPy, Pandas, or Matplotlib. It also includes a package manager that allows you to install any other package with a wheel available on PyPI.

Similarly to TypeScript, we have a web worker that executes the Python code. The code is executed by Pyodide, and we load external packages from imports using the package manager.

To make it fully work with our Python SDK, we only needed to patch the runtime so that Pyodide could do HTTP requests, using pyodide-http.

Conclusion

The Xata Playground is a great tool to try out Xata's SDKs and to learn how to use them. You can quickly try out some ideas and go back to your IDE to implement them. If you haven't already, give it a try!

We hope you enjoy the new Python support, and we're looking forward to seeing what you build with it. If you have any feedback or ideas, please let us know on Discord.

Discover the Xata SDK for Python

Joan — Tue, 10 Oct 2023 12:42:26 +0000

At the start, Xata's initial focus was to make things better for developers using Jamstack. This led to us creating content that predominantly focused on providing a robust TypeScript offering.

Motivated by our enthusiasm for connecting with developers, we soon launched a Python SDK, though it initially had just basic capabilities. Since that initial release, a few things have changed. The world has gone AI-crazy and Python has unofficially become the language of choice for AI/ML. This has coincided with our introduction of vector embeddings and the integration of OpenAI's ChatGPT for your data.

As the Python SDK user base has grown steadily over the last few months, we have actively gathered feedback about what users like and dislike.

Version 1.0.0 of our Python SDK is available.

PEP-8 FTW!

Previously, we received feedback that the SDK didn't feel pythonic! Initially, in the 0.x releases, the API was generated one-by-one from our OpenAPI specification, which resulted in a non-pythonic API. For this release, we have aligned as much as possible with the PEP-8 standard.

Speed improvements

Under the hood, we've made adjustments to how connections are managed and reused. The refactoring has yielded significant performance improvements across the board. The most notable ones include accelerated operation speeds:

Operation	Speedup (on average)
Get a single record	5.95x
Insert a single record	4.95x
Insert 100 records with transactions	2.22x

How to migrate to 1.x?

Migrating to a new major version can be tough due to the need to understand breaking changes and features that are no longer available. With that in mind, our goal was to minimize the impact while still implementing necessary changes for the greater good. You can check out the full migration guide in our docs.

The most impactful user-facing change is the renaming of the API surface [xata-py#93]. Additionally, some API endpoint calls were simplified to remove unnecessary code bloat.

Previously in 0.x, you needed to investigate the payload shape and be specific about the region and branch name.

xata.databases().createDatabase("new_db",
  {
    "region": "us-east-1",
    "branchName": "main",
  }
)

In version 1.0.*, we've redesigned the API interface to incorporate payload options directly into the method signature. We've also optimized the SDK's internal values to make assumptions to reuse the SDK internal values, like the region. This results in a more streamlined API structure, and you can achieve functionality using:

xata.databases().create("new_db")

What happens to 0.x?

We will continue providing support for the 0.x version of the Xata Python SDK through maintenance releases for an additional year. During this time, new API enhancements and security fixes will be introduced; however, no backports of helpers or other improvements are planned. The 0.x SDK version will be sunsetted by September 1st, 2024.

What’s next?

Check out our documentation and let us know what you think. We’d love to hear from you! If you think something is missing or you found a bug, open a ticket in the xata-py repository. All contributions are welcome. You can also follow us on Twitter or join us in Discord.

Is there a language you wished we supported natively but don't today? Feel free to open up a feature request or check-in with our amazing community. The xata-go SDK was initiated by Kerdo and is hosted on GitHub. mrkresnofatih also added the terraform-provider-xata to provision Xata. If you're interested in contributing, feel free to reach out with any questions! Happy building 😄

Using the LangChain Integration with a Serverless Database

Joan — Tue, 10 Oct 2023 12:25:56 +0000

Xata has integrations with LangChain, and is available both as a vector store and a memory store.

What is LangChain?

LangChain is a popular open-source framework for developing AI applications powered by Large Language Models (LLMs). You can think of it as a collection of composable components implemented in Python and TypeScript that you can combine to implement various AI use cases.

The components typically offer a common API for different models (OpenAI, Llama, Replicate, etc.), vector stores (Pinecone, Weaviate, Chroma, etc.), databases as memory stores (DynamoDB, Redis, Planetscale, etc.) and more. By offering a common API across different models, for example, LangChain makes it easy to switch between models and compare results or use different models for different parts of the app.

These components can then be “chained together” in more complex applications, and LangChain comes with off-the-shelf implementations for several popular AI use cases (Q&A chat bots, summarization, autonomous agents, etc).

While Xata has a built-in API for the “ChatGPT on your data” use case, in this blog post we’ll see how one can implement similar functionality using LangChain and Xata integrations. This can allow for more flexibility in the details of the implementation (for example, you can chose a non-OpenAI model) at the cost of having more code to write and maintain.

The integrations

Currently, the following integrations are available :

Xata as a vector store in LangChain. This allows one to store documents with embeddings in a Xata table and perform vector search on them. The integration takes advantage of the Xata Python SDK. The integration supports filtering by metadata, which is represented in Xata columns for the maximum performance.
Xata as a vector store in LangChain.js. Same as the Python integration, but for your TypeScript/JavaScript applications.
Xata as a memory store in LangChain. This allows storing the chat message history for AI chat sessions in Xata, making it work as “memory” for LLM applications. The messages are stored in a Xata table.
Xata as a memory store in LangChain.js. Same as the Python integration, but for TypeScript/JavaScript.

Each integration comes with one or two code examples in the doc pages linked above.

The four integrations already make Xata one of the most comprehensive data solutions for LangChain, and we’re just getting started! For the near future, we’re planning to add custom retrievers for the Xata keyword and hybrid search and the Xata Ask AI endpoint.

Why choose Xata?

As we’ve pointed out above, a key benefit of LangChain is that it supports a lot of solutions for each integration type. For example, at the moment, the Python LangChain integrates with 47 (!) vector stores, while LangChain.js integrates with 24 vector stores.

So what makes Xata different and why should you consider it for your AI apps?

Xata is a serverless data platform that stores data in PostgreSQL, but also replicates it automatically to Elasticsearch. This means that it offers functionality from both Postgres (ACID transactions, constraints, etc.) and from Elasticsearch (full-text search, vector search, hybrid search) behind the same simple serverless API.

Here is why you should consider Xata for you LangChain application:

It’s comprehensive: It offers LangChain integrations not only as a vector store, but also as a memory store. Also, it offers the same integrations for both the Python and TypeScript/JavaScript versions of LangChain. Because it uses Elasticsearch behind the scene, it can offer BM25 and hybrid search in addition to just vector search.
It’s a pure serverless solution: You simply get an API endpoint, no clusters or instances to configure, because we handle the scaling. The lightweight TypeScript SDK runs in any serverless environment, including Cloudflare Workers.
It has a modern developer workflow: Xata's workflow is based on branches and has built-in integrations with platforms like GitHub, Vercel, and Netlify.
It’s easy: The Xata UI makes it very easy to manage your schema, look-up data, create and test queries and searches, and generally understand what’s going on.

How to get started?

To get started with Xata and LangChain, you can use the minimal code samples from each of the integrations above. If you are looking for more complex examples, for Python, there is a more complete example in this Jupyter Notebook. For TypeScript, check out the announcement blog post on the LangChain blog.

While the integrations added so far already make Xata one of the most comprehensive and easy to use data solutions for LangChain and AI applications, this is only the beginning! We’re planning to add more components that take advantage of Xata’s BM25 search and the Ask endpoint.

If you have any questions or ideas or if you need help implementing Xata with LangChain, reach out to us on Discord or join us on Twitter.

DEV Community: Joan

How pgroll works under the hood

A brief recap

A step-by-step example

Initialization

First migration

Second migration

Completing the migration

Rollbacks

Client applications

Get involved

Announcing the release of the Xata Go SDK

Anatomy of the SDK

What’s next?

What is Horizontal Sharding?

What is Sharding

Hash-based sharding

Directory-based sharding

Geo-sharding

Key-based sharding

Functional sharding

Sharding comparison example

Why horizontal sharding?

Benefits of horizontal sharding

Challenges of horizontal sharding

Implementing horizontal sharding

Horizontal sharding in Postgres

Tips for horizontal sharding

Using Next.js to improve speed and efficiency

Web app fragmentation

Navigating startup success with Next.js

Unifying the tech stack for consistency

Beyond unifying our tech stack

The role of a design system

Enhancing SEO with Next.js

Leveraging analytics and monitoring

A thriving community for startups

Powering growth with Next.js and Vercel

Key Takeaways

Database Horror Stories

Boogeyman backups: Nightmares of lost data

How to prevent backup mishaps?

Schema spectres: How misconfigured schemas can haunt you with data inconsistencies

How to prevent schema migration mishaps?

Vampiric VACUUM: Reclaiming dead tuples and space

How to prevent VACUUM mishaps?

Trick or treat

Xata Community Spotlight: Extracting News and Research With Chat Interactions

Generative search solutions

Feedback and feature requests

Share your story

Xata's JSON Column Type

Creating a JSON column

TypeScript

Python

JSON

Querying JSON documents

The arrow notation ->

Filter products size 'M'

TypeScript

Python

SQL

JSON

Filter products with a length greater than 50 meters

TypeScript

Python

SQL

JSON

Check for a substring in a nested JSON node

TypeScript

Python

SQL

JSON

Other general control or negation operators work as well

TypeScript

Python

SQL

JSON

Conclusion

Announcing the Winners of the Xata Content Hackathon

The arrow notation `->`