DEV Community: Claudiu Dascalescu

Inside Xatastor: ZFS + NVMe-oF for millions of Postgres databases

Claudiu Dascalescu — Sun, 03 May 2026 21:00:00 +0000

At Xata we run a Postgres service offering fast database branching and scale-to-zero. It happens to be great for enabling coding agents to work with realistic data without needing direct access to the prod DB. This way LLM hallucinations can’t oops the prod database, like we’ve seen happen more and more often lately.

There is also a new generation of dev, cloud, and AI platforms that require a database-per-tenant architecture. They often offer free tiers or inexpensive plans for which the cost of “ephemeral” (scale-to-zero) databases are a critical consideration.

This new level of demand put pressure on us to support very large numbers of databases and branches in the most cost efficient way possible. It is why, after having worked and operated several other distributed storage systems, we decided to create our own.

The new storage system is called Xatastor, and is the secret sauce of the Xata Cloud platform.

Our storage journey and learnings so far

When we started working on what became the current iteration of Xata, we knew we wanted copy-on-write branching and to use vanilla PostgreSQL without modifications. These two requirements together pretty much meant that we needed some sort of separation of storage from compute at the block device layer. In other words, some sort of distributed, software-defined, storage.

We’ve evaluated a bunch of options in the space, both open source and commercial: Ceph, LongHorn, OpenEBS, SimplyBlock, Portworx, Lightbits, and more. We initially chose a commercial solution but eventually we wanted more control over the storage system so we switched to an open-source project: OpenEBS with Mayastor.

Small side-note: even with the release of Xatastor, we continue to use OpenEBS / Mayastor in production and it is what we usually recommend for open source and on-prem installs. We found it reliable and easy to operate and I would generally recommend it for K8s native storage.

What multiple of these projects have in common, is that they use SPDK internally. SPDK is a development kit from Intel and it’s one of those rare pieces of software that you would describe with a single word: a beast. Skim through their performance reports and see that it’s capable of transporting millions of IOPS over the network with latencies measured in micro-seconds. It is what enabled us to compete with local storage in the benchmarks we did last year.

To transport bytes over the network at those speeds, SPDK uses a protocol called NVMe-over-fabrics (NVMe-oF). It is a protocol specifically designed to keep the advantages of NVMe (massive parallelism, command set matching SSDs) and export them over the network. It is quite the beast by itself. The SPDK provides its own user-space implementation of it, which enables these fancy storage solutions.

While SPDK-based solution are impressive, we’ve learned that SPDK is optimized for very high performance on a relatively small number of busy volumes. A lot of its defaults and design decisions are oriented towards that use case. Unfortunately, they tend to have issues when dealing with a very large number of volumes. A few hundreds volumes per node is perfectly fine, but reaching thousands per node is already a challenge. This ultimately results in a costly system to operate.

In short, SPDK-based solutions optimize for a small number of very busy volumes. Our main use case is the opposite: a very large number of mostly idle volumes.

Xatastor - goals

Given the learnings above, it became clear that if we were to scale to millions of databases and branches, we needed to work on a custom storage solution. With the knowledge we gained by operating the other storage engines, we had a good handle on the main tradeoffs, the failure modes, the cost implications, and the operational aspects.

We set for ourselves the following goals:

Highly scalable in terms of volumes per storage node. We were targeting over 100K volumes per storage node.
An inactive volume (from a scaled to zero Postgres) should take no resources at all except for disk space.
Copy-on-write snapshots and clones features. Thin provisioning so we can charge only for actual usage.
Built on top of proven technologies with mature tooling. Simple architecture with the minimum amount of moving parts.

Xatastor - exposing ZFS zvols over NVMe-oF

In short, Xatastor is implemented like this:

ZFS pools of volumes (zvols) exist on the storage nodes.
Our own user-space implementation of NVMe-oF exposes them over the network.
Our Xatastor Kubernetes operator serves as the control plane.

Let’s go through each of the key components.

ZFS pools of volumes (zvols)

Interestingly, we don’t actually use ZFS as a filesystem , but rather only the bottom layer of zvols. A zvol exposes a ZFS dataset as a block device (e.g. /dev/zvol/pool/name). Since we mount these over the network, the client is responsible for the filesystem and mounting. We happen to use XFS as the filesystem. This means that Postgres and the compute pods are completely unaware of the ZFS layer.

Zvols offer a lot of the key functionality that we want:

Snapshots and clones
Thin provisioning
Data integrity via checksums
Compression

Equally important, ZFS is highly scalable when it comes to the number of volumes. A single node can store hundreds of thousands of volumes, snapshots, and clones.

At high scale, a few operations need care. For example, zfs list becomes slow with many datasets, and we need to run it (or its equivalent) on startup. Because we need the startup to be as fast as possible so we can do zero-downtime upgrades, we work around this by maintaining our own metadata store.

ZFS also has very mature tooling for administration, monitoring, moving volumes between storage nodes, and so on. This simplifies our operations significantly.

Implementation of the NVMe-oF protocol

Since we decided not to use SPDK, we had to create our own user-space implementation of the NVMe-oF protocol. It’s perfectly compatible with the Linux client implementation, so nothing needs to change on the client side.

From the implementation point of view, it is a Rust daemon using monoio (a thread per core) and an io_uring-based async runtime. Each worker thread owns its own TCP listener via SO_REUSEPORT, its own connection state machines, and its own buffer pool. So a connection's PDU parsing, NVMe command decode, ZFS read/write, and response framing all happen on one core with very little cross-thread synchronization on the hot path.

The little coordination that we need between the admin and the IO queues is done completely mutex-free, in order to avoid adding latency spikes in the IO queues.

The Xatastor operator

For the control plane we wanted to be as Kubernetes native as possible. There's no external metadata store: every volume, its placement, its NVMe-oF connect parameters, and its lifecycle state live in a cluster-scoped CRD (xvol) that you can kubectl get, watch, or describe like any other resource. Create, clone, snapshot, delete, and health monitoring all flow through standard Kubernetes APIs (PVCs, VolumeSnapshots, and the Xvol CR underneath).

Following the Kubernetes standards makes Xatastor fit well among the OpenEBS storage engines. From the user point of view, it is a drop-in replacement for Mayastor, which makes it easy to switch from one to another, or run them in parallel.

At the same time, having our CSI implementation allows us to implement custom workflows and a set of optimizations that enable very fast wake-up (scale to one) and branch times. But about that, we’ll have a future blog post.

Performance

The goal of Xatastor is to be a storage system that scales to a huge number of volumes with minimal cost. Our design didn’t optimize for single-volume benchmarks. We expected some performance penalties compared to SPDK-based solutions, because of ZFS overhead and the simpler NVMe-oF implementation.

However, in practice, the simpler architecture and clean implementation offset the expected overhead.

We’ll have a follow up blog post covering benchmark results, but in our tests so far Xatastor matches SPDK-based solutions on Postgres benchmarks, while requiring only a small fraction of the hardware resources.

Redundancy

So far in this blog post we haven’t written anything about redundancy. Mayastor, for example, allows storing multiple replicas of the same volume, with sync replication between them. The way this works is that the storage nodes run NVMe-oF proxies (called nexuses) to route writes between them. The nexus needs to wait for the acknowledgement from the replicas before acknowledging the write as successful. If a node is lost, a rebuild process needs to copy the data from a healthy replica to a new node.

We decided to skip all of this for a few reasons:

Our workload is Postgres, which has its own replication mechanism via read-replicas. Our system has automatic failover in case of primary failures. This is the most trustworthy HA solution, and the right level to solve redundancy. We highly recommend all production workloads to add at least one read-replica, for more reasons than just redundancy: with replicas updates and configuration changes are rolled out via switchover, which mean significant less downtime.
On our cloud regions, we use EBS in AWS and Persistent Disks in GCP, which provide their own internal redundancy. For non-production instances, or for prod instances that scale-to-zero, this means you get redundancy even without read-replicas.
Write performance and cost. Redundancy at the storage layer via sync writes doesn’t come for free. It introduces write amplification, basically doubling the cost. Also, often the network bandwidth is the main constraint that we deal with, and the replication means more traffic over the network.
Simplicity. This is maybe the biggest reason right now. Having replicas would mean a significantly more complex system, which for now we don’t think is justified for our particular use case.

Compression

We have many customers storing non-prod databases with multiple terabyte of data. An unexpected benefit of ZFS is that we can enable compression on a per volume basis. We’re seeing zvol compression of 40-50% in some cases, which translate in significant cost savings.

Conclusion

Xatastor is a cloud native software-defined storage engine that enables the level of scalability that is required in the agentic era.

If you are working on a web, cloud, or AI platform that would benefit from offering a Postgres instance (or more) per tenant, don’t hesitate to reach out.

Also: if you are an engineer that enjoys this type of system-level challenges, we are hiring!

Introducing Xata OSS: Postgres platform with branching, now Apache 2.0

Claudiu Dascalescu — Wed, 22 Apr 2026 13:13:00 +0000

Xata is a cloud-native Postgres platform with the following highlights:

Fast branching using Copy-on-Write at the storage level. You can “copy” TB of data in a matter of seconds. Except they are lightweight copies that happen instantly and take very little extra disk space.
Scale-to-zero functionality, so inactive databases don’t cost you compute. Together with the previous point, this makes Postgres branches almost free.
100% Vanilla Postgres. We took a bit of a different path compared with Neon or Aurora: we run vanilla Postgres without any modifications. The branching functionality happens at the underlying storage layer (tech details below).
All the production grade requirements: high availability, read replicas, automatic failover/switchover, upgrades, backups with PITR, IP filtering, etc

Xata’s customers use it in two major ways:

For creating preview and testing environments quickly and with real data.
As a fully-managed PostgreSQL service to run the production database.

The first use case is becoming particularly more popular with the advent of coding agents. Producing the code is now fast and cheap, but validating it with realistic production workloads still takes a long time. Synthetic or seeded data just doesn’t cut it for finding performance issues and subtle corner-case data bugs. Copying production data with pg_dump is slow and risky. Keeping many dev instances up and running for a long time is very expensive.

A solution with copy-on-write branching and scale-to-zero is the ideal solution for these types of issues.

With the new open-source distribution of Xata, any engineer in any organization can deploy Xata and without concerns related to adding a new vendor or having to worry about licensing.

For the second use case, organizations are now able to use Xata as their internal Postgres service. This is particularly useful if your organization needs a variety of instance sizes (e.g. a few large databases, many small ones, database per PR, etc.).

Copy-on-write: a short intro

To understand the Xata architecture, let’s start at the lowest layer and the go up.

The idea of copy-on-write (CoW) is not new. It has been widely used in databases, file systems, and operating systems. It typically looks like something like this:

Data is split into blocks and the location of each block is stored in a special metadata “index” block. Instead of accessing data directly, the program finds the location of the block in the index before accessing it.

When a copy is created, only the index block is copied initially. This new index block points to the same data blocks as the original. This is what makes the copy operation so fast, the index block is typically tiny compared to the full data size.

As long as only reads are received by both copies, this arrangement works. As soon as either of the two copies receives write, the copy happens (hence the name copy-on-write). But instead of copying all of the data, only the block that received the write is copied.

If the blocks are relatively small, the copy is almost invisible and induces no significant overhead. This is because, for example, Postgres only writes full pages, even if you only changed a single byte.

If in the end all blocks are written to, all blocks will be eventually copied. However, in many practical cases, a lot of blocks are not written at all after the initial creation, which means that this scheme has the potential to save a lot of storage space.

The key is in the storage system

One thing to note is that the original and the copy need to be able to access the same data blocks. Otherwise, there’s no way to take advantage of CoW. The most simple way of achieving this is to just run both on the same server. For Postgres branching, you can do this with a CoW filesystem, like ZFS or btrfs, like this:

This is a good start, but it means that the parent and the child branches need to share the same CPU and RAM resources. For example, if the server has say 6 GB of RAM, and you need 2 GB for each Postgres instance, you can only run the parent and two other child branches.

This leads us to the idea of mounting the storage over the network. This way we can break free of single-node limits:

A lot can be written about the pros and cons of separating compute and storage for running databases. However, our primary use cases involve branching and scale-to-zero, and separating allows for a much more flexible compute layer.

Freed up from the relatively static nature of bytes on disk, the compute layer can scale up and down with demand. We're efficiently bin-packing Postgres instances on the most optimal set of nodes.

In practice, it might looks something like this:

There are many ways to mount storage over the network. From good old NFS, to sshfs or WebDAV. They each have their own ideal use cases. For running databases with high performance, NVMe over fabrics (NVMe-of) stands out.

NVME-of has a stable implementation in the Linux kernel and is designed for high performance. It is capable of driving hundreds of thousands of IOPS over the network, which is exactly what Postgres needs.

Because it was important to us to be 100% vanilla PostgreSQL, with no significant modifications, a storage engine exposed over NVMe-of was the clear choice for us.

On the shoulder of giants

Looking deeper into the technology behind this effort, Xata is built on top of two mature cloud-native open source projects:

CloudNativePG is a Postgres operator for Kubernetes. It handles most of the typical production concerns: high-availability, failover/switchover, upgrades, connection pooling, backups, etc.
OpenEBS, is a a cloud native storage project, offering both local storage (i.e. local NVMe disks) as well as a replicated storage engine (also called Mayastor).

The Xata OSS project doesn’t just put these two technologies together, it builds a set of new components to work on top and along them.

More precisely, Xata adds:

SQL gateway, responsible for routing, IP filtering, waking up scaled-to-zero clusters, serving the serverless driver over HTTP / websockets, etc.
Branch operator managing all resources related to a branch.
Clusters and projects services for the control-plane and REST APIs
Auth service, based on Keycloack for API keys
CLI that makes use of the REST API
Scale-to-zero CNPG plugin for automatically hibernating branches on inactivity

All these components are now open source in the Xata repository. We also remain committed to our popular and long standing OSS projects:

pgstream offers PostgreSQL replication with DDL statements and anonymization
pgroll offers zero-downtime, undoable, schema migrations for PostgreSQL

When NOT to use Xata open source

You might be wondering how will we be making money if we just open sourced our platform. First, we should say that we didn’t open source quite everything. We have kept some components in our private repository:

code for deploying and securing multi-organization, multi-region, and multi-cell installations.
our own “Xatastor” storage engine. It similar to OpenEBS/Maystor but we developed it for the exact needs of the agentic workloads. We’ll be talking more about it in a future blog post.

In addition to technical features, it’s a matter of convenience: Even with a platform like Xata, operating a large fleet of Postgres clusters can be a challenge. Your time might be better utilized on your core business offerings.

This is why there are some cases were we actually recommend against using Xata open-source, for example:

You only need a single Postgres instance. Xata is running on top of a Kubernetes cluster and it would be overkill for a single instance. You can either self-host Postgres on your own, or use a managed Postgres service, like Xata Cloud (coming soon).
Offering a public Postgres-as-a-Service to offer your own users and customers. While the Apache 2 license allows this use case (nor do we have any intention to change it in the future), we don’t recommend using it like this because the OSS version does lack some security features related to adversarial multi-tenancy. If you are looking to offer PGaaS to your users and customers, likely our BYOC option is what you need. Book some time with us to discuss it.

Conclusion

Starting now, the core Xata platform is open-source under the Apache 2 license. If you are looking to deploy it inside your organization, we’d love to help, so don’t hesitate to reach out via GitHub issues or email.

Also, this is only the beginning of what's new at Xata. We have some exciting announcements happening over the next few weeks to share all of the important Postgres offerings we've been working on this year. Keep up with the announcements on X/Twitter, BlueSky, and LinkedIn.

Karpathy's LLM wiki pattern is missing a data layer. Here's how to add one.

Claudiu Dascalescu — Thu, 16 Apr 2026 09:57:12 +0000

Originally published on the Xata blog. Cross-posting here for the dev.to Postgres/AI crowd.

If you're running product, go-to-market, or operations, your work lives in two places. Traditionally, you’re comfortable writing things like strategy docs, competitive intel, research notes, and campaign briefs. But the decisions you make (hopefully) depend on data: activation rates by cohort, pipeline conversion, revenue trends, campaign ROI. Your context exists in writing, your decision relies on data. This is nothing new, but how well and how fast we can run these cycles has completely changed with AI.

Andrej Karpathy recently shared a pattern for using LLMs to build personal knowledge bases, and it resonated with a LOT of people - 18M views, and what feels like an equal number of derivative pieces, as of this writing. If your work is pure research and idea generation, it's all you need. Markdown is the right container for concepts, connections, and synthesis.

But if you need to make a quarterly planning decision, a wiki page will only get you half way there. You need activation rates, not a summary of what activation means. You need pipeline numbers, not a positioning doc. Karpathy's pattern only handles the first half. The missing piece is a database alongside your wiki, with the LLM maintaining both and the references between them.

What Karpathy proposed

You rarely touch the wiki directly. The LLM writes it, maintains it, and keeps it consistent. You handle the thinking, it handles the work.

His first step is to index source documents (articles, papers, repos, datasets, images) into a raw/ directory. The LLM then incrementally "compiles" a wiki from those sources: summaries, concept pages, backlinks, entity articles, all cross-linked. The wiki lives as .md files, ready to view in your editor of choice - his preference is Obsidian. But remember, in this case Obsidian is a viewer, not an editor. The LLM is the editor.

Once the wiki reaches meaningful size (Karpathy's is around 100 articles and 400K words on some topics), you can ask the LLM complex research questions against it. The LLM reads the index files and summaries, pulls in the relevant articles, and synthesizes answers. At this scale, you don't need RAG infrastructure. The LLM's own index maintenance and the ability to read files on demand is enough.

Most importantly, outputs get filed back into the wiki. Your explorations and queries accumulate in the knowledge base, making it richer for future queries. The wiki basically gets stronger with exercise (novel concept), not just from ingesting new sources. Periodic linting keeps the wiki improving over time. As the wiki grows, you build lightweight tools around it. Karpathy vibe-coded a small search engine over the wiki that he uses directly through a web UI and hands off to the LLM as a CLI tool for larger queries.

Karpathy's closing observation: "I think there is room here for an incredible new product instead of a hacky collection of scripts." He's right - and several new products have already been shared. The challenge with them is that they’re all making decisions based on experience and professional principles, not data.

Two compilers, not one

I spent an embarrassing amount of time trying to make markdown tables work for tracking weekly metrics before admitting what should have been obvious from the start: knowledge and data behave differently. A concept page in your wiki gets rewritten and refined over time. The LLM improves it as it learns more. But last week's signup numbers don't get rewritten. They're a fact. Next week adds another row. The wiki pattern is built for knowledge that evolves through revision. Your data evolves through accumulation. Markdown doesn't have a natural model for "append a row to an existing dataset." Databases do. That's literally what they are.

Karpathy built a knowledge compiler. Raw text goes in, structured wiki comes out. What's missing is a data compiler that sits alongside it.

Raw Sources
|
+-- Text sources (articles, papers, docs, transcripts)
|   |
|   +-> LLM compiles -> Wiki (.md files)
|       Concepts, summaries, backlinks, weekly reports
|
+-- Data sources (APIs, CSVs, event streams, logs)
    |
    +-> LLM compiles -> Database (Postgres)
        Schemas, tables, derived views, saved queries

The two layers serve different purposes and need different operations:

	Wiki (knowledge)	Database (data)
Shape	Paragraphs, concepts, relationships	Rows, columns, time-series
Core operation	Summarize, link, synthesize	Query, aggregate, compare
LLM role	Write and maintain text	Define schemas, write queries, maintain derived tables
Update pattern	Rewrite sections, add pages	Insert rows, run migrations
Scales by	More pages	More rows and tables

The interesting part: the two layers reference each other constantly. A weekly analytics report (wiki) cites a SQL query that produced the numbers (database). A database table of experiment results gets interpreted in a wiki page that explains what the numbers mean and what to do about them. The LLM maintains both layers and the references between them.

The loop: database to markdown and back

The questions you most want answered span both layers. "Why did activation drop?" is not a data question or a knowledge question. It's both. You need a database to compute the drop, and a wiki to explain it.

Here's what this looks like if you run a weekly product analytics review. Say you're tracking signups, activation, branch creation, and revenue for a SaaS product.

It starts with the database. The LLM runs SQL against Postgres to pull the raw numbers. These are relational queries that markdown can't answer:

WITH first_project AS (
    SELECT organization_id, MIN(created_at) AS first_project_at
    FROM projects
    GROUP BY organization_id
)
SELECT COUNT(*)
FROM first_project
WHERE first_project_at >= '2026-03-23'
  AND first_project_at < '2026-03-30';

Revenue joins across billing tables:

SELECT ROUND(SUM(li.amount_micros)::numeric / 1000000, 2) AS revenue
FROM invoices i
JOIN invoice_line_items li ON li.invoice_id = i.id
JOIN customers c ON i.customer_id = c.id
WHERE i.status IN ('paid', 'issued')
  AND i.invoice_date >= '2026-03-23'
  AND i.invoice_date < '2026-03-30';

Then the LLM takes the query results, computes trends (week-over-week changes, cohort comparisons), and writes a report that combines the numbers with qualitative context: what you shipped that week, what might explain a spike or drop, what to watch next. That report is a markdown file that goes into your knowledge base:

# Week 14

## Product Metrics
**replaced with dummy data**

**New orgs that created branches:** 57 orgs (38.8% of new orgs)
Activation rate dropped from 79.5% last week. Worth investigating
whether the signup flow or onboarding changed.

**Total branches created:** 342 (-2.3% WoW)
Branch volume held roughly flat despite fewer activating orgs,
meaning the orgs that did activate created more branches per org.

## Key Insights

The headline this week is the activation rate drop to 38.8%,
down from 79.5%. Org creation held up (147, +11%) and
visit-to-org conversion reached a year high at 3.28%,
so top of funnel is working. The gap is between signup
and first branch.

## Execution This Week

- launch plan and drumbeat campaign
- push updated campaigns to staging
- blog post refresh and traffic analysis

This is where the two layers start compounding. The report says activation dropped from 79.5% to 38.8%. That raises a question: why? You ask the LLM. It goes back to the database and queries activation by signup source to check if a specific channel brought in low-intent users. It searches the wiki for last week's report and finds that the previous cohort included a batch of internal test accounts that inflated the baseline. It checks the execution notes and sees the onboarding flow wasn't changed.

You get an answer that combines fresh data ("activation by channel shows organic signups activated at 52%, while Product Hunt signups activated at 12%") with knowledge from the wiki ("W13 baseline was inflated by internal testing, see weekly report W13. Onboarding flow unchanged.") Neither layer alone could have given you that answer.

And now that answer gets filed back into your knowledge base too. Next week, when the LLM writes the W15 report, it has the full context chain: the drop in W14, the investigation, the root cause. Your knowledge base gets smarter every cycle, without you doing any of the bookkeeping.

This is the loop: database produces the numbers, wiki captures the interpretation, follow-up questions drive deeper into both, and everything accumulates. Each weekly report isn't just a snapshot. It's another layer of context that makes your next analysis better.

Putting it together: structure and downstream decisions

The product analytics example above shows the loop. But this pattern works for any domain where you mix text with numbers. Here's what the directory structure looks like when you set up a knowledge base with both layers:

knowledge-base/
|
+-- raw/                          # immutable sources
|   +-- api-exports/              # raw data pulls (PostHog, billing)
|   +-- articles/                 # clipped web articles
|   +-- notes/                    # meeting notes, call transcripts
|
+-- wiki/                         # LLM-maintained markdown
|   +-- index.md                  # catalog of all pages
|   +-- weekly/                   # weekly reports with insights
|   |   +-- 2026-w14.md
|   |   +-- 2026-w13.md
|   +-- concepts/                 # evergreen concept pages
|   |   +-- activation-rate.md
|   |   +-- cohort-quality.md
|   +-- log.md                    # chronological ingest log
|
+-- data/                         # structured data layer
|   +-- schema.sql                # LLM-maintained schema
|   +-- migrations/               # schema evolution
|   +-- queries/                  # saved queries the LLM wrote
|   |   +-- new-orgs-by-week.sql
|   |   +-- activation-rate.sql
|   |   +-- revenue-by-week.sql
|
+-- outputs/                      # rendered artifacts
|   +-- charts/                   # matplotlib/observable plots
|   +-- reports/                  # formatted reports
|
+-- schema.md                     # instructions for both compilers

The schema.md file (what Karpathy calls the "schema layer") now has instructions for both compilers. It covers how to ingest text into the wiki and data into the database, what page structure the wiki follows and what table schemas the database uses.

Your data layer is a Postgres database that mirrors whatever data sources you're working with - product analytics, campaign performance, pipeline stages, billing, customer data. We wrote about how to build a product analytics warehouse in Postgres using materialized views and pg_cron, and that's exactly the kind of setup that works well as the data layer here. The LLM queries it directly via SQL whenever it needs to produce a report, answer a question, or validate a claim in a strategy doc.

The database is where the numbers live. The wiki is where the context lives. And because everything ends up in the same knowledge base, the LLM can move between the two without you having to point it at the right source every time.

Making the LLM useful with your data

Getting the LLM to work with your wiki is trivial since it reads and writes markdown. The database side takes a bit more setup, but it comes down to three things:

Give it a schema

Any ORM schema or plain schema.sql works, since Xata is vanilla Postgres. Prisma's PostgreSQL quickstart is a good starting point if you want typed queries.

The key is describing what tables exist, what columns they have, and how they relate. Without it, every query starts with the LLM exploring your database blind.

Save your queries

The first time you ask a question, the LLM figures out the schema, writes the query, runs it, maybe adjusts it. That costs tokens and time. The second time, it will just run activation-rate.sql. Saved queries are the data equivalent of wiki pages: compiled knowledge about how to get a specific answer.

Write a skill file

The schema says status String. It doesn't say "a branch with status 'active' might still be hibernating." A skill file teaches the LLM what your columns actually mean, which metrics matter, and how to interpret what it finds. This is the highest-leverage thing you can do for output quality. You stop getting generic SQL output and start getting answers that reflect how your team thinks about your business.

Picking a database for this

⚡ TLDR: we’re obviously going to suggest using a Xata Postgres database, but you do you.

Signup at console.xata.io, then:

curl -fsSL <https://xata.io/install.sh> | bash

xata auth login

xata init

A knowledge base database has a specific usage pattern: you query it intensely for an hour while writing a weekly report, then don't touch it for days. Look for two things in your Postgres setup:

scale-to-zero: don’t pay for compute while the database sits idle between sessions
branching: so you can test hypotheses without risking your master knowledge base

Remember, your LLM is effectively an agent that needs a database it can break. Let it experiment freely on a copy.

This pattern works at personal or small-team scale: a few data sources, a few hundred thousand rows, one or two people asking questions. Postgres handles more than most people think before you need to reach for something else. If you need Spark here, I’d hate to see your Claude bill.

If you want to add a data layer to your own knowledge base, start with one data source. Pick the data you're currently cramming into markdown (product events, billing exports, analytics CSVs), put it in Postgres, write a skill file that explains what the columns mean, and run a few cross-layer queries. Save the queries that work. The pattern compounds from there, just like Karpathy's wiki does.