On the Andy Pavlo's DB review

Pranav Aurora — Thu, 02 Jan 2025 18:37:10 +0000

The Andy Pavlo yearly review has a massive chokehold amongst the DB community. It's like the oscars of databases?

This year was a pretty special review, our project, pg_mooncake was mentioned.

Here are some thoughts from reading the review, and what we've learnt at Mooncake Labs in our first 121 days of existence.

1. Yes, we're guilty of 'Shoving Ducks everywhere'...

Our first project, pg_mooncakeadded a native columnstore table (Iceberg) to Postgres for 1000x faster analytics.

While, there are quite a few extensions on the market bringing DuckDB into Postgres; we focussed on making the columnar storage feel like a regular Postgres table. Things like transactional writes, triggers etc. See our architecture

To us, it feels like the final touch to complete the 'analytics in PG experience'. Almost a decade later from early projects like Citus, we're optimistic that analytics in Postgres will be a reality.

2. 2024 felt like year of the Data Lake.

Snowflake vs Databricks. elastic's 'search lake' (lol). s3 tables.

What I mean by the 'lake': serverless workloads on data in object storage.

In 2024, analytic (DatabricksSQL, Snowflake Iceberg) & Vector Search (Turbo Puffer, Lance) moved to the lake.

In 2025, I reckon there will be more workloads (lookups, full-text) running in this manner.

3. As for vector search...

Agents are everywhere; and yet vector search wasn't a topic at all... Couple thoughts.

Just use Postgres
If you have big 'data', LanceDB / Turbopuffer
Vector search workloads moving toward full-text workloads. Something we've noticed a lot. Hybrid Search results are often ~95%+ full-text results.

4. As for AI / Agents

A lot of the AI companies we spend time with are each building a'systems of record' for each customer... And they're all storing structured/unstructured data in a 'Lake'. See Rox's architecture

Another trend we've seen: LLMs being used for data processing and ML tasks (feature extraction, classifiers).

It kind of makes sense too… on small data. Product engineers can use LLMs out of the box, instead of picking/training/deploying ML models for each task.

I am super super curious how Snowflake, Databricks and Redshift AI functions will play out this year.

2025 will be exciting.

Pranav

DEV Community: Pranav Aurora

On the Andy Pavlo's DB review