<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tiger Data (Creators of TimescaleDB)</title>
    <description>The latest articles on DEV Community by Tiger Data (Creators of TimescaleDB) (@tigerdata).</description>
    <link>https://dev.to/tigerdata</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2028%2F55d4ec28-b9c7-4adb-bd8f-08fad8f4c075.png</url>
      <title>DEV Community: Tiger Data (Creators of TimescaleDB)</title>
      <link>https://dev.to/tigerdata</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tigerdata"/>
    <language>en</language>
    <item>
      <title>pg_textsearch 1.0: How We Built a BM25 Search Engine on Postgres Pages</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Tue, 31 Mar 2026 13:09:03 +0000</pubDate>
      <link>https://dev.to/tigerdata/pgtextsearch-10-how-we-built-a-bm25-search-engine-on-postgres-pages-42cc</link>
      <guid>https://dev.to/tigerdata/pgtextsearch-10-how-we-built-a-bm25-search-engine-on-postgres-pages-42cc</guid>
      <description>&lt;p&gt;&lt;em&gt;Design, implementation, and benchmarks of a native BM25 index for Postgres. Now generally available to all&lt;/em&gt; &lt;a href="https://www.tigerdata.com/cloud" rel="noopener noreferrer"&gt;&lt;em&gt;&lt;u&gt;Tiger Cloud&lt;/u&gt;&lt;/em&gt;&lt;/a&gt; &lt;em&gt;customers and freely available via open source.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you have used Postgres's built-in ts_rank for full-text search at any meaningful scale, you already know the limitations. Ranking quality degrades as your corpus grows. There is no inverse document frequency, so common words carry the same weight as rare ones. There is no term frequency saturation, so a document that mentions "database" 50 times outranks one that mentions it once. There is no efficient top-k path: scoring requires touching every matching row.&lt;/p&gt;

&lt;p&gt;Most teams work around this by bolting on Elasticsearch or Typesense as a sidecar. That works, but now you are syncing data between two systems, operating two clusters, and debugging consistency issues when they diverge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.tigerdata.com/docs/use-timescale/latest/extensions/pg-textsearch" rel="noopener noreferrer"&gt;&lt;u&gt;pg_textsearch&lt;/u&gt;&lt;/a&gt; takes a different approach: real BM25 scoring, built from scratch in C on top of Postgres's own storage layer. You create an index, write a query, and get results ranked by relevance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;bm25&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'database ranking'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'database ranking'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;&amp;lt;@&amp;gt;&lt;/code&gt; operator returns a BM25 relevance score. Scores are negated so that Postgres's default ascending ORDER BY returns the most relevant results first. The index is stored entirely in standard Postgres pages managed by the buffer cache. It participates in WAL, works with pg_dump and streaming replication, and requires no external storage or special backup procedures.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What shipped in 1.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;From preview to production. In October 2025, we released a preview that held the entire inverted index in shared memory, rebuilt from the heap on restart (preview blog). In the five months and 180+ commits since, the extension has been substantially rewritten:&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• Disk-based segments replaced the memory-only architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• Block-Max WAND + WAND optimization for fast top-k queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• Posting list compression with SIMD-accelerated decoding (41% smaller indexes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• Parallel index builds (138M documents in under 18 minutes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• 2.4x to 6.5x faster than ParadeDB/Tantivy for 2-4 term queries at 138M scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• 8.7x higher concurrent throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;This post covers the architecture, query optimization strategy, and benchmark results. We include a candid discussion of where ParadeDB is faster and a full accounting of current limitations.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Background: Why BM25 in Postgres?
&lt;/h2&gt;

&lt;p&gt;Postgres ships &lt;code&gt;tsvector/tsquery&lt;/code&gt; with &lt;code&gt;ts_rank&lt;/code&gt; for full-text ranking. &lt;code&gt;ts_rank&lt;/code&gt; uses an ad-hoc scoring function that lacks the three properties that make BM25 effective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inverse document frequency (IDF):&lt;/strong&gt; downweights common terms so that rarer, more informative terms drive the ranking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Term frequency saturation:&lt;/strong&gt; prevents a document from scoring arbitrarily high by repeating a term many times. A document mentioning "database" 50 times is not 50 times more relevant than one mentioning it once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document length normalization:&lt;/strong&gt; accounts for the fact that a term match in a short document is more informative than the same match in a long one [1].&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For applications where ranking quality matters (RAG pipelines, search-driven UIs, hybrid retrieval), this is a material limitation. At scale, &lt;code&gt;ts_rank&lt;/code&gt; also has no top-k optimization path: ranking by relevance requires scoring every matching row.&lt;/p&gt;

&lt;p&gt;The primary existing BM25 extension for Postgres is ParadeDB/pg_search, which wraps the Tantivy search library written in Rust. Early versions stored the index in auxiliary files outside the WAL; current versions use Postgres pages.&lt;/p&gt;

&lt;p&gt;pg_textsearch takes a different approach: rather than wrapping an external search library, the entire search engine (tokenization, compression, query optimization) is built from scratch in C on top of Postgres's storage layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fex8hr08ubhffvj31eb79.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fex8hr08ubhffvj31eb79.png" alt="Fig. 1: pg_textsearch Architecture diagram" width="800" height="1249"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 1: pg_textsearch Architecture diagram&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The hybrid memtable + segment design
&lt;/h3&gt;

&lt;p&gt;pg_textsearch uses an LSM-tree-inspired architecture [4]. Incoming writes go to an in-memory inverted index (the memtable), which periodically spills to immutable on-disk segments. Segments compact in levels: when a level accumulates enough segments (default 8), they merge into the next level. Fewer segments means fewer posting lists to consult per query term, which directly reduces query latency. This is the same write-optimized-memtable / read-optimized-segment pattern used in RocksDB [5] and other LSM-based engines, adapted here for Postgres's page-based storage.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;The write path: memtable&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The memtable lives in Postgres shared memory, one per index, accessible to all backends. It contains a string-interning hash table that stores each unique term exactly once; per-term posting lists recording document IDs and term frequencies; and corpus statistics (document count and average document length) maintained incrementally so that BM25 scores can be computed without a separate pass over the index.&lt;/p&gt;

&lt;p&gt;When the memtable exceeds a configurable threshold (default: 32M posting entries), it spills to a Level-0 disk segment at transaction commit. A secondary trigger (default: 100K unique terms per transaction) handles large single-transaction loads like bulk imports.&lt;/p&gt;

&lt;p&gt;The memtable is rebuilt from the heap on startup. Since the heap is WAL-logged, no data is lost if Postgres crashes before a spill completes. This is analogous to how a write-ahead log protects an LSM memtable, except here the WAL is Postgres's own. The rebuild cost is proportional to the amount of data not yet spilled to segments; for indexes where most data has been spilled, startup is fast.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3opgjv8tk3srcg31n64y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3opgjv8tk3srcg31n64y.png" alt="Fig. 2: pg_textsearch memtable write path" width="800" height="923"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 2: pg_textsearch memtable write path&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The read path: segments
&lt;/h3&gt;

&lt;p&gt;Segments are immutable and stored in standard Postgres pages. Each segment contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A term dictionary:&lt;/strong&gt; a sorted array of offsets into a string pool, binary-searchable for O(log n) term lookup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Posting blocks&lt;/strong&gt; of up to 128 documents each, containing delta-encoded doc IDs, packed term frequencies, and quantized document lengths (fieldnorms). A separate skip index stores one entry per posting block with upper-bound score metadata used by Block-Max WAND optimization (described below).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A fieldnorm table&lt;/strong&gt; mapping document lengths to 1-byte quantized values using Lucene/Tantivy's SmallFloat encoding [6]. This encoding is exact for lengths 0-39 (covering most short documents); for longer documents, quantization error increases from ~5% to ~11%. In practice, the impact on ranking is smaller than these numbers suggest: BM25 scores depend on the ratio of document length to average document length, which dampens quantization error, and the b parameter (default 0.75) further reduces length's influence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A doc ID to CTID mapping&lt;/strong&gt; that translates internal document IDs to Postgres tuple identifiers for heap fetches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmoua6q56wmqbqx7knt5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmoua6q56wmqbqx7knt5f.png" alt="Fig. 3: pg_textsearch segment internal structure" width="800" height="1304"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 3: pg_textsearch segment internal structure&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Minimizing page access
&lt;/h3&gt;

&lt;p&gt;Storing data in Postgres pages means every access goes through the buffer manager. Even for pages already in cache, each access involves a buffer table lookup, pin acquisition, and lock handling. That overhead adds up in a scoring loop processing millions of postings. This constraint shaped several design decisions.&lt;/p&gt;

&lt;p&gt;Each segment assigns compact 4-byte, segment-local document IDs (0 to N-1), which map to Postgres's 6-byte CTIDs (heap tuple identifiers). After collecting all documents for a segment, doc IDs are reassigned so that doc_id order matches CTID order. Sequential iteration through posting lists then produces sequential access to the CTID mapping, maximizing cache locality. CTIDs themselves are stored as two separate arrays (4-byte page numbers and 2-byte offsets) rather than interleaved 6-byte records, doubling cache line utilization.&lt;/p&gt;

&lt;p&gt;The scoring loop works entirely with doc IDs, term frequencies, and fieldnorms. It never touches the CTID arrays. CTIDs are resolved only for the final top-k results in a single batched pass. A top-10 query that scores thousands of candidates resolves ten CTIDs, not thousands.&lt;/p&gt;
&lt;h3&gt;
  
  
  Postgres integration
&lt;/h3&gt;

&lt;p&gt;Because the index is stored in standard buffer-managed pages, pg_textsearch participates in Postgres infrastructure without special handling: MVCC visibility, proper rollback on abort, WAL and physical replication, &lt;code&gt;pg_dump / pg_upgrade&lt;/code&gt;, VACUUM with correct dead-entry removal, and planner hooks that detect the &lt;code&gt;&amp;lt;@&amp;gt;&lt;/code&gt; operator and select index scans automatically. Logical replication works in the usual way: row changes are replicated and the index is rebuilt on the subscriber.&lt;/p&gt;
&lt;h2&gt;
  
  
  Query Optimization: Block-Max WAND
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The top-k problem
&lt;/h3&gt;

&lt;p&gt;Naive BM25 evaluation scores every document matching any query term. For a 3-term query on MS-MARCO v2 (138M documents), this means decoding and scoring posting lists with tens of millions of entries. Most applications need only the top 10 or 100 results. The challenge is finding them without scoring everything.&lt;/p&gt;
&lt;h3&gt;
  
  
  Block-Max WAND
&lt;/h3&gt;

&lt;p&gt;pg_textsearch implements Block-Max WAND (BMW) [2], which uses block-level upper bounds to skip non-contributing posting blocks during top-k evaluation. Lucene adopted a similar approach in version 8.0 [7]. The core idea: maintain the score of the k-th best result seen so far as a threshold, and skip any posting block whose upper-bound score cannot exceed it.&lt;/p&gt;

&lt;p&gt;Each 128-document posting block has a corresponding skip entry storing the maximum term frequency in the block and the minimum fieldnorm (the shortest document, which would score highest for a given term frequency). From these two values, BMW can compute a tight upper bound on the block's BM25 contribution without decompressing it. If the upper bound falls below the current threshold, the entire block (all 128 documents) is skipped.&lt;/p&gt;

&lt;p&gt;To illustrate: consider a single-term top-10 query on a large corpus. After scanning a few thousand postings, the algorithm has accumulated 10 results with a minimum score of, say, 12.3. It now encounters a block where the upper-bound BM25 score (computed from the block's stored metadata) is 9.1. Since 9.1 &amp;lt; 12.3, no document in this block can enter the top 10, and the entire block is skipped without decompression. For short queries on large corpora, the vast majority of blocks are skipped this way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjzcaaou8sgoxsmo0q3b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjzcaaou8sgoxsmo0q3b.png" alt="Fig. 4: pg_textsearch Block-Max WAND visualization" width="800" height="591"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 4: pg_textsearch Block-Max WAND visualization&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  WAND pivot selection
&lt;/h3&gt;

&lt;p&gt;For multi-term queries, pg_textsearch adds the WAND algorithm [3] for cross-term skipping. Terms are ordered by their current document ID, and the algorithm identifies a pivot term: the first term whose cumulative maximum score exceeds the current threshold. All terms before the pivot advance to at least the pivot's current doc ID, skipping entire ranges of documents across multiple posting lists simultaneously, before block-level BMW bounds are even checked. For multi-term queries, BMW compares the sum of per-term block upper bounds against the threshold, extending the single-term logic described above.&lt;/p&gt;

&lt;p&gt;The combination of WAND (cross-term skipping) and BMW (within-list block skipping) is most effective for short queries (1-4 terms), which account for the majority of real-world search traffic. In the full MS-MARCO v1 query set (1,010,916 queries from Bing), 72.6% have 2-4 lexemes after English stemming and stopword removal, with a mean of 3.7 and a mode of 3. The speedup narrows for longer queries, where more blocks contain at least one term with a potentially high-scoring document. Grand et al. [7] observe the same pattern in Lucene's BMW implementation.&lt;/p&gt;
&lt;h2&gt;
  
  
  Compression and Storage
&lt;/h2&gt;

&lt;p&gt;Posting blocks use a compression scheme designed for fast random-access decoding. Doc IDs are delta-encoded (storing differences between consecutive IDs rather than absolute values), then packed with variable-width bitpacking: the maximum delta in the block determines the bit width, and all deltas use that width. Term frequencies are packed separately with their own bit width. Fieldnorms are the 1-byte SmallFloat values described above.&lt;/p&gt;

&lt;p&gt;The bitpack decode path uses branchless direct-indexed uint64 loads rather than a byte-at-a-time accumulator, eliminating branch misprediction in the inner decode loop. Where available, SIMD intrinsics (SSE2 on x86-64, NEON on ARM64) accelerate the mask-and-store step. A scalar fallback handles other platforms.&lt;/p&gt;

&lt;p&gt;Compression reduces index size by 41% compared to uncompressed storage. Decode overhead is approximately 6% of query time (measured by profiling), which is more than offset by reduced buffer cache pressure. The scheme prioritizes decode speed over compression ratio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on index size comparisons:&lt;/strong&gt; pg_textsearch does not store term positions, so it cannot support phrase queries natively (see Limitations). This makes its indexes inherently smaller than engines like Tantivy that store positions by default. The 19-26% size advantage reported in our benchmarks reflects both compression and this feature difference.&lt;/p&gt;
&lt;h2&gt;
  
  
  Parallel Index Build
&lt;/h2&gt;

&lt;p&gt;For large tables, serial index construction can take hours. pg_textsearch uses Postgres's built-in parallel worker infrastructure to distribute the work.&lt;/p&gt;

&lt;p&gt;The leader launches workers and assigns each a range of heap blocks. Workers scan their assigned blocks, tokenize documents via &lt;code&gt;to_tsvector&lt;/code&gt;, build local in-memory indexes, and write intermediate segments to temporary BufFiles. The leader then performs an N-way merge of all worker output, writing a single merged segment directly to index pages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61y9a8j5equ8ngyu0z4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61y9a8j5equ8ngyu0z4d.png" alt="Fig. 5: pg_textsearch Parallel Index Build" width="800" height="994"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 5: pg_textsearch Parallel Index Build&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Workers run concurrently in the scan/tokenize/build phase; the leader merges sequentially. The expensive part (heap scanning, tokenization, posting list assembly) is CPU-bound and parallelizes well. The merge/write phase is comparatively cheap, so a serial merge captures most of the speedup with minimal complexity. It also produces a single fully-compacted segment that is optimal for query performance.&lt;/p&gt;

&lt;p&gt;On MS-MARCO v2 (138M passages), 15 workers complete the build in 17 minutes 37 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_parallel_maintenance_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;maintenance_work_mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'256MB'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;passages&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;bm25&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Methodology
&lt;/h3&gt;

&lt;p&gt;All benchmarks use the MS-MARCO passage ranking dataset [8], a standard information retrieval benchmark drawn from real Bing search queries. We compare pg_textsearch against ParadeDB v0.21.6 (which wraps Tantivy). Both extensions use their default configurations; Postgres tuning is specified per experiment. Both systems configure English stemming and stopword removal.&lt;/p&gt;

&lt;p&gt;Queries are drawn uniformly from 8 token-count buckets (100 queries per bucket on v1; up to 100 per bucket on v2). Weighted-average metrics use the MS-MARCO v1 lexeme distribution as weights, reflecting real search traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache state.&lt;/strong&gt; All query benchmarks are warm-cache: a warmup pass runs before timing begins, and the working set fits in the OS page cache and shared_buffers for all configurations tested. Results reflect CPU and algorithmic efficiency, not I/O. We have not benchmarked memory-constrained configurations where the index exceeds available cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ranking.&lt;/strong&gt; Both systems produce BM25 rankings using the same tokenization (English stemming and stopwords). We have not performed a systematic ranking equivalence comparison; both implement standard BM25 with the same default parameters (k1 = 1.2, b = 0.75), but differences in IDF computation and tokenization edge cases may produce different orderings for some queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  MS-MARCO query length distribution
&lt;/h3&gt;

&lt;p&gt;The following histogram shows the distribution of query lengths in the full MS-MARCO v1 query set (1,010,916 queries), measured in lexemes after English stopword removal and stemming via Postgres &lt;code&gt;to_tsvector('english')&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9uhedx6bps3xuzxjkgny.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9uhedx6bps3xuzxjkgny.png" alt="Fig. 6: MS-MARCO query length histogram" width="800" height="432"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 6: MS-MARCO query length histogram&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This distribution is broadly consistent with web search query length studies [9, 10]. The MS-MARCO mean of 3.7 lexemes (after stemming/stopword removal) corresponds to roughly 5–6 raw words, consistent with the corpus statistics reported by Nguyen et al. [8]. We use the v1 distribution for weighting throughout as it provides the largest sample.&lt;/p&gt;
&lt;h3&gt;
  
  
  Results: MS-MARCO v2 (138M passages)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Environment.&lt;/strong&gt; Dedicated c6i.4xlarge EC2 instance: Intel Xeon Platinum 8375C, 8 cores / 16 threads, 123 GB RAM, NVMe SSD. Postgres 17.4 with shared_buffers = 31 GB. Both indexes fit in the buffer cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Index build:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;pg_textsearch&lt;/th&gt;
&lt;th&gt;ParadeDB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Index size&lt;/td&gt;
&lt;td&gt;17 GB&lt;/td&gt;
&lt;td&gt;23 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build time&lt;/td&gt;
&lt;td&gt;17 min 37 sec&lt;/td&gt;
&lt;td&gt;8 min 55 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documents&lt;/td&gt;
&lt;td&gt;138,364,158&lt;/td&gt;
&lt;td&gt;138,364,158&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel workers&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;pg_textsearch index is 26% smaller. ParadeDB builds approximately 2x faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-client query latency (p50 median, top-10 queries):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lexemes&lt;/th&gt;
&lt;th&gt;pg_textsearch (ms)&lt;/th&gt;
&lt;th&gt;ParadeDB (ms)&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;5.11&lt;/td&gt;
&lt;td&gt;59.83&lt;/td&gt;
&lt;td&gt;11.7x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;9.14&lt;/td&gt;
&lt;td&gt;59.65&lt;/td&gt;
&lt;td&gt;6.5x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;20.04&lt;/td&gt;
&lt;td&gt;77.62&lt;/td&gt;
&lt;td&gt;3.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;41.92&lt;/td&gt;
&lt;td&gt;98.89&lt;/td&gt;
&lt;td&gt;2.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;67.76&lt;/td&gt;
&lt;td&gt;125.38&lt;/td&gt;
&lt;td&gt;1.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;102.82&lt;/td&gt;
&lt;td&gt;148.78&lt;/td&gt;
&lt;td&gt;1.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;159.37&lt;/td&gt;
&lt;td&gt;169.65&lt;/td&gt;
&lt;td&gt;1.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8+&lt;/td&gt;
&lt;td&gt;177.95&lt;/td&gt;
&lt;td&gt;190.47&lt;/td&gt;
&lt;td&gt;1.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The same pattern holds: pg_textsearch is fastest on short queries and the systems converge at longer lengths. Weighted by the MS-MARCO v1 query length distribution, the overall p50 is 40.6 ms for pg_textsearch vs. 94.4 ms for ParadeDB, a 2.3x advantage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrent throughput.&lt;/strong&gt; We ran pgbench with 16 parallel clients for 60 seconds (after a 5-second warmup). Each client repeatedly executes a query drawn at random from a weighted pool of 1,000 queries:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;pg_textsearch&lt;/th&gt;
&lt;th&gt;ParadeDB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transactions/sec&lt;/td&gt;
&lt;td&gt;198.7&lt;/td&gt;
&lt;td&gt;22.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average latency&lt;/td&gt;
&lt;td&gt;81 ms&lt;/td&gt;
&lt;td&gt;701 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total transactions (60s)&lt;/td&gt;
&lt;td&gt;11,969&lt;/td&gt;
&lt;td&gt;1,387&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;pg_textsearch sustains 8.7x higher throughput under concurrent load.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Results: MS-MARCO v1 (8.8M passages)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;On the smaller dataset (GitHub Actions runner, 7 GB RAM, Postgres 17), the advantages are more pronounced: 26x speedup for single-token queries, 14x for 2-token, 7.3x for 4-token. Total sequential execution time for all 800 queries: 6.5 seconds for pg_textsearch vs. 25.2 seconds for ParadeDB. Full results and methodology are available at the &lt;a href="https://timescale.github.io/pg_textsearch/benchmarks/" rel="noopener noreferrer"&gt;&lt;u&gt;benchmarks&lt;/u&gt;&lt;/a&gt; page.&lt;/p&gt;
&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Latency vs. query length
&lt;/h3&gt;

&lt;p&gt;The speedup correlates strongly with query length: 11.7x for single-token queries on v2, narrowing to 1.1x at 8+ tokens. This is the expected behavior of dynamic pruning algorithms like BMW and WAND. Grand et al. [7] observe the same pattern in Lucene's BMW implementation.&lt;/p&gt;

&lt;p&gt;The practical significance depends on the workload's query length distribution. 72.6% of MS-MARCO queries have 2-4 lexemes, the range where pg_textsearch shows its largest advantage (6.5x to 2.4x on v2). Weighted by this distribution, the overall speedup is 2.3x on v2 and 3.9x on v1.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Concurrent throughput&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The concurrent throughput advantage (8.7x) substantially exceeds the single-client advantage (2.3x weighted p50). pg_textsearch queries execute as C code operating on Postgres buffer pages, with all memory management handled by Postgres's buffer cache. ParadeDB routes queries through Rust/C FFI into Tantivy, which manages its own memory and I/O outside the buffer pool. We have not profiled ParadeDB's internals, so we cannot attribute the concurrency gap to specific causes, but the architectural difference (shared buffer cache vs. separate memory management) is a plausible contributor. ParadeDB's concurrent performance may also improve in future versions.&lt;/p&gt;
&lt;h3&gt;
  
  
  Where ParadeDB is faster
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Index build time.&lt;/strong&gt; ParadeDB builds indexes 1.6-2x faster across both datasets. Tantivy's indexer is highly optimized Rust code with its own I/O management, not constrained by Postgres's page-based storage. Build time is a one-time cost per index (or per REINDEX); it does not affect query performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long queries.&lt;/strong&gt; At 7+ lexemes, the two systems converge. On v2, the 8+ lexeme p50 is 178 ms for pg_textsearch vs. 190 ms for ParadeDB. These long queries represent ~3.7% of the MS-MARCO distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Index size caveat.&lt;/strong&gt; pg_textsearch indexes are 19-26% smaller, but this comparison is not apples-to-apples: pg_textsearch does not store term positions, while ParadeDB stores positions to support phrase queries.&lt;/p&gt;
&lt;h3&gt;
  
  
  Benchmark limitations
&lt;/h3&gt;

&lt;p&gt;All measurements are warm-cache on datasets that fit in memory. The 100-query sample per bucket provides directional results but limited statistical power for tail latencies. ParadeDB v0.21.6 was current at time of testing; future versions may improve. We compare against ParadeDB because it is the primary Postgres-native BM25 alternative; standalone engines like Elasticsearch operate in a different deployment model. We have not benchmarked write-heavy workloads with concurrent queries.&lt;/p&gt;
&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;We want to be clear about what pg_textsearch does not support in 1.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No phrase queries.&lt;/strong&gt; The index stores term frequencies but not term positions, so it cannot natively evaluate queries like "database system" as a phrase. Phrase matching can be done with a post-filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'database system'&lt;/span&gt;
  &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="c1"&gt;-- over-fetch to compensate for post-filter&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sub&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;ILIKE&lt;/span&gt; &lt;span class="s1"&gt;'%database system%'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OR-only query semantics.&lt;/strong&gt; All query terms are implicitly OR'd. A query for "database system" matches documents containing either term. We plan to add AND/OR/NOT operators via a dedicated boolean query syntax in a post-1.0 release.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No highlighting or snippet generation.&lt;/strong&gt; Use Postgres's &lt;code&gt;ts_headline()&lt;/code&gt; on the result set for highlighting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No expression indexing.&lt;/strong&gt; Each BM25 index covers a single text column. Workaround: create a generated column concatenating multiple fields.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partition-local statistics.&lt;/strong&gt; Each partition maintains its own IDF and average document length. Cross-partition queries return scores computed independently per partition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No background compaction.&lt;/strong&gt; Segment compaction runs synchronously during memtable spill. Write-heavy workloads may observe compaction latency. Background compaction is planned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PL/pgSQL requires explicit index names.&lt;/strong&gt; The implicit text &lt;code&gt;&amp;lt;@&amp;gt; 'query'&lt;/code&gt; syntax relies on planner hooks that do not fire inside PL/pgSQL, DO blocks, or stored procedures. Use &lt;code&gt;to_bm25query('query', 'index_name')&lt;/code&gt; explicitly. This is a practical limitation many developers will hit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;shared_preload_libraries required.&lt;/strong&gt; pg_textsearch must be listed in shared_preload_libraries, requiring a server restart to install. On Tiger Cloud, this is handled automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No fuzzy matching or typo tolerance.&lt;/strong&gt; pg_textsearch uses Postgres's standard text search configurations for tokenization and stemming but does not provide built-in fuzzy matching. Typo-tolerant search requires a separate approach (e.g., pg_trgm).&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Planned work for post-1.0 releases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boolean query operators: AND, OR, NOT via a dedicated query syntax&lt;/li&gt;
&lt;li&gt;Background compaction: decouple compaction from the write path&lt;/li&gt;
&lt;li&gt;Expression index support: index computed expressions, not just bare columns&lt;/li&gt;
&lt;li&gt;Dictionary compression: front-coding for terms, reducing dictionary size&lt;/li&gt;
&lt;li&gt;Improved write concurrency: better throughput for sustained insert-heavy workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;pg_textsearch requires Postgres 17 or 18. The fastest way to try it is on &lt;a href="https://www.tigerdata.com/search" rel="noopener noreferrer"&gt;&lt;u&gt;Tiger Cloud&lt;/u&gt;&lt;/a&gt;, where it is already installed and configured. No setup, no shared_preload_libraries. Create a service and run the example below.&lt;/p&gt;

&lt;p&gt;For self-hosted installations, pre-built binaries for Linux and macOS (amd64, arm64) are available on the &lt;a href="https://github.com/timescale/pg_textsearch/releases" rel="noopener noreferrer"&gt;&lt;u&gt;GitHub Releases page&lt;/u&gt;&lt;/a&gt;. Add it to shared_preload_libraries and restart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;shared_preload_libraries&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'pg_textsearch'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source code and full documentation: &lt;a href="https://github.com/timescale/pg_textsearch" rel="noopener noreferrer"&gt;&lt;u&gt;github.com/timescale/pg_textsearch&lt;/u&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Part 2 of this series covers getting started with pg_textsearch, hybrid search with pgvectorscale, and production patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] Robertson et al. "Okapi at TREC-3." 1994. See also: Robertson, Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in IR, 3(4):333-389, 2009.&lt;/p&gt;

&lt;p&gt;[2] Ding, Suel. "Faster top-k document retrieval using block-max indexes." SIGIR 2011, pp. 993-1002.&lt;/p&gt;

&lt;p&gt;[3] Broder et al. "Efficient query evaluation using a two-level retrieval process." CIKM 2003, pp. 426-434.&lt;/p&gt;

&lt;p&gt;[4] O'Neil et al. "The log-structured merge-tree (LSM-tree)." Acta Informatica, 33(4):351-385, 1996.&lt;/p&gt;

&lt;p&gt;[5] Facebook. "RocksDB: A Persistent Key-Value Store for Fast Storage Environments." &lt;a href="https://rocksdb.org/" rel="noopener noreferrer"&gt;https://rocksdb.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[6] SmallFloat encoding: Apache Lucene SmallFloat.java. Tantivy uses an equivalent implementation.&lt;/p&gt;

&lt;p&gt;[7] Grand et al. "From MAXSCORE to Block-Max Wand: The Story of How Lucene Significantly Improved Query Evaluation Performance." ECIR 2020.&lt;/p&gt;

&lt;p&gt;[8] Nguyen et al. "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." 2016.&lt;/p&gt;

&lt;p&gt;[9] Statista. "Distribution of online search queries in the US, February 2020, by number of search terms."&lt;/p&gt;

&lt;p&gt;[10] Dean. "We Analyzed 306M Keywords." Backlinko, 2024.&lt;/p&gt;

</description>
      <category>announcementsrelease</category>
      <category>pgtextsearch</category>
      <category>postgres</category>
      <category>searchengine</category>
    </item>
    <item>
      <title>How to Break Your PostgreSQL IIoT Database and Learn Something in the Process</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Mon, 30 Mar 2026 17:42:43 +0000</pubDate>
      <link>https://dev.to/tigerdata/how-to-break-your-postgresql-iiot-database-and-learn-something-in-the-process-n2d</link>
      <guid>https://dev.to/tigerdata/how-to-break-your-postgresql-iiot-database-and-learn-something-in-the-process-n2d</guid>
      <description>&lt;p&gt;As engineers, we're taught to design for reliability. We do design calculations, run simulations, build and test prototypes, and even then we recognize that these are imperfect, so we include safety factors. When it comes to the Industrial Internet of Things (IIoT) though, we rarely give the same level of scrutiny to the components that we rely on.&lt;/p&gt;

&lt;p&gt;What if we treated our IIoT database the same way we treated the physical things we produce? We build and design a prototype database, and then  &lt;a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill/" rel="noopener noreferrer"&gt;put it through some serious testing&lt;/a&gt;, even to failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Value (and Perils) of Stress Testing
&lt;/h2&gt;

&lt;p&gt;Think of database stress testing as a destructive materials test for your data storage. You wouldn't trust a bridge made of untested steel, so don’t trust your database until you know its limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Value:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Identify Bottlenecks:&lt;/strong&gt;  Stress testing reveals the weak links—what is likely to fail first? Will you run out of storage? Will your queries get bogged down? Or will you hit the dreaded ingest wall (when data comes in faster than it can be stored)?&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Determine Real-World Behaviour:&lt;/strong&gt;  You'll find out exactly how your database performance changes as the amount of data increases. What issues are future-you going to struggle with?&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill/" rel="noopener noreferrer"&gt;&lt;strong&gt;Optimize Configuration&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;:&lt;/strong&gt;  Just like you might build a few different prototypes and see how it affects failure modes, changing your database configuration, especially when it comes to indices, can dramatically affect how it behaves. Building a rigorous stress testing framework provides a safe way to optimize your design.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope it goes without saying, but please, please don’t run this on your production environment. Even if it’s technically a different database but the same hardware, this test can wreak havoc on your resources and crash your system. You’ve been warned.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Measure?
&lt;/h2&gt;

&lt;p&gt;There’s no point going through all the effort to break your system if you don’t learn anything. Assuming you’re using a PostgreSQL database (&lt;a href="https://www.tigerdata.com/blog/its-2026-just-use-postgres" rel="noopener noreferrer"&gt;It’s 2026, Just Use PostgreSQL&lt;/a&gt;), here is a decent set of metrics to keep track of while you’re putting your database through its paces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Size
&lt;/h3&gt;

&lt;p&gt;The size of a Postgresql table is generally measured by number of rows, but the actual space on disk that it occupies is a sum of the heap (the main relational table), the indices, and the TOAST (storage for large objects).&lt;/p&gt;

&lt;p&gt;The following query will give the number or rows as well as the size of each component of the table in bytes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
      &lt;span class="n"&gt;reltuples&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;bigint&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;heap_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;pg_indexes_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;indices_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;pg_table_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;
            &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;toast_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_class&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason for the odd row_count is that counting rows the standard way, with COUNT(*), requires scanning the whole table, which is going to be painfully slow when we’re building a table big enough to break things.&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Performance
&lt;/h3&gt;

&lt;p&gt;The best way to measure table performance is to use the actual queries that your production system will use. At a minimum, this should include your batched INSERT (you always batch, right?) and at least one common SELECT. Keep in mind that for a table with N rows, the timing for queries tend to be either constant, log(N), N or worse depending on how the indices are structured.&lt;/p&gt;

&lt;p&gt;You can get very accurate timing info from running your queries with the prefix EXPLAIN ANALYZE, and it’s worth doing this at least once to see what the database is doing under the hood. However, I recommend running the whole test with a scripting language and then just timing the execution of that particular step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Server Performance
&lt;/h3&gt;

&lt;p&gt;Don’t forget the engine that’s driving all this machinery. You’ll need to watch the CPU, Memory, Storage, and Network Bandwidth. People in the IT world tend to talk about headroom for a server, and that’s what you’re really looking at: how much spare capacity do you have? Your CPU and Memory usage might spike at times, but the important thing is that it’s not always running at max capacity.&lt;/p&gt;

&lt;p&gt;There are a lot of free and paid tools to monitor these variables. I almost always do this type of test in a VM (easier to clean up the mess when it all breaks) and I like to use  &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;  but honestly Perfmon in Windows or Top in Linux gives you all you really need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting Limits
&lt;/h3&gt;

&lt;p&gt;It’s helpful to set some limits on these parameters so you know when to stop the test. For database size, it might be some measurement like a year's worth of data, or when the drive is 80% full. For ingest timing, I suggest stopping when inserting takes longer than the desired ingest frequency—this is the ingest bottleneck and something you really want to avoid in production. Scan times can be limited by the time it takes for a specific query. Maybe calculating the average value from one tag over the past hour must be less than 10s.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Simulate Data?
&lt;/h2&gt;

&lt;p&gt;There are lots of ways to insert data, but it’s usually a tradeoff between how well the data represents real scenarios and how long it takes to run the test.&lt;/p&gt;

&lt;p&gt;The following is one of my favourite methods for injecting large amounts of data into an IIoT database:&lt;/p&gt;

&lt;p&gt;Say you have a classic IIoT history table like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;iiot_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tag_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt; &lt;span class="nb"&gt;PRECISION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you expect to ingest 10,000 tags at 1s intervals, you can use the following INSERT query to add a day’s worth of history to the back end of your table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;iiot_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;min_date&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;min_date&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1s'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1s'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;
        &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;LEAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_date&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iiot_history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tag_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will generate random data values for every second during a day and for every tag_id from 1 to 10,000. Not exactly as interesting as real data, but enough to fill up your table.&lt;/p&gt;

&lt;p&gt;The nice thing about this query is that you should be able to run it in parallel to your real-time data pipeline and it won’t mess with your data (aside from potentially locking your table while it runs). It’s also easy to modify this query to inject more or less tags as well as change the time interval if you’re playing around with different configurations.&lt;/p&gt;

&lt;p&gt;If you use this query, or whichever one you prefer, in a script (I usually use Python), then you can automate the whole test. Something along the lines of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Get database size&lt;/li&gt;
&lt;li&gt; Run select queries, measure execution time&lt;/li&gt;
&lt;li&gt; Run insert queries several times, measure and average execution time&lt;/li&gt;
&lt;li&gt; Artificially grow database size&lt;/li&gt;
&lt;li&gt; Repeat 1-3 until one of the failure conditions is reached.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to Interpret Results and What to Expect in the Real World?
&lt;/h2&gt;

&lt;p&gt;Your test results will give you some clear data points, but you still need to do some interpreting.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Identify the Limiting Component:&lt;/strong&gt;  Where did the database fail? If it’s a query that took too long, you might be able to speed things up with a clever index. If it’s an insert that took too long, you might be able to speed things up by removing that clever index you added earlier.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Optimize:&lt;/strong&gt;  There’s a lot you can do to improve table performance before throwing the whole thing out in frustration:

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Proper Indexing:&lt;/strong&gt;  Choosing an index is almost always a tradeoff, for example: Indexing the tag_id column before the time column will speed up most queries, at the cost of slower inserts as the table grows. Indexing the time column first will avoid the ‘ingest wall’ at the cost of slower queries. Figure out which solution is best.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Plan for the future:&lt;/strong&gt;  Will you need more hardware in a few months or a few years? Being able to estimate the life of your existing architecture means you won’t be caught unawares when it no longer suffices.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Partitioning/Chunking:&lt;/strong&gt;  For very large tables, you may need to partition appropriately (see PostgreSQL extensions like  &lt;a href="https://www.tigerdata.com/timescaledb" rel="noopener noreferrer"&gt;TimescaleDB&lt;/a&gt;). How great would it be to learn you’ll need this before you actually need this.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Add a Safety Factor:&lt;/strong&gt;  If your test showed a maximum reliable throughput of 15,000 rows/sec, set your operational limit to 10,000 rows/sec. The real world has peaks, unexpected queries, and background maintenance tasks that will steal resources. Like we do with all engineering products, design with margin.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you treat your database like a prototype and really put it through its paces, you’ll get a preview of how it’ll behave in the future and make good, proactive design decisions instead of struggling in the future. Now, go break something (and learn).&lt;/p&gt;

</description>
      <category>iot</category>
      <category>postgres</category>
      <category>industrial</category>
      <category>database</category>
    </item>
    <item>
      <title>What Developers Get Wrong About Storing Sensor Data</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Thu, 19 Mar 2026 14:08:03 +0000</pubDate>
      <link>https://dev.to/tigerdata/what-developers-get-wrong-about-storing-sensor-data-4e4m</link>
      <guid>https://dev.to/tigerdata/what-developers-get-wrong-about-storing-sensor-data-4e4m</guid>
      <description>&lt;h2&gt;
  
  
  Sensor Data Looks Simple Until It Isn’t
&lt;/h2&gt;

&lt;p&gt;Sensor data appears straightforward. It just has timestamps, numeric readings, and maybe a device identifier. Compared to transactional application data, sensor data feels uniform and predictable. Teams often assume they can store it using familiar relational database schemas and grow from there.&lt;/p&gt;

&lt;p&gt;That assumption falls apart instantly when scale explodes. Devices multiply, sampling rates rise, and historical data accumulates indefinitely. Queries shift from single-row lookups to time windows and aggregations. Data arrives out of order. Storage costs grow exponentially. Systems designed around transactional assumptions crack in ways that are difficult to correct once data volume locks architecture in place.&lt;/p&gt;

&lt;p&gt;The root problem is conceptual. Sensor data looks like rows but behaves like a time-ordered stream whose value declines with age. Engineers must design the database as a time-series log with decay from the outset, rather than adapting it from a transactional model later. The following sections show how relational database approaches are inadequate for handling sensor data, and what a more suitable architecture looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Default Model: Treating Sensor Data Like Rows
&lt;/h2&gt;

&lt;p&gt;Most database developers approach sensor data with a transactional mindset. They design normalized schemas, enforce relational integrity, and add indexes for point queries. They only work for mutable business entities such as users or orders.&lt;/p&gt;

&lt;p&gt;Sensor data, however, is append-only. New measurements arrive continuously and are rarely updated. Sustained ingestion and time-range retrieval are dominant, not row mutation or lookup. When schemas assume row-oriented access, data ingestion becomes join-heavy, indexing costs grow with volume, and write throughput falls behind input data flow.&lt;/p&gt;

&lt;p&gt;Treating sensor data as rows creates problems precisely where sensor systems spend most of their effort: writing and scanning time-ordered streams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where That Model Breaks
&lt;/h2&gt;

&lt;p&gt;As the system grows, several problems appear simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt; , ingestion is continuous and bursty. Devices reconnect and flush buffers, producing spikes rather than steady flows. Row-oriented schemas struggle to absorb these bursts efficiently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second&lt;/strong&gt; , growth compounds across multiple axes: more devices, higher sampling frequency, additional metrics, and longer retention. Storage volume grows quickly, turning early schema choices into long-term constraints because migrating historical time-series data is costly and risky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third&lt;/strong&gt; , queries shift toward time windows. Monitoring, analytics, and diagnostics rely on ranges, aggregates, and rates over time rather than individual rows. Row-optimized indexing performs poorly for these scans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth&lt;/strong&gt; , operational realities inevitably create problems. Timestamps arrive late or out of sequence. Data must be replayed or corrected. Systems designed for ordered inserts encounter fragmentation and duplication under these conditions.&lt;/p&gt;

&lt;p&gt;Each constraint highlights the same reality. Sensor workloads are shaped by time and continuity, not by relational identity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Insight: Sensor Data Is a Log With Decay
&lt;/h2&gt;

&lt;p&gt;Sensor data has two defining properties.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It is a log: append-only, time-indexed, and rarely modified after arrival.&lt;/li&gt;
&lt;li&gt;It decays: its value decreases as it ages, even as its volume accumulates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Recent data require high-resolution monitoring and debugging. Older data supports trends and aggregates. Very old data is rarely queried except in a summarized form. Yet without lifecycle awareness, systems retain all data at equal resolution and cost.&lt;/p&gt;

&lt;p&gt;Once teams understand that sensor data is a &lt;strong&gt;log with decay&lt;/strong&gt; , the correct architecture becomes clear. Storage must optimize for append throughput and time-range access while permitting data to evolve in resolution and tier as it ages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time-Series Architecture
&lt;/h2&gt;

&lt;p&gt;Time-series data that loses value over time requires the database architecture to have a few key properties.&lt;/p&gt;

&lt;h3&gt;
  
  
  Log-optimized ingestion
&lt;/h3&gt;

&lt;p&gt;Writes must be sequential and batched, minimizing per-row overhead. Storage engines and schemas should favor append operations over update operations so ingestion scales with device fleets and burst conditions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time-partitioned organization
&lt;/h3&gt;

&lt;p&gt;Data should be grouped primarily by time, corresponding its physical storage with dominant query patterns. Time partitioning keeps recent data localized and keeps historical segments compact and independent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lifecycle tiering
&lt;/h3&gt;

&lt;p&gt;Because sensor data’s value declines with age, resolution, and storage cost should decline as well. High-resolution recent data is hot, and older data is compressed, downsampled, or moved to cheaper storage tiers while preserving analytical performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Role separation
&lt;/h3&gt;

&lt;p&gt;Operational monitoring, historical analytics, and archival retention create different latency and throughput challenges. Separating these roles prevents continuous ingestion from degrading analytical performance and allows each layer to evolve independently.&lt;/p&gt;

&lt;p&gt;These properties are not optimizations layered onto transactional storage. Instead, they are intentional design choices needed to handle the key aspects of time-series data: continuous append, time-range access, and aging value.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Enables for Developers
&lt;/h2&gt;

&lt;p&gt;Architectures aligned with time-series data change how systems scale and operate.&lt;/p&gt;

&lt;p&gt;Ingestion stays stable as fleets expand because write operations match append patterns rather than row mutation. Query cost stays predictable because time-range scans match with storage layout. Storage growth stays bounded relative to insight because data resolution declines with age. Operational corrections and replays become routine rather than disruptive because logs tolerate disorder.&lt;/p&gt;

&lt;p&gt;Developers spend less effort compensating for schema problems and more effort deriving insight from data. Systems stay adaptable as deployments grow from prototypes to global fleets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Time-Series Architecture Becomes Inevitable
&lt;/h2&gt;

&lt;p&gt;Engineers only design transactional database models for mutable records whose value stays relatively stable over time. Sensor data is the opposite. It is filled with immutable events whose volume grows continuously while their value declines with age. As ingestion becomes constant, queries become time-range-driven, and history accumulates indefinitely, databases built on transactional assumptions develop write bottlenecks, inefficient scans, and rising storage costs.&lt;/p&gt;

&lt;p&gt;Once teams understand that sensor data is just an append-only data stream with aging value, the architectural solution becomes clear. Systems must ingest sequentially, organize primarily by time, reduce resolution as data ages, and separate operational and historical workloads. These structures stem directly from how sensor data behaves, not a preference for any particular technology.&lt;/p&gt;

&lt;p&gt;Treating sensor data as rows delays problems but does not fix them. As scale grows, transactional models diverge further from workload reality, while time-series architectures stay matched to it. Database design, therefore, can’t be retrofitted late without cost and disruption. It must start from the correct model: sensor data as a time-series log with decay.&lt;/p&gt;

</description>
      <category>timeseries</category>
      <category>database</category>
      <category>iot</category>
      <category>backend</category>
    </item>
    <item>
      <title>How Do PostgreSQL Indices Work, Anyways?</title>
      <dc:creator>Matty Stratton</dc:creator>
      <pubDate>Wed, 18 Mar 2026 14:35:21 +0000</pubDate>
      <link>https://dev.to/tigerdata/how-do-postgresql-indices-work-anyways-3jnn</link>
      <guid>https://dev.to/tigerdata/how-do-postgresql-indices-work-anyways-3jnn</guid>
      <description>&lt;p&gt;You've probably created a hundred indexes in your career. Maybe a thousand. You ran &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;, saw "Index Scan" instead of "Seq Scan," pumped your fist, and moved on.&lt;/p&gt;

&lt;p&gt;But do you actually know what's happening underneath? Because once you do, a lot of things about PostgreSQL performance start to make a &lt;em&gt;lot&lt;/em&gt; more sense. And some of the pain points you've been fighting start to feel less like mysteries and more like, well, physics.&lt;/p&gt;

&lt;h2&gt;
  
  
  It's a tree. Obviously.
&lt;/h2&gt;

&lt;p&gt;The default index type in PostgreSQL is a B-tree. You knew that. But let's talk about what that actually means for your data.&lt;/p&gt;

&lt;p&gt;When you create an index on, say, a &lt;code&gt;timestamp&lt;/code&gt; column, PostgreSQL builds a balanced tree structure where each node contains keys and pointers. The leaf nodes point to actual heap tuples (your rows on disk). The internal nodes just help you navigate. Think of it like a phone book. (Do people still know what phone books are? I'm aging myself.)&lt;/p&gt;

&lt;p&gt;The key thing to understand: the index is a &lt;em&gt;separate data structure&lt;/em&gt; from your table. It lives in its own pages on disk. When you insert a row, PostgreSQL doesn't just write your row. It also has to update every index on that table. Every. Single. One.&lt;/p&gt;

&lt;p&gt;So if you have a table with five indexes and you're doing 50,000 inserts per second, that's not 50K write operations. That's 250K+ B-tree insertions per second, plus the heap write. Oof.&lt;/p&gt;

&lt;p&gt;You can see exactly how much space each index is consuming with &lt;code&gt;\di+&lt;/code&gt; in psql:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="n"&gt;di&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="c1"&gt;-- Or if you want programmatic access:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;regclass&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;times_used&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;idx_tup_read&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;tuples_read&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;   &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;  &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt;  &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run that on your biggest table. If you see indexes measured in gigabytes that have &lt;code&gt;idx_scan = 0&lt;/code&gt;, those indexes are costing you writes and giving you nothing back. They're dead weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pages, not rows
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. PostgreSQL doesn't read individual rows from disk. It reads 8KB pages. Always. Even if you only want one tiny row, you're pulling in a full 8KB page.&lt;/p&gt;

&lt;p&gt;Your B-tree is also organized into 8KB pages. Each page holds as many index entries as it can fit. For a simple index on a &lt;code&gt;bigint&lt;/code&gt; column, you can fit a few hundred entries per page. For a compound index on &lt;code&gt;(tenant_id, event_type, created_at)&lt;/code&gt;, you're fitting fewer because each entry is wider.&lt;/p&gt;

&lt;p&gt;When PostgreSQL traverses your B-tree, it starts at the root page, reads it, follows a pointer to the right internal page, reads that, and eventually gets to a leaf page that tells it where your actual row lives on the heap. For a table with a million rows, that's maybe three or four page reads. For a billion rows, it might be five or six. Logarithmic scaling is your friend here.&lt;/p&gt;

&lt;p&gt;You can see this in action with &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'1 hour'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Look for lines like:&lt;/span&gt;
&lt;span class="c1"&gt;--   Index Scan using events_created_at_idx on events&lt;/span&gt;
&lt;span class="c1"&gt;--     Buffers: shared hit=4 read=2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;shared hit&lt;/code&gt; count tells you how many pages came from the buffer cache. The &lt;code&gt;read&lt;/code&gt; count tells you how many had to come from disk. If you're seeing high &lt;code&gt;read&lt;/code&gt; values on a query you run frequently, your working set has outgrown your &lt;code&gt;shared_buffers&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But. (There's always a but.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The part nobody thinks about
&lt;/h2&gt;

&lt;p&gt;Those leaf pages need to stay ordered. When you insert a new value that belongs in the middle of a page that's already full, PostgreSQL has to split that page. Page splits are expensive. They cause write amplification and can fragment your index over time.&lt;/p&gt;

&lt;p&gt;For time-series data (timestamps always increasing), you mostly dodge this problem because new values go to the rightmost leaf. That's nice. But it creates a different problem: hot-page contention. Every concurrent insert is fighting to write to the same leaf page at the end of the tree.&lt;/p&gt;

&lt;p&gt;And then there's the part that really gets you: MVCC overhead.&lt;/p&gt;

&lt;p&gt;PostgreSQL's multiversion concurrency control means that even your index has to deal with tuple visibility. Index entries don't get removed immediately when a row is deleted or updated. They stick around until &lt;code&gt;VACUUM&lt;/code&gt; cleans them up. So your index isn't just tracking live rows. It's tracking &lt;em&gt;all the versions&lt;/em&gt; of your rows until the cleanup crew gets around to it.&lt;/p&gt;

&lt;p&gt;For a high-churn table, your index can be significantly larger than you'd expect just from the row count. I've seen cases where the index is effectively 2-3x the "expected" size because of dead tuple bloat.&lt;/p&gt;

&lt;p&gt;Here's how to check if bloat is eating your indexes alive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dead_pct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;   &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;  &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt;  &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;dead_pct&lt;/code&gt; is climbing above 10-20% and &lt;code&gt;last_autovacuum&lt;/code&gt; was hours ago (or null), autovacuum is falling behind. That bloat isn't just wasting space. It's making every index scan touch more pages than it should.&lt;/p&gt;

&lt;h2&gt;
  
  
  Index-only scans (and why they're worth understanding)
&lt;/h2&gt;

&lt;p&gt;There's one more behavior worth knowing about, because it changes how you think about index design.&lt;/p&gt;

&lt;p&gt;Normally, PostgreSQL uses the index to find &lt;em&gt;where&lt;/em&gt; a row lives on the heap, then goes and reads the actual row. That's two separate lookups: the index, then the heap.&lt;/p&gt;

&lt;p&gt;But if every column your query needs is already &lt;em&gt;in&lt;/em&gt; the index, PostgreSQL can skip the heap entirely. That's an index-only scan, and it's significantly faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- This index covers both the WHERE clause and the SELECT list:&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_events_covering&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Now this query never touches the heap:&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'1 hour'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Look for:&lt;/span&gt;
&lt;span class="c1"&gt;--   Index Only Scan using idx_events_covering on events&lt;/span&gt;
&lt;span class="c1"&gt;--     Heap Fetches: 0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;Heap Fetches: 0&lt;/code&gt; is what you want. That means PostgreSQL answered the entire query from the index alone.&lt;/p&gt;

&lt;p&gt;The catch: index-only scans only work well when the visibility map is up to date, which brings us right back to VACUUM. If VACUUM hasn't visited a page recently, PostgreSQL can't trust the index alone and has to check the heap anyway. So even this optimization depends on keeping autovacuum healthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Partial indexes (less is more)
&lt;/h2&gt;

&lt;p&gt;One more tool that's underused: partial indexes. If you only query a subset of your data most of the time, you can index just that subset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Instead of indexing every row:&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_events_status&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Index only the rows that matter:&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_events_active&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The partial index is smaller, faster to scan, and cheaper to maintain on writes. For high-churn tables where most queries filter to a small slice of data, this is free performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  So why does this matter?
&lt;/h2&gt;

&lt;p&gt;Understanding this stuff isn't just academic. It explains real problems you hit in production:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why adding indexes slows down writes.&lt;/strong&gt; Every index is another B-tree that needs to be maintained on every insert. It's not free. It's never been free. The cost just hides until you're at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why your queries get slower over time even though nothing changed.&lt;/strong&gt; Index bloat from dead tuples. Pages that used to be tightly packed are now half-empty after splits and vacuuming. Your three-page-read query is now a six-page-read query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why VACUUM matters so much.&lt;/strong&gt; It's not just reclaiming table space. It's keeping your indexes healthy. If autovacuum can't keep up, your indexes degrade. And if you're inserting fast enough, autovacuum can fall behind. That's not a bug. That's just the architecture working as designed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why partitioning helps (and then stops helping).&lt;/strong&gt; Smaller partitions mean smaller indexes mean fewer tree levels. Great. But now your query planner has to evaluate all those partitions to figure out which ones to scan. And that planning cost scales linearly with partition count. You're trading one bottleneck for another.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger picture
&lt;/h2&gt;

&lt;p&gt;I wrote about this cycle more extensively in a piece about &lt;a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill?utm_source=devto&amp;amp;utm_medium=da-activity&amp;amp;utm_campaign=matty-digital" rel="noopener noreferrer"&gt;the PostgreSQL optimization treadmill&lt;/a&gt;. The short version: there's a pretty predictable progression that teams go through. Optimize indexes. Partition tables. Tune autovacuum. Scale vertically. Add read replicas. Each phase buys you a few months.&lt;/p&gt;

&lt;p&gt;That's not a criticism of PostgreSQL. Postgres is an incredible database. But it's a &lt;em&gt;general-purpose&lt;/em&gt; relational database, and its architecture reflects that. The heap storage model, MVCC, the query planner, B-trees. They're all designed to handle a wide range of workloads really well. The tradeoff is that for very specific access patterns (like time-series data at scale), those general-purpose design choices start working against you instead of for you.&lt;/p&gt;

&lt;p&gt;Understanding &lt;em&gt;how&lt;/em&gt; your indexes work is the first step to understanding &lt;em&gt;when&lt;/em&gt; they stop being enough. And knowing when you're fighting the architecture instead of optimizing within it can save you months of whack-a-mole performance tuning.&lt;/p&gt;

&lt;p&gt;But that's a topic for another day. For now, go run these queries on your biggest table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- How big are your indexes, really?&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;regclass&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;scans&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;   &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;  &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'your_table_here'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt;  &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Are any of them unused?&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;regclass&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;idx_scan&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;   &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;  &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt;  &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt;  &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You might be surprised.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>performance</category>
      <category>sql</category>
    </item>
    <item>
      <title>🚀 Introducing Agentic Postgres: The First &amp; Free Database Built for Agents</title>
      <dc:creator>Ajay Kulkarni</dc:creator>
      <pubDate>Tue, 21 Oct 2025 15:15:26 +0000</pubDate>
      <link>https://dev.to/tigerdata/introducing-agentic-postgres-the-first-free-database-built-for-agents-50i7</link>
      <guid>https://dev.to/tigerdata/introducing-agentic-postgres-the-first-free-database-built-for-agents-50i7</guid>
      <description>&lt;h2&gt;
  
  
  Agents are the New Developer
&lt;/h2&gt;

&lt;p&gt;80% of Claude Code &lt;a href="https://www.reddit.com/r/singularity/comments/1khxwjh/claude_code_wrote_80_of_its_own_code_anthropic_dev/" rel="noopener noreferrer"&gt;was written by AI&lt;/a&gt;. More than a &lt;a href="https://arstechnica.com/ai/2024/10/google-ceo-says-over-25-of-new-google-code-is-generated-by-ai/" rel="noopener noreferrer"&gt;quarter of all new code at Google&lt;/a&gt; was generated by AI one year ago. It’s safe to say that in the next 12 months, the majority of all new code will be written by AI.&lt;/p&gt;

&lt;p&gt;Agents don’t behave like humans. They behave in new ways. Software development tools need to evolve. Agents need a new kind of database made for how they work.&lt;/p&gt;

&lt;p&gt;But what would a database for agents look like?&lt;/p&gt;

&lt;p&gt;At Tiger, we’ve obsessed over databases for the past 10 years. We’ve built high-performance systems for time-series data, scaled Postgres across millions of workloads, and served thousands of customers and hundreds of thousands of developers around the world. &lt;/p&gt;

&lt;p&gt;​​So when agents arrived, we felt it immediately. In our bones. This new era of computing would need its own kind of data infrastructure. One that still delivered power without complexity, but built for a new type of user. &lt;/p&gt;

&lt;h2&gt;
  
  
  What Agents Actually Need
&lt;/h2&gt;

&lt;p&gt;Agents work differently than humans. They need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCPs, not UIs&lt;/strong&gt; – they call functions, not click buttons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native search&lt;/strong&gt; – find the right data instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast forks and teardown&lt;/strong&gt; – spin up experiments without the overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficient pricing&lt;/strong&gt; – pay for what you use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in knowledge&lt;/strong&gt; – best practices that come with the database&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What We Built
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. An MCP Server That Actually Understands Postgres
&lt;/h3&gt;

&lt;p&gt;We built an MCP server that doesn't just connect to the database—it knows how to use it well. We took 10+ years of Postgres experience and turned it into built-in prompts. Agents get tools for schema design, query optimization, and migrations, plus they can search Postgres docs on the fly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; I want to create a personal assistant app. Please create a free 
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; service on Tiger. Then using Postgres best practices, describe 
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; the schema you would create.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Search Built Into the Database
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;pgvectorscale&lt;/strong&gt;: We improved our vector search extension. Better indexing throughput, better recall, lower latency than pgvector.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pg_textsearch&lt;/strong&gt;: Our newest extension. It implements BM25 for proper ranked keyword search, built for hybrid AI apps. Right now it uses an in-memory structure for speed—disk-based segments with compression are coming.&lt;/p&gt;

&lt;p&gt;No need to bolt on external search. It's all in Postgres.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Instant Database Forks
&lt;/h3&gt;

&lt;p&gt;We built a copy-on-write storage layer that makes databases instantly forkable. Full production data, isolated environment, seconds to create. No data duplication, no cost duplication. You only pay for what changes.&lt;/p&gt;

&lt;p&gt;Great for testing, benchmarking, or running migrations in parallel without touching prod.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Create a fork of my database, &lt;span class="nb"&gt;test &lt;/span&gt;3 different indexes 
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;for &lt;/span&gt;performance, delete the fork, and report findings.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. New CLI and a Free Tier
&lt;/h3&gt;

&lt;p&gt;Three commands to get started:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the Tiger CLI and MCP&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://cli.tigerdata.com | sh
&lt;span class="nv"&gt;$ &lt;/span&gt;tiger auth login
&lt;span class="nv"&gt;$ &lt;/span&gt;tiger mcp &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then either tell your agent to create a free service, or run &lt;code&gt;tiger create service&lt;/code&gt; yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fluid Storage
&lt;/h2&gt;

&lt;p&gt;This all runs on Fluid Storage—our new distributed block store. It's built on local NVMe with a storage proxy that handles copy-on-write volumes.&lt;/p&gt;

&lt;p&gt;What you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instant forks and snapshots&lt;/li&gt;
&lt;li&gt;Automatic scaling, no downtime&lt;/li&gt;
&lt;li&gt;Over 100K IOPS and 1 GB/s per volume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It looks like a local disk to Postgres but scales like cloud storage. Every free service runs on it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Today
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://cli.tigerdata.com | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Built for agents. Designed for developers.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>database</category>
      <category>postgres</category>
      <category>agents</category>
    </item>
    <item>
      <title>🐯 🚀 Timescale is now TigerData: Building the Modern PostgreSQL for the Analytical and Agentic Era</title>
      <dc:creator>Ajay Kulkarni</dc:creator>
      <pubDate>Wed, 18 Jun 2025 16:09:24 +0000</pubDate>
      <link>https://dev.to/tigerdata/timescale-is-now-tigerdata-building-the-modern-postgresql-for-the-analytical-and-agentic-era-3a51</link>
      <guid>https://dev.to/tigerdata/timescale-is-now-tigerdata-building-the-modern-postgresql-for-the-analytical-and-agentic-era-3a51</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Eight years ago, we launched Timescale to bring time-series to PostgreSQL. Our mission was simple: help developers building time-series applications.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since then, we have built a thriving business: 2,000 customers, mid 8-digit ARR (&amp;gt;100% growth year over year), $180 million raised from top investors. &lt;/p&gt;

&lt;p&gt;We serve companies who are building real-time analytical products and large-scale AI workloads like: Mistral, HuggingFace, Nvidia, Toyota, Tesla, NASA, JP Morgan Chase, Schneider Electric, Palo Alto Networks, and Caterpillar. These are companies building developer tools, industrial dashboards, crypto exchanges, AI-native games, financial RAG applications, and more. &lt;/p&gt;

&lt;p&gt;We’ve quietly evolved from a time-series database into the modern PostgreSQL for today’s and tomorrow’s computing, built for performance, scale, and the agentic future. So we’re changing our name: from Timescale to TigerData. Not to change who we are, but to reflect who we’ve become. TigerData is bold, fast, and built to power the next era of software.&lt;/p&gt;

&lt;h2&gt;
  
  
  Developers Thought We Were Crazy
&lt;/h2&gt;

&lt;p&gt;When we started 8 years ago, SQL databases were “old fashioned.” NoSQL was the future. Hadoop, MongoDB, Cassandra, InfluxDB – these were the new, exciting NoSQL databases. PostgreSQL was old and boring.&lt;/p&gt;

&lt;p&gt;That’s when we launched Timescale: a time-series database on PostgreSQL. Developers thought we were crazy. PostgreSQL didn’t scale. PostgreSQL wasn’t fast. Time-series needed a NoSQL database. Or so they said.&lt;/p&gt;

&lt;p&gt;“While I appreciate PostgreSQL every day, am I the only one who thinks this is a rather bad idea?” – top HackerNews comment on our launch (link)&lt;/p&gt;

&lt;p&gt;But we believed in PostgreSQL. We knew that boring could be awesome, especially with databases. And frankly, we were selfish: PostgreSQL was the only database that we wanted to use.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Today, PostgreSQL has won. *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are no more “SQL vs. NoSQL” debates. MongoDB, Cassandra, InfluxDB, and other NoSQL databases are seen as technical dead ends. Snowflake and Databricks are acquiring PostgreSQL companies. No one talks about Hadoop. The Lakehouse has won. &lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Today, agentic workloads are here. *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Agents need a fast database. We see this in our customer base: private equity firms and hedge funds using agents to help understand market movements (“How did the market respond to Apple WWDC 2025?”); industrial equipment manufacturers building chat interfaces on top of internal manuals to help field technicians; developer platforms storing agentic interactions into history tables for greater transparency and trust; and so on.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Started as a Heretical Idea Is Now a Thriving Business
&lt;/h2&gt;

&lt;p&gt;We have also changed. We met in September 1997, during our first week at MIT. We soon became friends, roommates, even marathon training partners (Boston 1998).&lt;/p&gt;

&lt;p&gt;That friendship became the foundation for an entrepreneurial journey that has surpassed even our boldest imaginations. &lt;/p&gt;

&lt;p&gt;What started as a heretical idea is now a thriving business:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;2,000 customers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mid 8-digit ARR, growing &amp;gt;100% y/y&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;200 people in 25 countries&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;$180 million raised from top investors&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;60%+ gross margins&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cloud usage is up 5x in the last 18 months, based on paid customers alone.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that’s only the paid side of the story. Our open-source community is 10x-20x larger. (Based on telemetry, it’s 10x, but we estimate that at least half of all deployments have telemetry turned off.)&lt;/p&gt;

&lt;p&gt;TimescaleDB is everywhere. It’s included in PostgreSQL offerings around the world: from Azure, Alibaba, and Huawei to Supabase, DigitalOcean, and Fly.io. You’ll also find it on Databricks Neon, Snowflake Crunchy Bridge, OVHCloud, Render, Vultr, Linode, Aiven, and more.&lt;/p&gt;

&lt;h2&gt;
  
  
  We Are TigerData
&lt;/h2&gt;

&lt;p&gt;Today, we are more than a time-series database. We are powering developer tools, SaaS applications, AI-native games, financial RAG applications, and more. The majority of workloads on our Cloud product aren’t time-series. Companies are running entire applications on us. CTOs would say to us, “You keep talking about how you are the best time-series database, but I see you as the best PostgreSQL.” &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So we are now “TigerData"&lt;/strong&gt; We offer the fastest PostgreSQL. Speed without sacrifice.&lt;/p&gt;

&lt;p&gt;Our cloud offering is “Tiger Cloud.” Our logo stays the same: the tiger, looking forward, focused and fast. Some things do not change. Our open source time-series PostgreSQL extension remains TimescaleDB. Our vector extension is still pgvectorscale. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why “Tiger”?&lt;/strong&gt; The tiger has been our mascot since 2017, symbolizing the speed, power, and precision we strive for in our database. Over time, it’s become a core part of our culture: from weekly “Tiger Time” All Hands and monthly “State of the Tiger” business reviews, to welcoming new teammates as “tiger cubs” to the “jungle.” As we reflected on our products, performance, and community, we realized: we aren’t just Timescale. We’re Tiger. Today, we’re making that official.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is not a reinvention: it’s a reflection of how we already serve our customers today.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Polymarket&lt;/strong&gt; uses TigerData to track their price history. During the last election Polymarket ramped up 4x when trade volumes were extra high, to power over $3.7 billion dollars worth of trades.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Linktree&lt;/strong&gt; uses TigerData for their premium analytics product, saving $17K per month on 12.6 TB from compression savings. They also compressed their time to launch, going from 2 weeks to 2 days for shipping analytical features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Titan America&lt;/strong&gt; uses TigerData’s compression and continuous aggregates to reduce costs and increase visibility into their facilities for manufacturing cement, ready-mixed concrete, and related materials. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lucid Motors&lt;/strong&gt; uses TigerData for real-time telemetry and autonomous driving analytics. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Financial Times&lt;/strong&gt; runs time-sensitive analytics and semantic search. &lt;/p&gt;

&lt;h2&gt;
  
  
  Come Join Us
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tiger is the Fastest PostgreSQL.&lt;/strong&gt; The operational database platform built for transactional, analytical, and agentic workloads. The only database platform that provides Speed without Sacrifice.&lt;/p&gt;

&lt;p&gt;This is not a rebrand, but a recommitment to our customers, to our developers, and to our core mission.&lt;/p&gt;

&lt;p&gt;If this mission resonates with you, come join us. Give us product feedback. Spread the word. Wear the swag. Join the team. &lt;/p&gt;

&lt;p&gt;It’s Go Time. 🐯🚀&lt;/p&gt;

</description>
      <category>news</category>
      <category>database</category>
      <category>ai</category>
      <category>postgres</category>
    </item>
    <item>
      <title>The Database Meets the Lakehouse: Toward a Unified Architecture for Modern Applications</title>
      <dc:creator>Mike Freedman</dc:creator>
      <pubDate>Tue, 10 Jun 2025 14:24:30 +0000</pubDate>
      <link>https://dev.to/tigerdata/the-database-meets-the-lakehouse-toward-a-unified-architecture-for-modern-applications-2le3</link>
      <guid>https://dev.to/tigerdata/the-database-meets-the-lakehouse-toward-a-unified-architecture-for-modern-applications-2le3</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; The OLTP/OLAP split no longer fits how developers build today. Postgres and the lakehouse are now used side-by-side – but stitched together with brittle pipelines. We think they belong in a single, modular system: open formats, bidirectional sync, and real-time performance by default.&lt;/p&gt;

&lt;p&gt;The architecture of modern data systems is undergoing a fundamental shift.&lt;/p&gt;

&lt;p&gt;Ask a developer how they build data systems today, and the answer increasingly looks like this: Postgres for the application, a lakehouse for the analytics and data science.&lt;/p&gt;

&lt;p&gt;Postgres, long favored for transactional workloads, has evolved into a general-purpose operational database. It’s trusted, flexible, and deeply extensible, powering everything from customer transactions and CRUD apps, to real-time dashboards and AI-backed product features. Its ecosystem has grown to support real-time analytics (&lt;a href="https://github.com/timescale" rel="noopener noreferrer"&gt;TimescaleDB&lt;/a&gt;), geospatial data (PostGIS), vector and full-text search (pgvector and pgvectorscale), and more.&lt;/p&gt;

&lt;p&gt;At the same time, the rise of open lakehouse technologies has redefined how organizations manage and analyze data at scale. Disaggregated storage, open table formats like Iceberg, structured data catalogs, and composable query engines have made it possible to analyze petabyte-scale data with precision and control. This architecture can offer governance, avoid vendor lock-in, and still provide data teams flexibility in their choice of tools.&lt;/p&gt;

&lt;p&gt;What’s striking isn’t just the success of these technologies individually, but how often they’re now being deployed together. Organizations increasingly need to support both operational workloads (powered by databases) and non-operational workloads (powered by lakehouses), often using data from the same sources – people, machines, digital systems, or agents. Yet these systems are still treated in isolation, often owned by different teams, with too much friction in making them work together seamlessly.&lt;/p&gt;

&lt;p&gt;We believe that friction should not exist. In fact, we think a new, more coherent architecture is emerging: one that treats Postgres and the lakehouse not as separate worlds, but as distinct layers of a single, modular system, designed to meet the full spectrum of operational and analytical needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Limits of the OLTP vs OLAP Dichotomy
&lt;/h2&gt;

&lt;p&gt;The old way of thinking about databases was simple: OLTP for transactions, OLAP for analysis. You used Postgres to power your app, and sent nightly ETL jobs to a data warehouse for internal reports and dashboards. This traditional distinction served us well when applications were simpler, and internal reporting could live on a much slower cadence. But that’s no longer the case. &lt;/p&gt;

&lt;p&gt;Modern applications are data-heavy, customer-facing, and real-time by design. They blur the lines between transactional and analytical. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A financial app might run a trading engine that needs millisecond access to customer portfolios, while simultaneously feeding real-time risk reports and internal dashboards. &lt;/li&gt;
&lt;li&gt;A SaaS app isn’t just storing clicks – it’s calculating usage metrics, triggering alerts, and serving personalized models. &lt;/li&gt;
&lt;li&gt;An industrial monitoring system might ingest tens of millions of sensor readings per hour, drive anomaly detection and alerting logic, and archive years of telemetry for long-term analytics and AI model training.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These use cases are not outliers – they are quickly becoming the norm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We increasingly see a more useful split: operational databases that power products, and lakehouses that power organizations.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yet even though ownership of these types of systems are split – product-engineering teams responsible for the operational systems powering their products, and data teams responsible for managing lakehouse systems as organizational services – the two still need to talk to each other. They need to work on the same data and often share underlying schemas. The better they integrate and remain in sync, the more resilient and capable the system becomes.&lt;/p&gt;

&lt;h2&gt;
  
  
  An Operational Medallion Architecture
&lt;/h2&gt;

&lt;p&gt;One pattern we see gaining traction is what we call an &lt;em&gt;operational medallion architecture&lt;/em&gt;. Inspired by the medallion models popularized in the data engineering world, this pattern also incorporates bronze, silver, and gold layers – not just for internal analytics, but for powering real-time, user-facing systems as well.&lt;/p&gt;

&lt;p&gt;Here’s what that looks like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bronze Layer:&lt;/strong&gt; Raw data lives in Parquet or Iceberg files on AWS S3 or similar low-cost bottomless storage systems. This data is typically immutable, append-only, and queryable by anything: query engines like AWS Athena, DuckDB, Trino, ClickHouse, or Polars, or even directly from an operational database like Postgres.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Silver Layer:&lt;/strong&gt; Cleaned, filtered, validated, and deduplicated data is written into Postgres to power real-time analytics, dashboards, or application logic of user-facing products.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Gold Layer:&lt;/strong&gt; Pre-aggregated data over silver data (like Postgres’ materialized views or TimescaleDB’s continuous aggregates) serve low-latency, high-concurrency product experiences. These are typically maintained within the database to ensure consistency between silver and gold layers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6dz91ge8y5rn6kko8mj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6dz91ge8y5rn6kko8mj.png" alt=" " width="800" height="294"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Crucially, each layer is queryable, and this movement of data is bidirectional. You can pull raw or transformed data from S3 directly into Postgres (akin to tightly integrated reverse ETL). You can roll up aggregates from Iceberg into Postgres tables (by one-off or standing queries against Iceberg files from Postgres). You can continuously sync a full schema or a single table from the database to the lakehouse.&lt;/p&gt;

&lt;p&gt;Much as bronze (or transformed) data can be read from the lakehouse storage layer on S3 into the database, silver and gold in the database can be written to these lakehouse storage formats. This avoids needing to re-implement identical pipelines in both systems, which both adds complexity and risks consistency.&lt;/p&gt;

&lt;p&gt;One common pattern we’ve observed in applications requiring fresh data is writing from an upstream streaming system like Kafka or Kinesis &lt;em&gt;in parallel&lt;/em&gt; to both S3 (for row, unmodified bronze data) and Postgres (relying on database schemas and constraints for data validation). Then these silver tables and subsequent gold aggregates in the database are exported out to S3 again, so data teams now have access to the “ground truth data” that had been served to customers.&lt;/p&gt;

&lt;p&gt;Now, each system maintains its separation of concerns. The operational database can run locked down – both to users and unfriendly queries – while data is still made available as part of the open lakehouse wherever it’s needed in the org.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Now? Technical Forces Driving the Shift
&lt;/h2&gt;

&lt;p&gt;Several developments are powering this shift from the operational databases and lakehouses from being siloed to integrated.&lt;/p&gt;

&lt;p&gt;First, Iceberg has matured into a stable and flexible table format that supports schema evolution, ACID transactions, and efficient compaction. It enables multiple compute engines to read from and write to the same datasets – with catalog layers that track metadata and enforce governance across the stack. Much like databases had catalogs at their core, so now do lakehouses.&lt;/p&gt;

&lt;p&gt;Second, Postgres has continued to evolve as a platform. With extensions for columnar storage, time-series data, and vector and hybrid search – what we’ve been building at Timescale for years – Postgres now serves many products that incorporate real-time analytics and agentic workflows directly. And with emerging support for querying S3 and Iceberg data directly from within Postgres, it is increasingly possible to incorporate S3-hosted data directly. So Postgres is no longer for just transactional data – with one-way ETL/CDC to lakehouse – but &lt;strong&gt;now acts as the serving layer for products incorporating both transactional and analytical data&lt;/strong&gt;. This isn’t just a data caching layer for pre-computed data, but a full-fledged SQL database for further aggregations, enrichment, or JOINs at query time.&lt;/p&gt;

&lt;p&gt;Third, developers expect composability. Some organizations may be stuck with their legacy monolithic data platforms, but most developers and data scientists want flexibility to compose their own stacks, integrating familiar tools in ways that reflect their application’s needs. The shift toward open formats and disaggregated storage fits this mindset. So does the desire for control, particularly in regulated industries or where data sovereignty matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Put differently: the market is moving toward modular, open, developer-friendly architectures.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Comes Next
&lt;/h2&gt;

&lt;p&gt;We believe the future of data infrastructure will be shaped by systems that integrate operational and analytical layers more deeply – systems that treat Postgres and the lakehouse as two sides of the same coin.&lt;/p&gt;

&lt;p&gt;This won’t happen through another monolith. It will come from careful interfaces – incremental sync, shared catalogs, unified query surfaces – and from an architectural philosophy that embraces heterogeneity rather than fighting it.&lt;/p&gt;

&lt;p&gt;We’re working on something new in this space. Something that builds on the strengths of Postgres and Iceberg, tightly integrates with existing lakehouse systems, and makes it dramatically easier to build full-stack data systems with operational and analytical fidelity.&lt;/p&gt;

&lt;p&gt;This isn’t about using ETL to move data from legacy systems to new systems – it’s about building a coherent modern data architecture that serves operational and non-operational use cases, alike.&lt;/p&gt;

&lt;p&gt;Stay tuned.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>news</category>
      <category>database</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Your Rails App Isn’t Slow—Your Database Is</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Tue, 06 May 2025 12:23:00 +0000</pubDate>
      <link>https://dev.to/tigerdata/your-rails-app-isnt-slow-your-database-is-o57</link>
      <guid>https://dev.to/tigerdata/your-rails-app-isnt-slow-your-database-is-o57</guid>
      <description>&lt;p&gt;In case you missed the quiet launch of our timescaledb-ruby gem, we’re here to remind you that you can now &lt;a href="https://www.timescale.com/blog/connecting-ruby-and-postgresql-timescale-integrations-expand" rel="noopener noreferrer"&gt;connect PostgreSQL and Ruby when using TimescaleDB&lt;/a&gt;. 🎉 This integration delivers a deeply integrated experience that will feel natural to Ruby and Rails developers.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Scale Your Rails App Analytics with TimescaleDB
&lt;/h2&gt;

&lt;p&gt;If you’ve worked with Rails for any length of time, you’ve probably hit the wall when dealing with time-series data. I know I did. &lt;/p&gt;

&lt;p&gt;Your app starts off smooth—collecting metrics, logging events, tracking usage. But one day, your dashboards start lagging. Page load times creep past 10 seconds. Pagination stops helping. Background jobs queue up as yesterday’s data takes too long to process.&lt;/p&gt;

&lt;p&gt;This isn’t a Rails problem. Or even a PostgreSQL problem. It’s a “using the wrong tool for the job” problem.&lt;/p&gt;

&lt;p&gt;In this post, I’ll show you how we solve these challenges at Timescale—and how you can too. I’ll walk through the real implementation patterns we use in production Rails apps, using practical code examples instead of abstract concepts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Growing Time-Series Data Challenge
&lt;/h2&gt;

&lt;p&gt;A few years ago, I was building analytics for a high-traffic Rails app. Despite adding indexes and optimizing queries, performance kept degrading as our data grew.&lt;/p&gt;

&lt;p&gt;Like most apps, we started with simple timestamp columns and standard ActiveRecord queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;ApplicationRecord&lt;/span&gt;
  &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="ss"&gt;:recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'created_at &amp;gt; ?'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;week&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ago&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="ss"&gt;:by_day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"DATE_TRUNC('day', created_at)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works fine at first. But as your table grows to millions (or billions) of rows, things slow to a crawl:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5ms when you have 10K rows&lt;/li&gt;
&lt;li&gt;2000ms when you have 10M rows
&lt;code&gt;Event.where(user_id: 123).by_day&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the problems compound when you need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track high-volume events (like API calls or page views)&lt;/li&gt;
&lt;li&gt;Keep historical data accessible for trends&lt;/li&gt;
&lt;li&gt;Run complex aggregations across time&lt;/li&gt;
&lt;li&gt;Maintain dashboard performance as data scales&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Over the years, I tried all the usual tricks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Additional indexes: Helped at first, then hurt insert performance&lt;/li&gt;
&lt;li&gt;Manual partitioning: Fragile and hard to manage&lt;/li&gt;
&lt;li&gt;Pre-aggregation jobs: Complex and often stale&lt;/li&gt;
&lt;li&gt;Custom caching: Difficult to maintain, always a step behind&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It felt like fighting my database instead of working with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why PostgreSQL Falls Short for Time-Series
&lt;/h2&gt;

&lt;p&gt;PostgreSQL is a fantastic general-purpose database. But time-series data introduces new demands that standard Postgres tables aren’t designed for. Let’s break that down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Insertion pattern: Data constantly arrives in time order, but old data rarely changes&lt;/li&gt;
&lt;li&gt;Query pattern: Most queries use time bounds (WHERE created_at BETWEEN x AND y)&lt;/li&gt;
&lt;li&gt;Aggregation pattern: You’re grouping by time (hourly, daily, monthly)&lt;/li&gt;
&lt;li&gt;Storage pattern: The dataset grows linearly—forever&lt;/li&gt;
&lt;li&gt;Access pattern: Recent (hot) data is queried far more than older (cold) data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These characteristics expose several pain points:.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No built-in partitioning for time&lt;/li&gt;
&lt;li&gt;Index bloat as tables grow&lt;/li&gt;
&lt;li&gt;Inefficient time-based queries&lt;/li&gt;
&lt;li&gt;Manual rollups and background jobs&lt;/li&gt;
&lt;li&gt;Difficulty managing large historical datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that’s exactly where TimescaleDB comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  TimescaleDB: PostgreSQL, But Built for Time-Series
&lt;/h2&gt;

&lt;p&gt;TimescaleDB is a PostgreSQL extension built to handle time-series and real-time workloads—without giving up the safety and simplicity of Postgres.&lt;/p&gt;

&lt;p&gt;Now with the timescaledb Ruby gem, it integrates cleanly into Rails. You don’t have to leave behind ActiveRecord, or rewrite your models, or learn a whole new stack.&lt;/p&gt;

&lt;p&gt;Here’s what TimescaleDB brings to your Rails app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hypertables: Automatic time-based partitioning, transparent to your queries&lt;/li&gt;
&lt;li&gt;Optimized time indexes: Stay fast even as your data grows&lt;/li&gt;
&lt;li&gt;Built-in compression: Reduce storage by 90–95%&lt;/li&gt;
&lt;li&gt;Continuous aggregates: Pre-computed rollups that stay fresh automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most importantly? You keep your Rails patterns.&lt;/p&gt;

&lt;p&gt;These work just like before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="no"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;user_id: &lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;created_at: &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ago&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="no"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="no"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by_day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:created_at&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;  &lt;span class="c1"&gt;# using the groupdate gem&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real Performance Gains Without Rewriting Everything
&lt;/h2&gt;

&lt;p&gt;With Timescale, our analytics workflows went from laggy to fast—without adding new caching layers or complex ETL.&lt;/p&gt;

&lt;p&gt;Across production workloads, teams have seen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sub-second queries on tens of millions of rows&lt;/li&gt;
&lt;li&gt;95%+ compression on time-series datasets&lt;/li&gt;
&lt;li&gt;Fewer background jobs, thanks to continuous aggregates&lt;/li&gt;
&lt;li&gt;Simplified code—no more rollup scripts or cache warmers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It feels like your app leveled up, without any extra complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Continuous Aggregates in One Line of Ruby
&lt;/h2&gt;

&lt;p&gt;One of TimescaleDB’s most powerful features is continuous aggregates—think materialized views that update automatically in the background.&lt;br&gt;
And with the timescaledb gem, defining them looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Download&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;ApplicationRecord&lt;/span&gt;
  &lt;span class="kp"&gt;extend&lt;/span&gt; &lt;span class="no"&gt;Timescaledb&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;ActsAsHypertable&lt;/span&gt;
  &lt;span class="kp"&gt;include&lt;/span&gt; &lt;span class="no"&gt;Timescaledb&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;ContinuousAggregatesHelper&lt;/span&gt;

  &lt;span class="n"&gt;acts_as_hypertable&lt;/span&gt; &lt;span class="ss"&gt;time_column: &lt;/span&gt;&lt;span class="s1"&gt;'ts'&lt;/span&gt;

  &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="ss"&gt;:total_downloads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"count(*) as total"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="ss"&gt;:downloads_by_gem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"gem_name, count(*) as total"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:gem_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;continuous_aggregates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="ss"&gt;timeframes: &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:minute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:month&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="ss"&gt;scopes: &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:total_downloads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:downloads_by_gem&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single model creates a cascade of continuously updated rollups—from minute to month—all while sticking to the ActiveRecord patterns you know and love.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;p&gt;If you're building a Rails app that tracks metrics, logs, events, or any kind of time-based data, TimescaleDB gives you a clear path to scale without duct tape and complexity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce load on your app servers—let the DB do the aggregating&lt;/li&gt;
&lt;li&gt;Eliminate complex background jobs—less moving parts to break&lt;/li&gt;
&lt;li&gt;Get predictable performance—even with billions of rows&lt;/li&gt;
&lt;li&gt;Stick with Rails conventions—write less custom SQL&lt;/li&gt;
&lt;li&gt;Continuous aggregates alone can replace dozens of lines of rollup - code and hours of maintenance work.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Rails developers deserve a time-series database that just works. TimescaleDB gives you the performance and scale your app needs without giving up the elegance of ActiveRecord.&lt;/p&gt;

&lt;p&gt;If you’re curious, here’s how to get started:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install TimescaleDB (it’s just a Postgres extension)&lt;/li&gt;
&lt;li&gt;Add the timescaledb gem to your Gemfile&lt;/li&gt;
&lt;li&gt;Identify models with time-based data&lt;/li&gt;
&lt;li&gt;Start with hypertables, then add continuous aggregates as needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can self-host, or try Timescale Cloud for a fully managed option.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ: TimescaleDB for Ruby on Rails Developers
&lt;/h2&gt;

&lt;p&gt;Q: Do I need to change how I use ActiveRecord?&lt;/p&gt;

&lt;p&gt;A: Nope! TimescaleDB works with your existing ActiveRecord models. Just add the timescaledb gem and use the acts_as_hypertable macro to enable time-series functionality.&lt;/p&gt;

&lt;p&gt;Q: How is TimescaleDB different from just using PostgreSQL?&lt;/p&gt;

&lt;p&gt;A: TimescaleDB is a PostgreSQL extension. It gives you automatic time-based partitioning (hypertables), faster time-based queries, built-in compression, and continuous aggregates—all while staying 100% SQL- and Rails-compatible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I keep using the gems I already use for date grouping, like groupdate?
&lt;/h3&gt;

&lt;p&gt;A: Yes. TimescaleDB works seamlessly with gems like groupdate. You can continue using .group_by_day, .group_by_hour, etc., and get better performance under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What kind of performance improvements can I expect?
&lt;/h3&gt;

&lt;p&gt;A: Teams have seen sub-second query times on tens of millions of rows and 95%+ storage savings using TimescaleDB’s compression. The biggest wins are in read-heavy, time-bounded queries (e.g., user activity, logs, metrics).&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What’s the learning curve for continuous aggregates?
&lt;/h3&gt;

&lt;p&gt;A: It’s minimal. The timescaledb gem lets you define continuous aggregates using a simple DSL that reuses your existing scopes. You don’t need to learn new SQL or create custom rollup jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use this in production? Is it stable?
&lt;/h3&gt;

&lt;p&gt;A: Yes. TimescaleDB powers production workloads at companies like NetApp, Linktree, and RubyGems.org. It’s backed by years of performance and reliability improvements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Do I need to self-host? Or is there a managed option?
&lt;/h3&gt;

&lt;p&gt;A: Both! You can self-host TimescaleDB or use Timescale Cloud, a fully managed PostgreSQL service with built-in TimescaleDB, HA, backups, and usage-based pricing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Where can I learn more?
&lt;/h3&gt;

&lt;p&gt;A:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/timescale/timescaledb-ruby" rel="noopener noreferrer"&gt;Ruby Quickstart&lt;/a&gt; in Timescale Docs&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/timescale/timescaledb-ruby" rel="noopener noreferrer"&gt;timescaledb-ruby&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://console.cloud.timescale.com/signup" rel="noopener noreferrer"&gt;Fully Managed Timescale Cloud&lt;/a&gt; (free for 30 days)&lt;/li&gt;
&lt;li&gt;Install the &lt;a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noopener noreferrer"&gt;open-source TimescaleDB extension&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>postgres</category>
      <category>ruby</category>
      <category>rails</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>We Listened: Pgai Vectorizer Now Works With Any Postgres Database</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Mon, 05 May 2025 15:01:35 +0000</pubDate>
      <link>https://dev.to/tigerdata/we-listened-pgai-vectorizer-now-works-with-any-postgres-database-1e57</link>
      <guid>https://dev.to/tigerdata/we-listened-pgai-vectorizer-now-works-with-any-postgres-database-1e57</guid>
      <description>&lt;p&gt;TL;DR: &lt;br&gt;
We're excited to announce that pgai Vectorizer—the &lt;a href="https://www.timescale.com/blog/pgai-vectorizer-now-works-with-any-postgres-database" rel="noopener noreferrer"&gt;tool for robust embedding creation and management&lt;/a&gt;—is now available as a Python CLI and library, making it compatible with any Postgres database, whether it be self-hosted Postgres or cloud-hosted on Timescale Cloud, Amazon RDS for PostgreSQL, or Supabase. &lt;/p&gt;



&lt;p&gt;This expansion comes directly from developer feedback requesting broader accessibility while maintaining the Postgres integration that makes pgai Vectorizer the ideal solution for production-grade embedding creation, management, and experimentation. &lt;a href="https://github.com/timescale/pgai" rel="noopener noreferrer"&gt;&lt;u&gt;To get started, head over to the pgai GitHub&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why We Built Pgai Vectorizer for Postgres
&lt;/h2&gt;

&lt;p&gt;When we first &lt;a href="https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction" rel="noopener noreferrer"&gt;&lt;u&gt;launched pgai Vectorizer&lt;/u&gt;&lt;/a&gt;, we aimed to simplify vector embedding management for developers building AI systems with Postgres. We heard the horror stories of developers struggling with complex ETL (extract-transform-load) pipelines, embedding synchronization issues, and the constant battle to keep embeddings up-to-date when source data changes. Teams were spending more time maintaining infrastructure than building useful AI features.&lt;/p&gt;

&lt;p&gt;Many developers found themselves cobbling together custom solutions involving message queues, Lambda functions, and background workers just to handle the embedding creation workflow. Others faced the frustration of stale embeddings that no longer matched their updated content, leading to degraded search quality and hallucinations in their RAG applications.&lt;/p&gt;

&lt;p&gt;Pgai Vectorizer solved these problems with a declarative approach that automated the entire embedding lifecycle with a single SQL command, similar to how you'd create an index in Postgres. The &lt;a href="https://news.ycombinator.com/item?id=41985176" rel="noopener noreferrer"&gt;&lt;u&gt;tool resonated with developers&lt;/u&gt;&lt;/a&gt; and quickly gained traction among AI builders. However, we soon started hearing a consistent piece of feedback that would shape our next steps.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Change: Moving From Extension-Only to Python CLI and Library
&lt;/h2&gt;

&lt;p&gt;After our initial launch, we received consistent feedback from developers who wanted to use pgai Vectorizer with their existing managed Postgres databases. While our extension-based approach worked great for self-hosted Postgres and Timescale Cloud, users on platforms like Amazon RDS for PostgreSQL, Supabase, and other managed database services couldn't use pgai Vectorizer unless their cloud provider chose to make it available.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXee19HYvVr8vTsVmwdlCVLOqHM1_-0VzjQrSk-3ZWQETtAFb8q8CBb9SKPmikQFJCl9ZgdpcrftidajbruKCWvshO8AkVuJbK5tpqlj9PyDrwk6SKrWfbG-KaRXu4KKmQyWrkX6bA%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXee19HYvVr8vTsVmwdlCVLOqHM1_-0VzjQrSk-3ZWQETtAFb8q8CBb9SKPmikQFJCl9ZgdpcrftidajbruKCWvshO8AkVuJbK5tpqlj9PyDrwk6SKrWfbG-KaRXu4KKmQyWrkX6bA%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" width="1437" height="620"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXcau7ZF40A9PuXPL2Zbp60ymU-EK3MLtzeId2XpSittjRCcBxga3dFoBApqChi4cJTwXrD9Hw2lYoAPLv-5A6ehmbIbqU2_Bji1O39jVqSL-iAm5fVyKGiRexcfArnAj9X4KEtOgA%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXcau7ZF40A9PuXPL2Zbp60ymU-EK3MLtzeId2XpSittjRCcBxga3dFoBApqChi4cJTwXrD9Hw2lYoAPLv-5A6ehmbIbqU2_Bji1O39jVqSL-iAm5fVyKGiRexcfArnAj9X4KEtOgA%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" width="823" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXc7W5YksotaCkwdSfhpzWB1x6DmpkBnX5DQAvP1ahIknUEXFHjwM8ATzNFAoo76_mKKDT6MpvCc_aNjCi3HZ5T9qjkB7dLGvqNh7FifbYv---v9MJZf4fPp3mNEPKKTop4-h7zr4A%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXc7W5YksotaCkwdSfhpzWB1x6DmpkBnX5DQAvP1ahIknUEXFHjwM8ATzNFAoo76_mKKDT6MpvCc_aNjCi3HZ5T9qjkB7dLGvqNh7FifbYv---v9MJZf4fPp3mNEPKKTop4-h7zr4A%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" width="1435" height="997"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Requests for pgai Vectorizer support on Supabase, Azure PostgreSQL, and Amazon RDS.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We knew we needed to make pgai Vectorizer more accessible without compromising its seamless Postgres integration. The solution? Repackaging our core functionality as a Python CLI (command-line interface) and library that can work with any Postgres database while maintaining the same robustness and "set it and forget it" simplicity.&lt;/p&gt;

&lt;p&gt;This approach gives developers the best of both worlds: the powerful vectorization capabilities of pgai Vectorizer with the flexibility to use their existing database infrastructure, regardless of provider. The Python library handles the creation of database objects that house the pgai Vectorizer internals, and provides a SQL API that handles loading data, creating embeddings, and synchronizing changes, all while writing the results back to your Postgres database.&lt;/p&gt;

&lt;p&gt;The library maintains all the core functionality that made pgai Vectorizer valuable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding creation and management:&lt;/strong&gt; Automatically create and synchronize vector embeddings from &lt;a href="https://www.timescale.com/blog/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines" rel="noopener noreferrer"&gt;Postgres data and S3 documents&lt;/a&gt;. Embeddings update automatically as data changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-ready out-of-the-box&lt;/strong&gt; : Supports batch processing for efficient embedding generation, with built-in handling for model failures, rate limits, and latency spikes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experimentation and testing:&lt;/strong&gt; &lt;a href="https://www.timescale.com/blog/open-source-vs-openai-embeddings-for-rag" rel="noopener noreferrer"&gt;&lt;u&gt;Easily switch between embedding models&lt;/u&gt;&lt;/a&gt;, test different models, and compare performance without changing application code or manually reprocessing data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plays well with pgvector and pgvectorscale:&lt;/strong&gt; Once your embeddings are created, use them to power vector and semantic search with &lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;&lt;u&gt;pgvector&lt;/u&gt;&lt;/a&gt; and &lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale&lt;/u&gt;&lt;/a&gt;. Embeddings are stored in the pgvector data format. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;*&lt;em&gt;What this means for existing users: *&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Timescale Cloud customers:&lt;/strong&gt; Existing vectorizers running on Timescale Cloud will continue to work as is, so no immediate action is necessary. We encourage you to use the new &lt;a href="https://github.com/timescale/pgai" rel="noopener noreferrer"&gt;&lt;u&gt;pgai Python library&lt;/u&gt;&lt;/a&gt; to create and manage new vectorizers. To do so, you have to upgrade to the latest version of both the pgai extension in Timescale Cloud and the pgai Python library. Upgrading the extension decouples the vectorizer-related database objects from the extension, therefore allowing them to be managed by the Python library. Pgai Vectorizer remains in Early Access on Timescale Cloud. &lt;a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/migrating-from-extension.md" rel="noopener noreferrer"&gt;&lt;u&gt;See this guide&lt;/u&gt;&lt;/a&gt; for details and instructions on upgrading and migrating.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted users:&lt;/strong&gt; Existing self-hosting vectorizers will also continue to work as is, so no immediate action is required. If you already have the pgai extension installed, you’ll need to upgrade to version 0.10.1. Upgrading the extension decouples the vectorizer-related database objects from the extension, therefore allowing them to be created and managed by the Python library. &lt;a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/migrating-from-extension.md" rel="noopener noreferrer"&gt;&lt;u&gt;See this guide&lt;/u&gt;&lt;/a&gt; for self-hosted upgrade and migration details and instructions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What this means for new users:&lt;/strong&gt; Whether you use pgai Vectorizer on Timescale Cloud or self-hosted, this change means a simplified installation process and more flexibility—you now have tighter integrations between pgai Vectorizer and your search and RAG backends in your AI applications. Self-hosted users no longer need to install the pgai extension to use pgai Vectorizer. Timescale Cloud customers will continue to get the pgai extension auto-installed for them. To try pgai Vectorizer for yourself, &lt;a href="https://github.com/timescale/pgai#quick-start" rel="noopener noreferrer"&gt;&lt;u&gt;here’s how you can get started&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pgai Vectorizer Works With Any Postgres Database
&lt;/h2&gt;

&lt;p&gt;The new Python library implementation of pgai Vectorizer works with virtually any Postgres database, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://console.cloud.timescale.com/signup" rel="noopener noreferrer"&gt;&lt;u&gt;Timescale Cloud&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Self-hosted Postgres&lt;/li&gt;
&lt;li&gt;Amazon RDS for PostgreSQL&lt;/li&gt;
&lt;li&gt;Supabase&lt;/li&gt;
&lt;li&gt;Google Cloud SQL for PostgreSQL&lt;/li&gt;
&lt;li&gt;Azure Database for PostgreSQL&lt;/li&gt;
&lt;li&gt;Neon PostgreSQL&lt;/li&gt;
&lt;li&gt;Render PostgreSQL&lt;/li&gt;
&lt;li&gt;DigitalOcean Managed Databases&lt;/li&gt;
&lt;li&gt;Any other self-hosted or managed Postgres service running PostgreSQL 15 and later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The new implementation addresses one of our most requested features from the community. Users were actively building AI applications with these managed services, but couldn't take advantage of pgai Vectorizer's powerful embedding management capabilities.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to Use Pgai Vectorizer: A Quick Refresher
&lt;/h2&gt;

&lt;p&gt;A standout feature of the new Python library is its enhanced support for document processing directly from cloud storage. &lt;/p&gt;

&lt;p&gt;With the expanded Amazon S3 integration, you can now seamlessly load documents and generate embeddings based on file URLs stored in your Postgres table. Pgai Vectorizer automatically loads and parses each into an LLM-friendly format like Markdown, then generates the required chunks for embedding creation, all according to your specification.&lt;/p&gt;

&lt;p&gt;For document vectorization, we've included support for parsing multiple formats, including PDF, DOCX, XLSX, HTML, images, and more using &lt;a href="https://research.ibm.com/publications/docling-an-efficient-open-source-toolkit-for-ai-driven-document-conversion" rel="noopener noreferrer"&gt;&lt;u&gt;IBM Docling&lt;/u&gt;&lt;/a&gt;, which provides advanced document understanding capabilities. This makes it easy to build powerful document search and retrieval systems without leaving the Postgres ecosystem.&lt;/p&gt;

&lt;p&gt;Getting started with the pgai Vectorizer Python library is straightforward. Install pgai on your database via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pgai
pgai install -d postgresql://postgres:postgres@localhost:5432/postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Afterward, your database is enhanced with pgai’s capabilities. Here's a simple example of how to create a vectorizer for processing text data from a database column named ‘text’:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_vectorizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'wiki'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;regclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;if_not_exists&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loading_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'text'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'text-embedding-ada-002'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'1536'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;destination&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;view_name&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'wiki_embedding'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For document processing, you can use this configuration, which shows a document metadata table in PostgreSQL with references to data in Amazon S3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Document source table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;owner_id&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;access_level&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Example with rich metadata&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;access_level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Product Manual'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-bucket/documents/product-manual.pdf'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'application/pdf'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'internal'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'product'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'reference'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'API Reference'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-bucket/documents/api-reference.md'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'text/markdown'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'api'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'developer'&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_vectorizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'document'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;regclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loading_uri&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'uri'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;chunking&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chunking_recursive_character_text_splitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;700&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;separators&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;## '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;### '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;#### '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;- '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;1. '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'.'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'?'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'!'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'|'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'text-embedding-3-small'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;destination&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'document_embeddings'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the worker via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;pgai&lt;/span&gt; &lt;span class="n"&gt;vectorizer&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; 
&lt;span class="n"&gt;postgresql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;localhost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5432&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And watch the magic happen as pgai creates vector embeddings for your source data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Started With Pgai Vectorizer Today
&lt;/h2&gt;

&lt;p&gt;We're excited to see what you'll build with the new pgai Vectorizer, whether you're creating semantic search, RAG, or next-gen agentic applications.&lt;/p&gt;

&lt;p&gt;Check out the &lt;a href="https://github.com/timescale/pgai" rel="noopener noreferrer"&gt;&lt;u&gt;GitHub repository&lt;/u&gt;&lt;/a&gt; to explore capabilities and getting started guides.&lt;/p&gt;

&lt;p&gt;As you can tell by this post, we really value community feedback. If you encounter any issues or have suggestions for improvements, please open an &lt;a href="https://github.com/timescale/pgai/issues" rel="noopener noreferrer"&gt;&lt;u&gt;issue on GitHub&lt;/u&gt;&lt;/a&gt; or join our &lt;a href="https://discord.gg/KRdHVXAmkp" rel="noopener noreferrer"&gt;&lt;u&gt;community Discord&lt;/u&gt;&lt;/a&gt;. Your input will help shape the future development of pgai Vectorizer as we continue to enhance its capabilities.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>python</category>
      <category>ai</category>
      <category>news</category>
    </item>
    <item>
      <title>PostgreSQL vs. Qdrant for Vector Search: 50M Embedding Benchmark</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Fri, 02 May 2025 14:35:37 +0000</pubDate>
      <link>https://dev.to/tigerdata/postgresql-vs-qdrant-for-vector-search-50m-embedding-benchmark-3hhe</link>
      <guid>https://dev.to/tigerdata/postgresql-vs-qdrant-for-vector-search-50m-embedding-benchmark-3hhe</guid>
      <description>&lt;p&gt;Vector search is becoming a core workload for AI-driven applications. But do you really need to introduce a new system just to handle it?&lt;/p&gt;

&lt;p&gt;We ran a performancebenchmark to find out: &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;comparing PostgreSQL (using pgvector + pgvectorscale) with Qdrant on 50 million embeddings&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The results at 99% recall:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Sub-100ms query latencies&lt;/li&gt;
&lt;li&gt;471 queries per second (QPS) on Postgres—11x higher throughput than Qdrant (41 QPS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Head to the full write-up for a deep dive into our &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;vector database comparison&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4cujejzv0axs1s27e72.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4cujejzv0axs1s27e72.jpg" alt="Postgres vs Qdrant vector database performance comparison" width="720" height="720"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  For vectors, Postgres is all you need.
&lt;/h2&gt;

&lt;p&gt;At 99% recall, Postgres delivers sub-100ms query latencies and handles 11x more query throughput than Qdrant (471 QPS vs. Qdrant’s 41 QPS).&lt;/p&gt;

&lt;p&gt;The results show that thanks to &lt;code&gt;pgvectorscale&lt;/code&gt;, &lt;a href="https://docs.timescale.com/ai/latest/sql-interface-for-pgvector-and-timescale-vector/" rel="noopener noreferrer"&gt;Postgres can keep up with specialized vector databases&lt;/a&gt; and deliver as good, if not better performance at scale. Learn more about &lt;a href="https://www.timescale.com/blog/why-postgres-wins-for-ai-and-vector-workloads" rel="noopener noreferrer"&gt;why Postgres wins for AI and vector workloads&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning PostgreSQL Into a High-Performance Vector Search Engine
&lt;/h2&gt;

&lt;p&gt;How? We built &lt;code&gt;pgvectorscale&lt;/code&gt; to push Postgres to its limits for vector workloads—without compromising recall, latency, or cost-efficiency. It turns your favorite relational database into a high-performance vector search engine.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ No extra systems.&lt;/li&gt;
&lt;li&gt;✅ No new query languages.&lt;/li&gt;
&lt;li&gt;✅ Just Postgres.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We used &lt;a href="https://rtabench.com/" rel="noopener noreferrer"&gt;RTABench&lt;/a&gt; to run a transparent, reproducible evaluation—designed for real-world, high-scale workloads.&lt;/p&gt;

&lt;p&gt;Curious about the architecture behind it all?&lt;/p&gt;

&lt;p&gt;👉 Read our whitepaper on &lt;a href="https://docs.timescale.com/about/latest/whitepaper/" rel="noopener noreferrer"&gt;building Timescale for real-time and AI workloads&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It dives into how we engineered Timescale to handle time-series, vector, and relational data—all in one Postgres-native platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR: For many vector workloads, Postgres is all you need.
&lt;/h2&gt;

&lt;p&gt;Have you used Postgres or Qdrant for vector search?&lt;br&gt;
What’s your stack look like today—and where do you feel the friction?&lt;/p&gt;

&lt;p&gt;👉 Postgres vs Qdrant: which side are you on? Comment down below!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Connecting S3 and Postgres: Automatic Synchronization Without ETL Pipelines</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Thu, 01 May 2025 12:32:36 +0000</pubDate>
      <link>https://dev.to/tigerdata/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines-32kg</link>
      <guid>https://dev.to/tigerdata/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines-32kg</guid>
      <description>&lt;p&gt;Modern applications need data that's both accessible and fast. You have data in S3, but transforming it into usable insights requires complex ETL (extract-transform-load) pipelines. With our new &lt;a href="https://www.timescale.com/blog/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines" rel="noopener noreferrer"&gt;livesync for S3 and pgai Vectorizer features&lt;/a&gt;, Timescale transforms how you interact with S3 data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Powerful Postgres–S3 Integration Approaches
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F06aeidm498g1kyknudmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F06aeidm498g1kyknudmc.png" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our new features offer distinct approaches to working with S3 data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.timescale.com/blog/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines#transform-s3-to-analytics-in-seconds-automatic-data-synchronization-with-livesync" rel="noopener noreferrer"&gt;&lt;strong&gt;Livesync for S3&lt;/strong&gt;&lt;/a&gt; brings your structured S3 data directly into Postgres tables, automatically synchronizing files as they change.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.timescale.com/blog/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines#simplify-document-embeddings-with-pgai-vectorizer" rel="noopener noreferrer"&gt;&lt;strong&gt;pgai Vectorizer&lt;/strong&gt; leaves documents in S3 but generates searchable embeddings and metadata in Postgres&lt;/a&gt;, connecting unstructured content with structured data for RAG, search, and agentic applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both eliminate complex ETL pipelines, letting you work with S3 data using familiar SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transform S3 to Analytics in Seconds: Automatic Data Synchronization With Livesync
&lt;/h2&gt;

&lt;p&gt;S3 is where countless organizations store their data, but Timescale Cloud is where they unlock insights. Livesync for S3 bridges this gap, eliminating the traditional complexity of moving data between these systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem: Complex ETL pipelines for S3 data
&lt;/h3&gt;

&lt;p&gt;Data management challenges create significant obstacles when bridging S3 storage and analytics environments. Organizations struggle with the manual effort required to transport data between S3 buckets and analytical databases, requiring custom integration code that demands ongoing maintenance. This challenge is compounded by the brittle and resource-intensive nature of maintaining ETL processes.&lt;/p&gt;

&lt;p&gt;Many organizations find themselves caught in a constant battle to ensure data freshness, requiring vigilant monitoring systems to confirm that analytics platforms accurately reflect the most current information in S3 repositories. The culmination of these challenges frequently manifests as performance bottlenecks, where inefficient data transfer mechanisms cause critical delays in delivering up-to-date information to customer-facing applications, leading to poor user experiences and customers making decisions based on stale data.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution: Automatic data synchronization
&lt;/h3&gt;

&lt;p&gt;Livesync for S3 bridges this gap, eliminating the traditional complexity of moving data between these systems. We've engineered livesync for S3 to bring stream-like behavior to object storage, effectively turning your S3 bucket into a continuous data feed.&lt;/p&gt;

&lt;p&gt;Our solution delivers speed and simplicity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-ETL experience&lt;/strong&gt; : Eliminate complex pipelines or custom integration code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time data pipeline&lt;/strong&gt; : Turn your S3 bucket into a continuous data feed with automatic synchronization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Familiar tools&lt;/strong&gt; : Use S3 for storage and Timescale Cloud for analytics without compromise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal configuration&lt;/strong&gt; : Connect to your S3 bucket, define mapping, and let livesync handle the rest.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How livesync works
&lt;/h3&gt;

&lt;p&gt;Behind the scenes, we're doing the heavy lifting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema mapping that infers your data from CSV or Parquet files to hypertables&lt;/li&gt;
&lt;li&gt;Managing the initial data load&lt;/li&gt;
&lt;li&gt;Maintaining continuous synchronization&lt;/li&gt;
&lt;li&gt;Intelligent tracking of processed files to prevent duplicates or missed data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables teams across multiple industries to build robust pipelines. For organizations with production applications on Postgres looking to scale their real-time analytics, livesync for S3 has a sister solution—&lt;a href="https://www.timescale.com/blog/connect-any-postgres-to-real-time-analytics" rel="noopener noreferrer"&gt;&lt;u&gt;livesync for Postgres&lt;/u&gt;&lt;/a&gt;—which lets you keep your Postgres as-is while streaming data in real time to a Timescale Cloud instance optimized for analytical workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  The inner workings of livesync for S3
&lt;/h3&gt;

&lt;p&gt;Secure cross-account authentication&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbunjw7pte7q1v6jobbwx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbunjw7pte7q1v6jobbwx.png" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Livesync employs a robust security model using AWS role assumption. Our service assumes a specific role in your AWS account with precisely the permissions needed to access your S3 data. To prevent confused deputy attacks, we implement the industry-standard External ID verification using your unique Project ID/Service ID combination.&lt;/p&gt;

&lt;h4&gt;
  
  
  Smart polling and file discovery
&lt;/h4&gt;

&lt;p&gt;Behind the scenes, livesync intelligently scans your S3 bucket using optimized ListObjectsV2 calls. Starting with the prefix from your pattern (like "logs/" from "logs/**/*.csv"), it applies glob matching to find relevant files. The system tracks processed files in lexicographical order, ensuring no file is missed or duplicated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fij6v8264zidawnxzo2ze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fij6v8264zidawnxzo2ze.png" width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To maintain performance, livesync for S3 manages an orderly queue limited to 100 files per connection. When files are plentiful, polling accelerates to every minute; when caught up, it follows your configured schedule. You can always trigger immediate processing with the "Pull now" button.&lt;/p&gt;

&lt;h4&gt;
  
  
  Optimized data processing pipeline
&lt;/h4&gt;

&lt;p&gt;Livesync handles different file formats with specialized techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CSV files&lt;/strong&gt; are analyzed for compression (UTF-8, ZIP, GZIP), then processed using high-performance parallel ingestion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parquet files&lt;/strong&gt; undergo efficient conversion before being streamed into TimescaleDB (which lives at the core of your Timescale Cloud service).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire pipeline includes intelligent error handling, which is clearly visible in the dashboard. After three consecutive failures, livesync automatically pauses to prevent resource waste, awaiting your review.&lt;/p&gt;

&lt;p&gt;This architecture delivers the perfect balance of reliability, performance, and operational simplicity, bringing your S3 data into Timescale Cloud with minimal configuration and maximum confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build powerful ingest pipelines with minimal configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IoT telemetry flows:&lt;/strong&gt; Connect devices that log to S3 (like AWS IoT Core) directly to time-series analytics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming data persistence:&lt;/strong&gt; Automatically process data from Kinesis, Kafka, or other streaming platforms that land files in S3 and transform into TimescaleDB hypertables for high-performance querying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crypto/financial data analytics:&lt;/strong&gt; Sync trading data from S3 into TimescaleDB for real-time analytics on recent market movements and long-term historical analysis for backtesting and trend identification.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Currently supporting CSV and Parquet file formats, livesync delivers a frictionless way to unlock the value of your data stored in S3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcelzm9f7sfzbwblx9bjm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcelzm9f7sfzbwblx9bjm.png" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Simple setup, powerful results
&lt;/h3&gt;

&lt;p&gt;Livesync for S3 continuously monitors your S3 bucket for incoming sensor data, automatically maps schemas, and syncs data into TimescaleDB hypertables in minutes. This enables operators to query millions of readings with millisecond latency, driving real-time dashboards that catch anomalies before equipment fails. Livesync for S3 ensures that syncing from S3 to hypertables remains smooth, dependable, and lightning-fast.&lt;/p&gt;

&lt;p&gt;Setting up &lt;a href="https://docs.timescale.com/migrate/latest/livesync-for-s3/" rel="noopener noreferrer"&gt;&lt;u&gt;livesync for S3&lt;/u&gt;&lt;/a&gt; is surprisingly straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Connect to your S3 bucket with your credentials.&lt;/li&gt;
&lt;li&gt;Define how your objects map to TimescaleDB tables.&lt;/li&gt;
&lt;li&gt;Let livesync for S3 handle the rest—monitoring and ingesting new data automatically.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Behind the scenes, we're doing the heavy lifting of schema mapping, managing the initial data load, and maintaining continuous synchronization. The system intelligently tracks what it's processed, so you never have duplicate data or missed files.&lt;/p&gt;

&lt;p&gt;For example, in manufacturing environments where sensors continuously capture critical equipment data through AWS IoT Core and store it in S3, livesync ensures this data becomes immediately queryable in TimescaleDB. This enables operators to identify anomalies before equipment fails, turning static S3 storage into actionable intelligence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zero maintenance, maximum performance
&lt;/h3&gt;

&lt;p&gt;Once configured, livesync for S3 delivers ease and performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-maintenance operation&lt;/strong&gt; once configured&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema mapping&lt;/strong&gt; that infers your data from CSV or Parquet files to hypertables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic retry mechanisms&lt;/strong&gt; for transient failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained control&lt;/strong&gt; over which objects sync and when&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete observability&lt;/strong&gt; with detailed history of file imports and error messages (if any)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Simplify Document Embeddings With Pgai Vectorizer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Searching unstructured documents embeddings with pgvector
&lt;/h3&gt;

&lt;p&gt;While livesync brings S3 data into Postgres, pgai Vectorizer takes a different approach for unstructured documents. It creates searchable vector embeddings in Postgres from documents stored in S3 while keeping the original files in place.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem: Complex pipelines for document search
&lt;/h3&gt;

&lt;p&gt;AI applications using RAG (retrieval-augmented generation) can help businesses unlock insights from mountains of unstructured data. Today, that unstructured data’s natural home is Amazon S3. On the other hand, Postgres has become the default vector database for developers, thanks to extensions like &lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;&lt;u&gt;pgvector&lt;/u&gt;&lt;/a&gt; and &lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale&lt;/u&gt;&lt;/a&gt;. These extensions enable them to build intelligent applications with vector search capabilities without needing to use a separate database just for vectors.&lt;/p&gt;

&lt;p&gt;We’ve previously written about how &lt;a href="https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction" rel="noopener noreferrer"&gt;&lt;u&gt;vector databases are the wrong abstraction&lt;/u&gt;&lt;/a&gt; because they divorce the source data from the vector embedding and lose the connection between unstructured data that's being embedded and the embeddings themselves. This problem is especially apparent for documents housed in object storage like Amazon S3.&lt;/p&gt;

&lt;p&gt;Before pgai Vectorizer, developers typically needed to manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex ETL pipelines to chunk, format, and create embeddings from source data&lt;/li&gt;
&lt;li&gt;Multiple systems: a vector database for embeddings, an application database for metadata, and possibly a separate lexical search index&lt;/li&gt;
&lt;li&gt;Data synchronization services to maintain a single source of truth&lt;/li&gt;
&lt;li&gt;Queuing systems for updates and synchronization&lt;/li&gt;
&lt;li&gt;Monitoring tools to catch data drift and handle rate limits from embedding services&lt;/li&gt;
&lt;li&gt;Alert systems for stale search results&lt;/li&gt;
&lt;li&gt;Validation checks across all these systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Processing documents in AI pipelines introduces several challenges, such as managing diverse file formats (PDFs, DOCX, XLSX, HTML, and more), handling complex metadata, keeping embeddings up to date with document changes, and ensuring efficient storage and retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution: Automatic document vectorization
&lt;/h3&gt;

&lt;p&gt;To solve these challenges, Timescale has added support for document vectorization to pgai Vectorizer, giving developers an automated way to create embeddings from documents in Amazon S3 and keep those embeddings synchronized as the underlying data changes, eliminating the need for external ETL pipelines and queuing systems.&lt;/p&gt;

&lt;p&gt;Pgai Vectorizer provides a streamlined approach where developers can reference documents in S3 (or local storage) via URLs stored in a database table. The vectorizer then handles the complete workflow—downloading documents, parsing them to extract content, chunking text appropriately, and generating embeddings for use in semantic search, RAG, or agentic applications.&lt;/p&gt;

&lt;p&gt;This integration supports a wide variety of file formats, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Documents: PDF, DOCX, TXT, MD, AsciiDoc&lt;/li&gt;
&lt;li&gt;Spreadsheets: CSV, XLSX&lt;/li&gt;
&lt;li&gt;Presentations: PPTX&lt;/li&gt;
&lt;li&gt;Images: PNG, JPG, TIFF, BMP&lt;/li&gt;
&lt;li&gt;Web content: HTML&lt;/li&gt;
&lt;li&gt;Books: MOBI, EPUB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For developers, pgai Vectorizer for document vectorization offers three key benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Get started more easily&lt;/strong&gt; → Automatic embedding creation with a simple SQL command manages the entire workflow from document reference to searchable embeddings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spend less time wrangling data infrastructure&lt;/strong&gt; → Automatic updating and synchronization of embeddings means your vector search stays current with your S3 documents without manual intervention. It’s as simple as adding a new row or updating a “modified_at” column in the documents table, and pgai Vectorizer will take off any (re)processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuously improve your AI systems&lt;/strong&gt; → Testing and experimentation with different embedding models or chunking strategies can be done &lt;a href="https://www.timescale.com/blog/open-source-vs-openai-embeddings-for-rag" rel="noopener noreferrer"&gt;&lt;u&gt;with a single line of SQL&lt;/u&gt;&lt;/a&gt;, allowing you to optimize your application's performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By keeping your embeddings automatically synchronized to the source documents in S3, pgai Vectorizer ensures that your Postgres database remains the single source of truth for both your structured and vector data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Under the hood: How pgai Vectorizer works with Amazon S3
&lt;/h2&gt;

&lt;p&gt;Pgai Vectorizer simplifies the entire document processing pipeline through a streamlined architecture that connects your Amazon S3 documents with Postgres. Here's how it works:&lt;/p&gt;

&lt;h4&gt;
  
  
  Architecture overview
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0nxj834n1hvuxevl0oo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0nxj834n1hvuxevl0oo.png" width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Architecture overview of pgai Vectorizer: The vectorizer system takes in source data from Postgres tables and S3 buckets, creates embeddings via worker processes running in AWS Lambda using user-specified parsing, chunking, and embedding configurations, and stores the final embeddings in Postgres tables using the pgvector data type.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pgai Vectorizer architecture for document vectorization consists of several key components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data sources: Postgres and Amazon S3&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Text and metadata residing in Postgres tables&lt;/li&gt;
&lt;li&gt;Postgres tables containing URLs that reference documents in Amazon S3 (which serves as the data aggregation layer where your documents reside)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Vectorization configuration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stored in Postgres, allowing you to manage everything through familiar SQL commands&lt;/li&gt;
&lt;li&gt;Defines chunking strategies, embedding models, and processing parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Vectorizer worker (AWS Lambda)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A daemon process that handles the actual work of processing documents&lt;/li&gt;
&lt;li&gt;Responsible for downloading, parsing, chunking, and embedding creation&lt;/li&gt;
&lt;li&gt;Automatically manages synchronization between source documents and embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Destination&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All embeddings are stored in Postgres alongside metadata&lt;/li&gt;
&lt;li&gt;Enables unified queries across both structured data and vector embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Document processing pipeline
&lt;/h4&gt;

&lt;p&gt;The document vectorization process follows these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Documents are referenced via URLs stored in a database column.&lt;/li&gt;
&lt;li&gt;The vectorizer downloads documents using these URLs.&lt;/li&gt;
&lt;li&gt;Documents are parsed to extract text content in an embedding-friendly format. &lt;/li&gt;
&lt;li&gt;The content is chunked using configurable chunking strategies.&lt;/li&gt;
&lt;li&gt;Chunks are processed for embedding generation using your chosen embedding model.&lt;/li&gt;
&lt;li&gt;Embeddings are stored in Postgres with references to the source documents.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1phx1teqqkarcuu6zg0u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1phx1teqqkarcuu6zg0u.png" width="800" height="229"&gt;&lt;/a&gt;&lt;br&gt;
_ Pgai Vectorizer document processing pipeline showing how files in Amazon S3 get parsed, chunked, formatted, and embedded in order to be used in vector search queries in a Postgres database._&lt;/p&gt;

&lt;h4&gt;
  
  
  Key components
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loader&lt;/strong&gt; : Loads files from Amazon S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parser&lt;/strong&gt; : Extracts content from retrieved files, handling different document formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt; : Splits content into appropriate sizes for embedding models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Formatting&lt;/strong&gt; : Organizes chunks with metadata from the source files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding generator&lt;/strong&gt; : Processes chunks into vector embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use cases for pgai Vectorizer document vectorization
&lt;/h3&gt;

&lt;p&gt;Pgai Vectorizer's document vectorization capabilities enable several powerful use cases across industries by connecting S3-stored documents with Postgres vector search:&lt;/p&gt;

&lt;h4&gt;
  
  
  Financial analysis
&lt;/h4&gt;

&lt;p&gt;Automatically vectorize financial documents from S3 without custom pipelines. Connect document insights with quantitative metrics for unified queries.&lt;/p&gt;

&lt;h4&gt;
  
  
  Legal document management
&lt;/h4&gt;

&lt;p&gt;Maintain synchronized knowledge bases of legal documents with automatic embedding updates. Test different models for your specific domain.&lt;/p&gt;

&lt;h4&gt;
  
  
  Enhanced customer support
&lt;/h4&gt;

&lt;p&gt;Make knowledge base content immediately searchable as it changes, connecting support documents with customer data.&lt;/p&gt;

&lt;h4&gt;
  
  
  Research systems
&lt;/h4&gt;

&lt;p&gt;Build research AI with continuously updated paper collections, connecting published findings with experimental time-series data.&lt;/p&gt;

&lt;p&gt;In each case, pgai Vectorizer eliminates infrastructure complexity while enabling continuous improvement through its "set it and forget it" synchronization and simple experimentation capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try out the S3 features in Timescale Cloud Today
&lt;/h2&gt;

&lt;p&gt;Livesync and pgai Vectorizer are just the first steps in our vision to unify Postgres and object storage into a single, powerful lakehouse-style architecture—built for real-time AI and analytics. &lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://docs.timescale.com/migrate/latest/livesync-for-s3/" rel="noopener noreferrer"&gt;&lt;u&gt;Try Livesync for S3.&lt;/u&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/document-embeddings.md" rel="noopener noreferrer"&gt;&lt;u&gt;Try pgai Vectorizer. &lt;/u&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→&lt;a href="https://console.cloud.timescale.com/signup" rel="noopener noreferrer"&gt;&lt;u&gt; Sign up for Timescale Cloud&lt;/u&gt;&lt;/a&gt; and get started in seconds.&lt;/p&gt;

&lt;p&gt;We can’t wait to see what you build.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 4 preview: &lt;em&gt;Developer Tools That Speed Up Your Workflow: Introducing SQL Assistant, Recommendation Engine, and Insights&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Tomorrow, we'll reveal how Timescale delivers high-speed performance without sacrificing simplicity through &lt;strong&gt;SQL assistant with agent mode&lt;/strong&gt; , &lt;strong&gt;recommendation engine&lt;/strong&gt; , and &lt;strong&gt;Insights&lt;/strong&gt;. See how plain-language queries eliminate SQL wrangling, how automated tuning keeps databases optimized with a single click, and why developers finally get both the millisecond response times users demand and the operational simplicity teams need.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>aws</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>Postgres vs. Qdrant: Why Postgres Wins for AI and Vector Workloads</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Wed, 30 Apr 2025 16:01:47 +0000</pubDate>
      <link>https://dev.to/tigerdata/postgres-vs-qdrant-why-postgres-wins-for-ai-and-vector-workloads-3d71</link>
      <guid>https://dev.to/tigerdata/postgres-vs-qdrant-why-postgres-wins-for-ai-and-vector-workloads-3d71</guid>
      <description>&lt;p&gt;It's Timescale Launch Week and we’re bringing benchmarks: &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;Postgres vs. Qdrant on 50M Embeddings&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There’s a belief in the AI infrastructure world that you need to abandon general-purpose databases to get great performance on vector workloads. The logic goes: Postgres is great for transactions, but when you need high-performance vector search, it’s time to bring in a specialized vector database like Qdrant.&lt;/p&gt;

&lt;p&gt;That logic doesn’t hold—just like it didn’t when we benchmarked &lt;a href="https://www.timescale.com/blog/pgvector-vs-pinecone" rel="noopener noreferrer"&gt;pgvector vs. &lt;u&gt;Pinecone&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Like everything in Launch Week, this is about speed without sacrifice. And in this case, Postgres delivers both.&lt;/p&gt;

&lt;p&gt;We’re releasing a new benchmark that challenges the assumption that you can only scale with a specialized vector database. We compared Postgres (with &lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;&lt;u&gt;pgvector&lt;/u&gt;&lt;/a&gt; and &lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale&lt;/u&gt;&lt;/a&gt;) to Qdrant on a massive dataset of 50 million embeddings. The results show that Postgres not only holds its own but also delivers standout throughput and latency, even at production scale.&lt;/p&gt;

&lt;p&gt;This post summarizes the key takeaways, but it’s just the beginning. &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;&lt;u&gt;Check out the full benchmark blog post&lt;/u&gt;&lt;/a&gt; on query performance, developer experience, and operational experience.&lt;/p&gt;

&lt;p&gt;Let’s dig into what we found and what it means for teams building production AI applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark: Postgres vs. Qdrant on 50M Embeddings
&lt;/h2&gt;

&lt;p&gt;We tested Postgres and Qdrant on a level playing field:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50 million embeddings&lt;/strong&gt; , each with 768 dimensions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ANN-benchmarks&lt;/strong&gt; , the industry-standard benchmarking tool&lt;/li&gt;
&lt;li&gt;Focused on &lt;strong&gt;approximate nearest neighbor (ANN) search&lt;/strong&gt;, no filtering&lt;/li&gt;
&lt;li&gt;All benchmarks run on identical AWS hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The takeaway? Postgres with pgvector and pgvectorscale showed significantly higher throughput while maintaining sub-100 ms latencies. Qdrant performed strongly on tail latencies and index build speed, but Postgres pulled ahead where it matters most for teams scaling to production workloads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fox73b5q8gbq353qnicwc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fox73b5q8gbq353qnicwc.png" alt="Vector search query throughput at 99 % recall (bar graph). Postgres with pgvector and pgvectorscale processes 471.57 queries per second vs. Qdrant's 41.47." width="800" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the complete results, including detailed performance metrics, graphs, and testing configurations, &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;&lt;u&gt;read the full benchmark blog post&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters: AI Performance Without the Rewrite
&lt;/h2&gt;

&lt;p&gt;These results aren’t just a technical curiosity. They have &lt;strong&gt;real implications&lt;/strong&gt; for how you architect your AI stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production-grade latency:&lt;/strong&gt; Postgres with pgvectorscale delivers sub-100 ms p99 latencies needed to power real-time or responsive AI applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher concurrency&lt;/strong&gt; : Postgres delivered significantly higher throughput, meaning you can support more simultaneous users without scaling out as aggressively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lower complexity&lt;/strong&gt; : You don't need to manage and integrate a separate, specialized vector database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational familiarity&lt;/strong&gt; : You leverage the reliability, tooling, and operational practices you already have with Postgres.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL-first development&lt;/strong&gt; : You can filter, join, and integrate vector search naturally with relational data, without learning new APIs or query languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Postgres with pgvector and pgvectorscale gives you the performance of a specialized vector database &lt;em&gt;without&lt;/em&gt; giving up the ecosystem, tooling, and developer experience that make Postgres the world’s most popular database.&lt;/p&gt;

&lt;p&gt;You don’t need to split your stack to do vector search.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes It Work: Pgvectorscale and StreamingDiskANN
&lt;/h2&gt;

&lt;p&gt;How can Postgres compete with (and outperform) purpose-built vector databases?&lt;/p&gt;

&lt;p&gt;The answer lies in&lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale&lt;/u&gt;&lt;/a&gt; (part of the&lt;a href="https://github.com/timescale/pgai" rel="noopener noreferrer"&gt;&lt;u&gt;pgai&lt;/u&gt;&lt;/a&gt; family), which implements the StreamingDiskANN index (a disk-based ANN algorithm built for scale) for pgvector. Combined with Statistical Binary Quantization (SBQ), &lt;a href="https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data" rel="noopener noreferrer"&gt;&lt;u&gt;it balances memory usage and performance&lt;/u&gt;&lt;/a&gt; better than traditional in-memory HNSW (hierarchical navigable small world) implementations.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can run large-scale vector search on standard cloud hardware.&lt;/li&gt;
&lt;li&gt;You don’t need massive memory footprints or expensive GPU-accelerated nodes.&lt;/li&gt;
&lt;li&gt;Performance holds steady even as your dataset grows to tens or hundreds of millions of vectors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All while staying inside Postgres.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose Postgres, and When Not To
&lt;/h2&gt;

&lt;p&gt;To be clear: Qdrant is a capable system. It has faster index builds and lower tail latencies. It’s a strong choice if you’re not already using Postgres, or for specific use cases that need native scale-out and purpose-built vector semantics.&lt;/p&gt;

&lt;p&gt;However, for many teams—especially those already invested in Postgres— &lt;strong&gt;it makes no sense to introduce a new database&lt;/strong&gt; just to support vector search.&lt;/p&gt;

&lt;p&gt;If you want high recall, high throughput, and tight integration with your existing stack, Postgres is more than enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Want to Try It?
&lt;/h2&gt;

&lt;p&gt;Pgvector and pgvectorscale are open source and available today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;&lt;u&gt;pgvector GitHub&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale GitHub&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Or save time and access both by creating a &lt;a href="https://timescale.com/signup" rel="noopener noreferrer"&gt;&lt;u&gt;free Timescale Cloud account&lt;/u&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vector search in Postgres isn’t a hack or a workaround. It’s fast, it scales, and it works. If you’re building AI applications in 2025, you don’t have to sacrifice your favorite database to move fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Up Next at Timescale Launch Week
&lt;/h2&gt;

&lt;p&gt;Next up, we’re taking Postgres even further: Learn how to stream external S3 data into Postgres with livesync for S3 and work with S3 data in place using the pgai Vectorizer. Two powerful ways to seamlessly integrate external data from S3 directly into your Postgres workflows!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>postgres</category>
      <category>vectordatabase</category>
      <category>news</category>
    </item>
  </channel>
</rss>
