<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Team Tiger Data</title>
    <description>The latest articles on DEV Community by Team Tiger Data (@tigerdata_dev).</description>
    <link>https://dev.to/tigerdata_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2547418%2F1859b44f-d7f7-47c9-9ca2-082bae60b949.png</url>
      <title>DEV Community: Team Tiger Data</title>
      <link>https://dev.to/tigerdata_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tigerdata_dev"/>
    <language>en</language>
    <item>
      <title>How TimescaleDB Outperforms ClickHouse and MongoDB for LogTide's Observability Platform</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:24:18 +0000</pubDate>
      <link>https://dev.to/tigerdata_dev/how-timescaledb-outperforms-clickhouse-and-mongodb-for-logtides-observability-platform-29gl</link>
      <guid>https://dev.to/tigerdata_dev/how-timescaledb-outperforms-clickhouse-and-mongodb-for-logtides-observability-platform-29gl</guid>
      <description>&lt;p&gt;Giuseppe “Polliog” Pollio started writing code for LogTide in September 2025. By early 2026, the platform was handling five million logs per day for alpha users, compressing 220GB of production data down to 25GB.&lt;/p&gt;

&lt;h2&gt;
  
  
  LogTide
&lt;/h2&gt;

&lt;p&gt;Most enterprise log management tools are built for enterprises. Datadog and Splunk far exceed small operation budgets. For developers running a self-hosted stack, there is no clear alternative for affordable log observability.&lt;/p&gt;

&lt;p&gt;LogTide addresses this gap as an open-source log management and SIEM platform built specifically for teams who need serious observability without serious hardware. Sigma rule-based detection, structured log search, alerting, and notifications, the same capabilities that make Datadog and Splunk useful, run in two gigabytes of RAM with Logtide.&lt;/p&gt;

&lt;p&gt;"That's because our target is small agencies and home labs," Giuseppe explains. "I wanted to create an ecosystem with low impact on RAM, something you can host on a really old machine."&lt;/p&gt;

&lt;p&gt;LogTide launched its cloud alpha in early 2026, with around 100 companies stress-testing the platform for free. One of them sends five million logs per day.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;When Giuseppe set out to build LogTide, he targeted home labs and small businesses who cannot afford enterprise infrastructure, let alone enterprise pricing.&lt;/p&gt;

&lt;p&gt;ELK - Elasticsearch, Logstash, Kibana typically require multiple nodes and significant RAM. Grafana Loki is lighter but still has indexing and query limitations that make full-text log search painful at scale. ClickHouse is fast and compresses well, but is built for analytics clusters, not Raspberry Pis. Datadog and Splunk simply cost too much. &lt;/p&gt;

&lt;p&gt;LogTide needed a reliable database to underpin its OSS log observability that could scale to production without split architecture or excessive budget spend. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why TimescaleDB
&lt;/h2&gt;

&lt;p&gt;Giuseppe found TimescaleDB while searching for Postgres with additional support for high ingest of event data.&lt;/p&gt;

&lt;p&gt;"There are lots of alternatives, but most are too resource-intensive," Giuseppe explains. "TimescaleDB was a perfect choice."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;There are lots of alternatives, but most are too resource-intensive. TimescaleDB was a perfect choice. - Giuseppe Pollio, Founder, LogTide&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The appeal was both technical and practical. TimescaleDB is Postgres. It uses the same wire protocol, the same SQL syntax, the same tooling, and the same extension ecosystem. For a solo developer building a platform that has to run on minimal hardware, that meant no operational surprises, no vendor-specific APIs, and no migration work if users already had Postgres running. &lt;/p&gt;

&lt;p&gt;“If Postgres can run on your machine, TimescaleDB can run,” notes Giuseppe,”and you can deploy LogTide for inexpensive observability at scale.” &lt;/p&gt;

&lt;h2&gt;
  
  
  The LogTide Stack
&lt;/h2&gt;

&lt;p&gt;LogTide's architecture is simple by design. “Simple architecture means it's easier to manage, easier to maintain,” said Giuseppe.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Simple architecture means it’s easier to manage, easier to maintain. - Giuseppe Pollio&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Logs enter the system from one of three client sources: OpenTelemetry-instrumented services, Fluent Bit agents, or one of LogTide's native SDKs. All three routes converge on a single ingest endpoint. The endpoint handles format variations including OTEL format and a handful of special-case adapters so the ingestion path stays unified regardless of how the log was generated.&lt;/p&gt;

&lt;p&gt;From the ingest endpoint, log payloads enter a job queue backed by Redis. Redis is optional: if it is not available, the ingestion path routes directly to the worker. The worker is where the platform earns its SIEM designation. It evaluates Sigma rules against incoming logs, generates alerts, dispatches notifications, and runs the full analysis pipeline. &lt;/p&gt;

&lt;p&gt;After processing, logs pass through what Giuseppe calls the LogTide Reservoir: a storage abstraction layer that keeps the backend pluggable. In practice, only one backend is truly necessary.&lt;/p&gt;

&lt;p&gt;"TimescaleDB is our unique persistent database," Giuseppe explains. "All the aggregation that populates our dashboards is powered by TimescaleDB."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;TimescaleDB is our unique persistent database. All the aggregation that populates our dashboards is powered by TimescaleDB. - Giuseppe Pollio&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Inside TimescaleDB, LogTide maintains three hypertable families: raw logs, distributed traces (spans), and detection events. Retention policies run automatically with no manual intervention or cron jobs. Continuous aggregates sit on top of the raw log hypertable and are what make the platform fast at scale.&lt;/p&gt;

&lt;p&gt;From &lt;code&gt;packages/backend/src/modules/retention/service.ts&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="cm"&gt;/**
 * Execute retention cleanup for all organizations.
 *
 * Strategy (scales with number of distinct retention values, not orgs):
 * 1. drop_chunks for max retention — instant, drops entire files
 * 2. Group orgs by retention_days, collect all project_ids per group
 * 3. For each group with retention &amp;lt; max: batch-delete their logs
 */&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;executeRetentionForAllOrganizations&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;RetentionExecutionSummary&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;startTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;logging&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;isInternalLoggingEnabled&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Get all organizations with their retention + projects&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;selectFrom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;organizations&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;name&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;retention_days&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;orgProjects&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;selectFrom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;projects&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;organization_id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Build org -&amp;gt; projectIds map&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;projectsByOrg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;orgProjects&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;projectsByOrg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;organization_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="nx"&gt;list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;projectsByOrg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;organization_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;list&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Find max retention (used for drop_chunks)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;maxRetention&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;organizations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;o&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;retention_days&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;maxCutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;maxRetention&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Step 1: drop_chunks older than max retention (TimescaleDB only — instant, no decompression)&lt;/span&gt;
  &lt;span class="c1"&gt;// For ClickHouse, TTL policies handle this natively or deleteByTimeRange in step 3&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;chunksDropped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;reservoir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getEngineType&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;timescale&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dropResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="s2"&gt;`
        SELECT drop_chunks('logs', older_than =&amp;gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;maxCutoff&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;::timestamptz)
      `&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;chunksDropped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dropResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

      &lt;span class="cm"&gt;/* v8 ignore next 6 -- telemetry, disabled in tests */&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunksDropped&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;captureLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;info&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`Dropped &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;chunksDropped&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; chunks older than &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;maxRetention&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; days`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;maxRetentionDays&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;maxRetention&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;cutoffDate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;maxCutoff&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
          &lt;span class="nx"&gt;chunksDropped&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// drop_chunks may fail if no chunks to drop — that's fine&lt;/span&gt;
      &lt;span class="cm"&gt;/* v8 ignore next 4 -- telemetry, disabled in tests */&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;captureLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;debug&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`drop_chunks: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Step 2: Group orgs by retention_days (only those with retention &amp;lt; max need per-row deletes)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;retentionGroups&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;organizations&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;org&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;organizations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;retention_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;maxRetention&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// already handled by drop_chunks&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;group&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;retentionGroups&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;retention_days&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="na"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;orgProjectIds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;projectsByOrg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;orgProjectIds&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;retentionGroups&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;retention_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Step 3: Batch-delete per retention group&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RetentionExecutionResult&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;totalDeleted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;failedCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;retentionDays&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;retentionGroups&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;org&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
          &lt;span class="na"&gt;organizationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;organizationName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="nx"&gt;retentionDays&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;logsDeleted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;executionTimeMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;groupStart&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cutoffDate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;retentionDays&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;oldestResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;reservoir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cutoffDate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;sortOrder&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;asc&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;});&lt;/span&gt;

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;oldestResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;org&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;organizationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;organizationName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nx"&gt;retentionDays&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;logsDeleted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;executionTimeMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;groupStart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="p"&gt;});&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;deleted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batchDeleteLogs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;cutoffDate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;oldestResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;totalDeleted&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;deleted&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;failedCount&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orgs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"The aggregates are necessary," said Giuseppe. "If you have five million, ten million logs every day, and you need to see how many logs you received every hour, you can't run that query on 10 million logs. The aggregates give you query results in milliseconds instead of 30 or 40 seconds."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continuous aggregate definition&lt;/strong&gt;, from &lt;code&gt;packages/backend/migrations/004_performance_optimization.sql&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;logs_hourly_stats&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timescaledb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;time_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'1 hour'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;log_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;NO&lt;/span&gt; &lt;span class="k"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Refreshes automatically every hour&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;add_continuous_aggregate_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'logs_hourly_stats'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;start_offset&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'3 hours'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;end_offset&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 hour'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;schedule_interval&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1 hour'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;if_not_exists&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;idx_logs_hourly_stats_project_bucket&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;logs_hourly_stats&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Hybrid query at runtime&lt;/strong&gt;, from &lt;code&gt;packages/backend/src/modules/dashboard/service.ts&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;todayAggregateStats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recentTotal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recentErrors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recentServices&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;yesterdayAggregateStats&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;prevHourCount&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="c1"&gt;// Today's historical stats from aggregate (today start to 1 hour ago)&lt;/span&gt;
  &lt;span class="nx"&gt;db&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;selectFrom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;logs_hourly_stats&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
      &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;`COALESCE(SUM(log_count), 0)`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;total&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;`COALESCE(SUM(log_count) FILTER (WHERE level IN ('error', 'critical')), 0)`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;errors&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;`COUNT(DISTINCT service)`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;services&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;project_id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;in&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bucket&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;todayStart&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bucket&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;lastHourStart&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;executeTakeFirst&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;

  &lt;span class="c1"&gt;// Recent stats from reservoir (last hour)&lt;/span&gt;
  &lt;span class="nx"&gt;reservoir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;lastHourStart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="nx"&gt;reservoir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;lastHourStart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;critical&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="nx"&gt;reservoir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;distinct&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;field&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;service&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;lastHourStart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;

  &lt;span class="c1"&gt;// Yesterday's stats from aggregate&lt;/span&gt;
  &lt;span class="nx"&gt;db&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;selectFrom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;logs_hourly_stats&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
      &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;`COALESCE(SUM(log_count), 0)`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;total&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;`COALESCE(SUM(log_count) FILTER (WHERE level IN ('error', 'critical')), 0)`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;errors&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;`COUNT(DISTINCT service)`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;services&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;project_id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;in&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bucket&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;yesterdayStart&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bucket&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;&amp;lt;&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;todayStart&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;executeTakeFirst&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;

  &lt;span class="c1"&gt;// Previous hour from reservoir (for throughput trend)&lt;/span&gt;
  &lt;span class="nx"&gt;reservoir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;projectIds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;prevHourStart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;lastHourStart&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgarohwfxrg7oerz5xvo4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgarohwfxrg7oerz5xvo4.png" alt="LogTide's architecture. Logs flow from client SDKs and agents through a single ingest endpoint, into a processing worker, and into TimescaleDB hypertables via the LogTide Reservoir storage abstraction." width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;LogTide's architecture. Logs flow from client SDKs and agents through a single ingest endpoint, into a processing worker, and into TimescaleDB hypertables via the LogTide Reservoir storage abstraction.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What We've Seen
&lt;/h2&gt;
&lt;h3&gt;
  
  
  220GB Down to 25GB
&lt;/h3&gt;

&lt;p&gt;In production, LogTide's TimescaleDB deployment compressed 220GB of raw log data, 135GB of row data plus 85GB of indexes, down to 25GB. That is an 88.6% reduction, achieved using TimescaleDB's native columnar compression with a segmentby configuration on project_id and log level, ordered by timestamp descending. Chunks older than seven days compress automatically.&lt;/p&gt;

&lt;p&gt;From &lt;code&gt;packages/backend/migrations/001_initial_schema.sql&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Enable compression on logs hypertable&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;timescaledb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;timescaledb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress_segmentby&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'project_id'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;timescaledb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compress_orderby&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'time DESC'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Add compression policy for logs (compress chunks older than 7 days)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;add_compression_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'logs'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'7 days'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_not_exists&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Global retention safety net&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;add_retention_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'logs'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'90 days'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_not_exists&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query performance did not degrade. Time-range filtering got 33% faster after compression. Aggregations got 41% faster. Only30 full-text search slowed slightly, by about 12%, because columnar storage requires scanning additional columns to reconstruct text fields. For a log management platform where engineers are far more likely to query a time window than to search a raw string, the tradeoff strongly favors compression.&lt;/p&gt;

&lt;p&gt;In practice, 30 million logs stored in 15GB on a single 4-vCPU, 8GB RAM node, with a P95 query latency of 50ms. Learn more in Giuseppe’s &lt;a href="https://dev.to/polliog/timescaledb-compression-from-150gb-to-15gb-90-reduction-real-production-data-bnj"&gt;&lt;u&gt;dev.to post on TimescaleDB compression&lt;/u&gt;&lt;/a&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  TimescaleDB Bested MongoDB and ClickHouse in Head-to-Head Performance Benchmarks
&lt;/h3&gt;

&lt;p&gt;Giuseppe built an open benchmark suite and ran it across 1K to 1M records, as outlined in his &lt;a href="https://builder.aws.com/content/3Aoryr85VEVzFKrFjDmzXpwRLkU/i-benchmarked-timescaledb-vs-clickhouse-vs-mongodb-for-observability-data" rel="noopener noreferrer"&gt;&lt;u&gt;AWS Builder Center article benchmarking ClickHouse and MongoDB vs TimescaleDB&lt;/u&gt;&lt;/a&gt;. The ingestion story is straightforward: at batch sizes typical of real-world observability (100 events per call), TimescaleDB handles 14,200 inserts per second. ClickHouse handles 250 at the same batch size. The gap exists because ClickHouse buffers small writes and flushes on a 400ms timer, the right design for bulk analytics, the wrong design when a dozen microservices are logging in real time.&lt;/p&gt;

&lt;p&gt;The query results are the main story. At 100,000 log records, TimescaleDB answers a filtered service query in 0.47ms. MongoDB answers the same query in 304ms, a 650x difference. Under 50 concurrent queries, TimescaleDB holds at 6.2ms whether the dataset is 1,000 or 1,000,000 records. The mechanism is hypertable partitioning: queries filter by time range and service, TimescaleDB routes them to the active chunk instead of scanning the full table, and continuous aggregates make count and dashboard queries nearly free because the work is already done at write time.&lt;/p&gt;

&lt;h3&gt;
  
  
  A 2GB RAM Requirement Keeps Operations Lean
&lt;/h3&gt;

&lt;p&gt;The most important number is not the compression ratio or the write throughput. It is the 2GB RAM figure that defines where LogTide can actually run.&lt;/p&gt;

&lt;p&gt;"If you have log management that can work with 2GB of RAM, it's really magic," Giuseppe says. "Because you can't do that with Datadog or Splunk or the other self-hosted programs and containers."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;If you have log management that can work with 2GB of RAM, it's really magic.  You can't do that with Datadog or Splunk or the other self-hosted programs and containers. - Giuseppe Pollio&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That 2GB ceiling is what makes LogTide viable for home labs running NAS, small businesses on shared hosting, or a developer who wants to know when their Raspberry Pi's services throw errors. The entire LogTide platform, including API, worker, dashboard, and TimescaleDB storage, runs on the same hardware that already runs Postgres. &lt;/p&gt;

&lt;h2&gt;
  
  
  Looking Ahead
&lt;/h2&gt;

&lt;p&gt;The LogTide Cloud Platform alpha prototype is now open to trial users.  Meanwhile, LogTide’s open-source project is growing fast. Hundreds of GitHub stars and 1k+ clones per day signal a developer community that has found the project and is actively building with it. The next phase is expanding SDK coverage and continuing to stress-test the storage layer. TimescaleDB runs anywhere Postgres runs. The goal is to make sure LogTide does too.&lt;/p&gt;

</description>
      <category>devqa</category>
      <category>timescaledb</category>
      <category>clickhouse</category>
      <category>mongodb</category>
    </item>
    <item>
      <title>pg_textsearch 1.0: How We Built a BM25 Search Engine on Postgres Pages</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Tue, 31 Mar 2026 13:09:03 +0000</pubDate>
      <link>https://dev.to/tigerdata/pgtextsearch-10-how-we-built-a-bm25-search-engine-on-postgres-pages-42cc</link>
      <guid>https://dev.to/tigerdata/pgtextsearch-10-how-we-built-a-bm25-search-engine-on-postgres-pages-42cc</guid>
      <description>&lt;p&gt;&lt;em&gt;Design, implementation, and benchmarks of a native BM25 index for Postgres. Now generally available to all&lt;/em&gt; &lt;a href="https://www.tigerdata.com/cloud" rel="noopener noreferrer"&gt;&lt;em&gt;&lt;u&gt;Tiger Cloud&lt;/u&gt;&lt;/em&gt;&lt;/a&gt; &lt;em&gt;customers and freely available via open source.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you have used Postgres's built-in ts_rank for full-text search at any meaningful scale, you already know the limitations. Ranking quality degrades as your corpus grows. There is no inverse document frequency, so common words carry the same weight as rare ones. There is no term frequency saturation, so a document that mentions "database" 50 times outranks one that mentions it once. There is no efficient top-k path: scoring requires touching every matching row.&lt;/p&gt;

&lt;p&gt;Most teams work around this by bolting on Elasticsearch or Typesense as a sidecar. That works, but now you are syncing data between two systems, operating two clusters, and debugging consistency issues when they diverge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.tigerdata.com/docs/use-timescale/latest/extensions/pg-textsearch" rel="noopener noreferrer"&gt;&lt;u&gt;pg_textsearch&lt;/u&gt;&lt;/a&gt; takes a different approach: real BM25 scoring, built from scratch in C on top of Postgres's own storage layer. You create an index, write a query, and get results ranked by relevance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;bm25&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'database ranking'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'database ranking'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;&amp;lt;@&amp;gt;&lt;/code&gt; operator returns a BM25 relevance score. Scores are negated so that Postgres's default ascending ORDER BY returns the most relevant results first. The index is stored entirely in standard Postgres pages managed by the buffer cache. It participates in WAL, works with pg_dump and streaming replication, and requires no external storage or special backup procedures.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What shipped in 1.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;From preview to production. In October 2025, we released a preview that held the entire inverted index in shared memory, rebuilt from the heap on restart (preview blog). In the five months and 180+ commits since, the extension has been substantially rewritten:&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• Disk-based segments replaced the memory-only architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• Block-Max WAND + WAND optimization for fast top-k queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• Posting list compression with SIMD-accelerated decoding (41% smaller indexes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• Parallel index builds (138M documents in under 18 minutes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• 2.4x to 6.5x faster than ParadeDB/Tantivy for 2-4 term queries at 138M scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;• 8.7x higher concurrent throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;This post covers the architecture, query optimization strategy, and benchmark results. We include a candid discussion of where ParadeDB is faster and a full accounting of current limitations.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Background: Why BM25 in Postgres?
&lt;/h2&gt;

&lt;p&gt;Postgres ships &lt;code&gt;tsvector/tsquery&lt;/code&gt; with &lt;code&gt;ts_rank&lt;/code&gt; for full-text ranking. &lt;code&gt;ts_rank&lt;/code&gt; uses an ad-hoc scoring function that lacks the three properties that make BM25 effective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inverse document frequency (IDF):&lt;/strong&gt; downweights common terms so that rarer, more informative terms drive the ranking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Term frequency saturation:&lt;/strong&gt; prevents a document from scoring arbitrarily high by repeating a term many times. A document mentioning "database" 50 times is not 50 times more relevant than one mentioning it once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document length normalization:&lt;/strong&gt; accounts for the fact that a term match in a short document is more informative than the same match in a long one [1].&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For applications where ranking quality matters (RAG pipelines, search-driven UIs, hybrid retrieval), this is a material limitation. At scale, &lt;code&gt;ts_rank&lt;/code&gt; also has no top-k optimization path: ranking by relevance requires scoring every matching row.&lt;/p&gt;

&lt;p&gt;The primary existing BM25 extension for Postgres is ParadeDB/pg_search, which wraps the Tantivy search library written in Rust. Early versions stored the index in auxiliary files outside the WAL; current versions use Postgres pages.&lt;/p&gt;

&lt;p&gt;pg_textsearch takes a different approach: rather than wrapping an external search library, the entire search engine (tokenization, compression, query optimization) is built from scratch in C on top of Postgres's storage layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fex8hr08ubhffvj31eb79.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fex8hr08ubhffvj31eb79.png" alt="Fig. 1: pg_textsearch Architecture diagram" width="800" height="1249"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 1: pg_textsearch Architecture diagram&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The hybrid memtable + segment design
&lt;/h3&gt;

&lt;p&gt;pg_textsearch uses an LSM-tree-inspired architecture [4]. Incoming writes go to an in-memory inverted index (the memtable), which periodically spills to immutable on-disk segments. Segments compact in levels: when a level accumulates enough segments (default 8), they merge into the next level. Fewer segments means fewer posting lists to consult per query term, which directly reduces query latency. This is the same write-optimized-memtable / read-optimized-segment pattern used in RocksDB [5] and other LSM-based engines, adapted here for Postgres's page-based storage.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;The write path: memtable&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The memtable lives in Postgres shared memory, one per index, accessible to all backends. It contains a string-interning hash table that stores each unique term exactly once; per-term posting lists recording document IDs and term frequencies; and corpus statistics (document count and average document length) maintained incrementally so that BM25 scores can be computed without a separate pass over the index.&lt;/p&gt;

&lt;p&gt;When the memtable exceeds a configurable threshold (default: 32M posting entries), it spills to a Level-0 disk segment at transaction commit. A secondary trigger (default: 100K unique terms per transaction) handles large single-transaction loads like bulk imports.&lt;/p&gt;

&lt;p&gt;The memtable is rebuilt from the heap on startup. Since the heap is WAL-logged, no data is lost if Postgres crashes before a spill completes. This is analogous to how a write-ahead log protects an LSM memtable, except here the WAL is Postgres's own. The rebuild cost is proportional to the amount of data not yet spilled to segments; for indexes where most data has been spilled, startup is fast.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3opgjv8tk3srcg31n64y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3opgjv8tk3srcg31n64y.png" alt="Fig. 2: pg_textsearch memtable write path" width="800" height="923"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 2: pg_textsearch memtable write path&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The read path: segments
&lt;/h3&gt;

&lt;p&gt;Segments are immutable and stored in standard Postgres pages. Each segment contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A term dictionary:&lt;/strong&gt; a sorted array of offsets into a string pool, binary-searchable for O(log n) term lookup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Posting blocks&lt;/strong&gt; of up to 128 documents each, containing delta-encoded doc IDs, packed term frequencies, and quantized document lengths (fieldnorms). A separate skip index stores one entry per posting block with upper-bound score metadata used by Block-Max WAND optimization (described below).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A fieldnorm table&lt;/strong&gt; mapping document lengths to 1-byte quantized values using Lucene/Tantivy's SmallFloat encoding [6]. This encoding is exact for lengths 0-39 (covering most short documents); for longer documents, quantization error increases from ~5% to ~11%. In practice, the impact on ranking is smaller than these numbers suggest: BM25 scores depend on the ratio of document length to average document length, which dampens quantization error, and the b parameter (default 0.75) further reduces length's influence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A doc ID to CTID mapping&lt;/strong&gt; that translates internal document IDs to Postgres tuple identifiers for heap fetches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmoua6q56wmqbqx7knt5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmoua6q56wmqbqx7knt5f.png" alt="Fig. 3: pg_textsearch segment internal structure" width="800" height="1304"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 3: pg_textsearch segment internal structure&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Minimizing page access
&lt;/h3&gt;

&lt;p&gt;Storing data in Postgres pages means every access goes through the buffer manager. Even for pages already in cache, each access involves a buffer table lookup, pin acquisition, and lock handling. That overhead adds up in a scoring loop processing millions of postings. This constraint shaped several design decisions.&lt;/p&gt;

&lt;p&gt;Each segment assigns compact 4-byte, segment-local document IDs (0 to N-1), which map to Postgres's 6-byte CTIDs (heap tuple identifiers). After collecting all documents for a segment, doc IDs are reassigned so that doc_id order matches CTID order. Sequential iteration through posting lists then produces sequential access to the CTID mapping, maximizing cache locality. CTIDs themselves are stored as two separate arrays (4-byte page numbers and 2-byte offsets) rather than interleaved 6-byte records, doubling cache line utilization.&lt;/p&gt;

&lt;p&gt;The scoring loop works entirely with doc IDs, term frequencies, and fieldnorms. It never touches the CTID arrays. CTIDs are resolved only for the final top-k results in a single batched pass. A top-10 query that scores thousands of candidates resolves ten CTIDs, not thousands.&lt;/p&gt;
&lt;h3&gt;
  
  
  Postgres integration
&lt;/h3&gt;

&lt;p&gt;Because the index is stored in standard buffer-managed pages, pg_textsearch participates in Postgres infrastructure without special handling: MVCC visibility, proper rollback on abort, WAL and physical replication, &lt;code&gt;pg_dump / pg_upgrade&lt;/code&gt;, VACUUM with correct dead-entry removal, and planner hooks that detect the &lt;code&gt;&amp;lt;@&amp;gt;&lt;/code&gt; operator and select index scans automatically. Logical replication works in the usual way: row changes are replicated and the index is rebuilt on the subscriber.&lt;/p&gt;
&lt;h2&gt;
  
  
  Query Optimization: Block-Max WAND
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The top-k problem
&lt;/h3&gt;

&lt;p&gt;Naive BM25 evaluation scores every document matching any query term. For a 3-term query on MS-MARCO v2 (138M documents), this means decoding and scoring posting lists with tens of millions of entries. Most applications need only the top 10 or 100 results. The challenge is finding them without scoring everything.&lt;/p&gt;
&lt;h3&gt;
  
  
  Block-Max WAND
&lt;/h3&gt;

&lt;p&gt;pg_textsearch implements Block-Max WAND (BMW) [2], which uses block-level upper bounds to skip non-contributing posting blocks during top-k evaluation. Lucene adopted a similar approach in version 8.0 [7]. The core idea: maintain the score of the k-th best result seen so far as a threshold, and skip any posting block whose upper-bound score cannot exceed it.&lt;/p&gt;

&lt;p&gt;Each 128-document posting block has a corresponding skip entry storing the maximum term frequency in the block and the minimum fieldnorm (the shortest document, which would score highest for a given term frequency). From these two values, BMW can compute a tight upper bound on the block's BM25 contribution without decompressing it. If the upper bound falls below the current threshold, the entire block (all 128 documents) is skipped.&lt;/p&gt;

&lt;p&gt;To illustrate: consider a single-term top-10 query on a large corpus. After scanning a few thousand postings, the algorithm has accumulated 10 results with a minimum score of, say, 12.3. It now encounters a block where the upper-bound BM25 score (computed from the block's stored metadata) is 9.1. Since 9.1 &amp;lt; 12.3, no document in this block can enter the top 10, and the entire block is skipped without decompression. For short queries on large corpora, the vast majority of blocks are skipped this way.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjzcaaou8sgoxsmo0q3b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjzcaaou8sgoxsmo0q3b.png" alt="Fig. 4: pg_textsearch Block-Max WAND visualization" width="800" height="591"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 4: pg_textsearch Block-Max WAND visualization&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  WAND pivot selection
&lt;/h3&gt;

&lt;p&gt;For multi-term queries, pg_textsearch adds the WAND algorithm [3] for cross-term skipping. Terms are ordered by their current document ID, and the algorithm identifies a pivot term: the first term whose cumulative maximum score exceeds the current threshold. All terms before the pivot advance to at least the pivot's current doc ID, skipping entire ranges of documents across multiple posting lists simultaneously, before block-level BMW bounds are even checked. For multi-term queries, BMW compares the sum of per-term block upper bounds against the threshold, extending the single-term logic described above.&lt;/p&gt;

&lt;p&gt;The combination of WAND (cross-term skipping) and BMW (within-list block skipping) is most effective for short queries (1-4 terms), which account for the majority of real-world search traffic. In the full MS-MARCO v1 query set (1,010,916 queries from Bing), 72.6% have 2-4 lexemes after English stemming and stopword removal, with a mean of 3.7 and a mode of 3. The speedup narrows for longer queries, where more blocks contain at least one term with a potentially high-scoring document. Grand et al. [7] observe the same pattern in Lucene's BMW implementation.&lt;/p&gt;
&lt;h2&gt;
  
  
  Compression and Storage
&lt;/h2&gt;

&lt;p&gt;Posting blocks use a compression scheme designed for fast random-access decoding. Doc IDs are delta-encoded (storing differences between consecutive IDs rather than absolute values), then packed with variable-width bitpacking: the maximum delta in the block determines the bit width, and all deltas use that width. Term frequencies are packed separately with their own bit width. Fieldnorms are the 1-byte SmallFloat values described above.&lt;/p&gt;

&lt;p&gt;The bitpack decode path uses branchless direct-indexed uint64 loads rather than a byte-at-a-time accumulator, eliminating branch misprediction in the inner decode loop. Where available, SIMD intrinsics (SSE2 on x86-64, NEON on ARM64) accelerate the mask-and-store step. A scalar fallback handles other platforms.&lt;/p&gt;

&lt;p&gt;Compression reduces index size by 41% compared to uncompressed storage. Decode overhead is approximately 6% of query time (measured by profiling), which is more than offset by reduced buffer cache pressure. The scheme prioritizes decode speed over compression ratio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on index size comparisons:&lt;/strong&gt; pg_textsearch does not store term positions, so it cannot support phrase queries natively (see Limitations). This makes its indexes inherently smaller than engines like Tantivy that store positions by default. The 19-26% size advantage reported in our benchmarks reflects both compression and this feature difference.&lt;/p&gt;
&lt;h2&gt;
  
  
  Parallel Index Build
&lt;/h2&gt;

&lt;p&gt;For large tables, serial index construction can take hours. pg_textsearch uses Postgres's built-in parallel worker infrastructure to distribute the work.&lt;/p&gt;

&lt;p&gt;The leader launches workers and assigns each a range of heap blocks. Workers scan their assigned blocks, tokenize documents via &lt;code&gt;to_tsvector&lt;/code&gt;, build local in-memory indexes, and write intermediate segments to temporary BufFiles. The leader then performs an N-way merge of all worker output, writing a single merged segment directly to index pages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61y9a8j5equ8ngyu0z4d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61y9a8j5equ8ngyu0z4d.png" alt="Fig. 5: pg_textsearch Parallel Index Build" width="800" height="994"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 5: pg_textsearch Parallel Index Build&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Workers run concurrently in the scan/tokenize/build phase; the leader merges sequentially. The expensive part (heap scanning, tokenization, posting list assembly) is CPU-bound and parallelizes well. The merge/write phase is comparatively cheap, so a serial merge captures most of the speedup with minimal complexity. It also produces a single fully-compacted segment that is optimal for query performance.&lt;/p&gt;

&lt;p&gt;On MS-MARCO v2 (138M passages), 15 workers complete the build in 17 minutes 37 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_parallel_maintenance_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;maintenance_work_mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'256MB'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;passages&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;bm25&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Methodology
&lt;/h3&gt;

&lt;p&gt;All benchmarks use the MS-MARCO passage ranking dataset [8], a standard information retrieval benchmark drawn from real Bing search queries. We compare pg_textsearch against ParadeDB v0.21.6 (which wraps Tantivy). Both extensions use their default configurations; Postgres tuning is specified per experiment. Both systems configure English stemming and stopword removal.&lt;/p&gt;

&lt;p&gt;Queries are drawn uniformly from 8 token-count buckets (100 queries per bucket on v1; up to 100 per bucket on v2). Weighted-average metrics use the MS-MARCO v1 lexeme distribution as weights, reflecting real search traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache state.&lt;/strong&gt; All query benchmarks are warm-cache: a warmup pass runs before timing begins, and the working set fits in the OS page cache and shared_buffers for all configurations tested. Results reflect CPU and algorithmic efficiency, not I/O. We have not benchmarked memory-constrained configurations where the index exceeds available cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ranking.&lt;/strong&gt; Both systems produce BM25 rankings using the same tokenization (English stemming and stopwords). We have not performed a systematic ranking equivalence comparison; both implement standard BM25 with the same default parameters (k1 = 1.2, b = 0.75), but differences in IDF computation and tokenization edge cases may produce different orderings for some queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  MS-MARCO query length distribution
&lt;/h3&gt;

&lt;p&gt;The following histogram shows the distribution of query lengths in the full MS-MARCO v1 query set (1,010,916 queries), measured in lexemes after English stopword removal and stemming via Postgres &lt;code&gt;to_tsvector('english')&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9uhedx6bps3xuzxjkgny.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9uhedx6bps3xuzxjkgny.png" alt="Fig. 6: MS-MARCO query length histogram" width="800" height="432"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Fig. 6: MS-MARCO query length histogram&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This distribution is broadly consistent with web search query length studies [9, 10]. The MS-MARCO mean of 3.7 lexemes (after stemming/stopword removal) corresponds to roughly 5–6 raw words, consistent with the corpus statistics reported by Nguyen et al. [8]. We use the v1 distribution for weighting throughout as it provides the largest sample.&lt;/p&gt;
&lt;h3&gt;
  
  
  Results: MS-MARCO v2 (138M passages)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Environment.&lt;/strong&gt; Dedicated c6i.4xlarge EC2 instance: Intel Xeon Platinum 8375C, 8 cores / 16 threads, 123 GB RAM, NVMe SSD. Postgres 17.4 with shared_buffers = 31 GB. Both indexes fit in the buffer cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Index build:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;pg_textsearch&lt;/th&gt;
&lt;th&gt;ParadeDB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Index size&lt;/td&gt;
&lt;td&gt;17 GB&lt;/td&gt;
&lt;td&gt;23 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build time&lt;/td&gt;
&lt;td&gt;17 min 37 sec&lt;/td&gt;
&lt;td&gt;8 min 55 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documents&lt;/td&gt;
&lt;td&gt;138,364,158&lt;/td&gt;
&lt;td&gt;138,364,158&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel workers&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;pg_textsearch index is 26% smaller. ParadeDB builds approximately 2x faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-client query latency (p50 median, top-10 queries):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lexemes&lt;/th&gt;
&lt;th&gt;pg_textsearch (ms)&lt;/th&gt;
&lt;th&gt;ParadeDB (ms)&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;5.11&lt;/td&gt;
&lt;td&gt;59.83&lt;/td&gt;
&lt;td&gt;11.7x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;9.14&lt;/td&gt;
&lt;td&gt;59.65&lt;/td&gt;
&lt;td&gt;6.5x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;20.04&lt;/td&gt;
&lt;td&gt;77.62&lt;/td&gt;
&lt;td&gt;3.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;41.92&lt;/td&gt;
&lt;td&gt;98.89&lt;/td&gt;
&lt;td&gt;2.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;67.76&lt;/td&gt;
&lt;td&gt;125.38&lt;/td&gt;
&lt;td&gt;1.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;102.82&lt;/td&gt;
&lt;td&gt;148.78&lt;/td&gt;
&lt;td&gt;1.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;159.37&lt;/td&gt;
&lt;td&gt;169.65&lt;/td&gt;
&lt;td&gt;1.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8+&lt;/td&gt;
&lt;td&gt;177.95&lt;/td&gt;
&lt;td&gt;190.47&lt;/td&gt;
&lt;td&gt;1.1x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The same pattern holds: pg_textsearch is fastest on short queries and the systems converge at longer lengths. Weighted by the MS-MARCO v1 query length distribution, the overall p50 is 40.6 ms for pg_textsearch vs. 94.4 ms for ParadeDB, a 2.3x advantage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrent throughput.&lt;/strong&gt; We ran pgbench with 16 parallel clients for 60 seconds (after a 5-second warmup). Each client repeatedly executes a query drawn at random from a weighted pool of 1,000 queries:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;pg_textsearch&lt;/th&gt;
&lt;th&gt;ParadeDB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transactions/sec&lt;/td&gt;
&lt;td&gt;198.7&lt;/td&gt;
&lt;td&gt;22.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average latency&lt;/td&gt;
&lt;td&gt;81 ms&lt;/td&gt;
&lt;td&gt;701 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total transactions (60s)&lt;/td&gt;
&lt;td&gt;11,969&lt;/td&gt;
&lt;td&gt;1,387&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;pg_textsearch sustains 8.7x higher throughput under concurrent load.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Results: MS-MARCO v1 (8.8M passages)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;On the smaller dataset (GitHub Actions runner, 7 GB RAM, Postgres 17), the advantages are more pronounced: 26x speedup for single-token queries, 14x for 2-token, 7.3x for 4-token. Total sequential execution time for all 800 queries: 6.5 seconds for pg_textsearch vs. 25.2 seconds for ParadeDB. Full results and methodology are available at the &lt;a href="https://timescale.github.io/pg_textsearch/benchmarks/" rel="noopener noreferrer"&gt;&lt;u&gt;benchmarks&lt;/u&gt;&lt;/a&gt; page.&lt;/p&gt;
&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Latency vs. query length
&lt;/h3&gt;

&lt;p&gt;The speedup correlates strongly with query length: 11.7x for single-token queries on v2, narrowing to 1.1x at 8+ tokens. This is the expected behavior of dynamic pruning algorithms like BMW and WAND. Grand et al. [7] observe the same pattern in Lucene's BMW implementation.&lt;/p&gt;

&lt;p&gt;The practical significance depends on the workload's query length distribution. 72.6% of MS-MARCO queries have 2-4 lexemes, the range where pg_textsearch shows its largest advantage (6.5x to 2.4x on v2). Weighted by this distribution, the overall speedup is 2.3x on v2 and 3.9x on v1.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Concurrent throughput&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The concurrent throughput advantage (8.7x) substantially exceeds the single-client advantage (2.3x weighted p50). pg_textsearch queries execute as C code operating on Postgres buffer pages, with all memory management handled by Postgres's buffer cache. ParadeDB routes queries through Rust/C FFI into Tantivy, which manages its own memory and I/O outside the buffer pool. We have not profiled ParadeDB's internals, so we cannot attribute the concurrency gap to specific causes, but the architectural difference (shared buffer cache vs. separate memory management) is a plausible contributor. ParadeDB's concurrent performance may also improve in future versions.&lt;/p&gt;
&lt;h3&gt;
  
  
  Where ParadeDB is faster
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Index build time.&lt;/strong&gt; ParadeDB builds indexes 1.6-2x faster across both datasets. Tantivy's indexer is highly optimized Rust code with its own I/O management, not constrained by Postgres's page-based storage. Build time is a one-time cost per index (or per REINDEX); it does not affect query performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long queries.&lt;/strong&gt; At 7+ lexemes, the two systems converge. On v2, the 8+ lexeme p50 is 178 ms for pg_textsearch vs. 190 ms for ParadeDB. These long queries represent ~3.7% of the MS-MARCO distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Index size caveat.&lt;/strong&gt; pg_textsearch indexes are 19-26% smaller, but this comparison is not apples-to-apples: pg_textsearch does not store term positions, while ParadeDB stores positions to support phrase queries.&lt;/p&gt;
&lt;h3&gt;
  
  
  Benchmark limitations
&lt;/h3&gt;

&lt;p&gt;All measurements are warm-cache on datasets that fit in memory. The 100-query sample per bucket provides directional results but limited statistical power for tail latencies. ParadeDB v0.21.6 was current at time of testing; future versions may improve. We compare against ParadeDB because it is the primary Postgres-native BM25 alternative; standalone engines like Elasticsearch operate in a different deployment model. We have not benchmarked write-heavy workloads with concurrent queries.&lt;/p&gt;
&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;We want to be clear about what pg_textsearch does not support in 1.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No phrase queries.&lt;/strong&gt; The index stores term frequencies but not term positions, so it cannot natively evaluate queries like "database system" as a phrase. Phrase matching can be done with a post-filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'database system'&lt;/span&gt;
  &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="c1"&gt;-- over-fetch to compensate for post-filter&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sub&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;ILIKE&lt;/span&gt; &lt;span class="s1"&gt;'%database system%'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OR-only query semantics.&lt;/strong&gt; All query terms are implicitly OR'd. A query for "database system" matches documents containing either term. We plan to add AND/OR/NOT operators via a dedicated boolean query syntax in a post-1.0 release.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No highlighting or snippet generation.&lt;/strong&gt; Use Postgres's &lt;code&gt;ts_headline()&lt;/code&gt; on the result set for highlighting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No expression indexing.&lt;/strong&gt; Each BM25 index covers a single text column. Workaround: create a generated column concatenating multiple fields.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partition-local statistics.&lt;/strong&gt; Each partition maintains its own IDF and average document length. Cross-partition queries return scores computed independently per partition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No background compaction.&lt;/strong&gt; Segment compaction runs synchronously during memtable spill. Write-heavy workloads may observe compaction latency. Background compaction is planned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PL/pgSQL requires explicit index names.&lt;/strong&gt; The implicit text &lt;code&gt;&amp;lt;@&amp;gt; 'query'&lt;/code&gt; syntax relies on planner hooks that do not fire inside PL/pgSQL, DO blocks, or stored procedures. Use &lt;code&gt;to_bm25query('query', 'index_name')&lt;/code&gt; explicitly. This is a practical limitation many developers will hit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;shared_preload_libraries required.&lt;/strong&gt; pg_textsearch must be listed in shared_preload_libraries, requiring a server restart to install. On Tiger Cloud, this is handled automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No fuzzy matching or typo tolerance.&lt;/strong&gt; pg_textsearch uses Postgres's standard text search configurations for tokenization and stemming but does not provide built-in fuzzy matching. Typo-tolerant search requires a separate approach (e.g., pg_trgm).&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Planned work for post-1.0 releases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Boolean query operators: AND, OR, NOT via a dedicated query syntax&lt;/li&gt;
&lt;li&gt;Background compaction: decouple compaction from the write path&lt;/li&gt;
&lt;li&gt;Expression index support: index computed expressions, not just bare columns&lt;/li&gt;
&lt;li&gt;Dictionary compression: front-coding for terms, reducing dictionary size&lt;/li&gt;
&lt;li&gt;Improved write concurrency: better throughput for sustained insert-heavy workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;pg_textsearch requires Postgres 17 or 18. The fastest way to try it is on &lt;a href="https://www.tigerdata.com/search" rel="noopener noreferrer"&gt;&lt;u&gt;Tiger Cloud&lt;/u&gt;&lt;/a&gt;, where it is already installed and configured. No setup, no shared_preload_libraries. Create a service and run the example below.&lt;/p&gt;

&lt;p&gt;For self-hosted installations, pre-built binaries for Linux and macOS (amd64, arm64) are available on the &lt;a href="https://github.com/timescale/pg_textsearch/releases" rel="noopener noreferrer"&gt;&lt;u&gt;GitHub Releases page&lt;/u&gt;&lt;/a&gt;. Add it to shared_preload_libraries and restart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;shared_preload_libraries&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'pg_textsearch'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source code and full documentation: &lt;a href="https://github.com/timescale/pg_textsearch" rel="noopener noreferrer"&gt;&lt;u&gt;github.com/timescale/pg_textsearch&lt;/u&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Part 2 of this series covers getting started with pg_textsearch, hybrid search with pgvectorscale, and production patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1] Robertson et al. "Okapi at TREC-3." 1994. See also: Robertson, Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in IR, 3(4):333-389, 2009.&lt;/p&gt;

&lt;p&gt;[2] Ding, Suel. "Faster top-k document retrieval using block-max indexes." SIGIR 2011, pp. 993-1002.&lt;/p&gt;

&lt;p&gt;[3] Broder et al. "Efficient query evaluation using a two-level retrieval process." CIKM 2003, pp. 426-434.&lt;/p&gt;

&lt;p&gt;[4] O'Neil et al. "The log-structured merge-tree (LSM-tree)." Acta Informatica, 33(4):351-385, 1996.&lt;/p&gt;

&lt;p&gt;[5] Facebook. "RocksDB: A Persistent Key-Value Store for Fast Storage Environments." &lt;a href="https://rocksdb.org/" rel="noopener noreferrer"&gt;https://rocksdb.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[6] SmallFloat encoding: Apache Lucene SmallFloat.java. Tantivy uses an equivalent implementation.&lt;/p&gt;

&lt;p&gt;[7] Grand et al. "From MAXSCORE to Block-Max Wand: The Story of How Lucene Significantly Improved Query Evaluation Performance." ECIR 2020.&lt;/p&gt;

&lt;p&gt;[8] Nguyen et al. "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." 2016.&lt;/p&gt;

&lt;p&gt;[9] Statista. "Distribution of online search queries in the US, February 2020, by number of search terms."&lt;/p&gt;

&lt;p&gt;[10] Dean. "We Analyzed 306M Keywords." Backlinko, 2024.&lt;/p&gt;

</description>
      <category>announcementsrelease</category>
      <category>pgtextsearch</category>
      <category>postgres</category>
      <category>searchengine</category>
    </item>
    <item>
      <title>How to Break Your PostgreSQL IIoT Database and Learn Something in the Process</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Mon, 30 Mar 2026 17:42:43 +0000</pubDate>
      <link>https://dev.to/tigerdata/how-to-break-your-postgresql-iiot-database-and-learn-something-in-the-process-n2d</link>
      <guid>https://dev.to/tigerdata/how-to-break-your-postgresql-iiot-database-and-learn-something-in-the-process-n2d</guid>
      <description>&lt;p&gt;As engineers, we're taught to design for reliability. We do design calculations, run simulations, build and test prototypes, and even then we recognize that these are imperfect, so we include safety factors. When it comes to the Industrial Internet of Things (IIoT) though, we rarely give the same level of scrutiny to the components that we rely on.&lt;/p&gt;

&lt;p&gt;What if we treated our IIoT database the same way we treated the physical things we produce? We build and design a prototype database, and then  &lt;a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill/" rel="noopener noreferrer"&gt;put it through some serious testing&lt;/a&gt;, even to failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Value (and Perils) of Stress Testing
&lt;/h2&gt;

&lt;p&gt;Think of database stress testing as a destructive materials test for your data storage. You wouldn't trust a bridge made of untested steel, so don’t trust your database until you know its limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Value:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Identify Bottlenecks:&lt;/strong&gt;  Stress testing reveals the weak links—what is likely to fail first? Will you run out of storage? Will your queries get bogged down? Or will you hit the dreaded ingest wall (when data comes in faster than it can be stored)?&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Determine Real-World Behaviour:&lt;/strong&gt;  You'll find out exactly how your database performance changes as the amount of data increases. What issues are future-you going to struggle with?&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.tigerdata.com/blog/postgres-optimization-treadmill/" rel="noopener noreferrer"&gt;&lt;strong&gt;Optimize Configuration&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;:&lt;/strong&gt;  Just like you might build a few different prototypes and see how it affects failure modes, changing your database configuration, especially when it comes to indices, can dramatically affect how it behaves. Building a rigorous stress testing framework provides a safe way to optimize your design.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope it goes without saying, but please, please don’t run this on your production environment. Even if it’s technically a different database but the same hardware, this test can wreak havoc on your resources and crash your system. You’ve been warned.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Measure?
&lt;/h2&gt;

&lt;p&gt;There’s no point going through all the effort to break your system if you don’t learn anything. Assuming you’re using a PostgreSQL database (&lt;a href="https://www.tigerdata.com/blog/its-2026-just-use-postgres" rel="noopener noreferrer"&gt;It’s 2026, Just Use PostgreSQL&lt;/a&gt;), here is a decent set of metrics to keep track of while you’re putting your database through its paces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Size
&lt;/h3&gt;

&lt;p&gt;The size of a Postgresql table is generally measured by number of rows, but the actual space on disk that it occupies is a sum of the heap (the main relational table), the indices, and the TOAST (storage for large objects).&lt;/p&gt;

&lt;p&gt;The following query will give the number or rows as well as the size of each component of the table in bytes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
      &lt;span class="n"&gt;reltuples&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;bigint&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;heap_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;pg_indexes_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;indices_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;pg_table_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;
            &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;toast_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_class&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'iiot_history'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason for the odd row_count is that counting rows the standard way, with COUNT(*), requires scanning the whole table, which is going to be painfully slow when we’re building a table big enough to break things.&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Performance
&lt;/h3&gt;

&lt;p&gt;The best way to measure table performance is to use the actual queries that your production system will use. At a minimum, this should include your batched INSERT (you always batch, right?) and at least one common SELECT. Keep in mind that for a table with N rows, the timing for queries tend to be either constant, log(N), N or worse depending on how the indices are structured.&lt;/p&gt;

&lt;p&gt;You can get very accurate timing info from running your queries with the prefix EXPLAIN ANALYZE, and it’s worth doing this at least once to see what the database is doing under the hood. However, I recommend running the whole test with a scripting language and then just timing the execution of that particular step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Server Performance
&lt;/h3&gt;

&lt;p&gt;Don’t forget the engine that’s driving all this machinery. You’ll need to watch the CPU, Memory, Storage, and Network Bandwidth. People in the IT world tend to talk about headroom for a server, and that’s what you’re really looking at: how much spare capacity do you have? Your CPU and Memory usage might spike at times, but the important thing is that it’s not always running at max capacity.&lt;/p&gt;

&lt;p&gt;There are a lot of free and paid tools to monitor these variables. I almost always do this type of test in a VM (easier to clean up the mess when it all breaks) and I like to use  &lt;a href="https://prometheus.io/" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;  but honestly Perfmon in Windows or Top in Linux gives you all you really need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting Limits
&lt;/h3&gt;

&lt;p&gt;It’s helpful to set some limits on these parameters so you know when to stop the test. For database size, it might be some measurement like a year's worth of data, or when the drive is 80% full. For ingest timing, I suggest stopping when inserting takes longer than the desired ingest frequency—this is the ingest bottleneck and something you really want to avoid in production. Scan times can be limited by the time it takes for a specific query. Maybe calculating the average value from one tag over the past hour must be less than 10s.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Simulate Data?
&lt;/h2&gt;

&lt;p&gt;There are lots of ways to insert data, but it’s usually a tradeoff between how well the data represents real scenarios and how long it takes to run the test.&lt;/p&gt;

&lt;p&gt;The following is one of my favourite methods for injecting large amounts of data into an IIoT database:&lt;/p&gt;

&lt;p&gt;Say you have a classic IIoT history table like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;iiot_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tag_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt; &lt;span class="nb"&gt;PRECISION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you expect to ingest 10,000 tags at 1s intervals, you can use the following INSERT query to add a day’s worth of history to the back end of your table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;iiot_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;min_date&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;min_date&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1s'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'1s'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;
        &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;LEAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_date&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;iiot_history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tag_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will generate random data values for every second during a day and for every tag_id from 1 to 10,000. Not exactly as interesting as real data, but enough to fill up your table.&lt;/p&gt;

&lt;p&gt;The nice thing about this query is that you should be able to run it in parallel to your real-time data pipeline and it won’t mess with your data (aside from potentially locking your table while it runs). It’s also easy to modify this query to inject more or less tags as well as change the time interval if you’re playing around with different configurations.&lt;/p&gt;

&lt;p&gt;If you use this query, or whichever one you prefer, in a script (I usually use Python), then you can automate the whole test. Something along the lines of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Get database size&lt;/li&gt;
&lt;li&gt; Run select queries, measure execution time&lt;/li&gt;
&lt;li&gt; Run insert queries several times, measure and average execution time&lt;/li&gt;
&lt;li&gt; Artificially grow database size&lt;/li&gt;
&lt;li&gt; Repeat 1-3 until one of the failure conditions is reached.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  How to Interpret Results and What to Expect in the Real World?
&lt;/h2&gt;

&lt;p&gt;Your test results will give you some clear data points, but you still need to do some interpreting.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Identify the Limiting Component:&lt;/strong&gt;  Where did the database fail? If it’s a query that took too long, you might be able to speed things up with a clever index. If it’s an insert that took too long, you might be able to speed things up by removing that clever index you added earlier.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Optimize:&lt;/strong&gt;  There’s a lot you can do to improve table performance before throwing the whole thing out in frustration:

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Proper Indexing:&lt;/strong&gt;  Choosing an index is almost always a tradeoff, for example: Indexing the tag_id column before the time column will speed up most queries, at the cost of slower inserts as the table grows. Indexing the time column first will avoid the ‘ingest wall’ at the cost of slower queries. Figure out which solution is best.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Plan for the future:&lt;/strong&gt;  Will you need more hardware in a few months or a few years? Being able to estimate the life of your existing architecture means you won’t be caught unawares when it no longer suffices.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Partitioning/Chunking:&lt;/strong&gt;  For very large tables, you may need to partition appropriately (see PostgreSQL extensions like  &lt;a href="https://www.tigerdata.com/timescaledb" rel="noopener noreferrer"&gt;TimescaleDB&lt;/a&gt;). How great would it be to learn you’ll need this before you actually need this.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Add a Safety Factor:&lt;/strong&gt;  If your test showed a maximum reliable throughput of 15,000 rows/sec, set your operational limit to 10,000 rows/sec. The real world has peaks, unexpected queries, and background maintenance tasks that will steal resources. Like we do with all engineering products, design with margin.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you treat your database like a prototype and really put it through its paces, you’ll get a preview of how it’ll behave in the future and make good, proactive design decisions instead of struggling in the future. Now, go break something (and learn).&lt;/p&gt;

</description>
      <category>iot</category>
      <category>postgres</category>
      <category>industrial</category>
      <category>database</category>
    </item>
    <item>
      <title>What Developers Get Wrong About Storing Sensor Data</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Thu, 19 Mar 2026 14:08:03 +0000</pubDate>
      <link>https://dev.to/tigerdata/what-developers-get-wrong-about-storing-sensor-data-4e4m</link>
      <guid>https://dev.to/tigerdata/what-developers-get-wrong-about-storing-sensor-data-4e4m</guid>
      <description>&lt;h2&gt;
  
  
  Sensor Data Looks Simple Until It Isn’t
&lt;/h2&gt;

&lt;p&gt;Sensor data appears straightforward. It just has timestamps, numeric readings, and maybe a device identifier. Compared to transactional application data, sensor data feels uniform and predictable. Teams often assume they can store it using familiar relational database schemas and grow from there.&lt;/p&gt;

&lt;p&gt;That assumption falls apart instantly when scale explodes. Devices multiply, sampling rates rise, and historical data accumulates indefinitely. Queries shift from single-row lookups to time windows and aggregations. Data arrives out of order. Storage costs grow exponentially. Systems designed around transactional assumptions crack in ways that are difficult to correct once data volume locks architecture in place.&lt;/p&gt;

&lt;p&gt;The root problem is conceptual. Sensor data looks like rows but behaves like a time-ordered stream whose value declines with age. Engineers must design the database as a time-series log with decay from the outset, rather than adapting it from a transactional model later. The following sections show how relational database approaches are inadequate for handling sensor data, and what a more suitable architecture looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Default Model: Treating Sensor Data Like Rows
&lt;/h2&gt;

&lt;p&gt;Most database developers approach sensor data with a transactional mindset. They design normalized schemas, enforce relational integrity, and add indexes for point queries. They only work for mutable business entities such as users or orders.&lt;/p&gt;

&lt;p&gt;Sensor data, however, is append-only. New measurements arrive continuously and are rarely updated. Sustained ingestion and time-range retrieval are dominant, not row mutation or lookup. When schemas assume row-oriented access, data ingestion becomes join-heavy, indexing costs grow with volume, and write throughput falls behind input data flow.&lt;/p&gt;

&lt;p&gt;Treating sensor data as rows creates problems precisely where sensor systems spend most of their effort: writing and scanning time-ordered streams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where That Model Breaks
&lt;/h2&gt;

&lt;p&gt;As the system grows, several problems appear simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt; , ingestion is continuous and bursty. Devices reconnect and flush buffers, producing spikes rather than steady flows. Row-oriented schemas struggle to absorb these bursts efficiently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second&lt;/strong&gt; , growth compounds across multiple axes: more devices, higher sampling frequency, additional metrics, and longer retention. Storage volume grows quickly, turning early schema choices into long-term constraints because migrating historical time-series data is costly and risky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third&lt;/strong&gt; , queries shift toward time windows. Monitoring, analytics, and diagnostics rely on ranges, aggregates, and rates over time rather than individual rows. Row-optimized indexing performs poorly for these scans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth&lt;/strong&gt; , operational realities inevitably create problems. Timestamps arrive late or out of sequence. Data must be replayed or corrected. Systems designed for ordered inserts encounter fragmentation and duplication under these conditions.&lt;/p&gt;

&lt;p&gt;Each constraint highlights the same reality. Sensor workloads are shaped by time and continuity, not by relational identity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Insight: Sensor Data Is a Log With Decay
&lt;/h2&gt;

&lt;p&gt;Sensor data has two defining properties.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It is a log: append-only, time-indexed, and rarely modified after arrival.&lt;/li&gt;
&lt;li&gt;It decays: its value decreases as it ages, even as its volume accumulates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Recent data require high-resolution monitoring and debugging. Older data supports trends and aggregates. Very old data is rarely queried except in a summarized form. Yet without lifecycle awareness, systems retain all data at equal resolution and cost.&lt;/p&gt;

&lt;p&gt;Once teams understand that sensor data is a &lt;strong&gt;log with decay&lt;/strong&gt; , the correct architecture becomes clear. Storage must optimize for append throughput and time-range access while permitting data to evolve in resolution and tier as it ages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time-Series Architecture
&lt;/h2&gt;

&lt;p&gt;Time-series data that loses value over time requires the database architecture to have a few key properties.&lt;/p&gt;

&lt;h3&gt;
  
  
  Log-optimized ingestion
&lt;/h3&gt;

&lt;p&gt;Writes must be sequential and batched, minimizing per-row overhead. Storage engines and schemas should favor append operations over update operations so ingestion scales with device fleets and burst conditions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time-partitioned organization
&lt;/h3&gt;

&lt;p&gt;Data should be grouped primarily by time, corresponding its physical storage with dominant query patterns. Time partitioning keeps recent data localized and keeps historical segments compact and independent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lifecycle tiering
&lt;/h3&gt;

&lt;p&gt;Because sensor data’s value declines with age, resolution, and storage cost should decline as well. High-resolution recent data is hot, and older data is compressed, downsampled, or moved to cheaper storage tiers while preserving analytical performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Role separation
&lt;/h3&gt;

&lt;p&gt;Operational monitoring, historical analytics, and archival retention create different latency and throughput challenges. Separating these roles prevents continuous ingestion from degrading analytical performance and allows each layer to evolve independently.&lt;/p&gt;

&lt;p&gt;These properties are not optimizations layered onto transactional storage. Instead, they are intentional design choices needed to handle the key aspects of time-series data: continuous append, time-range access, and aging value.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Enables for Developers
&lt;/h2&gt;

&lt;p&gt;Architectures aligned with time-series data change how systems scale and operate.&lt;/p&gt;

&lt;p&gt;Ingestion stays stable as fleets expand because write operations match append patterns rather than row mutation. Query cost stays predictable because time-range scans match with storage layout. Storage growth stays bounded relative to insight because data resolution declines with age. Operational corrections and replays become routine rather than disruptive because logs tolerate disorder.&lt;/p&gt;

&lt;p&gt;Developers spend less effort compensating for schema problems and more effort deriving insight from data. Systems stay adaptable as deployments grow from prototypes to global fleets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Time-Series Architecture Becomes Inevitable
&lt;/h2&gt;

&lt;p&gt;Engineers only design transactional database models for mutable records whose value stays relatively stable over time. Sensor data is the opposite. It is filled with immutable events whose volume grows continuously while their value declines with age. As ingestion becomes constant, queries become time-range-driven, and history accumulates indefinitely, databases built on transactional assumptions develop write bottlenecks, inefficient scans, and rising storage costs.&lt;/p&gt;

&lt;p&gt;Once teams understand that sensor data is just an append-only data stream with aging value, the architectural solution becomes clear. Systems must ingest sequentially, organize primarily by time, reduce resolution as data ages, and separate operational and historical workloads. These structures stem directly from how sensor data behaves, not a preference for any particular technology.&lt;/p&gt;

&lt;p&gt;Treating sensor data as rows delays problems but does not fix them. As scale grows, transactional models diverge further from workload reality, while time-series architectures stay matched to it. Database design, therefore, can’t be retrofitted late without cost and disruption. It must start from the correct model: sensor data as a time-series log with decay.&lt;/p&gt;

</description>
      <category>timeseries</category>
      <category>database</category>
      <category>iot</category>
      <category>backend</category>
    </item>
    <item>
      <title>Your Rails App Isn’t Slow—Your Database Is</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Tue, 06 May 2025 12:23:00 +0000</pubDate>
      <link>https://dev.to/tigerdata/your-rails-app-isnt-slow-your-database-is-o57</link>
      <guid>https://dev.to/tigerdata/your-rails-app-isnt-slow-your-database-is-o57</guid>
      <description>&lt;p&gt;In case you missed the quiet launch of our timescaledb-ruby gem, we’re here to remind you that you can now &lt;a href="https://www.timescale.com/blog/connecting-ruby-and-postgresql-timescale-integrations-expand" rel="noopener noreferrer"&gt;connect PostgreSQL and Ruby when using TimescaleDB&lt;/a&gt;. 🎉 This integration delivers a deeply integrated experience that will feel natural to Ruby and Rails developers.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Scale Your Rails App Analytics with TimescaleDB
&lt;/h2&gt;

&lt;p&gt;If you’ve worked with Rails for any length of time, you’ve probably hit the wall when dealing with time-series data. I know I did. &lt;/p&gt;

&lt;p&gt;Your app starts off smooth—collecting metrics, logging events, tracking usage. But one day, your dashboards start lagging. Page load times creep past 10 seconds. Pagination stops helping. Background jobs queue up as yesterday’s data takes too long to process.&lt;/p&gt;

&lt;p&gt;This isn’t a Rails problem. Or even a PostgreSQL problem. It’s a “using the wrong tool for the job” problem.&lt;/p&gt;

&lt;p&gt;In this post, I’ll show you how we solve these challenges at Timescale—and how you can too. I’ll walk through the real implementation patterns we use in production Rails apps, using practical code examples instead of abstract concepts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Growing Time-Series Data Challenge
&lt;/h2&gt;

&lt;p&gt;A few years ago, I was building analytics for a high-traffic Rails app. Despite adding indexes and optimizing queries, performance kept degrading as our data grew.&lt;/p&gt;

&lt;p&gt;Like most apps, we started with simple timestamp columns and standard ActiveRecord queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Event&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;ApplicationRecord&lt;/span&gt;
  &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="ss"&gt;:recent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'created_at &amp;gt; ?'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;week&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ago&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="ss"&gt;:by_day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"DATE_TRUNC('day', created_at)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works fine at first. But as your table grows to millions (or billions) of rows, things slow to a crawl:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5ms when you have 10K rows&lt;/li&gt;
&lt;li&gt;2000ms when you have 10M rows
&lt;code&gt;Event.where(user_id: 123).by_day&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the problems compound when you need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track high-volume events (like API calls or page views)&lt;/li&gt;
&lt;li&gt;Keep historical data accessible for trends&lt;/li&gt;
&lt;li&gt;Run complex aggregations across time&lt;/li&gt;
&lt;li&gt;Maintain dashboard performance as data scales&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Over the years, I tried all the usual tricks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Additional indexes: Helped at first, then hurt insert performance&lt;/li&gt;
&lt;li&gt;Manual partitioning: Fragile and hard to manage&lt;/li&gt;
&lt;li&gt;Pre-aggregation jobs: Complex and often stale&lt;/li&gt;
&lt;li&gt;Custom caching: Difficult to maintain, always a step behind&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It felt like fighting my database instead of working with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why PostgreSQL Falls Short for Time-Series
&lt;/h2&gt;

&lt;p&gt;PostgreSQL is a fantastic general-purpose database. But time-series data introduces new demands that standard Postgres tables aren’t designed for. Let’s break that down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Insertion pattern: Data constantly arrives in time order, but old data rarely changes&lt;/li&gt;
&lt;li&gt;Query pattern: Most queries use time bounds (WHERE created_at BETWEEN x AND y)&lt;/li&gt;
&lt;li&gt;Aggregation pattern: You’re grouping by time (hourly, daily, monthly)&lt;/li&gt;
&lt;li&gt;Storage pattern: The dataset grows linearly—forever&lt;/li&gt;
&lt;li&gt;Access pattern: Recent (hot) data is queried far more than older (cold) data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These characteristics expose several pain points:.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No built-in partitioning for time&lt;/li&gt;
&lt;li&gt;Index bloat as tables grow&lt;/li&gt;
&lt;li&gt;Inefficient time-based queries&lt;/li&gt;
&lt;li&gt;Manual rollups and background jobs&lt;/li&gt;
&lt;li&gt;Difficulty managing large historical datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that’s exactly where TimescaleDB comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  TimescaleDB: PostgreSQL, But Built for Time-Series
&lt;/h2&gt;

&lt;p&gt;TimescaleDB is a PostgreSQL extension built to handle time-series and real-time workloads—without giving up the safety and simplicity of Postgres.&lt;/p&gt;

&lt;p&gt;Now with the timescaledb Ruby gem, it integrates cleanly into Rails. You don’t have to leave behind ActiveRecord, or rewrite your models, or learn a whole new stack.&lt;/p&gt;

&lt;p&gt;Here’s what TimescaleDB brings to your Rails app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hypertables: Automatic time-based partitioning, transparent to your queries&lt;/li&gt;
&lt;li&gt;Optimized time indexes: Stay fast even as your data grows&lt;/li&gt;
&lt;li&gt;Built-in compression: Reduce storage by 90–95%&lt;/li&gt;
&lt;li&gt;Continuous aggregates: Pre-computed rollups that stay fresh automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most importantly? You keep your Rails patterns.&lt;/p&gt;

&lt;p&gt;These work just like before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="no"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;user_id: &lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;created_at: &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;month&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ago&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="no"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="no"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by_day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:created_at&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;  &lt;span class="c1"&gt;# using the groupdate gem&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real Performance Gains Without Rewriting Everything
&lt;/h2&gt;

&lt;p&gt;With Timescale, our analytics workflows went from laggy to fast—without adding new caching layers or complex ETL.&lt;/p&gt;

&lt;p&gt;Across production workloads, teams have seen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sub-second queries on tens of millions of rows&lt;/li&gt;
&lt;li&gt;95%+ compression on time-series datasets&lt;/li&gt;
&lt;li&gt;Fewer background jobs, thanks to continuous aggregates&lt;/li&gt;
&lt;li&gt;Simplified code—no more rollup scripts or cache warmers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It feels like your app leveled up, without any extra complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Continuous Aggregates in One Line of Ruby
&lt;/h2&gt;

&lt;p&gt;One of TimescaleDB’s most powerful features is continuous aggregates—think materialized views that update automatically in the background.&lt;br&gt;
And with the timescaledb gem, defining them looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Download&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;ApplicationRecord&lt;/span&gt;
  &lt;span class="kp"&gt;extend&lt;/span&gt; &lt;span class="no"&gt;Timescaledb&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;ActsAsHypertable&lt;/span&gt;
  &lt;span class="kp"&gt;include&lt;/span&gt; &lt;span class="no"&gt;Timescaledb&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;ContinuousAggregatesHelper&lt;/span&gt;

  &lt;span class="n"&gt;acts_as_hypertable&lt;/span&gt; &lt;span class="ss"&gt;time_column: &lt;/span&gt;&lt;span class="s1"&gt;'ts'&lt;/span&gt;

  &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="ss"&gt;:total_downloads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"count(*) as total"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;scope&lt;/span&gt; &lt;span class="ss"&gt;:downloads_by_gem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"gem_name, count(*) as total"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:gem_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;continuous_aggregates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="ss"&gt;timeframes: &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:minute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:hour&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:month&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="ss"&gt;scopes: &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:total_downloads&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:downloads_by_gem&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single model creates a cascade of continuously updated rollups—from minute to month—all while sticking to the ActiveRecord patterns you know and love.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters
&lt;/h2&gt;

&lt;p&gt;If you're building a Rails app that tracks metrics, logs, events, or any kind of time-based data, TimescaleDB gives you a clear path to scale without duct tape and complexity.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce load on your app servers—let the DB do the aggregating&lt;/li&gt;
&lt;li&gt;Eliminate complex background jobs—less moving parts to break&lt;/li&gt;
&lt;li&gt;Get predictable performance—even with billions of rows&lt;/li&gt;
&lt;li&gt;Stick with Rails conventions—write less custom SQL&lt;/li&gt;
&lt;li&gt;Continuous aggregates alone can replace dozens of lines of rollup - code and hours of maintenance work.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Rails developers deserve a time-series database that just works. TimescaleDB gives you the performance and scale your app needs without giving up the elegance of ActiveRecord.&lt;/p&gt;

&lt;p&gt;If you’re curious, here’s how to get started:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install TimescaleDB (it’s just a Postgres extension)&lt;/li&gt;
&lt;li&gt;Add the timescaledb gem to your Gemfile&lt;/li&gt;
&lt;li&gt;Identify models with time-based data&lt;/li&gt;
&lt;li&gt;Start with hypertables, then add continuous aggregates as needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can self-host, or try Timescale Cloud for a fully managed option.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ: TimescaleDB for Ruby on Rails Developers
&lt;/h2&gt;

&lt;p&gt;Q: Do I need to change how I use ActiveRecord?&lt;/p&gt;

&lt;p&gt;A: Nope! TimescaleDB works with your existing ActiveRecord models. Just add the timescaledb gem and use the acts_as_hypertable macro to enable time-series functionality.&lt;/p&gt;

&lt;p&gt;Q: How is TimescaleDB different from just using PostgreSQL?&lt;/p&gt;

&lt;p&gt;A: TimescaleDB is a PostgreSQL extension. It gives you automatic time-based partitioning (hypertables), faster time-based queries, built-in compression, and continuous aggregates—all while staying 100% SQL- and Rails-compatible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I keep using the gems I already use for date grouping, like groupdate?
&lt;/h3&gt;

&lt;p&gt;A: Yes. TimescaleDB works seamlessly with gems like groupdate. You can continue using .group_by_day, .group_by_hour, etc., and get better performance under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What kind of performance improvements can I expect?
&lt;/h3&gt;

&lt;p&gt;A: Teams have seen sub-second query times on tens of millions of rows and 95%+ storage savings using TimescaleDB’s compression. The biggest wins are in read-heavy, time-bounded queries (e.g., user activity, logs, metrics).&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What’s the learning curve for continuous aggregates?
&lt;/h3&gt;

&lt;p&gt;A: It’s minimal. The timescaledb gem lets you define continuous aggregates using a simple DSL that reuses your existing scopes. You don’t need to learn new SQL or create custom rollup jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use this in production? Is it stable?
&lt;/h3&gt;

&lt;p&gt;A: Yes. TimescaleDB powers production workloads at companies like NetApp, Linktree, and RubyGems.org. It’s backed by years of performance and reliability improvements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Do I need to self-host? Or is there a managed option?
&lt;/h3&gt;

&lt;p&gt;A: Both! You can self-host TimescaleDB or use Timescale Cloud, a fully managed PostgreSQL service with built-in TimescaleDB, HA, backups, and usage-based pricing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Where can I learn more?
&lt;/h3&gt;

&lt;p&gt;A:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/timescale/timescaledb-ruby" rel="noopener noreferrer"&gt;Ruby Quickstart&lt;/a&gt; in Timescale Docs&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/timescale/timescaledb-ruby" rel="noopener noreferrer"&gt;timescaledb-ruby&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://console.cloud.timescale.com/signup" rel="noopener noreferrer"&gt;Fully Managed Timescale Cloud&lt;/a&gt; (free for 30 days)&lt;/li&gt;
&lt;li&gt;Install the &lt;a href="https://docs.timescale.com/self-hosted/latest/install/" rel="noopener noreferrer"&gt;open-source TimescaleDB extension&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>postgres</category>
      <category>ruby</category>
      <category>rails</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>We Listened: Pgai Vectorizer Now Works With Any Postgres Database</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Mon, 05 May 2025 15:01:35 +0000</pubDate>
      <link>https://dev.to/tigerdata/we-listened-pgai-vectorizer-now-works-with-any-postgres-database-1e57</link>
      <guid>https://dev.to/tigerdata/we-listened-pgai-vectorizer-now-works-with-any-postgres-database-1e57</guid>
      <description>&lt;p&gt;TL;DR: &lt;br&gt;
We're excited to announce that pgai Vectorizer—the &lt;a href="https://www.timescale.com/blog/pgai-vectorizer-now-works-with-any-postgres-database" rel="noopener noreferrer"&gt;tool for robust embedding creation and management&lt;/a&gt;—is now available as a Python CLI and library, making it compatible with any Postgres database, whether it be self-hosted Postgres or cloud-hosted on Timescale Cloud, Amazon RDS for PostgreSQL, or Supabase. &lt;/p&gt;



&lt;p&gt;This expansion comes directly from developer feedback requesting broader accessibility while maintaining the Postgres integration that makes pgai Vectorizer the ideal solution for production-grade embedding creation, management, and experimentation. &lt;a href="https://github.com/timescale/pgai" rel="noopener noreferrer"&gt;&lt;u&gt;To get started, head over to the pgai GitHub&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why We Built Pgai Vectorizer for Postgres
&lt;/h2&gt;

&lt;p&gt;When we first &lt;a href="https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction" rel="noopener noreferrer"&gt;&lt;u&gt;launched pgai Vectorizer&lt;/u&gt;&lt;/a&gt;, we aimed to simplify vector embedding management for developers building AI systems with Postgres. We heard the horror stories of developers struggling with complex ETL (extract-transform-load) pipelines, embedding synchronization issues, and the constant battle to keep embeddings up-to-date when source data changes. Teams were spending more time maintaining infrastructure than building useful AI features.&lt;/p&gt;

&lt;p&gt;Many developers found themselves cobbling together custom solutions involving message queues, Lambda functions, and background workers just to handle the embedding creation workflow. Others faced the frustration of stale embeddings that no longer matched their updated content, leading to degraded search quality and hallucinations in their RAG applications.&lt;/p&gt;

&lt;p&gt;Pgai Vectorizer solved these problems with a declarative approach that automated the entire embedding lifecycle with a single SQL command, similar to how you'd create an index in Postgres. The &lt;a href="https://news.ycombinator.com/item?id=41985176" rel="noopener noreferrer"&gt;&lt;u&gt;tool resonated with developers&lt;/u&gt;&lt;/a&gt; and quickly gained traction among AI builders. However, we soon started hearing a consistent piece of feedback that would shape our next steps.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Change: Moving From Extension-Only to Python CLI and Library
&lt;/h2&gt;

&lt;p&gt;After our initial launch, we received consistent feedback from developers who wanted to use pgai Vectorizer with their existing managed Postgres databases. While our extension-based approach worked great for self-hosted Postgres and Timescale Cloud, users on platforms like Amazon RDS for PostgreSQL, Supabase, and other managed database services couldn't use pgai Vectorizer unless their cloud provider chose to make it available.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXee19HYvVr8vTsVmwdlCVLOqHM1_-0VzjQrSk-3ZWQETtAFb8q8CBb9SKPmikQFJCl9ZgdpcrftidajbruKCWvshO8AkVuJbK5tpqlj9PyDrwk6SKrWfbG-KaRXu4KKmQyWrkX6bA%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXee19HYvVr8vTsVmwdlCVLOqHM1_-0VzjQrSk-3ZWQETtAFb8q8CBb9SKPmikQFJCl9ZgdpcrftidajbruKCWvshO8AkVuJbK5tpqlj9PyDrwk6SKrWfbG-KaRXu4KKmQyWrkX6bA%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" width="1437" height="620"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXcau7ZF40A9PuXPL2Zbp60ymU-EK3MLtzeId2XpSittjRCcBxga3dFoBApqChi4cJTwXrD9Hw2lYoAPLv-5A6ehmbIbqU2_Bji1O39jVqSL-iAm5fVyKGiRexcfArnAj9X4KEtOgA%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXcau7ZF40A9PuXPL2Zbp60ymU-EK3MLtzeId2XpSittjRCcBxga3dFoBApqChi4cJTwXrD9Hw2lYoAPLv-5A6ehmbIbqU2_Bji1O39jVqSL-iAm5fVyKGiRexcfArnAj9X4KEtOgA%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" width="823" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXc7W5YksotaCkwdSfhpzWB1x6DmpkBnX5DQAvP1ahIknUEXFHjwM8ATzNFAoo76_mKKDT6MpvCc_aNjCi3HZ5T9qjkB7dLGvqNh7FifbYv---v9MJZf4fPp3mNEPKKTop4-h7zr4A%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh7-rt.googleusercontent.com%2Fdocsz%2FAD_4nXc7W5YksotaCkwdSfhpzWB1x6DmpkBnX5DQAvP1ahIknUEXFHjwM8ATzNFAoo76_mKKDT6MpvCc_aNjCi3HZ5T9qjkB7dLGvqNh7FifbYv---v9MJZf4fPp3mNEPKKTop4-h7zr4A%3Fkey%3DBTo9RW9k3V54BU75a7UCXCke" width="1435" height="997"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Requests for pgai Vectorizer support on Supabase, Azure PostgreSQL, and Amazon RDS.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We knew we needed to make pgai Vectorizer more accessible without compromising its seamless Postgres integration. The solution? Repackaging our core functionality as a Python CLI (command-line interface) and library that can work with any Postgres database while maintaining the same robustness and "set it and forget it" simplicity.&lt;/p&gt;

&lt;p&gt;This approach gives developers the best of both worlds: the powerful vectorization capabilities of pgai Vectorizer with the flexibility to use their existing database infrastructure, regardless of provider. The Python library handles the creation of database objects that house the pgai Vectorizer internals, and provides a SQL API that handles loading data, creating embeddings, and synchronizing changes, all while writing the results back to your Postgres database.&lt;/p&gt;

&lt;p&gt;The library maintains all the core functionality that made pgai Vectorizer valuable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embedding creation and management:&lt;/strong&gt; Automatically create and synchronize vector embeddings from &lt;a href="https://www.timescale.com/blog/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines" rel="noopener noreferrer"&gt;Postgres data and S3 documents&lt;/a&gt;. Embeddings update automatically as data changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-ready out-of-the-box&lt;/strong&gt; : Supports batch processing for efficient embedding generation, with built-in handling for model failures, rate limits, and latency spikes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experimentation and testing:&lt;/strong&gt; &lt;a href="https://www.timescale.com/blog/open-source-vs-openai-embeddings-for-rag" rel="noopener noreferrer"&gt;&lt;u&gt;Easily switch between embedding models&lt;/u&gt;&lt;/a&gt;, test different models, and compare performance without changing application code or manually reprocessing data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plays well with pgvector and pgvectorscale:&lt;/strong&gt; Once your embeddings are created, use them to power vector and semantic search with &lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;&lt;u&gt;pgvector&lt;/u&gt;&lt;/a&gt; and &lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale&lt;/u&gt;&lt;/a&gt;. Embeddings are stored in the pgvector data format. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;*&lt;em&gt;What this means for existing users: *&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Timescale Cloud customers:&lt;/strong&gt; Existing vectorizers running on Timescale Cloud will continue to work as is, so no immediate action is necessary. We encourage you to use the new &lt;a href="https://github.com/timescale/pgai" rel="noopener noreferrer"&gt;&lt;u&gt;pgai Python library&lt;/u&gt;&lt;/a&gt; to create and manage new vectorizers. To do so, you have to upgrade to the latest version of both the pgai extension in Timescale Cloud and the pgai Python library. Upgrading the extension decouples the vectorizer-related database objects from the extension, therefore allowing them to be managed by the Python library. Pgai Vectorizer remains in Early Access on Timescale Cloud. &lt;a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/migrating-from-extension.md" rel="noopener noreferrer"&gt;&lt;u&gt;See this guide&lt;/u&gt;&lt;/a&gt; for details and instructions on upgrading and migrating.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted users:&lt;/strong&gt; Existing self-hosting vectorizers will also continue to work as is, so no immediate action is required. If you already have the pgai extension installed, you’ll need to upgrade to version 0.10.1. Upgrading the extension decouples the vectorizer-related database objects from the extension, therefore allowing them to be created and managed by the Python library. &lt;a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/migrating-from-extension.md" rel="noopener noreferrer"&gt;&lt;u&gt;See this guide&lt;/u&gt;&lt;/a&gt; for self-hosted upgrade and migration details and instructions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What this means for new users:&lt;/strong&gt; Whether you use pgai Vectorizer on Timescale Cloud or self-hosted, this change means a simplified installation process and more flexibility—you now have tighter integrations between pgai Vectorizer and your search and RAG backends in your AI applications. Self-hosted users no longer need to install the pgai extension to use pgai Vectorizer. Timescale Cloud customers will continue to get the pgai extension auto-installed for them. To try pgai Vectorizer for yourself, &lt;a href="https://github.com/timescale/pgai#quick-start" rel="noopener noreferrer"&gt;&lt;u&gt;here’s how you can get started&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pgai Vectorizer Works With Any Postgres Database
&lt;/h2&gt;

&lt;p&gt;The new Python library implementation of pgai Vectorizer works with virtually any Postgres database, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://console.cloud.timescale.com/signup" rel="noopener noreferrer"&gt;&lt;u&gt;Timescale Cloud&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Self-hosted Postgres&lt;/li&gt;
&lt;li&gt;Amazon RDS for PostgreSQL&lt;/li&gt;
&lt;li&gt;Supabase&lt;/li&gt;
&lt;li&gt;Google Cloud SQL for PostgreSQL&lt;/li&gt;
&lt;li&gt;Azure Database for PostgreSQL&lt;/li&gt;
&lt;li&gt;Neon PostgreSQL&lt;/li&gt;
&lt;li&gt;Render PostgreSQL&lt;/li&gt;
&lt;li&gt;DigitalOcean Managed Databases&lt;/li&gt;
&lt;li&gt;Any other self-hosted or managed Postgres service running PostgreSQL 15 and later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The new implementation addresses one of our most requested features from the community. Users were actively building AI applications with these managed services, but couldn't take advantage of pgai Vectorizer's powerful embedding management capabilities.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to Use Pgai Vectorizer: A Quick Refresher
&lt;/h2&gt;

&lt;p&gt;A standout feature of the new Python library is its enhanced support for document processing directly from cloud storage. &lt;/p&gt;

&lt;p&gt;With the expanded Amazon S3 integration, you can now seamlessly load documents and generate embeddings based on file URLs stored in your Postgres table. Pgai Vectorizer automatically loads and parses each into an LLM-friendly format like Markdown, then generates the required chunks for embedding creation, all according to your specification.&lt;/p&gt;

&lt;p&gt;For document vectorization, we've included support for parsing multiple formats, including PDF, DOCX, XLSX, HTML, images, and more using &lt;a href="https://research.ibm.com/publications/docling-an-efficient-open-source-toolkit-for-ai-driven-document-conversion" rel="noopener noreferrer"&gt;&lt;u&gt;IBM Docling&lt;/u&gt;&lt;/a&gt;, which provides advanced document understanding capabilities. This makes it easy to build powerful document search and retrieval systems without leaving the Postgres ecosystem.&lt;/p&gt;

&lt;p&gt;Getting started with the pgai Vectorizer Python library is straightforward. Install pgai on your database via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pgai
pgai install -d postgresql://postgres:postgres@localhost:5432/postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Afterward, your database is enhanced with pgai’s capabilities. Here's a simple example of how to create a vectorizer for processing text data from a database column named ‘text’:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_vectorizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'wiki'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;regclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;if_not_exists&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loading_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'text'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'text-embedding-ada-002'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'1536'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;destination&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;view_name&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'wiki_embedding'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For document processing, you can use this configuration, which shows a document metadata table in PostgreSQL with references to data in Amazon S3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Document source table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;owner_id&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;access_level&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Example with rich metadata&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;access_level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Product Manual'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-bucket/documents/product-manual.pdf'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'application/pdf'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'internal'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'product'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'reference'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'API Reference'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-bucket/documents/api-reference.md'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'text/markdown'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'api'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'developer'&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_vectorizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'document'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;regclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loading&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loading_uri&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;column_name&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'uri'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;chunking&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chunking_recursive_character_text_splitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;700&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;separators&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;## '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;### '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;#### '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;- '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;1. '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;E&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'.'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'?'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'!'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'|'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_openai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'text-embedding-3-small'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;destination&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'document_embeddings'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the worker via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;pgai&lt;/span&gt; &lt;span class="n"&gt;vectorizer&lt;/span&gt; &lt;span class="n"&gt;worker&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; 
&lt;span class="n"&gt;postgresql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;localhost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5432&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;postgres&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And watch the magic happen as pgai creates vector embeddings for your source data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Started With Pgai Vectorizer Today
&lt;/h2&gt;

&lt;p&gt;We're excited to see what you'll build with the new pgai Vectorizer, whether you're creating semantic search, RAG, or next-gen agentic applications.&lt;/p&gt;

&lt;p&gt;Check out the &lt;a href="https://github.com/timescale/pgai" rel="noopener noreferrer"&gt;&lt;u&gt;GitHub repository&lt;/u&gt;&lt;/a&gt; to explore capabilities and getting started guides.&lt;/p&gt;

&lt;p&gt;As you can tell by this post, we really value community feedback. If you encounter any issues or have suggestions for improvements, please open an &lt;a href="https://github.com/timescale/pgai/issues" rel="noopener noreferrer"&gt;&lt;u&gt;issue on GitHub&lt;/u&gt;&lt;/a&gt; or join our &lt;a href="https://discord.gg/KRdHVXAmkp" rel="noopener noreferrer"&gt;&lt;u&gt;community Discord&lt;/u&gt;&lt;/a&gt;. Your input will help shape the future development of pgai Vectorizer as we continue to enhance its capabilities.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>python</category>
      <category>ai</category>
      <category>news</category>
    </item>
    <item>
      <title>PostgreSQL vs. Qdrant for Vector Search: 50M Embedding Benchmark</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Fri, 02 May 2025 14:35:37 +0000</pubDate>
      <link>https://dev.to/tigerdata/postgresql-vs-qdrant-for-vector-search-50m-embedding-benchmark-3hhe</link>
      <guid>https://dev.to/tigerdata/postgresql-vs-qdrant-for-vector-search-50m-embedding-benchmark-3hhe</guid>
      <description>&lt;p&gt;Vector search is becoming a core workload for AI-driven applications. But do you really need to introduce a new system just to handle it?&lt;/p&gt;

&lt;p&gt;We ran a performancebenchmark to find out: &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;comparing PostgreSQL (using pgvector + pgvectorscale) with Qdrant on 50 million embeddings&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The results at 99% recall:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Sub-100ms query latencies&lt;/li&gt;
&lt;li&gt;471 queries per second (QPS) on Postgres—11x higher throughput than Qdrant (41 QPS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Head to the full write-up for a deep dive into our &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;vector database comparison&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4cujejzv0axs1s27e72.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4cujejzv0axs1s27e72.jpg" alt="Postgres vs Qdrant vector database performance comparison" width="720" height="720"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  For vectors, Postgres is all you need.
&lt;/h2&gt;

&lt;p&gt;At 99% recall, Postgres delivers sub-100ms query latencies and handles 11x more query throughput than Qdrant (471 QPS vs. Qdrant’s 41 QPS).&lt;/p&gt;

&lt;p&gt;The results show that thanks to &lt;code&gt;pgvectorscale&lt;/code&gt;, &lt;a href="https://docs.timescale.com/ai/latest/sql-interface-for-pgvector-and-timescale-vector/" rel="noopener noreferrer"&gt;Postgres can keep up with specialized vector databases&lt;/a&gt; and deliver as good, if not better performance at scale. Learn more about &lt;a href="https://www.timescale.com/blog/why-postgres-wins-for-ai-and-vector-workloads" rel="noopener noreferrer"&gt;why Postgres wins for AI and vector workloads&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning PostgreSQL Into a High-Performance Vector Search Engine
&lt;/h2&gt;

&lt;p&gt;How? We built &lt;code&gt;pgvectorscale&lt;/code&gt; to push Postgres to its limits for vector workloads—without compromising recall, latency, or cost-efficiency. It turns your favorite relational database into a high-performance vector search engine.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ No extra systems.&lt;/li&gt;
&lt;li&gt;✅ No new query languages.&lt;/li&gt;
&lt;li&gt;✅ Just Postgres.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We used &lt;a href="https://rtabench.com/" rel="noopener noreferrer"&gt;RTABench&lt;/a&gt; to run a transparent, reproducible evaluation—designed for real-world, high-scale workloads.&lt;/p&gt;

&lt;p&gt;Curious about the architecture behind it all?&lt;/p&gt;

&lt;p&gt;👉 Read our whitepaper on &lt;a href="https://docs.timescale.com/about/latest/whitepaper/" rel="noopener noreferrer"&gt;building Timescale for real-time and AI workloads&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It dives into how we engineered Timescale to handle time-series, vector, and relational data—all in one Postgres-native platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR: For many vector workloads, Postgres is all you need.
&lt;/h2&gt;

&lt;p&gt;Have you used Postgres or Qdrant for vector search?&lt;br&gt;
What’s your stack look like today—and where do you feel the friction?&lt;/p&gt;

&lt;p&gt;👉 Postgres vs Qdrant: which side are you on? Comment down below!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Connecting S3 and Postgres: Automatic Synchronization Without ETL Pipelines</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Thu, 01 May 2025 12:32:36 +0000</pubDate>
      <link>https://dev.to/tigerdata/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines-32kg</link>
      <guid>https://dev.to/tigerdata/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines-32kg</guid>
      <description>&lt;p&gt;Modern applications need data that's both accessible and fast. You have data in S3, but transforming it into usable insights requires complex ETL (extract-transform-load) pipelines. With our new &lt;a href="https://www.timescale.com/blog/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines" rel="noopener noreferrer"&gt;livesync for S3 and pgai Vectorizer features&lt;/a&gt;, Timescale transforms how you interact with S3 data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Powerful Postgres–S3 Integration Approaches
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F06aeidm498g1kyknudmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F06aeidm498g1kyknudmc.png" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our new features offer distinct approaches to working with S3 data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.timescale.com/blog/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines#transform-s3-to-analytics-in-seconds-automatic-data-synchronization-with-livesync" rel="noopener noreferrer"&gt;&lt;strong&gt;Livesync for S3&lt;/strong&gt;&lt;/a&gt; brings your structured S3 data directly into Postgres tables, automatically synchronizing files as they change.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.timescale.com/blog/connecting-s3-and-postgres-automatic-synchronization-without-etl-pipelines#simplify-document-embeddings-with-pgai-vectorizer" rel="noopener noreferrer"&gt;&lt;strong&gt;pgai Vectorizer&lt;/strong&gt; leaves documents in S3 but generates searchable embeddings and metadata in Postgres&lt;/a&gt;, connecting unstructured content with structured data for RAG, search, and agentic applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both eliminate complex ETL pipelines, letting you work with S3 data using familiar SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transform S3 to Analytics in Seconds: Automatic Data Synchronization With Livesync
&lt;/h2&gt;

&lt;p&gt;S3 is where countless organizations store their data, but Timescale Cloud is where they unlock insights. Livesync for S3 bridges this gap, eliminating the traditional complexity of moving data between these systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem: Complex ETL pipelines for S3 data
&lt;/h3&gt;

&lt;p&gt;Data management challenges create significant obstacles when bridging S3 storage and analytics environments. Organizations struggle with the manual effort required to transport data between S3 buckets and analytical databases, requiring custom integration code that demands ongoing maintenance. This challenge is compounded by the brittle and resource-intensive nature of maintaining ETL processes.&lt;/p&gt;

&lt;p&gt;Many organizations find themselves caught in a constant battle to ensure data freshness, requiring vigilant monitoring systems to confirm that analytics platforms accurately reflect the most current information in S3 repositories. The culmination of these challenges frequently manifests as performance bottlenecks, where inefficient data transfer mechanisms cause critical delays in delivering up-to-date information to customer-facing applications, leading to poor user experiences and customers making decisions based on stale data.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution: Automatic data synchronization
&lt;/h3&gt;

&lt;p&gt;Livesync for S3 bridges this gap, eliminating the traditional complexity of moving data between these systems. We've engineered livesync for S3 to bring stream-like behavior to object storage, effectively turning your S3 bucket into a continuous data feed.&lt;/p&gt;

&lt;p&gt;Our solution delivers speed and simplicity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-ETL experience&lt;/strong&gt; : Eliminate complex pipelines or custom integration code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time data pipeline&lt;/strong&gt; : Turn your S3 bucket into a continuous data feed with automatic synchronization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Familiar tools&lt;/strong&gt; : Use S3 for storage and Timescale Cloud for analytics without compromise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal configuration&lt;/strong&gt; : Connect to your S3 bucket, define mapping, and let livesync handle the rest.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How livesync works
&lt;/h3&gt;

&lt;p&gt;Behind the scenes, we're doing the heavy lifting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema mapping that infers your data from CSV or Parquet files to hypertables&lt;/li&gt;
&lt;li&gt;Managing the initial data load&lt;/li&gt;
&lt;li&gt;Maintaining continuous synchronization&lt;/li&gt;
&lt;li&gt;Intelligent tracking of processed files to prevent duplicates or missed data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables teams across multiple industries to build robust pipelines. For organizations with production applications on Postgres looking to scale their real-time analytics, livesync for S3 has a sister solution—&lt;a href="https://www.timescale.com/blog/connect-any-postgres-to-real-time-analytics" rel="noopener noreferrer"&gt;&lt;u&gt;livesync for Postgres&lt;/u&gt;&lt;/a&gt;—which lets you keep your Postgres as-is while streaming data in real time to a Timescale Cloud instance optimized for analytical workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  The inner workings of livesync for S3
&lt;/h3&gt;

&lt;p&gt;Secure cross-account authentication&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbunjw7pte7q1v6jobbwx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbunjw7pte7q1v6jobbwx.png" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Livesync employs a robust security model using AWS role assumption. Our service assumes a specific role in your AWS account with precisely the permissions needed to access your S3 data. To prevent confused deputy attacks, we implement the industry-standard External ID verification using your unique Project ID/Service ID combination.&lt;/p&gt;

&lt;h4&gt;
  
  
  Smart polling and file discovery
&lt;/h4&gt;

&lt;p&gt;Behind the scenes, livesync intelligently scans your S3 bucket using optimized ListObjectsV2 calls. Starting with the prefix from your pattern (like "logs/" from "logs/**/*.csv"), it applies glob matching to find relevant files. The system tracks processed files in lexicographical order, ensuring no file is missed or duplicated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fij6v8264zidawnxzo2ze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fij6v8264zidawnxzo2ze.png" width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To maintain performance, livesync for S3 manages an orderly queue limited to 100 files per connection. When files are plentiful, polling accelerates to every minute; when caught up, it follows your configured schedule. You can always trigger immediate processing with the "Pull now" button.&lt;/p&gt;

&lt;h4&gt;
  
  
  Optimized data processing pipeline
&lt;/h4&gt;

&lt;p&gt;Livesync handles different file formats with specialized techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CSV files&lt;/strong&gt; are analyzed for compression (UTF-8, ZIP, GZIP), then processed using high-performance parallel ingestion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parquet files&lt;/strong&gt; undergo efficient conversion before being streamed into TimescaleDB (which lives at the core of your Timescale Cloud service).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire pipeline includes intelligent error handling, which is clearly visible in the dashboard. After three consecutive failures, livesync automatically pauses to prevent resource waste, awaiting your review.&lt;/p&gt;

&lt;p&gt;This architecture delivers the perfect balance of reliability, performance, and operational simplicity, bringing your S3 data into Timescale Cloud with minimal configuration and maximum confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build powerful ingest pipelines with minimal configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IoT telemetry flows:&lt;/strong&gt; Connect devices that log to S3 (like AWS IoT Core) directly to time-series analytics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming data persistence:&lt;/strong&gt; Automatically process data from Kinesis, Kafka, or other streaming platforms that land files in S3 and transform into TimescaleDB hypertables for high-performance querying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crypto/financial data analytics:&lt;/strong&gt; Sync trading data from S3 into TimescaleDB for real-time analytics on recent market movements and long-term historical analysis for backtesting and trend identification.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Currently supporting CSV and Parquet file formats, livesync delivers a frictionless way to unlock the value of your data stored in S3.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcelzm9f7sfzbwblx9bjm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcelzm9f7sfzbwblx9bjm.png" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Simple setup, powerful results
&lt;/h3&gt;

&lt;p&gt;Livesync for S3 continuously monitors your S3 bucket for incoming sensor data, automatically maps schemas, and syncs data into TimescaleDB hypertables in minutes. This enables operators to query millions of readings with millisecond latency, driving real-time dashboards that catch anomalies before equipment fails. Livesync for S3 ensures that syncing from S3 to hypertables remains smooth, dependable, and lightning-fast.&lt;/p&gt;

&lt;p&gt;Setting up &lt;a href="https://docs.timescale.com/migrate/latest/livesync-for-s3/" rel="noopener noreferrer"&gt;&lt;u&gt;livesync for S3&lt;/u&gt;&lt;/a&gt; is surprisingly straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Connect to your S3 bucket with your credentials.&lt;/li&gt;
&lt;li&gt;Define how your objects map to TimescaleDB tables.&lt;/li&gt;
&lt;li&gt;Let livesync for S3 handle the rest—monitoring and ingesting new data automatically.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Behind the scenes, we're doing the heavy lifting of schema mapping, managing the initial data load, and maintaining continuous synchronization. The system intelligently tracks what it's processed, so you never have duplicate data or missed files.&lt;/p&gt;

&lt;p&gt;For example, in manufacturing environments where sensors continuously capture critical equipment data through AWS IoT Core and store it in S3, livesync ensures this data becomes immediately queryable in TimescaleDB. This enables operators to identify anomalies before equipment fails, turning static S3 storage into actionable intelligence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zero maintenance, maximum performance
&lt;/h3&gt;

&lt;p&gt;Once configured, livesync for S3 delivers ease and performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-maintenance operation&lt;/strong&gt; once configured&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema mapping&lt;/strong&gt; that infers your data from CSV or Parquet files to hypertables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic retry mechanisms&lt;/strong&gt; for transient failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained control&lt;/strong&gt; over which objects sync and when&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete observability&lt;/strong&gt; with detailed history of file imports and error messages (if any)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Simplify Document Embeddings With Pgai Vectorizer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Searching unstructured documents embeddings with pgvector
&lt;/h3&gt;

&lt;p&gt;While livesync brings S3 data into Postgres, pgai Vectorizer takes a different approach for unstructured documents. It creates searchable vector embeddings in Postgres from documents stored in S3 while keeping the original files in place.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem: Complex pipelines for document search
&lt;/h3&gt;

&lt;p&gt;AI applications using RAG (retrieval-augmented generation) can help businesses unlock insights from mountains of unstructured data. Today, that unstructured data’s natural home is Amazon S3. On the other hand, Postgres has become the default vector database for developers, thanks to extensions like &lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;&lt;u&gt;pgvector&lt;/u&gt;&lt;/a&gt; and &lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale&lt;/u&gt;&lt;/a&gt;. These extensions enable them to build intelligent applications with vector search capabilities without needing to use a separate database just for vectors.&lt;/p&gt;

&lt;p&gt;We’ve previously written about how &lt;a href="https://www.timescale.com/blog/vector-databases-are-the-wrong-abstraction" rel="noopener noreferrer"&gt;&lt;u&gt;vector databases are the wrong abstraction&lt;/u&gt;&lt;/a&gt; because they divorce the source data from the vector embedding and lose the connection between unstructured data that's being embedded and the embeddings themselves. This problem is especially apparent for documents housed in object storage like Amazon S3.&lt;/p&gt;

&lt;p&gt;Before pgai Vectorizer, developers typically needed to manage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex ETL pipelines to chunk, format, and create embeddings from source data&lt;/li&gt;
&lt;li&gt;Multiple systems: a vector database for embeddings, an application database for metadata, and possibly a separate lexical search index&lt;/li&gt;
&lt;li&gt;Data synchronization services to maintain a single source of truth&lt;/li&gt;
&lt;li&gt;Queuing systems for updates and synchronization&lt;/li&gt;
&lt;li&gt;Monitoring tools to catch data drift and handle rate limits from embedding services&lt;/li&gt;
&lt;li&gt;Alert systems for stale search results&lt;/li&gt;
&lt;li&gt;Validation checks across all these systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Processing documents in AI pipelines introduces several challenges, such as managing diverse file formats (PDFs, DOCX, XLSX, HTML, and more), handling complex metadata, keeping embeddings up to date with document changes, and ensuring efficient storage and retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution: Automatic document vectorization
&lt;/h3&gt;

&lt;p&gt;To solve these challenges, Timescale has added support for document vectorization to pgai Vectorizer, giving developers an automated way to create embeddings from documents in Amazon S3 and keep those embeddings synchronized as the underlying data changes, eliminating the need for external ETL pipelines and queuing systems.&lt;/p&gt;

&lt;p&gt;Pgai Vectorizer provides a streamlined approach where developers can reference documents in S3 (or local storage) via URLs stored in a database table. The vectorizer then handles the complete workflow—downloading documents, parsing them to extract content, chunking text appropriately, and generating embeddings for use in semantic search, RAG, or agentic applications.&lt;/p&gt;

&lt;p&gt;This integration supports a wide variety of file formats, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Documents: PDF, DOCX, TXT, MD, AsciiDoc&lt;/li&gt;
&lt;li&gt;Spreadsheets: CSV, XLSX&lt;/li&gt;
&lt;li&gt;Presentations: PPTX&lt;/li&gt;
&lt;li&gt;Images: PNG, JPG, TIFF, BMP&lt;/li&gt;
&lt;li&gt;Web content: HTML&lt;/li&gt;
&lt;li&gt;Books: MOBI, EPUB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For developers, pgai Vectorizer for document vectorization offers three key benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Get started more easily&lt;/strong&gt; → Automatic embedding creation with a simple SQL command manages the entire workflow from document reference to searchable embeddings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spend less time wrangling data infrastructure&lt;/strong&gt; → Automatic updating and synchronization of embeddings means your vector search stays current with your S3 documents without manual intervention. It’s as simple as adding a new row or updating a “modified_at” column in the documents table, and pgai Vectorizer will take off any (re)processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuously improve your AI systems&lt;/strong&gt; → Testing and experimentation with different embedding models or chunking strategies can be done &lt;a href="https://www.timescale.com/blog/open-source-vs-openai-embeddings-for-rag" rel="noopener noreferrer"&gt;&lt;u&gt;with a single line of SQL&lt;/u&gt;&lt;/a&gt;, allowing you to optimize your application's performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By keeping your embeddings automatically synchronized to the source documents in S3, pgai Vectorizer ensures that your Postgres database remains the single source of truth for both your structured and vector data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Under the hood: How pgai Vectorizer works with Amazon S3
&lt;/h2&gt;

&lt;p&gt;Pgai Vectorizer simplifies the entire document processing pipeline through a streamlined architecture that connects your Amazon S3 documents with Postgres. Here's how it works:&lt;/p&gt;

&lt;h4&gt;
  
  
  Architecture overview
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0nxj834n1hvuxevl0oo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy0nxj834n1hvuxevl0oo.png" width="800" height="485"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Architecture overview of pgai Vectorizer: The vectorizer system takes in source data from Postgres tables and S3 buckets, creates embeddings via worker processes running in AWS Lambda using user-specified parsing, chunking, and embedding configurations, and stores the final embeddings in Postgres tables using the pgvector data type.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The pgai Vectorizer architecture for document vectorization consists of several key components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data sources: Postgres and Amazon S3&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Text and metadata residing in Postgres tables&lt;/li&gt;
&lt;li&gt;Postgres tables containing URLs that reference documents in Amazon S3 (which serves as the data aggregation layer where your documents reside)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Vectorization configuration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stored in Postgres, allowing you to manage everything through familiar SQL commands&lt;/li&gt;
&lt;li&gt;Defines chunking strategies, embedding models, and processing parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Vectorizer worker (AWS Lambda)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A daemon process that handles the actual work of processing documents&lt;/li&gt;
&lt;li&gt;Responsible for downloading, parsing, chunking, and embedding creation&lt;/li&gt;
&lt;li&gt;Automatically manages synchronization between source documents and embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Destination&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All embeddings are stored in Postgres alongside metadata&lt;/li&gt;
&lt;li&gt;Enables unified queries across both structured data and vector embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Document processing pipeline
&lt;/h4&gt;

&lt;p&gt;The document vectorization process follows these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Documents are referenced via URLs stored in a database column.&lt;/li&gt;
&lt;li&gt;The vectorizer downloads documents using these URLs.&lt;/li&gt;
&lt;li&gt;Documents are parsed to extract text content in an embedding-friendly format. &lt;/li&gt;
&lt;li&gt;The content is chunked using configurable chunking strategies.&lt;/li&gt;
&lt;li&gt;Chunks are processed for embedding generation using your chosen embedding model.&lt;/li&gt;
&lt;li&gt;Embeddings are stored in Postgres with references to the source documents.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1phx1teqqkarcuu6zg0u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1phx1teqqkarcuu6zg0u.png" width="800" height="229"&gt;&lt;/a&gt;&lt;br&gt;
_ Pgai Vectorizer document processing pipeline showing how files in Amazon S3 get parsed, chunked, formatted, and embedded in order to be used in vector search queries in a Postgres database._&lt;/p&gt;

&lt;h4&gt;
  
  
  Key components
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loader&lt;/strong&gt; : Loads files from Amazon S3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parser&lt;/strong&gt; : Extracts content from retrieved files, handling different document formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt; : Splits content into appropriate sizes for embedding models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Formatting&lt;/strong&gt; : Organizes chunks with metadata from the source files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding generator&lt;/strong&gt; : Processes chunks into vector embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Use cases for pgai Vectorizer document vectorization
&lt;/h3&gt;

&lt;p&gt;Pgai Vectorizer's document vectorization capabilities enable several powerful use cases across industries by connecting S3-stored documents with Postgres vector search:&lt;/p&gt;

&lt;h4&gt;
  
  
  Financial analysis
&lt;/h4&gt;

&lt;p&gt;Automatically vectorize financial documents from S3 without custom pipelines. Connect document insights with quantitative metrics for unified queries.&lt;/p&gt;

&lt;h4&gt;
  
  
  Legal document management
&lt;/h4&gt;

&lt;p&gt;Maintain synchronized knowledge bases of legal documents with automatic embedding updates. Test different models for your specific domain.&lt;/p&gt;

&lt;h4&gt;
  
  
  Enhanced customer support
&lt;/h4&gt;

&lt;p&gt;Make knowledge base content immediately searchable as it changes, connecting support documents with customer data.&lt;/p&gt;

&lt;h4&gt;
  
  
  Research systems
&lt;/h4&gt;

&lt;p&gt;Build research AI with continuously updated paper collections, connecting published findings with experimental time-series data.&lt;/p&gt;

&lt;p&gt;In each case, pgai Vectorizer eliminates infrastructure complexity while enabling continuous improvement through its "set it and forget it" synchronization and simple experimentation capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try out the S3 features in Timescale Cloud Today
&lt;/h2&gt;

&lt;p&gt;Livesync and pgai Vectorizer are just the first steps in our vision to unify Postgres and object storage into a single, powerful lakehouse-style architecture—built for real-time AI and analytics. &lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://docs.timescale.com/migrate/latest/livesync-for-s3/" rel="noopener noreferrer"&gt;&lt;u&gt;Try Livesync for S3.&lt;/u&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer/document-embeddings.md" rel="noopener noreferrer"&gt;&lt;u&gt;Try pgai Vectorizer. &lt;/u&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→&lt;a href="https://console.cloud.timescale.com/signup" rel="noopener noreferrer"&gt;&lt;u&gt; Sign up for Timescale Cloud&lt;/u&gt;&lt;/a&gt; and get started in seconds.&lt;/p&gt;

&lt;p&gt;We can’t wait to see what you build.&lt;/p&gt;

&lt;h3&gt;
  
  
  Day 4 preview: &lt;em&gt;Developer Tools That Speed Up Your Workflow: Introducing SQL Assistant, Recommendation Engine, and Insights&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Tomorrow, we'll reveal how Timescale delivers high-speed performance without sacrificing simplicity through &lt;strong&gt;SQL assistant with agent mode&lt;/strong&gt; , &lt;strong&gt;recommendation engine&lt;/strong&gt; , and &lt;strong&gt;Insights&lt;/strong&gt;. See how plain-language queries eliminate SQL wrangling, how automated tuning keeps databases optimized with a single click, and why developers finally get both the millisecond response times users demand and the operational simplicity teams need.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>aws</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>Postgres vs. Qdrant: Why Postgres Wins for AI and Vector Workloads</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Wed, 30 Apr 2025 16:01:47 +0000</pubDate>
      <link>https://dev.to/tigerdata/postgres-vs-qdrant-why-postgres-wins-for-ai-and-vector-workloads-3d71</link>
      <guid>https://dev.to/tigerdata/postgres-vs-qdrant-why-postgres-wins-for-ai-and-vector-workloads-3d71</guid>
      <description>&lt;p&gt;It's Timescale Launch Week and we’re bringing benchmarks: &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;Postgres vs. Qdrant on 50M Embeddings&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There’s a belief in the AI infrastructure world that you need to abandon general-purpose databases to get great performance on vector workloads. The logic goes: Postgres is great for transactions, but when you need high-performance vector search, it’s time to bring in a specialized vector database like Qdrant.&lt;/p&gt;

&lt;p&gt;That logic doesn’t hold—just like it didn’t when we benchmarked &lt;a href="https://www.timescale.com/blog/pgvector-vs-pinecone" rel="noopener noreferrer"&gt;pgvector vs. &lt;u&gt;Pinecone&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Like everything in Launch Week, this is about speed without sacrifice. And in this case, Postgres delivers both.&lt;/p&gt;

&lt;p&gt;We’re releasing a new benchmark that challenges the assumption that you can only scale with a specialized vector database. We compared Postgres (with &lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;&lt;u&gt;pgvector&lt;/u&gt;&lt;/a&gt; and &lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale&lt;/u&gt;&lt;/a&gt;) to Qdrant on a massive dataset of 50 million embeddings. The results show that Postgres not only holds its own but also delivers standout throughput and latency, even at production scale.&lt;/p&gt;

&lt;p&gt;This post summarizes the key takeaways, but it’s just the beginning. &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;&lt;u&gt;Check out the full benchmark blog post&lt;/u&gt;&lt;/a&gt; on query performance, developer experience, and operational experience.&lt;/p&gt;

&lt;p&gt;Let’s dig into what we found and what it means for teams building production AI applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark: Postgres vs. Qdrant on 50M Embeddings
&lt;/h2&gt;

&lt;p&gt;We tested Postgres and Qdrant on a level playing field:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50 million embeddings&lt;/strong&gt; , each with 768 dimensions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ANN-benchmarks&lt;/strong&gt; , the industry-standard benchmarking tool&lt;/li&gt;
&lt;li&gt;Focused on &lt;strong&gt;approximate nearest neighbor (ANN) search&lt;/strong&gt;, no filtering&lt;/li&gt;
&lt;li&gt;All benchmarks run on identical AWS hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The takeaway? Postgres with pgvector and pgvectorscale showed significantly higher throughput while maintaining sub-100 ms latencies. Qdrant performed strongly on tail latencies and index build speed, but Postgres pulled ahead where it matters most for teams scaling to production workloads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fox73b5q8gbq353qnicwc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fox73b5q8gbq353qnicwc.png" alt="Vector search query throughput at 99 % recall (bar graph). Postgres with pgvector and pgvectorscale processes 471.57 queries per second vs. Qdrant's 41.47." width="800" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the complete results, including detailed performance metrics, graphs, and testing configurations, &lt;a href="https://www.timescale.com/blog/pgvector-vs-qdrant" rel="noopener noreferrer"&gt;&lt;u&gt;read the full benchmark blog post&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters: AI Performance Without the Rewrite
&lt;/h2&gt;

&lt;p&gt;These results aren’t just a technical curiosity. They have &lt;strong&gt;real implications&lt;/strong&gt; for how you architect your AI stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production-grade latency:&lt;/strong&gt; Postgres with pgvectorscale delivers sub-100 ms p99 latencies needed to power real-time or responsive AI applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher concurrency&lt;/strong&gt; : Postgres delivered significantly higher throughput, meaning you can support more simultaneous users without scaling out as aggressively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lower complexity&lt;/strong&gt; : You don't need to manage and integrate a separate, specialized vector database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational familiarity&lt;/strong&gt; : You leverage the reliability, tooling, and operational practices you already have with Postgres.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL-first development&lt;/strong&gt; : You can filter, join, and integrate vector search naturally with relational data, without learning new APIs or query languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Postgres with pgvector and pgvectorscale gives you the performance of a specialized vector database &lt;em&gt;without&lt;/em&gt; giving up the ecosystem, tooling, and developer experience that make Postgres the world’s most popular database.&lt;/p&gt;

&lt;p&gt;You don’t need to split your stack to do vector search.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes It Work: Pgvectorscale and StreamingDiskANN
&lt;/h2&gt;

&lt;p&gt;How can Postgres compete with (and outperform) purpose-built vector databases?&lt;/p&gt;

&lt;p&gt;The answer lies in&lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale&lt;/u&gt;&lt;/a&gt; (part of the&lt;a href="https://github.com/timescale/pgai" rel="noopener noreferrer"&gt;&lt;u&gt;pgai&lt;/u&gt;&lt;/a&gt; family), which implements the StreamingDiskANN index (a disk-based ANN algorithm built for scale) for pgvector. Combined with Statistical Binary Quantization (SBQ), &lt;a href="https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data" rel="noopener noreferrer"&gt;&lt;u&gt;it balances memory usage and performance&lt;/u&gt;&lt;/a&gt; better than traditional in-memory HNSW (hierarchical navigable small world) implementations.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can run large-scale vector search on standard cloud hardware.&lt;/li&gt;
&lt;li&gt;You don’t need massive memory footprints or expensive GPU-accelerated nodes.&lt;/li&gt;
&lt;li&gt;Performance holds steady even as your dataset grows to tens or hundreds of millions of vectors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All while staying inside Postgres.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose Postgres, and When Not To
&lt;/h2&gt;

&lt;p&gt;To be clear: Qdrant is a capable system. It has faster index builds and lower tail latencies. It’s a strong choice if you’re not already using Postgres, or for specific use cases that need native scale-out and purpose-built vector semantics.&lt;/p&gt;

&lt;p&gt;However, for many teams—especially those already invested in Postgres— &lt;strong&gt;it makes no sense to introduce a new database&lt;/strong&gt; just to support vector search.&lt;/p&gt;

&lt;p&gt;If you want high recall, high throughput, and tight integration with your existing stack, Postgres is more than enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Want to Try It?
&lt;/h2&gt;

&lt;p&gt;Pgvector and pgvectorscale are open source and available today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/pgvector/pgvector" rel="noopener noreferrer"&gt;&lt;u&gt;pgvector GitHub&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/timescale/pgvectorscale" rel="noopener noreferrer"&gt;&lt;u&gt;pgvectorscale GitHub&lt;/u&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Or save time and access both by creating a &lt;a href="https://timescale.com/signup" rel="noopener noreferrer"&gt;&lt;u&gt;free Timescale Cloud account&lt;/u&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vector search in Postgres isn’t a hack or a workaround. It’s fast, it scales, and it works. If you’re building AI applications in 2025, you don’t have to sacrifice your favorite database to move fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Up Next at Timescale Launch Week
&lt;/h2&gt;

&lt;p&gt;Next up, we’re taking Postgres even further: Learn how to stream external S3 data into Postgres with livesync for S3 and work with S3 data in place using the pgai Vectorizer. Two powerful ways to seamlessly integrate external data from S3 directly into your Postgres workflows!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>postgres</category>
      <category>vectordatabase</category>
      <category>news</category>
    </item>
    <item>
      <title>Connect Any Postgres to Real-Time Analytics</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Tue, 29 Apr 2025 18:14:58 +0000</pubDate>
      <link>https://dev.to/tigerdata/connect-any-postgres-to-real-time-analytics-5anm</link>
      <guid>https://dev.to/tigerdata/connect-any-postgres-to-real-time-analytics-5anm</guid>
      <description>&lt;p&gt;TLDR: We built livesync for Postgres to solve the analytics-vs-stability dilemma. Stream data from any Postgres instance directly into Timescale Cloud with zero downtime and no application changes. It performs historical backfills at 150GB/hour while capturing live changes through CDC, automatically converting tables to hypertables. Your production database remains untouched while you gain columnar storage, compression, and time-partitioning capabilities that dramatically accelerate queries. No more complex ETL pipelines or risky migrations, just high-performance analytics without compromising system reliability. &lt;a href="https://www.timescale.com/blog/connect-any-postgres-to-real-time-analytics" rel="noopener noreferrer"&gt;Read the full article&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://www.timescale.com/blog/scaling-postgresql-to-petabyte-scale" rel="noopener noreferrer"&gt;Scaling real-time analytics on Postgres&lt;/a&gt; has always been a balancing act. Timescale was built to solve this problem, to make Postgres scalable, fast, and analytics-ready without sacrificing reliability. &lt;/p&gt;

&lt;p&gt;But teams with production applications on vanilla Postgres (or locked into other database-as-a-service platforms) often find themselves stuck. They face an impossible choice: risk downtime to migrate, build brittle ETL (extract-transform-load) pipelines, or live with the slow drag of overloaded systems.&lt;/p&gt;

&lt;p&gt;We built &lt;a href="https://docs.timescale.com/migrate/latest/livesync-for-s3/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Timescale livesync for Postgres&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; to break this cycle—and today, we’re excited to kick off a &lt;strong&gt;Timescale Launch Week&lt;/strong&gt; by introducing it.&lt;/p&gt;

&lt;p&gt;Livesync lets you stream data from any Postgres database, whether it's running on Amazon RDS for PostgreSQL, Amazon Aurora, Azure PostgreSQL, self-hosted, or elsewhere, into Timescale Cloud with zero downtime, no application rewrites, and no disruption to your existing systems.&lt;/p&gt;

&lt;p&gt;You get the analytical speed of Timescale’s &lt;a href="https://docs.timescale.com/use-timescale/latest/hypertables/" rel="noopener noreferrer"&gt;&lt;u&gt;hypertables&lt;/u&gt;&lt;/a&gt; and &lt;a href="https://docs.timescale.com/use-timescale/latest/hypercore/" rel="noopener noreferrer"&gt;&lt;u&gt;hypercore&lt;/u&gt;&lt;/a&gt;, providing seamless time-based partitioning, automatic columnar storage, and unbeatable compression without sacrificing the production stability you’ve worked hard to build. If you’re after a refresher on what Timescale can do, then check out our &lt;a href="https://docs.timescale.com/about/latest/whitepaper/" rel="noopener noreferrer"&gt;&lt;u&gt;whitepaper&lt;/u&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You get speed without sacrifice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Use Livesync for Real-Time Analytics?
&lt;/h2&gt;

&lt;p&gt;When your application needs to deliver real-time insights, your options have traditionally been limited:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Risky full migrations&lt;/strong&gt; , often requiring downtime windows that never feel safe enough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heavy, custom-built ETL pipelines&lt;/strong&gt; that introduce complexity, lag, and new points of failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overloaded production systems&lt;/strong&gt; , where even simple dashboards slow down, and you risk impacting your critical transactional workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In an ideal world, every application that faces these problems would migrate directly to Timescale (we are just Postgres after all!), but we understand that’s not always realistic. For many teams, the stakes are too high to move fast and break things. Systems powering financial transactions, IoT networks, and SaaS platforms can’t tolerate disruption, yet they can't ignore the need for analytics.&lt;/p&gt;

&lt;p&gt;Livesync offers a new path forward: &lt;strong&gt;keep your existing Postgres exactly as it is&lt;/strong&gt; , but stream your data in real time to a dedicated Timescale Cloud instance optimized for analytical workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No downtime. No risk. No performance hits to production.&lt;/strong&gt; To get the most out of Timescale Cloud, we’d still advise that you start planning to fully migrate in the future, but for now you have blazing-fast real-time analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Livesync for Postgres Works
&lt;/h2&gt;

&lt;p&gt;Livesync uses &lt;a href="https://www.postgresql.org/docs/current/logical-replication.html" rel="noopener noreferrer"&gt;&lt;u&gt;Postgres’ logical replication&lt;/u&gt;&lt;/a&gt; protocol to stream changes from your production database into Timescale Cloud. But rather than simply duplicating your data, livesync extends logical replication with high-throughput ingestion, automatic hypertable creation, and a cloud-native architecture that prepares your data for real-time analytics at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zero-downtime setup and continuous replication
&lt;/h3&gt;

&lt;p&gt;When you first connect livesync for Postgres, it performs an initial historical backfill, copying your existing data into Timescale at speeds of up to 150 GB per hour. &lt;/p&gt;

&lt;p&gt;At the same time, it begins capturing and streaming live changes through change data capture (CDC), recording every insert, update, and delete from your source Postgres database as they happen. You can choose exactly which tables to replicate, moving only the data you need for analytics, while the rest of your production system continues operating normally, with no maintenance windows, no locks, and no downtime risk.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futss8utz8cc0766rt86l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futss8utz8cc0766rt86l.png" alt="Data Flow - Existing data + Real time replication" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Behind the scenes, livesync uses a microservice that connects to your source database as a logical replication subscriber, consuming changes directly from your Postgres publication.&lt;/p&gt;

&lt;p&gt;Unlike pure logical replication, livesync for Postgres automatically prepares your data for high-performance analytics: It can create corresponding tables inside Timescale Cloud and configure them as hypertables, unlocking immediate benefits like native time-based partitioning and faster time-series queries.&lt;/p&gt;

&lt;p&gt;Once your data is live in Timescale Cloud, you can optionally enable columnar storage and compression to further accelerate analytics and optimize storage, without modifying your ingestion or sync setup. This tight integration ensures that Livesync for Postgres doesn't just mirror your data—it sets it up to scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Non-Invasive, Read-Only Architecture for Real-Time Analytics
&lt;/h2&gt;

&lt;p&gt;Livesync is designed to be as non-invasive as possible. It creates a lightweight publication and replication slot on your Postgres database but does not alter your existing schemas, application code, or database connections.&lt;/p&gt;

&lt;p&gt;Your production environment remains fully operational at all times. Livesync simply observes and streams changes without interfering, making it especially well-suited for production systems with strict stability, compliance, or uptime requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-Performance, Scalable by Design
&lt;/h2&gt;

&lt;p&gt;Livesync for Postgres is engineered for high throughput across both historical and live data.&lt;/p&gt;

&lt;p&gt;During the initial backfill, livesync achieves historical copy speeds of up to 150 GB per hour. During live operations, it can sustain 30,000 to 40,000 DML operations per second.&lt;/p&gt;

&lt;p&gt;Future improvements, including intra-table parallelization and adoption of &lt;a href="https://www.postgresql.org/docs/current/protocol-logical-replication.html" rel="noopener noreferrer"&gt;&lt;u&gt;logical replication protocol v2&lt;/u&gt;&lt;/a&gt; (which allows streaming of in-flight transactions rather than send on commit semantics), are already on our roadmap to push these limits even further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seamless Integration with Existing Architectures
&lt;/h2&gt;

&lt;p&gt;Livesync fits cleanly into your existing infrastructure, whether you're running RDS, Aurora, Azure Database for PostgreSQL, or a self-managed instance.&lt;/p&gt;

&lt;p&gt;Your operational systems continue working exactly as they do today. Livesync simply adds a Timescale-powered analytics layer next to your source database, letting you redirect analytical queries to Timescale by switching to a new connection string when you're ready.&lt;/p&gt;

&lt;p&gt;No replatforming, no rewrites, no downtime. Just faster SQL real-time analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose Livesync
&lt;/h2&gt;

&lt;p&gt;Livesync is the right choice when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero downtime is non-negotiable&lt;/strong&gt; , such as in financial systems, healthcare apps, production SaaS platforms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need real-time insights today&lt;/strong&gt; , but can't risk application rewrites or complex migrations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want to incrementally adopt Timescale&lt;/strong&gt; , starting small and expanding over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You’re evaluating Timescale performance&lt;/strong&gt; before committing to full migration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you're monitoring 10,000 IoT sensors, analyzing 50 million transactions, or building dashboards for end users, livesync gets you there—without the "lift and pray" gamble of traditional migration projects.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(That said, if you're open to consolidating on a single database, check out our&lt;/em&gt; &lt;a href="https://docs.timescale.com/migrate/latest/" rel="noopener noreferrer"&gt;&lt;em&gt;&lt;u&gt;migration tooling &lt;/u&gt;&lt;/em&gt;&lt;/a&gt;&lt;em&gt;which takes the risk out of  full migrations to Timescale.)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Livesync Applications
&lt;/h2&gt;

&lt;p&gt;Livesync is built for real-world systems where uptime, performance, and gradual adoption matter. Here’s how different verticals might use it to accelerate analytics without disruption:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Financial services&lt;/strong&gt; : Offload heavy historical queries to Timescale without risking downtime or introducing disruptive schema changes. Keep OLTP workloads fast and stable while running complex analytics separately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IoT environments&lt;/strong&gt; : Send millions of time-series events daily directly into hypertables, enabling real-time rollups, faster trend analysis, and storage optimizations like compression—without custom ETL pipelines or manual partitioning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS-native architectures&lt;/strong&gt; : Layer Timescale Cloud analytics on top of existing RDS or Aurora deployments, delivering sub-second analytical performance without replatforming or disrupting operational systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In every case, livesync removes the traditional pain of "faster analytics" projects, delivering immediate real-world results without sacrificing production stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Livesync
&lt;/h2&gt;

&lt;p&gt;Getting started is simple!&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create a Timescale Cloud service&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connect livesync&lt;/strong&gt; to your existing Postgres—no code changes required (see the Actions tab when you manage your service).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure your source database&lt;/strong&gt; , either by providing your host, port, user and database or a Postgres connection string.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Select your tables&lt;/strong&gt; , configure hypertables, and hit start.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch your data flow&lt;/strong&gt; into Timescale—with full visibility, full control, and no disruption.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once live, you can redirect your analytical queries to Timescale’s battle-tested storage engine, unlocking sub-second dashboards and deep insights without impacting your production system. &lt;/p&gt;

&lt;h2&gt;
  
  
  And For Our Next Trick...
&lt;/h2&gt;

&lt;p&gt;Connecting any Postgres to real-time analytics is just the start. This week is a Timescale Launch Week, a celebration of new features and new ways to move faster without compromise. Every day, we’re showing how you can deliver speed without sacrifice across your entire stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Day 1&lt;/strong&gt; : Connect Any Postgres to Real-Time Analytics — Start using livesync for Postgres and add real-time analytics to your existing stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 2&lt;/strong&gt; : Compare Pgvector and Qdrant — See a side-by-side breakdown of two popular open-source vector databases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 3&lt;/strong&gt; : Connect S3 Data to Postgres — Use livesync for S3 and the pgai Vectorizer to work with external data directly inside Postgres.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 4&lt;/strong&gt; : Supercharge Your Developer Experience — Use our SQL Assistant’s new AI agent mode, get recommendations to tune your instance, and explore new query visibility tools with Timescale Insights.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Day 5&lt;/strong&gt; : Strengthen Security and Compliance — Maintain control while scaling your analytics performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try Livesync Today
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://console.cloud.timescale.com/signup" rel="noopener noreferrer"&gt;Start your free Timescale trial&lt;/a&gt; and set up livesync for Postgres today.&lt;/p&gt;

&lt;p&gt;And stay tuned, in the coming weeks we'll dive deeper into how we built our livesync products.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>datascience</category>
      <category>performance</category>
      <category>sql</category>
    </item>
    <item>
      <title>Building IoT Pipelines for Faster Analytics With IoT Core</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Mon, 28 Apr 2025 14:58:20 +0000</pubDate>
      <link>https://dev.to/tigerdata/building-iot-pipelines-for-faster-analytics-with-iot-core-26n2</link>
      <guid>https://dev.to/tigerdata/building-iot-pipelines-for-faster-analytics-with-iot-core-26n2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb13dfy22d6sohs885vva.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb13dfy22d6sohs885vva.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is AWS IoT Core
&lt;/h2&gt;

&lt;p&gt;AWS IoT Core is Amazon’s managed IoT service that lets you easily connect and manage Internet of Things devices. This can be done using the MQTT protocol, which was specifically designed for IoT use cases to be lightweight and fault-tolerant. If MQTT isn’t your cup of tea, you also have the option to use HTTP, although this is less common for reasons beyond this post.&lt;/p&gt;

&lt;p&gt;While IoT Core can be used solely as a message broker, allowing IoT devices to send and receive messages by way of publishing and subscribing to MQTT topics, you can also use the message routing functionality to receive select messages using a SQL-like rule and forward them to other Amazon Web Services. These include Kinesis, SQS, and Kafka, but most importantly, AWS Lambda.&lt;/p&gt;

&lt;p&gt;If you’ve followed the Do More With Timescale on AWS series, you’ve undoubtedly seen that with a &lt;a href="https://www.timescale.com/blog/do-more-with-aws-in-timescale-an-aws-lambda-tutorial-using-sam-cli" rel="noopener noreferrer"&gt;simple AWS Lambda function&lt;/a&gt;, we can very easily insert any kind of time-series data into a Timescale database.&lt;/p&gt;

&lt;p&gt;Today, we’ll go over how to set up a message routing rule and set up a Lambda function as an action so that for every MQTT message, a Lambda function gets triggered.&lt;/p&gt;

&lt;p&gt;If you’d rather watch a video on how to achieve this, click on the video below!&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/VPsabybrizw"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing the Action Lambda Function
&lt;/h2&gt;

&lt;p&gt;Before deploying any resources to AWS, let’s write our AWS Lambda function, which we will use as a trigger in AWS IoT Core to insert MQTT messages into Timescale.&lt;/p&gt;

&lt;p&gt;For the sake of simplicity, I’ve written the Lambda function in Python and kept the code as simple as possible. A side effect of this is that the function doesn’t have adequate type-checking, error handling, or secret management. For that reason, it is not recommended to use this function in a production environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Initialization
&lt;/h3&gt;

&lt;p&gt;Let’s start by writing the initialization portion of our function. This will be executed once, so long the function stays “hot.” By importing our libraries and creating a database connection in the initialization portion, we forgo having to create a new database connection for each Lambda execution. This will save valuable time (and consequently, a lot of money).&lt;/p&gt;

&lt;p&gt;First, import the &lt;code&gt;psycopg2&lt;/code&gt; library, which we’ll use to connect to our Timescale database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we create a variable called &lt;code&gt;connection_string&lt;/code&gt; which holds the connection string to our database. As mentioned previously, for the sake of simplicity, we are omitting any sensible secret management.&lt;/p&gt;

&lt;p&gt;An easy way to do this would be to add your connection string as an environment variable to your Lambda function and use the &lt;code&gt;os.getenv&lt;/code&gt; function to retrieve it or to use an alternative secret management solution like AWS Secrets Manager.&lt;/p&gt;

&lt;p&gt;Since we will be using Lambdas, we strongly recommend adding a &lt;a href="https://docs.timescale.com/use-timescale/latest/services/connection-pooling/" rel="noopener noreferrer"&gt;connection pooler&lt;/a&gt; to your service and using transaction mode. Add a connection pooler to your service by going to the Connection info panel of your service in the Timescale dashboard, clicking the Connection pooler tab, and then clicking Add a connection pooler. From there, choose “Transaction pool” in the Pool drop-down. The Service URL will be your connection string.&lt;/p&gt;

&lt;p&gt;Do note that this connection string does not contain your service password, so you will need to manually add this as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;conn_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"postgres://tsdbadmin:passwd@service.tsdb.cloud.timescale.com:5432/tsdb"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Afterward, we create a &lt;code&gt;psycopg2&lt;/code&gt; connection object from which we create a cursor that we will use to insert rows into our database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lambda handler
&lt;/h3&gt;

&lt;p&gt;Then, we move on to the main handler of our Lambda function. This will be run exactly once for every execution.&lt;/p&gt;

&lt;p&gt;We execute a SQL insert statement into the &lt;code&gt;sensor&lt;/code&gt; hypertable. In this case, the &lt;code&gt;event&lt;/code&gt; consists of a single float, which could be a temperature reading from an IoT sensor or the battery percentage of an electric car.&lt;/p&gt;

&lt;p&gt;We also use the PostgreSQL NOW() function to indicate when this event happened. In a production environment, it might be advisable to add a timestamp on the IoT sensor itself, as MQTT messages routed through AWS IoT Core can incur a small time delay.&lt;/p&gt;

&lt;p&gt;After our execution, we commit the transaction and return the function. As mentioned at the beginning of this blog post, this simple function can benefit hugely from logging and more graceful error handling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;def&lt;/span&gt; &lt;span class="k"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;try&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="k"&gt;insert&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;into&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;sensor&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt;
    &lt;span class="k"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"INSERT INTO sensor (time, value) VALUES (NOW(), %s);"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can find the full Lambda function code in this &lt;a href="https://github.com/mathisve/timescale-iot-core/tree/master/lambda-function" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating the Sensor Hypertable
&lt;/h2&gt;

&lt;p&gt;Before we continue, it’s important to create the sensor hypertable in our Timescale instance by executing the following SQL queries on our database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sensor&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt; &lt;span class="nb"&gt;PRECISION&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;create_hypertable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'sensor'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'time'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating the Lambda Function in AWS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Publish the function
&lt;/h3&gt;

&lt;p&gt;Once we’ve successfully written the code, we can package our function along with the &lt;code&gt;psycopg2&lt;/code&gt; library in a Docker container and publish it to AWS ECR. You can find the full code, Dockerfile, and build script in this &lt;a href="https://github.com/mathisve/timescale-iot-core/tree/master/lambda-function" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create the function
&lt;/h3&gt;

&lt;p&gt;Once the Docker container has finished publishing to ECR, we can create a new Lambda function using its container image URI. We will name our function &lt;code&gt;timescale-insert&lt;/code&gt;. Make sure to pay attention to what architecture you used to build the lambda function. If you are using an M1 (or higher) Mac, this will be &lt;code&gt;arm64&lt;/code&gt;. In most other cases, you can use &lt;code&gt;x86_64&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvtxpmy65qqhc9uyfciwv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvtxpmy65qqhc9uyfciwv.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Due to the nature of containers, we don’t need to change any settings once the Lambda function has been created!&lt;/p&gt;

&lt;h2&gt;
  
  
  Create the IoT Core Message Routing Rule
&lt;/h2&gt;

&lt;p&gt;IoT Core has a feature called ‘message routing,’ which allows you to (as the name suggests) route messages to different services. Throughout this section, you will learn more about how rules work.&lt;/p&gt;

&lt;p&gt;Create a new rule by pressing the orange 'Create rule' button.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdv96mcixspj2solaxz5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdv96mcixspj2solaxz5n.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Give the rule an appropriate name. Since IoT Core rule names cannot include dashes, we will name it: &lt;code&gt;timescale_insert&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7zfvmxjl44uxdo4awzq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk7zfvmxjl44uxdo4awzq.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next up, we need to create a SQL statement that will be used to ‘query’ all incoming MQTT messages in IoT Core. This SQL statement consists of three parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A SELECT clause that selects specific fields of the MQTT message payload.&lt;/li&gt;
&lt;li&gt;A FROM clause that is used to specify the MQTT topic we want to query on&lt;/li&gt;
&lt;li&gt;A WHERE clause that is used to exclude certain MQTT messages where a condition is not met.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In our query, we will be selecting every field (*) of messages on the &lt;code&gt;my-topic/thing&lt;/code&gt; topic. We won’t be using a where clause because we want to insert every MQTT message on the topic in our Timescale database. Then, we click ‘Next.’&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3uw6qoazwtvg5jtdau5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp3uw6qoazwtvg5jtdau5.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After configuring our SQL statement, we need to add an action. Select the &lt;code&gt;Lambda&lt;/code&gt; action type, then find and select the &lt;code&gt;timescale-insert&lt;/code&gt; function we created earlier! As you can see, there is an option to add multiple rule actions to a single IoT Core rule. This would allow you to stream your data to multiple destinations, for example, two (or more) Timescale databases or hypertables. You name it; it can be done!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9qemp8digujyie1qf5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9qemp8digujyie1qf5a.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then lastly, click on the orange ‘Create rule’ button.&lt;/p&gt;

&lt;p&gt;If all goes well, your newly created rule should be active!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4m46ogdce2jawbua4u3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4m46ogdce2jawbua4u3e.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing Our Message Routing Rule
&lt;/h2&gt;

&lt;p&gt;Now that we’re done writing our Lambda function, let’s set up our Timescale hypertable and IoT Core message rule. It’s time to put it to the test!&lt;/p&gt;

&lt;p&gt;To do this, I’ve repurposed a Python script written by AWS. You can find the &lt;a href="https://docs.aws.amazon.com/iot/latest/developerguide/iot-quick-start.html?icmpid=docs_iot_hp_connect_quickstart" rel="noopener noreferrer"&gt;original AWS tutorial&lt;/a&gt; or &lt;a href="https://github.com/mathisve/timescale-iot-core/tree/master/mqtt-generator" rel="noopener noreferrer"&gt;clone the modified Python script here&lt;/a&gt;. Do note that you will have to add your own certificates and ‘things’ to AWS IoT Core for this to work (which could be a blog post on its own).&lt;/p&gt;

&lt;p&gt;In case you don’t feel like reading the Python code, these are the sequential steps taken by the script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create an MQTT connection to the AWS IoT Core endpoint using the appropriate device certificates.&lt;/li&gt;
&lt;li&gt;Generate an array of floats.&lt;/li&gt;
&lt;li&gt;Iterate over the array:
a) Synthesize JSON message with a float from the array created in step 2.
b) Publish the JSON message to the my-topic/thing MQTT topic.
c) Sleep for one second.&lt;/li&gt;
&lt;li&gt;Gracefully disconnect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We run this script for a handful of seconds, &lt;a href="https://www.timescale.com/blog/how-to-install-psql-on-mac-ubuntu-debian-windows" rel="noopener noreferrer"&gt;connect to our Timescale database using psql&lt;/a&gt; and execute a query to retrieve all the rows in the &lt;code&gt;sensor&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmv3m2wacms72z4m9z22.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmv3m2wacms72z4m9z22.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And as you can see, we’ve achieved our goal of piping MQTT data from AWS IoT Core into a Timescale database with a single AWS Lambda function!&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS IoT Core: The End
&lt;/h2&gt;

&lt;p&gt;Congratulations! You have successfully set up an AWS IoT Core message rule that streams MQTT messages originating from a simple (albeit fake) sensor into a Timescale database.&lt;/p&gt;

&lt;p&gt;You are now primed and ready to build an endless stream of performant IoT pipelines that can accelerate your real-time IoT dashboards and analytics!&lt;/p&gt;

&lt;p&gt;If you want to learn more about how Timescale can improve your workloads on AWS, check out this &lt;a href="https://www.youtube.com/watch?v=aFAfwckBeVc&amp;amp;list=PLsceB9ac9MHRZIA2FwxAwpIsqvh0Q149H" rel="noopener noreferrer"&gt;YouTube playlist&lt;/a&gt; filled with other AWS and Timescale-related tutorials!&lt;/p&gt;




&lt;h2&gt;
  
  
  More resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.timescale.com/blog/do-more-on-aws-with-timescale-cloud-8-services-to-build-time-series-apps-faster/" rel="noopener noreferrer"&gt;Do More on AWS With Timescale: 8 Services to Build Time-Series Apps Faster&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.timescale.com/blog/do-more-with-aws-and-timescale-cloud-vpc-peering" rel="noopener noreferrer"&gt;Do More on AWS With Timescale: VPC Peering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.timescale.com/blog/do-more-with-aws-in-timescale-an-aws-lambda-tutorial-using-sam-cli" rel="noopener noreferrer"&gt;Do More With AWS in Timescale: An AWS Lambda Tutorial Using SAM CLI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>datascience</category>
      <category>postgres</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Text-to-SQL: A Developer’s Zero-to-Hero Guide</title>
      <dc:creator>Team Tiger Data</dc:creator>
      <pubDate>Fri, 25 Apr 2025 15:09:52 +0000</pubDate>
      <link>https://dev.to/tigerdata/text-to-sql-a-developers-zero-to-hero-guide-48gi</link>
      <guid>https://dev.to/tigerdata/text-to-sql-a-developers-zero-to-hero-guide-48gi</guid>
      <description>&lt;p&gt;TL;DR&lt;br&gt;
&lt;a href="https://www.timescale.com/learn/text-to-sql-a-developers-zero-to-hero-guide" rel="noopener noreferrer"&gt;Build your own text-to-SQL system&lt;/a&gt; that translates natural language into database queries. This guide covers implementation approaches from rule-based to ML models, practical code examples, and production-ready best practices for security and performance.&lt;/p&gt;


&lt;h2&gt;
  
  
  What You'll Learn
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;How to translate natural language queries into SQL with NLP&lt;/li&gt;
&lt;li&gt;Building both rule-based and ML-based text-to-SQL systems&lt;/li&gt;
&lt;li&gt;Implementing error handling, security, and performance optimizations&lt;/li&gt;
&lt;li&gt;Advanced features like multi-turn conversations and visualization&lt;/li&gt;
&lt;li&gt;Troubleshooting common challenges in real-world deployments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The Developer's Text-to-SQL-Challenge
&lt;/h2&gt;

&lt;p&gt;As developers, we've all been there:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PM:&lt;/strong&gt; "Can you pull last quarter's revenue by product category?"&lt;br&gt;
&lt;strong&gt;You:&lt;/strong&gt; "Give me an hour to write the SQL..."&lt;/p&gt;

&lt;p&gt;What if anyone in your organization could get answers directly from your database without knowing SQL? That's the promise of text-to-SQL systems.&lt;/p&gt;

&lt;p&gt;This guide will show you how to build a production-ready text-to-SQL pipeline that empowers non-technical users while maintaining security and performance. We'll focus on practical implementation rather than theory.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Building Blocks of Text-to-SQL System: SQL and NLP
&lt;/h2&gt;

&lt;p&gt;Before going into the details of building a text-to-SQL system, let's understand the two core pillars that enable the translation of human-readable questions into database queries: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQL (Structured Query Language)&lt;/li&gt;
&lt;li&gt;Natural Language Processing (NLP)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These technologies work together to translate human-readable questions into database queries. Let’s break them down.&lt;/p&gt;
&lt;h2&gt;
  
  
  Understanding SQL
&lt;/h2&gt;

&lt;p&gt;SQL is the language of relational databases. It helps us to interact with structured data, retrieve information, and perform complex operations like filtering, sorting, and aggregating. Here’s a quick look at the basics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;SELECT&lt;/code&gt;: specifies the columns you want to retrieve&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;FROM&lt;/code&gt;: specifies the table containing the data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;WHERE&lt;/code&gt;: filters rows based on conditions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;GROUP BY&lt;/code&gt;: aggregates data based on one or more columns&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;ORDER BY&lt;/code&gt;: sorts result in ascending or descending order&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;JOIN&lt;/code&gt;: combines data from multiple tables based on related columns&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For instance, we can create a query that calculates the total revenue by city for 2024, sorted in descending order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Schema design
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.timescale.com/learn/data-modeling-on-postgresql" rel="noopener noreferrer"&gt;A database schema defines the structure of your data&lt;/a&gt;, including tables, columns, and relationships. For example, a &lt;code&gt;sales&lt;/code&gt; table might have columns like &lt;code&gt;invoice_id&lt;/code&gt;, &lt;code&gt;date&lt;/code&gt;, &lt;code&gt;product&lt;/code&gt;, and &lt;code&gt;revenue&lt;/code&gt;. A well-designed schema allows text-to-SQL systems to generate accurate queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Natural language processing (NLP)
&lt;/h2&gt;

&lt;p&gt;NLP enables machines to understand and process human language. In the text-to-SQL context, NLP helps interpret natural language questions and map them to database structures. Here’s how it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Tokenization: It’s about breaking down a sentence into individual words or tokens. For example:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Input: "Show me sales in New York." &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tokens: ["Show", "me", "sales", "in", "New", "York"]&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Intent recognition: Identifying the user’s goal. For instance, the question "What’s the total revenue?" intends to perform an aggregation (SUM).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Entity extraction: Detecting key pieces of information, such as:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dates: "last quarter" → &lt;code&gt;WHERE date BETWEEN '2023-07-01' AND '2023-09-30'&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Locations:&lt;code&gt;"New York" → WHERE city = 'New York'&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Schema linking: Mapping natural language terms to database schema elements. For example:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"sales" → &lt;code&gt;sales&lt;/code&gt; table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"revenue" → &lt;code&gt;revenue&lt;/code&gt; column.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For instance, if a user asks, “What are the top five products by sales in Q1 2023?”, an NLP model would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Identify key entities like “products,” “sales,” and “Q1 2023.”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Map these to corresponding database tables and columns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Generate an SQL query.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;quarter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Q1'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2023&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_name&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Text-to-SQL Implementation Approaches
&lt;/h2&gt;

&lt;p&gt;Different implementation approaches can be employed for building a text-to-SQL pipeline, depending on the queries' complexity, the database's size, and the level of accuracy required. Below, we’ll discuss two primary approaches, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Rule-based systems&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Machine learning-based systems&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Rule-based systems
&lt;/h3&gt;

&lt;p&gt;Rule-based systems depend on manually crafted rules and heuristics to convert natural language queries into SQL commands. These systems are deterministic, which means they adhere to a fixed set of instructions to generate queries.&lt;/p&gt;

&lt;p&gt;Rule-based systems work by parsing natural language inputs into structured representations and then applying a set of predefined templates or grammatical rules to generate SQL queries. For example, the rule for the query, “Show me sales in New York last quarter," can look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="nv"&gt;"sales"&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nv"&gt;"in [location]"&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nv"&gt;"last quarter"&lt;/span&gt;  
&lt;span class="k"&gt;THEN&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;  
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;location&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start_of_quarter&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;end_of_quarter&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the generated SQL query will look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;  
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'New York'&lt;/span&gt;  
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2023-07-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2023-09-30'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But, as databases grew in size and complexity, rule-based systems became impractical, paving the way for machine learning-based approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Machine learning-based systems
&lt;/h3&gt;

&lt;p&gt;Machine learning (ML) approaches to text-to-SQL use algorithms to learn how to map between natural language inputs and SQL queries. These systems can handle more complex and varied queries compared to rule-based methods.&lt;/p&gt;

&lt;p&gt;Machine learning models depend on feature engineering to extract relevant input text and database schema information. Features such as part-of-speech tags, named entities, and schema metadata (e.g., table names and column types) are extracted from the input. A classifier or regression model then predicts the corresponding SQL query based on these features.&lt;/p&gt;

&lt;h3&gt;
  
  
  LSTM-based models
&lt;/h3&gt;

&lt;p&gt;Long short-term memory (LSTM) networks were among the first deep-learning approaches applied to text-to-SQL tasks. They can effectively model the sequential nature of natural language and SQL queries. &lt;/p&gt;

&lt;p&gt;For instance, Sequence-to-Sequence (Seq2Seq) architectures commonly used with LSTMs treat the problem as a translation task, converting natural language sequences into SQL sequences. They consist of two elements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;An encoder processes the input natural language query and generates a context vector that understands the query's meaning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A decoder uses the context vector to generate the SQL query step-by-step.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Transformer-based models
&lt;/h3&gt;

&lt;p&gt;Transformer-based models, like BERT, GPT, and Llama, have become the dominant approach in text-to-SQL. These models use a self-attention mechanism, allowing them to understand contextual relationships in the input text and the database schema much more effectively. Self-attention enables the model to understand, for example, that "top five products" implies sorting and limiting results. &lt;/p&gt;

&lt;p&gt;Moreover, transformers can better handle schema information by incorporating it into the model's input or using specialized schema encoding techniques.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Text-to-SQL Practices and Considerations
&lt;/h2&gt;

&lt;p&gt;Building a text-to-SQL system is more than just wiring together &lt;a href="https://www.timescale.com/blog/how-nlp-cloud-monitors-their-language-ai-api" rel="noopener noreferrer"&gt;NLP models&lt;/a&gt; and databases. You need to adopt industry-tested practices and anticipate common pitfalls to ensure reliability, scalability, and security. There are actionable strategies to optimize your system—which we’ll discuss next—including schema design, error handling, and navigating real-world challenges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data preparation and schema design
&lt;/h3&gt;

&lt;p&gt;The quality of your database schema directly impacts the performance and accuracy of your text-to-SQL system. Ensure that your database is well-structured, with normalized tables to minimize redundancy. Use intuitive and descriptive column names that align with natural language terms. Provide metadata about tables, columns, and relationships (e.g., &lt;code&gt;unit_price&lt;/code&gt; → "USD, before tax") to help the system map natural language inputs to the correct schema elements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Good Schema  &lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;-- Total amount in USD  &lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Poor Schema  &lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;tbl1&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;col1&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;col2&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;col3&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  
    &lt;span class="n"&gt;col4&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Handling ambiguity and user intent
&lt;/h3&gt;

&lt;p&gt;Natural language is inherently ambiguous, and users may phrase queries in unexpected ways. Addressing their ambiguity is crucial for generating accurate SQL queries. One study found that nearly 20 % of the user questions are problematic, including 55 % ambiguous and 45 % unanswerable.&lt;/p&gt;

&lt;p&gt;There are multiple ways to handle the ambiguities, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Clarification prompts: If the input is unclear, prompt the user for clarification. This approach improves user experience and reduces errors.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Synonym mapping: Map synonyms and variations to standardized terms in the database schema. For example, recognize “earnings,” “revenue,” and “income” as referring to the &lt;code&gt;sales_amount&lt;/code&gt; column.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Context awareness: Maintain context across multi-turn conversations to handle follow-up questions effectively. &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Error handling
&lt;/h3&gt;

&lt;p&gt;Plan for failures to maintain user trust because even the most advanced systems will occasionally generate incorrect queries. Implementing an error-handling strategy ensures a smooth user experience. Error handling strategies can include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Graceful error messages: These provide clear and actionable feedback when a query fails or produces no results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fallback strategies: If the primary model fails, refer to simpler methods (e.g., rule-based templates) or ask the user to rephrase their query.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Logging and monitoring: Log failed queries and analyze them to identify patterns or recurring issues. Use this data to improve the system iteratively.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generate_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;AmbiguityError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"Please clarify your question."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;options&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;UnsafeQueryError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"This query is not permitted."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Security and privacy concerns
&lt;/h3&gt;

&lt;p&gt;Text-to-SQL systems interact directly with databases, prioritizing security to protect your database from malicious or accidental harm.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Access control: Restrict access to sensitive tables or columns based on user roles.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Input validation: Sanitize user inputs to prevent SQL injection attacks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data masking: Mask sensitive information in query results (e.g., partial credit card numbers or anonymized customer IDs).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Audit trails: Maintain logs of all queries executed through the system to track usage and detect unauthorized activity.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Performance optimization
&lt;/h3&gt;

&lt;p&gt;Efficient query generation and execution are essential for delivering timely results, especially for large-scale databases. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Indexing: Ensure that frequently queried columns are indexed to speed up search operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Caching: Cache frequently requested queries and their results to reduce database load.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Query simplification: Optimize generated SQL queries by removing unnecessary joins or filters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Parallel processing: Leverage parallelism for complex queries involving multiple tables or aggregations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advanced Features in Text-to-SQL Systems
&lt;/h2&gt;

&lt;p&gt;Enhancing a text-to-SQL system with advanced capabilities, including features that boost usability, scalability, and user satisfaction, is essential. Below are key advanced features of the system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contextual understanding and multi-turn conversations
&lt;/h3&gt;

&lt;p&gt;One significant improvement in modern text-to-SQL systems is their ability to maintain context across multiple interactions, enabling multi-turn conversations. This feature is handy when users refine their queries based on previous results or ask follow-up questions.&lt;/p&gt;

&lt;p&gt;For instance, if a user asks about sales from the last quarter and then follows up with a request to break it down by product line, the system understands that the second query refers to the same time period. The system reduces repetition and frustration by maintaining session-based memory and tracking entities like dates or regions mentioned earlier, enabling users to build on previous queries without starting over.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration with other systems and platforms
&lt;/h3&gt;

&lt;p&gt;Text-to-SQL systems can be extended beyond standalone applications by integrating with other tools and platforms, creating end-to-end analytics workflows. Real-world use cases often require combining data from multiple sources or pushing results to external systems for further analysis or visualization.&lt;/p&gt;

&lt;p&gt;For example, connecting the system to business intelligence (BI) tools like Tableau or Power BI allows users to generate interactive dashboards and reports directly from their natural language queries. Similarly, integrating with CRM (customer relationship management) or ERP (enterprise resource planning) systems enables users to query operational data seamlessly, such as asking how many deals were closed last month. The system can also pull data from external APIs or cloud storage services, combining internal datasets with external market trends to provide a unified view of information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating visualizations from SQL output
&lt;/h3&gt;

&lt;p&gt;Transforming raw query results into visual formats is another powerful feature that enhances usability and makes data more accessible to non-technical users. Visualizations help users quickly identify trends, patterns, and outliers in the data, reducing the cognitive load associated with interpreting raw tables.&lt;/p&gt;

&lt;p&gt;Additionally, providing options to export visualizations as PDFs, PNGs, or interactive HTML files makes it easier for users to share insights with stakeholders. By presenting data in a digestible format, the system ensures that insights are not only actionable but also easily shareable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Challenges in Text-to-SQL Systems
&lt;/h2&gt;

&lt;p&gt;While text-to-SQL systems offer immense benefits for democratizing data access, they are not without their challenges. Here are common challenges developers and users face with these systems: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Ambiguity in natural language queries: Natural language inputs can be vague or open to multiple interpretations, leading to incorrect SQL queries.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Handling complex queries: Text-to-SQL systems may fail to generate correct SQL for complex queries that involve joins, subqueries, or nested logic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Poor schema: Poor schemas in text-to-SQL systems can lead to incorrect column or table mappings, resulting in irrelevant query results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Performance and scalability: Text-to-SQL systems that query large datasets or generate complex SQL can strain computational resources and slow performance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Error recovery: Even the most advanced systems occasionally generate incorrect queries. Implementing robust error recovery strategies is essential to maintaining user trust and improving the system iteratively.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Text-to-SQL connects human language with database queries, enabling users to effortlessly access and analyze data without the need to write code. It uses NLP to understand user intent by translating natural language questions into SQL and mapping it to the database schema.&lt;/p&gt;

&lt;p&gt;The main advantages of using text-to-SQL include enhanced data accessibility for non-technical users and quicker data analysis. For time-series data, leveraging a powerful time-series database like Timescale Cloud can greatly &lt;a href="https://www.timescale.com/cloud" rel="noopener noreferrer"&gt;improve the performance and scalability of your text-to-SQL system&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To experience the power of time-series data with text-to-SQL, &lt;a href="https://console.cloud.timescale.com/signup" rel="noopener noreferrer"&gt;try Timescale today&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>postgres</category>
      <category>nlp</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
