DEV Community: Tiger Data (Creators of TimescaleDB)

AI's Physical Constraints: How AI Rewired the Data Center

Team Tiger Data — Thu, 02 Jul 2026 19:34:34 +0000

For most of the cloud era, a server rack was a five to twenty kilowatt object. You could fill a room with them, move air across the front, and the building stayed an ordinary building. A single current AI rack, NVIDIA's GB300 NVL72, draws about 132 to 140 kilowatts, with the GPUs alone accounting for more than a hundred. That is close to an order of magnitude more power in the same floor space as those old racks, and it lands as heat in the same small volume. Past roughly a hundred kilowatts per rack, air stops being able to carry the heat out, and the rack has to be plumbed for liquid. The compute got denser and the building changed with it.

This pattern repeats all the way out to the grid. For about fifteen years, getting more computing power felt like turning a dial. You needed more, you asked for more, and a few seconds later it was there. Spin up a hundred servers for a traffic spike, spin them back down when it passes. Capacity behaved like something continuous, instant, and reversible, a knob you turned rather than a thing you built. A generation of software was designed on that assumption, and it held, because for ordinary workloads the power and hardware involved were small against what the world could supply.

That has changed. Across AI infrastructure projects, the same moment now repeats. A team asks for more, and the answer comes back no. Not "no, that costs more," which everyone understands, but a harder no. "No, those GPUs are not available this quarter." "No, that region has no more power, and will not for years." You can order a GPU in a day; the date that a few hundred megawatts arrives at a site can be four or five years out, and no amount of money moves it sooner. The request that used to be a billing question has become a physical one.

Anyone who has designed, built, or run a data center knows the physical layer was always there. The people who design and build these facilities sized the transformers, ordered the switchgear, planned the cooling, and waited on the utility. What is new is the scale at which AI hits the physical layer. The data centers going up for AI are a different class of build: denser, hotter, hungrier, and more tightly coupled to the grid than the ones that came before.

Data centers always needed chips, memory, cooling, power, and water. Most cloud workloads before the AI surge kept those requirements in a range the existing build could absorb. AI pushes them past the thresholds where the old assumptions hold. It does not create a new kind of physics. It removes the buffer that made the physics easy to ignore.

Each of these limits has been written about on its own. What gets missed is how they connect, and why AI makes them arrive together. AI scaling moves through a physical dependency chain: more accelerators require scarce chip packaging and memory; more memory and compute concentrate heat; concentrated heat changes the rack and the building; the building then needs power the grid may take years to deliver; at that scale, the power itself may have to be buffered inside the facility; and the cooling choices made along the way determine where water becomes a problem. The limits arrive in a predictable order, and the order is the story:

GPUs. The first visible shortage is accelerators, the GPUs that do the AI computation, but the real bottleneck sits around the chip, not in it.
Memory. The accelerator depends on high-bandwidth memory, which pulls on the same finite wafer base as ordinary memory.
Cooling. More compute and memory in the same space means more heat in the same rack, past what air can carry.
Power and time. Liquid cooling moves heat, but every watt still has to come from the grid. First you wait for power to arrive; then, at AI scale, you may need to buffer the workload's own power swings.
Water. Not the national catastrophe the headlines suggest, but a local siting constraint shaped by cooling design.

Let's start with the one everyone already knows.

GPUs: The Bottleneck Is Not the Chip

The first wall everyone notices is GPUs. You cannot get them, you cannot get enough, or the price to rent them has climbed since last year. The figures are not subtle. H100 rental prices rose roughly 40 percent off their late-2025 lows in a matter of months, from about $1.70 to $2.35 per GPU-hour on one-year contracts between October 2025 and March 2026, and on-demand capacity is effectively sold out across GPU types. The pressure reaches the workstation end too. In June 2026 NVIDIA listed its RTX Pro 6000 Blackwell at $13,250, a 55 percent jump over the $8,565 launch price a year earlier, and the reason it gave was the 96 gigabytes of memory on the card in a market where memory is scarce.

The obvious reading is that NVIDIA cannot make enough chips, but that is not where the bottleneck lives. A modern AI accelerator is not one chip but a package: the processor die, stacks of high-bandwidth memory (HBM), and an interposer that wires them together at enormous bandwidth. A faster processor does not help if it cannot be packaged with memory, and advanced packaging capacity is finite, specifically the chip-on-wafer-on-substrate (CoWoS) lines at TSMC. The same analysis that tracked the rental spike named CoWoS packaging and HBM, not the processor, as the choke points. The lead times that stretch GPU orders toward a year are gated there. The chip is not the scarce thing; what surrounds it is.

So the GPU shortage is really a packaging and memory shortage wearing a GPU label. Packaging capacity can expand, but it expands on a manufacturing clock. Memory is the harder half, and the reasons it stays scarce are the next wall.

Memory: Why the Price Will Not Come Down

When a component spikes in price, the reflex is to wait. Shortages end, factories ramp, the price comes back down. That reflex is wrong here, and the reason is structural, not cyclical.

The memory AI systems need comes in two kinds. HBM is the fast, expensive memory stacked right next to the accelerator, where the model's working data lives during computation. Dynamic random-access memory (DRAM) is the ordinary system memory around it. The binding shortage is HBM, and the pressure spills into DRAM because both pull on the same finite wafer base. SK Hynix, the leading maker of HBM, locked up its 2026 HBM output well ahead of the year, and Micron has likewise reported its 2026 HBM sold out. Memory has gone from a rounding error in a machine's bill of materials to the single largest driver of the price on a high-end card.

So why not just make more? Essentially all the world's DRAM comes from three companies: Samsung and SK Hynix in South Korea, and Micron in the United States. When all three make the same allocation call at once, that is the global supply. There is no fourth maker at comparable scale waiting to undercut them.

Adding supply means building a fabrication plant, and a fab is not a factory you stand up in a quarter. It takes years of cleanroom construction, tool installation, and qualification before a single sellable chip comes out. HBM makes the squeeze worse, not better: each gigabyte of it uses about three times the wafer capacity of DDR5, today's standard volume DRAM, so every wafer redirected to the scarce thing makes the common thing scarcer still, and HBM already consumes roughly a quarter of all DRAM wafer output.

There is a second reason, and it is a choice rather than a constraint. The makers are steering wafers toward high-margin AI and enterprise memory and away from everything else, because that is where the money is. A new fab does not automatically reverse that, because the same margin logic governs what it chooses to build. IDC, the market-research firm, projects 2026 DRAM supply growth of only about 16 percent year over year, well below the 20 to 30 percent that was historically normal, even as demand for the AI variety grows far faster. The people running these companies are saying so directly. Intel's chief executive, Lip-Bu Tan, relayed in February 2026 what two of the key memory makers had told him: there is no relief until 2028.

The makers could produce more, but it takes years, only three companies do it at global scale, the AI memory eats three times the wafers, and they earn far more selling to AI than to you. Part of the shortage is physics, the supply is simply years out. Part of it is choice: the capacity that does exist is being pointed at AI, not at you. You have not been priced out for a quarter. You have been outbid.

Cooling: Why the Rack Changed Shape

GPUs and memory are still things you buy, even when you have to wait. The next wall is different. The same density that makes AI systems powerful, more transistors on the die and more memory stacked beside it, becomes heat the moment the hardware enters a building.

Every watt of power a machine draws comes back out as heat. Whatever goes in has to come out, or the machine cooks itself. It sounds too simple to matter, and it is the whole reason the rack changed shape.

For most of the history of computing, taking the heat out meant moving air. Servers in a rack, cool air through the room, and that was enough. At the rack densities of the CPU era, even at the top of the range, the airflow was manageable enough that the building stayed recognizably the same kind of building: rows of racks, cold aisles, hot aisles, chillers, and fans. Anyone who built those rooms knows the envelope.

AI did not make heat new. It made heat dense.

All the capability crammed into the die and the memory turns into heat in the same small volume. The rack densities from the opening, an order of magnitude higher than the CPU era, are really heat-removal figures: every one of those kilowatts has to be carried back out. The next generation on the roadmap, the Vera Rubin systems, is projected to push per-rack density several times higher again, and cooling vendors are already designing for the increase. At those densities, air cooling stops being practical. Bigger fans do not solve the volume problem.

Figure 1: Rack power density by generation. Past roughly 100 kilowatts, air cooling stops working and liquid becomes mandatory. The Rubin Ultra figure is a roadmap projection.

So the machine changed shape. A given volume of air can carry only so much heat away before it has to move faster than is practical, while water carries roughly three to four thousand times as much heat per unit volume as air. Past roughly a hundred kilowatts per rack, that gap stops being an efficiency question and becomes a hard limit, and the model flips: from moving air through a room to carrying heat away in liquid piped directly to the chip. The newest systems do not offer an air-cooled option at all. The GB300 NVL72 is fully liquid-cooled. The rack is no longer just electrical equipment. It now has plumbing.

Capital helps with procurement and retrofits. It does not change the thermal limits of air. This wall is geometry and thermodynamics. And it has a consequence even experienced operators feel: the hardware no longer fits in most existing buildings. A data center built for air, even one finished a couple of years ago, often cannot host these racks without being substantially rebuilt, retrofitted for liquid distribution, higher floor loading, and the plumbing that comes with it. You cannot simply drop the latest GPUs into the footprint you already have. For an operator who has spent a career optimizing airflow, that is the moment it becomes clear this is a different kind of building.

Liquid cooling changes how heat leaves the chip. It does not change where the energy comes from. Every watt still starts at the grid, and that is where the slowest constraint appears.

Power and Time: The Wall Underneath the Walls

What limits AI in the end is not chips, and it is not cooling. It is electricity, and specifically it is time. You can buy a GPU in a day. You cannot buy the specific date on which a few hundred megawatts will be delivered to a site. That is set by a physical system that moves on a timescale of years. Money can fund equipment and alternatives, but it does not make shared grid capacity appear on software time.

The building can be designed and built on one clock; the grid upgrades that let it draw full power often run on a longer one. Before a site can draw full power, the utility has to study and approve the load, the transmission system has to support it, and the substations and lines that feed it have to exist. The industry calls this time-to-power : the interval between choosing a site and being able to draw the load you planned around. For large AI sites, that interval can define the project. The upstream grid work resists money in a way the other constraints do not, because the transmission lines and substations are shared infrastructure that serves everyone on the grid, so a new load cannot simply pay to skip the queue without the wires actually being built. You can finish the building and then wait to turn it all the way on.

The time-to-power gap. You can buy a GPU in a day, but energizing a site takes years, and the build can finish while the power wait continues.

There is another clock running alongside the approval queue: the equipment itself. Even once a project is cleared to connect, the high-voltage transformers and switchgear that tie it to the grid can be in shortage. Lead times for large power transformers have stretched from roughly two years before 2020 to as long as five years now, and industry estimates suggest a meaningful share of planned 2026 data-center capacity could slip for want of power equipment and grid connections. Electrical gear is not the biggest line item in a data center. It can still decide when the building turns on, a reversal any operator who has waited on a transformer order will recognize.

The grid backlog around these projects is large. At the end of 2025, more than 2,000 gigawatts of generation and storage capacity were waiting in line to connect to the US grid, roughly twice the entire installed US power fleet. More waiting to connect than currently exists. That queue is not the same thing as a data-center load request, but it shows the condition of the shared infrastructure every large new load depends on: the wires, substations, studies, and upgrades are all moving on a multi-year clock. Lawrence Berkeley National Laboratory, which tracks those queues, finds the median time from request to commercial operation has roughly doubled, from under two years for projects built in the early 2000s to four to five years now.

This is not abstract, and it is not only an American problem. Ireland is the cleanest example. Dublin had become one of Europe's great data center hubs until the grid could not keep up, and in 2021 the grid operator EirGrid and the Commission for Regulation of Utilities imposed what amounted to a moratorium on new connections in the Dublin area. One Amazon project and two Microsoft projects were among those turned away and relocated to London, Frankfurt, and Madrid. By 2024, data centers were drawing around 21 percent of all the electricity in the country. The moratorium eased only in December 2025, and the new terms show where things are headed: a new facility now has to bring its own power generation or storage rather than simply draw from the grid.

The power industry itself is reorganizing around this demand. In May 2026, NextEra Energy announced a roughly $67 billion all-stock plan to acquire Dominion Energy, the utility behind northern Virginia's data center corridor.

That is the first half of the power story: getting electricity to the site. The second half starts once it arrives, and it is where AI looks least like the loads the grid grew up serving. A large training run is synchronized. Tens of thousands of accelerators compute, pause together to exchange results, and compute again. Power draw follows that loop. A single H100-class GPU draws far less at idle than under compute, so when tens of thousands switch states together, the facility's load can swing by tens of megawatts in seconds or less. Meta reported swings around 30 megawatts on a 24,000-GPU cluster training Llama 3.

A synchronized training cluster swings between compute and pause many times a second. The grid is built to follow the smooth aggregate of many independent users, not one correlated load moving in lockstep.

The grid was built around load diversity, where thousands of independent homes and businesses average into something smooth and predictable. A synchronized training cluster is neither diverse nor smooth, and that is the part that is new even to people who have planned power for a living. The stability problem is broader than training-loop swings: large data-center loads can also behave unexpectedly during grid disturbances. In July 2024, a transmission fault in Northern Virginia caused roughly 1,500 megawatts of data-center load to disconnect itself within 82 seconds, an event the North American Electric Reliability Corporation (NERC), which sets and enforces reliability standards for the North American bulk power system, said the system had never seen at that magnitude.

The fix moves on-site. xAI's Colossus cluster in Memphis installed about 150 megawatts of grid-scale battery storage alongside its power infrastructure. The point of that storage is not how much energy it holds but how fast it can absorb and deliver power. A small, fast store placed in front of a slower supply is a cache. Here the slow backing store is the grid, and the fast store is local batteries and power electronics. Batteries are no longer only backup equipment. In these designs, they can become part of workload control. At AI scale, power stops being merely an input to the computer. It becomes part of the computer's design.

And a design has to be operated. Once batteries, power electronics, cooling loops, and GPUs act as a single system, someone has to watch them as one: how power draw tracks compute, how the batteries answer a training swing, how heat follows the load. Those measurements arrive every second, from equipment that used to belong to three different teams. Read after the fact, they tell you what broke. Read live, they keep the loop stable.

Power is where the earlier constraints converge. The GPU you could not get, the region that was full, the building that needed rebuilding, the batteries now sitting between the workload and the grid: each one traces back to the same place, a power system that has to be built and buffered, on the grid's schedule, not the software team's.

Water: In Proportion

Power is the hardest wall because it sets the clock and, through on-site batteries and power electronics, becomes part of the machine's own design. Water is different. It sits downstream of cooling design and geography, which makes it more local, more variable, and more solvable than the public debate suggests. AI makes the siting choice more visible because the facilities are larger and denser, but the water problem still depends on design. That matters because water draws the most public attention of any of these constraints, and some of the least accurate reporting.

One distinction matters before any number makes sense: water withdrawn is not water consumed. Withdrawal is what a facility takes in; consumption is what it uses up, mostly through evaporation. A facility can withdraw a large volume and return most of it, or consume nearly all of what it takes, depending entirely on the cooling design.

Nationally, the figure is modest. As of 2021, all US data centers combined accounted for roughly 449 million gallons of water a day, about three to four tenths of one percent of total US water withdrawals, far below agriculture or power generation. The headline framing of data centers draining the country's water is not supported by the national figures.

The real issue is local. Roughly 40 percent of US data centers sit in areas of high or extreme water stress, so even a small national share can land hard on a particular community. Stated that way, it is a siting problem, real and solvable, rather than an indictment of the technology.

The cooling design is what sets the consumption. On-site consumption ranges from nearly nothing, for an air-cooled or closed-loop facility, to as much as 70 to 80 percent of what was withdrawn, for an open evaporative one. A single large evaporative facility can use something like five million gallons a day, comparable to a town of fifty thousand people. The same facility, built closed-loop, can use almost none. The high number and the low number describe the same building with two different cooling choices.

This is the constraint the industry is most actively engineering away. Closed-loop systems fill once and recirculate rather than evaporate. The same shift to liquid and direct-to-chip cooling described earlier can cut water needs dramatically, by up to 95 percent in some designs, and immersion cooling can eliminate evaporative water use altogether. Reclaimed wastewater is increasingly used in place of drinking water. Of the five walls in this piece, water is the one where the engineering response is furthest along, which is exactly why it deserves to be described accurately rather than dramatically.

The Reserve Ran Out

The five walls are not five problems. They are one fact seen from five angles. None of this is new physics: the power was always physical, the heat was always real, capacity always took years to build. What changed is that the cloud era ran on a deep reserve of capacity built ahead of demand, and as long as that reserve lasted, the limits underneath stayed out of view. You turned a dial and the reserve answered. AI has drawn that reserve down, and at a scale the old infrastructure was never built to carry, so the limits are back in view all at once.

That is why the data centers rising for AI are a different class of build, and why the people who built the last generation look at the numbers and recognize that the rules they worked under have moved. The accelerator depends on packaging and memory, the rack depends on liquid cooling, and the building depends on power. At this scale, the power itself needs a buffer. The site depends on grid capacity, water choices, and time. The next time a capacity question lands on your desk, ask where it will physically live and how long the power takes, before you ask what it costs. The cloud used to be an abstraction. It has an address now.

The five walls are physical. Operating inside them is not. Once the facility and the computer are one coupled system, running it means reading it as one: GPU power draw, cooling response, battery state, and grid posture, measured together and fast enough to act while the numbers are still true. That is not a facilities dashboard on a five-minute refresh. It is a live, correlated, high-frequency record of a machine that now runs from the silicon to the substation. Capturing that record, and querying it before it goes stale, is its own problem.

Get Started

Operational telemetry only helps if you can query it while it is still true, at the rate it arrives. That is the workload Tiger Data is built for: time-series and event data on Postgres, fresh and correct, without splitting into a second system. Start a free Tiger Cloud trial. Running on-premises, at the edge, or air-gapped? TimescaleDB Enterprise is built for those deployments and is taking design partners.

How Float Runs an AI Energy Company on a 3-Person Team with Tiger Data

Team Tiger Data — Tue, 30 Jun 2026 11:00:05 +0000

Danish startup achieves 99.3% compression on 1Hz smart meter data, powering real-time appliance-level energy analytics for hundreds of homes.

Float is a Danish AI and energy startup that collects 1Hz smart meter data from hundreds of homes and disaggregates it into per-appliance consumption using a proprietary ML model. The entire system depends on one architectural bet: that compression on their time-series database would be high enough to make the storage economics work at scale. Co-founders Jens Brandt Nellegaard (CEO) and Victor Grabow (CTO) share how they evaluated every major time-series database, why 90% compression was the hard floor for viability, and what happened when they hit 99.3% on Tiger Data.

About Float

Most people have no idea what their individual appliances cost to run. In Denmark, most energy providers send a monthly PDF bill, and that is the entire customer experience. The market has seen no meaningful innovation in 10 to 15 years. Fraud and a lack of transparency have been so widespread that the Danish government introduced new consumer protection regulation effective January 2026.

The underlying data to solve this problem actually exists. Most European smart meters have a standardized consumer interface that can output total household electrical load down to one-second resolution. But total load is not very interesting on its own. What consumers need is a breakdown by individual appliance - to understand which one is wasting electricity, which one is running at peak price, which one is approaching the capacity limit of a fuse. Academic research on energy disaggregation goes back to 1992. Nobody had solved it in a commercially scalable way.

Float built three things to close this gap: a proprietary hardware module that plugs into the consumer port on a European smart meter, a signal processing and neural net pipeline that classifies appliance-level consumption from the raw waveform, and a consumer-facing app with a proactive AI energy agent. The system collects roughly 15 measurements per second per household, each at 1Hz resolution. One second is the hard floor - Victor explains: "At one-minute resolution, the model would break, because what we track are the changes. If there are too many appliances turning on within the same minute, it would be very hard to differentiate them."

Jens Brandt Nellegaard and Victor Grabow co-founded Float in 2022. After nearly three years of R&D, they achieved a proof of concept in December 2024, secured their energy provider license in December 2025, and are now rolling out a private beta to 350 pre-vetted customers. The company has just three people.

The Challenge

Float started on Azure managed Postgres with the TimescaleDB extension. The team had Azure credits early, so it made sense at the time. But the Apache version available on Azure did not include compression, and that turned out to be a dealbreaker.

Every customer generates roughly 15 measurements per second. Float samples voltage, frequency, and total load across each phase entering the home. At 1,000 homes, that is 15,000 data points per second, continuously. Without compression, the storage cost alone would break their business model. “We are an energy company with a flat-rate subscription fee,” says Jens. “We pass through the spot market price one-to-one with no markup. If storage cost per user exceeds what the subscription supports, the economics collapse.”

We also tried InfluxDB before settling on TimescaleDB. We ran into ingestion issues, and we needed SQL. When you are a three-person team building an asset-centric microservice platform, you cannot afford a database that requires a proprietary query language and limits how you join and query data across domains. - Victor Grabow, CTO, Float

On top of the storage problem, Float needed continuous aggregates. The Danish DSO delivers settlement data at 15-minute resolution. Float collects data every second. To generate a live energy bill and compare it against the grid operator's numbers, the system needs to aggregate raw 1Hz data down to 15-minute windows constantly. On managed Postgres without TimescaleDB's full feature set, that meant writing and maintaining batch jobs - more infrastructure overhead for a team that was already stretched thin across hardware, ML, and a licensed energy company.

Why Tiger Data

We tested pretty much every time-series database on the market. We think Tiger Data is the best solution for our use case. - Victor Grabow, CTO, Float

The team researched time-series solutions extensively and discovered TimescaleDB through the compression and continuous aggregate features. The compression was compelling enough that it was just a matter of time before they needed the full capability. When they found Tiger Data - the company behind TimescaleDB - the managed cloud service made the path clear.

Two features drove the decision. First, Compression. The team had modeled the unit economics and needed at least 90% compression on the time-series data for the subscription model to work. Anything below that and storage costs per user would exceed what the flat-rate fee could support. Second, Continuous Aggregates - materialized views that update incrementally as new data arrives. Float runs aggregations constantly, converting 1Hz readings to 15-minute settlement windows, calculating threshold-based alerts on voltage and frequency, detecting outages, and triggering duration-based notifications like flagging an oven that has been running for four hours. Continuous aggregates handle all of this without batch jobs or scheduled pipelines.

We chose Tiger Cloud, the fully managed service on Azure, because it was a question of speed. We needed to get up and running fast and offload infrastructure management entirely. Encore, our DevOps platform, provides ephemeral environments on Google Cloud, and Tiger Cloud's database branching fits naturally into that workflow. - Victor Grabow, CTO, Float

The Float Energy Data Stack

Data starts at the smart meter. Float’s IoT module plugs into the standardized consumer interface port and captures voltage, frequency, and total load across each phase at 1Hz resolution. The module sends readings to Azure IoT Hub, which the team kept from the original Azure setup as a stable ingestion endpoint for all devices.

From there, a bridge connector forwards the stream into Google Cloud, where Encore deploys Float’s microservices. The team moved off Azure Event Hub eventually because it was expensive. Google Cloud streaming services handle the same throughput at lower cost. The streaming layer batches incoming measurements from all households every second and inserts them per batch into Tiger Data.

Tiger Data stores the raw 1Hz time-series readings and runs continuous aggregates for threshold-based monitoring: voltage spikes, frequency changes, mean and max calculations, outage detection, and duration-based appliance alerts. All raw data is retained for ML training purposes through the private beta phase, with tiered storage planned as the fleet scales.

The Float app reads processed data to show customers their real-time energy breakdown per appliance. New customers see total wattage immediately on connection. Appliance-level breakdown takes roughly three to four weeks as the model trains on their home's specific patterns. The agentic orchestration layer on top handles billing, onboarding, customer service, and proactive notifications - flagging forgotten ovens and irons, inefficient appliances, and dangerous load conditions approaching fuse limits.

Float's data architecture: 1Hz readings flow from the IoT module through Azure IoT Hub and Google Cloud streaming into Tiger Data, which serves the ML pipeline, consumer app, and agentic platform.

What Compression Enabled

On Tiger Data, Float is seeing 99.3% compression on its time-series data. Victor puts it directly:

Compression needed to be in the high nineties range to not break our business model. So that was a great outcome. - Victor Grabow, CTO, Float

That number unlocked three things that would not have been possible at lower compression ratios.

The Business Model Works

At 15,000 data points per second across 1,000 homes, uncompressed storage would generate terabytes of raw time-series data per year. Float passes through the spot market electricity price to customers at cost with no markup. Revenue comes from a flat-rate subscription fee. If storage cost per user climbs above what that fee can support, the entire model collapses. At 99.3% compression, it does not. The subscription covers infrastructure with margin to spare, and that margin holds as the fleet scales.

Full Data Retention for ML Training

Float's disaggregation model needs weeks of 1Hz training data per household to learn each home's specific appliance signatures. At lower compression, the team would face a choice: retain all raw data for model training or keep storage costs viable. At 99.3%, they retain everything. All raw 1Hz readings from the entire private beta fleet are available for the ML pipeline, with tiered storage planned only as the fleet scales past the beta phase.

Real-Time Billing Without Batch Infrastructure

The Danish grid operates on 15-minute settlement windows. Float collects data every second. Continuous aggregates bridge that gap, converting 1Hz readings into the 15-minute intervals the DSO requires for bill reconciliation. Danish energy prices swing up to 80% between peak and off-peak hours, which makes the freshness of the aggregation directly valuable to customers. Because continuous aggregates update incrementally as new data arrives, Float's live energy bill is always current, i.e., no scheduled batch jobs, no pipeline maintenance, no lag.

A Three-Person Team Running a Licensed Energy Company

Float holds an energy provider license in Denmark. That means billing, customer service, onboarding, regulatory compliance - operational overhead that traditional energy companies staff with dozens of people. Tiger Cloud's managed infrastructure is part of what makes this possible. The team does not manage database operations, storage provisioning, or aggregation pipelines. That overhead is handled. When asked about team size, Jens's answer was simple: "Three. We have three people... and an army of agents. This is the future."

Looking Ahead

Float is targeting 1,000 additional private beta customers within the next 12 months, with a seed round, two additional hardware variants for full Danish grid coverage, and expansion into two more countries.

The next major integration is EV charging - starting with Tesla's telemetry API, enabling smart charging during cheap price windows. The bigger thesis is that a fleet of homes measured at 1Hz resolution can trade power on the spot market more efficiently than any energy company operating at 15-minute resolution with a 24 to 48-hour delay. As Jens puts it: "Ultimately we are trying to make the home not a burden for the grid, but a partner of the grid."

The architecture decision that compounds as Float scales is not the compression ratio itself. It is that everything runs on a single Tiger Data instance: the raw 1Hz readings, the continuous aggregates for billing, the training data for the ML pipeline, the anomaly detection queries. No split architecture to maintain, no query paths to reconcile as the fleet grows from 350 homes to 1,000 and beyond. The data model does not change - it just gets bigger.

What You’re Really Owning When You Self-Host TimescaleDB

Team Tiger Data — Thu, 25 Jun 2026 13:51:54 +0000

Written by Matty Stratton, Brandon Purcell, Noah Hein, Hien Phan

Why operating TimescaleDB for mission-critical applications becomes a sustained platform engineering investment. Written by the team that builds TimescaleDB and operates Tiger Cloud across thousands of production deployments.

Abstract

Most engineers who evaluate TimescaleDB believe they are making a database decision. By the time customers depend on the application in production, they discover they made a platform ownership decision. The gap between those two decisions is the subject of this paper.

Availability, recoverability, scalability, security, and lifecycle management originate in the requirements of the application, and they land on the database team as operational systems that must be designed, built, staffed, and maintained for as long as the application runs. None of that work is beyond a capable engineering team. The question this paper raises is whether owning that platform is the highest-leverage use of the platform engineers who could otherwise be building the product the database supports.

Analytics on Live Operational Data Is Part of the Application

Analytics that runs overnight against a reporting database is important, and it is separate from the application. Analytics that plant operators watch continuously, that customers open before starting their workday, or that triggers an automated intervention before a defect is manufactured has moved onto the critical path of the business. Once analytics moves onto the critical path, the database supporting it moves there too.

One deployment carries most of this paper. An automotive supplier operates 120 robotic welding lines ingesting billions of sensor measurements per day. Plant engineers use the platform to detect abnormal operating patterns before they become equipment failures. A welding line showing early signs of a calibration drift or a sensor trending out of range may be a maintenance issue today; missed for thirty minutes, it can become an unplanned shutdown that halts production and costs the business more in rework than the platform team's quarterly budget. Every minute of latency is a minute a problem propagates across a line. We will follow this platform from its first twenty instrumented lines through its third year of operation, because the operational story of a self-hosted deployment is a story that unfolds over years, and it is easier to understand through one platform than across a survey of many. Where the same failure arrives from a different direction, we will bring in other deployments we have worked with: a connected-equipment OEM, a food processor, a specialty chemicals manufacturer. Different industries, different latency tolerances, the same dynamic underneath.

TimescaleDB is still TimescaleDB and PostgreSQL is still PostgreSQL. What changes is the operational commitment required to deliver the availability, recovery, scalability, and governance the application now demands.

The Database Inherits the Application's Requirements

Nobody is thinking about the database when the automotive supplier's operations organization commits to detecting equipment degradation before it stops a line, when plant management sets uptime expectations for the dashboards shift supervisors watch, or when compliance defines how long production records backing warranty claims must be retained. The database team inherits the requirements anyway and the inheritance changes the nature of the work.

A 99.9% availability target becomes a systems problem: replicas, failover orchestration, monitoring, and on-call coverage. A point-in-time recovery requirement becomes a tested runbook with validated restore times at production volume. Years of telemetry under compliance obligations becomes legal exposure that accumulates with every year of data the team holds.

Early in an application's life these requirements are loose. An outage of a few minutes is an inconvenience; a day of missing data is embarrassing but recoverable. As more users depend on the application and other teams build their own workflows on its data, the requirements tighten. Closing that gap is continuous work, and it falls on your platform engineers for as long as the application runs. The rest of this paper is what that work actually looks like, and each section raises the stakes a step: from lost engineer hours, to business risk, to commercial liability, to legal exposure.

Downtime Becomes a Product Problem Before the Architecture Catches Up

During the automotive supplier's early deployment, with twenty welding lines instrumented and a handful of engineers as the only users, a database restart is an inconvenience. Dashboards go blank for two minutes. Engineers wait. Nobody escalates.

Twelve months later, all 120 lines are instrumented and shift supervisors in three facilities depend on those dashboards. An equipment problem that goes undetected for thirty minutes because the platform was down costs more in rework than the platform team's quarterly budget. The database has the same configuration it had twelve months ago. The application does not.

The response is redundancy. PostgreSQL streaming replication keeps a replica synchronized with the primary, and if the primary fails, the replica is promoted with minimal data loss. This is well-understood architecture and it works. It is also where the engineering commitment begins.

Figure 1: Single-instance deployment versus production HA architecture: primary, HA replica, read replica, connection pooler, and monitoring as distinct operational layers, with failure paths and failover direction

Replication configuration is a decision with real trade-offs: synchronous replication eliminates lag but constrains ingest throughput; asynchronous replication preserves throughput but can lose data in a failover. Someone who understands the application has to make that call and own it as the workload changes. More importantly, someone has to watch replication continuously. A replica that has quietly fallen minutes behind provides a false sense of safety: stale as a read source, lossy as a failover target. Silent replication drift is one of the most common ways HA architectures fail to deliver the guarantees they appear to provide. We have seen it on deployments that had every structural component in place. The monitoring was the missing piece.

Beyond replication, HA is a stack of components that each need owners: failover automation that has been tested under simulated failure, not just configured; a connection pooler that shortens the error window and brings its own failure modes; rolling maintenance procedures rehearsed before they're needed under time pressure; and a monitoring layer that covers replication health, failover state, and background jobs, with alert thresholds tuned to the application and runbooks kept current as the system changes.

Availability is a standing allocation: a senior engineer's judgment on replication and failover, recurring hours for monitoring and rehearsal, and a permanent slot in the on-call rotation. Every one of those hours comes from the same platform engineers the roadmap is counting on. The question is whether it is the work you hired them to do.

Recovery Objectives Are Set by the Business and Tested by Almost Nobody

Figure 2: Recovery architecture showing full weekly backups, daily incrementals, continuous WAL archiving, and the RPO/RTO envelope they define, with restoration time as a function of data volume

Most backups are running. The question organizations rarely ask before they need to is whether the restore completes within the window the business requires, from the point in time the business requires, at the volume the database has actually reached. That gap is where recovery risk lives, and the food processor is standing in it the day a customer reports a potential contamination event.

The investigation needs sensor records for one production line during a two-hour window eighteen months ago. The data was backed up. None of that answers the only question that matters: whether restoring eighteen months of production telemetry at current volumes completes inside the investigation's clock. Nobody has ever run that restore. The procedure that exists is a hypothesis, and the incident is the wrong time to run the experiment.

That is the ownership gap. Not the backup. The untested restore.

As data volume grows, restore drills become harder, slower, and more important. A backup that exists is not the same as a recovery process that works.

The food processor's contamination event is the urgent version of this failure; the chemical manufacturer's corrupted migration is the irreversible one. A restore that completes is not necessarily a restore that worked. Validating that distinction at production scale is a recurring drill, measured in engineer-days per quarter, and someone has to own the calendar invite. That is before the platform has grown. The next problem is a system that succeeds.

Success Rewrites the Architecture

Figure 3: The same deployment scaling from hundreds to tens of thousands of assets: storage growth, backup window duration, query p95 latency, and concurrent connections as separate axes over time

The deployment that shows what success costs is an industrial equipment OEM whose customer-facing dashboards are a contracted product feature. A database outage there is a commercial event, logged against an SLA and escalated to an account manager. The OEM instructed a few hundred connected assets at launch and reached tens of thousands two years later. It did not build a system that broke. It built a system that succeeded, and the success invalidated the original architecture one assumption at a time. Nothing failed. The application simply outran the operational decisions made around it.

Storage growth is the most visible dimension and the least costly; disk is cheap to add. The downstream effects are the expensive part: longer backup windows, longer restores, more expensive maintenance, slower schema changes across thousands of chunks.

Retention policy management becomes load-bearing, and the hard part is rarely the configuration. The OEM's customers, now benchmarking equipment against peers on historical trends, reject the retention window set two years ago. The food processor meets the same moving target from the regulatory side: entering a new jurisdiction rewrites the retention obligations it launched with. The policy must stay aligned with requirements that keep moving, and someone has to confirm it is actually running.

Growth changes the operating model across every dimension: more assets, more users, more historical data, and more customer expectations. Query concurrency grows from a handful of internal analysts to thousands of customer-facing users, bringing read replicas, replication lag, and connection routing into scope. Configuration that was correct at two hundred assets is reconsidered at twenty thousand, but corrections don't take effect instantly. They migrate into effect gradually as new data arrives, which means the team is managing transitions, not flipping switches. These solutions work. They also need owners, and more of them as the application grows.

None of this stays solved. Each revisit lands on the same team being asked to ship features and support customers. Growth in the application is growth in the platform team's backlog.

The Platform Never Stops Evolving

Figure 4: Version lifecycle timeline: PostgreSQL major version cadence, TimescaleDB release cadence, support windows, and the upgrade planning cycles they impose

There is a common expectation, particularly among teams building their first production database platform, that operations stabilize once the deployment is running. It does not. Ownership never ends.

PostgreSQL ships a major version each year; TimescaleDB tracks those releases. Running past end-of-life means running without security patches, which is untenable for any system holding customer data or regulated production records. Minor version patches require only a brief service restart. A major PostgreSQL version upgrade is a coordinated migration process: it requires a staging environment that mirrors production in data volume, query workload, aggregate configuration, and columnstore state; a post-upgrade validation suite checked against pre-upgrade baselines; a tested rollback plan; and a coordinated maintenance window. For the chemicals manufacturer's five years of reactor telemetry, that is two to four engineer-weeks of work, recurring roughly annually, for as long as the platform operates.

Runbooks are the connective tissue holding the rest together, and they decay by default. Return to the automotive supplier, now three years in. The platform looks substantially different from what launched: additional lines instrumented, aggregates added, retention adjusted, the HA configuration changed after a failover exposed a gap in the original design, the chunk interval re-tuned after ingest grew. Each change was made for a good reason. None made it back into the runbook, and the engineer who made most of them has moved to a different team. This is how operational debt accumulates: not through negligence, but through the ordinary pressure of a team moving fast and treating documentation as something to get to later. If you are reading this and thinking it would not happen on your team, it is worth asking when your runbooks were last tested against the system they describe. Later arrives during incidents.

Every Capability Arrives With an Owner Attached

The preceding sections walk through these systems one at a time, as they arrive in the life of a deployment. Here is the full surface area in one place: each system with its own configuration, monitoring requirements, failure modes, and cadence of ongoing work. In aggregate, they make up the platform.

System	What it delivers	How it fails quietly	The ongoing work	Cadence
HA & failover (replicas, pooler, promotion automation)	The availability SLA	Silent replication drift; failover automation that was never rehearsed	Lag monitoring, failover drills, pooler tuning, rolling maintenance coordination	Continuous monitoring; drills quarterly
Backup & point-in-time recovery	The RPO/RTO commitment	Restores never run at production volume; broken WAL archive chains	Restore validation against real objectives; post-restore verification before returning to service	Restore drills quarterly; re-scoped at each growth step
Retention policies	Bounded storage; compliance windows	Background job fails silently; policy on the wrong relation deletes data that should have been kept; aggregates outlive the raw data they were built from	Policy validation and job monitoring; alignment with dependent systems and changing business requirements	Monthly job audit; review on every regulatory or contract change
Hypercore columnstore	90–98% storage reduction; faster analytical queries	Conversion boundary set too early adds overhead to hot data; too late, and the economics of long retention erode	Boundary tuning against access patterns; resource planning for large backfills (e.g., post-calibration corrections)	Semiannual review; per backfill event
Continuous aggregates	Dashboard latency at production scale	A failed refresh serves stale data with no user-visible error	Refresh policy tuning; job failure alerting so stale data is caught before users report it	Weekly alert review; retuned with each workload shift
Chunk configuration	Ingest throughput and memory health	A misconfigured interval degrades writes precisely during peak load	Interval review as ingest rates change; corrections migrate gradually into effect	Quarterly review or per major ingest change
Security & governance	Breach containment, auditability, contractual compliance	Long-lived over-permissioned credentials; audit logs retained where nobody looks; unreviewed production changes	Credential scoping and rotation rehearsed outside of incidents; audit log review and tamper protection; change control with staging validation and rollback plans	Rotation per policy; review gates on every production change
Version lifecycle	Security patches and support coverage	Running past end-of-life, unpatched, while holding customer or regulated data	Minor patch windows; major upgrades validated against a staging environment that mirrors production volume, workload, aggregate configuration, and columnstore state	Minor on a rolling cadence; major roughly annually
Runbooks & institutional knowledge	Incident response speed	Documentation describing the system as it was at launch	Updating with every configuration change; testing procedures against the live system	With every change; tested quarterly

Losing the Database Stops Being an Outage

We have seen audit configurations that were technically correct but practically invisible: logs retained in a system nobody had access to, alerts wired to a distribution list that no longer existed. The mechanism was in place. The ownership was not.

Every section so far has priced platform ownership in engineer hours and business risk. There is a point in a platform's life where the currency changes, where losing the database stops being an outage and starts being regulatory action, contractual liability, or litigation. The chemicals manufacturer lives past that point. Its five years of reactor telemetry is the empirical foundation for yield improvement decisions that took years to accumulate, and it is also an evidentiary record. The corrupted migration from earlier in this paper does more than destroy operational knowledge; it puts the team in the position of reconstructing that record under potential legal scrutiny, explaining to lawyers what a background job did and why nobody caught it. Some of these failures are irreversible by construction. A retention policy that accidentally drops a week of compliance data cannot be undone. A breach of customer telemetry arrives with notification obligations and suddenly load-bearing contract language.

This work rarely looks dramatic on a task list. It is credential scoping, rotation drills, tamper-protected audit logs, staging validation, change control, and rollback plans for every production change.

But the failure mode is different. Availability gaps cost minutes. Recovery gaps cost hours or days. Governance gaps can cost the company things that cannot simply be restored.

What This Costs, and What It Buys

Sustaining this platform at the standard a critical application demands lands between 1.5 and 3 full-time platform engineers, growing with the application, because every dimension of the work scales with data volume, concurrency, and the criticality of the guarantees. The on-call rotation requires three to four people to staff sustainably, independent of how much of their time the platform consumes. A major version upgrade is two to four engineer-weeks a year. Restore drills, failover rehearsals, policy audits, and runbook maintenance each claim recurring days per quarter. These numbers are rough by necessity and conservative by experience.

Capability was never the question. The teams that do this well go in with eyes open, staffed for the work, treating the platform as a product in its own right. The real question is the counterfactual: what would those same engineers build if they were pointed at the application instead, at the ingestion pipelines, the product features, the things customers actually pay for, while the guarantees are delivered by the team that builds the database?

Most engineers who evaluate TimescaleDB believe they are making a database decision. By the time the application is in production and customers depend on it, they discover they were deciding which parts of a platform they want to own. Every team owns the application. The decision is how much of the platform they want to own alongside it. This paper is designed to make that decision visible before it is made. For a framework to make it deliberately, see Self-Hosted TimescaleDB or Tiger Cloud: A Framework for the Decision.

Great Models Aren't Enough for Physical AI

Team Tiger Data — Thu, 18 Jun 2026 17:52:23 +0000

Notes from our Physical AI dinner

What should a drone do when a police helicopter approaches it?

We heard that question at a dinner we recently hosted for engineering leaders and founders working in Physical AI: the AI behind robots, drones, autonomous vehicles, and other machines that sense and act in the real world, along with the infrastructure that keeps them running. Nobody at the table had a clean answer. The people in that room deploy these machines for a living, and the question that stumped them was about safety, regulation, and operations, not model quality.

That was the theme of the whole evening: Physical AI is constrained by the physical world. Progress depends not only on better models, but on solving the problems around them: regulation, safety, operations, and data.

Scaling takes more than a better model

Despite rapid model progress, truly large-scale autonomous deployments still feel distant. The gap is the long tail of situations nobody puts in a pitch deck. How should a system respond when an animal starts interacting with the equipment? What kicks in when hardware behaves in ways nobody anticipated? Who is accountable when it does?

This is the unglamorous work that gates adoption. The physical world doesn't behave like a benchmark.

The physical world plays by rules a model can't change

Physical AI companies routinely operate inside regulatory frameworks written for older technologies. EV charging companies have had to navigate gas-station rules, including public price displays and printed receipts. Drone operators face aviation requirements designed for crewed aircraft. These constraints sit outside the model entirely, and a team has to clear them before a deployment is legal, let alone good.

Society sets its own rule on top: machines face a higher bar than humans. A battery fire or an autonomous accident draws disproportionate attention compared to an equivalent human-caused incident. That's the reality, and the teams that win will design for it early.

Surviving the physical world is a data problem

The edge cases, the regulations, the higher bar: you handle all of them through the data the machines produce. You catch an edge case because something in the telemetry looked wrong. You prove you met a regulation because you kept the records. The physical world is messy, and data is how you get a grip on it.

So data becomes its own hard problem. Machines in the field generate enormous volumes of telemetry, and every team deploying them wrestles with the same five decisions:

What does the system need in real time?
What stays at the edge?
What gets shipped to the cloud?
What's worth retaining for training?
And what must be kept for regulators, sometimes for decades?

Nuclear applications can carry 30-year retention requirements, a timescale that makes most storage strategies look quaint.

At fleet scale, monitoring, observability, and automation become critical infrastructure, increasingly run with agentic copilots that help operators watch and triage while humans stay accountable for the rare edge cases.

Most teams haven't felt this yet, because most aren't at fleet scale. The ones who treat the telemetry layer as core infrastructure before they get there are the ones who won't be rebuilding it under load later.

The work is surviving reality, not beating a benchmark

The model is what everyone watches. The deployment is decided by everything around it: the regulation, the safety bar, the operations, and the data that ties them together.

That's the work the people at our dinner do every day, and it's why we'll keep bringing them together. It's also the work we do: helping teams capture, store, and make sense of the data their machines produce, so the operational layer is ready when deployment scales. If you're building machines that operate beyond the lab, reach out. We'd love to have you at the table.

When PostgreSQL Isn't the Right Fit: Recognizing Workloads That Need Different Architecture

Team Tiger Data — Fri, 12 Jun 2026 12:00:47 +0000

When PostgreSQL isn't the right fit, the signs don't announce themselves clearly. Postgres is the right database for roughly 90% of workloads, such as SaaS backends, CRUD applications, and transactional systems with mixed read/write access on shared rows. But there's a narrow 10% where those same strengths become overhead: high-frequency append-only ingestion, time-ordered data accumulating at sustained rates, analytical scans over hundreds of millions of rows. If that sounds like your system, this post is for you.

What You Will Learn

If you've added indexes, implemented partitioning, tuned autovacuum, and upgraded hardware only to watch performance degrade again on the same trajectory, the problem likely isn't your configuration. By the end of this post, you'll know whether your workload is in Postgres's 10%, how to confirm it with a single diagnostic query, and what the first concrete step toward the right architecture looks like.

Why It Matters

An optimization problem and an architecture problem look identical in the early stages. Both show up as slow queries. Both respond to the same fixes: indexes, partitioning, autovacuum tuning, hardware upgrades. The divergence happens later, when the fixes stop holding and performance degrades on the same trajectory regardless of what you change.

This is what’s known as the optimization treadmill: a predictable sequence of phases that each buy three to six months of relief without changing the underlying trajectory. MVCC overhead, row-oriented storage, B-tree index maintenance, WAL volume. These aren't bugs. They're architectural tradeoffs that work well for 90% of workloads and work poorly for the 10%.

Knowing which problem you have determines whether you should keep tuning or make a different decision.

What Postgres Was Designed For

Postgres's architecture is built around concurrent access to shared rows. Multiple transactions read and write the same data at the same time, and MVCC handles the isolation. B-tree indexes find specific rows by key. Row-oriented storage assumes that when you retrieve a row, you want most of the columns in it.

For an e-commerce backend, a user authentication system, or a multi-tenant SaaS product, these are exactly the right tradeoffs. Transactions need isolation. Point lookups by user ID are the dominant query pattern. Write rates track user activity, which gives the database natural breathing room between peaks. The question isn't whether Postgres is good. It's whether the workload you're running matches the patterns its architecture was designed to serve.

The Workload That Breaks the Match

Three characteristics, when they appear together, put a workload outside what Postgres handles well.

Append-only or append-heavy writes. Rows are written once and never, or almost never, updated. Sensor readings, financial transactions, log entries, event streams. Every row still pays the full MVCC cost: a 23-byte tuple header tracking transaction visibility, hint-bit dirtying on reads, and autovacuum running continuously to freeze tuples and update the visibility map. None of that overhead produces value on data that will never be touched again.

Sustained high write rates. Not burst traffic that settles. Continuous ingestion at thousands to hundreds of thousands of rows per second, around the clock. The table grows without pause, B-tree index maintenance adds overhead with every insert, and that cost compounds with row volume, so there is no quiet window for autovacuum to catch up.

Analytical query patterns. The queries are aggregations over time ranges: averages, counts, percentiles, GROUP BY time bucket. Row-oriented storage forces Postgres to read all columns of every matching row even when the query needs two. On a 30-column table, that's fifteen times the I/O a columnar layout would require.

Any one of these is manageable. All three together is the combination that Postgres handles well at one million rows and struggles with at one hundred million.

The Optimization Treadmill in Practice

The pattern is predictable. Queries slow down as the table grows. You add indexes, and reads get faster. Write performance drops because index maintenance scales with row volume. You upgrade the instance. Performance stabilizes and costs go up. You implement partitioning. Recent-data queries get faster. Partition management becomes its own maintenance burden. You tune autovacuum settings. Things stabilize for a while. Data volume increases. The cycle repeats.

Each step is individually correct. The problem is that the sequence never ends. You're working around an architectural mismatch instead of running a workload the architecture was designed to serve.

The engineering cost accumulates in ways that are harder to see on a dashboard. The senior engineer spending a week on partition strategy is not shipping product features. The on-call rotation starts treating "database is slow again" as a recurring incident category. Quarterly planning includes a database scalability line item, every quarter.

How to Know Which 10% You're In

The answer is already in your table statistics. Not in EXPLAIN plans or monitoring dashboards, but in the counters tracking exactly how rows have been written, updated, and cleaned up over the table's lifetime. Run this against your highest-traffic tables:

SELECT
    relname AS table_name,
    N
_live_tup,
    n_dead_tup,
    n_tup_ins,
    n_tup_upd,
    ROUND(n_tup_upd::numeric / NULLIF(n_tup_ins, 0) * 100, 2) AS update_pct,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_tup_ins DESC
LIMIT 10;

Here's an example of what a flagged table looks like next to a healthy one:

table_name	n_tup_ins	n_tup_upd	update_pct	last_autovacuum
device_metrics	84,729,3041	24,892	0.00	2025-06-01 14:22:11
user_accounts	184,203	91,843	49.86	2025-05-29 08:14:03

device_metrics is in the 10%: 847 million inserts, near-zero updates, and autovacuum fired three minutes ago on a table that has never had a meaningful UPDATE run against it. user_accounts is not: nearly half its rows are updated, and autovacuum runs only when it actually needs to.

Look for update_pct under 5% and last_autovacuum timestamps within the last few minutes on tables with near-zero deletes. That's the overhead the companion piece documents in detail: a cleanup process running non-stop on data you never modify, because the storage engine generates that work regardless of your intent.

Pair those numbers against the broader pattern. Your sustained write rate exceeds 10,000 rows per second. Your most common queries aggregate over time ranges, not point lookups by row identifier. You added partitioning specifically to control table size. You upgraded your instance specifically for query performance, not connection headroom.

Three or more of those conditions, and you're in the 10%. The optimization treadmill will keep running, but the trajectory won't change.

What the 10% Actually Needs

If you've confirmed you're in the 10%, migrating your highest-traffic table starts with a single function call:

SELECT create_hypertable('device_metrics', by_range('ts'));

This converts the table to a TimescaleDB hypertable, which does automatic time-based chunking without cron jobs or partition management scripts. From there, you can enable columnar storage on your chunks. This format reads only the columns a query requests, not full rows, and compresses historical data by 10 to 20x, bringing time-range aggregation performance in line with what the workload demands. The migration post walks through the full process, including zero-downtime options for production tables.

You keep the same SQL, the same connection strings, the same ecosystem tooling. This isn't a replacement for Postgres. It's Postgres with the storage primitives your specific workload actually needs.

Conclusion

Postgres is not the problem. Running the wrong workload class through an architecture designed for a different problem is. The distinction matters because one has a tuning fix and the other has a structural fix, and those two paths look identical for the first several months.

The most expensive version of this recognition happens after 18 months of optimization effort. The cheapest version happens now.

Run the diagnostic query above. If the numbers land where you expect, read the full architectural breakdown. If you're ready to test on your own data, start a free Tiger Data trial today.

Row vs Columnar Storage for Analytics: Why PostgreSQL Scans Are Slower Than They Should Be

Team Tiger Data — Fri, 05 Jun 2026 12:48:04 +0000

Here's a query that runs on most time-series tables:

SELECT time_bucket('1 hour', ts) AS hour,
       avg(temperature),
       max(temperature)
FROM sensor_readings
WHERE ts > now() - interval '7 days'
GROUP BY hour
ORDER BY hour;

The query needs two columns: ts and temperature. The table has 15 columns. Postgres reads all 15 columns for every row that matches the WHERE clause.

That's not a bug. It's how row-oriented storage works. Each row is stored as a contiguous block of bytes on disk, called a heap tuple, and Postgres reads the entire tuple to access any column within it. For point lookups on individual records, this is efficient. You want the whole row, and it's stored together. For analytical scans over millions of rows where you need two columns out of fifteen, it's the dominant source of wasted I/O.

In Understanding Postgres Performance Limits for Analytics on Live Data, row-oriented storage was identified as one of four architectural constraints that compound under high-frequency ingestion. That whitepaper maps the pattern at a system level. This post goes deeper on the physical mechanism: exactly how pages work, how read amplification accumulates, and why the usual fixes don't reach it.

What You Will Learn

By the end of this post, you'll have a concrete diagnostic formula: the read amplification ratio. It tells you whether your storage layout is the dominant I/O bottleneck for analytical queries on any table you own. You'll also understand why indexes can't fix this class of problem and how a hybrid row-columnar storage layout changes the math. This post assumes working familiarity with Postgres page layout and B-tree indexes.

How Row Storage Actually Works in Postgres

Postgres stores data in 8KB pages. Each page holds multiple heap tuples. Each tuple contains every column value for that row, stored sequentially, preceded by a 23-byte header that carries transaction visibility metadata.

A table with 15 columns averaging 200 bytes per row fits roughly 35 to 40 rows per page, after accounting for headers, alignment padding, and page overhead.

When Postgres runs a sequential scan, it reads pages from disk in order. Each page load brings all the rows on that page into shared_buffers, with all 15 columns per row intact. The executor then evaluates the WHERE clause and pulls the needed columns from what was already loaded into memory.

The I/O cost is proportional to total table size, not to the size of the queried columns. A query that needs 12 bytes of data per row still reads 200 bytes from disk. The remaining 188 bytes load into the buffer cache and get discarded.

The Read Amplification Math

The number that makes this concrete is the read amplification ratio: total row width divided by the width of the columns the query actually needs.

For sensor_readings, the calculation is direct. The ts column is a timestamptz at 8 bytes. The temperature column is a float4 at 4 bytes. Together they represent 12 bytes of useful data per row. The full row is 200 bytes.

Read amplification ratio: 200 ÷ 12 = 16.7x

For every byte the query uses, Postgres reads 16.7 bytes from disk.

At 100 million rows covering seven days, that ratio stops being abstract. The query needs 100M x 12 bytes = 1.14 GB. Postgres reads 100M x 200 bytes = 18.6 GB. At a 500 MB/sec sequential read rate, the scan takes approximately 38 seconds. Reading only the needed columns would take roughly 2.3 seconds. That 16x gap is pure storage model overhead.

No index changes this number. No configuration setting changes it. Partitioning reduces scope. Fewer pages get scanned by cutting the time range, but within each partition the same per-row read cost applies. The storage layout determines the I/O, and the storage layout is fixed.

Try This Now: Measure Your Read Amplification

You can calculate the ratio for any table you own. Run these two queries to get the byte widths you need:

-- Full row weight
SELECT pg_column_size(t.*) AS row_bytes
FROM sensor_readings t
LIMIT 1;

-- Queried column weight
SELECT pg_column_size(ts) + pg_column_size(temperature) AS queried_bytes
FROM sensor_readings
LIMIT 1;

Divide row_bytes by queried_bytes. If the ratio is above 5x, the storage model is your largest I/O bottleneck for analytical queries on that table. No index or configuration change will close that gap.

Why Indexes Don’t Solve This

When a query is slow, the instinctive response is to add an index. For OLTP workloads, that instinct is correct. B-tree indexes excel at row selection: they find specific rows in O(log n) time, and for a lookup like SELECT * FROM users WHERE id = 123, the index locates the target row in microseconds.

For analytical queries that touch millions of rows, row selection is not the bottleneck. Finding the rows is fast. Reading the data from those rows is slow. An index scan on a million-row result set still reads the full heap tuple for every matching row to extract the needed columns.

The one exception is a covering index, which stores column values inside the index itself so Postgres can satisfy the query without touching the heap. But covering indexes for analytical queries become impractical at scale. When queries involve aggregations across high-frequency writes, wide covering indexes impose substantial write overhead, compounding exactly the index maintenance costs described in the optimization treadmill post.

B-tree indexes optimize for row selection (which rows to read). Analytical query performance is dominated by row width (how much data per row). These are different problems, and solving one leaves the other intact. For a broader look at what this means for your schema design, see Best Practices for PostgreSQL Data Analysis.

How Columnar Storage Changes the Equation

In columnar storage, data is organized by column instead of by row. All values for ts live together in one stream on disk. All values for temperature live together in another. When the query needs those two columns, it reads two streams. The other 13 columns are never touched.

Same query, same 100 million rows: data read drops to 100M x 12 bytes = 1.14 GB. With typical 10 to 20x compression for time-series data, that compresses to approximately 60 to 120 MB. At 500 MB/sec, the same scan completes in roughly 0.12 to 0.24 seconds.

The compression benefit stacks on top of the I/O reduction. Because all values in a column share the same data type, compression algorithms work far more effectively. Sequential timestamps delta-encode to near-zero storage overhead. Floating-point sensor values compress with XOR-based techniques derived from Facebook's Gorilla algorithm. Row-oriented heap storage can't apply any of these because values from different columns are interleaved on every page. There's no contiguous column stream to compress.

Hypercore: Row and Columnar in One Table

The tradeoff with pure columnar storage is write performance. Every new row appends to each column file separately, which adds overhead for high-frequency ingestion. You get the read benefit but give up write throughput. Tiger Data's Hypercore solves this with a hybrid layout that keeps both.

Recent data stays in row-oriented storage for fast ingestion. Older data converts automatically to columnar format based on a compression policy you configure. The application writes standard SQL to one table. The storage format changes by age without any application-layer involvement.

-- Enable Hypercore on a hypertable with a 7-day row storage window
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id',
    timescaledb.compress_orderby = 'ts DESC'
);

SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');

New rows land in row format and ingest quickly. Data older than seven days converts to columnar chunks. To verify the behavior immediately without waiting for the policy schedule, compress a chunk manually:

SELECT compress_chunk(c) FROM show_chunks('sensor_readings') c LIMIT 1;

Then run EXPLAIN (ANALYZE, BUFFERS) on the aggregation query to see the difference in buffer reads (representative output on a 100M-row dataset):

-- Before: row storage sequential scan
Seq Scan on sensor_readings
  Buffers: shared read=2375000 -- 18.6 GB read from disk
  Execution Time: 38142.2 ms

-- After: Hypercore columnar scan
Custom Scan (ColumnarScan) on sensor_readings
  Buffers: shared read=10240 -- 80 MB read from disk
  Execution Time: 196.4 ms

The same SELECT statement works against both storage formats. The query planner handles the difference transparently.

Conclusion

Row storage reads every column to access any column. For analytical queries that scan millions of rows and need only a few, this is the largest source of I/O overhead. It doesn't yield to index tuning, partitioning, or hardware upgrades.

Calculate the read amplification ratio for your most common analytical queries using the pg_column_size queries above. If the ratio is above 5x, Hypercore is the direct fix. Start a free Tiger Data trial today to enable the hybrid storage model on your tables.

Postgres Extensions Cheat Sheet: Replace 7 Databases With SQL

Matty Stratton — Sat, 02 May 2026 20:47:24 +0000

This post is a practical companion to It's 2026, Just Use Postgres. That post makes the architectural case for consolidating on Postgres. This one shows you how.

Below are working SQL examples for each use case. Every extension listed here is available on Tiger Cloud with no additional setup. If you're self-hosting, each section links to the extension's repo.

What you'll be able to do after reading this: Set up Postgres extensions for full-text search, vector search, time-series, caching, message queues, document storage, geospatial queries, and scheduled jobs. Each section is self-contained, so you can skip to what you need.

Enable Everything

Here's the full set. You probably don't need all of them. Pick the ones that match your workload.

CREATE EXTENSION pg_textsearch; -- BM25 full-text search
CREATE EXTENSION vector; -- Vector search (pgvector)
CREATE EXTENSION vectorscale; -- DiskANN index for vectors
CREATE EXTENSION ai; -- AI embeddings and RAG workflows
CREATE EXTENSION timescaledb; -- Time-series
CREATE EXTENSION pgmq; -- Message queues
CREATE EXTENSION pg_cron; -- Scheduled jobs
CREATE EXTENSION postgis; -- Geospatial

Full-Text Search (Replace Elasticsearch)

Extension: pg_textsearch (true BM25 ranking)

What you're replacing: Elasticsearch (separate JVM cluster, complex mappings, sync pipelines), Solr, or Algolia ($1 per 1,000 searches).

What you get: The same BM25 algorithm that powers Elasticsearch, running natively in Postgres. No separate cluster. No sync jobs. No data drift.

CREATE TABLE articles (
  id SERIAL PRIMARY KEY,
  title TEXT,
  content TEXT
);

-- Create a BM25 index
CREATE INDEX idx_articles_bm25 ON articles USING bm25(content)
  WITH (text_config = 'english');

-- Search with BM25 scoring
SELECT title, -(content <@> 'database optimization') AS score
FROM articles
ORDER BY content <@> 'database optimization'
LIMIT 10;

Deep dive: You Don't Need Elasticsearch: BM25 is Now in Postgres

Vector Search (Replace Pinecone)

Extensions: pgvector + pgvectorscale

What you're replacing: Pinecone ($70/month minimum, separate infrastructure, data sync), Qdrant, Milvus, or Weaviate.

What you get: pgvectorscale uses the DiskANN algorithm (from Microsoft Research). On a 50M vector benchmark, it achieved 28x lower p95 latency and 16x higher throughput than Pinecone at 99% recall.

CREATE EXTENSION vector;
CREATE EXTENSION vectorscale CASCADE;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding vector(1536)
);

-- High-performance DiskANN index
CREATE INDEX idx_docs_embedding ON documents USING diskann(embedding);

-- Find similar documents
SELECT content, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

Auto-sync embeddings with pgai

No more manual embedding pipelines. pgai regenerates embeddings automatically on every INSERT and UPDATE.

SELECT ai.create_vectorizer(
  'documents'::regclass,
  loading => ai.loading_column(column_name => 'content'),
  embedding => ai.embedding_openai(
    model => 'text-embedding-3-small',
    dimensions => '1536'
  )
);

Every row stays in sync. No batch jobs. No drift.

Hybrid Search: BM25 + Vectors in One Query

This is where Postgres consolidation pays off immediately. Combining keyword search and semantic search in other stacks requires two API calls, result merging, failure handling, and double the latency. In Postgres, it's one query.

Simple weighted hybrid

SELECT
  title,
  -(content <@> 'database optimization') AS bm25_score,
  embedding <=> query_embedding AS vector_distance,
  0.7 * (-(content <@> 'database optimization')) +
  0.3 * (1 - (embedding <=> query_embedding)) AS hybrid_score
FROM articles
ORDER BY hybrid_score DESC
LIMIT 10;

Reciprocal Rank Fusion (for RAG applications)

WITH bm25 AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY content <@> $1) AS rank
  FROM documents LIMIT 20
),
vectors AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS rank
  FROM documents LIMIT 20
)
SELECT d.*,
  1.0 / (60 + COALESCE(b.rank, 1000)) +
  1.0 / (60 + COALESCE(v.rank, 1000)) AS score
FROM documents d
LEFT JOIN bm25 b ON d.id = b.id
LEFT JOIN vectors v ON d.id = v.id
WHERE b.id IS NOT NULL OR v.id IS NOT NULL
ORDER BY score DESC LIMIT 10;

One query. One transaction. One result set.

Time-Series (Replace InfluxDB)

Extension: TimescaleDB (21K+ GitHub stars)

What you're replacing: InfluxDB (separate database, Flux or limited SQL), Prometheus (metrics only, not application data).

What you get: Automatic time-based partitioning, compression up to 95%, continuous aggregates for fast dashboards, and full SQL. Your time-series data lives alongside your relational data with JOINs and ACID guarantees.

CREATE EXTENSION timescaledb;

CREATE TABLE metrics (
  time TIMESTAMPTZ NOT NULL,
  device_id TEXT,
  temperature DOUBLE PRECISION
);

-- Convert to a hypertable (automatic time partitioning)
SELECT create_hypertable('metrics', 'time');

-- Query with time buckets
SELECT time_bucket('1 hour', time) AS hour,
       AVG(temperature)
FROM metrics
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY hour;

Lifecycle automation

TimescaleDB handles retention and compression policies so you don't have to build cron jobs for data management.

-- Automatically drop data older than 30 days
SELECT add_retention_policy('metrics', INTERVAL '30 days');

-- Compress data older than 7 days (up to 95% storage reduction)
ALTER TABLE metrics SET (timescaledb.compress);
SELECT add_compression_policy('metrics', INTERVAL '7 days');

Case study: Plexigrid went from 4 databases to 1 and got 350x faster queries.

Caching (Replace Redis)

Feature: UNLOGGED tables + JSONB (built into Postgres, no extension needed)

What you're replacing: Redis for simple key-value caching scenarios.

What you get: In-memory-speed storage without WAL overhead. Good for session data, temporary lookups, and simple caches. No separate service to operate.

When to keep Redis: If you need pub/sub, sorted sets, Lua scripting, or complex data structures, Redis is still the better tool for those specific jobs.

-- UNLOGGED = no WAL overhead, faster writes
CREATE UNLOGGED TABLE cache (
  key TEXT PRIMARY KEY,
  value JSONB,
  expires_at TIMESTAMPTZ
);

-- Set with expiration
INSERT INTO cache (key, value, expires_at)
VALUES ('user:123', '{"name": "Alice"}', NOW() + INTERVAL '1 hour')
ON CONFLICT (key) DO UPDATE SET value = EXCLUDED.value;

-- Get
SELECT value FROM cache
WHERE key = 'user:123' AND expires_at > NOW();

-- Schedule cleanup with pg_cron
SELECT cron.schedule('cache_cleanup', '0 * * * *',
  $$DELETE FROM cache WHERE expires_at < NOW()$$);

Message Queues (Replace Kafka)

Extension: pgmq

What you're replacing: Kafka or RabbitMQ for task queues and simple event processing.

What you get: A lightweight message queue inside Postgres. Send, receive with visibility timeouts, and delete after processing. Transactional with the rest of your data.

When to keep Kafka: If you need high-throughput event streaming across dozens of services, consumer groups, exactly-once semantics, or multi-datacenter replication, Kafka is purpose-built for that.

CREATE EXTENSION pgmq;
SELECT pgmq.create('my_queue');

-- Send a message
SELECT pgmq.send('my_queue', '{"event": "signup", "user_id": 123}');

-- Receive (with 30-second visibility timeout)
SELECT * FROM pgmq.read('my_queue', 30, 5);

-- Delete after processing
SELECT pgmq.delete('my_queue', msg_id);

Alternative: SKIP LOCKED pattern (no extension needed)

For simple job queues, Postgres has a built-in pattern using FOR UPDATE SKIP LOCKED:

CREATE TABLE jobs (
  id SERIAL PRIMARY KEY,
  payload JSONB,
  status TEXT DEFAULT 'pending'
);

-- Worker claims a job atomically
UPDATE jobs SET status = 'processing'
WHERE id = (
  SELECT id FROM jobs WHERE status = 'pending'
  FOR UPDATE SKIP LOCKED LIMIT 1
) RETURNING *;

Documents (Replace MongoDB)

Feature: Native JSONB (built into Postgres since 2014)

What you're replacing: MongoDB for document storage.

What you get: Schemaless document storage with GIN indexing, plus everything Postgres gives you: ACID transactions, relational JOINs, and SQL. No separate database for your "document-shaped" data.

CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  data JSONB
);

-- Insert a nested document
INSERT INTO users (data) VALUES ('{
  "name": "Alice",
  "profile": {"bio": "Developer", "links": ["github.com/alice"]}
}');

-- Query nested fields
SELECT data->>'name', data->'profile'->>'bio'
FROM users
WHERE data->'profile'->>'bio' LIKE '%Developer%';

-- Index specific JSON fields for fast lookups
CREATE INDEX idx_users_email ON users ((data->>'email'));

Geospatial (Replace Specialized GIS)

Extension: PostGIS (the industry standard since 2001)

What you're replacing: Nothing, really. PostGIS is what most specialized GIS tools are built on. It powers OpenStreetMap and has been in production for 24 years.

CREATE EXTENSION postgis;

CREATE TABLE stores (
  id SERIAL PRIMARY KEY,
  name TEXT,
  location GEOGRAPHY(POINT, 4326)
);

-- Find stores within 5km
SELECT name,
  ST_Distance(location, ST_MakePoint(-122.4, 37.78)::geography) AS meters
FROM stores
WHERE ST_DWithin(location, ST_MakePoint(-122.4, 37.78)::geography, 5000);

Scheduled Jobs (Replace External Cron)

Extension: pg_cron

What you're replacing: External cron jobs, Kubernetes CronJobs, or Lambda scheduled triggers for database maintenance tasks.

What you get: Cron scheduling inside Postgres. Useful for cache cleanup, materialized view refreshes, data retention, and periodic aggregation.

CREATE EXTENSION pg_cron;

-- Run cache cleanup every hour
SELECT cron.schedule('cleanup', '0 * * * *',
  $$DELETE FROM cache WHERE expires_at < NOW()$$);

-- Refresh a materialized view every night at 2 AM
SELECT cron.schedule('rollup', '0 2 * * *',
  $$REFRESH MATERIALIZED VIEW CONCURRENTLY daily_stats$$);

Fuzzy Search (Typo Tolerance)

Extension: pg_trgm (built into Postgres)

CREATE EXTENSION pg_trgm;

CREATE INDEX idx_name_trgm ON products USING GIN (name gin_trgm_ops);

-- Finds "PostgreSQL" even when typed as "posgresql"
SELECT name FROM products
WHERE name % 'posgresql'
ORDER BY similarity(name, 'posgresql') DESC;

What's Next

If you want the architectural argument for why consolidating on Postgres matters (especially in the AI era), read It's 2026, Just Use Postgres.

All of these extensions come pre-configured on Tiger Cloud. Create a free database and start building.

Further reading:

pg_textsearch 1.0: How We Built a BM25 Search Engine on Postgres Pages

Team Tiger Data — Tue, 31 Mar 2026 13:09:03 +0000

Design, implementation, and benchmarks of a native BM25 index for Postgres. Now generally available to all Tiger Cloud customers and freely available via open source.

If you have used Postgres's built-in ts_rank for full-text search at any meaningful scale, you already know the limitations. Ranking quality degrades as your corpus grows. There is no inverse document frequency, so common words carry the same weight as rare ones. There is no term frequency saturation, so a document that mentions "database" 50 times outranks one that mentions it once. There is no efficient top-k path: scoring requires touching every matching row.

Most teams work around this by bolting on Elasticsearch or Typesense as a sidecar. That works, but now you are syncing data between two systems, operating two clusters, and debugging consistency issues when they diverge.

pg_textsearch takes a different approach: real BM25 scoring, built from scratch in C on top of Postgres's own storage layer. You create an index, write a query, and get results ranked by relevance:

CREATE INDEX ON articles USING bm25(content) WITH (text_config = 'english');

SELECT title, content <@> 'database ranking' AS score
FROM articles
ORDER BY content <@> 'database ranking'
LIMIT 10;

The <@> operator returns a BM25 relevance score. Scores are negated so that Postgres's default ascending ORDER BY returns the most relevant results first. The index is stored entirely in standard Postgres pages managed by the buffer cache. It participates in WAL, works with pg_dump and streaming replication, and requires no external storage or special backup procedures.

What shipped in 1.0
From preview to production. In October 2025, we released a preview that held the entire inverted index in shared memory, rebuilt from the heap on restart (preview blog). In the five months and 180+ commits since, the extension has been substantially rewritten:
• Disk-based segments replaced the memory-only architecture
• Block-Max WAND + WAND optimization for fast top-k queries
• Posting list compression with SIMD-accelerated decoding (41% smaller indexes)
• Parallel index builds (138M documents in under 18 minutes)
• 2.4x to 6.5x faster than ParadeDB/Tantivy for 2-4 term queries at 138M scale
• 8.7x higher concurrent throughput
This post covers the architecture, query optimization strategy, and benchmark results. We include a candid discussion of where ParadeDB is faster and a full accounting of current limitations.

Background: Why BM25 in Postgres?

Postgres ships tsvector/tsquery with ts_rank for full-text ranking. ts_rank uses an ad-hoc scoring function that lacks the three properties that make BM25 effective:

Inverse document frequency (IDF): downweights common terms so that rarer, more informative terms drive the ranking.
Term frequency saturation: prevents a document from scoring arbitrarily high by repeating a term many times. A document mentioning "database" 50 times is not 50 times more relevant than one mentioning it once.
Document length normalization: accounts for the fact that a term match in a short document is more informative than the same match in a long one [1].

For applications where ranking quality matters (RAG pipelines, search-driven UIs, hybrid retrieval), this is a material limitation. At scale, ts_rank also has no top-k optimization path: ranking by relevance requires scoring every matching row.

The primary existing BM25 extension for Postgres is ParadeDB/pg_search, which wraps the Tantivy search library written in Rust. Early versions stored the index in auxiliary files outside the WAL; current versions use Postgres pages.

pg_textsearch takes a different approach: rather than wrapping an external search library, the entire search engine (tokenization, compression, query optimization) is built from scratch in C on top of Postgres's storage layer.

Architecture

Fig. 1: pg_textsearch Architecture diagram

The hybrid memtable + segment design

pg_textsearch uses an LSM-tree-inspired architecture [4]. Incoming writes go to an in-memory inverted index (the memtable), which periodically spills to immutable on-disk segments. Segments compact in levels: when a level accumulates enough segments (default 8), they merge into the next level. Fewer segments means fewer posting lists to consult per query term, which directly reduces query latency. This is the same write-optimized-memtable / read-optimized-segment pattern used in RocksDB [5] and other LSM-based engines, adapted here for Postgres's page-based storage.

The write path: memtable

The memtable lives in Postgres shared memory, one per index, accessible to all backends. It contains a string-interning hash table that stores each unique term exactly once; per-term posting lists recording document IDs and term frequencies; and corpus statistics (document count and average document length) maintained incrementally so that BM25 scores can be computed without a separate pass over the index.

When the memtable exceeds a configurable threshold (default: 32M posting entries), it spills to a Level-0 disk segment at transaction commit. A secondary trigger (default: 100K unique terms per transaction) handles large single-transaction loads like bulk imports.

The memtable is rebuilt from the heap on startup. Since the heap is WAL-logged, no data is lost if Postgres crashes before a spill completes. This is analogous to how a write-ahead log protects an LSM memtable, except here the WAL is Postgres's own. The rebuild cost is proportional to the amount of data not yet spilled to segments; for indexes where most data has been spilled, startup is fast.

Fig. 2: pg_textsearch memtable write path

The read path: segments

Segments are immutable and stored in standard Postgres pages. Each segment contains:

A term dictionary: a sorted array of offsets into a string pool, binary-searchable for O(log n) term lookup.
Posting blocks of up to 128 documents each, containing delta-encoded doc IDs, packed term frequencies, and quantized document lengths (fieldnorms). A separate skip index stores one entry per posting block with upper-bound score metadata used by Block-Max WAND optimization (described below).
A fieldnorm table mapping document lengths to 1-byte quantized values using Lucene/Tantivy's SmallFloat encoding [6]. This encoding is exact for lengths 0-39 (covering most short documents); for longer documents, quantization error increases from ~5% to ~11%. In practice, the impact on ranking is smaller than these numbers suggest: BM25 scores depend on the ratio of document length to average document length, which dampens quantization error, and the b parameter (default 0.75) further reduces length's influence.
A doc ID to CTID mapping that translates internal document IDs to Postgres tuple identifiers for heap fetches.

Fig. 3: pg_textsearch segment internal structure

Minimizing page access

Storing data in Postgres pages means every access goes through the buffer manager. Even for pages already in cache, each access involves a buffer table lookup, pin acquisition, and lock handling. That overhead adds up in a scoring loop processing millions of postings. This constraint shaped several design decisions.

Each segment assigns compact 4-byte, segment-local document IDs (0 to N-1), which map to Postgres's 6-byte CTIDs (heap tuple identifiers). After collecting all documents for a segment, doc IDs are reassigned so that doc_id order matches CTID order. Sequential iteration through posting lists then produces sequential access to the CTID mapping, maximizing cache locality. CTIDs themselves are stored as two separate arrays (4-byte page numbers and 2-byte offsets) rather than interleaved 6-byte records, doubling cache line utilization.

The scoring loop works entirely with doc IDs, term frequencies, and fieldnorms. It never touches the CTID arrays. CTIDs are resolved only for the final top-k results in a single batched pass. A top-10 query that scores thousands of candidates resolves ten CTIDs, not thousands.

Postgres integration

Because the index is stored in standard buffer-managed pages, pg_textsearch participates in Postgres infrastructure without special handling: MVCC visibility, proper rollback on abort, WAL and physical replication, pg_dump / pg_upgrade, VACUUM with correct dead-entry removal, and planner hooks that detect the <@> operator and select index scans automatically. Logical replication works in the usual way: row changes are replicated and the index is rebuilt on the subscriber.

Query Optimization: Block-Max WAND

The top-k problem

Naive BM25 evaluation scores every document matching any query term. For a 3-term query on MS-MARCO v2 (138M documents), this means decoding and scoring posting lists with tens of millions of entries. Most applications need only the top 10 or 100 results. The challenge is finding them without scoring everything.

Block-Max WAND

pg_textsearch implements Block-Max WAND (BMW) [2], which uses block-level upper bounds to skip non-contributing posting blocks during top-k evaluation. Lucene adopted a similar approach in version 8.0 [7]. The core idea: maintain the score of the k-th best result seen so far as a threshold, and skip any posting block whose upper-bound score cannot exceed it.

Each 128-document posting block has a corresponding skip entry storing the maximum term frequency in the block and the minimum fieldnorm (the shortest document, which would score highest for a given term frequency). From these two values, BMW can compute a tight upper bound on the block's BM25 contribution without decompressing it. If the upper bound falls below the current threshold, the entire block (all 128 documents) is skipped.

To illustrate: consider a single-term top-10 query on a large corpus. After scanning a few thousand postings, the algorithm has accumulated 10 results with a minimum score of, say, 12.3. It now encounters a block where the upper-bound BM25 score (computed from the block's stored metadata) is 9.1. Since 9.1 < 12.3, no document in this block can enter the top 10, and the entire block is skipped without decompression. For short queries on large corpora, the vast majority of blocks are skipped this way.

Fig. 4: pg_textsearch Block-Max WAND visualization

WAND pivot selection

For multi-term queries, pg_textsearch adds the WAND algorithm [3] for cross-term skipping. Terms are ordered by their current document ID, and the algorithm identifies a pivot term: the first term whose cumulative maximum score exceeds the current threshold. All terms before the pivot advance to at least the pivot's current doc ID, skipping entire ranges of documents across multiple posting lists simultaneously, before block-level BMW bounds are even checked. For multi-term queries, BMW compares the sum of per-term block upper bounds against the threshold, extending the single-term logic described above.

The combination of WAND (cross-term skipping) and BMW (within-list block skipping) is most effective for short queries (1-4 terms), which account for the majority of real-world search traffic. In the full MS-MARCO v1 query set (1,010,916 queries from Bing), 72.6% have 2-4 lexemes after English stemming and stopword removal, with a mean of 3.7 and a mode of 3. The speedup narrows for longer queries, where more blocks contain at least one term with a potentially high-scoring document. Grand et al. [7] observe the same pattern in Lucene's BMW implementation.

Compression and Storage

Posting blocks use a compression scheme designed for fast random-access decoding. Doc IDs are delta-encoded (storing differences between consecutive IDs rather than absolute values), then packed with variable-width bitpacking: the maximum delta in the block determines the bit width, and all deltas use that width. Term frequencies are packed separately with their own bit width. Fieldnorms are the 1-byte SmallFloat values described above.

The bitpack decode path uses branchless direct-indexed uint64 loads rather than a byte-at-a-time accumulator, eliminating branch misprediction in the inner decode loop. Where available, SIMD intrinsics (SSE2 on x86-64, NEON on ARM64) accelerate the mask-and-store step. A scalar fallback handles other platforms.

Compression reduces index size by 41% compared to uncompressed storage. Decode overhead is approximately 6% of query time (measured by profiling), which is more than offset by reduced buffer cache pressure. The scheme prioritizes decode speed over compression ratio.

A note on index size comparisons: pg_textsearch does not store term positions, so it cannot support phrase queries natively (see Limitations). This makes its indexes inherently smaller than engines like Tantivy that store positions by default. The 19-26% size advantage reported in our benchmarks reflects both compression and this feature difference.

Parallel Index Build

For large tables, serial index construction can take hours. pg_textsearch uses Postgres's built-in parallel worker infrastructure to distribute the work.

The leader launches workers and assigns each a range of heap blocks. Workers scan their assigned blocks, tokenize documents via to_tsvector, build local in-memory indexes, and write intermediate segments to temporary BufFiles. The leader then performs an N-way merge of all worker output, writing a single merged segment directly to index pages.

Fig. 5: pg_textsearch Parallel Index Build

Workers run concurrently in the scan/tokenize/build phase; the leader merges sequentially. The expensive part (heap scanning, tokenization, posting list assembly) is CPU-bound and parallelizes well. The merge/write phase is comparatively cheap, so a serial merge captures most of the speedup with minimal complexity. It also produces a single fully-compacted segment that is optimal for query performance.

On MS-MARCO v2 (138M passages), 15 workers complete the build in 17 minutes 37 seconds:

SET max_parallel_maintenance_workers = 15;
SET maintenance_work_mem = '256MB';
CREATE INDEX ON passages USING bm25(content) WITH (text_config = 'english');

Benchmarks

Methodology

All benchmarks use the MS-MARCO passage ranking dataset [8], a standard information retrieval benchmark drawn from real Bing search queries. We compare pg_textsearch against ParadeDB v0.21.6 (which wraps Tantivy). Both extensions use their default configurations; Postgres tuning is specified per experiment. Both systems configure English stemming and stopword removal.

Queries are drawn uniformly from 8 token-count buckets (100 queries per bucket on v1; up to 100 per bucket on v2). Weighted-average metrics use the MS-MARCO v1 lexeme distribution as weights, reflecting real search traffic.

Cache state. All query benchmarks are warm-cache: a warmup pass runs before timing begins, and the working set fits in the OS page cache and shared_buffers for all configurations tested. Results reflect CPU and algorithmic efficiency, not I/O. We have not benchmarked memory-constrained configurations where the index exceeds available cache.

Ranking. Both systems produce BM25 rankings using the same tokenization (English stemming and stopwords). We have not performed a systematic ranking equivalence comparison; both implement standard BM25 with the same default parameters (k1 = 1.2, b = 0.75), but differences in IDF computation and tokenization edge cases may produce different orderings for some queries.

MS-MARCO query length distribution

The following histogram shows the distribution of query lengths in the full MS-MARCO v1 query set (1,010,916 queries), measured in lexemes after English stopword removal and stemming via Postgres to_tsvector('english'):

Fig. 6: MS-MARCO query length histogram

This distribution is broadly consistent with web search query length studies [9, 10]. The MS-MARCO mean of 3.7 lexemes (after stemming/stopword removal) corresponds to roughly 5–6 raw words, consistent with the corpus statistics reported by Nguyen et al. [8]. We use the v1 distribution for weighting throughout as it provides the largest sample.

Results: MS-MARCO v2 (138M passages)

Environment. Dedicated c6i.4xlarge EC2 instance: Intel Xeon Platinum 8375C, 8 cores / 16 threads, 123 GB RAM, NVMe SSD. Postgres 17.4 with shared_buffers = 31 GB. Both indexes fit in the buffer cache.

Index build:

Metric	pg_textsearch	ParadeDB
Index size	17 GB	23 GB
Build time	17 min 37 sec	8 min 55 sec
Documents	138,364,158	138,364,158
Parallel workers	15	14

pg_textsearch index is 26% smaller. ParadeDB builds approximately 2x faster.

Single-client query latency (p50 median, top-10 queries):

Lexemes	pg_textsearch (ms)	ParadeDB (ms)	Speedup
1	5.11	59.83	11.7x
2	9.14	59.65	6.5x
3	20.04	77.62	3.9x
4	41.92	98.89	2.4x
5	67.76	125.38	1.9x
6	102.82	148.78	1.4x
7	159.37	169.65	1.1x
8+	177.95	190.47	1.1x

The same pattern holds: pg_textsearch is fastest on short queries and the systems converge at longer lengths. Weighted by the MS-MARCO v1 query length distribution, the overall p50 is 40.6 ms for pg_textsearch vs. 94.4 ms for ParadeDB, a 2.3x advantage.

Concurrent throughput. We ran pgbench with 16 parallel clients for 60 seconds (after a 5-second warmup). Each client repeatedly executes a query drawn at random from a weighted pool of 1,000 queries:

Metric	pg_textsearch	ParadeDB
Transactions/sec	198.7	22.8
Average latency	81 ms	701 ms
Total transactions (60s)	11,969	1,387

pg_textsearch sustains 8.7x higher throughput under concurrent load.

Results: MS-MARCO v1 (8.8M passages)

On the smaller dataset (GitHub Actions runner, 7 GB RAM, Postgres 17), the advantages are more pronounced: 26x speedup for single-token queries, 14x for 2-token, 7.3x for 4-token. Total sequential execution time for all 800 queries: 6.5 seconds for pg_textsearch vs. 25.2 seconds for ParadeDB. Full results and methodology are available at the benchmarks page.

Discussion

Latency vs. query length

The speedup correlates strongly with query length: 11.7x for single-token queries on v2, narrowing to 1.1x at 8+ tokens. This is the expected behavior of dynamic pruning algorithms like BMW and WAND. Grand et al. [7] observe the same pattern in Lucene's BMW implementation.

The practical significance depends on the workload's query length distribution. 72.6% of MS-MARCO queries have 2-4 lexemes, the range where pg_textsearch shows its largest advantage (6.5x to 2.4x on v2). Weighted by this distribution, the overall speedup is 2.3x on v2 and 3.9x on v1.

Concurrent throughput

The concurrent throughput advantage (8.7x) substantially exceeds the single-client advantage (2.3x weighted p50). pg_textsearch queries execute as C code operating on Postgres buffer pages, with all memory management handled by Postgres's buffer cache. ParadeDB routes queries through Rust/C FFI into Tantivy, which manages its own memory and I/O outside the buffer pool. We have not profiled ParadeDB's internals, so we cannot attribute the concurrency gap to specific causes, but the architectural difference (shared buffer cache vs. separate memory management) is a plausible contributor. ParadeDB's concurrent performance may also improve in future versions.

Where ParadeDB is faster

Index build time. ParadeDB builds indexes 1.6-2x faster across both datasets. Tantivy's indexer is highly optimized Rust code with its own I/O management, not constrained by Postgres's page-based storage. Build time is a one-time cost per index (or per REINDEX); it does not affect query performance.

Long queries. At 7+ lexemes, the two systems converge. On v2, the 8+ lexeme p50 is 178 ms for pg_textsearch vs. 190 ms for ParadeDB. These long queries represent ~3.7% of the MS-MARCO distribution.

Index size caveat. pg_textsearch indexes are 19-26% smaller, but this comparison is not apples-to-apples: pg_textsearch does not store term positions, while ParadeDB stores positions to support phrase queries.

Benchmark limitations

All measurements are warm-cache on datasets that fit in memory. The 100-query sample per bucket provides directional results but limited statistical power for tail latencies. ParadeDB v0.21.6 was current at time of testing; future versions may improve. We compare against ParadeDB because it is the primary Postgres-native BM25 alternative; standalone engines like Elasticsearch operate in a different deployment model. We have not benchmarked write-heavy workloads with concurrent queries.

Limitations

We want to be clear about what pg_textsearch does not support in 1.0.

No phrase queries. The index stores term frequencies but not term positions, so it cannot natively evaluate queries like "database system" as a phrase. Phrase matching can be done with a post-filter:

SELECT * FROM (
  SELECT * FROM documents
  ORDER BY content <@> 'database system'
  LIMIT 100 -- over-fetch to compensate for post-filter
) sub
WHERE content ILIKE '%database system%'
LIMIT 10;

OR-only query semantics. All query terms are implicitly OR'd. A query for "database system" matches documents containing either term. We plan to add AND/OR/NOT operators via a dedicated boolean query syntax in a post-1.0 release.

No highlighting or snippet generation. Use Postgres's ts_headline() on the result set for highlighting.

No expression indexing. Each BM25 index covers a single text column. Workaround: create a generated column concatenating multiple fields.

Partition-local statistics. Each partition maintains its own IDF and average document length. Cross-partition queries return scores computed independently per partition.

No background compaction. Segment compaction runs synchronously during memtable spill. Write-heavy workloads may observe compaction latency. Background compaction is planned.

PL/pgSQL requires explicit index names. The implicit text <@> 'query' syntax relies on planner hooks that do not fire inside PL/pgSQL, DO blocks, or stored procedures. Use to_bm25query('query', 'index_name') explicitly. This is a practical limitation many developers will hit.

shared_preload_libraries required. pg_textsearch must be listed in shared_preload_libraries, requiring a server restart to install. On Tiger Cloud, this is handled automatically.

No fuzzy matching or typo tolerance. pg_textsearch uses Postgres's standard text search configurations for tokenization and stemming but does not provide built-in fuzzy matching. Typo-tolerant search requires a separate approach (e.g., pg_trgm).

What's Next

Planned work for post-1.0 releases:

Boolean query operators: AND, OR, NOT via a dedicated query syntax
Background compaction: decouple compaction from the write path
Expression index support: index computed expressions, not just bare columns
Dictionary compression: front-coding for terms, reducing dictionary size
Improved write concurrency: better throughput for sustained insert-heavy workloads

Try It

pg_textsearch requires Postgres 17 or 18. The fastest way to try it is on Tiger Cloud, where it is already installed and configured. No setup, no shared_preload_libraries. Create a service and run the example below.

For self-hosted installations, pre-built binaries for Linux and macOS (amd64, arm64) are available on the GitHub Releases page. Add it to shared_preload_libraries and restart:

shared_preload_libraries = 'pg_textsearch'

Source code and full documentation: github.com/timescale/pg_textsearch

Part 2 of this series covers getting started with pg_textsearch, hybrid search with pgvectorscale, and production patterns.

References

[1] Robertson et al. "Okapi at TREC-3." 1994. See also: Robertson, Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in IR, 3(4):333-389, 2009.

[2] Ding, Suel. "Faster top-k document retrieval using block-max indexes." SIGIR 2011, pp. 993-1002.

[3] Broder et al. "Efficient query evaluation using a two-level retrieval process." CIKM 2003, pp. 426-434.

[4] O'Neil et al. "The log-structured merge-tree (LSM-tree)." Acta Informatica, 33(4):351-385, 1996.

[5] Facebook. "RocksDB: A Persistent Key-Value Store for Fast Storage Environments." https://rocksdb.org/

[6] SmallFloat encoding: Apache Lucene SmallFloat.java. Tantivy uses an equivalent implementation.

[7] Grand et al. "From MAXSCORE to Block-Max Wand: The Story of How Lucene Significantly Improved Query Evaluation Performance." ECIR 2020.

[8] Nguyen et al. "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." 2016.

[9] Statista. "Distribution of online search queries in the US, February 2020, by number of search terms."

[10] Dean. "We Analyzed 306M Keywords." Backlinko, 2024.

How to Break Your PostgreSQL IIoT Database and Learn Something in the Process

Doug Pagnutti — Mon, 30 Mar 2026 17:42:43 +0000

As engineers, we're taught to design for reliability. We do design calculations, run simulations, build and test prototypes, and even then we recognize that these are imperfect, so we include safety factors. When it comes to the Industrial Internet of Things (IIoT) though, we rarely give the same level of scrutiny to the components that we rely on.

What if we treated our IIoT database the same way we treated the physical things we produce? We build and design a prototype database, and then put it through some serious testing, even to failure.

The Value (and Perils) of Stress Testing

Think of database stress testing as a destructive materials test for your data storage. You wouldn't trust a bridge made of untested steel, so don’t trust your database until you know its limits.

The Value:

Identify Bottlenecks: Stress testing reveals the weak links—what is likely to fail first? Will you run out of storage? Will your queries get bogged down? Or will you hit the dreaded ingest wall (when data comes in faster than it can be stored)?
Determine Real-World Behaviour: You'll find out exactly how your database performance changes as the amount of data increases. What issues are future-you going to struggle with?
Optimize Configuration: Just like you might build a few different prototypes and see how it affects failure modes, changing your database configuration, especially when it comes to indices, can dramatically affect how it behaves. Building a rigorous stress testing framework provides a safe way to optimize your design.

I hope it goes without saying, but please, please don’t run this on your production environment. Even if it’s technically a different database but the same hardware, this test can wreak havoc on your resources and crash your system. You’ve been warned.

What to Measure?

There’s no point going through all the effort to break your system if you don’t learn anything. Assuming you’re using a PostgreSQL database (It’s 2026, Just Use PostgreSQL), here is a decent set of metrics to keep track of while you’re putting your database through its paces.

Table Size

The size of a Postgresql table is generally measured by number of rows, but the actual space on disk that it occupies is a sum of the heap (the main relational table), the indices, and the TOAST (storage for large objects).

The following query will give the number or rows as well as the size of each component of the table in bytes.

SELECT
      reltuples::bigint AS row_count,
      pg_relation_size('iiot_history') AS heap_size,
      pg_indexes_size('iiot_history') AS indices_size,
      pg_table_size('iiot_history') -
            pg_relation_size('iiot_history') AS toast_size
FROM pg_class WHERE relname = 'iiot_history';

The reason for the odd row_count is that counting rows the standard way, with COUNT(*), requires scanning the whole table, which is going to be painfully slow when we’re building a table big enough to break things.

Table Performance

The best way to measure table performance is to use the actual queries that your production system will use. At a minimum, this should include your batched INSERT (you always batch, right?) and at least one common SELECT. Keep in mind that for a table with N rows, the timing for queries tend to be either constant, log(N), N or worse depending on how the indices are structured.

You can get very accurate timing info from running your queries with the prefix EXPLAIN ANALYZE, and it’s worth doing this at least once to see what the database is doing under the hood. However, I recommend running the whole test with a scripting language and then just timing the execution of that particular step.

Server Performance

Don’t forget the engine that’s driving all this machinery. You’ll need to watch the CPU, Memory, Storage, and Network Bandwidth. People in the IT world tend to talk about headroom for a server, and that’s what you’re really looking at: how much spare capacity do you have? Your CPU and Memory usage might spike at times, but the important thing is that it’s not always running at max capacity.

There are a lot of free and paid tools to monitor these variables. I almost always do this type of test in a VM (easier to clean up the mess when it all breaks) and I like to use Prometheus but honestly Perfmon in Windows or Top in Linux gives you all you really need.

Setting Limits

It’s helpful to set some limits on these parameters so you know when to stop the test. For database size, it might be some measurement like a year's worth of data, or when the drive is 80% full. For ingest timing, I suggest stopping when inserting takes longer than the desired ingest frequency—this is the ingest bottleneck and something you really want to avoid in production. Scan times can be limited by the time it takes for a specific query. Maybe calculating the average value from one tag over the past hour must be less than 10s.

How to Simulate Data?

There are lots of ways to insert data, but it’s usually a tradeoff between how well the data represents real scenarios and how long it takes to run the test.

The following is one of my favourite methods for injecting large amounts of data into an IIoT database:

Say you have a classic IIoT history table like the following:

CREATE TABLE iiot_history(
    time TIMESTAMPZ NOT NULL,
    tag_id INT NOT NULL,
    value DOUBLE PRECISION,
    PRIMARY KEY (tag_id, time)
);

If you expect to ingest 10,000 tags at 1s intervals, you can use the following INSERT query to add a day’s worth of history to the back end of your table.

INSERT INTO iiot_history(time, tag_id, value)
    SELECT *, random() as value 
FROM(
        SELECT generate_series(
            min_date-INTERVAL '1day',
            min_date-INTERVAL '1s',
            INTERVAL '1s') as time
        FROM (SELECT LEAST(NOW(),MIN(time)) AS min_date 
FROM iiot_history)
),
        generate_series(1,10000) as tag_id;

This will generate random data values for every second during a day and for every tag_id from 1 to 10,000. Not exactly as interesting as real data, but enough to fill up your table.

The nice thing about this query is that you should be able to run it in parallel to your real-time data pipeline and it won’t mess with your data (aside from potentially locking your table while it runs). It’s also easy to modify this query to inject more or less tags as well as change the time interval if you’re playing around with different configurations.

If you use this query, or whichever one you prefer, in a script (I usually use Python), then you can automate the whole test. Something along the lines of:

Get database size
Run select queries, measure execution time
Run insert queries several times, measure and average execution time
Artificially grow database size
Repeat 1-3 until one of the failure conditions is reached.

How to Interpret Results and What to Expect in the Real World?

Your test results will give you some clear data points, but you still need to do some interpreting.

Identify the Limiting Component: Where did the database fail? If it’s a query that took too long, you might be able to speed things up with a clever index. If it’s an insert that took too long, you might be able to speed things up by removing that clever index you added earlier.
Optimize: There’s a lot you can do to improve table performance before throwing the whole thing out in frustration:
1. Proper Indexing: Choosing an index is almost always a tradeoff, for example: Indexing the tag_id column before the time column will speed up most queries, at the cost of slower inserts as the table grows. Indexing the time column first will avoid the ‘ingest wall’ at the cost of slower queries. Figure out which solution is best.
2. Plan for the future: Will you need more hardware in a few months or a few years? Being able to estimate the life of your existing architecture means you won’t be caught unawares when it no longer suffices.
3. Partitioning/Chunking: For very large tables, you may need to partition appropriately (see PostgreSQL extensions like TimescaleDB). How great would it be to learn you’ll need this before you actually need this.
Add a Safety Factor: If your test showed a maximum reliable throughput of 15,000 rows/sec, set your operational limit to 10,000 rows/sec. The real world has peaks, unexpected queries, and background maintenance tasks that will steal resources. Like we do with all engineering products, design with margin.

If you treat your database like a prototype and really put it through its paces, you’ll get a preview of how it’ll behave in the future and make good, proactive design decisions instead of struggling in the future. Now, go break something (and learn).

What Developers Get Wrong About Storing Sensor Data

Team Tiger Data — Thu, 19 Mar 2026 14:08:03 +0000

Sensor Data Looks Simple Until It Isn’t

Sensor data appears straightforward. It just has timestamps, numeric readings, and maybe a device identifier. Compared to transactional application data, sensor data feels uniform and predictable. Teams often assume they can store it using familiar relational database schemas and grow from there.

That assumption falls apart instantly when scale explodes. Devices multiply, sampling rates rise, and historical data accumulates indefinitely. Queries shift from single-row lookups to time windows and aggregations. Data arrives out of order. Storage costs grow exponentially. Systems designed around transactional assumptions crack in ways that are difficult to correct once data volume locks architecture in place.

The root problem is conceptual. Sensor data looks like rows but behaves like a time-ordered stream whose value declines with age. Engineers must design the database as a time-series log with decay from the outset, rather than adapting it from a transactional model later. The following sections show how relational database approaches are inadequate for handling sensor data, and what a more suitable architecture looks like.

Default Model: Treating Sensor Data Like Rows

Most database developers approach sensor data with a transactional mindset. They design normalized schemas, enforce relational integrity, and add indexes for point queries. They only work for mutable business entities such as users or orders.

Sensor data, however, is append-only. New measurements arrive continuously and are rarely updated. Sustained ingestion and time-range retrieval are dominant, not row mutation or lookup. When schemas assume row-oriented access, data ingestion becomes join-heavy, indexing costs grow with volume, and write throughput falls behind input data flow.

Treating sensor data as rows creates problems precisely where sensor systems spend most of their effort: writing and scanning time-ordered streams.

Where That Model Breaks

As the system grows, several problems appear simultaneously.

First , ingestion is continuous and bursty. Devices reconnect and flush buffers, producing spikes rather than steady flows. Row-oriented schemas struggle to absorb these bursts efficiently.

Second , growth compounds across multiple axes: more devices, higher sampling frequency, additional metrics, and longer retention. Storage volume grows quickly, turning early schema choices into long-term constraints because migrating historical time-series data is costly and risky.

Third , queries shift toward time windows. Monitoring, analytics, and diagnostics rely on ranges, aggregates, and rates over time rather than individual rows. Row-optimized indexing performs poorly for these scans.

Fourth , operational realities inevitably create problems. Timestamps arrive late or out of sequence. Data must be replayed or corrected. Systems designed for ordered inserts encounter fragmentation and duplication under these conditions.

Each constraint highlights the same reality. Sensor workloads are shaped by time and continuity, not by relational identity.

Key Insight: Sensor Data Is a Log With Decay

Sensor data has two defining properties.

It is a log: append-only, time-indexed, and rarely modified after arrival.
It decays: its value decreases as it ages, even as its volume accumulates.

Recent data require high-resolution monitoring and debugging. Older data supports trends and aggregates. Very old data is rarely queried except in a summarized form. Yet without lifecycle awareness, systems retain all data at equal resolution and cost.

Once teams understand that sensor data is a log with decay , the correct architecture becomes clear. Storage must optimize for append throughput and time-range access while permitting data to evolve in resolution and tier as it ages.

Time-Series Architecture

Time-series data that loses value over time requires the database architecture to have a few key properties.

Log-optimized ingestion

Writes must be sequential and batched, minimizing per-row overhead. Storage engines and schemas should favor append operations over update operations so ingestion scales with device fleets and burst conditions.

Time-partitioned organization

Data should be grouped primarily by time, corresponding its physical storage with dominant query patterns. Time partitioning keeps recent data localized and keeps historical segments compact and independent.

Lifecycle tiering

Because sensor data’s value declines with age, resolution, and storage cost should decline as well. High-resolution recent data is hot, and older data is compressed, downsampled, or moved to cheaper storage tiers while preserving analytical performance.

Role separation

Operational monitoring, historical analytics, and archival retention create different latency and throughput challenges. Separating these roles prevents continuous ingestion from degrading analytical performance and allows each layer to evolve independently.

These properties are not optimizations layered onto transactional storage. Instead, they are intentional design choices needed to handle the key aspects of time-series data: continuous append, time-range access, and aging value.

What This Enables for Developers

Architectures aligned with time-series data change how systems scale and operate.

Ingestion stays stable as fleets expand because write operations match append patterns rather than row mutation. Query cost stays predictable because time-range scans match with storage layout. Storage growth stays bounded relative to insight because data resolution declines with age. Operational corrections and replays become routine rather than disruptive because logs tolerate disorder.

Developers spend less effort compensating for schema problems and more effort deriving insight from data. Systems stay adaptable as deployments grow from prototypes to global fleets.

Why Time-Series Architecture Becomes Inevitable

Engineers only design transactional database models for mutable records whose value stays relatively stable over time. Sensor data is the opposite. It is filled with immutable events whose volume grows continuously while their value declines with age. As ingestion becomes constant, queries become time-range-driven, and history accumulates indefinitely, databases built on transactional assumptions develop write bottlenecks, inefficient scans, and rising storage costs.

Once teams understand that sensor data is just an append-only data stream with aging value, the architectural solution becomes clear. Systems must ingest sequentially, organize primarily by time, reduce resolution as data ages, and separate operational and historical workloads. These structures stem directly from how sensor data behaves, not a preference for any particular technology.

Treating sensor data as rows delays problems but does not fix them. As scale grows, transactional models diverge further from workload reality, while time-series architectures stay matched to it. Database design, therefore, can’t be retrofitted late without cost and disruption. It must start from the correct model: sensor data as a time-series log with decay.

How Do PostgreSQL Indices Work, Anyways?

Matty Stratton — Wed, 18 Mar 2026 14:35:21 +0000

You've probably created a hundred indexes in your career. Maybe a thousand. You ran EXPLAIN ANALYZE, saw "Index Scan" instead of "Seq Scan," pumped your fist, and moved on.

But do you actually know what's happening underneath? Because once you do, a lot of things about PostgreSQL performance start to make a lot more sense. And some of the pain points you've been fighting start to feel less like mysteries and more like, well, physics.

It's a tree. Obviously.

The default index type in PostgreSQL is a B-tree. You knew that. But let's talk about what that actually means for your data.

When you create an index on, say, a timestamp column, PostgreSQL builds a balanced tree structure where each node contains keys and pointers. The leaf nodes point to actual heap tuples (your rows on disk). The internal nodes just help you navigate. Think of it like a phone book. (Do people still know what phone books are? I'm aging myself.)

The key thing to understand: the index is a separate data structure from your table. It lives in its own pages on disk. When you insert a row, PostgreSQL doesn't just write your row. It also has to update every index on that table. Every. Single. One.

So if you have a table with five indexes and you're doing 50,000 inserts per second, that's not 50K write operations. That's 250K+ B-tree insertions per second, plus the heap write. Oof.

You can see exactly how much space each index is consuming with \di+ in psql:

\di+ public.*

-- Or if you want programmatic access:
SELECT indexrelid::regclass AS index_name,
       pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
       idx_scan AS times_used,
       idx_tup_read AS tuples_read
FROM   pg_stat_user_indexes
WHERE  schemaname = 'public'
ORDER  BY pg_relation_size(indexrelid) DESC;

Run that on your biggest table. If you see indexes measured in gigabytes that have idx_scan = 0, those indexes are costing you writes and giving you nothing back. They're dead weight.

Pages, not rows

Here's where it gets interesting. PostgreSQL doesn't read individual rows from disk. It reads 8KB pages. Always. Even if you only want one tiny row, you're pulling in a full 8KB page.

Your B-tree is also organized into 8KB pages. Each page holds as many index entries as it can fit. For a simple index on a bigint column, you can fit a few hundred entries per page. For a compound index on (tenant_id, event_type, created_at), you're fitting fewer because each entry is wider.

When PostgreSQL traverses your B-tree, it starts at the root page, reads it, follows a pointer to the right internal page, reads that, and eventually gets to a leaf page that tells it where your actual row lives on the heap. For a table with a million rows, that's maybe three or four page reads. For a billion rows, it might be five or six. Logarithmic scaling is your friend here.

You can see this in action with EXPLAIN (ANALYZE, BUFFERS):

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM events WHERE created_at > now() - interval '1 hour';

-- Look for lines like:
--   Index Scan using events_created_at_idx on events
--     Buffers: shared hit=4 read=2

The shared hit count tells you how many pages came from the buffer cache. The read count tells you how many had to come from disk. If you're seeing high read values on a query you run frequently, your working set has outgrown your shared_buffers.

But. (There's always a but.)

The part nobody thinks about

Those leaf pages need to stay ordered. When you insert a new value that belongs in the middle of a page that's already full, PostgreSQL has to split that page. Page splits are expensive. They cause write amplification and can fragment your index over time.

For time-series data (timestamps always increasing), you mostly dodge this problem because new values go to the rightmost leaf. That's nice. But it creates a different problem: hot-page contention. Every concurrent insert is fighting to write to the same leaf page at the end of the tree.

And then there's the part that really gets you: MVCC overhead.

PostgreSQL's multiversion concurrency control means that even your index has to deal with tuple visibility. Index entries don't get removed immediately when a row is deleted or updated. They stick around until VACUUM cleans them up. So your index isn't just tracking live rows. It's tracking all the versions of your rows until the cleanup crew gets around to it.

For a high-churn table, your index can be significantly larger than you'd expect just from the row count. I've seen cases where the index is effectively 2-3x the "expected" size because of dead tuple bloat.

Here's how to check if bloat is eating your indexes alive:

SELECT relname,
       n_dead_tup,
       n_live_tup,
       round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
       last_autovacuum
FROM   pg_stat_user_tables
WHERE  n_dead_tup > 10000
ORDER  BY n_dead_tup DESC;

If dead_pct is climbing above 10-20% and last_autovacuum was hours ago (or null), autovacuum is falling behind. That bloat isn't just wasting space. It's making every index scan touch more pages than it should.

Index-only scans (and why they're worth understanding)

There's one more behavior worth knowing about, because it changes how you think about index design.

Normally, PostgreSQL uses the index to find where a row lives on the heap, then goes and reads the actual row. That's two separate lookups: the index, then the heap.

But if every column your query needs is already in the index, PostgreSQL can skip the heap entirely. That's an index-only scan, and it's significantly faster.

-- This index covers both the WHERE clause and the SELECT list:
CREATE INDEX idx_events_covering ON events (created_at) INCLUDE (event_type, value);

-- Now this query never touches the heap:
EXPLAIN (ANALYZE, BUFFERS)
SELECT event_type, value FROM events WHERE created_at > now() - interval '1 hour';

-- Look for:
--   Index Only Scan using idx_events_covering on events
--     Heap Fetches: 0

The Heap Fetches: 0 is what you want. That means PostgreSQL answered the entire query from the index alone.

The catch: index-only scans only work well when the visibility map is up to date, which brings us right back to VACUUM. If VACUUM hasn't visited a page recently, PostgreSQL can't trust the index alone and has to check the heap anyway. So even this optimization depends on keeping autovacuum healthy.

Partial indexes (less is more)

One more tool that's underused: partial indexes. If you only query a subset of your data most of the time, you can index just that subset.

-- Instead of indexing every row:
CREATE INDEX idx_events_status ON events (status);

-- Index only the rows that matter:
CREATE INDEX idx_events_active ON events (status) WHERE status = 'active';

The partial index is smaller, faster to scan, and cheaper to maintain on writes. For high-churn tables where most queries filter to a small slice of data, this is free performance.

So why does this matter?

Understanding this stuff isn't just academic. It explains real problems you hit in production:

Why adding indexes slows down writes. Every index is another B-tree that needs to be maintained on every insert. It's not free. It's never been free. The cost just hides until you're at scale.

Why your queries get slower over time even though nothing changed. Index bloat from dead tuples. Pages that used to be tightly packed are now half-empty after splits and vacuuming. Your three-page-read query is now a six-page-read query.

Why VACUUM matters so much. It's not just reclaiming table space. It's keeping your indexes healthy. If autovacuum can't keep up, your indexes degrade. And if you're inserting fast enough, autovacuum can fall behind. That's not a bug. That's just the architecture working as designed.

Why partitioning helps (and then stops helping). Smaller partitions mean smaller indexes mean fewer tree levels. Great. But now your query planner has to evaluate all those partitions to figure out which ones to scan. And that planning cost scales linearly with partition count. You're trading one bottleneck for another.

The bigger picture

I wrote about this cycle more extensively in a piece about the PostgreSQL optimization treadmill. The short version: there's a pretty predictable progression that teams go through. Optimize indexes. Partition tables. Tune autovacuum. Scale vertically. Add read replicas. Each phase buys you a few months.

That's not a criticism of PostgreSQL. Postgres is an incredible database. But it's a general-purpose relational database, and its architecture reflects that. The heap storage model, MVCC, the query planner, B-trees. They're all designed to handle a wide range of workloads really well. The tradeoff is that for very specific access patterns (like time-series data at scale), those general-purpose design choices start working against you instead of for you.

Understanding how your indexes work is the first step to understanding when they stop being enough. And knowing when you're fighting the architecture instead of optimizing within it can save you months of whack-a-mole performance tuning.

But that's a topic for another day. For now, go run these queries on your biggest table:

-- How big are your indexes, really?
SELECT indexrelid::regclass AS index_name,
       pg_size_pretty(pg_relation_size(indexrelid)) AS size,
       idx_scan AS scans
FROM   pg_stat_user_indexes
WHERE  relname = 'your_table_here'
ORDER  BY pg_relation_size(indexrelid) DESC;

-- Are any of them unused?
SELECT indexrelid::regclass AS index_name,
       idx_scan
FROM   pg_stat_user_indexes
WHERE  idx_scan = 0
  AND  schemaname = 'public'
ORDER  BY pg_relation_size(indexrelid) DESC;

You might be surprised.

🚀 Introducing Agentic Postgres: The First & Free Database Built for Agents

Ajay Kulkarni — Tue, 21 Oct 2025 15:15:26 +0000

Agents are the New Developer

80% of Claude Code was written by AI. More than a quarter of all new code at Google was generated by AI one year ago. It’s safe to say that in the next 12 months, the majority of all new code will be written by AI.

Agents don’t behave like humans. They behave in new ways. Software development tools need to evolve. Agents need a new kind of database made for how they work.

But what would a database for agents look like?

At Tiger, we’ve obsessed over databases for the past 10 years. We’ve built high-performance systems for time-series data, scaled Postgres across millions of workloads, and served thousands of customers and hundreds of thousands of developers around the world.

So when agents arrived, we felt it immediately. In our bones. This new era of computing would need its own kind of data infrastructure. One that still delivered power without complexity, but built for a new type of user.

What Agents Actually Need

Agents work differently than humans. They need:

MCPs, not UIs – they call functions, not click buttons
Native search – find the right data instantly
Fast forks and teardown – spin up experiments without the overhead
Efficient pricing – pay for what you use
Built-in knowledge – best practices that come with the database

What We Built

1. An MCP Server That Actually Understands Postgres

We built an MCP server that doesn't just connect to the database—it knows how to use it well. We took 10+ years of Postgres experience and turned it into built-in prompts. Agents get tools for schema design, query optimization, and migrations, plus they can search Postgres docs on the fly.

> I want to create a personal assistant app. Please create a free 
> service on Tiger. Then using Postgres best practices, describe 
> the schema you would create.

2. Search Built Into the Database

pgvectorscale: We improved our vector search extension. Better indexing throughput, better recall, lower latency than pgvector.

pg_textsearch: Our newest extension. It implements BM25 for proper ranked keyword search, built for hybrid AI apps. Right now it uses an in-memory structure for speed—disk-based segments with compression are coming.

No need to bolt on external search. It's all in Postgres.

3. Instant Database Forks

We built a copy-on-write storage layer that makes databases instantly forkable. Full production data, isolated environment, seconds to create. No data duplication, no cost duplication. You only pay for what changes.

Great for testing, benchmarking, or running migrations in parallel without touching prod.

> Create a fork of my database, test 3 different indexes 
> for performance, delete the fork, and report findings.

4. New CLI and a Free Tier

Three commands to get started:

# Install the Tiger CLI and MCP
$ curl -fsSL https://cli.tigerdata.com | sh
$ tiger auth login
$ tiger mcp install

Then either tell your agent to create a free service, or run tiger create service yourself.

Fluid Storage

This all runs on Fluid Storage—our new distributed block store. It's built on local NVMe with a storage proxy that handles copy-on-write volumes.

What you get:

Instant forks and snapshots
Automatic scaling, no downtime
Over 100K IOPS and 1 GB/s per volume

It looks like a local disk to Postgres but scales like cloud storage. Every free service runs on it.

Try It Today

$ curl -fsSL https://cli.tigerdata.com | sh

Built for agents. Designed for developers.