DEV Community: Team Tiger Data

AI's Physical Constraints: How AI Rewired the Data Center

Team Tiger Data — Thu, 02 Jul 2026 19:34:34 +0000

For most of the cloud era, a server rack was a five to twenty kilowatt object. You could fill a room with them, move air across the front, and the building stayed an ordinary building. A single current AI rack, NVIDIA's GB300 NVL72, draws about 132 to 140 kilowatts, with the GPUs alone accounting for more than a hundred. That is close to an order of magnitude more power in the same floor space as those old racks, and it lands as heat in the same small volume. Past roughly a hundred kilowatts per rack, air stops being able to carry the heat out, and the rack has to be plumbed for liquid. The compute got denser and the building changed with it.

This pattern repeats all the way out to the grid. For about fifteen years, getting more computing power felt like turning a dial. You needed more, you asked for more, and a few seconds later it was there. Spin up a hundred servers for a traffic spike, spin them back down when it passes. Capacity behaved like something continuous, instant, and reversible, a knob you turned rather than a thing you built. A generation of software was designed on that assumption, and it held, because for ordinary workloads the power and hardware involved were small against what the world could supply.

That has changed. Across AI infrastructure projects, the same moment now repeats. A team asks for more, and the answer comes back no. Not "no, that costs more," which everyone understands, but a harder no. "No, those GPUs are not available this quarter." "No, that region has no more power, and will not for years." You can order a GPU in a day; the date that a few hundred megawatts arrives at a site can be four or five years out, and no amount of money moves it sooner. The request that used to be a billing question has become a physical one.

Anyone who has designed, built, or run a data center knows the physical layer was always there. The people who design and build these facilities sized the transformers, ordered the switchgear, planned the cooling, and waited on the utility. What is new is the scale at which AI hits the physical layer. The data centers going up for AI are a different class of build: denser, hotter, hungrier, and more tightly coupled to the grid than the ones that came before.

Data centers always needed chips, memory, cooling, power, and water. Most cloud workloads before the AI surge kept those requirements in a range the existing build could absorb. AI pushes them past the thresholds where the old assumptions hold. It does not create a new kind of physics. It removes the buffer that made the physics easy to ignore.

Each of these limits has been written about on its own. What gets missed is how they connect, and why AI makes them arrive together. AI scaling moves through a physical dependency chain: more accelerators require scarce chip packaging and memory; more memory and compute concentrate heat; concentrated heat changes the rack and the building; the building then needs power the grid may take years to deliver; at that scale, the power itself may have to be buffered inside the facility; and the cooling choices made along the way determine where water becomes a problem. The limits arrive in a predictable order, and the order is the story:

GPUs. The first visible shortage is accelerators, the GPUs that do the AI computation, but the real bottleneck sits around the chip, not in it.
Memory. The accelerator depends on high-bandwidth memory, which pulls on the same finite wafer base as ordinary memory.
Cooling. More compute and memory in the same space means more heat in the same rack, past what air can carry.
Power and time. Liquid cooling moves heat, but every watt still has to come from the grid. First you wait for power to arrive; then, at AI scale, you may need to buffer the workload's own power swings.
Water. Not the national catastrophe the headlines suggest, but a local siting constraint shaped by cooling design.

Let's start with the one everyone already knows.

GPUs: The Bottleneck Is Not the Chip

The first wall everyone notices is GPUs. You cannot get them, you cannot get enough, or the price to rent them has climbed since last year. The figures are not subtle. H100 rental prices rose roughly 40 percent off their late-2025 lows in a matter of months, from about $1.70 to $2.35 per GPU-hour on one-year contracts between October 2025 and March 2026, and on-demand capacity is effectively sold out across GPU types. The pressure reaches the workstation end too. In June 2026 NVIDIA listed its RTX Pro 6000 Blackwell at $13,250, a 55 percent jump over the $8,565 launch price a year earlier, and the reason it gave was the 96 gigabytes of memory on the card in a market where memory is scarce.

The obvious reading is that NVIDIA cannot make enough chips, but that is not where the bottleneck lives. A modern AI accelerator is not one chip but a package: the processor die, stacks of high-bandwidth memory (HBM), and an interposer that wires them together at enormous bandwidth. A faster processor does not help if it cannot be packaged with memory, and advanced packaging capacity is finite, specifically the chip-on-wafer-on-substrate (CoWoS) lines at TSMC. The same analysis that tracked the rental spike named CoWoS packaging and HBM, not the processor, as the choke points. The lead times that stretch GPU orders toward a year are gated there. The chip is not the scarce thing; what surrounds it is.

So the GPU shortage is really a packaging and memory shortage wearing a GPU label. Packaging capacity can expand, but it expands on a manufacturing clock. Memory is the harder half, and the reasons it stays scarce are the next wall.

Memory: Why the Price Will Not Come Down

When a component spikes in price, the reflex is to wait. Shortages end, factories ramp, the price comes back down. That reflex is wrong here, and the reason is structural, not cyclical.

The memory AI systems need comes in two kinds. HBM is the fast, expensive memory stacked right next to the accelerator, where the model's working data lives during computation. Dynamic random-access memory (DRAM) is the ordinary system memory around it. The binding shortage is HBM, and the pressure spills into DRAM because both pull on the same finite wafer base. SK Hynix, the leading maker of HBM, locked up its 2026 HBM output well ahead of the year, and Micron has likewise reported its 2026 HBM sold out. Memory has gone from a rounding error in a machine's bill of materials to the single largest driver of the price on a high-end card.

So why not just make more? Essentially all the world's DRAM comes from three companies: Samsung and SK Hynix in South Korea, and Micron in the United States. When all three make the same allocation call at once, that is the global supply. There is no fourth maker at comparable scale waiting to undercut them.

Adding supply means building a fabrication plant, and a fab is not a factory you stand up in a quarter. It takes years of cleanroom construction, tool installation, and qualification before a single sellable chip comes out. HBM makes the squeeze worse, not better: each gigabyte of it uses about three times the wafer capacity of DDR5, today's standard volume DRAM, so every wafer redirected to the scarce thing makes the common thing scarcer still, and HBM already consumes roughly a quarter of all DRAM wafer output.

There is a second reason, and it is a choice rather than a constraint. The makers are steering wafers toward high-margin AI and enterprise memory and away from everything else, because that is where the money is. A new fab does not automatically reverse that, because the same margin logic governs what it chooses to build. IDC, the market-research firm, projects 2026 DRAM supply growth of only about 16 percent year over year, well below the 20 to 30 percent that was historically normal, even as demand for the AI variety grows far faster. The people running these companies are saying so directly. Intel's chief executive, Lip-Bu Tan, relayed in February 2026 what two of the key memory makers had told him: there is no relief until 2028.

The makers could produce more, but it takes years, only three companies do it at global scale, the AI memory eats three times the wafers, and they earn far more selling to AI than to you. Part of the shortage is physics, the supply is simply years out. Part of it is choice: the capacity that does exist is being pointed at AI, not at you. You have not been priced out for a quarter. You have been outbid.

Cooling: Why the Rack Changed Shape

GPUs and memory are still things you buy, even when you have to wait. The next wall is different. The same density that makes AI systems powerful, more transistors on the die and more memory stacked beside it, becomes heat the moment the hardware enters a building.

Every watt of power a machine draws comes back out as heat. Whatever goes in has to come out, or the machine cooks itself. It sounds too simple to matter, and it is the whole reason the rack changed shape.

For most of the history of computing, taking the heat out meant moving air. Servers in a rack, cool air through the room, and that was enough. At the rack densities of the CPU era, even at the top of the range, the airflow was manageable enough that the building stayed recognizably the same kind of building: rows of racks, cold aisles, hot aisles, chillers, and fans. Anyone who built those rooms knows the envelope.

AI did not make heat new. It made heat dense.

All the capability crammed into the die and the memory turns into heat in the same small volume. The rack densities from the opening, an order of magnitude higher than the CPU era, are really heat-removal figures: every one of those kilowatts has to be carried back out. The next generation on the roadmap, the Vera Rubin systems, is projected to push per-rack density several times higher again, and cooling vendors are already designing for the increase. At those densities, air cooling stops being practical. Bigger fans do not solve the volume problem.

Figure 1: Rack power density by generation. Past roughly 100 kilowatts, air cooling stops working and liquid becomes mandatory. The Rubin Ultra figure is a roadmap projection.

So the machine changed shape. A given volume of air can carry only so much heat away before it has to move faster than is practical, while water carries roughly three to four thousand times as much heat per unit volume as air. Past roughly a hundred kilowatts per rack, that gap stops being an efficiency question and becomes a hard limit, and the model flips: from moving air through a room to carrying heat away in liquid piped directly to the chip. The newest systems do not offer an air-cooled option at all. The GB300 NVL72 is fully liquid-cooled. The rack is no longer just electrical equipment. It now has plumbing.

Capital helps with procurement and retrofits. It does not change the thermal limits of air. This wall is geometry and thermodynamics. And it has a consequence even experienced operators feel: the hardware no longer fits in most existing buildings. A data center built for air, even one finished a couple of years ago, often cannot host these racks without being substantially rebuilt, retrofitted for liquid distribution, higher floor loading, and the plumbing that comes with it. You cannot simply drop the latest GPUs into the footprint you already have. For an operator who has spent a career optimizing airflow, that is the moment it becomes clear this is a different kind of building.

Liquid cooling changes how heat leaves the chip. It does not change where the energy comes from. Every watt still starts at the grid, and that is where the slowest constraint appears.

Power and Time: The Wall Underneath the Walls

What limits AI in the end is not chips, and it is not cooling. It is electricity, and specifically it is time. You can buy a GPU in a day. You cannot buy the specific date on which a few hundred megawatts will be delivered to a site. That is set by a physical system that moves on a timescale of years. Money can fund equipment and alternatives, but it does not make shared grid capacity appear on software time.

The building can be designed and built on one clock; the grid upgrades that let it draw full power often run on a longer one. Before a site can draw full power, the utility has to study and approve the load, the transmission system has to support it, and the substations and lines that feed it have to exist. The industry calls this time-to-power : the interval between choosing a site and being able to draw the load you planned around. For large AI sites, that interval can define the project. The upstream grid work resists money in a way the other constraints do not, because the transmission lines and substations are shared infrastructure that serves everyone on the grid, so a new load cannot simply pay to skip the queue without the wires actually being built. You can finish the building and then wait to turn it all the way on.

The time-to-power gap. You can buy a GPU in a day, but energizing a site takes years, and the build can finish while the power wait continues.

There is another clock running alongside the approval queue: the equipment itself. Even once a project is cleared to connect, the high-voltage transformers and switchgear that tie it to the grid can be in shortage. Lead times for large power transformers have stretched from roughly two years before 2020 to as long as five years now, and industry estimates suggest a meaningful share of planned 2026 data-center capacity could slip for want of power equipment and grid connections. Electrical gear is not the biggest line item in a data center. It can still decide when the building turns on, a reversal any operator who has waited on a transformer order will recognize.

The grid backlog around these projects is large. At the end of 2025, more than 2,000 gigawatts of generation and storage capacity were waiting in line to connect to the US grid, roughly twice the entire installed US power fleet. More waiting to connect than currently exists. That queue is not the same thing as a data-center load request, but it shows the condition of the shared infrastructure every large new load depends on: the wires, substations, studies, and upgrades are all moving on a multi-year clock. Lawrence Berkeley National Laboratory, which tracks those queues, finds the median time from request to commercial operation has roughly doubled, from under two years for projects built in the early 2000s to four to five years now.

This is not abstract, and it is not only an American problem. Ireland is the cleanest example. Dublin had become one of Europe's great data center hubs until the grid could not keep up, and in 2021 the grid operator EirGrid and the Commission for Regulation of Utilities imposed what amounted to a moratorium on new connections in the Dublin area. One Amazon project and two Microsoft projects were among those turned away and relocated to London, Frankfurt, and Madrid. By 2024, data centers were drawing around 21 percent of all the electricity in the country. The moratorium eased only in December 2025, and the new terms show where things are headed: a new facility now has to bring its own power generation or storage rather than simply draw from the grid.

The power industry itself is reorganizing around this demand. In May 2026, NextEra Energy announced a roughly $67 billion all-stock plan to acquire Dominion Energy, the utility behind northern Virginia's data center corridor.

That is the first half of the power story: getting electricity to the site. The second half starts once it arrives, and it is where AI looks least like the loads the grid grew up serving. A large training run is synchronized. Tens of thousands of accelerators compute, pause together to exchange results, and compute again. Power draw follows that loop. A single H100-class GPU draws far less at idle than under compute, so when tens of thousands switch states together, the facility's load can swing by tens of megawatts in seconds or less. Meta reported swings around 30 megawatts on a 24,000-GPU cluster training Llama 3.

A synchronized training cluster swings between compute and pause many times a second. The grid is built to follow the smooth aggregate of many independent users, not one correlated load moving in lockstep.

The grid was built around load diversity, where thousands of independent homes and businesses average into something smooth and predictable. A synchronized training cluster is neither diverse nor smooth, and that is the part that is new even to people who have planned power for a living. The stability problem is broader than training-loop swings: large data-center loads can also behave unexpectedly during grid disturbances. In July 2024, a transmission fault in Northern Virginia caused roughly 1,500 megawatts of data-center load to disconnect itself within 82 seconds, an event the North American Electric Reliability Corporation (NERC), which sets and enforces reliability standards for the North American bulk power system, said the system had never seen at that magnitude.

The fix moves on-site. xAI's Colossus cluster in Memphis installed about 150 megawatts of grid-scale battery storage alongside its power infrastructure. The point of that storage is not how much energy it holds but how fast it can absorb and deliver power. A small, fast store placed in front of a slower supply is a cache. Here the slow backing store is the grid, and the fast store is local batteries and power electronics. Batteries are no longer only backup equipment. In these designs, they can become part of workload control. At AI scale, power stops being merely an input to the computer. It becomes part of the computer's design.

And a design has to be operated. Once batteries, power electronics, cooling loops, and GPUs act as a single system, someone has to watch them as one: how power draw tracks compute, how the batteries answer a training swing, how heat follows the load. Those measurements arrive every second, from equipment that used to belong to three different teams. Read after the fact, they tell you what broke. Read live, they keep the loop stable.

Power is where the earlier constraints converge. The GPU you could not get, the region that was full, the building that needed rebuilding, the batteries now sitting between the workload and the grid: each one traces back to the same place, a power system that has to be built and buffered, on the grid's schedule, not the software team's.

Water: In Proportion

Power is the hardest wall because it sets the clock and, through on-site batteries and power electronics, becomes part of the machine's own design. Water is different. It sits downstream of cooling design and geography, which makes it more local, more variable, and more solvable than the public debate suggests. AI makes the siting choice more visible because the facilities are larger and denser, but the water problem still depends on design. That matters because water draws the most public attention of any of these constraints, and some of the least accurate reporting.

One distinction matters before any number makes sense: water withdrawn is not water consumed. Withdrawal is what a facility takes in; consumption is what it uses up, mostly through evaporation. A facility can withdraw a large volume and return most of it, or consume nearly all of what it takes, depending entirely on the cooling design.

Nationally, the figure is modest. As of 2021, all US data centers combined accounted for roughly 449 million gallons of water a day, about three to four tenths of one percent of total US water withdrawals, far below agriculture or power generation. The headline framing of data centers draining the country's water is not supported by the national figures.

The real issue is local. Roughly 40 percent of US data centers sit in areas of high or extreme water stress, so even a small national share can land hard on a particular community. Stated that way, it is a siting problem, real and solvable, rather than an indictment of the technology.

The cooling design is what sets the consumption. On-site consumption ranges from nearly nothing, for an air-cooled or closed-loop facility, to as much as 70 to 80 percent of what was withdrawn, for an open evaporative one. A single large evaporative facility can use something like five million gallons a day, comparable to a town of fifty thousand people. The same facility, built closed-loop, can use almost none. The high number and the low number describe the same building with two different cooling choices.

This is the constraint the industry is most actively engineering away. Closed-loop systems fill once and recirculate rather than evaporate. The same shift to liquid and direct-to-chip cooling described earlier can cut water needs dramatically, by up to 95 percent in some designs, and immersion cooling can eliminate evaporative water use altogether. Reclaimed wastewater is increasingly used in place of drinking water. Of the five walls in this piece, water is the one where the engineering response is furthest along, which is exactly why it deserves to be described accurately rather than dramatically.

The Reserve Ran Out

The five walls are not five problems. They are one fact seen from five angles. None of this is new physics: the power was always physical, the heat was always real, capacity always took years to build. What changed is that the cloud era ran on a deep reserve of capacity built ahead of demand, and as long as that reserve lasted, the limits underneath stayed out of view. You turned a dial and the reserve answered. AI has drawn that reserve down, and at a scale the old infrastructure was never built to carry, so the limits are back in view all at once.

That is why the data centers rising for AI are a different class of build, and why the people who built the last generation look at the numbers and recognize that the rules they worked under have moved. The accelerator depends on packaging and memory, the rack depends on liquid cooling, and the building depends on power. At this scale, the power itself needs a buffer. The site depends on grid capacity, water choices, and time. The next time a capacity question lands on your desk, ask where it will physically live and how long the power takes, before you ask what it costs. The cloud used to be an abstraction. It has an address now.

The five walls are physical. Operating inside them is not. Once the facility and the computer are one coupled system, running it means reading it as one: GPU power draw, cooling response, battery state, and grid posture, measured together and fast enough to act while the numbers are still true. That is not a facilities dashboard on a five-minute refresh. It is a live, correlated, high-frequency record of a machine that now runs from the silicon to the substation. Capturing that record, and querying it before it goes stale, is its own problem.

Get Started

Operational telemetry only helps if you can query it while it is still true, at the rate it arrives. That is the workload Tiger Data is built for: time-series and event data on Postgres, fresh and correct, without splitting into a second system. Start a free Tiger Cloud trial. Running on-premises, at the edge, or air-gapped? TimescaleDB Enterprise is built for those deployments and is taking design partners.

How Float Runs an AI Energy Company on a 3-Person Team with Tiger Data

Team Tiger Data — Tue, 30 Jun 2026 11:00:05 +0000

Danish startup achieves 99.3% compression on 1Hz smart meter data, powering real-time appliance-level energy analytics for hundreds of homes.

Float is a Danish AI and energy startup that collects 1Hz smart meter data from hundreds of homes and disaggregates it into per-appliance consumption using a proprietary ML model. The entire system depends on one architectural bet: that compression on their time-series database would be high enough to make the storage economics work at scale. Co-founders Jens Brandt Nellegaard (CEO) and Victor Grabow (CTO) share how they evaluated every major time-series database, why 90% compression was the hard floor for viability, and what happened when they hit 99.3% on Tiger Data.

About Float

Most people have no idea what their individual appliances cost to run. In Denmark, most energy providers send a monthly PDF bill, and that is the entire customer experience. The market has seen no meaningful innovation in 10 to 15 years. Fraud and a lack of transparency have been so widespread that the Danish government introduced new consumer protection regulation effective January 2026.

The underlying data to solve this problem actually exists. Most European smart meters have a standardized consumer interface that can output total household electrical load down to one-second resolution. But total load is not very interesting on its own. What consumers need is a breakdown by individual appliance - to understand which one is wasting electricity, which one is running at peak price, which one is approaching the capacity limit of a fuse. Academic research on energy disaggregation goes back to 1992. Nobody had solved it in a commercially scalable way.

Float built three things to close this gap: a proprietary hardware module that plugs into the consumer port on a European smart meter, a signal processing and neural net pipeline that classifies appliance-level consumption from the raw waveform, and a consumer-facing app with a proactive AI energy agent. The system collects roughly 15 measurements per second per household, each at 1Hz resolution. One second is the hard floor - Victor explains: "At one-minute resolution, the model would break, because what we track are the changes. If there are too many appliances turning on within the same minute, it would be very hard to differentiate them."

Jens Brandt Nellegaard and Victor Grabow co-founded Float in 2022. After nearly three years of R&D, they achieved a proof of concept in December 2024, secured their energy provider license in December 2025, and are now rolling out a private beta to 350 pre-vetted customers. The company has just three people.

The Challenge

Float started on Azure managed Postgres with the TimescaleDB extension. The team had Azure credits early, so it made sense at the time. But the Apache version available on Azure did not include compression, and that turned out to be a dealbreaker.

Every customer generates roughly 15 measurements per second. Float samples voltage, frequency, and total load across each phase entering the home. At 1,000 homes, that is 15,000 data points per second, continuously. Without compression, the storage cost alone would break their business model. “We are an energy company with a flat-rate subscription fee,” says Jens. “We pass through the spot market price one-to-one with no markup. If storage cost per user exceeds what the subscription supports, the economics collapse.”

We also tried InfluxDB before settling on TimescaleDB. We ran into ingestion issues, and we needed SQL. When you are a three-person team building an asset-centric microservice platform, you cannot afford a database that requires a proprietary query language and limits how you join and query data across domains. - Victor Grabow, CTO, Float

On top of the storage problem, Float needed continuous aggregates. The Danish DSO delivers settlement data at 15-minute resolution. Float collects data every second. To generate a live energy bill and compare it against the grid operator's numbers, the system needs to aggregate raw 1Hz data down to 15-minute windows constantly. On managed Postgres without TimescaleDB's full feature set, that meant writing and maintaining batch jobs - more infrastructure overhead for a team that was already stretched thin across hardware, ML, and a licensed energy company.

Why Tiger Data

We tested pretty much every time-series database on the market. We think Tiger Data is the best solution for our use case. - Victor Grabow, CTO, Float

The team researched time-series solutions extensively and discovered TimescaleDB through the compression and continuous aggregate features. The compression was compelling enough that it was just a matter of time before they needed the full capability. When they found Tiger Data - the company behind TimescaleDB - the managed cloud service made the path clear.

Two features drove the decision. First, Compression. The team had modeled the unit economics and needed at least 90% compression on the time-series data for the subscription model to work. Anything below that and storage costs per user would exceed what the flat-rate fee could support. Second, Continuous Aggregates - materialized views that update incrementally as new data arrives. Float runs aggregations constantly, converting 1Hz readings to 15-minute settlement windows, calculating threshold-based alerts on voltage and frequency, detecting outages, and triggering duration-based notifications like flagging an oven that has been running for four hours. Continuous aggregates handle all of this without batch jobs or scheduled pipelines.

We chose Tiger Cloud, the fully managed service on Azure, because it was a question of speed. We needed to get up and running fast and offload infrastructure management entirely. Encore, our DevOps platform, provides ephemeral environments on Google Cloud, and Tiger Cloud's database branching fits naturally into that workflow. - Victor Grabow, CTO, Float

The Float Energy Data Stack

Data starts at the smart meter. Float’s IoT module plugs into the standardized consumer interface port and captures voltage, frequency, and total load across each phase at 1Hz resolution. The module sends readings to Azure IoT Hub, which the team kept from the original Azure setup as a stable ingestion endpoint for all devices.

From there, a bridge connector forwards the stream into Google Cloud, where Encore deploys Float’s microservices. The team moved off Azure Event Hub eventually because it was expensive. Google Cloud streaming services handle the same throughput at lower cost. The streaming layer batches incoming measurements from all households every second and inserts them per batch into Tiger Data.

Tiger Data stores the raw 1Hz time-series readings and runs continuous aggregates for threshold-based monitoring: voltage spikes, frequency changes, mean and max calculations, outage detection, and duration-based appliance alerts. All raw data is retained for ML training purposes through the private beta phase, with tiered storage planned as the fleet scales.

The Float app reads processed data to show customers their real-time energy breakdown per appliance. New customers see total wattage immediately on connection. Appliance-level breakdown takes roughly three to four weeks as the model trains on their home's specific patterns. The agentic orchestration layer on top handles billing, onboarding, customer service, and proactive notifications - flagging forgotten ovens and irons, inefficient appliances, and dangerous load conditions approaching fuse limits.

Float's data architecture: 1Hz readings flow from the IoT module through Azure IoT Hub and Google Cloud streaming into Tiger Data, which serves the ML pipeline, consumer app, and agentic platform.

What Compression Enabled

On Tiger Data, Float is seeing 99.3% compression on its time-series data. Victor puts it directly:

Compression needed to be in the high nineties range to not break our business model. So that was a great outcome. - Victor Grabow, CTO, Float

That number unlocked three things that would not have been possible at lower compression ratios.

The Business Model Works

At 15,000 data points per second across 1,000 homes, uncompressed storage would generate terabytes of raw time-series data per year. Float passes through the spot market electricity price to customers at cost with no markup. Revenue comes from a flat-rate subscription fee. If storage cost per user climbs above what that fee can support, the entire model collapses. At 99.3% compression, it does not. The subscription covers infrastructure with margin to spare, and that margin holds as the fleet scales.

Full Data Retention for ML Training

Float's disaggregation model needs weeks of 1Hz training data per household to learn each home's specific appliance signatures. At lower compression, the team would face a choice: retain all raw data for model training or keep storage costs viable. At 99.3%, they retain everything. All raw 1Hz readings from the entire private beta fleet are available for the ML pipeline, with tiered storage planned only as the fleet scales past the beta phase.

Real-Time Billing Without Batch Infrastructure

The Danish grid operates on 15-minute settlement windows. Float collects data every second. Continuous aggregates bridge that gap, converting 1Hz readings into the 15-minute intervals the DSO requires for bill reconciliation. Danish energy prices swing up to 80% between peak and off-peak hours, which makes the freshness of the aggregation directly valuable to customers. Because continuous aggregates update incrementally as new data arrives, Float's live energy bill is always current, i.e., no scheduled batch jobs, no pipeline maintenance, no lag.

A Three-Person Team Running a Licensed Energy Company

Float holds an energy provider license in Denmark. That means billing, customer service, onboarding, regulatory compliance - operational overhead that traditional energy companies staff with dozens of people. Tiger Cloud's managed infrastructure is part of what makes this possible. The team does not manage database operations, storage provisioning, or aggregation pipelines. That overhead is handled. When asked about team size, Jens's answer was simple: "Three. We have three people... and an army of agents. This is the future."

Looking Ahead

Float is targeting 1,000 additional private beta customers within the next 12 months, with a seed round, two additional hardware variants for full Danish grid coverage, and expansion into two more countries.

The next major integration is EV charging - starting with Tesla's telemetry API, enabling smart charging during cheap price windows. The bigger thesis is that a fleet of homes measured at 1Hz resolution can trade power on the spot market more efficiently than any energy company operating at 15-minute resolution with a 24 to 48-hour delay. As Jens puts it: "Ultimately we are trying to make the home not a burden for the grid, but a partner of the grid."

The architecture decision that compounds as Float scales is not the compression ratio itself. It is that everything runs on a single Tiger Data instance: the raw 1Hz readings, the continuous aggregates for billing, the training data for the ML pipeline, the anomaly detection queries. No split architecture to maintain, no query paths to reconcile as the fleet grows from 350 homes to 1,000 and beyond. The data model does not change - it just gets bigger.

What You’re Really Owning When You Self-Host TimescaleDB

Team Tiger Data — Thu, 25 Jun 2026 13:51:54 +0000

Written by Matty Stratton, Brandon Purcell, Noah Hein, Hien Phan

Why operating TimescaleDB for mission-critical applications becomes a sustained platform engineering investment. Written by the team that builds TimescaleDB and operates Tiger Cloud across thousands of production deployments.

Abstract

Most engineers who evaluate TimescaleDB believe they are making a database decision. By the time customers depend on the application in production, they discover they made a platform ownership decision. The gap between those two decisions is the subject of this paper.

Availability, recoverability, scalability, security, and lifecycle management originate in the requirements of the application, and they land on the database team as operational systems that must be designed, built, staffed, and maintained for as long as the application runs. None of that work is beyond a capable engineering team. The question this paper raises is whether owning that platform is the highest-leverage use of the platform engineers who could otherwise be building the product the database supports.

Analytics on Live Operational Data Is Part of the Application

Analytics that runs overnight against a reporting database is important, and it is separate from the application. Analytics that plant operators watch continuously, that customers open before starting their workday, or that triggers an automated intervention before a defect is manufactured has moved onto the critical path of the business. Once analytics moves onto the critical path, the database supporting it moves there too.

One deployment carries most of this paper. An automotive supplier operates 120 robotic welding lines ingesting billions of sensor measurements per day. Plant engineers use the platform to detect abnormal operating patterns before they become equipment failures. A welding line showing early signs of a calibration drift or a sensor trending out of range may be a maintenance issue today; missed for thirty minutes, it can become an unplanned shutdown that halts production and costs the business more in rework than the platform team's quarterly budget. Every minute of latency is a minute a problem propagates across a line. We will follow this platform from its first twenty instrumented lines through its third year of operation, because the operational story of a self-hosted deployment is a story that unfolds over years, and it is easier to understand through one platform than across a survey of many. Where the same failure arrives from a different direction, we will bring in other deployments we have worked with: a connected-equipment OEM, a food processor, a specialty chemicals manufacturer. Different industries, different latency tolerances, the same dynamic underneath.

TimescaleDB is still TimescaleDB and PostgreSQL is still PostgreSQL. What changes is the operational commitment required to deliver the availability, recovery, scalability, and governance the application now demands.

The Database Inherits the Application's Requirements

Nobody is thinking about the database when the automotive supplier's operations organization commits to detecting equipment degradation before it stops a line, when plant management sets uptime expectations for the dashboards shift supervisors watch, or when compliance defines how long production records backing warranty claims must be retained. The database team inherits the requirements anyway and the inheritance changes the nature of the work.

A 99.9% availability target becomes a systems problem: replicas, failover orchestration, monitoring, and on-call coverage. A point-in-time recovery requirement becomes a tested runbook with validated restore times at production volume. Years of telemetry under compliance obligations becomes legal exposure that accumulates with every year of data the team holds.

Early in an application's life these requirements are loose. An outage of a few minutes is an inconvenience; a day of missing data is embarrassing but recoverable. As more users depend on the application and other teams build their own workflows on its data, the requirements tighten. Closing that gap is continuous work, and it falls on your platform engineers for as long as the application runs. The rest of this paper is what that work actually looks like, and each section raises the stakes a step: from lost engineer hours, to business risk, to commercial liability, to legal exposure.

Downtime Becomes a Product Problem Before the Architecture Catches Up

During the automotive supplier's early deployment, with twenty welding lines instrumented and a handful of engineers as the only users, a database restart is an inconvenience. Dashboards go blank for two minutes. Engineers wait. Nobody escalates.

Twelve months later, all 120 lines are instrumented and shift supervisors in three facilities depend on those dashboards. An equipment problem that goes undetected for thirty minutes because the platform was down costs more in rework than the platform team's quarterly budget. The database has the same configuration it had twelve months ago. The application does not.

The response is redundancy. PostgreSQL streaming replication keeps a replica synchronized with the primary, and if the primary fails, the replica is promoted with minimal data loss. This is well-understood architecture and it works. It is also where the engineering commitment begins.

Figure 1: Single-instance deployment versus production HA architecture: primary, HA replica, read replica, connection pooler, and monitoring as distinct operational layers, with failure paths and failover direction

Replication configuration is a decision with real trade-offs: synchronous replication eliminates lag but constrains ingest throughput; asynchronous replication preserves throughput but can lose data in a failover. Someone who understands the application has to make that call and own it as the workload changes. More importantly, someone has to watch replication continuously. A replica that has quietly fallen minutes behind provides a false sense of safety: stale as a read source, lossy as a failover target. Silent replication drift is one of the most common ways HA architectures fail to deliver the guarantees they appear to provide. We have seen it on deployments that had every structural component in place. The monitoring was the missing piece.

Beyond replication, HA is a stack of components that each need owners: failover automation that has been tested under simulated failure, not just configured; a connection pooler that shortens the error window and brings its own failure modes; rolling maintenance procedures rehearsed before they're needed under time pressure; and a monitoring layer that covers replication health, failover state, and background jobs, with alert thresholds tuned to the application and runbooks kept current as the system changes.

Availability is a standing allocation: a senior engineer's judgment on replication and failover, recurring hours for monitoring and rehearsal, and a permanent slot in the on-call rotation. Every one of those hours comes from the same platform engineers the roadmap is counting on. The question is whether it is the work you hired them to do.

Recovery Objectives Are Set by the Business and Tested by Almost Nobody

Figure 2: Recovery architecture showing full weekly backups, daily incrementals, continuous WAL archiving, and the RPO/RTO envelope they define, with restoration time as a function of data volume

Most backups are running. The question organizations rarely ask before they need to is whether the restore completes within the window the business requires, from the point in time the business requires, at the volume the database has actually reached. That gap is where recovery risk lives, and the food processor is standing in it the day a customer reports a potential contamination event.

The investigation needs sensor records for one production line during a two-hour window eighteen months ago. The data was backed up. None of that answers the only question that matters: whether restoring eighteen months of production telemetry at current volumes completes inside the investigation's clock. Nobody has ever run that restore. The procedure that exists is a hypothesis, and the incident is the wrong time to run the experiment.

That is the ownership gap. Not the backup. The untested restore.

As data volume grows, restore drills become harder, slower, and more important. A backup that exists is not the same as a recovery process that works.

The food processor's contamination event is the urgent version of this failure; the chemical manufacturer's corrupted migration is the irreversible one. A restore that completes is not necessarily a restore that worked. Validating that distinction at production scale is a recurring drill, measured in engineer-days per quarter, and someone has to own the calendar invite. That is before the platform has grown. The next problem is a system that succeeds.

Success Rewrites the Architecture

Figure 3: The same deployment scaling from hundreds to tens of thousands of assets: storage growth, backup window duration, query p95 latency, and concurrent connections as separate axes over time

The deployment that shows what success costs is an industrial equipment OEM whose customer-facing dashboards are a contracted product feature. A database outage there is a commercial event, logged against an SLA and escalated to an account manager. The OEM instructed a few hundred connected assets at launch and reached tens of thousands two years later. It did not build a system that broke. It built a system that succeeded, and the success invalidated the original architecture one assumption at a time. Nothing failed. The application simply outran the operational decisions made around it.

Storage growth is the most visible dimension and the least costly; disk is cheap to add. The downstream effects are the expensive part: longer backup windows, longer restores, more expensive maintenance, slower schema changes across thousands of chunks.

Retention policy management becomes load-bearing, and the hard part is rarely the configuration. The OEM's customers, now benchmarking equipment against peers on historical trends, reject the retention window set two years ago. The food processor meets the same moving target from the regulatory side: entering a new jurisdiction rewrites the retention obligations it launched with. The policy must stay aligned with requirements that keep moving, and someone has to confirm it is actually running.

Growth changes the operating model across every dimension: more assets, more users, more historical data, and more customer expectations. Query concurrency grows from a handful of internal analysts to thousands of customer-facing users, bringing read replicas, replication lag, and connection routing into scope. Configuration that was correct at two hundred assets is reconsidered at twenty thousand, but corrections don't take effect instantly. They migrate into effect gradually as new data arrives, which means the team is managing transitions, not flipping switches. These solutions work. They also need owners, and more of them as the application grows.

None of this stays solved. Each revisit lands on the same team being asked to ship features and support customers. Growth in the application is growth in the platform team's backlog.

The Platform Never Stops Evolving

Figure 4: Version lifecycle timeline: PostgreSQL major version cadence, TimescaleDB release cadence, support windows, and the upgrade planning cycles they impose

There is a common expectation, particularly among teams building their first production database platform, that operations stabilize once the deployment is running. It does not. Ownership never ends.

PostgreSQL ships a major version each year; TimescaleDB tracks those releases. Running past end-of-life means running without security patches, which is untenable for any system holding customer data or regulated production records. Minor version patches require only a brief service restart. A major PostgreSQL version upgrade is a coordinated migration process: it requires a staging environment that mirrors production in data volume, query workload, aggregate configuration, and columnstore state; a post-upgrade validation suite checked against pre-upgrade baselines; a tested rollback plan; and a coordinated maintenance window. For the chemicals manufacturer's five years of reactor telemetry, that is two to four engineer-weeks of work, recurring roughly annually, for as long as the platform operates.

Runbooks are the connective tissue holding the rest together, and they decay by default. Return to the automotive supplier, now three years in. The platform looks substantially different from what launched: additional lines instrumented, aggregates added, retention adjusted, the HA configuration changed after a failover exposed a gap in the original design, the chunk interval re-tuned after ingest grew. Each change was made for a good reason. None made it back into the runbook, and the engineer who made most of them has moved to a different team. This is how operational debt accumulates: not through negligence, but through the ordinary pressure of a team moving fast and treating documentation as something to get to later. If you are reading this and thinking it would not happen on your team, it is worth asking when your runbooks were last tested against the system they describe. Later arrives during incidents.

Every Capability Arrives With an Owner Attached

The preceding sections walk through these systems one at a time, as they arrive in the life of a deployment. Here is the full surface area in one place: each system with its own configuration, monitoring requirements, failure modes, and cadence of ongoing work. In aggregate, they make up the platform.

System	What it delivers	How it fails quietly	The ongoing work	Cadence
HA & failover (replicas, pooler, promotion automation)	The availability SLA	Silent replication drift; failover automation that was never rehearsed	Lag monitoring, failover drills, pooler tuning, rolling maintenance coordination	Continuous monitoring; drills quarterly
Backup & point-in-time recovery	The RPO/RTO commitment	Restores never run at production volume; broken WAL archive chains	Restore validation against real objectives; post-restore verification before returning to service	Restore drills quarterly; re-scoped at each growth step
Retention policies	Bounded storage; compliance windows	Background job fails silently; policy on the wrong relation deletes data that should have been kept; aggregates outlive the raw data they were built from	Policy validation and job monitoring; alignment with dependent systems and changing business requirements	Monthly job audit; review on every regulatory or contract change
Hypercore columnstore	90–98% storage reduction; faster analytical queries	Conversion boundary set too early adds overhead to hot data; too late, and the economics of long retention erode	Boundary tuning against access patterns; resource planning for large backfills (e.g., post-calibration corrections)	Semiannual review; per backfill event
Continuous aggregates	Dashboard latency at production scale	A failed refresh serves stale data with no user-visible error	Refresh policy tuning; job failure alerting so stale data is caught before users report it	Weekly alert review; retuned with each workload shift
Chunk configuration	Ingest throughput and memory health	A misconfigured interval degrades writes precisely during peak load	Interval review as ingest rates change; corrections migrate gradually into effect	Quarterly review or per major ingest change
Security & governance	Breach containment, auditability, contractual compliance	Long-lived over-permissioned credentials; audit logs retained where nobody looks; unreviewed production changes	Credential scoping and rotation rehearsed outside of incidents; audit log review and tamper protection; change control with staging validation and rollback plans	Rotation per policy; review gates on every production change
Version lifecycle	Security patches and support coverage	Running past end-of-life, unpatched, while holding customer or regulated data	Minor patch windows; major upgrades validated against a staging environment that mirrors production volume, workload, aggregate configuration, and columnstore state	Minor on a rolling cadence; major roughly annually
Runbooks & institutional knowledge	Incident response speed	Documentation describing the system as it was at launch	Updating with every configuration change; testing procedures against the live system	With every change; tested quarterly

Losing the Database Stops Being an Outage

We have seen audit configurations that were technically correct but practically invisible: logs retained in a system nobody had access to, alerts wired to a distribution list that no longer existed. The mechanism was in place. The ownership was not.

Every section so far has priced platform ownership in engineer hours and business risk. There is a point in a platform's life where the currency changes, where losing the database stops being an outage and starts being regulatory action, contractual liability, or litigation. The chemicals manufacturer lives past that point. Its five years of reactor telemetry is the empirical foundation for yield improvement decisions that took years to accumulate, and it is also an evidentiary record. The corrupted migration from earlier in this paper does more than destroy operational knowledge; it puts the team in the position of reconstructing that record under potential legal scrutiny, explaining to lawyers what a background job did and why nobody caught it. Some of these failures are irreversible by construction. A retention policy that accidentally drops a week of compliance data cannot be undone. A breach of customer telemetry arrives with notification obligations and suddenly load-bearing contract language.

This work rarely looks dramatic on a task list. It is credential scoping, rotation drills, tamper-protected audit logs, staging validation, change control, and rollback plans for every production change.

But the failure mode is different. Availability gaps cost minutes. Recovery gaps cost hours or days. Governance gaps can cost the company things that cannot simply be restored.

What This Costs, and What It Buys

Sustaining this platform at the standard a critical application demands lands between 1.5 and 3 full-time platform engineers, growing with the application, because every dimension of the work scales with data volume, concurrency, and the criticality of the guarantees. The on-call rotation requires three to four people to staff sustainably, independent of how much of their time the platform consumes. A major version upgrade is two to four engineer-weeks a year. Restore drills, failover rehearsals, policy audits, and runbook maintenance each claim recurring days per quarter. These numbers are rough by necessity and conservative by experience.

Capability was never the question. The teams that do this well go in with eyes open, staffed for the work, treating the platform as a product in its own right. The real question is the counterfactual: what would those same engineers build if they were pointed at the application instead, at the ingestion pipelines, the product features, the things customers actually pay for, while the guarantees are delivered by the team that builds the database?

Most engineers who evaluate TimescaleDB believe they are making a database decision. By the time the application is in production and customers depend on it, they discover they were deciding which parts of a platform they want to own. Every team owns the application. The decision is how much of the platform they want to own alongside it. This paper is designed to make that decision visible before it is made. For a framework to make it deliberately, see Self-Hosted TimescaleDB or Tiger Cloud: A Framework for the Decision.

Great Models Aren't Enough for Physical AI

Team Tiger Data — Thu, 18 Jun 2026 17:52:23 +0000

Notes from our Physical AI dinner

What should a drone do when a police helicopter approaches it?

We heard that question at a dinner we recently hosted for engineering leaders and founders working in Physical AI: the AI behind robots, drones, autonomous vehicles, and other machines that sense and act in the real world, along with the infrastructure that keeps them running. Nobody at the table had a clean answer. The people in that room deploy these machines for a living, and the question that stumped them was about safety, regulation, and operations, not model quality.

That was the theme of the whole evening: Physical AI is constrained by the physical world. Progress depends not only on better models, but on solving the problems around them: regulation, safety, operations, and data.

Scaling takes more than a better model

Despite rapid model progress, truly large-scale autonomous deployments still feel distant. The gap is the long tail of situations nobody puts in a pitch deck. How should a system respond when an animal starts interacting with the equipment? What kicks in when hardware behaves in ways nobody anticipated? Who is accountable when it does?

This is the unglamorous work that gates adoption. The physical world doesn't behave like a benchmark.

The physical world plays by rules a model can't change

Physical AI companies routinely operate inside regulatory frameworks written for older technologies. EV charging companies have had to navigate gas-station rules, including public price displays and printed receipts. Drone operators face aviation requirements designed for crewed aircraft. These constraints sit outside the model entirely, and a team has to clear them before a deployment is legal, let alone good.

Society sets its own rule on top: machines face a higher bar than humans. A battery fire or an autonomous accident draws disproportionate attention compared to an equivalent human-caused incident. That's the reality, and the teams that win will design for it early.

Surviving the physical world is a data problem

The edge cases, the regulations, the higher bar: you handle all of them through the data the machines produce. You catch an edge case because something in the telemetry looked wrong. You prove you met a regulation because you kept the records. The physical world is messy, and data is how you get a grip on it.

So data becomes its own hard problem. Machines in the field generate enormous volumes of telemetry, and every team deploying them wrestles with the same five decisions:

What does the system need in real time?
What stays at the edge?
What gets shipped to the cloud?
What's worth retaining for training?
And what must be kept for regulators, sometimes for decades?

Nuclear applications can carry 30-year retention requirements, a timescale that makes most storage strategies look quaint.

At fleet scale, monitoring, observability, and automation become critical infrastructure, increasingly run with agentic copilots that help operators watch and triage while humans stay accountable for the rare edge cases.

Most teams haven't felt this yet, because most aren't at fleet scale. The ones who treat the telemetry layer as core infrastructure before they get there are the ones who won't be rebuilding it under load later.

The work is surviving reality, not beating a benchmark

The model is what everyone watches. The deployment is decided by everything around it: the regulation, the safety bar, the operations, and the data that ties them together.

That's the work the people at our dinner do every day, and it's why we'll keep bringing them together. It's also the work we do: helping teams capture, store, and make sense of the data their machines produce, so the operational layer is ready when deployment scales. If you're building machines that operate beyond the lab, reach out. We'd love to have you at the table.

When PostgreSQL Isn't the Right Fit: Recognizing Workloads That Need Different Architecture

Team Tiger Data — Fri, 12 Jun 2026 12:00:47 +0000

When PostgreSQL isn't the right fit, the signs don't announce themselves clearly. Postgres is the right database for roughly 90% of workloads, such as SaaS backends, CRUD applications, and transactional systems with mixed read/write access on shared rows. But there's a narrow 10% where those same strengths become overhead: high-frequency append-only ingestion, time-ordered data accumulating at sustained rates, analytical scans over hundreds of millions of rows. If that sounds like your system, this post is for you.

What You Will Learn

If you've added indexes, implemented partitioning, tuned autovacuum, and upgraded hardware only to watch performance degrade again on the same trajectory, the problem likely isn't your configuration. By the end of this post, you'll know whether your workload is in Postgres's 10%, how to confirm it with a single diagnostic query, and what the first concrete step toward the right architecture looks like.

Why It Matters

An optimization problem and an architecture problem look identical in the early stages. Both show up as slow queries. Both respond to the same fixes: indexes, partitioning, autovacuum tuning, hardware upgrades. The divergence happens later, when the fixes stop holding and performance degrades on the same trajectory regardless of what you change.

This is what’s known as the optimization treadmill: a predictable sequence of phases that each buy three to six months of relief without changing the underlying trajectory. MVCC overhead, row-oriented storage, B-tree index maintenance, WAL volume. These aren't bugs. They're architectural tradeoffs that work well for 90% of workloads and work poorly for the 10%.

Knowing which problem you have determines whether you should keep tuning or make a different decision.

What Postgres Was Designed For

Postgres's architecture is built around concurrent access to shared rows. Multiple transactions read and write the same data at the same time, and MVCC handles the isolation. B-tree indexes find specific rows by key. Row-oriented storage assumes that when you retrieve a row, you want most of the columns in it.

For an e-commerce backend, a user authentication system, or a multi-tenant SaaS product, these are exactly the right tradeoffs. Transactions need isolation. Point lookups by user ID are the dominant query pattern. Write rates track user activity, which gives the database natural breathing room between peaks. The question isn't whether Postgres is good. It's whether the workload you're running matches the patterns its architecture was designed to serve.

The Workload That Breaks the Match

Three characteristics, when they appear together, put a workload outside what Postgres handles well.

Append-only or append-heavy writes. Rows are written once and never, or almost never, updated. Sensor readings, financial transactions, log entries, event streams. Every row still pays the full MVCC cost: a 23-byte tuple header tracking transaction visibility, hint-bit dirtying on reads, and autovacuum running continuously to freeze tuples and update the visibility map. None of that overhead produces value on data that will never be touched again.

Sustained high write rates. Not burst traffic that settles. Continuous ingestion at thousands to hundreds of thousands of rows per second, around the clock. The table grows without pause, B-tree index maintenance adds overhead with every insert, and that cost compounds with row volume, so there is no quiet window for autovacuum to catch up.

Analytical query patterns. The queries are aggregations over time ranges: averages, counts, percentiles, GROUP BY time bucket. Row-oriented storage forces Postgres to read all columns of every matching row even when the query needs two. On a 30-column table, that's fifteen times the I/O a columnar layout would require.

Any one of these is manageable. All three together is the combination that Postgres handles well at one million rows and struggles with at one hundred million.

The Optimization Treadmill in Practice

The pattern is predictable. Queries slow down as the table grows. You add indexes, and reads get faster. Write performance drops because index maintenance scales with row volume. You upgrade the instance. Performance stabilizes and costs go up. You implement partitioning. Recent-data queries get faster. Partition management becomes its own maintenance burden. You tune autovacuum settings. Things stabilize for a while. Data volume increases. The cycle repeats.

Each step is individually correct. The problem is that the sequence never ends. You're working around an architectural mismatch instead of running a workload the architecture was designed to serve.

The engineering cost accumulates in ways that are harder to see on a dashboard. The senior engineer spending a week on partition strategy is not shipping product features. The on-call rotation starts treating "database is slow again" as a recurring incident category. Quarterly planning includes a database scalability line item, every quarter.

How to Know Which 10% You're In

The answer is already in your table statistics. Not in EXPLAIN plans or monitoring dashboards, but in the counters tracking exactly how rows have been written, updated, and cleaned up over the table's lifetime. Run this against your highest-traffic tables:

SELECT
    relname AS table_name,
    N
_live_tup,
    n_dead_tup,
    n_tup_ins,
    n_tup_upd,
    ROUND(n_tup_upd::numeric / NULLIF(n_tup_ins, 0) * 100, 2) AS update_pct,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_tup_ins DESC
LIMIT 10;

Here's an example of what a flagged table looks like next to a healthy one:

table_name	n_tup_ins	n_tup_upd	update_pct	last_autovacuum
device_metrics	84,729,3041	24,892	0.00	2025-06-01 14:22:11
user_accounts	184,203	91,843	49.86	2025-05-29 08:14:03

device_metrics is in the 10%: 847 million inserts, near-zero updates, and autovacuum fired three minutes ago on a table that has never had a meaningful UPDATE run against it. user_accounts is not: nearly half its rows are updated, and autovacuum runs only when it actually needs to.

Look for update_pct under 5% and last_autovacuum timestamps within the last few minutes on tables with near-zero deletes. That's the overhead the companion piece documents in detail: a cleanup process running non-stop on data you never modify, because the storage engine generates that work regardless of your intent.

Pair those numbers against the broader pattern. Your sustained write rate exceeds 10,000 rows per second. Your most common queries aggregate over time ranges, not point lookups by row identifier. You added partitioning specifically to control table size. You upgraded your instance specifically for query performance, not connection headroom.

Three or more of those conditions, and you're in the 10%. The optimization treadmill will keep running, but the trajectory won't change.

What the 10% Actually Needs

If you've confirmed you're in the 10%, migrating your highest-traffic table starts with a single function call:

SELECT create_hypertable('device_metrics', by_range('ts'));

This converts the table to a TimescaleDB hypertable, which does automatic time-based chunking without cron jobs or partition management scripts. From there, you can enable columnar storage on your chunks. This format reads only the columns a query requests, not full rows, and compresses historical data by 10 to 20x, bringing time-range aggregation performance in line with what the workload demands. The migration post walks through the full process, including zero-downtime options for production tables.

You keep the same SQL, the same connection strings, the same ecosystem tooling. This isn't a replacement for Postgres. It's Postgres with the storage primitives your specific workload actually needs.

Conclusion

Postgres is not the problem. Running the wrong workload class through an architecture designed for a different problem is. The distinction matters because one has a tuning fix and the other has a structural fix, and those two paths look identical for the first several months.

The most expensive version of this recognition happens after 18 months of optimization effort. The cheapest version happens now.

Run the diagnostic query above. If the numbers land where you expect, read the full architectural breakdown. If you're ready to test on your own data, start a free Tiger Data trial today.

Row vs Columnar Storage for Analytics: Why PostgreSQL Scans Are Slower Than They Should Be

Team Tiger Data — Fri, 05 Jun 2026 12:48:04 +0000

Here's a query that runs on most time-series tables:

SELECT time_bucket('1 hour', ts) AS hour,
       avg(temperature),
       max(temperature)
FROM sensor_readings
WHERE ts > now() - interval '7 days'
GROUP BY hour
ORDER BY hour;

The query needs two columns: ts and temperature. The table has 15 columns. Postgres reads all 15 columns for every row that matches the WHERE clause.

That's not a bug. It's how row-oriented storage works. Each row is stored as a contiguous block of bytes on disk, called a heap tuple, and Postgres reads the entire tuple to access any column within it. For point lookups on individual records, this is efficient. You want the whole row, and it's stored together. For analytical scans over millions of rows where you need two columns out of fifteen, it's the dominant source of wasted I/O.

In Understanding Postgres Performance Limits for Analytics on Live Data, row-oriented storage was identified as one of four architectural constraints that compound under high-frequency ingestion. That whitepaper maps the pattern at a system level. This post goes deeper on the physical mechanism: exactly how pages work, how read amplification accumulates, and why the usual fixes don't reach it.

What You Will Learn

By the end of this post, you'll have a concrete diagnostic formula: the read amplification ratio. It tells you whether your storage layout is the dominant I/O bottleneck for analytical queries on any table you own. You'll also understand why indexes can't fix this class of problem and how a hybrid row-columnar storage layout changes the math. This post assumes working familiarity with Postgres page layout and B-tree indexes.

How Row Storage Actually Works in Postgres

Postgres stores data in 8KB pages. Each page holds multiple heap tuples. Each tuple contains every column value for that row, stored sequentially, preceded by a 23-byte header that carries transaction visibility metadata.

A table with 15 columns averaging 200 bytes per row fits roughly 35 to 40 rows per page, after accounting for headers, alignment padding, and page overhead.

When Postgres runs a sequential scan, it reads pages from disk in order. Each page load brings all the rows on that page into shared_buffers, with all 15 columns per row intact. The executor then evaluates the WHERE clause and pulls the needed columns from what was already loaded into memory.

The I/O cost is proportional to total table size, not to the size of the queried columns. A query that needs 12 bytes of data per row still reads 200 bytes from disk. The remaining 188 bytes load into the buffer cache and get discarded.

The Read Amplification Math

The number that makes this concrete is the read amplification ratio: total row width divided by the width of the columns the query actually needs.

For sensor_readings, the calculation is direct. The ts column is a timestamptz at 8 bytes. The temperature column is a float4 at 4 bytes. Together they represent 12 bytes of useful data per row. The full row is 200 bytes.

Read amplification ratio: 200 ÷ 12 = 16.7x

For every byte the query uses, Postgres reads 16.7 bytes from disk.

At 100 million rows covering seven days, that ratio stops being abstract. The query needs 100M x 12 bytes = 1.14 GB. Postgres reads 100M x 200 bytes = 18.6 GB. At a 500 MB/sec sequential read rate, the scan takes approximately 38 seconds. Reading only the needed columns would take roughly 2.3 seconds. That 16x gap is pure storage model overhead.

No index changes this number. No configuration setting changes it. Partitioning reduces scope. Fewer pages get scanned by cutting the time range, but within each partition the same per-row read cost applies. The storage layout determines the I/O, and the storage layout is fixed.

Try This Now: Measure Your Read Amplification

You can calculate the ratio for any table you own. Run these two queries to get the byte widths you need:

-- Full row weight
SELECT pg_column_size(t.*) AS row_bytes
FROM sensor_readings t
LIMIT 1;

-- Queried column weight
SELECT pg_column_size(ts) + pg_column_size(temperature) AS queried_bytes
FROM sensor_readings
LIMIT 1;

Divide row_bytes by queried_bytes. If the ratio is above 5x, the storage model is your largest I/O bottleneck for analytical queries on that table. No index or configuration change will close that gap.

Why Indexes Don’t Solve This

When a query is slow, the instinctive response is to add an index. For OLTP workloads, that instinct is correct. B-tree indexes excel at row selection: they find specific rows in O(log n) time, and for a lookup like SELECT * FROM users WHERE id = 123, the index locates the target row in microseconds.

For analytical queries that touch millions of rows, row selection is not the bottleneck. Finding the rows is fast. Reading the data from those rows is slow. An index scan on a million-row result set still reads the full heap tuple for every matching row to extract the needed columns.

The one exception is a covering index, which stores column values inside the index itself so Postgres can satisfy the query without touching the heap. But covering indexes for analytical queries become impractical at scale. When queries involve aggregations across high-frequency writes, wide covering indexes impose substantial write overhead, compounding exactly the index maintenance costs described in the optimization treadmill post.

B-tree indexes optimize for row selection (which rows to read). Analytical query performance is dominated by row width (how much data per row). These are different problems, and solving one leaves the other intact. For a broader look at what this means for your schema design, see Best Practices for PostgreSQL Data Analysis.

How Columnar Storage Changes the Equation

In columnar storage, data is organized by column instead of by row. All values for ts live together in one stream on disk. All values for temperature live together in another. When the query needs those two columns, it reads two streams. The other 13 columns are never touched.

Same query, same 100 million rows: data read drops to 100M x 12 bytes = 1.14 GB. With typical 10 to 20x compression for time-series data, that compresses to approximately 60 to 120 MB. At 500 MB/sec, the same scan completes in roughly 0.12 to 0.24 seconds.

The compression benefit stacks on top of the I/O reduction. Because all values in a column share the same data type, compression algorithms work far more effectively. Sequential timestamps delta-encode to near-zero storage overhead. Floating-point sensor values compress with XOR-based techniques derived from Facebook's Gorilla algorithm. Row-oriented heap storage can't apply any of these because values from different columns are interleaved on every page. There's no contiguous column stream to compress.

Hypercore: Row and Columnar in One Table

The tradeoff with pure columnar storage is write performance. Every new row appends to each column file separately, which adds overhead for high-frequency ingestion. You get the read benefit but give up write throughput. Tiger Data's Hypercore solves this with a hybrid layout that keeps both.

Recent data stays in row-oriented storage for fast ingestion. Older data converts automatically to columnar format based on a compression policy you configure. The application writes standard SQL to one table. The storage format changes by age without any application-layer involvement.

-- Enable Hypercore on a hypertable with a 7-day row storage window
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id',
    timescaledb.compress_orderby = 'ts DESC'
);

SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');

New rows land in row format and ingest quickly. Data older than seven days converts to columnar chunks. To verify the behavior immediately without waiting for the policy schedule, compress a chunk manually:

SELECT compress_chunk(c) FROM show_chunks('sensor_readings') c LIMIT 1;

Then run EXPLAIN (ANALYZE, BUFFERS) on the aggregation query to see the difference in buffer reads (representative output on a 100M-row dataset):

-- Before: row storage sequential scan
Seq Scan on sensor_readings
  Buffers: shared read=2375000 -- 18.6 GB read from disk
  Execution Time: 38142.2 ms

-- After: Hypercore columnar scan
Custom Scan (ColumnarScan) on sensor_readings
  Buffers: shared read=10240 -- 80 MB read from disk
  Execution Time: 196.4 ms

The same SELECT statement works against both storage formats. The query planner handles the difference transparently.

Conclusion

Row storage reads every column to access any column. For analytical queries that scan millions of rows and need only a few, this is the largest source of I/O overhead. It doesn't yield to index tuning, partitioning, or hardware upgrades.

Calculate the read amplification ratio for your most common analytical queries using the pg_column_size queries above. If the ratio is above 5x, Hypercore is the direct fix. Start a free Tiger Data trial today to enable the hybrid storage model on your tables.

The Postgres Developer's Guide to Vector Index Tradeoffs

Team Tiger Data — Tue, 02 Jun 2026 19:17:01 +0000

Vector search in Postgres usually starts simply. You add an embedding column, run a nearest-neighbor query, and order by distance.

SELECT content
FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'
LIMIT 10;

For a while, that is enough.

That simplicity breaks down as the workload becomes real. The table grows, filters become part of the query path, and recall starts affecting user experience. The index still has to stay fast while new data keeps arriving.

That is when vector search stops being a query pattern and becomes an index design problem.

Most vector search advice starts with algorithms: HNSW, IVFFlat, DiskANN, recall, latency. That is useful, but incomplete once vector search lives inside Postgres. Postgres developers do not choose algorithms in the abstract. They choose indexes under constraints: memory, recall, write volume, filter selectivity, and the operational cost of adding another system.

The right index is not the best ANN algorithm in isolation. It is the index that fits the constraint your workload hits first: memory, recall, writes, or filters.

This article maps those constraints to real Postgres index choices: what each one costs, when it becomes the binding variable, and which index type it points to.

When exact search stops being enough

Exact k-nearest neighbor search compares the query vector against every vector in the table. It gives perfect recall because it does not approximate the result set. It also scales linearly with the number of rows.

That tradeoff is fine early on. Exact search is the right starting point when the dataset is small, the query rate is low or you are still validating whether embeddings work for your application. It also gives you a useful baseline because the results are not affected by index tuning.

The problem shows up when the table grows into millions or tens of millions of vectors, or when users expect low latency. At that point, scanning every vector for every query becomes too expensive.

Approximate nearest neighbor search, or ANN search, exists for this moment. ANN indexes organize vectors ahead of time so the database can search only the most promising candidates instead of scanning the full table. The index gives up a small, controlled amount of accuracy in exchange for much lower query latency.

That is the first tradeoff: ANN is not magic. You are deciding how much recall you can afford to exchange for speed, memory efficiency, and lower infrastructure cost.

The four constraints behind every vector index

The right vector index is usually decided by four constraints: whether the working set fits in memory, how much recall the application needs, how often the data changes and how selective the surrounding filters are.

Memory

Memory is fast and low-latency, but expensive. SSDs are cheaper and can still work well for many workloads. Object storage is cheaper still, but its higher latency makes it a poor fit for index designs that require many small random reads.

Vector indexes do not all touch storage the same way. Graph-based indexes follow connections between vectors through the index. That access pattern works very well when the graph is in memory and becomes more expensive when each hop risks a disk read. Partitioning-based indexes group vectors into regions and scan the most promising ones, which can be more memory efficient but usually requires more tuning.

In Postgres, the practical question is whether the index working set fits comfortably in shared_buffers and the operating system page cache. If it does, an in-memory graph index can perform very well. If it does not, the storage access pattern starts to dominate the design.

Storage changes the index tradeoff. Graph-based indexes perform best when traversal stays hot in memory. Disk-aware and partition-based designs become increasingly important as the working set migrates to SSD or object storage.

Recall

Recall measures how close approximate search gets to exact search. Higher recall usually costs more because the index has to inspect more candidates, traverse more of a graph or scan more partitions.

For some applications, slightly lower recall is acceptable if latency improves dramatically. For others, especially RAG systems where missing the right document leads to a bad answer, recall is part of product quality.

The honest way to set this tradeoff is to measure against your own data. Embedding model, dimensionality, filters, and query distribution all affect the result.

Writes

Some vector workloads are mostly read-heavy. You build the index, query it many times, and update it occasionally. Other workloads change constantly. New documents arrive, old ones are deleted, embeddings are regenerated.

A structure optimized for high-recall reads may have higher write or maintenance costs. A lighter-weight index may be easier to update but require more tuning to reach the same recall.

Filters

Real Postgres queries rarely search vectors alone. A query might ask for the nearest vectors, but only within a specific customer, time range, tenant or category.

Those predicates change the shape of the search problem. If a filter is highly selective, it may be cheaper to narrow the rows first and then search. If the filter is broad, it may be better to use the vector index first and apply the filter after. The right plan depends on the data distribution, the selectivity of the filter, and the index available to the planner.

That is one reason vector benchmarks can vary so much. Vector search without filters is not the same workload as vector search inside a real application query.

That is why there is no universal best vector index. There is only the index that best matches the shape of your workload.

The ANN algorithms behind Postgres index choices

The point of understanding ANN algorithms is not to memorize every paper. It is to understand why each index behaves differently as your workload changes. Most of the indexes discussed below fall into two broad patterns.

Graph-based indexes, such as HNSW and DiskANN-style designs, search by moving through connections between nearby vectors. Spatial partitioning indexes, such as IVFFlat and SPANN-style designs, divide the vector space into regions and search the most promising ones.

That distinction matters because graph-based indexes tend to optimize for high recall when the working set is hot, while partitioning-based indexes often trade more tuning for lower memory and maintenance overhead.

Each algorithm below is best understood as a response to a specific pressure: memory, write cost, disk access, or update churn.

HNSW: When the index fits in memory

Your dataset fits in memory and you need high recall at high query throughput. HNSW is built for this.

Hierarchical Navigable Small Worlds organizes vectors as a layered graph where each node connects to nearby vectors across multiple levels of granularity. A query enters at the top layer, moves toward the target neighborhood, then descends to finer layers until it converges on the best candidates.

The layered structure is what gives HNSW its speed-recall profile. The upper layers help the search move quickly across the vector space. The lower layers refine the candidate set around the target neighborhood. When the graph is in memory, that traversal can be fast and accurate.

The tradeoffs show up on the write side and at scale. Each node stores multiple edge pointers, so the index carries a higher memory footprint than simpler partitioning-based alternatives. Inserts and deletes require maintaining graph structure, which makes writes more expensive. And when the index grows beyond available memory, latency can climb.

In pgvector, HNSW is often the first ANN index Postgres developers try when query latency and recall matter most. For a practical look at how it performs, see Vector Database Basics: HNSW.

IVFFlat: When memory and writes matter more

Your write throughput matters, or your index cannot comfortably fit in memory. IVFFlat is worth considering.

IVF stands for inverted file. The basic idea is to partition the vector space into lists, then search only the most promising lists at query time. In pgvector, this index type is exposed as ivfflat.

Compared with HNSW, IVFFlat is usually lighter to build and maintain. Inserts are simpler because adding a vector means assigning it to a list rather than updating a graph of neighboring nodes.

The tradeoff is that recall is more sensitive to tuning. If you create 1,000 lists and set probes = 10, the query searches a small fraction of the partitioned index. Increasing probes gives the query more chances to find the true nearest neighbors, but it also pushes the query closer to a broader scan. IVFFlat tuning is about finding the lowest probes value that still meets your recall target.

That is the core IVFFlat tradeoff: lower memory and maintenance overhead, but more responsibility for tuning lists and probes against your workload.

DiskANN: When the index needs to live partly on disk

HNSW assumes the graph fits comfortably in memory. At tens of millions of high-dimensional vectors, that often stops being practical.

DiskANN, developed at Microsoft Research, was built for this case. It is a graph-based algorithm designed for datasets too large to fit entirely in RAM. At a high level, it keeps enough compressed information in memory to guide the search while storing more of the full index and vector data on SSD.

The lesson for Postgres developers is the storage pattern. A vector index that works well in RAM may behave very differently when the query path depends on repeated disk reads. Disk-aware indexes are designed around that constraint instead of treating it as an afterthought.

DiskANN still carries higher update costs than many partitioning-based approaches. But for read-heavy workloads on large datasets, it explains the shape of the problem that disk-aware Postgres vector indexing is trying to solve. See Understanding DiskANN for a deeper look.

SPFresh: The update problem at scale

Large vector indexes create another problem: updates.

Many ANN systems handle inserts and deletes by buffering changes, maintaining secondary structures, or periodically rebuilding parts of the index. Those approaches can work, but at very large scale they require either accepting stale index state or paying an increasingly expensive maintenance cost to keep the index current.

SPFresh, from Microsoft Research, is one such direction. It builds on partitioning-oriented ideas to reduce the need for global rebuilds, incrementally rebalancing partitions as vectors are inserted, deleted, or updated. Partition assignments are not fixed. They can drift and be corrected over time.

SPFresh is not implemented in Postgres today. But it is not purely academic either. The ideas behind it have already shaped how production vector systems outside Postgres are being designed. Turbopuffer is one example: an object-storage-first vector search service whose architecture is built around centroid-based indexing and minimizing storage round trips. Turbopuffer is not a Postgres system. But the tradeoffs it navigates (high-update workloads, disk-based search, incremental index maintenance without global rebuilds) are real problems the Postgres ecosystem will need to address as vector workloads become more dynamic.

This is worth tracking because the maintenance cost of a vector index is not static. It grows with update frequency and dataset size. For read-heavy workloads on stable datasets, this is not a near-term concern. For teams with high insert and delete rates (documents being added, embeddings regenerated, records retired), it is worth understanding now, before the index becomes the bottleneck.

The Postgres vector search stack

The algorithms above map to real problems Postgres developers run into. HNSW is useful for in-memory performance, IVFFlat for lighter-weight indexing and write-sensitive workloads, and DiskANN-style designs for larger datasets where memory becomes the constraint.

Here is how the Postgres ecosystem addresses those problems today.

pgvector

pgvector is the starting point. It adds a native vector column type to Postgres and supports both HNSW and IVFFlat indexes directly.

An HNSW index looks like this:

CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops);

For IVFFlat, you define the number of lists and tune the number of probes:

CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
SET ivfflat.probes = 10;

The query planner can use these indexes for nearest-neighbor queries, and you can combine vector search with standard SQL filters, joins and CTEs in the same query. For many teams already running Postgres, this can remove the need to operate a separate vector database.

pgvector can start to show limits at larger scale, especially with high-dimensional embeddings at tens of millions of rows and indexes that no longer fit comfortably in memory. That is the problem pgvectorscale was built to address.

pgvectorscale

The DiskANN section above describes a specific problem: vector workloads that have grown too large to keep the working index in memory. For Postgres, pgvectorscale addresses that directly. It introduces a StreamingDiskANN index type that keeps a compressed representation in memory to guide search while storing the full index on disk.

On a Tiger Data benchmark of 50 million Cohere embeddings at 768 dimensions, Postgres with pgvector and pgvectorscale achieved 28x lower p95 latency and 16x higher query throughput compared to Pinecone's storage-optimized index at 99% recall. This was a vendor-run benchmark. Treat it as directionally useful, not universally predictive. Results will vary with embedding model, dimensionality, filters, recall target, and hardware.

The relevant point is that pgvectorscale stays inside the Postgres operational model. It remains composable with pgvector data types and standard SQL patterns. If your index has outgrown memory, you do not need a different system. You need a different index type.

pg_textsearch and ParadeDB

Vector similarity handles the semantic side of search well, but it is not the whole retrieval problem. Keyword-based retrieval still matters. It catches exact matches that embeddings miss, and for many queries, users know precisely what they are looking for.

This is where pg_textsearch and ParadeDB come in.

pg_textsearch, also from Tiger Data, brings BM25-based search into Postgres. BM25 accounts for term frequency saturation and document length normalization, which is why it is often a stronger ranking model for keyword search than simple term matching.

ParadeDB takes a related position as a Postgres distribution, bundling pg_search for BM25-based full-text search and pg_analytics for analytical query execution. If you want Elasticsearch-style search quality and are open to running a Postgres distribution rather than adding individual extensions, ParadeDB belongs on your evaluation list. When you are operating a small dataset, BM25 relevance ranking may not be a key requirement and pg_search will suffice. However, pg_textsearch is a better option when you need true BM25 relevance ranking with term saturation (how many times a term appears) or document length normalization to match the experience of Lucene (that powers Elasticsearch) or the algorithms that power Google.

The real payoff of having both vector search and BM25 inside Postgres is hybrid search: combining vector similarity and keyword scoring in a single query. For many RAG applications, this is often a stronger retrieval pattern than vector search alone because each approach covers the other's blind spots. Vector search captures semantic meaning. BM25 catches exact matches.

A simple hybrid search pattern in SQL

One common way to merge vector and keyword results is Reciprocal Rank Fusion, or RRF.

RRF avoids averaging scores across different scales. Instead, it combines rank positions. A result that appears near the top of either list gets a boost.

Hybrid search combines semantic and lexical retrieval. Vector search finds meaning. BM25 catches exact matches. RRF merges the ranked lists without comparing raw scores directly.

The exact syntax depends on which BM25 extension you use, but the query shape looks like this:

WITH keyword_results AS (
  SELECT
    id,
    content,
    paradedb.score(id) AS bm25_score,
    ROW_NUMBER() OVER (ORDER BY paradedb.score(id) DESC) AS keyword_rank
  FROM documents
  WHERE content @@@ 'vector search'
  LIMIT 60
),
vector_results AS (
  SELECT
    id,
    content,
    1 - (embedding <=> '[0.1, 0.2, ...]') AS similarity_score,
    ROW_NUMBER() OVER (ORDER BY embedding <=> '[0.1, 0.2, ...]') AS vector_rank
  FROM documents
  LIMIT 60
),
combined AS (
  SELECT
    COALESCE(k.id, v.id) AS id,
    COALESCE(k.content, v.content) AS content,
    COALESCE(1.0 / (60 + k.keyword_rank), 0) +
    COALESCE(1.0 / (60 + v.vector_rank), 0) AS rrf_score
  FROM keyword_results k
  FULL OUTER JOIN vector_results v ON k.id = v.id
)
SELECT id, content
FROM combined
ORDER BY rrf_score DESC
LIMIT 10;

This retrieves candidates from both systems, ranks them separately, and merges the ranked lists.

This is one of the strongest reasons to keep search in Postgres. Your embeddings, documents, metadata filters, joins, keyword search, and application data can live in one query model.

Learn more: how to build Hybrid Search in Postgres using pg_textsearch and pgvectorscale, and why hybrid search outperforms vector-only retrieval.

What this guide does not decide for you

No article can tell you the right vector index without your data.

Embedding model, dimensionality, filter selectivity, recall target, update rate, hardware, concurrency, and query distribution all change the answer. Even two datasets with the same number of rows can behave differently if their vectors cluster differently or their filters have different selectivity.

The point of this guide is not to replace benchmarking. It is to help you know what to benchmark first. Start with the simplest index that matches the shape of your workload. Measure it against exact search where possible. Tune recall and latency together. Then move to a more specialized index only when the workload gives you a reason.

Which Postgres vector index should you use?

Workload pattern	Start with	Why
Small dataset or still validating the application	Exact search	Simple, accurate and useful as a recall baseline
Starting a serious Postgres vector search workload	pgvector with HNSW	Strong speed-recall tradeoff for read-heavy workloads
Lighter index or higher write throughput matters	pgvector with IVFFlat	Lower memory and maintenance overhead, with more tuning required
Index no longer fits comfortably in memory	pgvectorscale with StreamingDiskANN	Disk-aware vector indexing while staying inside Postgres
Retrieval quality is the bottleneck	Hybrid search with vector plus BM25	Combines semantic similarity with exact keyword matching

The path usually looks like this: start with exact search while the dataset is small, move to HNSW when latency requires ANN, consider IVFFlat when memory or write cost matters more, evaluate disk-aware indexing when the working set outgrows memory, and add BM25 when retrieval quality needs more than semantic similarity alone.

Where things stand and where they are going

The practical rule is simple: benchmark the workload you actually run, not the cleanest version of vector search.

Start with exact search while the dataset is small. Move to HNSW when latency requires ANN. Consider IVFFlat when memory or write cost matters more. Evaluate StreamingDiskANN when the working set outgrows memory. Add BM25 when retrieval quality needs more than semantic similarity.

The one gap that remains is what SPFresh points toward: high-update workloads at scale without global index rebuilds. That capability is not yet in Postgres, but it is already showing up in production vector systems outside the Postgres ecosystem.

Whether it eventually appears as an extension, a fork or something nobody has named yet, the pattern is familiar: a hard problem gets real and someone in this community builds the thing.

Want to dig in further? Look at Tiger Data docs for pgvectorscale and pg_textsearch.

Understanding Why OS RAM and Postgres Buffer Cache Compete

Team Tiger Data — Fri, 22 May 2026 14:51:13 +0000

You just doubled the RAM on your database server to handle a climb in p95 latency. You expect the extra memory to absorb your growing dataset and bring those 45ms spikes back down to 8ms. Instead, the dashboard shows minimal improvement. Write latency remains high, and query response times stay variable.

The problem isn’t that you added too little RAM. It’s that you gave most of it to the wrong layer.

PostgreSQL and your operating system both cache data independently. When you over-allocate memory to Postgres, the OS loses the RAM it needs to do its own caching. Both layers end up storing identical data blocks simultaneously, a condition known as double buffering, while your system spends CPU cycles shuffling data between two pools instead of serving queries. At scale, this pattern becomes a vicious cycle: you add resources, the database absorbs them, performance recovers briefly, and then degrades again as the dataset grows.

This guide explains the double buffering mechanism, gives you the tuning rule that breaks the cycle, and shows you how to diagnose whether your current configuration is already caught in it. By the end, you will know how to calculate the correct shared_buffers value for your server, run a query to identify which tables are crowding out your buffer cache, and interpret the results to decide what to do next.

The Two Layers of Database Memory

To manage memory effectively, you need to understand the differences between the two independent caches that operate simultaneously on every Postgres server.

The internal buffer cache is defined by the shared_buffers configuration parameter. When a query needs a data block, Postgres checks here first. Ideally, it finds the data block so it can avoid a system call entirely. This cache is where your hot data lives.

The OS page cache lives in whatever RAM the operating system has not allocated elsewhere. When Postgres requests a block that is not in shared_buffers, it issues a file system call. If the OS has that block in its page cache, it serves the data immediately. If not, the OS falls through to a physical disk read.

It’s important to note that Postgres does not manage the OS page cache at all. Instead, the kernel manages the cache on its own, including allocating space and moving data into and out of the cache. Regardless, the OS page cache is a required part of Postgres, and not just a backup option for the internal buffer cache.

The Double Buffering Problem

Double buffering happens because neither cache knows what the other holds. Postgres does not inspect the OS page cache before storing a block in shared_buffers. The OS does not inspect shared_buffers before caching a file page. Both layers frequently hold copies of the same data at the same time.

This is wasteful at any size, but at scale it becomes actively harmful.

When shared_buffers is set too high (e.g. 80% of total RAM), the OS page cache is confined to the remaining 20%. Under a write-heavy workload, the OS needs that headroom to manage checkpoint writes, background writer activity, and WAL file flushes that grow proportionally with data volume. When the OS cache is too small, the kernel is forced to evict useful data pages to make room for incoming writes. Postgres then misses in both caches and falls through to disk, even if you have plenty of RAM.

This creates a vicious cycle. Adding more RAM to shared_buffers temporarily absorbs the working set, but as the dataset grows the same pressure returns. Each tuning cycle buys less time than the one before it.

Using The 25% Rule

The standard recommendation for Postgres is to set shared_buffers to 25% of total system RAM. By leaving 75% of memory to the OS, you give the kernel the headroom it needs to cache active data files, manage writes, and handle I/O bursts without evicting pages that Postgres will immediately need again.

To apply this, open postgresql.conf and update the parameter:

# For a server with 64GB RAM: 25% = 16GB
shared_buffers = '16GB'

This parameter requires a full server restart. A configuration reload is not sufficient.

Large Memory Servers

On systems with 512GB or more of RAM, 25% works out to 128GB. Beyond this point, the overhead of managing the internal buffer mapping can decrease performance rather than improve it. For very large memory systems, many teams cap shared_buffers at 128GB to 256GB and let the OS page cache handle the rest. Treat 128GB as your starting ceiling and benchmark from there.

Additional Settings

Changing shared_buffers in isolation can produce misleading results if these settings are not also configured correctly:

effective_cache_size: Tells the query planner how much total cache (shared_buffers plus OS page cache combined) it can expect to use. Set this to 50-75% of total RAM. It does not allocate memory, but rather informs planning decisions and affects whether the planner chooses index scans over sequential scans.
work_mem: Controls per-operation memory for sorts and hash joins. Too high, and concurrent queries can exhaust available RAM; too low, and sort operations spill to disk. A conservative starting point is total RAM divided by (max_connections x 2). On a 64GB server with 200 max_connections, that works out to roughly 163MB per operation, a reasonable baseline to start from and adjust under load.
checkpoint_completion_target: Set to 0.9 to spread checkpoint writes across a longer window, reducing the I/O spikes that compete with the OS page cache during heavy write periods.

Diagnosing Your Current Configuration

Once you apply the 25% rule, the pg_buffercache extension shows you exactly which tables and indexes are occupying your buffer cache right now.

SELECT
  c.relname AS table_name,
  count(*) AS buffered_pages,
  pg_size_pretty(count(*) * 8192) AS buffer_size,
  round(100.0 * count(*) /
    (SELECT setting FROM pg_settings WHERE name = 'shared_buffers')::integer, 2
  ) AS percent_of_cache
FROM pg_buffercache b
INNER JOIN pg_class c ON b.relid = c.oid
INNER JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
GROUP BY c.relname
ORDER BY buffered_pages DESC
LIMIT 10;

Interpreting Your Results

A healthy result shows no single object above 15-20% of the cache.

If any single table or index exceeds 30% of the cache, treat it as a signal that one object is crowding out everything else. Do not respond by increasing shared_buffers. If the object is already larger than your current allocation, giving Postgres more memory will only delay the problem until the table grows again. Instead, ask yourself the following questions:

Can the table be partitioned by time or key range so that queries touch only a recent, smaller slice of the data?
Can the queries driving the cache pressure be rewritten to use more selective indexes rather than scanning large portions of the table?

Addressing Index Bloat

A separate but related problem is index bloat. When index entries dominate the output over table entries, your indexes have likely grown faster than your access patterns have changed. Use this query to identify indexes that are consuming cache but receiving no scans:

SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan AS scans,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

Any index returned here is a candidate for removal. Dropping unused indexes directly reduces buffer pressure and frees cache space for objects that are actually serving queries.

Re-run the pg_buffercache query after any significant data volume increase or schema change to catch concentration drift before it affects query performance.

When Tuning Reaches Its Limit

The 25% rule and the diagnostics above will recover significant performance for most Postgres deployments. But when your working dataset is larger than the memory you can reasonably allocate to either cache layer, buffer management stops being the constraint. Instead, the data volume itself is the problem.

You can see this in pg_buffercache directly. If your largest table is 60GB and shared_buffers is 16GB, the table will never be fully cached regardless of how the allocation is tuned. percent_of_cache for that object will always approach 100% as the query workload pulls it in, leaving nothing for everything else:

At this point, adding more RAM extends the runway but does not change the slope. The next doubling of your dataset will return you to this same result. Columnar storage changes the equation by compressing data aggressively before it ever reaches the cache, reducing the volume that needs to be buffered in the first place.

You can test whether your workload would benefit from this approach by running the same pg_buffercache checks on a Tiger Data instance. Start a free trial today to optimize your database and internal buffer cache without affecting production.

The True Cost of Database Optimization: Engineering Time

Team Tiger Data — Thu, 14 May 2026 20:36:27 +0000

"We can fix the performance issue with better indexes, smarter partitioning, and some vacuum tuning. It's cheaper than switching."

You've heard this sentence. You may have said this sentence.

The optimization wasn't cheap. It just felt like it was.

"Cheaper than what" is the question nobody asks. The optimization doesn't show up on an invoice. It costs engineering time. And engineering time has a rate: the fully-loaded cost of the senior engineers doing the work, plus whatever those engineers aren't building while they're doing it. Most teams have never actually added up their database optimization spend. When they do, the number is larger than expected. And it comes back every quarter.

This problem is specific to a particular class of workload: high-frequency, append-heavy data. Telemetry, metrics, events, anything where timestamps are how you think about your data and the table only ever gets bigger. If that describes your system, keep reading. If you're running a CRUD app with predictable write volume, this isn't your problem.

Why optimization doesn't fix this

Here's what most teams figure out a year or two in: optimization isn't the wrong thing to try. It's just solving the wrong problem.

Tuning vanilla Postgres for a high-frequency append workload is a bit like upgrading the engine on a pickup truck because you want to haul more freight. You can make the truck faster and it feels productive. But at some point, you're limited by what the vehicle fundamentally is. The problem isn't the mechanic. It's the vehicle.

When your workload is structurally mismatched to your database architecture, the optimization treadmill is inevitable. Every index you add, every partition scheme you design, every autovacuum you tune: it's solving for a data volume you'll outgrow in months. The gap between "current optimization" and "needed optimization" widens every quarter. Not because you're falling behind. Because the data compounds faster than the fixes do.

A realistic year

Here's what that looks like. A year of Postgres optimization for a high-volume append workload.

Q1. Queries are slowing down. A senior engineer spends two weeks analyzing query plans, adding targeted indexes, and rewriting three critical queries. Performance improves. Write throughput drops roughly 15% because of new index maintenance overhead. (These numbers are illustrative. Your Q1 will have its own version of this tradeoff.)

Q2. Table size is causing partition-related issues. The team implements time-based partitioning. Two engineers spend three weeks on it: designing the partition scheme, migrating existing data, updating application queries that assumed a single table, and fixing the CI/CD pipeline that didn't account for partition management.

Q3. Autovacuum is competing with production writes during peak hours. One engineer spends a week tuning autovacuum parameters, adjusting cost delays, and setting up monitoring for vacuum lag. A follow-up incident two weeks later, when a vacuum job blocks a schema migration, costs another three days.

Q4. Storage costs are climbing. The team evaluates compression options, considers archiving old data to cold storage, and ultimately decides to upgrade the instance size to buy headroom for Q1 of next year. The upgrade takes a day. The evaluation and planning took two weeks.

Total: 12 to 16 engineer-weeks across the year. At fully-loaded senior engineer cost (call it $150K to $200K/year), that's $35K to $60K in direct labor. You bought time, not a solution. And the bill comes back next year.

The opportunity cost (the real number)

The $35K to $60K understates it.

12 to 16 engineer-weeks is a feature. It's a product launch. For a team of 10, that's 3 to 4% of total engineering output spent keeping the database at "acceptable." Not advancing it. Just treading water against a growing dataset.

Ask your engineering manager: if you reclaimed those 12 to 16 weeks, what would you build? That's the true cost of optimization. Not the hours. The roadmap you didn't ship.

And it compounds. Year two has all the same optimization needs plus new ones as data grows, but now you're also maintaining the partitioning scheme from Q2 and the vacuum configuration from Q3. The baseline maintenance burden grows even as new problems arrive.

Flogistix, who runs high-frequency oil and gas telemetry, reported 66% monthly cost savings after moving to Tiger Cloud, and their engineering team said the freed time directly increased roadmap velocity. That's what the other side of this decision looks like.

The hidden costs nobody tracks

These don't show up in sprint planning.

Incident response. Database performance incidents pull engineers off planned work. A slow query that triggers alerts at 2am costs the on-call engineer a night of sleep and a mostly useless next day. These incidents increase in frequency as the gap between "current optimization" and "needed optimization" widens. And the gap always widens.

Knowledge concentration. Database optimization work accumulates in one or two senior engineers who understand the schema, the query patterns, and enough Postgres internals to make changes safely. This is your single point of failure. When that engineer is on vacation or leaves, optimization work stalls or gets done slowly by someone learning as they go. Trust me, I've seen this play out in ways that aren't fun for anyone involved.

Context switching. Engineers don't work on database optimization in clean, uninterrupted blocks. They get pulled in for an afternoon here, a day there, to diagnose a regression or review a partition change. Context switching is expensive because it disrupts both the database work and whatever they were doing before. You're not just paying for the time spent on the database. You're paying for the interrupt tax on everything else.

All three are part of the platform tax: the invisible engineering cost of maintaining infrastructure that doesn't quite fit the workload. It doesn't show up on an invoice either.

Calculate your own number

Track for one month. Count hours spent on: query optimization and explain plan analysis; partition management and creation; autovacuum tuning, monitoring, and incident response; database-related incident response (slow query alerts, replication lag, connection pool exhaustion); and meetings discussing performance, capacity planning, or migration timing.

Multiply the monthly total by 12. Multiply that by the fully-loaded hourly rate of the engineers involved. That's your annual optimization cost.

Compare it against the one-time cost of migrating to a system designed for the workload (typically 2 to 8 engineer-weeks depending on data volume), plus ongoing maintenance that scales with workload complexity rather than with data growth.

For most teams, the breakeven is within the first year. Often within the first quarter. Do the math before assuming migration is the expensive option.

What the alternative looks like

After migrating to TimescaleDB (the open-source Postgres extension that powers Tiger Cloud), the engineering time picture looks different.

Migration cost: one-time, typically 1 to 4 weeks for a single engineer depending on data volume and schema complexity. Most of that time is data backfill, not application changes. TimescaleDB is still Postgres. Your SQL, your tooling, your team's existing knowledge stays intact.

Ongoing costs: not zero, but different in kind. The categories of work that consumed engineering time on vanilla Postgres shift significantly. Automatic partitioning via Hypertables removes partition management as a recurring quarterly project. The database handles it. Compression policies run automatically in the background. Autovacuum pressure on historical data drops because Hypercore converts older chunks to columnar format: instead of accumulating MVCC dead tuples on row-level records, that data is stored as compressed column arrays that don't generate the same vacuum workload. You still tune a database. You just stop tuning the same problems every quarter.

What was being spent on keeping vanilla Postgres at "acceptable" is now available for product work. Not because the database is magic. Because the architecture fits the workload.

The decision you keep deferring

The true cost of database optimization is not the cloud bill. It's the engineering time: senior engineers spending weeks per quarter on maintenance that keeps the system at "acceptable" rather than moving it forward.

If the annual optimization cost exceeds the one-time migration cost (and it usually does, often within the first year), the economic case writes itself. The harder question is whether the team can keep deferring the decision, knowing that each quarter of optimization increases the total spend without changing the trajectory.

Run the numbers. Then decide.

If you've done the math and want to understand what migration looks like at your data scale, The Best Time to Migrate Was at 10M Rows. The Second Best Time Is Now. is a good next read. And when you're ready to move, the migration guide covers the mechanics.

How TimescaleDB Outperforms ClickHouse and MongoDB for LogTide's Observability Platform

Team Tiger Data — Wed, 15 Apr 2026 12:24:18 +0000

Giuseppe “Polliog” Pollio started writing code for LogTide in September 2025. By early 2026, the platform was handling five million logs per day for alpha users, compressing 220GB of production data down to 25GB.

LogTide

Most enterprise log management tools are built for enterprises. Datadog and Splunk far exceed small operation budgets. For developers running a self-hosted stack, there is no clear alternative for affordable log observability.

LogTide addresses this gap as an open-source log management and SIEM platform built specifically for teams who need serious observability without serious hardware. Sigma rule-based detection, structured log search, alerting, and notifications, the same capabilities that make Datadog and Splunk useful, run in two gigabytes of RAM with Logtide.

"That's because our target is small agencies and home labs," Giuseppe explains. "I wanted to create an ecosystem with low impact on RAM, something you can host on a really old machine."

LogTide launched its cloud alpha in early 2026, with around 100 companies stress-testing the platform for free. One of them sends five million logs per day.

The Challenge

When Giuseppe set out to build LogTide, he targeted home labs and small businesses who cannot afford enterprise infrastructure, let alone enterprise pricing.

ELK - Elasticsearch, Logstash, Kibana typically require multiple nodes and significant RAM. Grafana Loki is lighter but still has indexing and query limitations that make full-text log search painful at scale. ClickHouse is fast and compresses well, but is built for analytics clusters, not Raspberry Pis. Datadog and Splunk simply cost too much.

LogTide needed a reliable database to underpin its OSS log observability that could scale to production without split architecture or excessive budget spend.

Why TimescaleDB

Giuseppe found TimescaleDB while searching for Postgres with additional support for high ingest of event data.

"There are lots of alternatives, but most are too resource-intensive," Giuseppe explains. "TimescaleDB was a perfect choice."

There are lots of alternatives, but most are too resource-intensive. TimescaleDB was a perfect choice. - Giuseppe Pollio, Founder, LogTide

The appeal was both technical and practical. TimescaleDB is Postgres. It uses the same wire protocol, the same SQL syntax, the same tooling, and the same extension ecosystem. For a solo developer building a platform that has to run on minimal hardware, that meant no operational surprises, no vendor-specific APIs, and no migration work if users already had Postgres running.

“If Postgres can run on your machine, TimescaleDB can run,” notes Giuseppe,”and you can deploy LogTide for inexpensive observability at scale.”

The LogTide Stack

LogTide's architecture is simple by design. “Simple architecture means it's easier to manage, easier to maintain,” said Giuseppe.

Simple architecture means it’s easier to manage, easier to maintain. - Giuseppe Pollio

Logs enter the system from one of three client sources: OpenTelemetry-instrumented services, Fluent Bit agents, or one of LogTide's native SDKs. All three routes converge on a single ingest endpoint. The endpoint handles format variations including OTEL format and a handful of special-case adapters so the ingestion path stays unified regardless of how the log was generated.

From the ingest endpoint, log payloads enter a job queue backed by Redis. Redis is optional: if it is not available, the ingestion path routes directly to the worker. The worker is where the platform earns its SIEM designation. It evaluates Sigma rules against incoming logs, generates alerts, dispatches notifications, and runs the full analysis pipeline.

After processing, logs pass through what Giuseppe calls the LogTide Reservoir: a storage abstraction layer that keeps the backend pluggable. In practice, only one backend is truly necessary.

"TimescaleDB is our unique persistent database," Giuseppe explains. "All the aggregation that populates our dashboards is powered by TimescaleDB."

TimescaleDB is our unique persistent database. All the aggregation that populates our dashboards is powered by TimescaleDB. - Giuseppe Pollio

Inside TimescaleDB, LogTide maintains three hypertable families: raw logs, distributed traces (spans), and detection events. Retention policies run automatically with no manual intervention or cron jobs. Continuous aggregates sit on top of the raw log hypertable and are what make the platform fast at scale.

From packages/backend/src/modules/retention/service.ts:

/**
 * Execute retention cleanup for all organizations.
 *
 * Strategy (scales with number of distinct retention values, not orgs):
 * 1. drop_chunks for max retention — instant, drops entire files
 * 2. Group orgs by retention_days, collect all project_ids per group
 * 3. For each group with retention < max: batch-delete their logs
 */
async executeRetentionForAllOrganizations(): Promise<RetentionExecutionSummary> {
  const startTime = Date.now();
  const logging = isInternalLoggingEnabled();

  // Get all organizations with their retention + projects
  const organizations = await db
    .selectFrom('organizations')
    .select(['id', 'name', 'retention_days'])
    .execute();

  const orgProjects = await db
    .selectFrom('projects')
    .select(['id', 'organization_id'])
    .execute();

  // Build org -> projectIds map
  const projectsByOrg = new Map<string, string[]>();
  for (const p of orgProjects) {
    const list = projectsByOrg.get(p.organization_id) || [];
    list.push(p.id);
    projectsByOrg.set(p.organization_id, list);
  }

  // Find max retention (used for drop_chunks)
  const maxRetention = Math.max(...organizations.map(o => o.retention_days));
  const maxCutoff = new Date(Date.now() - maxRetention * 24 * 60 * 60 * 1000);

  // Step 1: drop_chunks older than max retention (TimescaleDB only — instant, no decompression)
  // For ClickHouse, TTL policies handle this natively or deleteByTimeRange in step 3
  let chunksDropped = 0;
  if (reservoir.getEngineType() === 'timescale') {
    try {
      const dropResult = await sql`
        SELECT drop_chunks('logs', older_than => ${maxCutoff}::timestamptz)
      `.execute(db);
      chunksDropped = dropResult.rows.length;

      /* v8 ignore next 6 -- telemetry, disabled in tests */
      if (chunksDropped > 0 && logging) {
        hub.captureLog('info', `Dropped ${chunksDropped} chunks older than ${maxRetention} days`, {
          maxRetentionDays: maxRetention,
          cutoffDate: maxCutoff.toISOString(),
          chunksDropped,
        });
      }
    } catch (err) {
      // drop_chunks may fail if no chunks to drop — that's fine
      /* v8 ignore next 4 -- telemetry, disabled in tests */
      if (logging) {
        const msg = err instanceof Error ? err.message : String(err);
        hub.captureLog('debug', `drop_chunks: ${msg}`);
      }
    }
  }

  // Step 2: Group orgs by retention_days (only those with retention < max need per-row deletes)
  const retentionGroups = new Map<number, { orgs: typeof organizations; projectIds: string[] }>();
  for (const org of organizations) {
    if (org.retention_days >= maxRetention) continue; // already handled by drop_chunks

    const group = retentionGroups.get(org.retention_days) || { orgs: [], projectIds: [] };
    group.orgs.push(org);
    const orgProjectIds = projectsByOrg.get(org.id) || [];
    group.projectIds.push(...orgProjectIds);
    retentionGroups.set(org.retention_days, group);
  }

  // Step 3: Batch-delete per retention group
  const results: RetentionExecutionResult[] = [];
  let totalDeleted = 0;
  let failedCount = 0;

  for (const [retentionDays, group] of retentionGroups) {
    if (group.projectIds.length === 0) {
      for (const org of group.orgs) {
        results.push({
          organizationId: org.id,
          organizationName: org.name,
          retentionDays,
          logsDeleted: 0,
          executionTimeMs: 0,
        });
      }
      continue;
    }

    const groupStart = Date.now();
    const cutoffDate = new Date(Date.now() - retentionDays * 24 * 60 * 60 * 1000);

    try {
      const oldestResult = await reservoir.query({
        projectId: group.projectIds,
        from: new Date(0),
        to: cutoffDate,
        limit: 1,
        sortOrder: 'asc',
      });

      if (oldestResult.logs.length === 0) {
        for (const org of group.orgs) {
          results.push({
            organizationId: org.id,
            organizationName: org.name,
            retentionDays,
            logsDeleted: 0,
            executionTimeMs: Date.now() - groupStart,
          });
        }
        continue;
      }

      const deleted = await this.batchDeleteLogs(
        group.projectIds,
        cutoffDate,
        new Date(oldestResult.logs[0].time)
      );
      totalDeleted += deleted;
    } catch (error) {
      failedCount += group.orgs.length;
    }
  }
}

"The aggregates are necessary," said Giuseppe. "If you have five million, ten million logs every day, and you need to see how many logs you received every hour, you can't run that query on 10 million logs. The aggregates give you query results in milliseconds instead of 30 or 40 seconds."

Continuous aggregate definition, from packages/backend/migrations/004_performance_optimization.sql:

CREATE MATERIALIZED VIEW logs_hourly_stats
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', time) AS bucket,
  project_id,
  level,
  service,
  COUNT(*) AS log_count
FROM logs
GROUP BY bucket, project_id, level, service
WITH NO DATA;

-- Refreshes automatically every hour
SELECT add_continuous_aggregate_policy('logs_hourly_stats',
  start_offset => INTERVAL '3 hours',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour',
  if_not_exists => TRUE
);

CREATE INDEX IF NOT EXISTS idx_logs_hourly_stats_project_bucket
  ON logs_hourly_stats (project_id, bucket DESC);

Hybrid query at runtime, from packages/backend/src/modules/dashboard/service.ts

const [todayAggregateStats, recentTotal, recentErrors, recentServices, yesterdayAggregateStats, prevHourCount] = await Promise.all([
  // Today's historical stats from aggregate (today start to 1 hour ago)
  db
    .selectFrom('logs_hourly_stats')
    .select([
      sql<string>`COALESCE(SUM(log_count), 0)`.as('total'),
      sql<string>`COALESCE(SUM(log_count) FILTER (WHERE level IN ('error', 'critical')), 0)`.as('errors'),
      sql<string>`COUNT(DISTINCT service)`.as('services'),
    ])
    .where('project_id', 'in', projectIds)
    .where('bucket', '>=', todayStart)
    .where('bucket', '<', lastHourStart)
    .executeTakeFirst(),

  // Recent stats from reservoir (last hour)
  reservoir.count({ projectId: projectIds, from: lastHourStart, to: new Date() }),
  reservoir.count({ projectId: projectIds, from: lastHourStart, to: new Date(), level: ['error', 'critical'] }),
  reservoir.distinct({ field: 'service', projectId: projectIds, from: lastHourStart, to: new Date() }),

  // Yesterday's stats from aggregate
  db
    .selectFrom('logs_hourly_stats')
    .select([
      sql<string>`COALESCE(SUM(log_count), 0)`.as('total'),
      sql<string>`COALESCE(SUM(log_count) FILTER (WHERE level IN ('error', 'critical')), 0)`.as('errors'),
      sql<string>`COUNT(DISTINCT service)`.as('services'),
    ])
    .where('project_id', 'in', projectIds)
    .where('bucket', '>=', yesterdayStart)
    .where('bucket', '<', todayStart)
    .executeTakeFirst(),

  // Previous hour from reservoir (for throughput trend)
  reservoir.count({ projectId: projectIds, from: prevHourStart, to: lastHourStart }),
]);

LogTide's architecture. Logs flow from client SDKs and agents through a single ingest endpoint, into a processing worker, and into TimescaleDB hypertables via the LogTide Reservoir storage abstraction.

What We've Seen

220GB Down to 25GB

In production, LogTide's TimescaleDB deployment compressed 220GB of raw log data, 135GB of row data plus 85GB of indexes, down to 25GB. That is an 88.6% reduction, achieved using TimescaleDB's native columnar compression with a segmentby configuration on project_id and log level, ordered by timestamp descending. Chunks older than seven days compress automatically.

From packages/backend/migrations/001_initial_schema.sql:

-- Enable compression on logs hypertable
ALTER TABLE logs SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'project_id',
  timescaledb.compress_orderby = 'time DESC'
);

-- Add compression policy for logs (compress chunks older than 7 days)
SELECT add_compression_policy('logs', INTERVAL '7 days', if_not_exists => TRUE);

-- Global retention safety net
SELECT add_retention_policy('logs', INTERVAL '90 days', if_not_exists => TRUE);

Query performance did not degrade. Time-range filtering got 33% faster after compression. Aggregations got 41% faster. Only30 full-text search slowed slightly, by about 12%, because columnar storage requires scanning additional columns to reconstruct text fields. For a log management platform where engineers are far more likely to query a time window than to search a raw string, the tradeoff strongly favors compression.

In practice, 30 million logs stored in 15GB on a single 4-vCPU, 8GB RAM node, with a P95 query latency of 50ms. Learn more in Giuseppe’s dev.to post on TimescaleDB compression.

TimescaleDB Bested MongoDB and ClickHouse in Head-to-Head Performance Benchmarks

Giuseppe built an open benchmark suite and ran it across 1K to 1M records, as outlined in his AWS Builder Center article benchmarking ClickHouse and MongoDB vs TimescaleDB. The ingestion story is straightforward: at batch sizes typical of real-world observability (100 events per call), TimescaleDB handles 14,200 inserts per second. ClickHouse handles 250 at the same batch size. The gap exists because ClickHouse buffers small writes and flushes on a 400ms timer, the right design for bulk analytics, the wrong design when a dozen microservices are logging in real time.

The query results are the main story. At 100,000 log records, TimescaleDB answers a filtered service query in 0.47ms. MongoDB answers the same query in 304ms, a 650x difference. Under 50 concurrent queries, TimescaleDB holds at 6.2ms whether the dataset is 1,000 or 1,000,000 records. The mechanism is hypertable partitioning: queries filter by time range and service, TimescaleDB routes them to the active chunk instead of scanning the full table, and continuous aggregates make count and dashboard queries nearly free because the work is already done at write time.

A 2GB RAM Requirement Keeps Operations Lean

The most important number is not the compression ratio or the write throughput. It is the 2GB RAM figure that defines where LogTide can actually run.

"If you have log management that can work with 2GB of RAM, it's really magic," Giuseppe says. "Because you can't do that with Datadog or Splunk or the other self-hosted programs and containers."

If you have log management that can work with 2GB of RAM, it's really magic. You can't do that with Datadog or Splunk or the other self-hosted programs and containers. - Giuseppe Pollio

That 2GB ceiling is what makes LogTide viable for home labs running NAS, small businesses on shared hosting, or a developer who wants to know when their Raspberry Pi's services throw errors. The entire LogTide platform, including API, worker, dashboard, and TimescaleDB storage, runs on the same hardware that already runs Postgres.

Looking Ahead

The LogTide Cloud Platform alpha prototype is now open to trial users. Meanwhile, LogTide’s open-source project is growing fast. Hundreds of GitHub stars and 1k+ clones per day signal a developer community that has found the project and is actively building with it. The next phase is expanding SDK coverage and continuing to stress-test the storage layer. TimescaleDB runs anywhere Postgres runs. The goal is to make sure LogTide does too.

ClickHouse Is Fast. Your Pipeline Isn't.

Team Tiger Data — Tue, 14 Apr 2026 18:15:43 +0000

ClickHouse is fast. The benchmarks aren't lying. If you've run a comparison against vanilla Postgres on the same dataset, the results aren't close. ClickHouse wins by 10x-100x on typical analytical patterns.

That benchmark is also only measuring one dimension of your decision. It tells you how fast queries run on static data. It doesn't tell you anything about data freshness, transactional correctness, pipeline reliability, or the operational cost of keeping two systems synchronized.

This post isn't about whether ClickHouse is fast. It's about the full cost of getting your data there, and keeping it correct once it arrives.

What ClickHouse is actually good at

ClickHouse is a columnar OLAP database built for analytical scan performance. It's great at aggregations over large datasets, column-oriented scans that skip irrelevant data, compression that keeps big datasets resident in memory, and query parallelism across cores.

For batch analytics on historical data where "fresh" means "reflects the last ETL run," ClickHouse is a solid choice. Data warehousing, offline reporting, retrospective analysis. These are real ClickHouse strengths and I'm not going to pretend otherwise.

The question you need to answer for yourself is whether your use case is actually batch analytics, or whether it's operational analytics that needs to be fresh and correct.

The pipeline tax

Here's the thing about ClickHouse: your data doesn't teleport there from Postgres. You need a pipeline.

You've got options. CDC via Debezium, scheduled ETL jobs, Kafka-based streaming, application-level dual-writes. Each one introduces costs that won't show up in any benchmark you've read.

Lag. There's always a gap between a row being committed in Postgres and being queryable in ClickHouse. CDC pipelines typically add 5-30 seconds. Batch ETL adds minutes to hours. Dual-writes add milliseconds, but now you've got a consistency problem: when one write succeeds and the other fails, your two systems are telling different stories about what's true.

Drift. Every schema change in Postgres needs to propagate to ClickHouse. Column additions, type changes, table restructuring: all of it requires pipeline updates. Every migration is now a coordinated change across two systems. Good luck.

Failure modes. Pipelines break. Kafka consumers fall behind. CDC slots get dropped. Backfills happen after outages. Each of these failure modes needs its own monitoring, alerting, and runbook. All of this overhead exists purely because your data lives in two places.

Correctness gaps. ClickHouse uses eventual consistency. Rows arrive out of order. Late-arriving data might not appear in already-computed aggregations. Deduplication requires explicit schema decisions (ReplacingMergeTree and friends). When a dashboard query runs during a pipeline hiccup, the results are wrong, with no transaction isolation to tell you that.

What you actually lose without ACID

Let’s be specific about the ACID trade-off, because it matters more in practice than it sounds in theory.

ClickHouse doesn't support multi-row transactions. A batch INSERT either succeeds or fails as a batch, but you can't roll back a logical transaction that spans multiple inserts across tables. If your analytics join orders, payments, and inventory, the lack of transactional consistency means your results can reflect different points in time. (Whether that matters depends on your use case, but you should know it before you commit to the architecture.)

Updates work differently than you expect. ClickHouse mutations are background operations. When source data gets corrected in Postgres (a sensor recalibration, a price adjustment, a retroactive fix), getting that correction into ClickHouse means re-ingesting the affected data or running an async mutation that finishes whenever the system gets around to it. In Postgres, a corrected value is immediately correct. In ClickHouse, it's eventually correct.

There are no foreign keys, constraints, or triggers. Data integrity is your pipeline's problem now. If bad data gets through, ClickHouse will store it faithfully. Garbage in, garbage queryable.

The real cost: operating two systems

Two databases means two sets of everything. Monitoring dashboards, alerting rules, and backup strategies. Capacity planning, version upgrade procedures, and security patching schedules.

You also need two mental models. When a dashboard shows unexpected numbers, the engineer debugging it has to figure out: is the data wrong in Postgres, or is it right in Postgres but stale in ClickHouse? Is the pipeline behind? Did a schema change not propagate? Is deduplication working correctly? So many questions.

And the pipeline itself is a third system with its own maintenance burden. Kafka clusters, CDC connectors, ETL orchestrators. None of these are zero-maintenance infrastructure.

Total cost of ownership isn't "Postgres cost + ClickHouse cost." It's Postgres cost plus ClickHouse cost plus pipeline cost plus coordination overhead plus the debugging time you spend every time the two systems disagree. That last one is harder to budget for.

When the split is actually worth it

Here's a useful test before we go further. Have your stakeholders ever asked "why is the dashboard showing old data?" If yes, you have a freshness requirement. If the answer to that question has ever been "because the pipeline was behind," then a faster query engine isn't going to solve your problem.

I want to be honest here, because this is where a lot of competitive posts fall apart. There are legitimate reasons to run ClickHouse alongside Postgres.

The split makes sense if your analytics are batch-oriented and hours of lag is acceptable. If your queries are read-only historical scans and you already have Kafka running for other reasons. If analytical query volume would overwhelm your operational database.

It doesn't make sense if your stakeholders want to see current data. If a correction in Postgres needs to show up immediately in your dashboards. If the only reason you'd build a pipeline is to feed ClickHouse. Or if your team is small enough that the operational burden of running three systems isn't worth the query speed gain.

The alternative

TimescaleDB extends Postgres so analytical queries perform well on the same data, in the same database, with the same transactional guarantees.

Hypertables withcolumnar compression give you analytical scan performance on time-series data without moving it anywhere.Continuous aggregates pre-compute common rollups incrementally, so dashboards stay fast without batch jobs. FlightAware dropped a 6.4-second query to 30 milliseconds using continuous aggregates alone, without changing their data model or moving to a separate system. Real-time aggregates layer the newest raw data on top of those precomputed rollups in a single query, so results stay current without waiting for a refresh cycle.

Your data is always fresh because nothing moved it. Corrections are immediate because there's no second system to propagate them to. And there's no pipeline paging you at 3am, because there's no pipeline.

"TimescaleDB strikes a phenomenal balance between the simplicity of storing your analytical data under the same roof as your configuration data, while also gaining much of the impressive performance of a specialized OLAP system." — Robert Cepa, Senior Software Engineer, Cloudflare (How TimescaleDB helped Cloudflare scale analytics — and why they chose it over ClickHouse)

Worth being straight with you: for pure OLAP workloads on petabyte-scale historical data, a dedicated columnar store like ClickHouse will outperform TimescaleDB on raw scan throughput. That gap is real. For batch analytics on historical data where freshness and correctness aren't the point, ClickHouse is a reasonable choice.

But for most teams building operational analytics on live data, the architectural cost of moving that data doesn't justify the query speed gain.

The thing the benchmark doesn't tell you

The fastest query engine in the world doesn't help when the data it's querying is stale. And "the pipeline was behind" is a terrible answer to give your stakeholders at 2am.

ClickHouse is fast. The benchmarks are real. The trade-off is also real: pipelines, lag, drift, eventual consistency, and a second system to operate forever.

If your analytics can tolerate staleness and your team has the infrastructure to keep two systems in sync, ClickHouse is worth serious consideration. If your analytics need to be fresh, correct, and transactional, the architecture that gets you there matters more than the query speed of any single component.

The benchmark tells you one thing. The architecture is what you'll live with.

If you want to see what analytics on your live Postgres data actually looks like, start a free Tiger Cloud database. Your existing schema works as-is. No pipeline required.

Document Databases: Be Honest

Team Tiger Data — Wed, 01 Apr 2026 17:22:30 +0000

MongoDB gets a bad reputation in certain engineering circles that it doesn't entirely deserve. It ships fast. Schema flexibility is real. The developer experience for document-shaped data is good. A lot of teams made a reasonable call when they chose it.

But there's a version of this story that ends badly, and it follows a recognizable pattern. The team picks MongoDB for a new system. The system works. Then the data starts looking less like documents and more like a stream of timestamped events. Queries start filtering by time range. Write volume climbs. Performance degrades in ways that feel familiar if you've read about this problem, and deeply confusing if you haven't.

This post isn't here to relitigate the MongoDB decision. It's here to help you figure out whether the pain you're feeling is a MongoDB problem, a document database problem, or a workload problem that would follow you to Postgres.

The answer matters because the fix is different in each case.

What MongoDB is actually good at

Flexible schema for variable data that's actually variable. Product catalogs where every SKU has different attributes. User profiles where fields vary by account type. Content management where article structure differs by category. These are real document shapes, and MongoDB handles them without the ceremony Postgres requires.

Rapid iteration without migration overhead. Early-stage products change their data model constantly. In Postgres, every schema change is an ALTER TABLE. In MongoDB, you just write different fields. For teams that are still figuring out the shape of their data, this is a real advantage.

Nested and hierarchical data. Some data is naturally a tree. A purchase order with line items with sub-components. A configuration object with nested sections. Postgres can model this with JSONB, but MongoDB's native document model fits it more naturally and queries it more cleanly.

Horizontal scaling for document reads. MongoDB's sharding model was designed for document workloads. For read-heavy document access at scale, it's a mature and well-understood architecture.

These aren't consolation prizes. They're real reasons MongoDB is the right choice for a lot of workloads.

The trouble starts when the data changes shape.

What time-series data actually looks like

Time-series data has a specific shape, and it's not a document shape. Every row is a measurement. It has a timestamp, a source identifier, and a value or set of values. The schema doesn't vary between rows. There's nothing hierarchical about it. The document model isn't adding anything.

What time-series data has instead: enormous volume, strict ordering requirements, queries that almost always filter by time range, and retention policies that drop entire time windows at once.

A wind turbine sensor reporting every five seconds doesn't produce documents. It produces a flat stream of readings: timestamp, sensor ID, RPM, temperature, vibration. A financial trade feed isn't a document store. It's a sequence of immutable events. An APM platform collecting metrics from a distributed system is generating hundreds of thousands of measurements per second, all with the same shape.

The test is simple. Look at your most-written collection. Does each document have a different structure? Or does every document look essentially the same, with a timestamp and some measurements?

If it's the latter, you're storing time-series data in a document database, and the document model is providing zero value while the storage engine works against you.

Where MongoDB struggles with this workload

WiredTiger (MongoDB's default storage engine) uses a B-tree structure optimized for a workload that includes updates to existing documents. For high-frequency append-only writes, it faces a fundamental mismatch. Consider a single sensor reading: one document insert triggers a write to the primary collection, a write to the oplog, and a separate B-tree update for every index on that collection. Three indexes means five writes for one data point. At 10,000 inserts per second, that's 50,000 storage operations per second before you've run a single query. The engine was designed for mixed read-write workloads with in-place updates, not an endless append stream where no document is ever modified after creation.

MongoDB has no native time-based partitioning. Postgres has declarative range partitioning. TimescaleDB automates it entirely with hypertables. MongoDB has no equivalent primitive. Teams end up implementing time-based collection bucketing manually: separate collections per day or week, application-level routing logic, custom cleanup scripts. It works, but it's the same operational burden as manual Postgres partitioning, without the tooling ecosystem that exists on the Postgres side.

MongoDB's aggregation pipeline is expressive. But for time-series workloads, the queries that matter are time-range aggregations: hourly averages, daily maximums, week-over-week comparisons. These queries scan large volumes of documents and aggregate across fields. Without columnar storage and purpose-built time-series compression, performance degrades with data volume in the same way it does in vanilla Postgres.

MongoDB did add a native time-series collection type in 5.0. It's a real improvement for simple append-only use cases. But it doesn't support secondary indexes the same way regular collections do, restricts certain aggregation stages and update operations, and is still relatively new compared to the Postgres ecosystem. Worth knowing about. Not a full answer.

Why moving to vanilla Postgres isn't automatically the fix

This is the section most competitive content skips entirely. If you're evaluating a migration, you deserve the full picture.

If the workload is continuous high-frequency time-series ingestion with long retention and operational query requirements, vanilla Postgres has its own version of this problem. The MVCC overhead, write amplification, autovacuum contention, and index maintenance costs that create the Optimization Treadmill exist in Postgres too. The storage model is different from MongoDB's, but the outcome at scale is the same: performance degrades with data volume, maintenance overhead accumulates, and each optimization cycle buys time without changing the trajectory.

Moving from MongoDB to vanilla Postgres solves the schema flexibility problem (you probably don't need it for this workload anyway). You get a mature partitioning ecosystem, a better query planner, and a richer extension ecosystem. These are real improvements.

It doesn't solve the core time-series storage problem, because that problem lives in the storage model, not the database brand.

The question isn't MongoDB vs. Postgres. It's document store vs. purpose-built time-series storage. That's the actual axis the decision should sit on.

The decision framework

Your data is actually documents. Variable schema, nested structures, hierarchical relationships, read-heavy access patterns. MongoDB is the right tool. The pain you're feeling is probably a schema design or indexing problem, not a fundamental architectural mismatch. Fix the schema.

Your data is time-series but volume is modest. Sub-10K inserts per second, retention under 90 days, no hard operational latency requirements on the full retention window. Vanilla Postgres with good partitioning and indexing handles this fine. The Optimization Treadmill exists, but the ceiling is far enough away that standard tuning keeps you ahead of it. Move to Postgres, implement partitioning early, andmonitor the warning signs.

Your data is time-series at sustained high volume. Continuous ingestion, long retention, operational query requirements, growing data volume. This is the workload that breaks both MongoDB and vanilla Postgres through the same class of mechanisms. Purpose-built time-series storage on Postgres (same SQL, same wire protocol, same tooling) is the right answer.Migration from MongoDB to TimescaleDB follows a well-documented path: you keep everything Postgres-compatible and gain the storage architecture that matches the workload.

What to do next

MongoDB didn't fail you if you're reading this. Your workload evolved past what document storage was designed for. That's a different thing.

Most database choices are right at the time they're made and wrong eighteen months later when the system looks nothing like it did at launch. Sensor data that started as a feature became the core product. The document store that handled early prototyping became the production system for a time-series pipeline.

The question now is whether the fix is tuning, migration, or architecture. The framework above gives you a clear read on which one applies. If it's architecture, the good news is that moving from MongoDB to a Postgres-compatible time-series database is less disruptive than it sounds. Your application SQL stays the same. Your tooling stays the same. The storage engine underneath is the thing that changes.

That's the right scope for the change. Not the whole stack. Just the part that was always wrong for this workload.

Read the full technical breakdown of why vanilla Postgres hits these limits, orstart a Tiger Cloud trial and see how TimescaleDB handles your workload directly.