<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tim Huang</title>
    <description>The latest articles on DEV Community by Tim Huang (@tim_huang).</description>
    <link>https://dev.to/tim_huang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2587519%2F5f72f586-feca-4473-8bd4-bea5b41286af.jpeg</url>
      <title>DEV Community: Tim Huang</title>
      <link>https://dev.to/tim_huang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tim_huang"/>
    <language>en</language>
    <item>
      <title>Query 1B Rows in PostgreSQL &gt;25x Faster with Squirrels!</title>
      <dc:creator>Tim Huang</dc:creator>
      <pubDate>Wed, 18 Dec 2024 18:07:57 +0000</pubDate>
      <link>https://dev.to/tim_huang/query-1b-rows-in-postgresql-25x-faster-with-squirrels-4e01</link>
      <guid>https://dev.to/tim_huang/query-1b-rows-in-postgresql-25x-faster-with-squirrels-4e01</guid>
      <description>&lt;p&gt;The &lt;a href="https://www.morling.dev/blog/one-billion-row-challenge/" rel="noopener noreferrer"&gt;One Billion Row Challenge&lt;/a&gt; has been making waves in the data engineering community lately. Originally created to test CSV parsing performance, the challenge involves processing a file containing 1 billion weather measurements to calculate basic temperature statistics for each city. In this post, I'll tackle a variation of this challenge using PostgreSQL and demonstrate how to achieve dramatic performance improvements using Squirrels.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;The original &lt;a href="https://www.morling.dev/blog/one-billion-row-challenge/" rel="noopener noreferrer"&gt;One Billion Row Challenge&lt;/a&gt; focuses on raw CSV processing performance. For our variation, we'll:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load 1 billion rows into PostgreSQL with additional columns&lt;/li&gt;
&lt;li&gt;Query for city-level temperature statistics&lt;/li&gt;
&lt;li&gt;Create a Squirrels project to serve these analytics via REST API&lt;/li&gt;
&lt;li&gt;Demonstrate significant query performance improvements&lt;/li&gt;
&lt;li&gt;Show how to handle incremental data updates&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Setting Up the Environment
&lt;/h2&gt;

&lt;p&gt;I provisioned the following AWS resources in the same region and availability zone:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An RDS PostgreSQL database (db.r6g.large: 2 vCPUs, 16GB RAM)&lt;/li&gt;
&lt;li&gt;An EC2 instance (r8g.large: 2 vCPUs, 16GB RAM)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Generation and Loading
&lt;/h2&gt;

&lt;p&gt;I generated a 24GB CSV file containing 1 billion weather measurements using a modified version of the createMeasurements.py script from &lt;a href="https://github.com/ifnesi/1brc" rel="noopener noreferrer"&gt;this github repo&lt;/a&gt;. The first few lines of the file look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bissau;2012-02-20;14.3
Almaty;2019-10-24;-5.3
Ankara;2012-10-27;-6.7
Houston;2010-06-08;10.6
Makassar;2012-04-29;36.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data was loaded into a simple PostgreSQL table with the following DDL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;weather_data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;city&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;recorded_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt; &lt;span class="nb"&gt;FLOAT&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This created a 73GB PostgreSQL table with 4 columns and 1 billion rows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Initial Query Performance
&lt;/h2&gt;

&lt;p&gt;Our baseline query to calculate city-level statistics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_temperature&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;weather_data&lt;/span&gt; 
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query took approximately 7 minutes to complete against the PostgreSQL table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter Squirrels
&lt;/h2&gt;

&lt;p&gt;I created a Squirrels project to serve these analytics via a REST API. The project structure looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sqrl-1brc-postgres/
├── models/
│   ├── dbviews/
│   │   ├── aggregate_weather.sql
│   │   └── aggregate_weather.yml
│   └── sources.yml
├── .gitignore
├── env.yml
├── requirements.txt
└── squirrels.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The query (found in &lt;code&gt;models/dbviews/aggregate_weather.sql&lt;/code&gt;) was slightly modified to use Squirrels' source macro. A new column &lt;code&gt;translate_test&lt;/code&gt; was added as well, which will be explained later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;min_temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;translate_test&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"src_weather_data"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;models/sources.yml&lt;/code&gt; file is used to tell Squirrels details about the source named "src_weather_data" including metadata about the table and columns.&lt;/p&gt;

&lt;p&gt;The Squirrels project (the version from when this post was written) can be found on GitHub here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/squirrels-analytics/squirrels-examples/tree/1brc-part1/sqrl-1brc-postgres" rel="noopener noreferrer"&gt;https://github.com/squirrels-analytics/squirrels-examples/tree/1brc-part1/sqrl-1brc-postgres&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This includes all files except for the env.yml file which looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env_vars&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres_uri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgresql://postgres:********@postgres-db.************.us-east-1.rds.amazonaws.com:5432/postgres&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project dependencies were installed using &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;. This also installs the &lt;code&gt;sqrl&lt;/code&gt; CLI tool for commands such as running the API server and building the data artifact.&lt;/p&gt;

&lt;p&gt;After running &lt;code&gt;sqrl run --host 0.0.0.0 --no-cache&lt;/code&gt; to start the API server, the REST API can be accessed with GET method at the endpoint:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;/squirrels-v0/1brc-postgres/v1/dataset/aggregate-weather&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: By default, Squirrels includes a built-in result cache for the API results (where the cache's time-to-live can be configured). This was disabled by including the &lt;code&gt;--no-cache&lt;/code&gt; flag in the &lt;code&gt;sqrl run&lt;/code&gt; command.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Performance Results
&lt;/h2&gt;

&lt;p&gt;First, I ran the API request without building the data artifact to get the baseline query performance. This was done 3 times before getting an average. Then, I ran &lt;code&gt;sqrl build&lt;/code&gt; once to build the necessary data artifact to improve query performance, and I ran the API request 3 times again to observe the improvement. The data artifact is generated in the &lt;code&gt;target/&lt;/code&gt; folder and is about 5GB in size.&lt;/p&gt;

&lt;p&gt;The following are the performance results:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Initial API Request (average of 3 runs): 419.71 seconds (≈7 minutes)&lt;/li&gt;
&lt;li&gt;Running &lt;code&gt;sqrl build&lt;/code&gt;: 434.19 seconds&lt;/li&gt;
&lt;li&gt;API request after running the build (average of 3 runs): &lt;strong&gt;15.67 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;We successfully reduced the query runtime from 7 minutes to 15.67 seconds, a 27x performance improvement!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As shown in the screenshot below, there was practically no difference in runtimes across the 3 runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3093pcmywmiytpvw6h8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3093pcmywmiytpvw6h8.png" alt="Bar chart comparing performance results" width="800" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: With Squirrels, we can create pre-aggregated data models to improve query performance for specific queries. For instance, we can pre-aggregate the weather data by city and date to accommodate all queries that filter or group by city or date. With further modifications to the Squirrels project, this reduces the runtime of the query above to &lt;strong&gt;a fraction of a second&lt;/strong&gt;! More details on this will be covered in Part 2.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Handling Data Updates
&lt;/h2&gt;

&lt;p&gt;Squirrels makes it easy to handle incremental updates through update hints in &lt;code&gt;sources.yml&lt;/code&gt; (notice the &lt;code&gt;update_hints&lt;/code&gt; section below):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;src_weather_data&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;A table containing weather data&lt;/span&gt;
    &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;weather_data&lt;/span&gt;
    &lt;span class="na"&gt;update_hints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;increasing_column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;id&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To test this, I:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Added 1 million new rows to the PostgreSQL table&lt;/li&gt;
&lt;li&gt;Ran &lt;code&gt;sqrl build --stage&lt;/code&gt; (took 41.84 seconds) and confirmed that the data artifact was updated&lt;/li&gt;
&lt;li&gt;Tested query performance (took &lt;strong&gt;15.66 seconds&lt;/strong&gt; for 1.001 billion rows, similar to the 15.67 seconds for 1 billion rows)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;--stage&lt;/code&gt; flag ensures zero downtime during updates by staging the development copy before swapping the data artifact once it's not in use by any ongoing queries. In addition, the &lt;code&gt;sqrl build&lt;/code&gt; command can be run in the background at some time interval (e.g. every 10 minutes or every hour depending on data freshness requirements) to keep the data artifact up to date.&lt;/p&gt;

&lt;h2&gt;
  
  
  Behind the Scenes of the "Data Artifact"
&lt;/h2&gt;

&lt;p&gt;The "data artifact" is actually a &lt;a href="https://duckdb.org/" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt; database file. The &lt;code&gt;sqrl build&lt;/code&gt; command is able to build the source table(s) as DuckDB table(s) if the source table are from PostgreSQL, MySQL, or SQLite. When running an API request, Squirrels will determine whether all the necessary source tables exist in the DuckDB database for PostgreSQL, MySQL, or SQLite queries. If they do, it will translate the SQL query into DuckDB's SQL dialect (using &lt;a href="https://sqlglot.com/sqlglot.html" rel="noopener noreferrer"&gt;SQLGlot&lt;/a&gt;) and run the query on the DuckDB file instead. To prove that the translation is in place, the column &lt;code&gt;translate_test&lt;/code&gt; returns 0 because the PostgreSQL dialect does integer division. If the query was run in DuckDB without translating, the result would have been 0.5 because DuckDB does floating point division when dividing two integers using the &lt;code&gt;/&lt;/code&gt; operator.&lt;/p&gt;

&lt;p&gt;Support for loading from additional sources (e.g. APIs, S3, etc.) with &lt;code&gt;sqrl build&lt;/code&gt; will be made available in the future.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Using Squirrels together with DuckDB, we achieved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;27x faster query performance&lt;/strong&gt; on 1 billion rows without changing the PostgreSQL query&lt;/li&gt;
&lt;li&gt;Simple incremental updates with zero-downtime refreshes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While it is possible to optimize the PostgreSQL table for better performance, it often requires significant time and effort with only modest gains. In contrast, using Squirrels can achieve substantial performance improvements with minimal effort.&lt;/p&gt;

&lt;p&gt;In Part 2, we'll explore dynamic parameterized queries that change based on user input for date ranges, and demonstrate how to join results from multiple databases and unstructured data from S3!&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>dataengineering</category>
      <category>analytics</category>
      <category>bigdata</category>
    </item>
  </channel>
</rss>
