<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tatsuya Nishimura</title>
    <description>The latest articles on DEV Community by Tatsuya Nishimura (@nishimoo).</description>
    <link>https://dev.to/nishimoo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3706012%2F52d5bd51-7cf8-4f81-874e-ebe0e36cb764.png</url>
      <title>DEV Community: Tatsuya Nishimura</title>
      <link>https://dev.to/nishimoo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nishimoo"/>
    <language>en</language>
    <item>
      <title>Comparison of Apache Parquet and Apache Arrow</title>
      <dc:creator>Tatsuya Nishimura</dc:creator>
      <pubDate>Tue, 13 Jan 2026 03:58:30 +0000</pubDate>
      <link>https://dev.to/nishimoo/comparison-of-apache-parquet-and-apache-arrow-284m</link>
      <guid>https://dev.to/nishimoo/comparison-of-apache-parquet-and-apache-arrow-284m</guid>
      <description>&lt;h2&gt;
  
  
  Apache Parquet
&lt;/h2&gt;

&lt;p&gt;A column-oriented file format designed for efficient storage and querying of large-scale datasets. It reduces storage costs and I/O overhead through compression and encoding. Widely used in data lakes and distributed processing frameworks such as Hadoop, Spark, and BigQuery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Arrow
&lt;/h2&gt;

&lt;p&gt;A column-oriented in-memory format specification. It enables zero-copy data sharing across different processes and languages by using the same binary structure for in-memory processing, file storage (&lt;code&gt;.arrow&lt;/code&gt;/&lt;code&gt;.feather&lt;/code&gt;), inter-process communication (IPC), and network transfer (Flight).&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Each
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Parquet&lt;/strong&gt;: Long-term storage and archiving, reducing storage costs, leveraging statistics, integrating with data lakes like Spark and BigQuery&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Arrow&lt;/strong&gt;: Sharing data across processes and languages, low-latency requirements, in-memory caching, IPC and Flight communication&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Design Purpose Differences
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Parquet&lt;/th&gt;
&lt;th&gt;Arrow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File storage and archiving&lt;/td&gt;
&lt;td&gt;In-memory processing and sharing (also for files and network)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compression&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snappy / gzip / Zstd / LZ4&lt;/td&gt;
&lt;td&gt;LZ4 / Zstd (optional)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Thrift (at footer, includes statistics)&lt;/td&gt;
&lt;td&gt;Flatbuffer (at header, no statistics)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Read Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds to milliseconds&lt;/td&gt;
&lt;td&gt;Microseconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dense on disk (compressed), expanded when loaded&lt;/td&gt;
&lt;td&gt;Dense in memory, further reduced with buffer sharing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Updates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not supported (write new files only)&lt;/td&gt;
&lt;td&gt;Immutable but fast recreation in-memory, file appending easier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Parquet Compression Defaults:&lt;/strong&gt; PyArrow and DuckDB use &lt;strong&gt;Snappy&lt;/strong&gt; by default (&lt;a href="https://arrow.apache.org/docs/python/parquet.html#compression-encoding-and-file-compatibility" rel="noopener noreferrer"&gt;PyArrow&lt;/a&gt;, &lt;a href="https://duckdb.org/docs/stable/data/parquet/overview.html" rel="noopener noreferrer"&gt;DuckDB&lt;/a&gt;). Polars uses &lt;strong&gt;Zstd&lt;/strong&gt; by default (&lt;a href="https://github.com/pola-rs/polars/blob/2a5c6a3de8d2d487a4032e6fbcbdb917e437ab22/crates/polars-io/src/parquet/write/writer.rs#L76" rel="noopener noreferrer"&gt;Polars implementation&lt;/a&gt;). Compression is optional in all cases.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Parquet ↔ Arrow Conversion
&lt;/h2&gt;

&lt;p&gt;It can be done in a single line, but Parquet requires additional processing such as adding statistics to metadata, so the conversion isn't necessarily very fast.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow.parquet&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pq&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyarrow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;

&lt;span class="c1"&gt;# Parquet → Arrow
&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# table is an Arrow array table representation
&lt;/span&gt;
&lt;span class="c1"&gt;# Arrow → Parquet
&lt;/span&gt;&lt;span class="n"&gt;pq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output.parquet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Data Type Compatibility
&lt;/h3&gt;

&lt;p&gt;Nearly one-to-one correspondence. Both formats support primitive types (int, float, string, etc.) as well as nested types (&lt;code&gt;struct&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;, &lt;code&gt;map&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regarding Nested Types:&lt;/strong&gt; Parquet uses &lt;strong&gt;Dremel encoding&lt;/strong&gt; (a combination of definition and repetition levels) for encoding nested types (&lt;a href="https://github.com/apache/parquet-format#nested-encoding" rel="noopener noreferrer"&gt;Parquet Format - Nested Encoding&lt;/a&gt;). Arrow, on the other hand, represents nested types as relationships between parent and child arrays; while the in-memory layout differs, the semantics are compatible. When reading a nested column from a Parquet file using Arrow, internal layout conversion is necessary, but the data meaning is preserved. (References: &lt;a href="https://arrow.apache.org/docs/format/Columnar.html#struct-layout" rel="noopener noreferrer"&gt;Arrow Columnar Format - Struct Layout&lt;/a&gt;, &lt;a href="https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps" rel="noopener noreferrer"&gt;Arrow Columnar Format - Nested type arrays&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Binary Structure Differences
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Parquet Layout
&lt;/h3&gt;

&lt;p&gt;A Parquet file is a collection of pages, where each page contains compressed and encoded column data. On read, the metadata section at the footer is read first to determine "where each page is located," and then only the necessary pages are decompressed. Statistics allow skipping unnecessary data, making it advantageous for conditional queries on large-scale data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Column A, page 1: compressed]
[Column B, page 1: compressed]
[Column A, page 2: compressed]
[Column B, page 2: compressed]
...
[Footer metadata + Schema + Statistics]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Arrow Layout
&lt;/h3&gt;

&lt;p&gt;Arrow arrays consist of metadata and buffers that can be directly mapped to memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Metadata: schema, buffer offsets, sizes]
[Buffer 0: validity bitmap (0 if null, 1 otherwise)]
[Buffer 1: values / offsets / data]
[Buffer 2: ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Apache pretty much has everything you need.&lt;/p&gt;

</description>
      <category>parquet</category>
      <category>arrow</category>
    </item>
    <item>
      <title>Save on DuckDB + S3 Transfer Costs</title>
      <dc:creator>Tatsuya Nishimura</dc:creator>
      <pubDate>Mon, 12 Jan 2026 04:31:11 +0000</pubDate>
      <link>https://dev.to/nishimoo/save-on-duckdb-s3-transfer-costs-59i1</link>
      <guid>https://dev.to/nishimoo/save-on-duckdb-s3-transfer-costs-59i1</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Use Cloudflare R2, or run DuckDB on EC2 in the same region as your S3 bucket with Gateway Endpoint enabled.&lt;/p&gt;

&lt;h2&gt;
  
  
  A quick note
&lt;/h2&gt;

&lt;p&gt;Stick with Parquet.&lt;/p&gt;

&lt;h2&gt;
  
  
  How much data actually gets transferred?
&lt;/h2&gt;

&lt;p&gt;When you query a Parquet file on S3 through DuckDB, the whole file doesn't get downloaded. Instead, DuckDB uses &lt;strong&gt;HTTP Range Requests&lt;/strong&gt; to grab only the bytes it needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  The mechanics
&lt;/h3&gt;

&lt;p&gt;DuckDB fetches data in two passes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt;: Range-request just the metadata section of the Parquet file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data&lt;/strong&gt;: Range-request only the columns and row groups needed by your query&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;"DuckDB always uses range requests, firstly to query the metadata only, then to fetch the required columns."&lt;br&gt;
— &lt;a href="https://github.com/duckdb/duckdb/pull/5405" rel="noopener noreferrer"&gt;PR #5405: HTTP parquet optimizations&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  A concrete example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;column_a&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'s3://bucket/file.parquet'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DuckDB downloads only the bytes containing &lt;code&gt;column_a&lt;/code&gt;. So even with a 10GB file, if &lt;code&gt;column_a&lt;/code&gt; is just 100MB, you only transfer ~100MB.&lt;/p&gt;

&lt;p&gt;Even better—sometimes you don't transfer anything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'s3://bucket/file.parquet'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parquet metadata includes row counts, so DuckDB can return your result &lt;strong&gt;without reading any data at all&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Reference: &lt;a href="https://duckdb.org/docs/stable/core_extensions/httpfs/https" rel="noopener noreferrer"&gt;DuckDB Official Documentation - HTTP(S) Support&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Filter and projection pushdown
&lt;/h3&gt;

&lt;p&gt;DuckDB's S3 reader can push filters and projections down to the storage layer, so even less data gets touched.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We're able to do partial reads via Range requests actually, so it should be fairly efficient."&lt;br&gt;
— &lt;a href="https://github.com/duckdb/duckdb/discussions/4559" rel="noopener noreferrer"&gt;Discussion #4559&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  So why do S3 bills get so nasty?
&lt;/h2&gt;

&lt;p&gt;Here's the catch: &lt;strong&gt;intra-region EC2-to-S3 transfers are free&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Yet somehow people end up with shocking bills. What's going on?&lt;/p&gt;

&lt;h3&gt;
  
  
  1. No S3 Gateway Endpoint
&lt;/h3&gt;

&lt;p&gt;Without a Gateway Endpoint, traffic from your VPC to S3 gets routed through NAT Gateway or the internet gateway.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Via NAT Gateway&lt;/strong&gt;: You pay $0.045/GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Via Gateway Endpoint&lt;/strong&gt;: Free&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;"There is no additional charge for using gateway endpoints."&lt;br&gt;
— &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html" rel="noopener noreferrer"&gt;AWS Official Documentation - Gateway endpoints for Amazon S3&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Accessing across regions
&lt;/h3&gt;

&lt;p&gt;If your S3 bucket and EC2 are in different regions, AWS charges you $0.01–$0.02/GB for the privilege.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Going out to the internet
&lt;/h3&gt;

&lt;p&gt;Querying from your laptop or anything outside AWS? You pay $0.09/GB and up for internet egress.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;EC2 + S3 Gateway Endpoint in the same region&lt;/strong&gt; = zero transfer charges.&lt;/p&gt;

&lt;p&gt;Querying Parquet from EC2 in your bucket's region beats downloading everything locally by a mile. The bigger your data, the bigger the savings.&lt;/p&gt;

&lt;p&gt;The downside? Standing up and configuring EC2 every time gets old fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other object storage options
&lt;/h2&gt;

&lt;p&gt;Let's compare alternatives. The key metric is egress—that's what kills your budget.&lt;/p&gt;

&lt;p&gt;Note: Ingress (uploading) is always free. With object storage, you pay to get your data back out.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/s3/pricing/" rel="noopener noreferrer"&gt;AWS S3 Pricing&lt;/a&gt; - "Data Transfer IN To Amazon S3 From Internet: $0.00 per GB"&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/storage/pricing#network-egress" rel="noopener noreferrer"&gt;GCS Pricing&lt;/a&gt; - "Network ingress: Free"&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://azure.microsoft.com/en-us/pricing/details/bandwidth/" rel="noopener noreferrer"&gt;Azure Bandwidth Pricing&lt;/a&gt; - "Data Transfer In: Free"&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cloudflare R2
&lt;/h3&gt;

&lt;p&gt;Cloudflare R2: free egress, full stop.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Free Tier&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;$0.015/GB/month&lt;/td&gt;
&lt;td&gt;10GB/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Class A operations&lt;/td&gt;
&lt;td&gt;$4.50/million&lt;/td&gt;
&lt;td&gt;1M requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Class B operations&lt;/td&gt;
&lt;td&gt;$0.36/million&lt;/td&gt;
&lt;td&gt;10M requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Egress&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Unlimited&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you want to ditch the transfer bill entirely, R2 is your answer.&lt;/p&gt;

&lt;p&gt;Reference: &lt;a href="https://www.cloudflare.com/developer-platform/products/r2/" rel="noopener noreferrer"&gt;Cloudflare R2 Pricing&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Backblaze B2
&lt;/h3&gt;

&lt;p&gt;Backblaze B2 keeps egress essentially free too.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;$6/TB/month ($0.006/GB/month)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Egress&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Free&lt;/strong&gt; (up to 3× your storage/month)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overage Egress&lt;/td&gt;
&lt;td&gt;$0.01/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Store 100GB, download 300GB free per month. Plus it's S3-compatible.&lt;/p&gt;

&lt;p&gt;Reference: &lt;a href="https://www.backblaze.com/cloud-storage/pricing" rel="noopener noreferrer"&gt;Backblaze B2 Pricing&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Cloud Storage
&lt;/h3&gt;

&lt;p&gt;GCS hands out free transfers between services in the same region.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Transfer Type&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Same zone (private IP)&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Same region (GCS ↔ GCE, etc.)&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Different zones (same region)&lt;/td&gt;
&lt;td&gt;$0.01/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inter-region (e.g., US zones)&lt;/td&gt;
&lt;td&gt;$0.02/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outbound to internet&lt;/td&gt;
&lt;td&gt;$0.12/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Run DuckDB on a GCE instance in the same region as your data and you pay nothing.&lt;/p&gt;

&lt;p&gt;Reference: &lt;a href="https://cloud.google.com/storage/pricing" rel="noopener noreferrer"&gt;Google Cloud Storage Pricing&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Azure Blob Storage
&lt;/h3&gt;

&lt;p&gt;Azure does the same for intra-region transfers.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Transfer Type&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Same Availability Zone&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inter-region (e.g., US to Canada)&lt;/td&gt;
&lt;td&gt;$0.02/GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outbound to internet&lt;/td&gt;
&lt;td&gt;$0.087/GB (first 100GB free/month)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Spin up an Azure VM in the same region as your Blob Storage bucket and transfers are free.&lt;/p&gt;

&lt;p&gt;Reference: &lt;a href="https://azure.microsoft.com/en-us/pricing/details/storage/blobs/" rel="noopener noreferrer"&gt;Azure Blob Storage Pricing&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Egress&lt;/th&gt;
&lt;th&gt;Intra-region&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS S3&lt;/td&gt;
&lt;td&gt;$0.023/GB&lt;/td&gt;
&lt;td&gt;$0.09/GB&lt;/td&gt;
&lt;td&gt;Free (with Gateway Endpoint)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare R2&lt;/td&gt;
&lt;td&gt;$0.015/GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backblaze B2&lt;/td&gt;
&lt;td&gt;$0.006/GB&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Free&lt;/strong&gt; (3× storage/month)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCS&lt;/td&gt;
&lt;td&gt;$0.020/GB&lt;/td&gt;
&lt;td&gt;$0.12/GB&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure Blob&lt;/td&gt;
&lt;td&gt;$0.018/GB&lt;/td&gt;
&lt;td&gt;$0.087/GB&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;For DuckDB queries&lt;/strong&gt;: Cloudflare R2 or Backblaze B2 eliminate egress entirely.&lt;br&gt;
&lt;strong&gt;From a cloud VM&lt;/strong&gt;: Use that cloud's storage in the same region and pay nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap up
&lt;/h2&gt;

&lt;p&gt;Want zero egress charges with DuckDB? Pick R2 or Backblaze B2—both eliminate them entirely.&lt;/p&gt;

&lt;p&gt;Running on a cloud VM? Pick that cloud's object storage, keep it in the same region, and you're fine. Setting up EC2 each time is annoying, but at least the transfer costs disappear.&lt;/p&gt;




&lt;p&gt;I build &lt;a href="https://firchy.com/products/duck/" rel="noopener noreferrer"&gt;observability tools&lt;/a&gt; with DuckDB + object storage.&lt;/p&gt;

</description>
      <category>duckdb</category>
      <category>s3</category>
      <category>parquet</category>
    </item>
  </channel>
</rss>
