Roman Tsypuk for AWS Community Builders

Posted on Sep 18 • Originally published at tsypuk.github.io

Running analytics queries from your machine with Zero-BigData stack needed on top of AWS S3

#aws

Abstract

Cloud data lakes often store large volumes of logs, metrics, or event data in AWS S3. The formats vary:
newline-delimited JSON (often gzipped), CSV, or more optimized columnar formats like Parquet. Traditionally, querying
this data meant spinning up heavyweight tools (Athena, Spark, Presto, EMR).

But what if you could query the data instantly, from your laptop, without standing up an
entire cluster? Enter DuckDB.

DuckDB is an embedded analytical database — think “SQLite for analytics” — with first-class support for Parquet, JSON,
CSV, and remote storage like S3. The dreams come true.

In this post, we’ll explore:

Querying gzipped JSON files directly from S3
Comparing JSON vs Parquet vs CSV performance
Joining local and remote datasets
Calling HTTP APIs from DuckDB (yes, you can!)

Getting Started with DuckDB

DuckDB works in Python, R, Go, Nodejs, Rust, Java, CLI, or as a C library. If you a fan of JetBrains DataGrip it also can includes duckdb connector. The DuckDB CLI (Command Line Interface) is a single, dependency-free executable. Installation is simple.

pip install duckdb

brew install duckdb

1. Querying Local FS

After starting cli, we can easily access csv file stored in local FS, querying them as they are regular tables,
using SQL operators and predicates. Very often we need to open some csv file, analyze its structure or extract particular raws based on prodicate.Here I'm querying data from local FS csv file with limit:

SELECT *
FROM read_csv_auto('/tmp/oxford_5000.csv') LIMIT 100;

column0	word	type	cefr	phon_br	phon_n_am	definition	example
0	a	indefinite article	a1	/ə/	/ə/	used before countable or singular nouns referring to people or things that have not already been mentioned	a man/horse/unit
1	abandon	verb	b2	/əˈbændən/	/əˈbændən/	to leave somebody, especially somebody you are responsible for, with no intention of returning	abandon somebody, The baby had been abandoned by its mother.
2	ability	noun	a2	/əˈbɪləti/	/əˈbɪləti/	the fact that somebody/something is able to do something	People with the disease may lose their ability to communicate.

2. Querying Gzipped JSON on AWS S3

Suppose you have logs or events data that is ingested into DataLake by pipeline and they are stored in AWS S3 partitioned by date with a virtual path and also gzipped json. This is very common architecture.

Duckdb works perfectly with partitions, based on query it can access proper paths and also unzip the data. It auto-detects
schema, even across multiple files.

s3://my-bucket/logs/year=2025/month=09/day=16/data.json.gz

Directly querying data with DuckDB:

SELECT *
FROM read_json_auto(
  's3://my-bucket/logs/*.json.gz',
  format = 'json',
  compression = 'gzip'
     );

The beauty is that we do not need to bootstap any engine, create AWS IAM permissions, add service and pay for them - just use cli tool that is local on your machine.

Duckdb extensions and AWS credentials

The httpfs extension supports reading/writing/globbing files on object storage servers using the S3 API. S3 offers a standard API to read and write to remote files (while regular http servers, predating S3, do not offer a common write API). DuckDB conforms to the S3 API, that is now common among industry storage providers.

To query S3 firstly we need to add and enable extensions:

-- set s3 compatibility
INSTALL
httpfs;
LOAD
httpfs;

The httpfs filesystem is tested with AWS S3, Minio, Google Cloud, and lakeFS. Other services that implement the S3 API (such as Cloudflare R2)

After setup add S3 credentials via environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or IAM role of the profile (more prefered).

Adding secrets

-- set aws creds
INSTALL
aws;
LOAD
aws;

CREATE
SECRET env_dev (
    TYPE s3,
    PROVIDER credential_chain,
    REFRESH auto
);

SET
s3_region='us-east-1';
SET
s3_access_key_id='ASIAXXXXXX';
SET
s3_secret_access_key='';
SET
s3_session_token='';

Another option is to use exisitng AWS profiles and switch between them

CALL load_aws_credentials('dev');
CALL load_aws_credentials('prod');

Check credentials:

SELECT *
FROM FROM duckdb_secrets();

3. Parquet and CSV on S3

DuckDB shines with columnar formats like Parquet:

SELECT user_id, COUNT(*) AS events
FROM read_parquet('s3://my-bucket/logs/*.parquet')
GROUP BY user_id
ORDER BY events DESC;

For CSV:

SELECT *
FROM read_csv_auto('s3://my-bucket/data/*.csv.gz');

Parquet is much faster and smaller than JSON/CSV, but the magic is that DuckDB handles them all.

4. Loading Data from S3 storing locally in Parquet or other format

-- dev
CREATE
OR REPLACE VIEW events_2025_block AS
SELECT *
FROM read_json_auto(
  's3://events/block-events/block/year=2025/month=*/day=*/hour=*/*.json.gz',
  filename = true, -- adds a _filename column
  hive_partitioning = 1, -- pulls year/month/day/hour from the path into columns
  format = 'newline_delimited' -- omit if they’re arrays/objects; keep if NDJSON
     );

SELECT *
FROM events_2025_block
WHERE
  action == 'data-verified'
  LIMIT 10;

Writing extracted data from S3 to local FS and automatically converting into Parquet :

COPY events_2025_block TO '/tmp/dev.s3.events.parquet' (FORMAT parquet);

SELECT *
FROM '/tmp/dev.s3.events.parquet'
WHERE action == 'data-verified'
ORDER BY created_at DESC
  LIMIT 100;

5. Local + Remote Joins

You can join S3 data with local datasets:

WITH s3_data AS (
SELECT user_id, timestamp
FROM read_parquet('s3://my-bucket/events/*.parquet')
),
local_users AS (
SELECT * FROM read_csv_auto('users.csv')
)
SELECT u.name, COUNT(*) AS actions
FROM s3_data s
JOIN local_users u USING(user_id)
GROUP BY u.name;

This means you don’t need to ETL everything into a central DB just to query.

6. Calling HTTP APIs from DuckDB

DuckDB supports reading remote HTTP URLs directly. I have JSON-based aws news for different topics hosted at my github pages, let's call it to get latest aws architecture news:

SELECT *
FROM read_json_auto('https://tsypuk.github.io/aws-news/news/architecture.json')
WHERE title LIKE '%Lambda%' LIMIT 50;

title	link
How Zapier runs isolated tasks on AWS Lambda and upgrades functions at scale	https://aws.amazon.com/blogs/architecture/how-zapier-runs-isolated-tasks-on-aws-lambda-and-upgrades-functions-at-scale/
How Launchpad from Pega enables secure SaaS extensibility with AWS Lambda	https://aws.amazon.com/blogs/architecture/how-launchpad-from-pega-enables-secure-saas-extensibility-with-aws-lambda/

Same queries work with Parquet/CSV hosted over HTTPS. This is handy for blending APIs + S3 data.

7. Using standard DBs like MySQL Postgres

INSTALL postgres;
LOAD postgres;
ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE postgres, READ_ONLY);

Why DuckDB?

Lightweight: No cluster, runs anywhere (laptop, container, Lambda).
Format-flexible: JSON, CSV, Parquet, ORC.
Fast: Vectorized execution engine, optimized for analytics.
Great for ad-hoc exploration of data lakes.

When to Use Athena vs DuckDB

Use Athena when: queries must run at scale, multiple teams need SQL access, complex analytics require a lot of cpu/memory resources that are not available of local machine
Use DuckDB when: you need quick exploration, prototyping, data science, or embedding analytics in an app, access to QA/STG env where not so big volume size

Conclusions

DuckDB is like a Swiss Army knife for data lakes. With just a few lines of SQL, you can query gzipped JSON from S3, join
it with local CSVs, compare with Parquet, and even call APIs. It’s lightweight, fast, and great for both prototyping and
production embedding.

Try it next time you need to peek inside your S3 bucket without firing up a full big-data stack.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community