Gowtham Potureddi

Posted on May 26

AWS Data Engineering: Glue, EMR, Athena, Kinesis — End-to-End Guide

#python #sql #interview #dataengineering

aws data engineering is the umbrella for every pipeline pattern you ship on top of Amazon S3 — and four services do the heavy lifting in almost every modern AWS lakehouse: AWS Glue (the serverless Spark + Hive Metastore that catalogs and transforms your data), Amazon EMR (managed Spark / Hadoop / Trino clusters when you need full control of compute), Amazon Athena (serverless SQL over S3 via Presto / Trino), and Amazon Kinesis (the streaming front door that turns clickstream / IoT / CDC events into queryable S3 tables). Whether you're prepping for an AWS-track data-engineering loop or building a real production pipeline, the same four services keep showing up — and the senior signal is knowing exactly which one to reach for, when, and why.

This guide walks the end-to-end AWS data engineering stack the way you'd actually design it: an overview of the AWS lakehouse pattern (S3 raw → clean → curated, with Glue Catalog as the metastore), the AWS Glue architecture (Crawlers, Data Catalog, Jobs, bookmarks, DPUs, Iceberg), Amazon EMR (Master / Core / Task nodes, EMRFS, YARN, Spot economics, EMR Serverless and EMR on EKS), the Athena + Kinesis streaming pattern (Data Streams → Firehose → S3 → Glue Catalog → Athena, with partition + columnar + projection pushdown to cut scan cost), and the wider AWS data engineering ecosystem (Lake Formation, Redshift, MSK, DMS, Step Functions, MWAA). Every service section ships a runnable worked example, a step-by-step trace, and a concept-by-concept why this works breakdown.

When you want hands-on reps immediately after reading, browse Python practice library →, drill ETL Python drills →, sharpen data-manipulation patterns →, rehearse streaming Python drills →, or widen coverage on the full real-time analytics library →.

On this page

Why AWS data engineering — the lakehouse pattern in one diagram
AWS Glue — Crawlers, Data Catalog, Jobs, bookmarks, Iceberg
Amazon EMR — Master, Core, Task, EMRFS, YARN, Serverless
Athena + Kinesis — the streaming lakehouse pattern
The wider AWS data engineering ecosystem
Choosing the right AWS service (cheat sheet)
Frequently asked questions
Practice on PipeCode

1. Why AWS data engineering — the lakehouse pattern in one diagram

The AWS lakehouse pattern — S3 is the lake, Glue Catalog is the metastore, everything else is compute

The one-sentence invariant of aws data engineering: S3 is the durable lake; the Glue Data Catalog is the shared metastore; AWS Glue / EMR / Athena / Redshift / Lambda are the interchangeable compute engines that read and write the same tables. Once you internalise that picture, every AWS pipeline question collapses to "which engine for which workload?"

The four-zone S3 layout that every team converges on.

s3://lake/raw/ — exactly as ingested (JSON / CSV / Avro); never mutated; partitioned by ingest date.
s3://lake/clean/ — typed, deduplicated, validated; usually Parquet; partitioned by event date.
s3://lake/curated/ — business-modelled tables (facts + dims); Parquet or Iceberg; partitioned by the dimensions you query on.
s3://lake/serve/ — pre-aggregated marts for dashboards / Athena / Redshift Spectrum.

The shared metastore — Glue Data Catalog.

Hive-compatible metastore managed by AWS — same APIs that Hive / Spark / Trino expect.
Used by every AWS query engine — Athena, EMR (Spark / Hive / Trino / Presto), Redshift Spectrum, Glue Jobs.
Holds tables, partitions, columns, classifiers, connections; lives in the same AWS account as your S3 buckets.
The single source of truth for "what tables exist and where do they live on S3?"

The four compute engines you'll mix and match.

AWS Glue — serverless PySpark / Scala-Spark; pay per DPU-hour; first choice for "I just want to run a Spark job."
Amazon EMR — full Hadoop / Spark / Hive / Presto clusters (or EMR Serverless / EMR on EKS); first choice when you need custom packages, tighter cost control, or longer-running steady-state jobs.
Amazon Athena — serverless SQL over S3; pay per TB scanned; first choice for ad-hoc analytics and BI.
Amazon Redshift — managed columnar warehouse; first choice for sub-second BI dashboards and high-concurrency SQL.

Streaming front-door.

Amazon Kinesis Data Streams — managed real-time stream (shards, ordering per shard, 24h–365d retention).
Amazon Kinesis Data Firehose — managed S3 / Redshift / OpenSearch landing service; handles batching, compression, and Parquet conversion.
Amazon MSK — managed Apache Kafka; the alternative when you want full Kafka compatibility.

What interviewers listen for.

Do you separate storage (S3) from compute (Glue / EMR / Athena / Redshift)? — basic-but-tested fluency signal.
Do you mention the Glue Data Catalog as the metastore that every engine shares? — senior signal.
Do you reach for Athena for ad-hoc and EMR / Glue for ETL? — interview-canonical answer.
Do you partition by event date in S3 and explain how it cuts Athena scan cost? — strong cost-fluency signal.

aws data engineering tools — the canonical short list.

Storage — S3, RDS, DynamoDB, Aurora.
Streaming + Ingest — Kinesis Data Streams, Kinesis Firehose, MSK, DMS, AppFlow.
Compute + Transform — Glue, EMR, Athena, Redshift, Lambda, EMR Serverless.
Governance + Orchestration — Lake Formation, IAM, KMS, Step Functions, MWAA (Managed Airflow).

Worked example — a typical AWS lakehouse pipeline at a glance

Detailed explanation. A realistic AWS pipeline almost always combines all four services: Kinesis ingests events, Firehose lands them as Parquet on S3, a Glue Crawler updates the Data Catalog, an EMR or Glue Job transforms raw → clean → curated, and Athena queries the curated tables for analytics. Knowing how they hand off is the senior-fluency interview answer.

Question. Clickstream events arrive at ~50k events / sec; we need ad-hoc SQL within 5 minutes of an event landing, and a daily curated revenue mart in S3 for BI.

Pipeline shape (AWS-native).

Ingest     : producers → Kinesis Data Streams (4 shards)
Land       : Kinesis Firehose → S3 (raw, Parquet + Snappy, 5-min buffer, /raw/dt=YYYY-MM-DD/hr=HH/)
Catalog    : Glue Crawler → Glue Data Catalog (table: events_raw, partitioned by dt, hr)
Transform  : Glue Job (PySpark, Glue 4.0)
             reads events_raw → joins dim_customer → writes
             s3://lake/curated/revenue/dt=YYYY-MM-DD/ (Iceberg)
Query      : Athena (Presto) over Glue Catalog
Serve      : Redshift Spectrum / QuickSight on the curated table
Orchestrate: Step Functions (or MWAA / Airflow) triggers the daily Glue Job

Step-by-step explanation.

Kinesis Data Streams holds ~50k events/sec across 4 shards with 24h retention; producers (mobile SDKs, web beacons, CDC) write JSON records keyed by user_id so per-user ordering is preserved.
Kinesis Firehose subscribes to the stream, batches records into 128 MB / 5 minute windows, converts JSON to Parquet via the Glue Catalog schema, compresses with Snappy, and lands files under /raw/dt=YYYY-MM-DD/hr=HH/.
Glue Crawler runs hourly, discovers new partitions, and updates events_raw in the Glue Data Catalog so Athena and downstream Glue Jobs see them immediately.
Glue Job (PySpark) reads events_raw with a partition predicate (dt = current_date), joins the customer dim, writes the curated revenue table as Iceberg under /curated/revenue/dt=…/, and commits a Catalog table version.
Athena (or Redshift Spectrum) queries curated.revenue over the Glue Catalog with the same SQL the BI team already uses.
Step Functions orchestrates the daily Glue Job, retries on failure, and notifies on SLA breaches via SNS.

Rule of thumb: Modern AWS pipelines push streaming + landing to Kinesis + Firehose, catalog + ETL to Glue, ad-hoc + BI to Athena / Redshift, and heavy custom Spark to EMR — and they all share the same S3 + Glue Catalog substrate.

Python
Topic — etl
ETL Python drills

Practice →

Python
Topic — real-time-analytics
Real-time analytics patterns

Practice →

2. AWS Glue — Crawlers, Data Catalog, Jobs, bookmarks, Iceberg

`aws glue` — Crawler discovers, Data Catalog remembers, Glue Job transforms

aws glue is the most asked AWS data engineering service in 2026 interviews — and the one most teams use first. Senior answer in one sentence: AWS Glue is a serverless Apache Spark platform plus a Hive-compatible metastore (the Glue Data Catalog), with Crawlers that auto-discover schemas in S3, Jobs that transform data on a pay-per-DPU-hour model, and a Workflow / Trigger layer that orchestrates the whole thing.

The six Glue primitives every loop tests.

Crawler — discovers data in S3 / JDBC sources, infers schema + partitions, writes a Data Catalog table.
Data Catalog — Hive-compatible metastore (tables, partitions, columns, classifiers, connections) shared by Athena, EMR, Redshift Spectrum, and Glue Jobs.
Glue Job — serverless PySpark / Scala-Spark / Python shell job; you author with GlueContext + DynamicFrame (or plain Spark DataFrame).
Job bookmarks — built-in state that tracks which files / records a job has already processed, so daily jobs only read new data.
DPU (Data Processing Unit) — 4 vCPU + 16 GB RAM; you size a Glue Job by --number-of-workers and worker type (G.1X / G.2X); billed per second with a 1-minute minimum.
Workflow / Trigger — Glue's native orchestrator: Crawler → Job → Crawler chains with event / schedule triggers.

Glue 4.0 + Iceberg — the modern table format.

Glue 4.0 ships Spark 3.3, native Iceberg / Hudi / Delta Lake connectors, and Python 3.10.
Iceberg on Glue — atomic MERGE / UPSERT, schema evolution, hidden partitioning, time travel; the modern replacement for raw Parquet.
spark.sql.catalog.glue_catalog — point Spark at the Glue Catalog as your Iceberg catalog.

Glue Crawler quirks worth knowing.

Schedule vs on-demand — schedule hourly / daily; or trigger via S3 event → Lambda → Crawler.
Classifiers — built-in classifiers for JSON / CSV / Parquet / Avro / XML; custom classifiers (Grok / regex) for odd formats.
Partition discovery — directory layout like dt=2026-05-24/hr=14/ is auto-recognised as partitions.
Schema drift — by default a Crawler will add new columns to an existing table; it can also break schemas, so review changes in CI before promoting.

Glue Job sizing — what interviewers ask.

G.1X — 1 DPU per worker (4 vCPU, 16 GB); standard for most ETL.
G.2X — 2 DPU per worker (8 vCPU, 32 GB); use for heavy joins / shuffles.
G.4X / G.8X — for very heavy workloads or large in-memory caches.
Autoscaling (Glue 3.0+) — --enable-auto-scaling true lets Glue add / remove workers based on Spark stage demand.
Bookmarks — --job-bookmark-option job-bookmark-enable; tracks the last-processed file or partition.

Partition pushdown — the cost lever.

push_down_predicate — pass a partition filter (e.g. "dt = '2026-05-24'") to create_dynamic_frame.from_catalog; Glue only lists / reads matching partitions.
Why it matters — without pushdown, Glue would list every partition in the table, ballooning startup time and S3 LIST cost.

Worked example — Glue Job that compacts raw events into a curated Iceberg table

Detailed explanation. A realistic Glue Job reads the raw events table for the current day with a partition predicate, deduplicates by event_id, joins a customer dim from another Glue Catalog table, and MERGEs into an Iceberg curated table. This is the bread-and-butter Glue pattern.

Question. Daily JSON event files land in s3://lake/raw/events/dt=YYYY-MM-DD/. Produce a curated Iceberg table analytics.events_clean partitioned by dt, with one row per unique event_id, joined with analytics.dim_customer.

Input. Glue Catalog tables — analytics.events_raw (partitioned by dt) and analytics.dim_customer (small, ~1 GB).

Code.

import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import functions as F
from pyspark.sql.window import Window

args = getResolvedOptions(sys.argv, ["JOB_NAME", "run_date"])
sc = SparkContext()
gc = GlueContext(sc)
spark = gc.spark_session
job = Job(gc)
job.init(args["JOB_NAME"], args)

run_date = args["run_date"]

events = gc.create_dynamic_frame.from_catalog(
    database="analytics",
    table_name="events_raw",
    push_down_predicate=f"dt = '{run_date}'",
).toDF()

customers = gc.create_dynamic_frame.from_catalog(
    database="analytics",
    table_name="dim_customer",
).toDF()

w = Window.partitionBy("event_id").orderBy(F.col("ingest_ts").desc())
deduped = (
    events
    .withColumn("rn", F.row_number().over(w))
    .filter("rn = 1")
    .drop("rn")
)

joined = deduped.join(customers, "customer_id", "left")

(
    joined.writeTo("glue_catalog.analytics.events_clean")
    .using("iceberg")
    .tableProperty("format-version", "2")
    .partitionedBy("dt")
    .createOrReplace()
)

job.commit()

Step-by-step explanation.

getResolvedOptions reads job arguments — Glue passes --run_date 2026-05-24 from the trigger.
GlueContext.create_dynamic_frame.from_catalog with push_down_predicate reads only today's partition from S3 (single-partition listing, not full-table scan).
The window + row_number keeps the latest event per event_id — handles late-arriving duplicates without a self-join.
The customer dim is broadcast automatically by Spark AQE because it's small; the join is a single map-side join.
writeTo(...).using("iceberg") does an atomic MERGE into the Iceberg curated table via the Glue Catalog; readers either see the new snapshot or the old one — never a half-written file.
job.commit() advances the Glue Job bookmark so the next run skips today's files.

Output (Glue Catalog).

db	table	format	partitions	latest_snapshot
analytics	events_clean	iceberg	dt=2026-05-24 (+30 prior)	snap-1234

Rule of thumb: Use push_down_predicate whenever you read from a partitioned Glue Catalog table; use Iceberg whenever you need atomic writes, MERGE, or schema evolution.

Glue interview question on running a daily MERGE upsert

You're asked: "We have a Glue Catalog table analytics.events_clean (Iceberg). Each day, late-arriving events for the last 3 days arrive in events_raw. Write a MERGE that updates existing rows by event_id and inserts new ones."

Solution Using Iceberg MERGE INTO via Spark SQL on Glue

Code.

from awsglue.context import GlueContext
from pyspark.context import SparkContext

gc = GlueContext(SparkContext())
spark = gc.spark_session

spark.sql("""
  MERGE INTO glue_catalog.analytics.events_clean AS t
  USING (
    SELECT event_id, customer_id, event_ts, amount, dt
    FROM glue_catalog.analytics.events_raw
    WHERE dt >= date_sub(current_date(), 3)
  ) AS s
  ON  t.event_id = s.event_id
  WHEN MATCHED THEN UPDATE SET
      t.customer_id = s.customer_id,
      t.event_ts    = s.event_ts,
      t.amount      = s.amount,
      t.dt          = s.dt
  WHEN NOT MATCHED THEN INSERT (event_id, customer_id, event_ts, amount, dt)
                       VALUES (s.event_id, s.customer_id, s.event_ts, s.amount, s.dt)
""")

Step-by-step trace.

input row (events_raw)	matched in events_clean?	action
event_id=A, amount=10	yes (amount=8)	UPDATE amount to 10
event_id=B, amount=5	no	INSERT
event_id=C, amount=3	yes (amount=3)	UPDATE (no-op for amount)
event_id=D, amount=2	no	INSERT

Spark reads only dt >= today-3 from events_raw thanks to the WHERE → partition pruning.
The MERGE engine builds a key-aligned join between source and target on event_id.
MATCHED rows (A, C) take the UPDATE branch; NOT-MATCHED rows (B, D) take the INSERT branch.
Iceberg writes a new snapshot containing the updated + inserted rows; old data files are kept until the next expire_snapshots maintenance run.

Output:

event_id	amount	dt
A	10	2026-05-24
B	5	2026-05-24
C	3	2026-05-23
D	2	2026-05-24

Why this works — concept by concept:

Iceberg MERGE INTO — gives you ACID upserts on S3 with no rewrite-the-whole-partition penalty; matched rows update in place, unmatched rows insert.
partition pruning via WHERE on dt — Spark + Iceberg push the date filter down to S3 file listing, so we only touch the last 3 days of files, not the full table.
atomic snapshot — readers (Athena, EMR, Redshift Spectrum) either see the pre-MERGE snapshot or the post-MERGE snapshot, never a half-applied state.
Glue Catalog as Iceberg catalog — glue_catalog.analytics.events_clean resolves through the Glue Data Catalog, so the same table is queryable from every AWS engine.
Cost — O(changed partitions) S3 reads + O(updated + inserted rows) writes; no full-table rewrite.

Python
Topic — etl
AWS Glue / ETL drills

Practice →

Python
Topic — data-manipulation
DataFrame transformation patterns

Practice →

3. Amazon EMR — Master, Core, Task, EMRFS, YARN, Serverless

`amazon emr` — managed Spark / Hadoop with Master, Core, and Task nodes over S3

amazon emr is the second pillar of AWS data engineering: managed Hadoop / Spark / Hive / Presto / Trino clusters that give you the same on-prem cluster shape, but elastic and on AWS. Senior answer: EMR runs YARN on a cluster of Master + Core + Task nodes; EMRFS lets the cluster read and write S3 as if it were HDFS; you scale by mixing On-Demand Core nodes for stability with Spot Task nodes for cost.

The six EMR primitives every loop tests.

Master node — runs the YARN ResourceManager, HDFS NameNode (if HDFS is used), and the application drivers (Spark driver in client mode, etc.). One per cluster (or HA setup with three).
Core node — runs YARN NodeManager + HDFS DataNode + Spark executors. Holds HDFS data; don't put Core on Spot.
Task node — runs YARN NodeManager + Spark executors only (no HDFS). Perfect for Spot because losing one only costs you in-flight tasks.
EMRFS — an HDFS-compatible client over S3; lets every framework read S3 paths (s3://...) as if they were HDFS paths.
YARN — cluster resource manager (ResourceManager on Master, NodeManagers on Core / Task) that schedules every container.
Instance fleets — let EMR pick the best available instance types per role (Core / Task) across multiple ITs and AZs.

EMR Serverless vs EMR on EC2 vs EMR on EKS.

EMR on EC2 (classic) — you own the cluster; full control, longest-running steady-state jobs.
EMR Serverless — pay-per-second Spark / Hive without cluster management; the AWS-managed answer to "I just want to run a Spark job."
EMR on EKS — run EMR Spark inside an existing Kubernetes cluster; great when you already standardise on EKS.

EMRFS — what makes EMR an S3-native engine.

s3://bucket/path paths just work in every EMR framework (Spark, Hive, Trino).
Consistent View is no longer needed (S3 has been strong-read-after-write since 2020).
S3 Optimized Committer — replaces the slow FileOutputCommitter; uses S3 multipart uploads + manifests to avoid rename storms.
Encryption — KMS-managed encryption at rest; client-side encryption optional.

Spot economics — the EMR cost lever.

Core = On-Demand — losing a Core node means losing HDFS replicas and triggering replication; not worth the risk.
Task = Spot — losing a Task node only kills in-flight executor containers; YARN reschedules them on healthy Task nodes.
Common pattern — 3 Core On-Demand + N Task Spot, sized 4–10× the Core count.
Capacity-Optimised allocation strategy — Spot picks the deepest Spot pools to minimise interruption.

emr vs glue — the canonical comparison.

Glue — serverless, pay per DPU-hour, no cluster, great for ETL jobs under ~2 hours.
EMR on EC2 — full cluster, ideal for long-running steady-state or non-Spark engines (Hive, Trino, Presto, HBase).
EMR Serverless — Glue-style billing on the EMR runtime; choose when you want EMR's Spark version flexibility without cluster management.
The rule — Glue first; EMR Serverless when you need EMR features; EMR on EC2 when you need long-running clusters or non-Spark engines.

Step-up moments.

HA Master — 3 Master nodes with ZooKeeper-based failover; required for production multi-day clusters.
Auto-termination — --auto-termination-policy IdleTimeout=3600; kills idle dev clusters and saves money.
Bootstrap actions — shell scripts run on every node at launch to install custom packages.

Worked example — EMR Spark job that scans S3 with partition pushdown and writes Parquet

Detailed explanation. A realistic EMR Spark job reads a partitioned S3 dataset for one day, joins it with a smaller dim, and writes Parquet back to a curated S3 prefix. The key tricks are partition pushdown (only list the partitions you need), broadcast joins (small dim shipped to every executor), and the S3 Optimized Committer (no rename storms).

Question. Daily clickstream Parquet files live in s3://lake/clean/clicks/dt=YYYY-MM-DD/. Produce a curated revenue file under s3://lake/curated/revenue/dt=YYYY-MM-DD/ joined with a 200 MB product dim.

Input. clicks (Parquet, 50 GB / day, partitioned by dt); dim_product (Parquet, 200 MB).

Code.

from pyspark.sql import SparkSession, functions as F

spark = (
    SparkSession.builder.appName("daily_revenue")
    .config("spark.sql.parquet.filterPushdown", "true")
    .config("spark.sql.sources.parallelPartitionDiscovery.parallelism", "32")
    .getOrCreate()
)

run_date = "2026-05-24"

clicks = (
    spark.read.parquet("s3://lake/clean/clicks/")
    .filter(F.col("dt") == run_date)
)

dim = spark.read.parquet("s3://lake/dim/product/")

revenue = (
    clicks
    .join(F.broadcast(dim), "product_id", "left")
    .groupBy("dt", "country", "product_id")
    .agg(F.sum(F.col("price") * F.col("qty")).alias("revenue_usd"))
)

(
    revenue.repartition("country")
    .write.mode("overwrite")
    .partitionBy("dt", "country")
    .parquet("s3://lake/curated/revenue/")
)

Step-by-step explanation.

The Spark session enables Parquet filter pushdown so column- and predicate-level filters reach the S3 read path.
filter(dt == run_date) is converted to a partition predicate by Spark's Catalyst optimizer, so EMR only lists s3://lake/clean/clicks/dt=2026-05-24/.
F.broadcast(dim) ships the 200 MB product dim to every executor — no shuffle for the join.
The groupBy + sum runs as a single shuffle keyed by (dt, country, product_id).
repartition("country") aligns the write so each output country= directory contains exactly one file per executor.
The S3 Optimized Committer commits Parquet files atomically via multipart upload + manifest; no per-file rename round-trip to S3.

Output (S3 listing).

path	files	size
s3://lake/curated/revenue/dt=2026-05-24/country=US/	1	18 MB
s3://lake/curated/revenue/dt=2026-05-24/country=GB/	1	4 MB
s3://lake/curated/revenue/dt=2026-05-24/country=…	…	…

Rule of thumb: On EMR, always combine partition pushdown (only list the partitions you need) + broadcast joins for small dims + the S3 Optimized Committer for fast, atomic writes.

EMR interview question on right-sizing a cluster

You're asked: "Design an EMR cluster for a 4-hour nightly Spark job that processes 2 TB of Parquet from S3, joins a 50 GB dim, and writes 500 GB back to S3. Optimise for cost, then reliability."

Solution Using On-Demand Core + Spot Task instance fleets

Code (cluster spec, AWS CLI).

aws emr create-cluster \
  --name nightly-revenue \
  --release-label emr-6.15.0 \
  --applications Name=Spark Name=Hadoop \
  --instance-fleets \
    '[
      {
        "Name":"master","InstanceFleetType":"MASTER","TargetOnDemandCapacity":1,
        "InstanceTypeConfigs":[{"InstanceType":"m6g.xlarge"}]
      },
      {
        "Name":"core","InstanceFleetType":"CORE","TargetOnDemandCapacity":3,
        "InstanceTypeConfigs":[{"InstanceType":"r6g.2xlarge"}]
      },
      {
        "Name":"task","InstanceFleetType":"TASK","TargetSpotCapacity":12,
        "LaunchSpecifications":{"SpotSpecification":{"AllocationStrategy":"capacity-optimized","TimeoutDurationMinutes":10,"TimeoutAction":"SWITCH_TO_ON_DEMAND"}},
        "InstanceTypeConfigs":[
          {"InstanceType":"r6g.2xlarge","WeightedCapacity":1},
          {"InstanceType":"r5.2xlarge","WeightedCapacity":1},
          {"InstanceType":"r5d.2xlarge","WeightedCapacity":1}
        ]
      }
    ]' \
  --auto-termination-policy IdleTimeout=900 \
  --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole

Step-by-step trace.

component	choice	reason
Master	1 × `m6g.xlarge` On-Demand	cheap, single-AZ, fine for 4-hour job
Core	3 × `r6g.2xlarge` On-Demand	holds HDFS + scratch, must be stable
Task	12 × Spot, capacity-optimised, multi-IT	cheap compute that scales the join + aggregation
Idle policy	15 min	cluster auto-kills after job finishes

The job is large enough to need a cluster (vs Glue) and short enough that On-Demand Master + Core is a small share of total cost.
The 2 TB input is comfortably partitioned across the 12 Task Spot executors (~170 GB / executor).
Spot picks the deepest pools at launch via capacity-optimised allocation, minimising interruption risk.
If Spot capacity disappears mid-job, EMR switches new Task launches to On-Demand after a 10-minute timeout.
The cluster auto-terminates 15 minutes after the Spark step ends, so we don't pay for idle.

Output (rough cost shape).

line item	rate (us-east-1, illustrative)	hours	cost
1 × Master `m6g.xlarge` On-Demand	$0.16/h	4	$0.64
3 × Core `r6g.2xlarge` On-Demand	$0.50/h	4	$6.00
12 × Task Spot (mixed `r6g/r5/r5d`.2xlarge ~70% off)	$0.15/h avg	4	$7.20
EMR uplift	per-instance	4	~$1.50
Total			~$15

Why this works — concept by concept:

Master / Core / Task split — gives you a stable HDFS + control plane (Core On-Demand) while still extracting Spot savings on the elastic compute layer (Task Spot).
instance fleets — EMR picks the best available instance type per role across multiple instance types and AZs, so a single Spot pool exhaustion doesn't stall the job.
capacity-optimized allocation — Spot picks the pool with the most spare capacity, statistically the least likely to be reclaimed during the job.
auto-termination — kills the cluster after 15 idle minutes; the cluster does not keep running between nightly runs.
Cost — O(executors × hours) with most executors at Spot prices; roughly 60–70% cheaper than the same cluster on full On-Demand.

Python
Topic — etl
EMR / Spark cluster drills

Practice →

Python
Topic — data-analysis
Distributed compute patterns

Practice →

4. Athena + Kinesis — the streaming lakehouse pattern

`kinesis to athena` — stream once, land in S3, catalog with Glue, query with Athena

The kinesis → S3 → Glue Catalog → Athena pattern is the most common streaming-to-SQL pipeline on AWS and one of the most-asked AWS data engineering interview prompts. Senior answer: Kinesis Data Streams shards the event log for ordering and replay, Firehose batches and converts to Parquet on S3, Glue catalogs the partitions, and Athena queries it all with pay-per-TB-scanned SQL — and the cost story is partitioning + columnar formats + projection pushdown.

The five primitives every loop tests.

Kinesis Data Stream — sharded ordered log; one shard handles ~1 MB/s in or ~2 MB/s out and 1000 records/s in; partitioning key controls shard routing; per-shard ordering only.
Kinesis Firehose — managed S3 / Redshift / OpenSearch sink; buffers (60s–900s or 1–128 MB), supports Lambda transform, native Parquet / ORC conversion via the Glue Catalog schema.
Athena Workgroups — query isolation + per-workgroup S3 result location + per-query and per-workgroup data-scanned limits.
Partition projection — Athena calculates partition values from the query predicate without listing S3; eliminates the need to run a Crawler before every read.
Columnar pushdown — Parquet / ORC let Athena read only the columns the query references; combined with partitioning this is the dominant cost lever.

The four cost levers (in order).

Partition the table — at minimum by dt; ideally by dt + a high-cardinality second column you filter on.
Use Parquet (or ORC) — columnar layout means Athena reads only the columns in the SELECT / WHERE.
Use partition projection — instead of Glue partitions, declare a projected schema (dt from 2020-01-01 to NOW); no Crawler, no MSCK REPAIR TABLE, instant cold-start.
Set per-query data limits in the Workgroup — fails runaway queries before they cost $$$.

Kinesis Data Streams quirks worth knowing.

Shard math — sized in shards; each shard = 1 MB/s in or 1000 records/s in, 2 MB/s out, 5 reads/s per consumer.
Enhanced fan-out — gives each consumer its own 2 MB/s/shard read pipe (bypasses the shared 2 MB/s limit).
Ordering — guaranteed per shard, not per stream; use a partition key (e.g. user_id) to keep one entity's events on one shard.
Retention — default 24 h; can be extended to 365 days for replay.
Re-sharding — split / merge shards online; downstream consumers handle child shard discovery via the KCL.

Kinesis Firehose quirks worth knowing.

Buffering — size (1–128 MB) or time (60–900 s), whichever comes first.
Lambda transform — invoked on every batch; returns transformed records back to Firehose.
Parquet conversion — Firehose reads the schema from a Glue Catalog table and converts JSON → Parquet on the fly.
Dynamic partitioning — Firehose can extract partition values from each record (e.g. dt from event_ts) and route into nested S3 prefixes.

Athena quirks worth knowing.

MSCK REPAIR TABLE — old way to add new partitions; replaced by Glue Crawlers or partition projection.
Result reuse — Athena Workgroups can cache identical SQL results for up to 7 days; instant + free re-queries.
Federated queries — Athena can read JDBC / DynamoDB / DocumentDB via Lambda connectors; lakehouse + federation in one engine.
CTAS + INSERT INTO — CREATE TABLE AS SELECT and INSERT INTO let Athena itself produce Parquet outputs.

Worked example — Kinesis → Firehose → S3 (Parquet) → Athena query

Detailed explanation. A realistic streaming-to-SQL pipeline configures Firehose with dynamic partitioning + Glue Catalog-driven Parquet conversion, declares the Athena table with partition projection, and runs a partitioned + columnar SELECT. The two-line cost story is "only touch one day, only read three columns."

Question. Page-view events arrive on clickstream Kinesis stream at ~10 MB/s; we want SQL answers within 5 minutes and we want each Athena query to scan less than 1 GB.

Input. Firehose to-s3-clicks consuming clickstream; S3 s3://lake/raw/clicks/; Glue Catalog web.clicks_raw.

Code (Firehose config — Terraform snippet).

resource "aws_kinesis_firehose_delivery_stream" "to_s3_clicks" {
  name        = "to-s3-clicks"
  destination = "extended_s3"

  kinesis_source_configuration {
    kinesis_stream_arn = aws_kinesis_stream.clickstream.arn
    role_arn           = aws_iam_role.firehose.arn
  }

  extended_s3_configuration {
    role_arn           = aws_iam_role.firehose.arn
    bucket_arn         = aws_s3_bucket.lake.arn
    prefix             = "raw/clicks/dt=!{partitionKeyFromQuery:dt}/hr=!{partitionKeyFromQuery:hr}/"
    error_output_prefix = "raw/clicks_errors/"
    buffering_size     = 64
    buffering_interval = 300
    compression_format = "UNCOMPRESSED"   # Parquet does its own compression

    data_format_conversion_configuration {
      enabled = true
      input_format_configuration  { deserializer { open_x_json_ser_de {} } }
      output_format_configuration { serializer  { parquet_ser_de {} } }
      schema_configuration {
        database_name = "web"
        table_name    = "clicks_raw"
        role_arn      = aws_iam_role.firehose.arn
      }
    }

    dynamic_partitioning_configuration { enabled = true }

    processing_configuration {
      enabled = true
      processors {
        type = "MetadataExtraction"
        parameters {
          parameter_name  = "MetadataExtractionQuery"
          parameter_value = "{dt: .event_ts | sub(\"T.*\"; \"\"), hr: .event_ts | sub(\".*T\"; \"\") | sub(\":.*\"; \"\")}"
        }
        parameters { parameter_name = "JsonParsingEngine" parameter_value = "JQ-1.6" }
      }
    }
  }
}

Athena DDL with partition projection.

CREATE EXTERNAL TABLE web.clicks_raw (
  user_id   string,
  url       string,
  device    string,
  event_ts  timestamp,
  country   string
)
PARTITIONED BY (dt string, hr string)
STORED AS PARQUET
LOCATION 's3://lake/raw/clicks/'
TBLPROPERTIES (
  'projection.enabled'         = 'true',
  'projection.dt.type'         = 'date',
  'projection.dt.range'        = '2024-01-01,NOW',
  'projection.dt.format'       = 'yyyy-MM-dd',
  'projection.hr.type'         = 'integer',
  'projection.hr.range'        = '0,23',
  'projection.hr.digits'       = '2',
  'storage.location.template'  = 's3://lake/raw/clicks/dt=${dt}/hr=${hr}/'
);

Athena query.

SELECT country, count(*) AS views
FROM   web.clicks_raw
WHERE  dt = '2026-05-24'
  AND  hr BETWEEN '09' AND '11'
GROUP  BY country
ORDER  BY views DESC;

Step-by-step explanation.

Producers write JSON events to the Kinesis clickstream stream, keyed by user_id so per-user ordering is preserved.
Firehose batches records into 64 MB / 5 min windows, calls the metadata-extraction step to compute dt and hr from event_ts, then converts JSON → Parquet using the Glue Catalog schema.
Files land at s3://lake/raw/clicks/dt=2026-05-24/hr=09/…parquet via dynamic partitioning.
Athena resolves the WHERE dt = '...' AND hr BETWEEN '09' AND '11' predicate against the projected partition values — no Glue partition listing, no Crawler run, no MSCK REPAIR TABLE.
Athena uses Parquet's column projection so only country is scanned (1 column out of 5).
Total scan ≈ (1 day × 3 hours × 64 MB × N batches) of just the country column — well under the 1 GB target.

Output.

country	views
US	412,318
GB	88,701
IN	76,544
…	…

Rule of thumb: Partition projection + Parquet + a tight WHERE dt is the three-line recipe for cheap Athena queries; Firehose's Glue-Catalog Parquet conversion makes the storage side automatic.

Athena + Kinesis interview question on cutting query cost 100×

You're asked: "An analyst runs SELECT user_id, url, country FROM web.clicks_raw WHERE country = 'US' AND event_ts >= date '2026-05-24' and the Athena bill is way too high. Diagnose and fix."

Solution Using partition + columnar + projection pushdown

Code (the fixed table + the fixed query).

-- Step 1: ensure the underlying data is Parquet + partitioned (already true from Firehose).

-- Step 2: re-issue the query against the partition column, not event_ts.
SELECT user_id, url, country
FROM   web.clicks_raw
WHERE  dt = '2026-05-24'       -- partition pushdown
  AND  country = 'US';         -- column filter, applied after partition pushdown

Step-by-step trace.

layer	before fix	after fix
partition pruning	none (filter was on `event_ts`, not on `dt`)	only `dt=2026-05-24/` scanned
columnar pushdown	weak (SELECT *)	reads only `user_id`, `url`, `country` (3 of 5 columns)
data scanned	30 days × full row × full table	1 day × 3 columns
Athena bill	$X	~$X / 100

The original WHERE event_ts >= date '2026-05-24' did not prune partitions because the partition columns are dt + hr, not event_ts.
Rewriting WHERE to use the partition column dt lets Athena's planner restrict S3 listing to s3://lake/raw/clicks/dt=2026-05-24/....
Parquet column projection then reads only the 3 columns the SELECT references.
Net effect: data scanned drops by ~30× (partitions) × ~5/3× (columns) ≈ ~100× cheaper.

Output.

query	bytes scanned	$/run (est.)
before	1.2 TB	~$6.00
after	12 GB	~$0.06

Why this works — concept by concept:

partition pushdown — Athena resolves dt = '...' against the table's partition keys before listing S3, so only matching partition prefixes are scanned.
columnar pushdown — Parquet stores data column-by-column, so SELECTing 3 of 5 columns reads roughly 3/5 of the data.
projection vs Crawler — partition projection computes partition values from the predicate, eliminating Glue Crawler runs and MSCK REPAIR TABLE chores.
Workgroup data-scan limits — set per-query and per-workgroup BytesScannedCutoffPerQuery so runaway SELECT * queries fail fast instead of costing money.
Cost — O(partitions touched × columns selected); not O(table size).

Python
Topic — streaming
Kinesis + streaming drills

Practice →

Python
Topic — real-time-analytics
Athena + real-time analytics

Practice →

5. The wider AWS data engineering ecosystem

The four-layer AWS lakehouse stack — storage, streaming, compute, governance

Outside the four headline services, the AWS data engineering ecosystem stacks neatly into four layers. Senior interviews probe whether you can name the right service per layer without reaching for a Google search.

Layer 1 — Storage.

Amazon S3 — the universal object store; the data lake; the source of truth for every analytical workload.
Amazon RDS / Aurora — managed relational (Postgres, MySQL, Aurora Serverless v2); great for OLTP and small dims.
Amazon DynamoDB — managed key-value / document store; great for state, sessions, real-time lookups.
Amazon S3 + Iceberg / Delta / Hudi — modern open-table formats give you ACID + schema evolution on top of S3.

Layer 2 — Streaming + Ingest.

Kinesis Data Streams — sharded ordered log; the AWS-native event bus.
Kinesis Data Firehose — managed S3 / Redshift / OpenSearch landing.
Amazon MSK — managed Apache Kafka; choose when you want full Kafka compatibility.
AWS DMS — Database Migration Service; CDC from RDS / on-prem databases into S3, Kinesis, MSK.
AWS AppFlow — managed connectors from SaaS apps (Salesforce, Slack, etc.) into S3 / Redshift.

Layer 3 — Compute + Transform.

AWS Glue — serverless PySpark / Scala-Spark + Glue Data Catalog.
Amazon EMR — managed Spark / Hadoop / Hive / Trino clusters (EC2, Serverless, on EKS).
AWS Lambda — event-driven tiny transforms (Firehose transform, S3 event → tiny job, etc.).
Amazon Athena — serverless SQL over S3 via Trino / Presto.
Amazon Redshift — managed columnar warehouse; Redshift Spectrum reads S3 + Glue Catalog directly.

Layer 4 — Governance + Orchestration.

AWS Lake Formation — fine-grained, table / column / row-level permissions on top of the Glue Data Catalog.
AWS IAM — coarse-grained AWS-wide access control.
AWS KMS — managed encryption keys for S3, RDS, Kinesis, etc.
AWS Step Functions — visual state-machine orchestration for short-running workflows.
Amazon MWAA (Managed Apache Airflow) — for the team that's standardised on Airflow DAGs.

aws data engineering stack — the typical "build a lakehouse" picks.

Land — Kinesis + Firehose for streams; DMS for CDC; AppFlow for SaaS.
Store — S3 with Iceberg.
Catalog — Glue Data Catalog (governed by Lake Formation).
Transform — Glue Jobs first; EMR or EMR Serverless when you outgrow Glue.
Query — Athena for ad-hoc; Redshift for BI dashboards; Redshift Spectrum for federated SQL across both.
Orchestrate — Step Functions for state-machine flows; MWAA when you want full Airflow.

Open-source on AWS — the modern overlay.

Apache Iceberg / Delta Lake / Apache Hudi — open table formats; first-class supported in Glue 4.0 and EMR 6.x+.
Apache Airflow → MWAA — the dominant orchestrator.
dbt — runs natively on Athena, Redshift, Spark, EMR Serverless via dbt-spark.
Apache Flink → Amazon Managed Service for Apache Flink — for stateful streaming.

Things to mention in an interview to sound senior.

"S3 + Iceberg + Glue Catalog is the modern AWS lakehouse substrate."
"Glue Jobs first, EMR / EMR Serverless when we need more control or non-Spark engines."
"Athena for ad-hoc with partition projection + Parquet; Redshift / Redshift Spectrum for high-concurrency BI."
"Step Functions for short orchestration; MWAA when the team wants full Airflow."
"Lake Formation for fine-grained access; IAM for the coarse plumbing."

Python
Topic — etl
AWS ecosystem drills

Practice →

Python
Topic — data-analysis
Lakehouse architecture patterns

Practice →

Choosing the right AWS service (cheat sheet)

A one-screen cheat sheet for the most-asked aws data engineering decisions.

You want to …	AWS service	Notes
Store a data lake	Amazon S3 (+ Iceberg)	raw / clean / curated zones; partition by `dt`
Catalog tables across engines	AWS Glue Data Catalog	Hive-compatible; shared by Athena, EMR, Redshift
Run a serverless Spark ETL job	AWS Glue Job (Glue 4.0)	DPU-billed; bookmarks + push_down_predicate
Discover schemas in S3	AWS Glue Crawler	classifiers; on-demand or scheduled
Run a long-running Spark cluster	Amazon EMR on EC2	Master + Core On-Demand, Task Spot
Run pay-per-second Spark	Amazon EMR Serverless	EMR runtime, Glue-style billing
Run Spark on existing Kubernetes	Amazon EMR on EKS	reuses your EKS plumbing
Ad-hoc SQL over S3	Amazon Athena	partition projection + Parquet
Sub-second BI SQL	Amazon Redshift	RA3 nodes + Materialized Views
Stream events at low latency	Kinesis Data Streams	shard math: 1 MB/s in per shard
Land streams in S3 / Redshift	Kinesis Firehose	60s–900s buffers; Parquet conversion via Glue
Want full Apache Kafka	Amazon MSK	Kafka APIs; MSK Connect for connectors
CDC from a relational source	AWS DMS	full load + change capture
Connect SaaS apps	AWS AppFlow	managed Salesforce / Slack / etc. connectors
Orchestrate a workflow	AWS Step Functions or Amazon MWAA	Step Functions for state machines; MWAA for Airflow DAGs
Fine-grained table / column access	AWS Lake Formation	table / column / row-level grants over Glue Catalog
Cheap, short event transform	AWS Lambda	Firehose transform, S3 event handler
Open table format on S3	Apache Iceberg (or Delta / Hudi)	atomic MERGE, schema evolution, time travel

Frequently asked questions

What is AWS data engineering and which services should I learn first?

aws data engineering is the practice of building ingestion, storage, transformation, and serving pipelines on top of AWS-managed services — and the four to learn first, in order, are S3 (the lake), the Glue Data Catalog (the metastore that every engine shares), AWS Glue Jobs (your default serverless Spark ETL engine), and Amazon Athena (serverless SQL over S3 for ad-hoc analytics). Add Amazon Kinesis Data Streams + Firehose for streaming pipelines, then Amazon EMR when you outgrow Glue (custom packages, long-running steady-state jobs, non-Spark engines like Trino or HBase), and Amazon Redshift when you need sub-second BI concurrency. Around that core you'll layer Lake Formation for fine-grained permissions, Step Functions / MWAA for orchestration, and Amazon Iceberg as your open table format for ACID upserts and schema evolution on S3.

What is the difference between AWS Glue and Amazon EMR?

glue vs emr is the most-asked AWS data engineering comparison. AWS Glue is a fully serverless Spark platform — you don't pick instance types, you just declare --number-of-workers + worker type and Glue runs the job; billing is per-DPU-hour with a 1-minute minimum. Amazon EMR gives you full Hadoop / Spark / Hive / Trino / HBase clusters where you control instance types, Spot mixes, bootstrap actions, and software versions; EMR Serverless is the in-between option that gives you the EMR runtime with Glue-style serverless billing. The rule of thumb: reach for Glue first for almost every Spark ETL job; move to EMR Serverless when you need EMR's runtime features (specific Spark version, non-Glue connectors); move to EMR on EC2 when you need long-running steady-state clusters, custom OS packages, or non-Spark engines. Both share the same S3 + Glue Catalog substrate, so migrating between them is mostly a re-wiring of the job runner.

How does Amazon Athena cost work and how do I cut my bill?

Athena bills per TB of data scanned (region-dependent, around $5 / TB at the time of writing). Four levers dominate cost: (1) partition the table so the WHERE clause restricts S3 listing — most tables should partition by dt, often plus a second high-cardinality column; (2) use Parquet (or ORC) so Athena reads only the columns the query references — columnar formats typically cut scan size 5–10× vs CSV / JSON; (3) use partition projection so Athena resolves partition values from the predicate without listing the Glue Catalog or running MSCK REPAIR TABLE; (4) set BytesScannedCutoffPerQuery on the Workgroup so runaway SELECT * queries fail fast. Combined, these typically take an analyst's bill from O(TB scanned) per query to O(GB scanned) — a 100× improvement is common. The most frequent interview pitfall: filtering on event_ts rather than the partition column dt, which silently disables partition pushdown.

How does Amazon Kinesis Data Streams differ from Kinesis Firehose and Amazon MSK?

The three streaming services solve different jobs. Kinesis Data Streams is the sharded ordered log — your producers write records, consumers (KCL, Lambda, Spark Structured Streaming) read at their own pace, retention is 24h–365d, ordering is per shard, and each shard handles ~1 MB/s in / 2 MB/s out. Kinesis Firehose is the landing service — it consumes from a Kinesis stream (or direct PUT), buffers records, optionally transforms via Lambda or converts JSON → Parquet via the Glue Catalog, then writes to S3 / Redshift / OpenSearch / Splunk; there is no consumer SDK because Firehose IS the consumer. Amazon MSK is managed Apache Kafka — full Kafka API compatibility for teams already invested in Kafka tooling, with MSK Connect for connectors. Pick Data Streams + Firehose when you want native AWS streaming with the simplest S3 landing path; pick MSK when you need Kafka compatibility, exactly-once semantics via transactions, or Kafka Streams.

Should I use AWS Glue Crawlers or partition projection in Athena?

Most modern AWS data engineering teams default to partition projection for high-cardinality time-partitioned tables — it eliminates Crawler runs, eliminates MSCK REPAIR TABLE, and gives instant cold-start because Athena computes partition values from the predicate at query time. Glue Crawlers are still the right choice when schemas evolve unpredictably (new sources, classifiers, schema drift) or when the table has low-cardinality discrete partitions that don't lend themselves to projection (e.g. arbitrary string IDs). The senior pattern is: use Crawlers during exploration (let them discover the initial schema + partitions), then freeze the schema + switch to partition projection for the long-running production table. Many teams further combine the two — projection for dt / hr (the high-cardinality time keys) and a periodic Crawler to pick up schema drift on the column set.

Where does Amazon Redshift fit alongside Athena and Glue?

Amazon Redshift complements rather than replaces Athena / Glue. Athena is pay-per-TB-scanned serverless SQL — perfect for ad-hoc analytics, exploratory queries, and any workload where concurrency is low and queries are intermittent. Redshift is a managed columnar warehouse — perfect for sub-second BI dashboards, high-concurrency SQL (hundreds of users hitting the same marts), and complex multi-join queries on materialised models; you pay for the compute that runs continuously rather than per query. The bridge between the two is Redshift Spectrum — it lets Redshift query S3 tables registered in the Glue Catalog, so your curated S3 / Iceberg data is queryable from Redshift without copy. The canonical interview answer: "We land in S3 + Glue Catalog, model with Glue / EMR, expose ad-hoc workloads through Athena, and ship modelled marts to Redshift for the BI tier — and Redshift Spectrum gives Redshift queries access to the long tail of S3 data on demand."

Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including Python practice keyed to the same lakehouse + streaming concepts that underpin AWS data engineering (S3 partitioning, Spark / Glue Job thinking, Kinesis-style event flow, Athena-style projection-pushdown SQL). Whether you're drilling aws data engineer interview questions or building real Glue + EMR + Athena + Kinesis pipelines, the practice library mirrors the same four-layer mental model this guide teaches — plus the broader distributed-systems primitives every cloud lakehouse stands on.

Kick off via Explore practice →; drill the Python practice lane →; fan out into the ETL lane →; rehearse streaming Python drills →; reinforce real-time analytics drills →; widen coverage on the full data-analysis library →.

DEV Community

AWS Data Engineering: Glue, EMR, Athena, Kinesis — End-to-End Guide

1. Why AWS data engineering — the lakehouse pattern in one diagram

The AWS lakehouse pattern — S3 is the lake, Glue Catalog is the metastore, everything else is compute

Worked example — a typical AWS lakehouse pipeline at a glance

2. AWS Glue — Crawlers, Data Catalog, Jobs, bookmarks, Iceberg

`aws glue` — Crawler discovers, Data Catalog remembers, Glue Job transforms

Worked example — Glue Job that compacts raw events into a curated Iceberg table

Glue interview question on running a daily MERGE upsert

Solution Using Iceberg MERGE INTO via Spark SQL on Glue

3. Amazon EMR — Master, Core, Task, EMRFS, YARN, Serverless

`amazon emr` — managed Spark / Hadoop with Master, Core, and Task nodes over S3

Worked example — EMR Spark job that scans S3 with partition pushdown and writes Parquet

EMR interview question on right-sizing a cluster

Solution Using On-Demand Core + Spot Task instance fleets

4. Athena + Kinesis — the streaming lakehouse pattern

`kinesis to athena` — stream once, land in S3, catalog with Glue, query with Athena

Worked example — Kinesis → Firehose → S3 (Parquet) → Athena query

Athena + Kinesis interview question on cutting query cost 100×

Solution Using partition + columnar + projection pushdown

5. The wider AWS data engineering ecosystem

The four-layer AWS lakehouse stack — storage, streaming, compute, governance

Choosing the right AWS service (cheat sheet)

Frequently asked questions

What is AWS data engineering and which services should I learn first?

What is the difference between AWS Glue and Amazon EMR?

How does Amazon Athena cost work and how do I cut my bill?

How does Amazon Kinesis Data Streams differ from Kinesis Firehose and Amazon MSK?

Should I use AWS Glue Crawlers or partition projection in Athena?

Where does Amazon Redshift fit alongside Athena and Glue?

Practice on PipeCode

Top comments (0)

1. Why AWS data engineering — the lakehouse pattern in one diagram

The AWS lakehouse pattern — S3 is the lake, Glue Catalog is the metastore, everything else is compute

Worked example — a typical AWS lakehouse pipeline at a glance

2. AWS Glue — Crawlers, Data Catalog, Jobs, bookmarks, Iceberg

aws glue — Crawler discovers, Data Catalog remembers, Glue Job transforms

Worked example — Glue Job that compacts raw events into a curated Iceberg table

Glue interview question on running a daily MERGE upsert

Solution Using Iceberg MERGE INTO via Spark SQL on Glue

3. Amazon EMR — Master, Core, Task, EMRFS, YARN, Serverless

amazon emr — managed Spark / Hadoop with Master, Core, and Task nodes over S3

Worked example — EMR Spark job that scans S3 with partition pushdown and writes Parquet

EMR interview question on right-sizing a cluster

Solution Using On-Demand Core + Spot Task instance fleets

4. Athena + Kinesis — the streaming lakehouse pattern

kinesis to athena — stream once, land in S3, catalog with Glue, query with Athena

Worked example — Kinesis → Firehose → S3 (Parquet) → Athena query

Athena + Kinesis interview question on cutting query cost 100×

Solution Using partition + columnar + projection pushdown

5. The wider AWS data engineering ecosystem

The four-layer AWS lakehouse stack — storage, streaming, compute, governance

Choosing the right AWS service (cheat sheet)

Frequently asked questions

What is AWS data engineering and which services should I learn first?

What is the difference between AWS Glue and Amazon EMR?

How does Amazon Athena cost work and how do I cut my bill?

How does Amazon Kinesis Data Streams differ from Kinesis Firehose and Amazon MSK?

Should I use AWS Glue Crawlers or partition projection in Athena?

Where does Amazon Redshift fit alongside Athena and Glue?

Practice on PipeCode

`aws glue` — Crawler discovers, Data Catalog remembers, Glue Job transforms

`amazon emr` — managed Spark / Hadoop with Master, Core, and Task nodes over S3

`kinesis to athena` — stream once, land in S3, catalog with Glue, query with Athena