AWS DEA-C01 — the AWS Certified Data Engineer — Associate exam — is the cloud certification that finally maps cleanly to the data engineering day job: ingest, store, transform, catalog, secure, and operate data on AWS. Released in 2024, the exam tests ~85 scored questions in 130 minutes, with a passing scaled score of roughly 720 / 1000, and covers four domains weighted toward Data Ingestion + Transformation (34%), Data Store Management (26%), Data Operations + Support (22%), and Data Security + Governance (18%). If you've spent years gluing AWS data engineer certification services together — S3, Glue, Athena, Kinesis, Redshift, Lake Formation, Step Functions — this exam is the credential that proves it.
This guide is the DEA-C01 study guide field manual: a complete AWS data engineer associate certification roadmap that takes you from "I think I'm ready" to a booked exam slot in eight focused weeks. You'll see the exam blueprint broken down domain-by-domain, an 8-week study plan with reading + lab hours per week, the six minimum-viable hands-on labs that cover every exam domain end-to-end, a four-tier resource stack (official → hands-on → practice → exam day), and the exam-day playbook — proctor setup, time budget per question, flagging strategy, and what to do in the last 10 minutes. Every section walks through a real DEA-C01 exam questions-style scenario so you can pattern-match on the day.
When you want hands-on reps between study sessions, browse Python practice library →, drill ETL Python drills →, sharpen SQL practice →, rehearse streaming Python drills →, or widen coverage on the full data-analysis library →.
On this page
- Why DEA-C01 matters and what the exam actually tests
- The four DEA-C01 exam domains and how to weight your time
- The 8-week DEA-C01 study plan — week by week
- Six minimum-viable hands-on labs that cover every domain
- The four-tier resource stack and exam-day playbook
- Choosing the right DEA-C01 study lever (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why DEA-C01 matters and what the exam actually tests
AWS DEA-C01 — the first AWS certification built for the data engineering job, not the analytics one
The one-sentence invariant: AWS DEA-C01 is AWS's first associate-tier certification that maps directly to the data-engineering job description — pipelines, storage, transformation, security — instead of bolting analytics onto a generalist track. If you've previously side-eyed the now-retired DAS-C01 (Data Analytics — Specialty) for being half BI dashboards, DEA-C01 is the cleaner replacement: every domain is something you actually do in a DE seat.
The exam at a glance.
- Code — DEA-C01.
- Full name — AWS Certified Data Engineer — Associate.
- Release — March 2024 (general availability); current as of 2025-2026.
- Format — multiple choice and multiple response.
- Question count — ~85 questions total (65 scored + 20 unscored pretest).
- Time — 130 minutes (plus 30 minutes total for non-disclosure / surveys = ~2h 40m chair time).
- Pass mark — scaled score ≈ 720 / 1000 (AWS does not publish a fixed percentage).
- Cost — USD 150 (associate tier); plus optional Official Practice Question Set on Skill Builder.
- Delivery — Pearson VUE test centre or PSI / OnVue online proctored from home.
- Validity — 3 years; recertify by passing the latest version.
- Prerequisites — none required; 2-3 years AWS / data engineering experience recommended.
Who DEA-C01 is for.
- Working data engineers on AWS who want a credential that matches the actual day-job.
- Cloud or DevOps engineers moving sideways into data.
- Analytics engineers who use AWS but mostly through dbt + Snowflake / Redshift and want broader AWS fluency.
- Career switchers from BI / analytics / SWE backgrounds preparing for their first DE role at an AWS-shop company.
- AWS Solutions Architect Associate (SAA-C03) graduates who want the data-specific follow-on cert.
Who DEA-C01 is **not strictly for.**
- Pure ML practitioners — that's the MLS-C01 (Machine Learning Specialty) lane.
- Pure BI / dashboard engineers — the data-engineering scenarios on DEA-C01 will feel orthogonal.
- Teams on GCP or Azure — different vendor certifications (PDE for GCP, DP-203 for Azure) cover the equivalent ground.
What changed when DAS-C01 retired.
- DAS-C01 (Data Analytics — Specialty) was retired in April 2024.
- DEA-C01 is the spiritual successor for the data-engineering side; the BI / visualisation half effectively folded into other learning paths.
- DEA-C01 is associate-tier (DAS-C01 was specialty-tier) — slightly easier scope, slightly cheaper sticker.
- Modern services — DEA-C01 explicitly tests Glue Studio, Lake Formation, Iceberg on Athena, Redshift Serverless, MWAA, DataZone, Step Functions Distributed Map — none of which existed when DAS-C01 was written.
What the exam **does test (the headline themes).**
- Designing data pipelines end-to-end on AWS — pick the right ingest, store, transform, and serve services for a given scenario.
- Service trade-offs — Glue vs EMR vs Lambda for compute; Redshift vs Athena vs Aurora for serving; Kinesis Data Streams vs Kinesis Firehose vs MSK for streaming.
- Operating and monitoring pipelines — CloudWatch metrics, alarms, dashboards, X-Ray traces; Step Functions error handling and retries; DLQs.
- Securing data on AWS — IAM least-privilege, KMS encryption (SSE-S3 / SSE-KMS / CSE), Lake Formation tag-based / column-level access, VPC endpoints, PrivateLink, Macie for PII.
- Cost optimisation — S3 storage classes and lifecycle, partitioning + compression for Athena, RA3 + AQUA for Redshift, Spot pricing for EMR.
What the exam does **not test.**
- Hand-coding bespoke Spark UDFs from memory.
- Memorising every single CLI flag for every service.
- Deep ML / model-training internals.
- Pure visualisation tooling (QuickSight is mentioned, but not the focus).
- Hadoop-only on-prem topics.
Why most candidates fail.
- Studied SAA-C03 thinking it overlaps — it doesn't, the data services barely come up there.
- Watched videos but never opened the console — DEA-C01 is heavy on scenario questions where you must know which knob to turn.
- Memorised service names, not service trade-offs — the exam writers love "service A vs service B vs service C, which fits this constraint?" prompts.
- Skipped governance / Lake Formation — 18% of the exam, and the part candidates with pure pipeline backgrounds most often skip.
- No mock exams — without practice tests, your timing and your weakest domain stay hidden until exam day.
DEA-C01 vs DAS-C01 — the comparison that still comes up.
| Aspect | DEA-C01 (current) | DAS-C01 (retired) |
|---|---|---|
| Tier | Associate | Specialty |
| Focus | Data engineering | Data analytics + BI |
| Cost | USD 150 | USD 300 |
| Question count | ~85 | 65 |
| Time | 130 min | 180 min |
| Visualisation weight | Light | Heavy (QuickSight) |
| Streaming weight | Heavy (Kinesis + MSK) | Heavy (Kinesis + MSK) |
| Status | Active | Retired April 2024 |
Worked example — the most-common DEA-C01 scenario shape
Detailed explanation. Almost every DEA-C01 question is a short scenario followed by four options. The right answer is rarely the "fanciest" service — it's the one that meets the stated constraint (cost / latency / governance / scale) without over-engineering. Learn this shape and you'll save 20 seconds per question.
Question (DEA-C01-style sample).
A data engineering team ingests clickstream events at a steady rate of ~5 MB/s with bursty spikes up to 20 MB/s for short periods. They need to land the raw data in S3 as Parquet, partitioned by event date, with no custom code to run, and they want to minimise operational overhead. Which solution meets these requirements?
A. Amazon Kinesis Data Streams → AWS Lambda → Amazon S3
B. Amazon Kinesis Data Firehose with dynamic partitioning and Parquet conversion → Amazon S3
C. Amazon MSK → Apache Spark on Amazon EMR → Amazon S3
D. Amazon SQS → AWS Glue streaming job → Amazon S3
Input (the constraints to weigh).
| Constraint | Wording in question | What it points at |
|---|---|---|
| Throughput | "~5 MB/s, spikes to 20 MB/s" | Firehose or KDS comfortably handle this |
| Format | "land as Parquet" | Firehose has built-in format conversion |
| Partitioning | "partitioned by event date" | Firehose dynamic partitioning |
| Code | "no custom code" | Rules out Lambda + EMR + Glue streaming |
| Overhead | "minimise operational overhead" | Serverless, fully managed = Firehose |
Code. No code needed — the right answer is fully managed. The Firehose configuration that does it:
{
"DeliveryStreamName": "clickstream-to-s3",
"ExtendedS3DestinationConfiguration": {
"BucketARN": "arn:aws:s3:::analytics-raw",
"DynamicPartitioningConfiguration": { "Enabled": true },
"Prefix": "clickstream/year=!{partitionKeyFromQuery:year}/month=!{partitionKeyFromQuery:month}/day=!{partitionKeyFromQuery:day}/",
"ErrorOutputPrefix": "errors/",
"DataFormatConversionConfiguration": {
"Enabled": true,
"OutputFormatConfiguration": {
"Serializer": { "ParquetSerDe": {} }
},
"SchemaConfiguration": {
"DatabaseName": "analytics",
"TableName": "clickstream_raw"
}
}
}
}
Step-by-step explanation.
- Eliminate option C (MSK + EMR) — explicit "no custom code" rules out a Spark job.
- Eliminate option D (SQS + Glue streaming) — Glue streaming jobs are Spark code; also rules out the "no code" constraint.
- Eliminate option A (KDS + Lambda) — Lambda is custom code; you'd hand-write the Parquet conversion + partitioning logic.
- Pick B (Firehose) — Firehose's built-in Parquet conversion + dynamic partitioning meets every constraint without a single line of code.
- Sanity-check — throughput (~5-20 MB/s) is well within Firehose limits; cost is per-GB ingested + delivered, predictable and low.
Output.
| Field | Answer |
|---|---|
| Correct option | B |
| Why | Only fully-managed, no-code path that converts to Parquet and partitions on the fly |
| Common wrong pick | A — candidates default to "KDS + Lambda" out of habit |
| Time it should take | < 60 seconds once you spot the "no custom code" anchor |
Rule of thumb: The DEA-C01 exam writers anchor each scenario on one or two constraints (no code, sub-second latency, < $X / month, ACID, exactly-once). Find the anchor first; the right answer usually falls out of three constraints once you eliminate the over-engineered options.
Python
Topic — etl
ETL Python drills
Python
Topic — streaming
Streaming Python drills
2. The four DEA-C01 exam domains and how to weight your time
DEA-C01 exam domains — Ingestion 34%, Store 26%, Ops 22%, Security 18%
DEA-C01 exam domains are the single most important thing to memorise before you plan your study weeks — the percentages dictate where your time goes. Every scored question maps to exactly one of these four buckets.
Domain 1 — Data Ingestion and Transformation (34%).
The biggest chunk of the exam. Expect ~22 of the 65 scored questions here.
- Streaming ingest — Kinesis Data Streams, Kinesis Firehose, Amazon MSK (managed Kafka), Kinesis Data Analytics (Apache Flink).
- Batch ingest — AWS DataSync, AWS Snow family, AWS DMS (Database Migration Service), AWS Transfer Family (SFTP).
- CDC / database replication — DMS with CDC tasks, Aurora zero-ETL integration with Redshift.
- Transformation engines — AWS Glue (Spark + Python shell), Amazon EMR (Spark / Hive / Presto), AWS Lambda for lightweight transforms.
- Glue specifics — Glue Studio visual editor, Glue Crawlers, Glue Data Catalog, Glue Job bookmarks (incremental processing), Glue DataBrew.
- EMR specifics — EMR Serverless, EMR on EC2, EMR on EKS, instance fleets, Spot pricing, managed scaling.
- Orchestration — AWS Step Functions, Amazon MWAA (Managed Airflow), EventBridge, Step Functions Distributed Map.
Domain 2 — Data Store Management (26%).
Expect ~17 of the 65 scored questions here.
- Object storage — S3, S3 storage classes (Standard / Standard-IA / Intelligent-Tiering / Glacier / Glacier Deep Archive), lifecycle policies, S3 Object Lambda, S3 Select.
- Lake formats — Parquet vs ORC vs Avro, Apache Iceberg on Athena, Apache Hudi, Delta Lake.
- Data warehouse — Redshift (RA3 nodes, Serverless), Redshift Spectrum, Redshift materialised views, distribution + sort keys.
- NoSQL — DynamoDB (LSI / GSI, on-demand vs provisioned, DAX, streams), DocumentDB.
- Relational — Aurora (Postgres + MySQL), RDS.
- Specialty — OpenSearch Service, Timestream, Neptune (graph).
- Catalog — AWS Glue Data Catalog, Lake Formation governed tables, DataZone.
Domain 3 — Data Operations and Support (22%).
Expect ~14 of the 65 scored questions here.
- Monitoring — CloudWatch metrics, custom metrics, alarms, dashboards, Container Insights.
- Logging — CloudWatch Logs, CloudWatch Logs Insights queries, log groups, log retention.
- Tracing — AWS X-Ray for Lambda + Step Functions chains.
- Auditing — AWS CloudTrail (management + data events), Config rules.
- Error handling — Step Functions Retry / Catch, Lambda DLQs (SQS), Kinesis Firehose error records, Glue job retries.
- Performance — Athena partition projection, Glue dynamic frames vs DataFrames, EMR Spot fleets, Redshift workload management (WLM) queues, RA3 + AQUA.
- Cost — Cost Explorer, Cost Allocation tags, S3 storage class analysis, Glue auto-scaling, Redshift Serverless RPU caps.
Domain 4 — Data Security and Governance (18%).
Expect ~12 of the 65 scored questions here.
- Identity — IAM roles, policies, conditions, IAM Identity Center (formerly SSO), service control policies (SCPs) in AWS Organizations.
- Encryption — KMS keys (AWS-managed vs customer-managed), key rotation, SSE-S3 / SSE-KMS / SSE-C, client-side encryption, envelope encryption.
- Network isolation — VPC endpoints (Interface + Gateway), PrivateLink, VPC peering, Direct Connect.
- Lake Formation — fine-grained access (table / column / row / cell), LF-Tags, cross-account sharing.
- PII / sensitive data — Amazon Macie, Glue PII detection transforms, KMS-backed tokenisation.
- Compliance frameworks — GDPR, HIPAA, PCI DSS, SOC; AWS Artifact for audit reports.
- Data quality — Glue Data Quality (DQDL rules), AWS Deequ.
Worked example — a Domain 4 (Security and Governance) scenario
Detailed explanation. Domain 4 questions are notorious because pipeline engineers haven't usually set up Lake Formation themselves. The exam writers love LF-Tag and column-level scenarios because they sit at the intersection of IAM, Glue Data Catalog, and S3. Pattern: a multi-team scenario where one team must see one column subset and another team a different subset.
Question (DEA-C01-style sample).
A company stores a
customerstable in S3 (Parquet) cataloged in AWS Glue Data Catalog. Team A (Marketing) must see all columns exceptssnanddob. Team B (Finance) must see all columns. Both teams query via Amazon Athena. The solution must be manageable centrally and must not require duplicating the data. Which approach is best?A. Copy the table twice — once without
ssn/dobfor Marketing, once full for Finance — and grant each team access to its copy via IAM bucket policies.
B. Use AWS Lake Formation to grant column-level permissions on thecustomerstable — excludessnanddobfor Marketing's IAM role; grant all columns to Finance's role.
C. Use Athena workgroups withWHEREclauses and trust each user to omitssn/dob.
D. Encryptssnanddobwith different KMS keys and only grant Finance access to those keys.
Input.
| Constraint | What it rules in / out |
|---|---|
| "Manageable centrally" | Rules out A (two copies = double maintenance) |
| "Must not duplicate data" | Rules out A again |
| "Manageable centrally" | Rules out C (trust isn't access control) |
| Two teams, different column subsets | Rules in column-level grants |
| Athena queries | Lake Formation integrates natively |
Code (Lake Formation column-level grant via AWS CLI).
# Grant Finance role full SELECT on every column
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:role/FinanceRole \
--permissions SELECT \
--resource '{
"Table": {
"DatabaseName": "analytics",
"Name": "customers"
}
}'
# Grant Marketing role SELECT on all columns EXCEPT ssn, dob
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:role/MarketingRole \
--permissions SELECT \
--resource '{
"TableWithColumns": {
"DatabaseName": "analytics",
"Name": "customers",
"ColumnWildcard": {
"ExcludedColumnNames": ["ssn", "dob"]
}
}
}'
Step-by-step explanation.
- Lake Formation owns the Glue Catalog permissions — once you register the database with LF, IAM grants on the catalog stop working and you must use LF grants.
-
ColumnWildcard.ExcludedColumnNameslets you grantSELECT *minus a denylist — perfect for the "all columns except" pattern. -
Marketing IAM role assumes its role, queries Athena; Athena consults Lake Formation, returns rows with
ssn/dobnull-masked or hidden depending on engine version. - Finance role queries the same table; Lake Formation grants every column.
- No data duplication — both teams query the same underlying S3 Parquet files; Lake Formation rewrites the schema per role.
Output.
| Field | Answer |
|---|---|
| Correct option | B |
| Why | Only LF column-level grants are centrally managed, no-duplicate, and Athena-native |
| Common wrong pick | D — KMS key separation doesn't hide columns in query results |
| Time | < 75 seconds once you spot "centrally managed + no duplication" |
Rule of thumb: Every Lake Formation question on DEA-C01 has the same shape — multiple teams, different column / row subsets, must avoid duplicating data. The answer is always LF grants (column-level, row-level filters, or LF-Tags) — never IAM-only, never bucket policies, never "copy the data".
SQL
Topic — sql
SQL practice library
Python
Topic — etl
ETL pipeline drills
3. The 8-week DEA-C01 study plan — week by week
DEA-C01 study plan — eight focused weeks, ~8 hours per week, half reading + half hands-on
DEA-C01 study plan works best as eight weeks at ~8 hours per week (64 hours total). The proportions matter more than the order — re-arrange weeks if you already know storage or already use Spark, but don't compress the lab time.
Week 0 — set the foundation (do this before W1).
- Buy / download the official Exam Guide PDF from the AWS certification page.
- Skim it once end-to-end in 90 minutes — don't try to memorise.
- Highlight the task statements under each domain; these are the closest you'll get to the actual exam blueprint.
- Pin Exam Guide on your second monitor — re-read the task list every week.
- Create a free AWS account (or use a sandbox / employer account if you have one); the labs need real console access.
- Bookmark — Skill Builder, AWS Workshops, the Tutorials Dojo cheat sheets, and the Whizlabs / Tutorials Dojo practice exam pages.
Weeks 1-2 — Storage and ingestion (Domain 1 + Domain 2 core).
| Day | Topic | Reading hours | Lab hours |
|---|---|---|---|
| W1 D1-2 | S3 basics — buckets, keys, storage classes, lifecycle | 2 | 1 |
| W1 D3-4 | S3 advanced — versioning, replication, encryption, Object Lambda | 2 | 1 |
| W1 D5-7 | Kinesis Data Streams + Firehose + Lambda | 2 | 1 |
| W2 D1-2 | Amazon MSK + Kinesis Data Analytics (Flink) | 2 | 1 |
| W2 D3-4 | AWS DMS + Aurora zero-ETL + AWS DataSync | 2 | 1 |
| W2 D5-7 | Lab 1 — S3 + Glue + Athena lakehouse + Lab 2 — Kinesis + Firehose streaming | 1 | 4 |
- Reading goal — be able to name every storage class + every ingest service and one trade-off for each.
- Lab goal — finish Lab 1 (S3 + Glue + Athena lakehouse) and Lab 2 (Kinesis + Firehose streaming) — see §4.
Weeks 3-4 — Compute and transform (rest of Domain 1).
| Day | Topic | Reading hours | Lab hours |
|---|---|---|---|
| W3 D1-2 | AWS Glue — Studio, Crawlers, Catalog, Bookmarks | 2 | 1 |
| W3 D3-4 | Glue jobs — Spark vs Python Shell; DataBrew | 2 | 1 |
| W3 D5-7 | Amazon EMR — Serverless, on EC2, on EKS; managed scaling; Spot fleets | 2 | 1 |
| W4 D1-2 | Athena — partitioning, partition projection, query plan, workgroups | 2 | 1 |
| W4 D3-4 | Redshift — RA3, Serverless, Spectrum, materialised views, dist + sort keys | 2 | 1 |
| W4 D5-7 | Iceberg on Athena, Hudi, Delta Lake on AWS | 2 | 1 |
| W4 D6-7 | Lab 3 — Glue job + bookmarks + Lab 4 — EMR + Spark + Iceberg | 1 | 4 |
- Reading goal — know Glue vs EMR vs Lambda trade-offs cold; know Redshift vs Athena vs Aurora trade-offs cold.
- Lab goal — finish Lab 3 (Glue bookmarks) and Lab 4 (EMR + Iceberg).
Week 5 — Orchestration and ops (Domain 3).
| Day | Topic | Reading hours | Lab hours |
|---|---|---|---|
| D1-2 | Step Functions — states, error handling, Distributed Map | 2 | 1 |
| D3 | Amazon MWAA (Managed Airflow) — DAGs, secrets, Glue / EMR operators | 1 | 1 |
| D4 | EventBridge + EventBridge Scheduler | 1 | 1 |
| D5 | CloudWatch — metrics, alarms, dashboards, Logs Insights | 1 | 1 |
| D6 | CloudTrail + X-Ray + Config | 1 | 1 |
| D7 | Lab 5 — Redshift + Spectrum + RA3 (covers DW ops) | 1 | 3 |
- Reading goal — be able to design a Step Functions DAG with retries + DLQ from memory.
- Lab goal — finish Lab 5 (Redshift Spectrum) and a small Step Functions side-project chaining Glue + Lambda + Athena.
Week 6 — Security and governance (Domain 4).
| Day | Topic | Reading hours | Lab hours |
|---|---|---|---|
| D1-2 | IAM — roles, policies, conditions, IAM Identity Center | 2 | 1 |
| D3 | KMS — keys, rotation, SSE-S3 vs SSE-KMS, envelope encryption | 1 | 1 |
| D4 | Lake Formation — fine-grained access, LF-Tags, cross-account | 1 | 2 |
| D5 | VPC endpoints, PrivateLink, Direct Connect | 1 | 1 |
| D6 | Macie + Glue PII detection + Glue Data Quality (DQDL) | 1 | 1 |
| D7 | Lab 6 — Lake Formation + IAM + column-level ACL | 1 | 3 |
- Reading goal — be able to recite the IAM trust policy / permissions policy split + the Lake Formation column-level grant syntax.
- Lab goal — finish Lab 6 (LF column ACL).
Week 7 — Mock exams and gap analysis.
| Day | Activity | Hours |
|---|---|---|
| D1 | Mock 1 (Official Practice Question Set on Skill Builder, 20 questions) + review | 2 |
| D2 | Gap-fill — re-read weakest two domains | 2 |
| D3 | Mock 2 (Tutorials Dojo or Whizlabs, 65 questions, timed) + review | 3 |
| D4 | Gap-fill — drill the 5 services you scored worst on | 2 |
| D5 | Mock 3 (Tutorials Dojo or Whizlabs, 65 questions, timed) + review | 3 |
| D6 | Gap-fill | 2 |
| D7 | Mock 4 (different provider, 65 questions, timed) — target ≥ 80% | 3 |
- Reading goal — for every wrong answer, write a one-line "why I missed it" note in a single document; re-read this document every morning of W8.
- Lab goal — none; this week is pure practice questions.
Week 8 — Final review and book the exam.
| Day | Activity |
|---|---|
| D1 | Re-read the "why I missed it" document; re-read the Exam Guide PDF |
| D2 | Re-read your weakest domain's task statements end-to-end |
| D3 | Re-watch your two weakest service videos (Skill Builder) |
| D4 | Skim cheat sheets (Tutorials Dojo summary PDFs) |
| D5 | Final mock — aim ≥ 85% |
| D6 | Rest day — no AWS content; sleep |
| D7 | Exam day — see §5 for the playbook |
- Booking the exam — book it during W7 once you're scoring ≥ 75% on mocks. AWS schedules through Pearson VUE or PSI; pick a morning slot if you're an early person, otherwise mid-afternoon.
The 8-week budget in one line.
- Total — ~64 hours.
- Reading — ~30 hours.
- Hands-on labs — ~25 hours.
- Mocks + review — ~9 hours.
- Per week — ~8 hours; comfortable alongside a day job.
If you only have 4 weeks (cram plan).
- Compress W1-2 into W1, W3-4 into W2, W5+W6 into W3, W7+W8 into W4.
- Cut the second lab in each section — keep Lab 1, Lab 4, Lab 6.
- Take 2 mocks instead of 4.
- Doable but stressful — only attempt if you already have 2+ years AWS experience.
Python
Topic — etl
Pipeline study drills
Python
Topic — data-manipulation
Data-manipulation patterns
4. Six minimum-viable hands-on labs that cover every domain
DEA-C01 hands-on labs — the six labs that touch every domain end-to-end
DEA-C01 hands-on labs are non-negotiable. Reading without building leaves gaps that scenario questions will exploit. Six small labs — each ~4-6 hours — cover every exam domain at least once.
Lab 1 — S3 + Glue + Athena lakehouse (Domain 1 + 2).
- What you build — CSV → S3 raw → Glue Crawler → Glue Data Catalog → Athena query → S3 results.
- Why it matters — the canonical "lakehouse on a budget" pattern; appears on ~10% of exam scenarios.
- Key services — S3, Glue (Crawler + Data Catalog), Athena, IAM, KMS.
- Time — ~4 hours.
- Stretch goal — re-run with the data in Parquet (partitioned by date) and compare Athena scan size + cost.
Lab 2 — Kinesis Data Streams + Firehose + S3 (Domain 1).
-
What you build — producer (Python
boto3) writes events to Kinesis Data Stream → Firehose consumes → converts to Parquet → lands in S3 partitioned by date. - Why it matters — exercises streaming ingest, format conversion, dynamic partitioning, and the KDS vs Firehose trade-off (which the exam loves).
- Key services — Kinesis Data Streams, Kinesis Data Firehose, Lambda (optional transform), S3, Glue (for table registration).
- Time — ~5 hours.
- Stretch goal — swap Firehose for an MSK cluster + a Lambda consumer; observe the operational delta.
Lab 3 — Glue job + bookmarks + partitions (Domain 1).
- What you build — A Glue Spark job that reads incremental data from S3 (using job bookmarks to skip files already processed), transforms, and writes partitioned Parquet to S3.
- Why it matters — Glue bookmarks are a heavily-tested "how do I avoid re-processing" pattern; partitioning + compression is the cost-control answer to half the Athena scenarios.
- Key services — Glue (Spark + bookmarks), S3, Glue Data Catalog, CloudWatch (job metrics).
- Time — ~5 hours.
- Stretch goal — add Glue Data Quality rules (DQDL) and fail the job on rule breach; emit results to CloudWatch.
Lab 4 — EMR + Spark + Iceberg (Domain 1 + 2).
-
What you build — an EMR Serverless application that runs a PySpark job creating an Apache Iceberg table on S3, doing an ACID
MERGE INTO(upsert), and querying it from Athena. - Why it matters — Iceberg shows up across both ingest and store domains; EMR Serverless vs EMR on EC2 is a frequent trade-off question.
- Key services — EMR Serverless, Spark, Apache Iceberg, Glue Data Catalog, S3, Athena.
- Time — ~6 hours.
- Stretch goal — schema-evolve the table (add a column) and verify Athena query reads old + new partitions correctly.
Lab 5 — Redshift + Spectrum + RA3 (Domain 2 + 3).
- What you build — a Redshift Serverless workgroup with one materialised view + one Redshift Spectrum external table backed by S3 Parquet.
- Why it matters — Redshift questions test dist + sort keys, RA3 vs DS2, Spectrum vs COPY, and Serverless RPU limits.
- Key services — Redshift Serverless, Redshift Spectrum, S3, Glue Data Catalog, CloudWatch.
- Time — ~5 hours.
- Stretch goal — pause + resume Serverless; confirm cost stops; configure WLM queues.
Lab 6 — Lake Formation + IAM + column-level ACL (Domain 4).
- What you build — register your Lab 1 lakehouse with Lake Formation; create two IAM roles (Marketing, Finance); grant column-level + LF-Tag access; verify each role sees the correct columns from Athena.
- Why it matters — Domain 4 is the part most candidates skip and the part the exam writers love most. Building it once will save you 8-10 questions on exam day.
- Key services — Lake Formation, IAM, Glue Data Catalog, S3, Athena.
- Time — ~5 hours.
-
Stretch goal — add a row-level filter (e.g.
region = 'EMEA') for a third role; cross-account share via LF.
Lab order — read this carefully.
- Do Lab 1 first — sets up the catalog and S3 layout you'll reuse.
- Do Lab 6 last — depends on Lab 1's catalog; tying it off at the end cements security thinking.
- Labs 2-5 can be done in any order but follow the W1-W6 schedule for momentum.
Where to find ready-made lab scripts.
-
AWS Workshops (workshops.aws) —
Data Engineering on AWS — Foundations,Building Data Lakes,Lake Formationworkshops. - AWS Skill Builder Builder Labs — gated paid sandbox labs that give you a real account for 1 hour.
- GitHub — search "aws data engineering workshop" or "dea-c01 lab"; the AWS Samples org publishes most templates as CloudFormation.
Worked example — Lab 1 end-to-end (the canonical S3 + Glue + Athena lakehouse)
Detailed explanation. This is the most-built lab in DEA-C01 prep. Walk through it once and you'll recognise every Glue + Athena exam question for the next year. The flow: drop CSV in S3 → run a Crawler → query in Athena → rewrite to Parquet → re-query and compare scan size + cost.
Question (lab task).
A
sales.csvfile (1 GB, columnsorder_id,customer_id,order_date,region,amount) is ins3://my-raw/sales/sales.csv. Build a queryable Athena table over it, then create a partitioned Parquet copy ins3://my-curated/sales/and confirm the Parquet query scans less data.
Input.
| Item | Value |
|---|---|
| Raw file | s3://my-raw/sales/sales.csv |
| Size | 1 GB CSV |
| Columns |
order_id, customer_id, order_date, region, amount
|
| Target | Athena queries; minimise scan cost |
| Partition column |
order_date (truncate to month) |
Code.
# 1. Create the Glue database
aws glue create-database --database-input Name=sales_lake
# 2. Create the Glue Crawler
aws glue create-crawler \
--name sales-csv-crawler \
--role AWSGlueServiceRoleDefault \
--database-name sales_lake \
--targets '{"S3Targets": [{"Path": "s3://my-raw/sales/"}]}'
# 3. Run the Crawler — it discovers schema and registers the table
aws glue start-crawler --name sales-csv-crawler
# 4. Query in Athena (the CSV table is "sales")
# Run in the Athena console:
# SELECT region, SUM(amount) FROM sales_lake.sales GROUP BY region;
# Note the data scanned (~1 GB).
# 5. Use CREATE TABLE AS SELECT (CTAS) to write a partitioned Parquet copy
# Run in the Athena console:
# CREATE TABLE sales_lake.sales_parquet
# WITH (
# format = 'PARQUET',
# external_location = 's3://my-curated/sales/',
# partitioned_by = ARRAY['order_month']
# ) AS
# SELECT
# order_id, customer_id, order_date, region, amount,
# date_format(order_date, '%Y-%m') AS order_month
# FROM sales_lake.sales;
# 6. Re-run the aggregation against sales_parquet — note the much smaller data scan.
Step-by-step explanation.
-
Glue database — the namespace for tables;
sales_lakeis the catalog DB. -
Glue Crawler — points at the S3 prefix, scans files, infers schema, registers a table
salesinsales_lake. - Run the Crawler — takes 30-60 seconds; check the Glue console for the new table.
- Athena query against CSV — scans the full 1 GB on every query (Athena charges per TB scanned).
-
CTAS to Parquet + partition — Athena writes columnar Parquet partitioned by
order_month; query planner can now skip irrelevant partitions. -
Re-query Parquet — scan drops to ~50-100 MB for a single
regionaggregation; cost drops ~10-20×.
Output.
| Query | Format | Partitioned | Data scanned | Athena cost (approx) |
|---|---|---|---|---|
SELECT region, SUM(amount)… |
CSV | No | ~1 GB | $0.005 |
SELECT region, SUM(amount)… |
Parquet | Yes (by month) | ~50 MB | $0.00025 |
Rule of thumb: Every Athena cost question on the exam has the same answer — partition + compress (Parquet) + columnar. Build this lab once and the answer is muscle memory.
SQL
Topic — aggregation
Aggregation SQL drills
Python
Topic — data-analysis
Data-analysis library
5. The four-tier resource stack and exam-day playbook
DEA-C01 study resources — official → hands-on → practice → exam day, in that order
DEA-C01 study resources are best stacked as four tiers that compound. Skip the bottom tier and the upper tiers cost more; over-invest in the middle tiers and you forget the official wording the exam grades against.
Tier 1 — Official (start here, free or near-free).
- AWS Certified Data Engineer — Associate Exam Guide PDF — the single most important document. Re-read weekly.
- AWS Skill Builder Learning Plan — "Standard Exam Prep Plan: AWS Certified Data Engineer — Associate" (free Skill Builder tier).
- AWS Skill Builder Official Practice Question Set — ~20 official questions; same authoring team as the live exam; identical wording cadence (low-cost subscription).
-
AWS Whitepapers (read just these — not all of them):
- AWS Well-Architected Framework — Data Analytics Lens (essential).
- Lake Formation Best Practices.
- Big Data Analytics Options on AWS.
- Data Warehousing on AWS.
- Securing Data on AWS.
- AWS re:Invent talks — search YouTube for "DEA-C01 exam prep" and the latest re:Invent "What's new in Glue / EMR / Redshift" sessions.
Tier 2 — Hands-on (do every workshop you can fit).
- AWS Workshops (workshops.aws) — Data Engineering on AWS, Building Data Lakes, Lake Formation, Iceberg on AWS, Redshift Serverless workshops.
- AWS Skill Builder Builder Labs — paid sandbox accounts; 1-hour scoped labs with real consoles.
- AWS Free Tier sandbox — your own account; you can finish every lab in §4 for < $10 of charges if you tear down after each session.
-
GitHub —
aws-samplesorg publishes CloudFormation + CDK templates for almost every reference architecture.
Tier 3 — Practice exams (the highest-leverage tier in W7).
- AWS Official Practice Question Set — the gold standard; 20 questions, same authors as the live exam.
- Tutorials Dojo (Jon Bonso) — widely considered the closest third-party question style; ~390 questions across multiple test modes.
- Whizlabs — older third-party; cheaper; question quality is mixed but volume is high.
- Stéphane Maarek / Neal Davis Udemy practice tests — variable quality but cheap; useful as fill-in volume.
- Score-target rule — aim for ≥ 80% on three different providers before booking the exam.
Tier 4 — Exam day.
- Pearson VUE test centre — quiet, in-person, no proctor camera; book early as slots fill quickly.
- PSI test centre — alternative test centre operator; similar experience.
- Online proctored (PSI OnVue) — at-home; webcam, microphone, room scan; bring your patience — check-in can take 30 minutes.
Online-proctor checklist (do this 24 hours before).
- Quiet room with a door you can lock; no second monitor; clear desk.
- Government ID (passport or driver's licence); name must match the registration exactly.
- Webcam and microphone working; run the proctor app's system test 24 hours in advance.
- Wired internet if possible; mobile hotspot as backup.
- Close every other app; the proctor app will refuse to start otherwise.
- No water bottle on the desk during the exam (rules vary by provider; check yours).
- Bathroom break — allowed but the clock keeps running; pee first.
Exam-day time budget.
- 130 minutes / 85 questions = ~92 seconds per question.
- Aim for 75 seconds per question on the first pass, leaving ~15 minutes for the flagged-question second pass.
- Flag any question you're not 90% sure of; don't agonise. Come back.
- Never leave a question blank — there's no penalty for wrong answers; guess if you must.
The two-pass strategy.
- Pass 1 (75 minutes) — answer everything quickly; flag the unsure ones.
- Pass 2 (40 minutes) — re-read every flagged question; eliminate one wrong option at a time.
- Last 10 minutes — sanity-check the unflagged answers; trust your gut on flagged ones.
Pattern-matching tricks for the day.
- "No custom code" / "no operational overhead" → fully managed (Firehose, Athena, Glue Studio, MWAA, Step Functions).
- "Sub-second latency" → DynamoDB or Redshift Serverless (not Athena, not Glue, not S3 alone).
- "Petabyte-scale ad-hoc SQL on S3" → Athena or Redshift Spectrum (not Aurora).
- "Multi-team, different column subsets" → Lake Formation column grants (not IAM-only).
- "Cost-optimise S3" → Lifecycle → Intelligent-Tiering or Glacier; compression + partition.
- "Stream + windowed aggregation" → Kinesis Data Analytics (Flink) or Spark Structured Streaming on EMR.
- "Exactly-once + ACID on a data lake" → Iceberg / Hudi / Delta — not raw Parquet.
- "Audit who queried what" → CloudTrail data events + S3 access logs.
What happens after you click Submit.
- Provisional pass / fail shown on the screen immediately (online proctor) or at the test centre.
- Official score report in your AWS Certification account within 5 business days.
- Detailed domain-level breakdown in the score report; useful even if you passed (to see what to reinforce).
- Digital badge issued via Credly within a week.
- Validity — 3 years; recertify by passing the latest exam version (no separate "recert" exam).
If you fail.
- Wait 14 days before retake; AWS-enforced cooldown.
- Re-read your score report; identify the bottom domain.
- Spend two weeks rebuilding that domain end-to-end (re-do the lab, re-read the whitepaper, re-take the practice test).
- Retake fee — full USD 150 again.
Common day-of mistakes.
- Over-thinking obvious questions — if four options are obvious eliminations, pick the remaining one and move on.
- Changing too many answers on Pass 2 — first-instinct accuracy is usually higher; only change if you spot a misread.
- Misreading "EXCEPT" or "NOT" — the exam loves negation in stems; underline the negation on your scratchpad.
- Running out of time on the last 10 questions — the two-pass strategy prevents this; stick to it.
Python
Topic — etl
Exam-prep pipeline drills
Python
Topic — real-time-analytics
Real-time analytics drills
Choosing the right DEA-C01 study lever (cheat sheet)
A one-screen cheat sheet for the most-asked AWS DEA-C01 prep questions.
| You want to … | Lever | Notes |
|---|---|---|
| Understand the exam scope | Exam Guide PDF | Single source of truth; re-read weekly |
| Build foundational fluency | AWS Skill Builder Learning Plan | Free tier covers most of it |
| Get hands dirty | Six labs from §4 | Build them on your own AWS account |
| Practice exam writing | Tutorials Dojo + Skill Builder Official Practice | Target ≥ 80% on three providers |
| Compare Glue vs EMR vs Lambda | Domain 1 of Exam Guide | Pick by code-amount + scale + cost |
| Compare Redshift vs Athena vs Aurora | Domain 2 of Exam Guide | Pick by query pattern + freshness + scale |
| Understand Lake Formation | Whitepaper + Lab 6 | LF-Tags + column-level grants are exam-favourite |
| Get fluent with Step Functions | AWS Workshop + Lab 5 stretch goal | Retry / Catch / Distributed Map are tested |
| Tune Athena cost | Lab 1 stretch goal | Partition + Parquet + compression; never SELECT *
|
| Diagnose a slow Glue job | Glue job metrics in CloudWatch | Spark UI + executor count; consider auto-scaling |
| Book the exam | Pearson VUE or PSI portal via aws.training | Morning slot if early-bird; mid-afternoon if not |
| Pass the exam | Two-pass strategy + flag every uncertain | 75s per question on Pass 1, 40 min for Pass 2 |
| Recertify in 3 years | Pass the latest DEA-C01 version | No separate "recert" exam |
| Upgrade after DEA-C01 | SAP-C02 (Solutions Architect Professional) | Or MLS-C01 if you're pivoting to ML |
Frequently asked questions
Is the AWS DEA-C01 certification worth it in 2026?
Yes — for data engineers already on AWS or moving to an AWS-shop company, AWS DEA-C01 is the most relevant cloud certification on the market. It's the first AWS associate-tier certification built specifically for the data-engineering job description (ingest, store, transform, secure, operate) rather than bolting analytics onto a generalist track. Most AWS-shop employers (Amazon, Capital One, JPMorgan, Disney+, hundreds of mid-market companies) explicitly list DEA-C01 in DE job descriptions; some pay a per-cert bonus. The cert also signals that you've built (not just read about) lakehouse, streaming, orchestration, and governance pipelines — which is exactly what AWS-shop interviews probe. If you're on GCP or Azure, the equivalents (Google PDE, Azure DP-203) carry the same signal in their ecosystems; DEA-C01 is specifically the AWS one.
How long does it take to study for the DEA-C01?
DEA-C01 study time depends heavily on your starting point. For an engineer with 2-3 years of AWS data-engineering experience, 8 weeks at ~8 hours per week (64 total) is comfortable — see the W1-W8 plan in §3. For a candidate with general AWS experience (e.g. SAA-C03 holder) but light data exposure, plan 10-12 weeks to give yourself two extra weeks of labs. For a complete AWS beginner, 3-4 months is realistic — you'll need to learn IAM, VPC, S3 basics before tackling the data services. The single biggest time-saver is building the labs (§4) — reading without console reps leaves the kinds of gaps that scenario questions exploit. The single biggest time-waster is watching tutorial videos without taking notes or building anything alongside.
DEA-C01 vs DAS-C01 — what's the difference?
DEA-C01 vs DAS-C01 is a non-question in 2026 because DAS-C01 (Data Analytics — Specialty) was retired in April 2024. DEA-C01 (Data Engineer — Associate) is its spiritual successor for the engineering side. The differences: DEA-C01 is associate-tier (DAS-C01 was specialty-tier), cheaper (USD 150 vs USD 300), shorter (130 vs 180 minutes), more questions (~85 vs 65), and explicitly covers modern services that didn't exist when DAS-C01 was written — Glue Studio, Lake Formation column-level grants, Iceberg on Athena, Redshift Serverless, MWAA, DataZone, Step Functions Distributed Map. DAS-C01 also leaned heavier on QuickSight and BI; DEA-C01 is heavier on pipeline orchestration and governance. If you're still seeing DAS-C01 in a study guide, that guide is outdated — use DEA-C01 material.
What kind of salary uplift does DEA-C01 unlock?
The certification itself isn't a magic salary lever — the underlying skills and the projects you can now talk about are. That said, US-market signals (Levels.fyi, Glassdoor, Burning Glass / Lightcast data through 2025) show DEA-C01-holders self-reporting a 5-15% uplift when they switch employers, with the higher end concentrated in financial-services + healthcare AWS-shop roles where the cert is a soft prerequisite for compliance interviews. Internal promotion uplift is usually smaller (1-3%) but the cert often accelerates promotion by 6-12 months by un-blocking access to more senior data-platform projects. The biggest non-salary win is role mobility — the cert is one of the few credentials that's portable across every AWS-shop employer worldwide, which expands your hiring pool dramatically.
How many practice tests should I take before booking the exam?
DEA-C01 practice tests are the single highest-leverage activity in W7 — plan for at least four full-length timed mocks (65 questions, 130 minutes, lock yourself in a room). Take one official AWS Skill Builder practice set (20 questions, same authors as the live exam) plus three third-party 65-question mocks (Tutorials Dojo is the closest in style; Whizlabs and Stéphane Maarek's Udemy sets fill volume). Your score-target rule: ≥ 80% on three different providers before booking the exam, and ≥ 85% on the final mock the day before. Mock-exam review is more important than the mock itself — for every wrong answer, write a one-line "why I missed it" note in a single document, re-read that document the morning of the exam. Skipping mocks is the #1 reason candidates who "felt ready" fail on the day.
Can I pass DEA-C01 without hands-on AWS experience?
Technically yes, practically no. The exam is heavily scenario-based — almost every question describes a real architecture and asks you to choose the service combination that meets the constraints. Without hands-on console time you'll struggle to weigh trade-offs (Glue vs EMR vs Lambda; Redshift vs Athena vs Aurora; Kinesis Data Streams vs Firehose vs MSK) because those trade-offs are easier to feel than to memorise. The good news is AWS Free Tier + the six labs in §4 cost less than USD 10 in total charges if you tear down each lab after the session. If your employer doesn't give you a sandbox account, spin up a personal one for the prep period. Two months of weekend console time will outperform two months of pure video watching on every metric the exam cares about.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including Python practice and SQL practice keyed to the same patterns the AWS DEA-C01 exam tests: pipeline thinking, partition / cost trade-offs, streaming aggregation, set-based SQL on lake / warehouse tables, and the operational scenarios that show up in both certification mocks and real interview loops. Whether you're prepping for the cert, a DE interview at an AWS-shop company, or both, the practice library mirrors the same mental model this roadmap teaches.
Kick off via Explore practice →; drill the Python practice lane →; fan out into the ETL lane →; rehearse SQL practice →; reinforce data-manipulation drills →; widen coverage on the full streaming Python library →.





Top comments (0)