In this blog post, I will explain how to prepare for the AWS Data Analytics Specialty exam. It is advisable to already have an Associate certification so that you are familiar with the main AWS services.
The exam covers the following domains:
Domain 1: Collection - 18%
Domain 2: Storage & Data Management - 22%
Domain 3: Processing - 24%
Domain 4: Analysis and Visualization - 18%
Domain 5: Security - 18%
I passed the exam in March 2023, and here are my personal steps for preparing:
- Read the official exam guide AWS Exam guide. It describes what kind of knowledge, topics, and services will be covered. I discovered that Kinesis Video Analytics is not included in the exam, although it appeared shortly on a course and practice exams.
- I recommend watching a video course, especially if you have limited experience working with AWS on a platform like ACloud Guru, Cloud Academy, or WhizLabs. For me, writing notes during the course and taking screenshots of the presentations are helpful, so I can go over them later.
- Take note of the AWS services to put your primary focus on, like Glue, the Kinesis Family, Redshift, QuickSight, OpenSearch, Athena, EMR, and S3. Additionally, gather services that are still relevant but less important such as Lake Formation, AWS MSK, DMS, or DataSync.
- The next step is to get a deep understanding of the "top-level" services and a more general understanding of the "second-level" services. The courses usually can't cover every available feature and all of the best practices. You can go through the documentation and FAQs for each service and write down any additional information that you find significant. Pay attention to how the service integrates with other services, how it deals with encryption, logs, user access control, sharing between accounts/regions, and what are suitable use cases for it. Ask yourself when you would choose that service compared to another and take into account the following:
- cost
- how easy it is to set up and maintain
- is it (near) real-time, or is a delay acceptable
For example, you can compare Kinesis vs. AWS MSK or SQS FIFO. On this website, you can find multiple comparison tables Kinesis vs. AWS MSK.
Consider common errors and CloudWatch Metrics and how to solve the issue or improve the performance. Here are a few examples:
- OpenSearch - JVMMemoryPressure
- Kinesis - ProvisionedThroughputExceededException
- Kinesis Data Analytics - MillisBehindLatest
- EMR - YarnMemoryAvailablePercentage
- On YouTube, you can find great AWS Tech talks with best practices for specific services, which I recommend you to watch. I also recommend checking out the Johnny Chivers channel, which contains many videos on all Data Analytics services with hands-on examples.
- The last step is solving practice exams to test your knowledge - I recommend doing the Tutorials Dojo practice tests as they were the closest to the actual exam and contained great explanations for each question. I also found 25 free questions from WhizLabs and one YouTube video that was also useful.
Here are my notes on what to definitely research for each service:
Redshift:
- Distribution styles
- Redshift Spectrum
- GRANT / REVOKE statements
- KMS / HSM encryption
- Classic, elastic resize, concurrency scaling
- VACUUM
- COPY Command (sources, syntax, the number of files should be a multiple of the number of slices, ensure that the files are roughly the same size, between MB and 1 GB after compression; using a manifest file)
- UNLOAD Command
- Audit logging
- Materialized views
- Snapshots
- Data API
- WLM
Glue:
- Glue Crawlers
- Glue Jobs (DPUs, types, transformations, bookmarks)
- Glue Triggers
Athena:
- Integrations
- File format conversion
- Workgroups
- Partitioning
QuickSight:
- Pay attention to what is available only in the Enterprise edition
- When to use different types of charts
- Embedding in a website or app
- Data encryption (at rest and in transit)
- Connecting to resources in a private subnet
- Authentication options
- Permission to access only specific tables or rows
- Refreshing data
Kinesis:
- For each service what are producers(source) and consumers(destination)
- Data streams - limits, shards, enhanced fan-out, KCL/KPL/Kinesis agent/PutRecord(s)
- Firehose - record format conversion and lambda transformation, buffer size and interval, compression, PutRecord/ PutRecordBatch, source record backup
- Data Analytics - SQL/Apache Flink app, multiple in-application input streams, windowed queries, Random ForestCut feature
- AWS KMS - ZooKeeper nodes and broker nodes, writing to topics, scaling the cluster
EMR:
- Primary, core, and task nodes + when to use on-demand vs. spot instances (instance fleets vs. instance groups)
- Storage - HDFS vs. EMRFS
- Transient vs. Long-running clusters
- Bootstrap actions, jobs, steps
- Security configuration for Encryption, Kerberos authentication, and EMRFS authorization for S3
- Data replication across nodes
- Compression algorithms
- S3DistCp
- Managed Scaling vs. Custom Auto Scaling
- Storing logs (S3)
- Orchestration with Step Functions
OpenSearch:
- Data sources
- Shards, replicas
- Storage types (Hot, UltraWarm, Cold) + ISM
- Slow logs
- Cross-cluster search and replication + Multi-AZ deployment
- SAML authentication
- Fine-grained access control
- Refresh interval
- OpenSearch Dashboards - authentication, permissions, and sharing
Lake Formation
- Blueprints
- Handling permissions and sharing between accounts
DMS
- Tasks + change data capture (CDC)
DataSync
- Compare with Transfer family and Snowball
If I have missed something or you have additional recommendations, please leave a comment. Hopefully, my steps and bullet points will help you to prepare well, and I wish you the best of luck if you've already booked the exam :)
Top comments (0)