DEV Community

Mwenda Harun Mbaabu
Mwenda Harun Mbaabu

Posted on

22

Comprehensive LuxDevHQ Data Engineering Course Guide

Image description

This comprehensive course spans 4 months (16 weeks) and equips learners with expertise in Python, SQL, Azure, AWS, Apache Airflow, Kafka, Spark, and more.

  • Learning Days: Monday to Thursday (theory and practice).
  • Friday: Job shadowing or peer projects.
  • Saturday: Hands-on lab sessions and project-based learning.

Month 1: Foundations of Data Engineering

Week 1: Onboarding and Environment Setup

  • Monday:
    • Onboarding, course overview, career pathways, tools introduction.
  • Tuesday:
    • Introduction to cloud computing (Azure and AWS).
  • Wednesday:
    • Data governance, security, compliance, and access control.
  • Thursday:
    • Introduction to SQL for data engineering and PostgreSQL setup.
  • Friday:
    • Peer Project: Environment setup challenges.
  • Saturday (Lab):
    • Mini Project: Build a basic pipeline with PostgreSQL and Azure Blob Storage.

Week 2: SQL Essentials for Data Engineering

  • Monday:
    • Core SQL concepts (SELECT, WHERE, JOIN, GROUP BY).
  • Tuesday:
    • Advanced SQL techniques: recursive queries, window functions, and CTEs.
  • Wednesday:
    • Query optimization and execution plans.
  • Thursday:
    • Data modeling: normalization, denormalization, and star schemas.
  • Friday:
    • Job Shadowing: Observe senior engineers writing and optimizing SQL queries.
  • Saturday (Lab):
    • Mini Project: Create a star schema and analyze data using SQL.

Week 3: Introduction to Data Pipelines

  • Monday:
    • Theory: Introduction to ETL/ELT workflows.
  • Tuesday:
    • Lab: Create a simple Python-based ETL pipeline for CSV data.
  • Wednesday:
    • Theory: Extract, transform, load (ETL) concepts and best practices.
  • Thursday:
    • Lab: Build a Python ETL pipeline for batch data processing.
  • Friday:
    • Peer Project: Collaborate to design a basic ETL workflow.
  • Saturday (Lab):
    • Mini Project: Develop a simple ETL pipeline to process sales data.

Week 4: Introduction to Apache Airflow

  • Monday:
    • Theory: Introduction to Apache Airflow, DAGs, and scheduling.
  • Tuesday:
    • Lab: Set up Apache Airflow and create a basic DAG.
  • Wednesday:
    • Theory: DAG best practices and scheduling in Airflow.
  • Thursday:
    • Lab: Integrate Airflow with PostgreSQL and Azure Blob Storage.
  • Friday:
    • Job Shadowing: Observe real-world Airflow pipelines.
  • Saturday (Lab):
    • Mini Project: Automate an ETL pipeline with Airflow for batch data processing.

Month 2: Intermediate Tools and Concepts

Week 5: Data Warehousing and Data Lakes

  • Monday:
    • Theory: Introduction to data warehousing (OLAP vs. OLTP, partitioning, clustering).
  • Tuesday:
    • Lab: Work with Amazon Redshift and Snowflake for data warehousing.
  • Wednesday:
    • Theory: Data lakes and Lakehouse architecture.
  • Thursday:
    • Lab: Set up Delta Lake for raw and curated data.
  • Friday:
    • Peer Project: Implement a data warehouse model and data lake for sales data.
  • Saturday (Lab):
    • Mini Project: Design and implement a basic Lakehouse architecture.

Week 6: Data Governance and Security

  • Monday:
    • Theory: Data governance frameworks and data security principles.
  • Tuesday:
    • Lab: Use AWS Lake Formation for access control and security enforcement.
  • Wednesday:
    • Theory: Managing sensitive data and compliance (GDPR, HIPAA).
  • Thursday:
    • Lab: Implement security policies in S3 and Azure Blob Storage.
  • Friday:
    • Job Shadowing: Observe senior engineers applying governance policies.
  • Saturday (Lab):
    • Mini Project: Secure data in the cloud using AWS and Azure.

Week 7: Real-Time Data Processing with Kafka

  • Monday:
    • Theory: Introduction to Apache Kafka for real-time data streaming.
  • Tuesday:
    • Lab: Set up a Kafka producer and consumer.
  • Wednesday:
    • Theory: Kafka topics, partitions, and message brokers.
  • Thursday:
    • Lab: Integrate Kafka with PostgreSQL for real-time updates.
  • Friday:
    • Peer Project: Build a real-time Kafka pipeline for transactional data.
  • Saturday (Lab):
    • Mini Project: Create a pipeline to stream e-commerce data with Kafka.

Week 8: Batch vs. Stream Processing

  • Monday:
    • Theory: Introduction to batch vs. stream processing.
  • Tuesday:
    • Lab: Batch processing with PySpark.
  • Wednesday:
    • Theory: Combining batch and stream processing workflows.
  • Thursday:
    • Lab: Real-time processing with Apache Flink and Spark Streaming.
  • Friday:
    • Job Shadowing: Observe a real-time processing pipeline.
  • Saturday (Lab):
    • Mini Project: Build a hybrid pipeline combining batch and real-time processing.

Month 3: Advanced Data Engineering

Week 9: Machine Learning Integration in Data Pipelines

  • Monday:
    • Theory: Overview of ML workflows in data engineering.
  • Tuesday:
    • Lab: Preprocess data for machine learning using Pandas and PySpark.
  • Wednesday:
    • Theory: Feature engineering and automated feature extraction.
  • Thursday:
    • Lab: Automate feature extraction using Apache Airflow.
  • Friday:
    • Peer Project: Build a simple pipeline that integrates ML models.
  • Saturday (Lab):
    • Mini Project: Build an ML-powered recommendation system in a pipeline.

Week 10: Spark and PySpark for Big Data

  • Monday:
    • Theory: Introduction to Apache Spark for big data processing.
  • Tuesday:
    • Lab: Set up Spark and PySpark for data analysis.
  • Wednesday:
    • Theory: Spark RDDs, DataFrames, and SQL.
  • Thursday:
    • Lab: Analyze large datasets using Spark SQL.
  • Friday:
    • Peer Project: Build a PySpark pipeline for large-scale data processing.
  • Saturday (Lab):
    • Mini Project: Analyze big data sets with Spark and PySpark.

Week 11: Advanced Apache Airflow Techniques

  • Monday:
    • Theory: Advanced Airflow features (XCom, task dependencies).
  • Tuesday:
    • Lab: Implement dynamic DAGs and task dependencies in Airflow.
  • Wednesday:
    • Theory: Airflow scheduling, monitoring, and error handling.
  • Thursday:
    • Lab: Create complex DAGs for multi-step ETL pipelines.
  • Friday:
    • Job Shadowing: Observe advanced Airflow pipeline implementations.
  • Saturday (Lab):
    • Mini Project: Design an advanced Airflow DAG for complex data workflows.

Week 12: Data Lakes and Delta Lake

  • Monday:
    • Theory: Data lakes, Lakehouses, and Delta Lake architecture.
  • Tuesday:
    • Lab: Set up Delta Lake on AWS for data storage and management.
  • Wednesday:
    • Theory: Managing schema evolution in Delta Lake.
  • Thursday:
    • Lab: Implement batch and real-time data loading to Delta Lake.
  • Friday:
    • Peer Project: Design a Lakehouse architecture for an e-commerce platform.
  • Saturday (Lab):
    • Mini Project: Implement a scalable Delta Lake architecture.

Month 4: Capstone Projects

Week 13: Batch Data Pipeline Development

  • Monday to Thursday:
    • Design and Implementation:
    • Build an end-to-end batch data pipeline for e-commerce sales analytics.
    • Tools: PySpark, SQL, PostgreSQL, Airflow, S3.
  • Friday:
    • Peer Review: Present progress and receive feedback.
  • Saturday (Lab):
    • Project Milestone: Finalize and present batch pipeline results.

Week 14: Real-Time Data Pipeline Development

  • Monday to Thursday:
    • Design and Implementation:
    • Build an end-to-end real-time data pipeline for IoT sensor monitoring.
    • Tools: Kafka, Spark Streaming, Flink, S3.
  • Friday:
    • Peer Review: Present progress and receive feedback.
  • Saturday (Lab):
    • Project Milestone: Finalize and present real-time pipeline results.

Week 15: Final Project Integration

  • Monday to Thursday:
    • Design and Implementation:
    • Integrate both batch and real-time pipelines for a comprehensive end-to-end solution.
    • Tools: Kafka, PySpark, Airflow, Delta Lake, PostgreSQL, and S3.
  • Friday:
    • Job Shadowing: Observe senior engineers integrating complex pipelines.
  • Saturday (Lab):
    • Project Milestone: Showcase integrated solution for review.

Week 16: Capstone Project Presentation

  • Monday to Thursday:
    • Final Presentation Preparation:
    • Polish, test, and document the final project.
  • Friday:
    • Peer Review: Present final projects to peers and receive feedback.
  • Saturday (Lab):
    • Capstone Presentation: Showcase completed capstone projects to industry professionals and instructors.

Image of Datadog

How to Diagram Your Cloud Architecture

Cloud architecture diagrams provide critical visibility into the resources in your environment and how they’re connected. In our latest eBook, AWS Solution Architects Jason Mimick and James Wenzel walk through best practices on how to build effective and professional diagrams.

Download the Free eBook

Top comments (1)

Collapse
 
clus1986_eth profile image
chrisLightfoot

Link

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs