If you want to go beyond basic Azure demos and build something closer to a real-world data platform, this guide walks you through exactly that.
In this tutorial, youβll build:
- A secure Azure landing zone using Terraform
- A data lake using ADLS Gen2
- Data pipelines using Azure Data Factory
- Event-driven ingestion (auto-triggered pipelines)
- Managed identity-based access (no secrets)
- Source-controlled pipelines in GitHub
π Full code here:Github
Why This Project Matters
In industries like healthcare and life sciences, organisations deal with highly sensitive data such as patient records and genomics information. These datasets need to be processed securely, reliably, and at scale.
This project demonstrates how a secure, cloud-based data platform can be built on Azure to ingest, process, and store such data using modern engineering practices. It focuses on key real-world requirements such as data security, automated ingestion, and maintainability through Infrastructure as Code and source-controlled pipelines.
What Weβre Building
A simplified (but realistic) data platform:
ADF (Managed Identity)
β
ADLS Gen2
βββ raw
βββ curated
βββ audit
βββ reference
With:
Private endpoints
RBAC
Validation logic
Event triggers
Architecture Diagram
Step 1 β Create the Terraform Project
Create a modular structure:
modules/
environments/dev/
Key modules:
resource group
networking (VNet + subnets)
storage (ADLS Gen2)
key vault
log analytics
Step 2 β Deploy Secure Networking
Create:
VNet
Subnets:
app
data
private endpoints
Example:
resource "azurerm_virtual_network" "vnet" {
address_space = ["10.20.0.0/16"]
}
Step 3 β Create ADLS Gen2 (Data Lake)
Enable hierarchical namespace:
is_hns_enabled = true
Create containers:
raw
curated
audit
reference
Step 4 β Add Key Vault + Private Endpoints
Create:
Key Vault
Private endpoints:
Storage
Key Vault
This ensures secure access (no public exposure).
Step 5 β Add Logging
Deploy:
Log Analytics workspace
Diagnostic settings for:
Storage
Key Vault
Step 6 β Add Azure Data Factory
resource "azurerm_data_factory" "adf" { identity { type = "SystemAssigned" } }
Step 7 β Configure RBAC
Grant ADF access:
Storage Blob Data Contributor
Key Vault Secrets User
Step 8 β Upload Sample Data
Upload files to ADLS:
raw/clinical/incoming/patients.csv
raw/genomics/incoming/run-metadata.json
raw/genomics/incoming/variants_SMP001.vcf
Step 9 β Create First Pipeline (Clinical)
In ADF:
Create linked service (ADLS, Managed Identity)
Create datasets:
raw CSV
curated CSV
Create pipeline:
raw β copy β curated
Step 10 β Add Event Trigger
Trigger pipeline when file arrives:
Event: Blob Created
Path: raw/clinical/incoming/
Now ingestion becomes automatic.
Step 11 β Create Genomics Pipeline
Create datasets for:
JSON (metadata)
VCF (variants)
Pipeline:
raw/genomics β curated/genomics
Step 12 β Add Validation Logic
Add:
Get Metadata (check file exists)
If Condition
If exists β copy
Else β skip
This mimics real production pipelines.
Step 13 β Enable Git Integration
Connect ADF to GitHub:
Repo: your project
Root folder: /adf
Artifacts stored as code:
pipeline/
dataset/
linkedService/
trigger/
Final Repo Structure
terraform/ β infrastructure
adf/ β pipelines
data/ β sample data
docs/ β diagrams
What You have learnt
By completing this, you gain practical knowledge of:
- Terraform modular design
- Azure private networking
- Managed identity usage
- Data lake architecture
- ADF pipelines & triggers
- Git-based data platform workflows
Next Improvements
- CI/CD pipelines
- Data quality validation
- Monitoring dashboards
- Multi-environment deployment
π Full Project
π https://github.com/ranjanm1/secure-azure-genomics-demo
If you found this useful or are working on similar Azure/data engineering projects, feel free to connect with me on LinkedIn β Iβd love to exchange ideas and learn from others in the space.
In the next iteration, Iβll be extending this project with a full CI/CD setup using GitHub Actions to automate infrastructure and pipeline deployments.

Top comments (0)