Ranjan Majumdar

Posted on Mar 18

How to Build a Secure Azure Data Platform with Terraform & Data Factory (Step-by-Step)

#adf #terraform #azure #dataengineering

If you want to go beyond basic Azure demos and build something closer to a real-world data platform, this guide walks you through exactly that.

In this tutorial, you’ll build:

A secure Azure landing zone using Terraform
A data lake using ADLS Gen2
Data pipelines using Azure Data Factory
Event-driven ingestion (auto-triggered pipelines)
Managed identity-based access (no secrets)
Source-controlled pipelines in GitHub

👉 Full code here:Github

Why This Project Matters

In industries like healthcare and life sciences, organisations deal with highly sensitive data such as patient records and genomics information. These datasets need to be processed securely, reliably, and at scale.

This project demonstrates how a secure, cloud-based data platform can be built on Azure to ingest, process, and store such data using modern engineering practices. It focuses on key real-world requirements such as data security, automated ingestion, and maintainability through Infrastructure as Code and source-controlled pipelines.

What We’re Building

A simplified (but realistic) data platform:

ADF (Managed Identity)
        ↓
ADLS Gen2
   ├── raw
   ├── curated
   ├── audit
   └── reference

With:

Private endpoints

RBAC

Validation logic

Event triggers

Architecture Diagram

Step 1 — Create the Terraform Project

Create a modular structure:

modules/
environments/dev/

Key modules:

resource group

networking (VNet + subnets)

storage (ADLS Gen2)

key vault

log analytics

Step 2 — Deploy Secure Networking

Create:

VNet

Subnets:

app

data

private endpoints

Example:


resource "azurerm_virtual_network" "vnet" {
  address_space = ["10.20.0.0/16"]
}

Step 3 — Create ADLS Gen2 (Data Lake)

Enable hierarchical namespace:

is_hns_enabled = true

Create containers:

raw
curated
audit
reference

Step 4 — Add Key Vault + Private Endpoints

Create:

Key Vault

Private endpoints:

Storage

Key Vault

This ensures secure access (no public exposure).

Step 5 — Add Logging

Deploy:

Log Analytics workspace

Diagnostic settings for:

Storage

Key Vault

Step 6 — Add Azure Data Factory

resource "azurerm_data_factory" "adf" { identity { type = "SystemAssigned" } }

Step 7 — Configure RBAC

Grant ADF access:

Storage Blob Data Contributor

Key Vault Secrets User

Step 8 — Upload Sample Data

Upload files to ADLS:

raw/clinical/incoming/patients.csv
raw/genomics/incoming/run-metadata.json
raw/genomics/incoming/variants_SMP001.vcf

Step 9 — Create First Pipeline (Clinical)

In ADF:

Create linked service (ADLS, Managed Identity)

Create datasets:

raw CSV

curated CSV

Create pipeline:

raw → copy → curated

Step 10 — Add Event Trigger

Trigger pipeline when file arrives:

Event: Blob Created

Path: raw/clinical/incoming/

Now ingestion becomes automatic.

Step 11 — Create Genomics Pipeline

Create datasets for:

JSON (metadata)

VCF (variants)

Pipeline:

raw/genomics → curated/genomics

Step 12 — Add Validation Logic

Add:

Get Metadata (check file exists)

If Condition

If exists → copy  
Else → skip

This mimics real production pipelines.

Step 13 — Enable Git Integration

Connect ADF to GitHub:

Repo: your project

Root folder: /adf

Artifacts stored as code:

pipeline/
dataset/
linkedService/
trigger/

Final Repo Structure

terraform/ → infrastructure
adf/       → pipelines 
data/      → sample data 
docs/      → diagrams

What You have learnt

By completing this, you gain practical knowledge of:

Terraform modular design
Azure private networking
Managed identity usage
Data lake architecture
ADF pipelines & triggers
Git-based data platform workflows

Next Improvements

CI/CD pipelines
Data quality validation
Monitoring dashboards
Multi-environment deployment

🔗 Full Project

👉 https://github.com/ranjanm1/secure-azure-genomics-demo

If you found this useful or are working on similar Azure/data engineering projects, feel free to connect with me on LinkedIn — I’d love to exchange ideas and learn from others in the space.

In the next iteration, I’ll be extending this project with a full CI/CD setup using GitHub Actions to automate infrastructure and pipeline deployments.

DEV Community