DEV Community

Cover image for How to Build a Secure Azure Data Platform with Terraform & Data Factory (Step-by-Step)
Ranjan Majumdar
Ranjan Majumdar

Posted on

How to Build a Secure Azure Data Platform with Terraform & Data Factory (Step-by-Step)

If you want to go beyond basic Azure demos and build something closer to a real-world data platform, this guide walks you through exactly that.

In this tutorial, you’ll build:

  • A secure Azure landing zone using Terraform
  • A data lake using ADLS Gen2
  • Data pipelines using Azure Data Factory
  • Event-driven ingestion (auto-triggered pipelines)
  • Managed identity-based access (no secrets)
  • Source-controlled pipelines in GitHub

πŸ‘‰ Full code here:Github

Why This Project Matters

In industries like healthcare and life sciences, organisations deal with highly sensitive data such as patient records and genomics information. These datasets need to be processed securely, reliably, and at scale.

This project demonstrates how a secure, cloud-based data platform can be built on Azure to ingest, process, and store such data using modern engineering practices. It focuses on key real-world requirements such as data security, automated ingestion, and maintainability through Infrastructure as Code and source-controlled pipelines.

What We’re Building

A simplified (but realistic) data platform:

ADF (Managed Identity)
        ↓
ADLS Gen2
   β”œβ”€β”€ raw
   β”œβ”€β”€ curated
   β”œβ”€β”€ audit
   └── reference

With:

Private endpoints

RBAC

Validation logic

Event triggers
Enter fullscreen mode Exit fullscreen mode

Architecture Diagram

Step 1 β€” Create the Terraform Project

Create a modular structure:

modules/
environments/dev/

Key modules:

resource group

networking (VNet + subnets)

storage (ADLS Gen2)

key vault

log analytics
Enter fullscreen mode Exit fullscreen mode

Step 2 β€” Deploy Secure Networking

Create:

VNet

Subnets:

app

data

private endpoints

Example:


resource "azurerm_virtual_network" "vnet" {
  address_space = ["10.20.0.0/16"]
}
Enter fullscreen mode Exit fullscreen mode

Step 3 β€” Create ADLS Gen2 (Data Lake)

Enable hierarchical namespace:

is_hns_enabled = true
Enter fullscreen mode Exit fullscreen mode

Create containers:

raw
curated
audit
reference
Enter fullscreen mode Exit fullscreen mode

Step 4 β€” Add Key Vault + Private Endpoints

Create:

Key Vault

Private endpoints:

Storage

Key Vault
Enter fullscreen mode Exit fullscreen mode

This ensures secure access (no public exposure).

Step 5 β€” Add Logging

Deploy:

Log Analytics workspace

Diagnostic settings for:

Storage

Key Vault
Enter fullscreen mode Exit fullscreen mode

Step 6 β€” Add Azure Data Factory

resource "azurerm_data_factory" "adf" { identity { type = "SystemAssigned" } }
Enter fullscreen mode Exit fullscreen mode

Step 7 β€” Configure RBAC

Grant ADF access:

Storage Blob Data Contributor

Key Vault Secrets User
Enter fullscreen mode Exit fullscreen mode

Step 8 β€” Upload Sample Data

Upload files to ADLS:

raw/clinical/incoming/patients.csv
raw/genomics/incoming/run-metadata.json
raw/genomics/incoming/variants_SMP001.vcf
Enter fullscreen mode Exit fullscreen mode

Step 9 β€” Create First Pipeline (Clinical)

In ADF:

Create linked service (ADLS, Managed Identity)

Create datasets:

raw CSV

curated CSV

Create pipeline:

raw β†’ copy β†’ curated
Enter fullscreen mode Exit fullscreen mode

Step 10 β€” Add Event Trigger

Trigger pipeline when file arrives:

Event: Blob Created

Path: raw/clinical/incoming/

Now ingestion becomes automatic.
Enter fullscreen mode Exit fullscreen mode

Step 11 β€” Create Genomics Pipeline

Create datasets for:

JSON (metadata)

VCF (variants)

Pipeline:

raw/genomics β†’ curated/genomics
Enter fullscreen mode Exit fullscreen mode

Step 12 β€” Add Validation Logic

Add:

Get Metadata (check file exists)

If Condition

If exists β†’ copy  
Else β†’ skip
Enter fullscreen mode Exit fullscreen mode

This mimics real production pipelines.

Step 13 β€” Enable Git Integration

Connect ADF to GitHub:

Repo: your project

Root folder: /adf

Artifacts stored as code:

pipeline/
dataset/
linkedService/
trigger/
Enter fullscreen mode Exit fullscreen mode

Final Repo Structure

terraform/ β†’ infrastructure
adf/       β†’ pipelines 
data/      β†’ sample data 
docs/      β†’ diagrams
Enter fullscreen mode Exit fullscreen mode

What You have learnt

By completing this, you gain practical knowledge of:

  • Terraform modular design
  • Azure private networking
  • Managed identity usage
  • Data lake architecture
  • ADF pipelines & triggers
  • Git-based data platform workflows

Next Improvements

  • CI/CD pipelines
  • Data quality validation
  • Monitoring dashboards
  • Multi-environment deployment

πŸ”— Full Project

πŸ‘‰ https://github.com/ranjanm1/secure-azure-genomics-demo

If you found this useful or are working on similar Azure/data engineering projects, feel free to connect with me on LinkedIn β€” I’d love to exchange ideas and learn from others in the space.

In the next iteration, I’ll be extending this project with a full CI/CD setup using GitHub Actions to automate infrastructure and pipeline deployments.

Top comments (0)