Tutorial: Build a pipeline to join streams of real time data

#azure #database #tutorial

With traditional architectures, it's quite hard to counter challenges imposed by real-time streaming data – one such use case is joining streams of data from disparate sources. For example, think about a system that accepts processed orders from customers (real time, high velocity data source) and the requirement is to enrich these "raw" orders with additional customer info such as name, email, location etc. A possible solution is to build a service that fetches customer data for each customer ID from an external system (for example, a database), perform a join (in-memory) and stores the enriched data in another database perhaps (materialized view). This has several problems though and one of them is not being able to keep up (process with low latency) with a high volume data.

Stream Processing solutions are custom built for these kind of problems. One of them is Azure Stream Analytics, that is a real-time analytics and complex event-processing engine, designed to analyse and process high volumes of fast streaming data from multiple sources simultaneously.

It supports the notion of a Job, each of which consists of an Input, Query, and an Output. Azure Stream Analytics can ingest data from Azure Event Hubs (including Azure Event Hubs from Apache Kafka), Azure IoT Hub, or Azure Blob Storage. The query, which is based on SQL query language, can be used to easily filter, sort, aggregate, and join streaming data over a period of time.

Hands-on tutorial

This GitHub repository contains a sample application to demonstrate the related concepts and provides step-by-step guide to setup and run the end to end demo. It showcases how to build a data enrichment pipeline with streaming joins using a combination of Azure Event Hubs for data ingestion, Azure SQL Database for storing reference data, Azure Stream Analytics for data processing and Azure Cosmos DB for storing "enriched" data.

These are powerful, off-the-shelf services which you will be able to configure and use without setting up any infrastructure. You should be able to go through this tutorial using the Azure Portal (or Azure CLI), without writing any code!

https://github.com/abhirockzz/streaming-data-pipeline-azure

TL;DR

Here are the individual components:

Azure Event Hubs (Input Data source) - ingests raw orders data
Azure SQL Database (Reference Data source) - stores reference customer data
Azure Stream Analytics (Stream Processing) - joins the stream of orders data from Azure Event Hubs with the static reference customers data
Azure Cosmos DB (Output data source) - acts as a "sink" to store enriched orders info

I hope this helps you get started with Azure Stream Analytics and test the waters before moving on to more involved use cases. In addition to this, there is plenty of material for you to dig in!