End to End Netflix data analytics and recommendation system project using Microsoft Azure tools

#azure #dataengineering #netflix #programming

Workflow -

Empowering Data Excellence: Unveiling an Azure-Powered End-to-End Data Engineering Triumph

Firstly details about azure services -

Data Factory - Data integration service that enables you to create, schedule, and manage data pipelines for efficient data movement and transformation between various sources and destinations in Azure and beyond. It simplifies ETL (Extract, Transform, Load) and data integration tasks.
Data Lake Gen 2 - Data lake solution that combines the capabilities of a data lake with the power of Azure Blob Storage, allowing you to store and analyze large volumes of structured and unstructured data with enhanced performance, security, and analytics capabilities.
Azure Databricks - Databricks is a unified analytics platform built on top of Apache Spark, designed to help data engineers and data scientists collaborate on big data processing and machine learning tasks. It provides tools for data exploration, data processing, and building machine learning models in a collaborative and scalable environment.
Synapse Analytics - SQL Data Warehouse, is a cloud-based analytics service provided by Microsoft Azure. It combines big data and data warehousing into a single integrated platform, allowing organizations to analyze and process large volumes of data for business intelligence and data analytics purposes.

Project -

It's all about using the smart tools in Azure to turn ordinary raw data into useful insights that we can actually use. This project shows just how powerful Azure Data Factory, Data Lake Gen 2, Synapse Analytics, Azure Databricks, and Power BI can be when it comes to working with data.I embarked on a profound analysis of Netflix's extensive array of shows and movies, employing Exploratory Data Analysis (EDA) techniques. The culmination of this endeavor was a finely-tuned recommendation system, meticulously designed to anticipate user preferences. Paired with dynamic data visualizations, this project epitomizes the seamless fusion of data.

Data Source Selection: Curating the Foundation

The genesis of this undertaking lay in the meticulous selection of a Netflix dataset sourced from Kaggle. To faithfully recreate real-world data dynamics, the dataset found its home on the esteemed GitHub platform.

Azure Data Factory: Pioneering Data Choreography

The project's prelude commenced with Azure Data Factory's virtuosity. An orchestrated pipeline was masterfully conceived to draw data from GitHub's repository, with the GitHub (using http url) serving as the point of ingress. Azure Data Lake Storage Gen 2 stood as the repository of choice, emblematic of its prowess in data governance. Elegantly timed trigger (schedule trigger) was enacted, invoking seamless end-to-end automation.

Three types of triggers in Azure Data Factory -

Schedule Trigger (many to many)
Tumbling window Trigger (separate files for every execution)
Event based Trigger (blob related events deletion and generation of blob's)

Storage Architecture: Data Sanctum of Integrity

The architectural blueprint featured a thoughtfully designed storage framework. It delineated distinct domains for the raw and refined data, entrenching data fidelity and facilitating meticulous tracking of transformations.

Data Transformation with Azure Databricks: The Alchemical Nexus

As the project's crescendo approached, Azure Databricks emerged as the alchemical nexus for data transformation. The intricate process commenced with a secure interface established between Databricks and Azure Data Lake Storage, fortified by Azure Key Vault's cryptographic guardianship. A sophisticated notebook empowered by a potent Spark cluster undertook data refinement. Spark scripts, characterized by their lucidity, purged impurities and sculpted data into refined forms. The epilogue of this process witnessed the depositing of transformed data within meticulously organized containers, interlaced with metadata. We can mount the Data Lake Storage with Databricks using following code:

configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
          "fs.azure.account.oauth2.client.id": "<application-id>",
          "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Azure Synapse Analytics: Envisioning Data Choreography

The epoch shifted to Azure Synapse Analytics, the epicenter of data orchestration. Transformed data elegantly transited to tables and databases, channeling a triad of choices:

SQL scripts
notebook
machine learning training or testing data

The path of the latter was embraced, leading to seamless loading from the Data Lake into Synapse Analytics.

Probing Insights: SQL Elocution

With insights beckoning, SQL queries took center stage within Azure Synapse. Through these elocutions, latent patterns and narratives within the data began to unravel. Visualizations, be they dynamic charts or structured tables, afforded a nuanced depiction of data's symphony.

Power BI: Artistry in Visualization

In a consummate culmination, a Power BI dashboard adorned this project. This virtuosic visualization marvelously harmonized with Azure Synapse Analytics, unshackling insights from their tabular confines. In this amalgamation of intellect and aesthetics, the data metamorphosed into a visual chronicle.

Concluding the Opus

As the final curtain descended on this opus, the echo of Azure's harmonious ensemble lingered. From the selection of the dataset's embryo to its elegant transformation, meticulous storage orchestration to eloquent data analysis, the project encapsulated Azure's promise of an end-to-end data symphony.

To all those poised at the threshold of their data journey, Azure beckons. With an array of tools ready to sculpt, refine, and illuminate data's intricacies, the prospects are limitless. Let us traverse the realm of data alchemy, for together, we epitomize data's true potential.

With profound gratitude for joining this expedition, until the next chapter of our data odyssey.

Data Visualisation and analytics -

Recommendation system -

The TF-IDF(Term Frequency-Inverse Document Frequency (TF-IDF) ) score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.