DEV Community

Cover image for Big Data on Azure
shreyan1999
shreyan1999

Posted on • Edited on

Big Data on Azure

Data is a very crucial part of any enterprise service. It is present everywhere, abundantly in today’s world. An enormous amount of data is generated every single second. Now one might wonder,if such a huge amount of data is generated everyday then how is it managed and how is it useful? We will learn about this in this article. I will also inform you about how Microsoft Azure is emerging as one of the leading solution providers in the field of Data Science.

A cloud platform which mainly consists of:
Compute/Servers: Networked commodity machines located in data centers.
Storage/Databases: The storage services located on the servers in the data centers.
Intelligence/Analytics: Interacting with people in many possible ways.

Let us see some statistics on how data is changing,
1.Maximum data is generated in the past 2-3 years, when compared to the data from the time of the start of mankind. There is no doubt that due to the covid-19 pandemic, with more people switching over to the internet, maximum data is expected to be generated which is expected to be an all time record high.
2.Everyday, we see that approximately 2.5Exabytes of data is generated. (order of 10^18)

Why is the reason behind this rapid growth of data?
1.Internet: The major contributor to the data boom. Approximately 4 billion humans use the internet and 5 billion search queries are found. We can actually estimate the crucial role the internet has in this data boom.

2.Social media: This is definitely not a term we are not familiar with. Social Media is the next big contributor for the data boom.Every like,share counts! Every viral thing on social media makes a huge difference!

3.Internet of Things: Though not very popular, it has a significant contribution in data generation.It is sure to increase in the coming decades.

There are many other contributors but I have just jotted down a few

Having learnt about what data is, let us now explore types of data which are available,

1.Structured Data: This majorly includes all such data which is formatted or is defined well with a proper structure.For examples in rows and columns,
2.Unstructured Data: This kind of data is not at all structured for example media files,text files.
3.Semi-Structured Data: It is a hybrid of the above two formats.for example JSON,XML.

Usually we deal with unstructured and Semi structured data in today’s world. Talking with respect to enterprise level,usually they have 50% unstructured data in the order of petabytes.Just to add on,the industry has seen an explosion of semi and unstructured data in the last few years.

Now let's come to the main topic of the blog. BIG DATA.

The name itself makes it clear it to some extent. A data which is too large or complex to be analysed in traditional data processing software apps is what we call big data. Additionally, it has 3 major characteristics,which can be called the 3Vs.
What are the 3Vs?
1.Volume: one of the main characteristics of big data is volume. What volume are we talking of here? It is of the order of petabytes or more. Petabytes soon won't be the appropriate term to refer to, if we look at the size of the data that is being generated.
2.Velocity: Similar to velocity in Physics we all know, here it is the rate of the data growth. This is usually expressed in exponential terms.
3.Variety: This refers to the types of data which comes in. As we discussed earlier, it comes in 3 different formats i.e. Structured, Unstructured and Semi Structured Data.

All these seem to be in place. But what do we do with such volumes of data?

To understand this better, let us look at some real life examples,

Rolls Royce: Besides manufacturing automobiles,Rolls royce also manufactures aircraft engines. It has over 13000 engines which are operational and send real time data on various parameters of the engine. The main reason for this data collection is to increase the fuel efficiency of the engines. Fuel efficiency is the biggest concern for the airlines. Some landings save fuel whereas some turn exorbitantly expensive. Improving the fuel efficiency by analysing the real time parameters from all the operational engines is a great step taken by Rolls Royce to improve their quality of engines.

Hewlett Packard: This is a very familiar company which produces many electronic gadgets, spare parts and accessories. Service is a major concern for such companies. Many a time customers are not satisfied with the quality of after sale services provided by the companies. HP collected all the data of the customer problems, used a combination of Data and AI to improve the technical support. As a result, they observed an increase of 75% automated queries resolving compared to the previous methods.

Having seen these two, now we can relate it to things around us. All the leading social media services rely on this great technology for their smooth functioning. Healthcare, Transport, E-commerce to name a few rely on big data or will switch over to it very soon.
Now let us talk about the ways in which we can analyse this big data. This is a great challenge in itself right? Handling such precious data is obviously no joke. It should be taken care of really well.This is exactly where the cloud technologies come handy. Any data we might think of first needs to be stored and then be analysed.
We rely on relational databases for storage purposes and to also go ahead with a distributed file system called Data lakes.
Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists and analysts to store data of any size, shape and speed, and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming and interactive analytics. Azure Data Lake works with existing IT investments for identity, management and security for simplified data management and governance.

The next important aspect we need to discuss about is compute
We have options like No SQL databases, Apache spark, Hadoop, Azure synapse analytics and so on.

Microsoft Azure provides robust services for analysing the big data. As we discussed earlier the azure data lake is a wonderful and secure method to store the data.We can later process this using Spark on Azure databricks. Azure provides a hassle free experience with best-in class cloud security. Azure stream analytics is a service for real-time data analytics.

What does Azure databricks offer?
1.Optimimsed Spark engine
2,ML runtime
3.ML flow
4.Choice of language
5.Collabortive notebooks
6.Delta lake
7.Native integrations with Azure services
8.Interactive workspace.

Alt Text

Azure has hundreds of services to offer, which makes it a very hassle free application to deploy any kind of enterprise ready applications.Seems interesting? Doesn’t it?

Top comments (0)