DEV Community: D Siddhant Patro

Spark MLlib for Big data and Machine learning

D Siddhant Patro — Tue, 09 Feb 2021 18:50:51 +0000

In this world, full of data, there’s a good chance that you might know what Big data and Apache Spark is. If you don’t, that’s ok! I’ll tell you what it is but before knowing about big data and spark, you need to understand, what is Data.

Data :- The quantities, characters, or symbols containing some kind of information on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

Since you all got an idea about what Data is, now it will be easy for you to understand what big data is.

Big data :- It is a collection of data that is huge in volume and having more complexity, especially obtained from new data sources, and it is growing exponentially with time. These data sets are so voluminous that traditional data processing software just can’t manage them.
It consists of 3 types of data, they are structured, semi-structured and unstructured.

Machine learning :- It is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

Apache Spark :- With an immense amount of data, we need a tool to digest it and the tool is Apache Spark. It is a fast, unified computing and open source data-processing engine for parallel data processing on computer clusters. It is designed to deliver the computational speed and scalability required for Big Data — specifically for streaming data, graph data, machine learning applications.

Spark provides an unified data processing engine known as the
Spark stack. This stack is built on top of a strong foundation called Spark Core, which provides all the necessary functionalities to manage and run distributed applications such as scheduling, coordination, and fault tolerance. Available libraries of Spark are Spark SQL, Spark Streaming, GraphX, Spark MLlib and Spark R.

Spark SQL is for batch as well as interactive data processing.
Spark Streaming is for real-time stream data processing.
Spark GraphX is for graph processing.
Spark MLlib is for machine learning.
Spark R is for running machine learning tasks using the R shell.

Spark MLlib is nothing but a library that helps in managing and simplifying many of the machine learning models for building tasks, such as featurization, pipeline for constructing, evaluating and tuning of the model. Machine learning algorithms are iterative in nature, meaning they run through many iterations until a desired objective is achieved. Spark makes it extremely easy to implement those algorithms and run them in a scalable manner through a cluster of machines.

Spark MLlib tools are given below:-

ML Algorithms
Featurization
Pipelines
Model Tuning
Persistence

ML Algorithms:-
ML Algorithms form the core of MLlib. These include common learning algorithms such as classification, regression, clustering, and collaborative filtering. MLlib standardizes APIs to make it easier to combine multiple algorithms into a single pipeline.

Featurization:-
Featurization includes feature extraction, transformation, dimensionality reduction, and selection.

Feature Extraction is extracting features from raw data.
Feature Transformation includes scaling, and modifying features
Feature Selection involves selecting a subset of necessary features from a huge set of features.

Pipelines:-
In machine learning, it is common to run a sequence of steps to clean and transform data, then train one or more ML algorithms to learn from the data. MLlib has a class called Pipeline, which consists of a sequence of Pipeline Stages (Transformers and Estimators) to be run in a specific order.

Model Tuning:-
The goal of the model tuning is to train a model with the right set of parameters to achieve the best performance to meet the object defined in the first step of the ML development process.

Persistence:-
Persistence helps in saving and loading ML algorithms, models, and pipelines. This helps in reducing time and efforts as the model is persistence, it can be loaded or reused any time when needed.

The above are the tools via which one can learn to use machine learning algorithms on Apache spark framework for better and faster processing of massive and voluminous data.

Conclusion:-
In the Python world, scikit-learn is one of the most popular open source machine learning libraries. It provides a set of supervised and unsupervised learning algorithms. It is designed to be simple and efficient and therefore, it is a perfect tool to learn and practice machine learning on a single machine. But the moment the size of the data exceeds the storage capacity of a single machine, that’s when it is time to switch to Spark MLlib.

Thank you.

Rundown on Deep Learning

D Siddhant Patro — Sat, 01 Aug 2020 18:27:35 +0000

What is deep learning ? 😀

Deep learning is an artificial intelligence (AI) function that imitates the working of the human brain in processing data and creating patterns for decision making. The word "Deep" in Deep Learning isn't a reference to any kind of deeper understanding achieved by some approach, rather it stands for the idea of successive layers of representation. It is a subset of machine learning in artificial intelligence that has layers/networks, capable of learning unsupervised data that is, unstructured or unlabeled. Deep learning can also be called as deep neural learning or deep neural network.

Other appropriate names of deep learning could have been hierarchical representation learning and layered representation learning. Modern deep learning involves tens or even thousands of successive layers of representations and they are learned automatically from the exposure of training data.
These layered or hierarchical representations are learned via models called Neural Networks, which are stacked on top of each other. Most of us have learnt about neural networks in the subject Biology. Yes!, it is true that some of the core concepts of deep learning were developed by drawing inspiration from the understanding/learning procedure of our brain. But, since there is no evidence that our brain does the learning in same way as modern deep learning models do, so it is not right to say that deep learning models are the models of our brain.

How deep learning works ?

Let’s examine how a network of several layers transforms an image of a digit in order to recognize what digit it is.

From the above image, you can get an idea about the basic deep learning architecture used by the neural network models. There are 3 layers namely input layer(layer 1), hidden layer(layer 2 & 3) and output layer(layer 4). Connections between neurons(layers) are associated with a weight, dictating the importance of the input value.
Steps followed by the neural network are:

The above picture depicts an image of 28x28 pixels showing 4,
is provided as an input to the input layer of neural network.
This input gets transformed in the successive hidden layers.
This transformed image is allowed to pass through the output
layer. And in the output layer the deep learning model is able
to detect the digit.

You can think of a network as a distillation process in which the information passes through the successive filters and gives purified output.

This is just a brief idea about how deep learning model works.

Why deep learning ?

Research from Gartner revealed that a huge percentage of an organization’s data is unstructured because the majority of it exists in different types of formats like pictures, texts etc. For the majority of machine learning algorithms, it’s difficult to analyze unstructured data, which means it’s remaining unutilized and this is exactly where deep learning becomes useful.
According to Andrew Ng (the chief scientist of China’s major search engine Baidu, head of the Google Brain Project and co-founder of Coursera), “The analogy to deep learning is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms.”

According to the quote and the graph, above, it is evident that due to the increase in amount of data, deep learning models are very useful to obtain a perfect and desirable output. The ability to process large numbers of features makes deep learning very powerful when dealing with unstructured data. And this is the reason why deep learning has emerged in recent years.

Advantages of Deep learning over traditional machine learning

The major advantage of using deep learning over traditional machine learning algorithm are :

The deep learning model has the ability to do feature
engineering on its own.
Massively parallel computations through use of GPU - scalable
for large volume of data
In Deep learning, problems are solved on an end-to-end basis
while in machine learning, tasks are divided into small pieces
and then received results are combined into one conclusion.

Refer to the picture below.

Examples of deep learning in real-world scenarios

Electronics: Deep learning is being utilized
in automated speech translation. You can think of home
assistance devices which respond to your voice and understand
your preferences.
Automated driving: With the help of deep
learning, automotive researchers are now able to detect objects
like traffic lights, stop signs etc automatically. They’re also
using it to detect pedestrians that helps lower accidents.
Medical research: Deep learning is being used
by researchers to detect cancer cells automatically.

When to use deep learning ?

Deep learning performs exceptionally good for a massive amount
of data. But for small data size, machine learning algorithm is
more preferable.
Deep Learning really shines when it comes to complex problems
such as image classification, natural language processing, and
speech recognition.
Deep Learning techniques need to have high end infrastructure
to train in reasonable time.

Challenges faced

One needs to find and process massive datasets for training.
And these datasets are rarely available. Once the datasets
are in hand, using them to train deep learning networks can
require days on big clusters of CPUs and GPUs. Emerging
techniques such as transfer learning shows some promise with
regard to overcoming this challenge.
There can also be the danger of over-fitting of the data. Over-
fitting happens when an algorithm learns the detail and noise
in the training data to the extent that negatively impacts the
performance of the model in real-life scenarios.
Due to the sheer number of layers, nodes, and connections, it
is difficult to understand how deep learning networks arrive at
insights.

Conclusion

The points presented above illustrate that deep learning has a lot of potential, but needs to overcome a few challenges before becoming a more versatile tool. Now the question is not whether this technology is useful, rather how companies can implement it in their projects to improve the way they process data. The interest and enthusiasm for the field is, however, growing, and already today we see incredible real-world applications of this technology.

Thank you !!!