Difference between Data Lakes & Data Warehouses

Chandra Prakash — Tue, 21 Jul 2020 11:19:58 +0000

The differences between the two most popular choices i.e. Data Lakes & Data Warehouses for storing big data is discussed in this article.
As the volume of data is growing day by day and to store it efficiently is also a challenge for Data Engineers and DBAs.
So here we are going to discuss both the techniques with some aspects like Type of data they store, their purpose, task, size, users, etc.

1. Type of data:

Data lake store unstructured and structured data from various data sources like IoT devices, real-time social media streams, user data, and web application transactions. Sometimes this data is structured, but often, it’s quite messy because data is being ingested straight from the data source.
Data warehouses contain historical data that has been cleaned to fit a relational schema.

2. Purpose:

Data lakes are used for cost-effective storage of large amounts of data from many sources. Allowing data of any structure decreases cost because data is more flexible and scalable as the data doesn’t need to fit a specific schema.
By restricting data to a schema, data warehouses are very efficient for analyzing historical data for specific data decisions.

3. Size:

Data lakes are much bigger because they store all data that might be important to a company.
Data warehouses are much more selective on what data is stored.
Hence are smaller in size in comparison to data lakes.

3. Users:

Data lakes are set up and maintained by data engineers who integrate them into data pipelines. Data scientists work more closely with data lakes as they contain data of a wider and more current scope
Data warehouses require a lower level of programming and data science knowledge to use. Hence data analysts and business analysts often work within data warehouses containing explicitly related data that has been processed for their work.

4. Tasks:

Data lakes aren’t only limited to storage. Big data analytics can be run on data lakes using services such as Apache Spark and Hadoop.
Data warehouses are typically set to read-only for analyst users, who are primarily reading and aggregating data for insights.

Conclusion
At last, it's up to you which one you want to use according to the business requirement or need.
But most of the time while building data pipelines you need a combination of both the storage techniques.

Thank you for reading.
If you find this post helpful please react and share it.

Bivariate Regression on MLB 2002 Dataset

Chandra Prakash — Wed, 20 May 2020 17:30:49 +0000

This is my mini-project undertaken in pre-final year of college

In this project, I analyzed the data of an American Major League Baseball (MLB) tournament for season, 2002, which has a collection of batting statistics of 331 baseball players.

I aim to predict whether there is a relationship between batting average and the number of home runs a player hits.

First, I checked for outliers then perform the transformation on the data such that it does not violate any assumptions of regression.

Various types of plots used to visualize the data like scatter plot, normal q-q plot, etc.

Below is the GitHub link for the code

{% https://github.com/Chandra0505/Project-1-mlb-dataset %}

How I built it?

We divided the dataset into two sets one training set (80%) and the other as test set (20%). On the training set, we trained our model and with the test, we test its accuracy by cross-validating it.

Through our final regression model, we achieve an accuracy of about 22% which quite good because we are told to perform Bivariate Regression on batting average and home runs of a player.
Of course, many other factors also affect a person’s ability to hit home runs, such as size, strength, number of at-bats, and other factors.
However, batting average alone accounts for nearly one-fourth of the variability in the response.
So we neglected/ remove all other features like which could also play a crucial in finding the relationship.

What's the stack?

R as programming language (Version:3.5)
Libraries Used: ggplot2, caTools, Publish
RStudio as IDE (Version: 1.1.463)

My learnings / Feelings / Stories

Through this project, I got to learn how to perform various data science skills of a real-world dataset.

DEV Community: Chandra Prakash