DEV Community: Victor Alando

K-means Clustering Using the Elbow Method.

Victor Alando — Mon, 06 Jan 2025 07:51:17 +0000

Introduction

Clustering or cluster analysis is machine learning technique, which groups the unlabeled dataset. It can be said that as "way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group and those that have less or no similarities with another group"

Let's understand the clustering technique with the real-world example of Mall. When customers visit any shopping mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable sections, apple, bananas, Mangoes, e.t.c are grouped in a separate section, so that customers can easily find out the things. The clustering technique also works in the same way. Other examples of clustering are grouping of documents according to topics.

Python Implementation of K-means Clustering Algorithm.

Prerequisites

What is K-means Clustering Algorithm.
How does the k-means algorithm work?
How to find and choose the value of "k: number of clusters in k-means clustering.
Data preprocessing.
Standardization and feature scaling.
Fitting the training and Data Transformation.
Training the K-means Algorithm on the Training Dataset.
Make Predictions.
Inspect the coordinates of the 5 centroids
Finding the Optimal (k) number of clusters using the Elbow Method.
Visualizing the Clusters
Summary Findings

What is K-means Clustering Algorithm?

K-Means clustering is an unsupervised learning algorith, which groups the unlabeled dataset into different clusters. Here k defines the number of pre-defined clusters that need to be created in the line process, as if K=2, there will be two clusters, and K=3 it means that there will be clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only to only one group that has similar properties

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where clusters are associated with centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their correspond clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks: - Determines the best value for k center points or centroids by an iterative process. - Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.

Hence each cluster has data points with some commonalities, and it is away from other clusters.

Consider the below diagram that explains the working of the K-means Clustering Algorithm.

How does the k-means algorithm work?

The workings of the Kmeans algorithm can explained in the below steps:

Select the number of k to decide the number of clusters.
Select random K points or centroids (it can be from the input data set).
Assign each data point to their closet centroid, which will form the predefined k clusters.
Calculate the variance and place a new centroid of each cluster.
Repeat the third step which means, reassign each data point to the new closet centroid of each cluster.
If any reassignment occurs, then go to step 4 else go to finish.
The model is ready.

Let’s now understand the above steps with the help of a visual plots: Suppose we have two variables M1 and M2.

The x-y axis scatter plot of these two variables is given below:

Let’s take number k of clusters, i.e. K=2, to identify the dataset and to put them into different clusters. It means that we will try to group these datasets into two different clusters.
We need to choose some random k points or centroids to form the cluster. These points can be either or any other point. So, here wea re selecting the below two points as k points, which are not the part of our dataset.

Consider the below visual plots:

Now we will assign each data point of the scatter plots to its closest K-point or centroids. We will compute it by applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a median between both the centroids. See the visual plot below:

From the above scatter visualization, it is clear that points left side of the line is near to the K1 or blue centroids, and points to the right of line are close to the orange centroid. Let's color them as blue and orange for clear visualization.

We need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroid, we will compute the center of gravity of these centroids and will find new centroids as shown in the below image.

Next, we will reassign each data point to the new centroid. For this, we will repeat the same process of finding a median line. The median will be look like the one below.

From the above image, we can see one orange point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to step 4, which is finding new centroids or k-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image:

As we got the new centroids so again, we will draw the median line and reassign the data points. So, the image will be;

We can see in the above image, there are no dissimilar data points on either side of the line, which means our model is formed. Consider the below image.

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the image below:

Association Rule Learning

Victor Alando — Fri, 03 Jan 2025 13:29:46 +0000

Introduction

The association rule learning is one of the very important concepts of machine learning and it is employed in Market Basket Analysis, Web usage mining, Continous production. Here market basket analysis is a technique used by the various big retailer to discover the associations between items.

We can understand it by taking an example of a supermarket, as in a supermarket, all products that are purchased together are put together. For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk so these products are stored within a shelf or mostly nearby.

Prerequisites

What is Association Rule Learning.
How does Association Rule Work?
Types of association Rule.
Metrics of Association Rule.
Types of Association Rule Algorithms.
Types of Association Rule Learning.

People usually say that people who buy diapers must also buy juice ~ Anonymous

What is Association Rule Learning?

Association rule learning is a type of unsupervised learning technique that checks for the dependency of one data item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting relations or associations among the variables of dataset. It is based on different rules to discover the interesting relations between variables in the database.

Association rule learning can be divided into three types of algorithms:

Apriori.
Eclat.
F-P Growth Algorithm.

How does Association Rule Learning Work?

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the if element is called antecedent, and then the statement is called as Consequent. These types of relationships where we can find out some association or relation between two items is known as Single Cardinality. It is all about creating rules, and if the number of items increases, then cardinality also increases accordingly. So, to measure the associations between thousands of data items, there are several metrics to follow:

Metrics of association Rule

Support
Confidence
Lift

Let’s understand each of them:

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the transaction T that contains the item set X datasets, then for transactions T, it can be written as

Confidence

Confidence indicates how often the rules has been found to be true. Or how often the terms X and Y occur together in the dataset when the occurrence of X is already given. It is the ration of the transaction that contains X and Y to the number of records that contain X.

Lift

It is the strength of any rule, which can be defined as:- It is the ration of the observed support measure and expected support if X and Y are independent of each other. It has three possible values:

Types of Association Rule Algorithms

Association rule learning can be divided into three algorithms:

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on the datasets that contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be bought together. it can also be used in the healthcare industry to find drug reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version the Apriori Algorithm. It represents the database in the form of a tree structure that is known as a frequent pattern or tree. The purpose of this frequent tree is to extract the most frequent patterns.

Applications of Association Rule Learning

It has various applications in machine learning and data mining. Below are some of the popular applications of association rule learning:

Market Basket Analysis: It is one of the popular examples and applications association rule mining. This technique is commonly used by big retailers to determine the association between items.

Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in identifying the probability of illness for a particular disease.

Protein Sequence: The association rules help in determining the synthesis of artificial proteins.

It is also used for the Catalog Design and Loss-leader Analysis and many ore other applications.

For Python Implementation of Association Rule Click Here

************Thanks for Reading Give me a Thumb*******

The Ultimate Guide to Data Analytics

Victor Alando — Thu, 29 Aug 2024 10:18:36 +0000

Introduction

Data is shorthand for “information,” and whether you are collecting, reviewing, and/or analyzing data this process has always been part of Head Start program operations. Students' enrollment into the program requires many pieces of information. The provision of health and dental services includes information from screening and any follow-up services that are provided. All areas of a Head Start program – content and management – involve the collection and use of substantial amounts of information.

Prerequisites

What is Data Analysis
Who is a Data Analyst.
Data Analysis Life Cycle
Improving the model functionality.

What Is Data Analysis

Data analysis is the processing of data to yield useful insights or knowledge.
• Data processing involves finding, loading, cleaning, manipulating, transforming, modeling, and visualizing the data.
• The knowledge may be used for scientific discovery, business decision-making, or a variety of other applications.

A data analyst is a person who uses tools and applications to transform raw data into a form that will be useful.

From this perspective, we present a data analysis process that includes the following key components:
• Purpose
• Questions
• Data Collection
• Data Analysis Procedures and Methods
• Interpretation/Identification of Findings
• Writing, Reporting, and Dissemination; and
• Evaluation
We have also found, from our review of the literature, that there are many different ways
of conceptualizing the data analysis process. We can make a basic distinction between a
linear approach and a cyclical approach; in this Handbook we provide examples of both.

Data Analytics Life Circle

Data is precious in today’s digital world environment. It goes through several life stages, including creation, testing, preprocessing, consumption, and reuse.

These stages are mapped out in the Data Analytics Life Cycle for professionals working on data analytics initiatives. Each stage has its own significance and characteristics.

Define the Problem In the data analysis process, the most challenging phase is to define the problem that needs to be solved. Deciphering the root cause of an issue requires a profound understanding of a business’ needs and aspirations, and involves a deep dive into metrics, KPIs, and other crucial indicators.

This stage involves conducting initial analyses in order to gain valuable insights. It is crucial that this stage is done properly, as it lays a strong foundation for the entire data analysis process.

2. Data Collection
After defining the problem, a data analyst then determines the most suitable data to address that question. The types of data they usually collect here include

Quantitative data – like marketing figures – - or Qualitative data – like customer reviews.

Data types can be further categorized into 3 main groups:
first-party data (or data collected directly by an organization)
second-party data (or first-party data collected by one organization used by another), or third-party data (or data aggregated from multiple sources by a third party).

If the necessary data is incomplete or missing, a data analyst will be responsible in this step for devising a strategy for data collection. This includes different methods like surveys, social media monitoring, website analytics tracking, and online tracking in general.

3. Data Cleaning
Freshly-collected data in its raw form is typically unorganized and messy. Before proceeding with the necessary analysis, that data must be cleaned up. In order to clean data, errors, duplicates, and outliers must be removed, along with any irrelevant data that does not contribute to the analysis being done.

Additionally, the data must be restructured in a more meaningful manner depending on the type of analysis being done. Missing values must be filled in, too, in order to make the data more accurate. Data that is highly accurate can provide more valuable insights in the data analysis process.

4. Data validation
After it is cleaned, the data must be validated. This process involves verifying whether the data meets the specific requirements of the analysis being performed.

5. Perform Exploratory Data Analysis (EDA)
Exploratory Data Analysis(EDA) is the main step in the process of various data analysis. It helps data to visualize the patterns, characteristics, and relationships between variables. Python provides various libraries used for EDA such as NumPy, Pandas, Matplotlib, Seaborn, and Plotly.

6.Build the Model

Model building is an essential part of data analytics and is used to extract insights and knowledge from the data to make business decisions and strategies. In this phase of the project data science team needs to develop data sets for training, testing, and production purposes. To do this, dataset needs to be divided into two parts;

Training dataset
Test dataset

Note: Based on the dataset quality and quantity of the data one may choose to divide his dataset into three parts training and testing and validation data.

To divide the dataset, Python sklearn library which helps in dividing the dataset into training and testing datasets is used to perform the train and test split. Here a data analyst will choose the ratio by which he/she want to divide the dataset by default it 8:2 meaning 80% and 20% for training and testing respectively.

7. Share the outcome
After conducting the analysis and extracting important insights, the final step lies in effectively communicating these findings to those who initiated the project in the first place.

While it is essential to interpret the data accurately, it is equally important to be able to present those findings clearly and concisely. A data analyst is often working with marketing executives or stakeholders who are on time constraints and may not possess much technical expertise.

8. Model Deployment
After performing a success EDA, the next final stage is to deploy the model into a real-world system or application to automatically generate predictions or perform specific tasks.

*************Thank you for Reading########

Understanding Your Data: The Essentials of Exploratory Data Analysis

Victor Alando — Tue, 20 Aug 2024 17:18:18 +0000

Introduction

Exploratory data analysis (EDA) has been used by data scientists to analyze and investigate data sets and summarize their main characteristics, by applying data visualization methods. It is a vital process in Data science since it helps in understanding the data you are dealing with and also making conclusions about it. EDA serves as a bridge between the process of data collection and the processes of building machine learning models.

Prerequisites

What is Exploratory Data Analysis
Data Preprocessing and Feature Engineering in data science. 3 Types of Exploratory Data Analysis
EDA Python Libraries.

1. What Is Exploratory Data Analysis

Exploratory data analysis (EDA) is a critical initial step in the data science workflow. It involves using Python libraries to inspect, summarize, and visualize data to uncover trends, patterns, and relationships.

2. Data Preprocessing and Feature Engineering in data science.

Data preprocessing and feature engineering are crucial steps in preparing datasets for effective model training.

Data Preprocessing

Data pre-processing involves cleaning and preparing raw data to facilitate effective analysis and also prove the validity of a model. Data preprocessing include:

Handling Missing Values.
Detecting Outliers in a given dataset.
Perform Encoding for categorical variables such as gender, country names needs to be converted into numerical format for machine learning algorithms. Encoding techniques like one-hot encoding or label encoding transform categorical ****variables into a format that algorithms can understand.
Checking for Duplicate entries in given a dataset.
Performing Train/Test Split to have dataset divided into two sets for further training the model and also for testing.

Feature Engineering

Feature engineering is a critical task that significantly influences the outcome of a model. It involves crafting new features based on existing data. This task is called Creation of Derived Features. For example, extracting the day of the week from a date or creating interaction terms between existing features can provide valuable information.

Dimensionality Reduction is also another method of feature engineering. High-dimensional datasets may suffer from the curse of dimensionality, leading to increased computational complexity and potential overfitting.
Handling Outliers Outliers can distort model training, and addressing them is crucial. Techniques such as trimming, minorizing, or transforming features can mitigate the impact of outliers on model performance.

3. Types of Exploratory Analysis.

There are three main types of Exploratory Data Analysis (EDA):

Univariate (Non - Graphical).
Univariate (Graphical).
Bivariate.
Multivariate (Non-Graphical).
Multivariate (Graphical).

Univariate non-graphical

This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.

Univariate graphical

Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include: Stem-and-leaf plots, which show all data values and the shape of the distribution.

Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.

Multivariate Non-graphical

Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.

Multivariate graphical

Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

Scatter plots: which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
Multivariate chart: which is a graphical representation of the relationships between factors and a response.
Run chart: which is a line graph of data plotted over time.
Bubble chart: which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
Heat map: which is a graphical representation of data where values are depicted by color.

4. EDA Python Libraries.

Python has top libraries for EDA which include;

Pandas for data manipulation.
Matplotlib and Seaborn for visualisations.
Plotly for interactive plots and
Dask for scalable computing.

Example:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

These libraries enhance data analysis by offering powerful tools for summarizing, visualizing, and managing large datasets effectively.

Feature Engineering

Victor Alando — Mon, 19 Aug 2024 10:11:50 +0000

Build an LSTM Stock Market model for Time Series Prediction:

Victor Alando — Mon, 29 Apr 2024 10:13:44 +0000

Introduction

These import statements lay the groundwork for conducting time series analysis and building a LSTM neural network model for stock price prediction. Each imported library serves a specific purpose in the data fetching, preprocessing, modeling, and visualization stages of the analysis.

Prerequisites

Data Retrieval
Data Preprocessing
Modeling
Visualization
Predictions

In this tutorial, we are going to learn about how to build LSTM models for time series predictions. LSTM stands for Long-Short-Term Memory

These are models built on recurrent neural network (RNN) that are particularly effective for sequence prediction problems, such as time series forecasting.

You need good machine learning models that can look at the history of a sequence of data and correctly predict what the future elements of the sequence are going to be.

We are going first to load the data from Alpha Vantage. The stock Market is for American Airlines stock market prices to make your predictions, we are going to set the ticker to "AAL".

Additionally, you also define a url_string, which will return a JSON file with all the stock market data for American Airlines within the last 20 years, and a file_to_save, which will be the file to which you save the data. You'll use the ticker variable that you defined beforehand to help name this file.

K-Nearest Neighbor(K-NN) Algorithms for Machine Learning

Victor Alando — Wed, 10 Apr 2024 10:00:31 +0000

Introduction

In this tutorial, you are going to learn about how K-Nearest Neighbors (K-NN) as applied in Machine Learning models and also in Classification.

What is K-Nearest Neighbor (K-NN)

K-Nearest Neighbor is one of the simplest Machine Learning algorithm based on supervised learning techniques.
K-Nearest Neighbor assumes the similarity between the new case and data available cases and put the new case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means that when new data point appears then it can be easily classified into a well suite category by using K-NN algorithm.
K-NN algorithm can be used for Regression as well as for classification problems.
K-NN is non-parametric algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
K-NN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into category that is much similar to the new data.

Example
Suppose we have an image of a creature that looks similar to a Cheater and a Leopard, but we want to know either it is a Cheater or a Leopard. So for this identification, we can use the KNN algorithm, as it works on similarity measure.

Our KNN Model will find the similar features of the new dataset to the Cheaters and Leopards images and based on the similar features it will put it either Cheater or a Leopard category.

Why do we need a K-NN Algorithm?

Understanding Data Warehousing

Victor Alando — Tue, 09 Apr 2024 08:21:16 +0000

Introduction

Data warehouse works like a relational database designed for data analytical needs. It functions on the basis of OLAP (Online Analytical Processing). It is a central location where consolidated data from multiple locations (databases) are stored.

Prerequisites

In this tutorial, we are going to learn the following concepts in Data Warehousing.

What is Data Warehousing.
Data Warehouse Process.
Data Warehousing Architecture.
Data Warehouse Characteristics.
Modern Data Warehousing Examples.

1. What is Data Warehousing

Data warehousing is the act of organizing & storing data in a way so as to make its retrieval efficient and insightful. It is also called as the process of transforming data into information for future business intelligence.

2. Data Warehouse Process.

The data warehousing process refers to the sequence of steps involved in collecting, preparing, storing, and delivering data for analytics and business intelligence.

The process of data warehousing can involve the following tools:
i) Airbyte
ii) Fivetran
iii) Talend
iv) Informatica
v) custom ETL scripts (Python/Spark)

3. Architecture

Data Sources

In data warehousing architecture, data sources are key to the preparation of data warehouse. The goal of data extraction is to collect data from multiple sources like:

i) Transactional databases (e.g., MySQL, PostgreSQL)
ii) ERP/CRM systems (e.g., SAP, Salesforce)
iii) Flat files (CSV, Excel)
iv) APIs or web services
v) Logs and IoT devices

ETL / ELT Process

ETL = Extract → Transform → Load.

Extract data from sources
Transform (clean, merge, format, validate)
Load into the warehouse.

Storage Layers

Staging area: temporary holding zone before processing
Data warehouse: structured, cleaned, integrated data
Data marts: subsets of warehouse data for specific teams (e.g. sales, finance)

BI Tools and Analytics

Power BI
Tableau
Looker
Superset
Grafana
Metabase

4. Data Warehouse Characteristics.

For decision making process to happen, data warehouse must be a subject-oriented, integrated, time variant and non-volatile collection of data in support of management’s goals and business improvement.

Subject - Oriented.

A Data warehouse can be used to analyze a particular subject area target oriented business analysis for example: “Sales” can be particular subject-oriented type of analysis.

Data Science for Beginners: 2023 - 2024 Complete Road Map

Victor Alando — Fri, 05 Jan 2024 16:36:15 +0000

Data science is the study of data, much like marine biology is the study of sea-dwelling biological life forms. Data scientists construct questions around specific data sets and then use data analytics and advanced analytics to find patterns, create predictive models, and develop insights that guide decision-making within business logics.

Roadmaps are strategic plans that determine a goal or the desired outcome and feature the significant steps or milestones required to reach it.

A data science roadmap is a visual representation of a strategic plan designed to help aspiring those who aspire aspiring to learn and succeed in the field of data science.

Key Tools used in Data Science

we’ll take a look at key data science tools that will help on your data science roadmap journey successful.

Programming Languages - Here we have different programming languages that you need to master and know how to use them. Examples are;

R-Language - is similar to python and a famous programming language for working with data. it's a powerful language for performing data wrangling with dplyr and ggplot2 to create any kind of chart you might need.

Python - Python is one of the greatest options available to you. In python, you can take advantage of the following libraries under a package known as Jupyter Notebook;
a) Pandas
b) Matplotlib
c) Scikit-learn

SQL - Stands for Structured Query Language). SQL allows the user to insert, update, delete, and select data from databases and to create new tables. The most common way to interact with these databases — called relational databases--is through Structured Query Language, or simply SQL.

2.Machine Learning Libraries
In ML there are libraries you need to get familiar with like TensorFlow, Scikit learn, Pandas, Matplotlib, Numpy and NLTK.

3.Data Visualization Tools
Data visualization tools are software applications that render information in a visual format such as a graph, chart, or heat map for data analysis purposes. Such tools make it easier to understand and work with massive amounts of data. Examples are PowerBI, Tableau and Matplotlib.
4.Data Storage Software
Learn about data storage software like SQL, MySQL, PostgreSQL and MongoDB
5.Cloud Computing Platforms
These includes AWS - Amazon Cloud Services, Microsoft Azure and Google cloud services. by learning these, you will be able to interact with cloud storage services with your local stored data.

Learn about programming software Engineering

When you begin your data science roadmap, you must have a solid foundation. The data science field requires skills and experience in either software engineering or programming. You should learn a minimum of one programming language, such as Python, SQL, Scala, Java, or _ R._

Example Programming Topics to learn
Data scientists should learn about common data structures (e.g., dictionaries, data types, lists, sets, tuples), searching and sorting algorithms, logic, control flow, writing functions, object-oriented programming, and how to work with external libraries.

Additionally, aspiring data scientists should be familiar with using Git and GitHub-related elements such as terminals and version control.

Finally, data scientists should enjoy a familiarity with SQL scripting.

Learn Git and Github
Git and Github allows you as a data scientist to push your ready-made projects. This will make you share with the outside world and learn more about git concepts and rules of writing Git files and also sharing with others.

Database Normalization In DBMS

Victor Alando — Tue, 14 Nov 2023 10:35:16 +0000

What is Normalization?

Normalization is the process of organizing the data and the attributes of a database.

It is performed to reduce the data redundancy in a database and ensure that data is stored logically.
Normalization is systematic approach of decomposing table to eliminate data redundancy and undesirable characteristics like insertion, update and delete.
Normalization is multi-step process that puts data in tabular form and remove duplicate data from relation tables.

Example Employees Table

Id	Name	Address	Profession
101	Mary	1245	Developer
102	David	5234	Accountant
103	Juliet	1444	Salesperson
104	Elizabeth	8745	Manager
105	Haskell	3251	Operation

In this table, we have data of office employees.

1. Insertion Anomaly
An insertion anomaly occurs in the relation database when some attributes or data items are inserted into database without existence of other attributes.

2. Updation Anomaly
Updation Anomaly occurs when the same data item is repeated with the same values are not linked to each other.

3. Deletion Anomalies

Deletion Anomalies occurs when deleting one part of the data deletes the other necessary information from the database.

Types of Normalization

1NF
2NF
3NF
BCNF
4NF
5NF

Diagram:

1. 1NF (First Normal Form)
In 1NF relation, each table cell should contain a single value. Each record looks like unique.

Example

CouserId	Course Name	Framework
JAV101	Java	NetBeans
SQL102	SQL	MySQL, PostgreSQL
PY214	Python	Flask

Here in the Framework row, we stored two frameworks of course Name MySQL, PostgreSQL so it is *multi-valued attribute. * it is not 1NF relation. We need to convert it into 1NF.

Convert it into 1NF

CourseId	Course Name	Framework
JAV101	Java	NetBeans
SQL102	SQL	MySQL
SQL102	SQL	PostgreSQL
PY214	Python	Flask

It's a simple method to store Framework separately in 1NF. Now this is First Normal Form. 1NF wants to store unique information in table without data repetition.

2. 2NF (Second Normal Form)

In 2NF, relation must in 1NF. In the Second Normal Form, all non-key attributes are fully functionally dependent on the primary key.

StudentID	Specialization	Student Age
501	Data Analyst	22
501	Data Engineer	22
502	Full Stack Developer	24
503	Web Developer	23

Modern Data Stacks

Victor Alando — Thu, 09 Nov 2023 07:22:03 +0000

Modern data stacks, often referred to as data technology stacks or data toolchains, are the combination of software and technologies used to collect, store, process, and analyze data in contemporary data-driven organizations. These stacks have evolved significantly in recent years, incorporating a variety of open-source and proprietary tools to meet the growing demands of data analytics and data-driven decision-making. Here's a high-level overview of components commonly found in modern data stacks:

1. Data Ingestion:

Apache Kafka: For real-time data streaming.
Apache Nifi, Flume, or Logstash: For data collection and ETL (Extract, Transform, Load) processes

2. Data Storage:

Cloud-based data warehouses like Amazon Redshift, Google BigQuery, or Snowflake.
Distributed file systems like Hadoop HDFS.
NoSQL databases like MongoDB, Cassandra, or Elasticsearch for unstructured or semi-structured data.
Traditional relational databases such as PostgreSQL or MySQL.

3.Data Processing and Transformation:

Apache Spark: For distributed data processing and ETL.
Apache Flink: For real-time stream processing.
Apache Beam: For unified batch and stream data processing. Data processing frameworks like Apache Airflow for workflow management.

4. Data Query and Analysis:

SQL-based query engines for data warehousing solutions.
Business intelligence tools like Tableau, Power BI, or Looker.
Jupyter notebooks with Python or R for data analysis.
Custom dashboards using frameworks like Superset or Redash.

5. Data Visualization:

Tools like Tableau, Power BI, or Qlik for interactive data visualization.
Libraries like D3.js, Plotly, or Matplotlib for custom visualizations.

6. Data Governance and Security:

Data catalog and metadata management tools.
Access control and encryption solutions.
Data lineage and auditing tools for compliance.

7. Machine Learning and AI:

Machine learning frameworks like TensorFlow and PyTorch.
ML platforms like MLflow for model tracking and management.
AutoML tools for automated model building and deployment.

8. Cloud Services:

Leveraging cloud platforms like AWS, Azure, or Google Cloud for scalable and cost-effective data storage and processing.

9. DevOps and Infrastructure:

Containers and orchestration tools like Docker and Kubernetes.
Infrastructure as code (IaC) for managing and scaling data infrastructure.

10. Monitoring and Management:

Tools for logging, monitoring, and alerting, such as Prometheus, Grafana, or ELK stack.
Data pipeline orchestration and job scheduling using tools like Apache Oozie or Luigi.

The specific components and technologies in a data stack can vary based on the organization's needs, data volume, and budget. Modern data stacks are often designed to be flexible, scalable, and capable of handling both batch and real-time data processing, making them a crucial part of any data-driven enterprise.

Data Engineering for Beginners Step-by-Step

Victor Alando — Thu, 09 Nov 2023 07:08:48 +0000

In this article, We are going to look at the skills and qualifications you need to become a data engineer and provide you with some tips to help you land your first position in the industry.

Who is a data Engineer?

A Data engineer is responsible for laying the foundations for storage, transformation, and management of data in an organization. They manage the design, creation, and maintenance of database architecture and data processing systems; this ensures that the subsequent work of data analysis, visualization, and machine learning models development can be carried out seamlessly, continuously, securely, and effectively.

Responsibilities of a Data Engineer

They ensure that the large volume of data collected from different sources becomes accessible raw material for other data science specialists, such as data analysts and data scientists.
Serve as a data resource expert for the organization.
Build and execute data ETL solution pipelines for multiple clients in different industries.
Independently create data-driven solutions that are accurate and informative.
Interact with the data scientists team and assist them in providing suitable datasets for analysis.
Leverage various big data engineering tools and cloud service providing platforms to create data extractions and storage pipelines.

Exploring the basics of Data Engineering

To become a data Engineer, below is the path of the leaning path.

Know programming languages like Python and Scala, R and Java.
Learn the basics of Automation and scripting.
High efficiency in advanced probability and statistics
Ability to demonstrate expertise in database management systems.
Experience with using cloud services providing platforms like AWS/GCP/Azure.
Good knowledge of various machine learning and deep learning algorithms will be a bonus.
Knowledge of popular big data tools like Apache Spark, Apache Hadoop, etc.
Good communication skills as a data engineer directly works with the different teams.