DEV Community: kamandenduati

EXPLORATORY DATA ANALYSIS ULTIMATE GUIDE

kamandenduati — Fri, 03 Mar 2023 05:49:39 +0000

In this article, I will give you the ultimate experience of EDA(Exploratory data analysis) using the iris dataset. The iris dataset is contained in the sklearn module. First things first, what is EDA? EDA is all about learning about data using summarising and visualization techniques. It's about getting to know about the data, interacting with it and wanting to know the nook and crook about it.
Let's create a simple analogy for EDA so that we can understand what it could mean in layman's terms. For example, the talking stages before an individual decide to date. The question, the interrogations to find out more about the person you want to spend a part of your life with. The same can be said about data, you do all of this to get meaningful information about the data. Hopefully, it made sense.

What shall we cover?

1.How to load the dataset

2.How to convert the data into a data frame for analysis

3.A step-by-step guide to analysing the data.

Let's go!!

Loading the dataset

First of all, we will be using Jupyter anaconda, you can install anaconda, Installation guide, software which comes with jupyter Anaconda pre-installed.
Let's go ahead and import all the tools we will be required to use:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns

Next, we load the dataset and give it a variable name.
iris = datasets.load_iris()

Head()

Next is transforming it into a dataframe for analysis, but first, let's get to know the target and feature variables

The 'species' column is in numpy form and when converted to a pandas data frame is in float type. As you can see in the diagram after running the head() function. Head()- outputs the first five rows. Tail() gives the last five rows. The info() gives more information on the data.

The species is converted to their original information as shown above.

The describe() function gives statistical information on the data given for univariate analysis.

Function'.isnull.sum()' is used to check if there are any missing values in the data.
Function iris.values_counts('species') is used to check how many values are in each of the column species variables

Univariate analysis and multivariate analysis

The word 'uni' means one, therefore univariate analysis means the analysis of one variable independently. We use graphical data to conduct such analysis. The explanation will be on the diagram.
In multivariate analysis, we try to establish sensible relationships with all the variables.

Data visualization enables us to conduct such analysis. Boxplots, histogram, histogram with distplots. I will take you through boxplots and histograms. We deduce information from the said plots to give conclusions.

What have you observed, what is the highest frequency of occurrence and the length of each variable?

There are outliers in the sepal width, which need to be removed to improve the efficiency of the machine learning algorithm.

We find the correlation between different variables, a negative correlation is indicated by a coefficient closer to -1. A strong coefficient is indicated by a coefficient closer to 1.

This is a visual representation of correlation.
From the heat map we can see:
Petal width and sepal length are positively correlated
Petal length and sepal length are positively correlated
Petal length and petal width are positively correlated

The bar graph shows each column grouped by species. This gives the ability to compare the values against the values of different species. Observations could include:
-The Sepal length and petal length of Virginica is larger than the sepal length of other species
-The sepal width of Setosa is larger than the sepal width of other species

Conclusion

Based on the length and width of the petal/sepal alone we can dare say that versicolor and virginica might resemble in size, but the setosa species is different from other species.
There are many ways you can conduct EDA, but I couldn't go through them because of time constraints and also because you also have to figure out some stuff by yourself. EDA is fun, data analysis is fun.
Until the next article, ciao

INTRODUCTION TO PYTHON IN DATA SCIENCE

kamandenduati — Tue, 28 Feb 2023 08:25:42 +0000

What is Python? What is data science? Let me do you better, why is Python in data science? Seems like my infinity war reference may have not referenced, but let's not diverge from our focus, these are the pertinent issues I am to assist you to cover as you go through the article.
Data science is the domain of study that deals with vast volumes of data using modern tools and algorithms to find unseen patterns, derive meaningful information, and make business decisions. According to IBM, Data science uses a combination of maths and statistics, specialized programming, advanced analytics, Artificial intelligence(AI) and machine learning to cover with specific subject matter expertise to uncover actionable insights hidden in organization's data.
In simple terms, data science refers to the use of statistical, mathematical and computer science to extract insights from data that will assist in decision-making.

Applications of data science:
We encounter data science every day in ways we may have not realised:

Have you ever wondered how youtube gets to know your taste since youtube recommends channels or even music that you'd prefer to watch and listen to, youtube has complex machine learning algorithms that can analyze your preferences and gives you a recommendation as per the results of the algorithm.
The advertisement that you come across on youtube are at times personalized this is through the help of data science.

Pthon

Python is a high-level programming language, which means its a language that is easily understandable by users. It is widely used in the field of data science due to its simplicity, flexibility, and powerful libraries.

Advantages of using Python for data science include:

Simple and User-Friendly Syntax: Python has a simple and easy-to-learn syntax that makes it accessible to everyone, even those who are new to programming. The code is easy to read and understand, which makes it easier to write and debug.
Large Community: Python is an open-source language, therefore it has a large and active community of developers who constantly work on improving the language and developing new libraries. This community provides a wealth of resources, including tutorials, documentation, and forums, which makes it easier for beginners to learn the language.
Powerful Libraries: Python has a vast number of libraries that make it a powerful tool for data science. These libraries provide various tools for data manipulation, data analysis, machine learning and visualization, making it easier for data scientists to perform complex tasks.
Versatility: Python is a versatile language that can be used for a wide range of applications, including web development, scientific computing, machine learning, and data analysis. This versatility makes it a popular choice for data scient
ists who want to work on different projects.
Python also contains powerful libraries that make it a popular choice for data scientists. These libraries make python a powerful tool for data science and contain various tools for data manipulation, data analysis, machine learning, and visualization these libraries may include:
NB: These libraries must first be imported to be used in your code

1.NumPy: NumPy is a Python library/module that provides support for large, multi-dimensional arrays and matrices and provides various mathematical functions that make it a powerful tool for scientific computing. import numpy as np.

Pandas-provides data manipulation tools for tabular data and provides support for reading and writing data from various file formats. import pandas as pd.
Matplotlib-a python library that provides tools for visualization and provides support for various types of plots. import matplotlib. pyplot as plt
Scikit-learn-python library that provides tools for machine learning.

Python plays a crucial role in the data science workflow, which involves the following steps:

-Data Collection: Data scientists collect data from various sources, including web scraping, APIs, and databases. Python provides various libraries, for example, Beautiful Soup, that make it easier to collect data from websites.

Data Cleaning and Preprocessing: Data needs to be cleaned and preprocessed to remove any errors or inconsistencies. Python libraries, including Pandas and NumPy, provide various tools for data cleaning and preprocessing.
Data Analysis: Data scientists analyze the data to extract insights and patterns. Python provides various libraries, including Pandas and Matplotlib, that make it easier to analyze and visualize the data.
- Machine Learning: Data scientists use machine learning algorithms to build predictive models.

Install python from Here.

Install an IDE-integrated development environment- this eases your coding experience. There are multiple IDEs in the market but I would recommend Visual Studio Code as it is lighter and more efficient. You would be required to install the python extension from visual studio. Here are some guidelines. Click for guidance

I would also recommend trying out Anaconda, it comes with Python preinstalled and JUpyter anaconda is a powerful tool for creating machine learning algorithms and conducting EDAs. Try it out.
To start your journey to understanding python. I recommend the following websites for their awesome tutorialsw3 schools

Conclusion

I tried to give you some insights on the fundamentals. It may not be enough but I hope it gives you enough footing to start this long and exciting journey. There are many exciting resources all over the internet that you could use and they are completely free of change. You know what they say, "the best things are free", I don't know who exactly said this but this works in this case. I wish you all the best as you start the journey.