DEV Community: Alisha Rana

Finding out the Missing Values Using Missingno and Pandas

Alisha Rana — Mon, 29 Aug 2022 11:50:00 +0000

The first step in data cleaning for me is typically looking for missing data, missing data can have different sources, maybe it isn't available, maybe it gets lost, maybe it gets damaged and normally its not an issue, we can fill it but I think often time missing data is very informative in itself, while we can fill the data with the average or something like that and I will show you how to do that frequently,
For instance, if you have an online clothing store, if a customer never clicked on the baby category, it is likely that they do not have children. You can learn a lot by simply taking the information that is not there.

The missingno Library
Missingno is a great Python module that provides a set of visualisations to help you understand the presence and distribution of missing data within a pandas dataframe. This can take the shape of a dendrogram, heatmap, barplot, or matrix plot.
We can determine where missing values occur, the magnitude of the missingness, and whether any of the missing values are associated with each other using these graphs.
Using the pip command, you may install the missingno library:



pip install missingno

Importing Libraries and Loading the Data



import pandas as pd
import missingno as msno
df = pd.read_csv('housing.csv')
df.head()

Quick Analysis with Pandas
Before we utilise the missingno library, there are a few features in the pandas library that can provide us with an idea of how much missing data there is.

The first method is to use the .describe() method. This function returns a table with summary statistics about the dataframe, such as the mean, maximum, and minimum values.



df.describe()

Using the .info() method, we can go one step farther. This will provide you a count of the non-null values in addition to a summary of the dataframe.



df.info()

Yet another quick technique is



df.isna().sum()

This function produces a summary of the number of missing values in the dataframe. The isna() function finds missing values in the dataframe and returns a Boolean result for each element in the dataframe. The sum() function adds up all of the True values.

Using missingno to Identify Missing Data
There are four types of plots in the missingno library for visualising data completeness: barplots, matrix plots, heatmaps, and dendrogram plots.



msno.matrix(df)

The column total_bedrooms in the resulting graphic displays some amounts of missing data.



msno.bar(df)

The barplot provides a simple plot where each bar represents a column within the dataframe. The height of the bar indicates how complete that column is, i.e, how many non-null values are present.

you can notice the height of total_bedrooms which is less than others

Summary
Identifying missing data before using machine learning is a critical step in the data quality pipeline. This is possible with the missingno library and a sequence of visualisations.

Thank you for your time!

Dealing with Huge Data

Alisha Rana — Wed, 24 Aug 2022 12:31:00 +0000

It's quite common, especially in large companies, to have datasets that no longer fit in your computer's memory, or if you are performing any kind of calculation, the calculation takes so long that it makes you bored. This means that we must find ways to work with data to make it either small in memory or sample the data, so you have a subset, frequently times it is valid to just take the sample and that sample is representative of all the big data and then make calculations, do data science on it.
We'll import Pandas and add our data to the dataframe.

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head(5)

To examine the memory footprinting of our loaded data:

df.memory_usage(deep=True)

Declaring the deep=True because:
The memory footprint of object dtype columns is ignored by default, We don't want items to be neglected in our situation.
Checking the dtype of columns:

df.dtypes

Notice the dtype of ocean_proximity,
Always keep in mind that strings can take up a lot of memory space compared to numbers, which are particularly effective at doing so.
We will override our ocean_proximity datatype with the pandas-specific categorical datatype.

df["ocean_proximity"] = df["ocean_proximity"].astype("category")

This improves our memory usage,
Lets check memory

df.memory_usage(deep=True)

waoooooh!!! You can see it reduces more than a half way
In this way you can make your DataFrame more optimal in simple way.
However, the issue with this technique is that even after changing the memory, the memory footprint is still substantial because the data is loaded into our memory.
During the loading process, we can also modify the datatype

df_columns = pd.read_csv("data/housing.csv", usecols=["longitude", "latitude", "ocean_proximity"])

Here, we're going to utilise a dictionary with the key being the column name and the value being the datatype, which means you may use as many as you like.
It will adjust our memory footprint to the data frame automatically during loading.

Instead of importing all datasets because we might not require them all, we will construct a new dataframe and load the data as usual. However, in this time, we will define the columns

df_columns = pd.read_csv("data/housing.csv", usecols=["longitude", "latitude", "ocean_proximity"])

This is another excellent method for saving some space when loading the material.

Sometimes the issue isn't just with the data loading; sometimes it's with the computation itself because we have a costly function. In these cases, we need to sample our data, which pandas makes easy for us because each dataframe has the method sample available.

We have a random state that is really crucial if you want to replicate your analysis and give it to a different coworker or data scientist. This is a really nice thing to become used to

df_columns.sample(100, random_state=42)

but if you want to repeat something, you must ensure that your random process is reusable.

random_state = 42
df_columns.sample(100, random_state=random_state)

I hope you understand how to load data more efficiently and with fewer items.
See you next time!!!

Data Science Dependencies

Alisha Rana — Mon, 22 Aug 2022 22:46:51 +0000

These are prerequisites to begin pursuing data science. install all of these packages on your computers. You may also get the most recent version of each item.
Below are dependencies.

python=3.8.3
pip=20.1.1
eli5=0.10.1
folium=0.11.0
jupyter=1.0.0
matplotlib==3.3.0
missingno=0.4.2
numpy=1.19.1
pandas=1.0.5
pandas-profiling=2.8.0
pandera=0.4.4
scikit-learn=0.23.1
scipy=1.5.0
seaborn=0.10.1
shap=0.35.0
sqlalchemy=1.3.18
voila=0.1.21
pip:
- discover-feature-relationships==1.0.3
- quilt==2.9.15
- yellowbrick==1.1

Data loading with Pandas: Loading Excel , CSV , SQL, and any data file

Alisha Rana — Mon, 22 Aug 2022 22:26:35 +0000

Whether you want to begin with Data Analysis, fetch useful information, or predict something from data, the first step is always the data loading we will be using a pandas library.
We will use a Python tool called pandas to import data from either an Excel table or a SQL database.
Before getting into loading data, you must have pandas installed into your platform on which you are loading data.
I will be using Jupyter Notebook , you can easily get it in Anaconda
To install pandas run the following command in Jupyter Notebook cell:

!pip install pandas

Or else you can install in Python Environment as well, but that's not the focus of today.
This is first class we are touching the code , so open up Jupyter Notebook if you want to code along
I have some CSV and Excel file, I will go along with
Initially, you must import the installed library pandas.

import pandas

Writing this would be enough, but because we will be using pandas a lot usually we will give it a shorthand to some alias

import pandas as pd

pd is most common that people use, we execute the cell now we have Pandas in Python.
To import or read Data
You can enter pd.read in your Notebook and hit tab you can see different ways that you can load data with you will fine various way to load data ,in this we'll have a look at the most common ones
Import Excel Files

pd.read_excel("data/crypto.xlsx")

In the parenthesis you will be giving the location where your file is stored,
Now that loading has completed, you can see that you have data in a pandas dataframe
We didn't save it in a variable.
However, you can save data in a variable as well.

data=pd.read_excel("data/crypto.xlsx")

Import CSV Files
CSV files are slightly different because they contain raw data.

pd.read_csv("data/crypto.csv")

Loading Data From SQL
A great way to store data and make it available to data scientists is through SQL databases.
Most businesses avoid using Excel files since they can be duplicated.
In addition to pandas we have to import SQLAlchemy
SQLAlchemy is a package that helps Python programmes communicate with databases.

import sqlalchemy as sql

Below this will create the connection,its called an Engine, If you have PostgreSQL database, this should be the location of your database

connect=sql.create_engine("postgresql://scott:tiger@localhost/test")

Here we go read SQL Table

data = pd.read_sql_table("sales", connect)

Loading any Data Files
Pandas works great on structured data, but sometimes data comes in weird formats. This is the general way to work with data files in Python.

with open("data/crypto.csv", mode='r') as cryptocurr:
    data = cryptocurr.read()

If you only want to read the data and not alter it, you'll indicate that. mode='r'
Then we will give file a name to open, here i am giving file name as cryptocurr
Now we have a block where our file is open, within this block create a variable and will use read function after that run the cell and call the variable to get execute.

data

Hurraaah we did it!!!!!
Loading data into pandas is extremely easy.
Try it out with your own data, if you have an excel file lying around on your computer, make sure you have data in your computer nothing gets out so you can just pd.read and get in your data and play around with.

Data Science for Beginners: How to Get Started

Alisha Rana — Mon, 22 Aug 2022 12:43:00 +0000

Data Science
Data science is a trendy topic these days and the field is expanding quickly, but many people are unsure of what the term actually means. In this post, we'll try to clarify what data science is and how to utilise it in business analytics.
Data
First of all, what exactly is data? Data is omnipresent, and people are terrified of it being stolen. Data, however, is something that can teach us a significant amount about a person, a company, and international businesses.
Using data effectively in data science means developing analytical models from the data and making decisions on them.
Data Science
Three words—analysis, statistics, and machine learning—combine to form the term data science.

Analysis is performed to extract the data's practical insights.
For identifying and interpreting data patterns, statistics is used.
Machine learning is utilized to forecast data. Approaching the literal definition: Data science is the application of data to enhance decision-making to accomplish three objectives,

Analysis
Statistics
Machine Learning

You now understand what data Science and its uses , moving toward Which prerequisites must be satisfied before you may begin with data science.

Tools for Data Science

- Python
Other programming languages, such R, are also utilised in data science. But we'll be talking about which one is easiest to put into practice.
Python is currently gaining popularity because of how simple the syntax is while writing code in it. It can also run on a variety of devices such as Windows and Mac

- Anaconda
It's convenient because most of the data science packages we need are already there and are free, so we don't have to install additional programmes.

- Jupyter Notebook
It is a web-based Python interface that makes learning Python very simple, You can use to generate and distribute documents with text, mathematics, and live code.

- Numpy
It is scientific computing toolkit in Python that we use whenever we need to perform calculations.

- Pandas
For me, it combines Excel and SQL. its for data manipulation and analysis tool

For Machine Learning portion and the model validation:

- Scikit-learn
It is Python's most practical and reliable machine learning library. It offers a variety of effective methods for statistical modelling and machine learning, including dimensionality reduction, clustering, and regression, all through a Python interface.

- Matplotlib
A cross-platform library for Python's numerical extension NumPy and data visualisation and graphical charting

- Seaborn
Built upon Matplotlib, Seaborn uses single lines to create stunning data visualisations of statistical data.

These all are open source, free tools are a cornerstone of data science,
I hope you find this blog fascinating; I hope to see you again soon.