Knowing your data is a very important aspect in the data analysis process. You sort of get a chance to get a "feel" of your data and understand it more. This can include knowing the number of rows and columns you have, the datatypes or even the circumstances under which the data was collected or the objective which it seeks to achieve.
In today's discussion we will talk about the Essentials of exploratory Data Analysis.
Our first question question is:
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis. This involves inspecting the dataset from many angles, describing & summarizing it without making any assumptions about its contents.
It is a crucial step during data science projects because it allows data scientists to analyze and visualize data to understand its key characteristics, uncovering any patterns, identifying the relationship between different variables and locate outliers. The EDA process is normally performed out as a preliminary step before undertaking extra formal statistical analyses or modeling.
Now, we will discuss why EDA is important:
Understand data structures:EDA helps in getting familiar with the dataset, understanding the number of features, the type of data in each feature, and the distribution of data points. This understanding is crucial for selecting appropriate analysis or prediction techniques.
Identifying patterns and relationships:Through visualizations and statistical summaries, EDA can reveal hidden patterns and intrinsic relationships between variables. These insights can guide further analysis and enable more effective feature engineering and model building.
Catching anomalies and outliers in the data:It helps identify errors or unusual data points that may adversely affect or skew the results of your analysis. Detecting these early can prevent costly mistakes in predictive modeling and analysis.
Testing hypotheses:When we are doing a certain project, we normally make assumptions about the results of the data, and to verify whether these assumptions are true, we have to perform hypothesis tests on the data. You can read more about the null(H0) and alternative(Ha or H1) hypotheses here.
Helps inform feature selection and engineering:Insights gained from EDA can inform which features are most relevant to include in a model and how to transform them (scaling, encoding) to improve model performance.
Optimizing model design:By understanding the data’s characteristics, analysts can choose appropriate modeling techniques, decide on the complexity of the model, and better tune model parameters.
The Importance of Understanding Your Data.
For one to understand data, they first need to understand its characteristics of quality data which are also the elements of quality data: For example you need to make sure that your data is: Accurate, accessibility, complete, consistent, valid(Integrity), unique, current, reliable, and relevant to the study or model your building.
These characteristics determine data quality. As we already discussed poor quality data has dire consequences like affecting the model performance and can lead to businesses making wrong and potentially costly mistakes.
So, it is very important to make sure that the data meets the highest quality standards before being analyzed and used to provide insights for business decision making or training machine learning models.
Steps in EDA
Data collection and importing.
- Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes. After data is collected it can be organized and stored in different software like databases. Data stored and preserved in different formats:
Textual data: XML, TXT, HTML, PDF/A (Archival PDF)
Tabular data (including spreadsheets): CSV
Databases: XML, CSV
Images: TIFF, PNG, JPEG (note: JPEGS are a 'lossy' format which lose information when re-saved, so only use them if you are not concerned about image quality)
Audio: FLAC, WAV, MP3
- Data structure and overview: To get a view of some of the data in the dataset and its characteristics, we can use python to achieve this. For example:
- To get the first five and last rows in the dataset we use the:
head()
andtail()
. To get a general description and information about the dataset we use the:info()
anddescribe()
in pandas.
- Handling missing values: Missing can potentially affect our analysis especially when the dataset is small. To identify them we use some of the following tools, just to mention a few:
Spreadsheets: In the formula box, enter in =ISBLANK(A1) (assuming A1 is the first cell of your selected range).
SQL: the COUNT() function along with a WHERE clause can be used to find the number of null values in a column.
Python: The following functions are used to find, and handle missing values in python. isnull, notnull(), dropna(), fillna(), replace(), and the interpolate().
Data visualization and the techniques involved.
Data visualization is the practice of translating information into a visual context, such as maps, graphs, plots, charts and many more. This is crucial process in the data analysis process because it helps your audience understand the data in very simple terms. It helps build a picture of what the data contains. Visualizations make the storytelling process easy as it is easy explain insights to stakeholders with visual features available.
The most common visualizations used are:
Line graphs: these are used to track changes over a period of time.
Bar graphs: they are used to compare performance over two different groups and track changes over time. They do this by show the frequency counts of values for the different levels of a categorical or nominal variable.
Histograms: used to show distributions of individual variables.
Box plots: they detect outliers and help data analysts and scientists understand variability in the data.
Scatter plots:used to visualize relationships between two variables.
Pair plots: to compare multiple variables in the data at once.
Correlation Heatmaps: these are also used to understand the relationships between multiple variables.
Statistical summaries.
Just as the headline states, they provide a summary of key information about the sample data. They include:
Mean: Also known as the expected value. It is a measure of central tendency of a probability distribution along median and mode. It usually describes the entire dataset with a single value. To read more on the mean go here.
Median: This is the value in the middle of the dataset after the values have been organized either in ascending or descending order. It gives an idea of where the center value is and more reliable to calculate when we have skewed data.
Mode: Is the most commonly observed value in the data or the value with the highest frequency. To understand the mode more visit here.
Standard deviation: This the measure of how the data varies in relation to the mean. It helps analysts understand the variability of a dataset, and identify trends, assess data reliability, detect outliers, compare datasets, and evaluate risk. It is denoted as σ More on standard deviation can be found standard deviation.
Variance: This is a statistical measurement of the spread between numbers in a data set. It is calculated as the square of standard deviation, mostly denoted as σ2. For more on variance go to here.
Tools and libraries for EDA
In the python the commonly used libraries for EDA are:
Jupyter Notebooks
This is an interactive environment for running and saving Python code in a step-by-step manner. It is commonly used in the data space because it provides a flexible environment to work with code and data. For more on Jupyter notebooks click here.
Common pitfalls in EDA
Overfitting to patterns: This usually happens when analysts overinterpret the data to a point where they arrive at wrong insights. In machine learning it is a situation where a model learns too much from the training data and fails to generalize well to new or unseen data which can lead to poor performance, inaccurate predictions, and wasted resources.
Bias: Most especially confirmation bias, which mostly leads to analysts providing the wrong insights due to premeditated hypotheses. To avoid this, the analysts must ensure that they only review and interpret the data based on the finding from the data only and not there own conclusions.
Overall, the EDA process is crucial in the success of data science projects as it ensures the model get the best data to learn from.
To understand the EDA process in a step by step process with code, check out the following tutorials:
For datasets that you can use to practice EDA visit the following websites:
Remember to always document insights gained from EDA as they might be used as reference in future especially during the modelling phase. The EDA process is iterative, so as you proceed with the project, you might need to revisit the process as new insights emerge.
Thank you reading, and I will appreciate any feedback on areas I can improve on.
Top comments (0)