In the context of data science, Exploratory Data Analysis or EDA is the first phase of the procedure before complex analysis is done on the main data collected. EDA is used by a data scientist and analyst to make discoveries of structures within the data set, identifying outliers, hypothesis and assumptions checking. On reviewing the subsequent chapters of the book, it is possible to ascertain that before delving into complex modeling, one has to have a clear understating of the data and that is exactly where EDA comes in handy. Here is a brief guide on how to illustrate your data in this article; tools and techniques that can come handy during EDA are given below:
_WHAT IS EXPLORATORY DATA ANALYSIS. _
Exploratory Data Analysis is one of the seven primary forms of analysis of data in data analysis Harvard Business Review (2009).
Exploratory Data Analysis represents a number of approaches aimed at the examination of data that is to be analyzed and the identification of its primary characteristics, which can be accomplished visually. The aim of EDA is what the analyst has beyond the modeling or hypothesis testing objectives. It is more about drawing patterns on the face value of the data collected that might not be easily discoverable or apparent to a layman, for instance.
Communicating with exploratory graphics, EDA can comprise both graphical and quantitate analysis and normally includes a cascading analysis where one observation leads to another.
IMPORTANCE OF EDA.
EDA is important for several reasons:EDA is important for several reasons:
a)Understanding Data Structure: It is important to go through the characteristics of your data before applying any of the machine learning models. By using EDA, it is easier to understand the kind of variables that are involved; the nature of their distributions; whether they have missing values; or other anomalies such as outliers.
Informing Data Cleaning: Some of the preprocessed data that must be spotted using EDA are data that may be given erroneous values such as nulls, missing values, noise and outliers. This step is important to make sure that while feeding any model with data, it is fed with correct data.
b)Hypothesis Testing: Inspite of this, EDA can assist in developing hypotheses or in giving a direction on which hypotheses can be worked on. For example, the visualizations displayed may show that there is content correlation or distribution trend that may have causal relationships that cannot be manifested in further analytical work.
c)Feature Selection: They also allow the modeler to understand which variables will be most useful or predictive within the modeling analysis, so making the modeling process much more efficient.
d)Detecting Anomalies: They are crucial in helping to identify cases or observations that may be deemed as outliers or be important, in the sense that they could cause that skewing of analysis if retained into analyses, or could be of interest for further study.
_ESSENTIAL TECHNIQUES IN EDA. _
-
Descriptive Statistics
Out of all the activities in EDA, descriptive statistics is the first one that needs to be conducted. They give brief information regarding the sample and the measures and they are generally easy to understand. Common descriptive statistics include:- Mean, Median, and Mode: Mean, median and mode that are important in summarising the average value of the data set.
- Standard Deviation and Variance: Measures of dispersion whereby one is able to determine how much the values in the data stretch out from the mean.
- Minimum, Maximum, and Percentiles: Measures of dispersion of data or the degree of dispersion of the data points. These are useful tools for a brief consideration of the main tendencies and characteristics of the data.
-
Data Visualization
The other technique is visualization that is perhaps among the most effective EDA tools since it enables the analyst to see relations, trends or even patterns that cannot be seen by mere inspection of the numbers. Some common visualizations include:- Histograms: Illustrate the distribution of one variable only, identifying the general form of distribution, its median or mean, and variability.
- Box Plots: Special in finding out variability and to know about the dispersion and the nature of the data.
- Scatter Plots: Can be of assistance in searching for the connection between two variables, possible directions in which they are related or not related.
- Pair Plots: Especially valuable or creating a qualitative understanding of the associations between several factors in a given data set.
-
Data Profiling
Data profiling can be described as the examining of the data with the purpose of identifying its structure, relationships and content. This includes:- Missing Data Analysis: Missing values identification and recognizing their tendencies, which helps to decide how to act regarding them for example, imputation or deletion.
- Outlier Detection: Classification of some kind of data as outlying data which might require extraordinary care and attention.
- Correlation Analysis: Checking the correlation between two variables to test any assumptions that may be held between the different variables.
Dimensionality Reduction
While working with big and numerous variables, the use of methods for dimensionality reduction allow simplifying the object under consideration without considerable loss of information. This can be done using methods such as the Principal Component Analysis that pinpoints which of the features deserves most attention and which can be rejected.Hypothesis Testing
Yet, being graphic, EDA can and should serve as a basis for a first round of hypothesis testing. For instance, analysts can use t-tests, chi-square tests, and many others to establish whether what has been observed is mere chance.
BEST PRACTICES IN EDA.
- Iterative Process: EDA is not something that is only done once but rather it has to be done repeatedly. However, as ideas emerge, one will always be left with other questions to ask that will bring out more information.
- Document Your Process: Document the pre-processions that are performed when doing EDA, the visualizations made and the findings made. Such documentation will come in handy when writing reports or when explaining matters in meetings, or even when modeling.
- Be Skeptical: The most important thing that one should always do is to remain skeptical in regards to the regularities that can be identified. Inquire as to whether they are truly real or are simply a by-product of data capture. Seek confirmation in another analysis of the data or from some other source.
- Understand the Context: It should be noted that whatever we take from the field should be considered in context. What is the source? What are the weaknesses which accompanied the data collection phase of the research? What might be systematic and non-systematic biases? A closer look at some of the context permits a correct analysis of the results as they are. **
# CONCLUSION.
**
As will be seen in detail later, Exploratory Data Analysis is an important part of the data science workflow. It forms the platform from which to examine your data, spot important patterns, and get ready for further analysis. Here, the use of descriptive analysis involves developing statistical measures, graphs and charts, table of summary statistics, and data profiling all of which help in identifying the patterns, relationships, and anomalies within your data.
However, time spent on EDA in the world of machine learning and data exploration is time impeccably well spent indeed. This way it makes sure that the subsequent analysis that is undertaken is anchored on a very good understanding of the data you intend to analyze hence giving you very good results in your analysis. Thus, before leaping into rather complicated models and algorithms, you should invest as much time as possible with your data: your future selves will surely appreciate that.
Top comments (0)