Exploratory Data Analysis(EDA) is a technique data scientists and analysts use to analyze and understand data sets. EDA was first developed by an American mathematician, John Tukey in the 1970s. John Tuckey introduced Fast Fourier Transform. Data scientists are required to use statistical raw data to give insights and advise organizations on matters, improving performance.
It would be a challenge for data scientists to present and analyze data without proper visualization tools. Without EDA data scientists would be making assumptions that are not fueled by a statistical view.
EDA applies graphical and statistical techniques to analyze and summarize data sets. Computer software like python and powerBi offer platforms for creating useful visualizations. From data visualization, statisticians can discover patterns, spot anomalies, and investigate correlations between variables. EDA provides a better understanding of data sets and an understandable way to present data and findings. EDA helps affirm that the results provided are valid and applicable.
In this article, we will look at various ways to perform Exploratory Data Analysis and software that can be employed like excel, python, and powerBi. We will elaborate more on how EDA works to identify errors, understand patterns within data by graphical and visual aid, detect outliers and anomalous events, finding interesting relations among variables. Outliers are values that fall far off the range within which other values fall.
Why we need Exploratory Data Analysis
EDA is useful in discovering hidden patterns in data. Information sourced from EDA can be used to build and refine machine learning models that are fine-tuned to the specific needs of the data. From EDA we exploit understanding, discover underlying structures, identify outliers, and get answers to various assumptions on the data set.
**
Exploratory Data Analysis process
**
EDA process includes:
Data collection
This is the process of gathering data that will be useful when drawing insights on the matter being investigated. To collect data an analyst has to define the problem they want to solve. This will lead to the collection of relevant data.
Data can be extracted from public data sources or private data sources. Public data sources include websites like Kaggle which allow data access without restrictions. You can freely acquire data, perform analysis and draw reasonable output from it.
Private data sources however require authorization and authentication to retrieve data from. These could be company records or private shared websites where companies upload their data.
Data cleaning
This process involves checking data types for each column, removing unused columns, removing duplicates, removing missing values, checking the outliers, renaming your columns e. g those with spaces and missing values, check if data values have the correct format. Cleaning data leaves data that can be used to make useful insights.
There often are discussions on how long data cleaning should take. Technically there is no given period since the process is dependent on a couple of variables. These include the rawness of the data presented to the analyst, the information required to be extracted by the analyst, and the tools provided since different tools offer different functions and for different motives.
Data processing
This is the manipulation and transformation of data into useful insight. This involves a couple of analyses including univariate analysis, bivariate analysis, and multivariate analysis. Through data processing, we identify trends through manual processes or automated processes. Tools useful for data processing using automated means like python include pandas and numpy.
Univariate analysis is the process of analyzing each variable individually mainly to understand the distribution of data and identify outliers and anomalies.
Bivariate analysis is the process you use to analyze the relationship between pairs of variables. This could be done by plotting graphs and charts from which correlation between two variables could be done.
The multivariate analysis involves analyzing relationships among a number of variables. This will help to gain a deeper understanding of the data set.
Data visualization
This involves the graphical representation of data. Visual data output enables even those without data knowledge to draw conclusions based on what is shown on the models. It is made easier to compare relationships between variables. Data can be visualized using but not limited to bar graphs, histograms, line graphs, and charts, heatmaps for regression and correlation analysis, pivot tables in excel. Tools such as excel, python, and powerbi offer strong visualization tools.
Tools for Exploratory Data Analysis
Excel
Excel is a useful tool for data analytics. Excel allows the user to manage data and develop reporting and insight from the data. Excel has tools to perform the necessary EDA functionalities that include data sourcing, data cleaning, data processing, and data visualization. Excel also allows you to view data stored in a worksheet in software like Python.
Data sourcing - Excel offers options to import data from the web or from the local storage on which you can perform analysis.
Data cleaning - Excel allows you to clean data by removing duplicates, filling in missing values, and transforming data. This will enable working with a more refined data set.
Data processing - Excel has formulas and functions that are useful for data processing. A few include index and match, vlookup, If tests, countif, countifs, sumif, and sumifs.
Data visualization - Excel can create charts, line graphs, bar graphs, and pivot tables to graphically visualize data. In pivot tables, you can add slicers to view data from different and preferred perspectives.
Excel, however, has a limitation to the amount of data it can handle which is 1,048,576 rows and also a limited number of columns per worksheet.
Python
Python as a programming language is a useful tool for data analysis, data science, machine learning, and Artificial intelligence. Python has inbuilt functions and tools that allow it to source, clean, process, and visualize data.
Data sourcing - Python allows you to source data from the web, local storage, or excel using the pandas module. It allows data in different formats including csv, .xlsx, ipynb, and others.
Data cleaning - Python allows you to clean data by checking for missing values using data.isna() function. You can manipulate the function in the python pandas module to give the number of missing files, and show which columns have missing files.
You can drop columns in pandas using the dropna() function to drop rows and columns with missing values. You can merge columns using the " +" sign in pandas and replace columns. Python has the numpy module that allows you to create and manipulate arrays and work with numerical data.
Data processing - The pandas module in python allows for data processing by calling functions like .describe() which gives statistical measures of variable values.
Data visualizations - Python has powerful modules for visualization including seaborn and matplotlyb. These modules allow you to create heatmaps for regression, line charts, bar graphs, histograms, and other visual tools.
Python code can be deployed on git platforms to allow access by various people.
Powerbi
Powerbi is a powerful tool for data analysis. It is famous for its rich visualization tools. On powerbi, you can import CSV and excel files. It creates relationships among variables in a dataset using primary keys and foreign keys.
After analyzing and processing data, the tool tries to create a model for you to relate the variables in the data. A star schema approach is used to create these models. The inbuilt models may not be specific to your needs and therefore you can edit them to suit the user's needs.
Powerbi analysis can be deployed on Microsoft platforms to allow access by various users.
In this article, we have looked at the usefulness of EDA and how to perform EDA on different platforms. EDA allows users to derive useful insights from given data sets and offers understandable graphical visualizations. Data analysts and scientists are encouraged to always be learning as new modules and frameworks come up to help them achieve their purpose of driving organizational growth using data.
Top comments (0)