DEV Community: Byron Morara

"The Ultimate Guide to Data Analytics."

Byron Morara — Sun, 25 Aug 2024 11:42:55 +0000

In a world that has increasingly pivoted to data-driven decisions, those who can turn raw data into actionable insights hold the key to unlocking a business's true potential. In today's discussion, I will take you through the different types of data analytics, the data analysis process, tools and techniques used in the data analytics process, application of data analytics in various industries, challenges in the process, the future of data analytics, and some resources for learning and growing in data analytics. Let's get into it then!

Let's first understand what data analytics is. There are different definitions, all sticking to the standard reference only differing in the wording.

Data analytics is the process of collecting, transforming, and organizing data in order to draw meaningful insights, make predictions, and drive informed decision making, utilizing a range of tools, technologies, and processes to achieve this.

The process grants greater control to businesses and institutions. With data-driven decision making, they gain greater control over the direction of business and the quality of decisions made. This is because it is based on objective data, concrete evidence and results can be effectively measured in order to assess impact. Data driven decisions are all around us , we will outline some of them:

Personalized recommendations from streaming platforms; Spotify and Netflix are good examples.
Navigation and traffic predictions, apps like Google Maps analyze traffic data and provide you with the fastest route to your destination.
Online shopping, and product recommendations - companies like Amazon, Alibaba use data to recommend products that you would potentially like and predict sales using customer purchase data.

The Data Analysis process

To convert raw data into meaningful insights and valuable data that can be used to make predictions or train machine learning models, data analysts and scientists follow the data analysis process. In this section we will discuss the process in detail from start to end.

1. Identifying the business question

Before collecting data, an analyst must first answer some questions that would facilitate the data collection process, These include:

What’s the goal or purpose of this research?
What kind of data is required for this research?
What methods and procedures will be used to collect, store, and process the data?

2. Data collection

After identifying the business question, you move to what is typically considered the first step in the data analysis process. This the process of gathering, measuring and recording information on variables of interest. We will discuss two methods of collecting data.

Primary data collection
This process involves collecting data directly from the source or through direct interaction with the respondents. This method allows researchers to obtain firsthand information specifically tailored to their research objectives. There are various techniques for primary data collection, including:

Surveys and Questionnaires: Researchers design structured questionnaires or surveys to collect data from individuals or groups. These can be conducted through face-to-face interviews, telephone calls, mail, or online platforms.
Interviews: They involve direct interaction between the researcher and the respondent. They can be conducted in person, over the phone, or through video conferencing. Interviews can be structured (with predefined questions), semi-structured (allowing flexibility), or unstructured (more conversational).
Observations: Researchers observe and record behaviors, actions, or events in their natural setting. This method is useful for gathering data on human behavior, interactions, or phenomena without direct intervention.
Experiments: Experimental studies involve the manipulation of variables to observe their impact on the outcome. Researchers control the conditions and collect data to draw conclusions about cause-and-effect relationships(mostly conducted in laboratories).

Secondary Data Collection
This involves the use of already existing data collected by somebody else. The individual might have collected the collected the data for a different intent other than that of the current study, but still relevant to it. Some of the techniques employed here include:

Government and Institutional Records: Government agencies, research institutions, and organizations often maintain databases or records that can be used for research purposes. For example: census data.
Publicly Available Data: Data shared by individuals, organizations, or communities on public platforms, websites, or social media can be accessed and utilized for research.
Online Databases: Numerous online databases provide access to a wide range of secondary data, such as research articles, statistical information, economic data, and social surveys.

3. Data Cleaning and Preparation

After data is collected in needs to be prepared for analysis, one such step is by cleaning the data; which involves the systemic identification and correction errors, inconsistencies, and inaccuracies within the dataset. Datasets may contain errors like outliers, missing data, spelling errors, and more. The data can be cleaned through the following ways:

a). Handling outliers: Outliers are values or data points that lie way outside the normal sample range. These can either be removed or left if they are not many and would not skew the results.

b). Handling Missing Data: Devise strategies to handle missing data effectively. This may involve imputing missing values based on statistical methods, removing records with missing values, or employing advanced imputation techniques. This ensures a more complete dataset, preventing biases and maintaining the integrity of analyses.

c). Removal of Unwanted Observations: Identify and eliminate irrelevant or redundant observations from the dataset. The step involves scrutinizing data entries for duplicate records, irrelevant information, or data points that do not contribute meaningfully to the analysis. This streamlines the dataset, reducing noise and improving the overall quality.

d). Fixing Structure errors: Address structural issues in the dataset, such as inconsistencies in data formats, naming conventions, or variable types. Standardize formats, correct naming discrepancies, and ensure uniformity in data representation. Fixing such errors enhances data consistency and facilitates accurate analysis and interpretation.

4. Data Analysis

This step involves the application of statistical and machine learning methods to understand and gain meaning insights from the data. Programming languages and statistical software are used to achieve this. The following are some of the analysis types:

Regression analysis: Regression analysis is great for establishing trends and making predictions for the future. Using regression analysis, you can measure the relationship between variables by testing how different factors, known as independent variables, impact the dependent variable. Accountants can use regression analysis to help organizations make informed business decisions, while marketers and business owners can use this method to determine the factors influencing customer buying decisions.
Discourse analysis: Discourse analysis is a qualitative method used to explore how language is used in real-world social contexts. You can better understand how cultural values, beliefs, and conventions influence communication by performing discourse analysis. This helps clarify misunderstandings and establish the meaning behind verbal and nonverbal communication.
Hypothesis analysis: During a hypothesis analysis, you will develop two different hypotheses: Null and alternative. The null hypothesis states that no difference exists between the two groups, while the alternative hypothesis usually states the opposite. The goal of a hypothesis analysis, also called hypothesis testing, is to disprove the null hypothesis by demonstrating the difference between the two groups, thus validating the alternative hypothesis.
Content analysis: Content analysis can be used when working with qualitative data, such as different forms of communication. This type of data analysis allows you to quantify relationships and meanings found within qualitative data, such as using certain words or concepts.
Data mining: Data mining is the process of using computers to sort through large amounts of data to establish patterns or trends. With this method, you can automate the process of analyzing information and make predictions based on future probabilities and other useful insights.
Cluster analysis: The cluster analysis method sorts data into clusters based on their similarity. Cluster analysis is an unsupervised learning method, which means the model does the sorting instead of you having to sort data into clusters yourself. Because of this, you don’t know what the clusters are or how many exist before the cluster analysis. Cluster analysis can be particularly helpful in market segmentation, machine learning, pattern recognition, bioinformatics, and image analysis.

Factor analysis: Using factor analysis, you can take many variables and reduce them to a smaller number of factors to determine the amount of variance between the different variables and assign them a score. This method is especially helpful when working with complex data that has a high number of interconnected variables.

5. Data Visualization, interpretation, and Reporting

Data visualization is the process of putting data into charts, graphs, or any other visual format that helps inform analysis and interpretation. The visuals present analyzed data in ways that are accessible and easily understood by different stakeholders. Some of the most popular data visual formats include:

Frequency tables
Cross-tabulation tables
Bar charts
Line graphs
Pie charts
Heat Maps
Scatter Plots

Tools and Software in Data Analytics
The following are some of the tools used in data analytics:

1. Programming Languages:

Python
R.

2. Data Processing Tools:

3. Visualization Tools:

4. Big Data Tools:

Hadoop
Spark

5. Cloud Platforms:

The data analytics is no longer just a competitive advantage; it’s a fundamental driver of success in today’s data-rich world. From improving business processes to enhancing customer experiences, the ability to analyze data effectively can unlock new opportunities and insights. As technology continues to evolve, so will the power of data analytics, becoming even more accessible, efficient, and impactful. By embracing data analytics and staying ahead of emerging trends, individuals and organizations can ensure they are well-positioned to thrive in a future where data is the key to informed decision-making and innovation.

The Ultimate Guide to Data Analytics: Techniques and Tools

Byron Morara — Wed, 21 Aug 2024 12:27:50 +0000

In an age where the value of data has drastically increased, the reliance on data to make important decisions by businesses, institutions and even individuals has increasingly become a norm.
Data analytics involves the process of collecting, organizing, transforming and analyzing data to draw insights. It helps business and institutions optimize their operations be it in terms of costs or the market. It is used to many different industries as discussed in the following article,
This guide aims to discuss the techniques and tools used in the process, I hope it will be helpful and for anything that needs correction or clarification, don't hesitate to reach out as this is my first ever technical article. So let's get into it then.

The techniques are used to analyze qualitative and quantitative data. Quantitative data which involves descriptive statistics and inferential statistics.

1. Descriptive Analytics
Descriptive statistics involves summarizing and organizing data to describe the current situation. It uses measures like mean, median, mode, and standard deviation to describe the main features of a data set.

Example: A company analyzes sales data to determine the monthly average sales over the past year. They calculate the mean sales figures and use charts to visualize the sales trends.

2. Diagnostic Analytics
Diagnostic analysis goes beyond descriptive statistics to understand why something happened and determine the causes of trends and correlations between variables. It looks at data to find the causes of events, mostly seeking to answer the "why" questions.

Example: After noticing a drop in sales, a retailer uses diagnostic analysis to investigate the reasons. They examine marketing efforts, economic conditions, and competitor actions to identify the cause.

3. Predictive Analytics
Predictive analytics is the process of using historical data to forecast future outcomes. The process uses data analysis, machine learning, artificial intelligence, and statistical models to find patterns that might predict future behavior.
Example: An insurance company uses predictive analysis to assess the risk of claims by analyzing historical data on customer demographics, driving history, and claim history.

4. Prescriptive Analytics
Prescriptive analytics is the use of advanced processes and tools to analyze data and content to recommend the optimal course of action or strategy moving forward. Simply put, it seeks to answer the question, “What should we do?”
Example: An online retailer uses prescriptive analysis to optimize its inventory management. The system recommends the best products to stock based on demand forecasts and supplier lead times.

ESSENTIAL TOOLS FOR DATA ANALYTICS
There are many different tools that are used in data analytics. The different software and programming languages help to provide an environment that enables data analysts to efficiently clean, transform and analyze data.

Programming languages include:

Python: Python has in-built mathematical libraries and functions, making it easier to calculate mathematical problems and to perform data analysis.
R:R analytics (or R programming language) is a free, open-source software used for all kinds of data science, statistics, and visualization projects. R programming language is powerful, versatile, AND able to be integrated into BI platforms like Sisense, to help you get the most out of business-critical data.
These integrations include everything from statistical functions to predictive models, such as linear regression. R also allows you to build and run statistical models using Sisense data, automatically updating these as new information flows into the model.

Data visualization tools include: Tableau, Power BI, Google Data Studio. I will discuss the first two tools with a little more detail, because they are commonly used.

Power BI: is a technology-driven business intelligence tool provided by Microsoft for analyzing and visualizing raw data to present actionable information. It combines business analytics, data visualization, and best practices that help an organization to make data-driven decisions. It converts data from different sources to build interactive dashboards and Business Intelligence reports. It is highly preferred because of the following reasons:
Access to Volumes of Data from Multiple Sources Power BI can access vast volumes of data from multiple sources. It allows you to view, analyze, and visualize vast quantities of data that cannot be opened in Excel. Some of the important data sources available for Power BI are Excel, CSV, XML, JSON, pdf, etc. Power BI uses powerful compression algorithms to import and cache the data within the. PBIX file.

Interactive UI/UX Features
Power BI makes things visually appealing. It has an easy drag and drops functionality, with features that allow you to copy all formatting across similar visualizations.
Exceptional Excel Integration
Power BI helps to gather, analyze, publish, and share Excel business data. Anyone familiar with Office 365 can easily connect Excel queries, data models, and reports to Power BI Dashboards.
Accelerate Big Data Preparation with Azure
Using Power BI with Azure allows you to analyze and share massive volumes of data. An azure data lake can reduce the time it takes to get insights and increase collaboration between business analysts, data engineers, and data scientists.
Turn Insights into Action
Power BI allows you to gain insights from data and turn those insights into actions to make data-driven business decisions.
Real-time Stream Analytics
Power BI will enable you to perform real-time stream analytics. It helps you fetch data from multiple sensors and social media sources to get access to real-time analytics, so you are always ready to make business decisions.

Tableau: is an analytics solution that allows users to connect, analyze, and share their data. The software started as a visualization tool, growing into an enterprise platform with several deployment options until they were acquired by Salesforce in 2019.
Statistical analysis tools include:
SAS:is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation,[2] and predictive analytics.
SPSS: is a suite of software programs that analyzes scientific data related to the social sciences. SPSS offers a fast-visual modeling environment that ranges from the smallest to the most complex models. The data obtained from SPSS is used for surveys, data mining, market research, etc.

Big data tools include:
Hadoop
Spark

The success of a data analytics depends on the tools and techniques used, just like a good cook who knows how to prepare their food. The tools and techniques chosen will depend on many factors including: the nature of data, purpose of the analysis, complexity of the problem, domain or industry, cost and resources, integration with existing systems, data privacy and security just to mention a few.

You can read more about the data analysis processes and even the history and its projected growth as we get into a world that relies more and more on data.

Feature Engineering: Unlocking the Power of Data for Superior Machine Learning Models

Byron Morara — Wed, 21 Aug 2024 12:13:24 +0000

Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in machine learning, mostly in supervised learning. It consists of five processes: feature creation, transformations, feature extraction, exploratory data analysis and benchmarking. In this context, a 'feature' is any measurable input that can be used in a predictive model. It could be the sound of an animal, a color, or someone's voice.

This technique enables data scientists to extract the most valuable insights from data which ensures more accurate predictions and actionable insights.

Types of features

As stated above a feature is any measurable point that can be used in a predictive model. Let's go through the types of the feature engineering for machine learning-

Numerical features: These features are continuous variables that can be measured on a scale. For example: age, weight, height and income. These features can be used directly in machine learning.
Categorical features: These are discrete values that can be grouped into categories. They include: gender, zip-code, and color. Categorical features in machine learning typically need to be converted to numerical features before they can be used in machine learning algorithms. You can easily do this using one-hot, label, and ordinal encoding.
Time-series features: These features are measurements that are taken over time. Time-series features include stock prices, weather data, and sensor readings. These features can be used to train machine learning models that can predict future values or identify patterns in the data.
Text features: These are text strings that can represent words, phrases, or sentences. Examples of text features include product reviews, social media posts, and medical records. You can use text features to train machine learning models that can understand the meaning of text or classify text into different categories.
One of the most crucial processes in the machine learning pipeline is: feature selection, which is the process of selecting the most relevant features in a dataset to facilitate model training. It enhances the model's predictive performance and robustness, making it less likely to overfit to the training data. The process is crucial as it helps to reduce overfitting, enhance model interpretability, improve accuracy, and reduce training times.

Techniques in feature engineering

Imputation

This techniques deals with the handling of Missing values/data. It is one of the issues that you will encounter as you prepare your data for cleaning and even standardization. This is mostly caused by privacy concerns, human error, and even data flow interruptions. It can be classified into two categories:

Categorical Imputation: Missing categorical variables are usually replaced by the most commonly occurring value in other records(mode). It works with both numerical and categorical values. However, it ignores feature correlation. You can use Scikit-learn’s 'SimpleImputer' class for this imputation method. This class also works for imputation by mean and median approaches as well as shown below.

# impute Graduated and Family_Size features with most_frequent values

from sklearn.impute import SimpleImputer
impute_mode = SimpleImputer(strategy = 'most_frequent')
impute_mode.fit(df[['Graduated', 'age']])

df[['Graduated', 'age']] = impute_mode.transform(df[['Graduated', 'age']])

Numerical Imputation: Missing numerical values are generally replaced by the mean of the corresponding value in other records. Also called imputation by mean. This method is simple, fast, and works well with small datasets. However this method has some limitations, such as outliers in a column can skew the result mean that can impact the accuracy of the ML model. It also fails to consider feature correlation while imputing the missing values. You can use the 'fillna' function to impute the missing values in the column mean.

# Impute Work_Experience feature by its mean in our dataset

df['Work_Experience'] = df['Work_Experience'].fillna(df['Work_Experience'].mean())

Encoding

This is the process of converting categorical data into numerical(continuous) data. The following are some of the techniques of feature encoding:

Label encoding: Label encoding is a method of encoding variables or features in a dataset. It involves converting categorical variables into numerical variables.
One-hot encoding: One-hot encoding is the process by which categorical variables are converted into a form that can be used by ML algorithms.
Binary encoding: Binary encoding is the process of encoding data using the binary code. In binary encoding, each character is represented by a combination of 0s and 1s.

Scaling and Normalization

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. For example, if you have multiple independent variables like age, salary, and height; With their range as (18–100 Years), (25,000–75,000 Euros), and (1–2 Meters) respectively, feature scaling would help them all to be in the same range, for example- centered around 0 or in the range (0,1) depending on the scaling technique.

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. Here, Xmax and Xmin are the maximum and the minimum values of the feature, respectively.

Binning

Binning (also called bucketing) is a feature engineering technique that groups different numerical subranges into bins or buckets. In many cases, binning turns numerical data into categorical data. For example, consider a feature named X whose lowest value is 15 and highest value is 425. Using binning, you could represent X with the following five bins:

Bin 1: 15 to 34
Bin 2: 35 to 117
Bin 3: 118 to 279
Bin 4: 280 to 392
Bin 5: 393 to 425

Bin 1 spans the range 15 to 34, so every value of X between 15 and 34 ends up in Bin 1. A model trained on these bins will react no differently to X values of 17 and 29 since both values are in Bin 1.

Dimensionality Reduction

This is a method for representing a given dataset using a lower number of features (i.e. dimensions) while still capturing the original data’s meaningful properties.1 This amounts to removing irrelevant or redundant features, or simply noisy data, to create a model with a lower number of variables. Basically transforming high dimensional data into low dimensional data. There are two main approaches to dimensionality reduction -

Feature Selection: Feature selection involves selecting a subset of the original features that are most relevant to the problem at hand. The goal is to reduce the dimensionality of the dataset while retaining the most important features. There are several methods for feature selection, including filter methods, wrapper methods, and embedded methods. Filter methods rank the features based on their relevance to the target variable, wrapper methods use the model performance as the criteria for selecting features, and embedded methods combine feature selection with the model training process.
Feature Extraction: Feature extraction involves creating new features by combining or transforming the original features. The goal is to create a set of features that captures the essence of the original data in a lower-dimensional space. There are several methods for feature extraction, including principal component analysis (PCA), linear discriminant analysis (LDA), and t-distributed stochastic neighbor embedding (t-SNE). PCA is a popular technique that projects the original features onto a lower-dimensional space while preserving as much of the variance as possible.

Automated Feature Engineering Tools

There are several tools that are used to automate feature engineering, let's look at some of them.

FeatureTools -This is a popular open-source Python framework for automated feature engineering. It works across multiple related tables and applies various transformations for feature generation. The entire process is carried out using a technique called “Deep Feature Synthesis” (DFS) which recursively applies transformations across entity sets to generate complex features.

Autofeat - This is a python library that provides automated feature engineering and feature selection along with models such as AutoFeatRegressor and AutoFeatClassifier. These are built with many scientific calculations and need good computational power. The following are some of the features of the library:

Works similar to scikit learn models using functions such as fit(), fit_transform(), predict(), and score().
Can handle categorical features with One hot encoding.
Contains a feature selector class for selecting suitable features.
Physical units of features can be passed and relatable features will be computed.
Contains Buckingham Pi theorem – used for computing dimensionless quantities. Only used for tabular data.
Only used for tabular data.

AutoML - Automatic Machine Learning in simple terms can be defined as a search concept, with specialized search algorithms for finding the optimal solutions for each component piece of the ML pipeline. It includes: Automated Feature engineering, Automated Hyperparameter Optimization, )Neural Architecture Search (NAS

Common Issues and Best practices in Feature Engineering

Common Issues

Ignoring irrelevant features: This could result in a model with poor predictive performance, as irrelevant features don’t contribute to the output and might even add noise to the data. The mistake is caused by a lack of understanding and analysis of the relationship between different datasets and the target variable.

Imagine a business that wants to use machine learning to predict monthly sales. They input data such as employee count and office size, which have no relationship with sales volume.
Fix: Avoid this by conducting a thorough feature analysis to understand which data variables are necessary and remove those that are not.

Overfitting from too many features: The model may have perfect performance on training data (because it has effectively ‘memorized’ the data) but may perform poorly on new, unseen data. This is known as overfitting. This mistake is usually due to the misconception that “more is better.” Adding too many features to the model can lead to large complexity, making the model harder to interpret.

Consider an app forecasting future user growth that inputs 100 features into their model, but most of them share overlapping information.
Fix: Counter this by using strategies like dimensionality reduction and feature selection to minimize the number of inputs, thus reducing the model complexity.

Not normalizing features: The algorithm may give more weight to features with a larger scale, which can lead to inaccurate predictions. This mistake often happens due to a lack of understanding of how machine learning algorithms work. Most algorithms perform better if all features are on a similar scale.

Imagine a healthcare provider uses patient age and income level to predict the risk of a certain disease but doesn’t normalize these features, which have different scales.
Fix: Apply feature scaling techniques to bring all the variables into a similar scale to avoid this issue.

Neglecting to handle missing values Models can behave unpredictably when confronted with missing values, sometimes leading to faulty predictions. This pitfall often happens because of an oversight or the assumption that the presence of missing values won’t adversely affect the model.

For example, an online retailer predicting customer churn uses purchase history data but does not address instances where purchase data is absent.
Fix: Implement strategies to deal with missing values, such as data imputation, where you replace missing values with statistical estimates.

Best Practices

Make sure to handle missing data in your input features: In a real-world case where a project aims to predict housing prices, not all data entries may have information about a house’s age. Instead of discarding these entries, you may impute the missing data by using a strategy like “mean imputation,” where the average value of the house’s age from the dataset is used. By correctly handling missing data instead of just discarding it, the model will have more data to learn from, which could lead to better model performance.
Use one-hot encoding for categorical data: For instance, if we have a feature “color” in a dataset about cars, with the possible values of “red,” “blue,” and “green,” we would transform this into three separate binary features: “is_red,” “is_blue,” and “is_green.” This strategy allows the model to correctly interpret categorical data, improving the quality of the model’s findings and predictions.
Consider feature scaling: As a real example, a dataset for predicting disease may have age in years (1100) and glucose level measurements (70180). Scaling places these two features on the same scale, allowing each to contribute equally to distance computations like in the K-nearest neighbors (KNN) algorithm. Feature scaling may improve the performance of many machine learning algorithms, rendering them more efficient and reducing computation time.
Create interaction features where relevant: An example could include predicting house price interactions, which may be beneficial. Creating a new feature that multiplies the number of bathrooms by the total square footage may give the model valuable new information. Interaction features can capture patterns in the data that linear models otherwise wouldn’t see, potentially improving model performance.
Remove irrelevant features: In a problem where we need to predict the price of a smartphone, the color of the smartphone may have little impact on the prediction and can be dropped. Removing irrelevant features can simplify your model, make it faster, more interpretable, and reduce the risk of overfitting.

Feature engineering is not just a pre-processing step in machine learning; it’s a fundamental aspect that can make or break the success of your models. Well-engineered features can lead to more accurate predictions and better generalization. Data Representation: Features serve as the foundation upon which machine learning algorithms operate. By representing data effectively, feature engineering enables algorithms to discern meaningful patterns. Therefore, aspiring and even experienced data scientists and machine learning enthusiasts and engineers must recognize the pivotal role feature engineering plays in extracting meaningful insights from data. By understanding the art of feature engineering and applying it well, one can unlock the true potential of machine learning algorithms and drive impactful solutions across various domains.

If you have any questions, or any ways I could improve my article, please leave them in the comment section. Thank you!

Understanding Your Data: The Essentials of Exploratory Data Analysis

Byron Morara — Tue, 13 Aug 2024 13:06:41 +0000

Knowing your data is a very important aspect in the data analysis process. You sort of get a chance to get a "feel" of your data and understand it more. This can include knowing the number of rows and columns you have, the datatypes or even the circumstances under which the data was collected or the objective which it seeks to achieve.

In today's discussion we will talk about the Essentials of exploratory Data Analysis.

Our first question question is:
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis. This involves inspecting the dataset from many angles, describing & summarizing it without making any assumptions about its contents.

It is a crucial step during data science projects because it allows data scientists to analyze and visualize data to understand its key characteristics, uncovering any patterns, identifying the relationship between different variables and locate outliers. The EDA process is normally performed out as a preliminary step before undertaking extra formal statistical analyses or modeling.

Now, we will discuss why EDA is important:

Understand data structures:EDA helps in getting familiar with the dataset, understanding the number of features, the type of data in each feature, and the distribution of data points. This understanding is crucial for selecting appropriate analysis or prediction techniques.
Identifying patterns and relationships:Through visualizations and statistical summaries, EDA can reveal hidden patterns and intrinsic relationships between variables. These insights can guide further analysis and enable more effective feature engineering and model building.
Catching anomalies and outliers in the data:It helps identify errors or unusual data points that may adversely affect or skew the results of your analysis. Detecting these early can prevent costly mistakes in predictive modeling and analysis.
Testing hypotheses:When we are doing a certain project, we normally make assumptions about the results of the data, and to verify whether these assumptions are true, we have to perform hypothesis tests on the data. You can read more about the null(H0) and alternative(Ha or H1) hypotheses here.
Helps inform feature selection and engineering:Insights gained from EDA can inform which features are most relevant to include in a model and how to transform them (scaling, encoding) to improve model performance.
Optimizing model design:By understanding the data’s characteristics, analysts can choose appropriate modeling techniques, decide on the complexity of the model, and better tune model parameters.

The Importance of Understanding Your Data.

For one to understand data, they first need to understand its characteristics of quality data which are also the elements of quality data: For example you need to make sure that your data is: Accurate, accessibility, complete, consistent, valid(Integrity), unique, current, reliable, and relevant to the study or model your building.

These characteristics determine data quality. As we already discussed poor quality data has dire consequences like affecting the model performance and can lead to businesses making wrong and potentially costly mistakes.

So, it is very important to make sure that the data meets the highest quality standards before being analyzed and used to provide insights for business decision making or training machine learning models.

Steps in EDA

Data collection and importing.

Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes. After data is collected it can be organized and stored in different software like databases. Data stored and preserved in different formats:

Textual data: XML, TXT, HTML, PDF/A (Archival PDF)
Tabular data (including spreadsheets): CSV
Databases: XML, CSV
Images: TIFF, PNG, JPEG (note: JPEGS are a 'lossy' format which lose information when re-saved, so only use them if you are not concerned about image quality)
Audio: FLAC, WAV, MP3

Data structure and overview: To get a view of some of the data in the dataset and its characteristics, we can use python to achieve this. For example:

To get the first five and last rows in the dataset we use the: head() and tail(). To get a general description and information about the dataset we use the: info() and describe() in pandas.

Handling missing values: Missing can potentially affect our analysis especially when the dataset is small. To identify them we use some of the following tools, just to mention a few:

Spreadsheets: In the formula box, enter in =ISBLANK(A1) (assuming A1 is the first cell of your selected range).
SQL: the COUNT() function along with a WHERE clause can be used to find the number of null values in a column.
Python: The following functions are used to find, and handle missing values in python. isnull, notnull(), dropna(), fillna(), replace(), and the interpolate().

Data visualization and the techniques involved.
Data visualization is the practice of translating information into a visual context, such as maps, graphs, plots, charts and many more. This is crucial process in the data analysis process because it helps your audience understand the data in very simple terms. It helps build a picture of what the data contains. Visualizations make the storytelling process easy as it is easy explain insights to stakeholders with visual features available.

The most common visualizations used are:

Line graphs: these are used to track changes over a period of time.
Bar graphs: they are used to compare performance over two different groups and track changes over time. They do this by show the frequency counts of values for the different levels of a categorical or nominal variable.
Histograms: used to show distributions of individual variables.
Box plots: they detect outliers and help data analysts and scientists understand variability in the data.
Scatter plots:used to visualize relationships between two variables.
Pair plots: to compare multiple variables in the data at once.
Correlation Heatmaps: these are also used to understand the relationships between multiple variables.

Statistical summaries.
Just as the headline states, they provide a summary of key information about the sample data. They include:

Mean: Also known as the expected value. It is a measure of central tendency of a probability distribution along median and mode. It usually describes the entire dataset with a single value. To read more on the mean go here.
Median: This is the value in the middle of the dataset after the values have been organized either in ascending or descending order. It gives an idea of where the center value is and more reliable to calculate when we have skewed data.
Mode: Is the most commonly observed value in the data or the value with the highest frequency. To understand the mode more visit here.
Standard deviation: This the measure of how the data varies in relation to the mean. It helps analysts understand the variability of a dataset, and identify trends, assess data reliability, detect outliers, compare datasets, and evaluate risk. It is denoted as σ More on standard deviation can be found standard deviation.
Variance: This is a statistical measurement of the spread between numbers in a data set. It is calculated as the square of standard deviation, mostly denoted as σ2. For more on variance go to here.

Tools and libraries for EDA
In the python the commonly used libraries for EDA are:

Jupyter Notebooks
This is an interactive environment for running and saving Python code in a step-by-step manner. It is commonly used in the data space because it provides a flexible environment to work with code and data. For more on Jupyter notebooks click here.

Common pitfalls in EDA

Overfitting to patterns: This usually happens when analysts overinterpret the data to a point where they arrive at wrong insights. In machine learning it is a situation where a model learns too much from the training data and fails to generalize well to new or unseen data which can lead to poor performance, inaccurate predictions, and wasted resources.
Bias: Most especially confirmation bias, which mostly leads to analysts providing the wrong insights due to premeditated hypotheses. To avoid this, the analysts must ensure that they only review and interpret the data based on the finding from the data only and not there own conclusions.

Overall, the EDA process is crucial in the success of data science projects as it ensures the model get the best data to learn from.

To understand the EDA process in a step by step process with code, check out the following tutorials:

For datasets that you can use to practice EDA visit the following websites:

Remember to always document insights gained from EDA as they might be used as reference in future especially during the modelling phase. The EDA process is iterative, so as you proceed with the project, you might need to revisit the process as new insights emerge.

Thank you reading, and I will appreciate any feedback on areas I can improve on.