DEV Community: Samuel Kamuli

The ultimate guide to Data Analytics

Samuel Kamuli — Sun, 25 Aug 2024 18:36:34 +0000

The field of data analytics is vast and encompasses several careers ranging from software engineering to data science. Though the core goal of data analysis is to uncover underlying patterns and trends, there is a lot that goes into collecting the data, planning database schemas, building virtual infrastructures that will manage the flow of the data, predictive analytics and machine learning among other activities. For each of these roles, we have designated roles for them such as data architect, data engineer, data scientist and data product engineer just to name a few.

I have chosen the path of a data analyst. Data analysts explore, clean, analyze, visualize, and present information, providing valuable insights for the business. Structured Query Language is the tool of choice for accessing the database we are working on. Next, we leverage an object-oriented programming language like Python to clean and analyze data and rely on visualization tools, such as Power BI or Tableau, to present the findings. The essential technical skills that one need have as a data analyst are data visualization, data cleaning, MATLAB, R, Python, SQL, Machine Learning, Linear Algebra and Calculus and finally Microsoft Excel.
Key soft skills needed for one to be a good data analyst are Critical Thinking and Communication.

Data Visualization; This is a person’s ability to present data findings via graphics or other illustrations. The purpose of visualizing said data is to facilitate a better understanding of the insights gotten from analysis in an easy to understand manner. With data visualization, decision makers are able to identify patterns and understand complex ideas at a glance.
Data Cleaning; During this stage we perform several tasks to ensure the data is accurate, consistent and ready for analyses. This is a very crucial step because data that has not been properly cleaned will affect the integrity of any insights you generate from your analysis and impair the accuracy of your models. Some of the tasks performed during data cleaning are handling missing values, handling unnecessary duplicates, standardization, handling outliers and integration to name a few.
MATLAB, Python, R, SQL- These are some of the very essential languages that need to be mastered for one to be able to perform basic data tasks such as data mining, data cleaning and data visualization.
Machine Learning- This is a branch of artificial intelligence (AI) that focuses on developing algorithms and models that allow computers to learn from and make predictions or decisions based on data. A growing trend in the world today is automation of tasks. Boosting your skills as an analyst by having a general understanding of related tools and concepts of AI and in this case Machine Learning may give you an edge over competitors during your job search.
Linear Algebra and Calculus- When it comes to data analytics, having advanced mathematical skills is non-negotiable. Linear algebra has applications in machine and deep learning, where it supports vector, matrix, and tensor operations. Calculus is similarly used to build the objective/cost/loss functions that teach algorithms to achieve their objectives.
Microsoft Excel- MS Excel is essential for a data analyst to learn because it is a powerful and versatile tool widely used in the industry for data analysis, visualization, and reporting. Aside from its accessibility and familiarity, its Pivot table are one of the most useful dynamic features in analysis allowing an analyst to summarize, analyze, explore, and present data in a flexible way.

The key soft skills play the part of making one an all rounded analyst. Remember it is not enough to know how to generate insight and uncover underlying trends and patterns because at the end of the day, you need to be able to explain your findings to others.

The journey to becoming a data analyst begins with one step but as many who have walked that path will tell you, consistency is key. Going the extra mile to learn a new language or explore your data from a different perspective will set you apart from your peers.

FEATURE ENGINEERING; THE ULTIMATE GUIDE

Samuel Kamuli — Sun, 18 Aug 2024 18:58:26 +0000

INTRODUCTION
Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features.
A feature also known as a variable/attribute can be defined as an individual measurable property or characteristic of a data point that is used as input for a machine learning algorithm.
The core purpose of feature engineering is to optimize machine learning model performance by transforming and selecting relevant features.

FEATURE ENGINEERING
The process of feature engineering involves creating, transforming, selecting, extracting variables, exploratory data analysis and finally benchmarking. Each of these stages is geared towards engineering variables that are most conducive to making a machine learning model accurate. Below is an in depth look at each of these stages;

Feature Creation; Creating features involves identifying the variables that will be most useful in the predictive model. This is a subjective process that requires human intervention and creativity. Existing features are mixed via addition, subtraction, multiplication, and ratio to create new derived features that have greater predictive power.
Transformation involves manipulating the predictor/independent variables to improve model performance; e.g. ensuring the model is flexible in the variety of data it can ingest; ensuring variables are on the same scale, making the model easier to understand; improving accuracy; and avoiding computational errors by ensuring all features are within an acceptable range for the model. The goal of this stage is to plot and visualize data.
Feature Extraction: Feature extraction is the automatic creation of new variables by extracting them from raw data. The purpose of this step is to automatically reduce the volume of data into a more manageable set for modeling. Some feature extraction methods include cluster analysis, text analytics, edge detection algorithms, and principal components analysis.
Feature Selection: Feature selection algorithms essentially analyze, judge, and rank various features to determine which features are irrelevant and should be removed, which features are redundant and should be removed, and which features are most useful for the model and should be prioritized.
Exploratory data analysis : Exploratory data analysis (EDA) is a powerful and simple tool that can be used to improve your understanding of your data, by exploring its properties. The technique is often applied when the goal is to create new hypotheses or find patterns in the data. It’s often used on large amounts of qualitative or quantitative data that haven’t been analyzed before.
Benchmark : A benchmark model is the most user-friendly, dependable, transparent and interpretable model against which you can measure your own. It’s a good idea to run test data sets to see if your new machine learning model outperforms a recognized benchmark. These benchmarks are often used as measures for comparing the performance between different machine learning models like neural networks and support vector machines, linear and non-linear classifiers or different approaches like bagging and boosting.

Some of the best feature engineering tools that can automate the feature engineering process include FeatureTools,AutoFeat,TsFresh,OneBM,Explorekit.

Understanding Your Data: The Essentials of Exploratory Data Analysis"

Samuel Kamuli — Sat, 10 Aug 2024 23:37:57 +0000

INTRODUCTION

Data, in the simplest of terms can be referred to as factual information collected together for reference or analysis. Data can be grouped into first and foremost; qualitative/categorical data or quantitative/ numerical data. Qualitative data is data representing information and concepts that can not be represented by numbers whereas Quantitative data is data that can be represented numerically i.e. anything that can be counted or measured.
Exploratory data analysis refers to an analytical approach used to analyze datasets for the purpose of testing hypotheses, summarizing the general characteristics, uncover underlying patterns and spot anomalies.

ESSENTIALS OF EDA

** Understanding your Data Structure**
Before you delve into your data and begin crunching the numbers, it is essential that you first get a good grasp of your dataset. Know what data types are in your dataset be they 'date', 'datetime', 'boolean', 'string', 'integer', 'floating point number', etc.
It is also important to know whether your dataset falls under categorical or numerical data and the sub categories found therein. Understanding your dataset will guide you in knowing which type of EDA to perform whether it is Multivariate Non graphical, Univariate Non graphical or Univariate Graphical EDA
Cleaning your Dataset
When your dataset is first loaded into the coding environment of your choice, the most crucial step is to clean the dataset before analysis begins as a 'dirty' dataset is compromised and will affect the accuracy of your analysis. Some of the key steps in this stage including
checking for null values; once you have identified any null values in your dataset you can replace them using the mean, median or mode of that column. In some instances where there are too many null values in one column you can drop the entire column.
checking for outliers; outliers are data points that significantly deviate from the norm of your dataset. They can impact your data visualization, distort your summary statistic and negatively affect your models.
identifying duplicate data; duplicate data is another factor that affects the integrity of your data and accuracy of your analysis. The most common practice when dealing with duplicate data is to drop the duplicate.
Then the final stage of data cleaning is to ensure that there is data uniformity in your columns. Ensure that none of your columns has two or more distinct data types within it simultaneously.
Visualize your Dataset
Once you have cleaned up your original dataset, you can now visualize what remains. Depending on the numbers and type of variables you can choice any means of visualization. For instance you can elect correlation matrices or scatter plots to visualize data with 2 or more variables, you can choose bar graphs or pie charts to visualize categorical data and box plots for visualizing data with one variable.
Perform analyses on your variables
This step will help us gain insight into the distribution of and correlation between our variables. Once again the technique of analysis varies depending on the number of variables and datatypes. Once we analyze our variables, we can then identify the relationships between them.
Identifying data Patterns
This step is crucial because it allows us to observe the behavior of our variables and in the long term make predictions on them, both independent and dependent. This is a major step because it is a core reason for why EDA is performed in the first place.

The final step of EDA is documentation and reporting as you will need to present your findings in an 'easy to understand' manner. After all, the whole point of data analysis is to make sense of facts and figures.
Some of the tools that are necessary for EDA are Python, R and in some cases even SQL.

Data Analysis. The Ultimate Guide to Data Analytics: Techniques and Tools

Samuel Kamuli — Sun, 04 Aug 2024 13:45:38 +0000

INTRODUCTION
Our world in its entirety can be broken down into numbers. Nearly every aspect of our lives can be subjected to either qualitative or quantitative analysis and it is for this reason that data analysis is a crucial skill to have. In the simplest of terms, data analysis can be described as the practice of working with data to derive meaningful information from it. It is a useful skill to have not just for career reasons but personal as well. Think of it this way, the ability to analyze your monthly needs and allocate funds to satisfying the needs is a good example of data analysis that is often overlooked.

DATA ANALYSIS TECHNIQUES
Data analysis techniques can be broadly grouped into Quantitative and Qualitative methods. Quantitative data is based on numbers and information that can be countable while qualitative data is based on interpretation. It therefore follows that these two methods use different techniques to analyze the data in them. Some of the data analysis techniques that fall under Quantitative methods are;

Inferential Statistics; this techniques enables an analyst to come to conclusions and make predictions about a population dataset based on a sample of the same dataset using t-tests, regression analysis and hypotheses testing.
Data Mining; Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve problems using algorithms. Data mining techniques and tools help enterprises to predict future trends and make more informed business decisions.
Descriptive Statistics; a descriptive statistic is a summary statistic that summarizes features from a collection of information. The way that datasets can be summarized is by use of mean, median, mode and percentages.
Experimental design; the process of carrying out research in an objective and controlled fashion so that precision is maximized. The major intention with experimental design is to determine causal relationship between variables.

For qualitative data, some of the most commonly used methods of analysis include;

Content analysis; this is s a research tool used to determine the presence of certain words, themes, or concepts by studying documents and communication artifacts which may be in various formats such as picture, audio or video.
Thematic analysis; thematic analysis technique identifies recurring themes or patterns in qualitative data by categorizing the data.
Narrative analysis; Examines stories or narratives to understand experiences, perspectives, and meanings.
Grounded Theory; Develops theories or frameworks based on systematically collected and analyzed data, allowing theory development to be guided by the analysis process.

With the above in mind, we now need to be familiar with the software solutions designed to handle various aspects of data evaluation including collection, processing and analysis, in layman's terms analytics tools. Some of the most prominent data analysis tools are SQL, Python, Excel and tableau to name a few. It is imperative for a data analyst to be acquainted with as many tools as possible.

CONCLUSION
Data analysis tools and techniques are broad in their applications and this calls for anyone who intends to be an analyst to constantly train and practice whichever technique they want to major in.