DEV Community: Elishiba Muigo

How Excel is Used in Real-World Data Analysis

Elishiba Muigo — Tue, 09 Jun 2026 07:22:16 +0000

To anyone who has basic computer knowledge will tell you they know Excel, many will just say it's a software but they might not tell you beyond that. Well they're are not wrong because it's a program developed by Microsoft most 'ignored' but doesn't negate the fact it's a powerful tool that it has a cluster of functions that one can use.
*Excel *- a program developed by Microsoft, it's a spreadsheet used that organizes data into rows and columns and allows you to do mathematical functions.
In real world scenarios Excel is typically used for:

Data entry - allows users to input and organize information into structured rows and columns.
Data management- enables users to input, organise, and analyse information without the need of complex databases.
Data analysis- excel allows users to clean data and perform mathematical functions to analyse data.
Visuals and graphs- Excel has charts, graphs, maps, etc., that provide a simple and accessible way to understand our data and identify trends and outliers within datasets.
Financial modelling - Excel has a built in financial modelling tool that is used to understand and perform analysis on underlying business to guide on decision making.

WEEK 1 OF LEARNING EXCEL

Week 1 of learning has been a rollercoaster, first thing first we explored the toolbar that holds all the tools that can be used in excel. Learning what a cell is, differentiating between a row and a column.

Functions explored:

MAX () - finds the highest number in range.

MAX (A1:A100)

MIN () - finds the lowest number in range.

MIN(A1:A100)

AVERAGE() - calculates the average( arithmetic mean)

AVERAGE(A1:A100)

SUM() - adds up numbers in range.

SUM(A1:A100)

COUNT() - counts cells with numbers in range.

COUNT(A1:A100)

Week 1 was a lot of introduction but I loved it, there's is so much to explore in excel, and working with data is really interesting. The best bit is well cleaned and analysed data will always explain the trend and what is happening within a business.

HANDLING THE SAFE UPDATE MODE IN MYSQL

Elishiba Muigo — Thu, 12 Jun 2025 16:18:17 +0000

*Safe Update Mode is a feature designed to prevent unintentional data loss or corruption during update and delete operations. It restricts the execution of UPDATE and DELETE statements that don't include a WHERE clause that uses a key column, thus preventing accidental modifications to entire tables. *

The Default and Safe Mode

When SQL_SAFE_UPDATES is set to 1 (which is the default setting in many MySQL client tools like MySQL Workbench), the database system will prevent UPDATE or DELETE statements that:
Do not use a WHERE clause: For example, UPDATE my_table SET column1 = 'new_value';
Do not use a key column in the WHERE clause: For example, if id is your primary key, UPDATE my_table SET column1 = 'new_value' WHERE some_other_column = 'value'; would be blocked unless some_other_column is also indexed or part of a unique key.

SET SQL_SAFE_UPDATES = 1
DELETE FROM database_name.table_name WHERE condition;

The Unsafe/ Override Mode

When SQL_SAFE_UPDATES is set to 0, the database system will not prevent UPDATE or DELETE statements that lack a WHERE clause or don't use a key in the WHERE clause.
This is typically used when you intentionally want to perform a mass update or delete operation without specifying a key column in the WHERE clause, or if you are sure about your WHERE clause, even if it doesn't involve a key. It gives you more flexibility but also removes the safety net.

SET SQL_SAFE_UPDATES = 0
DELETE FROM database_name.table_name WHERE condition;

A Step-by-Step Roadmap to Data Engineering

Elishiba Muigo — Thu, 09 Nov 2023 20:18:43 +0000

Introduction

Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data as scale.
Data engineers work in a variety of settings to build systems that collect, manage and convert raw data into usable information for data scientists and business analysts to interpret. They make data accessible so that organizations can use to evaluate and optimize performance.

1.Basic programming languages

.Knowledge in python programing and scala.
.Knowledge of SQL and database programming. Know how to design and implement data models and database using tools such as PostgreSQL and MySQL.

2. Learn Linux commands and Git

.Knowledge in version control using git, this allows you to manage changes t workloads and basic Linux commands.

3. Data integration, ETL and ELT

Data integration is the process of combining data from multiple sources into cohesive view. It involves gathering data from various sources, cleaning and transforming the data to make it consistent and compatible then storing it.

ETL(Extract, Transform and Load) is a process in data data warehousing and business intelligence that involves extracting data from various sources, transforming it into a format that is suitable for analysis and reporting then loading it into a data warehouse or other data repository.
.Experience ETL tools like Apache Kafka, Talend.

ELT(Extract, Loading and Transformation) a process that involves moving raw data from a source system to a destination resource such a data warehouse.

Learn about ELT and ETL and when is the best time to use which method.

4. Data storage and warehousing

Data warehousing enables organizations to store, organize, and analyze data from various sources in a centralized location, providing a more complete view of the organization's data.

.Learn data modelling and schema design for data warehouse.
.Learn data warehousing concepts and tools like Snowflake, Redshift and
amazon.

5. Data Pipelines

A method in which raw data is ingested from various data sources and then moved to a data store, like a data lake or a data warehouse for analysis. Data can be sources form API's, SQL and NoSQL databases.

.Learn about Batch processing and Streaming data.
.Learn the data pipeline architecture.
.Understanding of workflow management tools such as Apache Airflow, AWS and
Azure Data Factory.
. Familiarize with containerization technologies such as Docker and Kubernetes for managing and deploying data pipelines.

6.Cloud Computing

The cloud computing platforms offer an advantage to data engineers including scalable infrastructure and a range of tools for data processing and analysis.
.Platforms like AWS, Google Cloud Platform and Microsoft Azure.

7.Data governance and security

Data governance ensures data is managed by legal and regulatory requirements. Another important function of data governance is helping protect data from unauthorized access, theft, and misuse. This is critical for data engineers, who are in charge of designing and maintaining secure data systems.

A COMPLETE GUIDE TO TIME SERIES MODELLING

Elishiba Muigo — Mon, 30 Oct 2023 21:01:29 +0000

What is a Time Series Model?

-An ordered sequence of values of variable at equally spaced time intervals. Used in accurately predicting patterns and trends in time-dependent data can offer valuable insights into fields such as climate analysis, stock marketing analysis and economics.
Time series modeling is a statistical and mathematical technique used to analyze and make predictions about data points collected and recorded over a series of time intervals. It's a method of analyzing a collection of data points over a period of time.

Characteristics of Time Series Model:

1.Autocorrelation-It's the degree of similarity between a given time series and a lagged version of itself over successive time intervals. Autocorrelation measures the relationship between a variable's current value and its past values. It measures the correlation between a time series and a lagged version of itself. Helps in model selection and diagnostics.

2.Seasonality-characteristic of time series in which data experiences regular and predictable changes that reoccur every year and is said to be seasonal. Seasonality refers to periodic fluctuations. Many time series exhibit seasonality which is recurring patterns or cycles that occur at regular intervals.

3.Stationarity-A fundamental assumption in time series analysis is stationarity. A time series mean, variance, and autocorrelation remain constant over time.

Time Series Analysis:

1.Segmentation- splits the data into segments to reveal the underlying properties of the source information.

2.Explanative Analysis-attempts to understand the data and the relationship between it's cause effect.

3.Classification-Identifies and assigns categories to the data.

4.Forecasting-Predicts future data. Time series models are used for forecasting future values of the series. Common techniques for forecasting include autoregressive (AR) models, moving average (MA) models, and their combinations in autoregressive integrated moving average (ARIMA) models.

5.Curve Fitting-Plots the data along a curve to study the relationships of variable within the data.

6.Descriptive Analysis-Patterns in the time series data such as trends, seasonal variations and cycles.

7.Exploratory analysis- Highlights the main characteristics of the time series data, usually in a visual format.

Models of Time Series:

ARIMA (Autoregressive Integrated Moving Average)
-A time series forecasting model used for analyzing and forecasting time-dependent data. It combines three key components: autoregression (AR), differencing (I for Integrated), and moving averages (MA). ARIMA models can apply in some cases where data show non-stationarity in the mean.

Parts of ARIMA
AR (Autoregressive): refers to the number of previous values to consider for the forecast. Described by the perimeter "p". Autoregressive is the lags of the variables in the stationary series in the estimation equation.

I (Integrated): differentiation of time series data. Characterized by "d". Refers to the number of differencing applied with the objective of achieving stationary time series. Integrated means that the data values are changed with the difference between their own values and previous values to make the series stable.

MA (Moving Average): a linear combination of past error values instead of previous values of the variable interest. Described by the parameter "q". Refers to the lags of forecast errors.

SARIMA MODEL ( Seasonal Autoregressive Integrated Moving Average)
It's designed to handle time series data with seasonal patterns. SARIMA models are widely used for time series forecasting and analysis, particularly when the data exhibit recurring patterns at regular intervals, such as daily, monthly, or yearly seasonality.

When there's seasonality in the series SARIMA will instead ARIMA. There is only one variable in both data and it will be a suitable model as SARIMA supports univariate time series data.

A SUMMARY OF EXPLATORY DATA ANALYSIS

Elishiba Muigo — Thu, 12 Oct 2023 07:43:10 +0000

The Significance of EDA

The EDA is an approach to analyzing datasets and involves using various tools and techniques to examine and understand data. EDA helps analysts gain insights into the data, identify relationships, detect outliers and prepare the data for further analysis or modelling. Process of visually and statistically summarizing data to discover it's underlying, structure, distribution and relationships between variables.

1.Data Collection- gather relevant data to the study or problem. This could be collected through various resources databases, spreadsheets, APIs and web scraping.

2.Data Cleaning- this a step that involves removing and handling missing values, as the incorrect insights. Deal with the outliers that might skew the analysis and address duplicates and inconsistent values.

3. Data Wrangling-this is converting raw data into a usable form . Involves merging multiple data sources into a single dataset for analysis.

4.Statistics Summary-calculates and visualize basic summary statistics like mean, median, standard deviation and quartiles for numerical variables.

5.Data Visualization-graphical representation of data to enhance understanding of the patterns, trends and insights within the data. They include:
-Bar Charts :To show frequency of categorical variables.
-Scatter plots: To explore relationships between numerical variables.
-Histograms: For visualizing the distribution of single variables.
-Heatmaps: To visualize correlations between variables.
-Box Plots: displays the distribution of a dataset, including the
median, quartiles and potential outliers.
6.Correlation Analysis-calculate correlation coefficients to understand the relationships between numerical variables e.g. Pearson, Spearman and then visualize correlations using matrices or heatmaps.

7.Feature Engineering-EDA can lead to the selection or creation of relevant features for predictive modeling. By exploring relationships between features and the target variable, data scientists can identify the most informative variables.

8.Model Assumptions- Understanding data distributions and relationships helps in selecting appropriate modeling techniques and verifying model assumptions. For instance, linear regression assumes a linear relationship between variables.

Methods of EDA
1.Time Analysis-EDA involves examining the temporal aspects of data, including trends, seasonality, and autocorrelation, through techniques like time series plots and autocorrelation functions.

2.Univariate Analysis- Focuses on a single variable at a time, summarizing its central tendencies, spread, and distribution using measures like mean, median, standard deviation, and visualizations such as histograms and box plots.

3.Bivariate Analysis-Explores the relationships between two variables. Scatter plots, correlation coefficients, and contingency tables are useful tools in this context.

4.Multivariate Analysis- Involves studying the relationships between multiple variables simultaneously. Techniques like principal component analysis (PCA) or clustering methods can be employed for dimensionality reduction and pattern discovery.

ROADMAP TO BECOMING A DATA SCIENTIST

Elishiba Muigo — Sat, 30 Sep 2023 01:12:14 +0000

-Learn basic fundamentals of python (control structures, data types, syntax, functions, Object Oriented Programming and data structures).

-Build a core of Statistics and statistical models:(hypothesis testing, regression analysis, calculus, probability and linear algebra to help you in drawing insights and making informed decisions.

-Data collection/scraping(learn to collect data from relevant sources from various sources including databases, web scraping using(scrapy, Beautiful soup) and APIs.

-Data cleaning: Learn to clean data in order to get rid of anomalies this can be done using python libraries(numpy, pandas).Learn how to handle missing values and inconsistencies.

-Learn the fundamentals of SQL(Aggregation Functions, Joins, Nested queries,
CREATE TABLE,GROUP BY, ALTER, INSERT, DELETE, DROP TABLE, ORDER BY, UPDATE).Gain proficiency in database management and working with large datasets.

-Data visualization: (Visualize data through chart and bar graphs in order to understand trends, variations and drive meaningful insights from the data. Gain familiarity with tools like Tableau, Power BI, and python packages(matplotlib, Seaborn).

-Become proficient in Machine Learning: Be well versed with different machine learning models and when they are used based on the problem and data characteristics(Linear Regression, Naive Bayes, Reinforcement Learning, Random Forest, Decision Trees, Neural Networks).This helps in training models and achieve high accuracy and precision.

-Learn Git and version control for Augmented Analysis.Helps in collaborating with others in doing projects and tracks the changes you make and save them.

-Learn soft skills including communication. Learn how to story tell about data to both technical and non-technical audiences. Be able to summarize the findings in a clear and understandable manner and provide recommendations and insights based on the insights.