DEV Community: Awasume Marylin

Presentations Tips: What to know on data analysis presentation Phase 1.

Awasume Marylin — Thu, 20 Mar 2025 17:43:39 +0000

I learnt about DATA STORYTELLING.
It is communicating the meaning of a dataset with visuals and a narrative that are customized for each particular audience. There are 3 data story steps.
💠 Engage your audience
💠 Create compelling visuals
💠Tell the story in an interesting narrative.
SPOTLIGHTING is scanning through data to quickly identify the most important insights.
DASHBOARDS is a tool that organizes information from multiple datasets into one central location for tracking, analysis and simple visualization through tables, charts and graphs.
LIFE AND STATIC INSIGHTS
To identify whether data is live or static depends on certain factors.
- How old is the data?
- How long untill the insights are stable or no longer valid to make
decisions
- Does this data or analysis need updating on a regular basis to remain
valuable?

➡️ STATIC DATA: Involves providing screenshots or snapshots in presentations or building dashboards using snapshots of data. There are PROS
and CONS to Static data.
⭕ PROS
- Can tightly control a point-in-time narrative of the data and insights.
- Allows for complex analysis to be explained in depth to a large audience
⭕ CONS
- Insights immediately begins to lose value and continues to do so the l
longer the data remains in a static state
- Snapshots cant keep up with the pace of data change
➡️ LIVE DATA: Means that you can build dashboards, reports and views
connected to automatically updated data.
⭕ PROS
- Dashboards can be built to be more dynamic and scalable
- Gives the most up to date data to the people who need it at the time
when they need it
- Allows for up-to-date curated views into data with the ability to build a
scalable " single source of truth" for various use cases
- Allows for immediate action to be taken on data that changes
frequently
- Alleviates time/resources soent on processes for every analysis
⭕ CONS
- Can take engineering resources to keep pipelines lives and scalable
which may be outside the scope of some companies data resources
allocation
- Without the ability to interpret data you can lose control of the
narrative, which can cause data chaos (ie teams coming to conflicting
conclusions based on the same data)
- Can potentially cause a lack of trust if the data isnt handled properly
I also learnt about how to SHARE DATA which is all about DATA PRESENTATION or SLIDE PRESENTATION.
When exploring a slide presentation, use your knowledge of effective
presentation practices to evaluate it. This includes REVIEWING YOUR OWN
WORK. When checing over slide presentations there are some best
practices you can check for.
💠Include title, subtitle, and date. Making sure that your deck presentation
has a title, subtitle and date makes sure that your audience knows
exactly what you are presenting and when the information was from.
💠Use a logical sequence of slides
💠 Provide an agenda with a timeline
💠 Limit the amount of text on slides
💠 Start with business task
💠 Establish the initial hypothesis
💠 Show what business metrics you used
💠 Use visualizations
💠Introduce the graphic by name
💠 Provide a title for each graph
💠 Go from general to specifics
💠 Use speaker notes to help you remember talking points
💠 Include key take aways.

Mastering data visualization: Turning numbers into something that matters

Awasume Marylin — Mon, 17 Feb 2025 21:32:50 +0000

I am in the 5th phase of Data Analytics which is the SHARE PHASE
Recall
➡️ ASK PHASE
➡️ PREPARE PHASE
➡️ PROCESS PHASE
➡️ ANALYZE PHASE
⭕ SHARE PHASE (VISUALIZATION PHASE)
➡️ ACT PHASE

DATA VISUALIZATION is the graphical representation and presentation of your data.
FRAMEWORKS help organize your thoughts about data visualization and gives you a useful checklist to reference as you plan and evaluate your data visualization.
There are 2 frameworks that employ slightly different techniques. Both are intended to improve the quality of your visuals.
- The McCandles Method: It has 4 elements of good data
visualization.
💠 Information: The data which you're working with.
💠 Story: A clear and compelling narrative or concept.
💠 Goal: A specific objective or function for the visual.
💠 Visual Form: An effective use of metaphor or visual
expression.
- Kaiser Fung's Junk Charts Trifecta Checkup: This approach is a set of questions that can help consumers of data visualization critque what they are consuming and determine how effective it is.
You can also use these questions to determine if your data visualization is effective.
💠 What is the practical question?
💠 What does the data say
💠 What does the visual say
The essential building blocks that make visuals immediately understandable are called MARKS and CHANNELS.
➡️ MARKS are basic visual objects such as points, lines and shapes. Every
mark can be broken down into qualties
⭕ Position: Where is a specific mark in space relative
to a scale or to other marks?
⭕ Size: How big, small, long or tall is a mark?
⭕ Shape: Does the shaoe of the object communicate
something about it?
⭕ Color: What color is the mark?

➡️ CHANNELS are visual aspects or variable that represent characteristics
of the data in a visualization. They are basically visualized marks that
have been used to visualize data.
⭕ Accuracy: Are channels helpful to accurately estimate
the values being represented?
⭕ Pop out: How easy is it to distinguish certain
values from others?
⭕ Grouping: How effective is a channel at communicating
groups that exist in data?
the different types of visuals to use when you want to visualize your data.

➡️ BAR GRAPHS: Use size contrast to compare two or more values.
➡️ PIE CHARTS: Shows how much each part of something makes up a
whole.
➡️ MAPS: Helps organize data geographically.
➡️ HISTOGRAM: A chart that shows how often data values fall into certain
range.
➡️ LINE CHART: Is used to track changes over short and long periods of
time. When smaller changes exists, lines charts are better to use than
BAR CHARTS.
➡️ COLUMN CHARTS: Use size to contrast and compare two or more
values. Using height or lenghts to represent the specific values.
➡️ HEAT MAP: Similar to BAR CHART, heat maps also use color to compare
categories in a dataset.
➡️ SCATTER PLOT: Show relationships between different variables.
Scatterplots are typically used for 2 variables for a set of data
although additional variables can be displayed.
➡️ DITRIBUTION GRAPH: Displaysy the spread of various outcome
in a dataset.
➡️ CORRELATION CHARTS: Shows relationship among data. It is the
measure of the degree to which 2 variables move in relation to each
other. If one variable goes up and the other one goes down, it
is a NEGATIVE or INVERSE CORRELATION. If one variable goes up and the
other one goes up, it is a POSITIVE CORRELATION. If one variable goes
up and the other values stays thesame , there is no CORRELATION.
CAUASATION refers to the idea that an event leads to a specific outcome
STATISTIC VISUALIZATION do not change over time unless they are edited
DYNAMIC VISUALIZATION are interactive and they change overtime.
There are meaningful patterns in data visualization, and they can take many forms

💠 CHANGE. This is a trend or instance of observation that becomes
different overtime. A great way to measure change in data is through
a LINE CHART or COLUMN CHART.
💠 CLUSTERING. A collection of data points with similar or differnt values.
This is best represented through a DISTRIBUTION GRAPH.
💠 RELATIVITY: These are observations considerec in relation or in
proportion to something else. They are best represented using PIE
CHART.
💠 RANKING. This is a position in a scale of achievement or status. Data
that requires ranking is best represented by a column chart .
I also learnt about DECISION TREES. Data grows on DECISION TREES.
A DECISION TREE is a decision making tool that allows you the data analyst, to make decisionss based on key questions that you can ask youself.
I also learnt about the 9 basic principles of design;
💠 BALANCE. The design of a data visualization is balanced when the key visual elements like color and shape are distributed evenly.
💠 EMPHASIS. Your data visualization should have a focal point, so that your audience knows where to concerntrate. In otherwords, your visualizations should emphasize the most important data so that users recognize it first. You can use colors and values.
💠 MOVEMENT. It can refer to the path the viewers eye travel as they look at a data visualizations or literal movement crreated by animation.
💠 PATTERNS. You can use similar shapes and colors to create patterns in your data visualization. You can sue patterns to highlight similarities between different datasets, or break up a pattern with a unique shape, color, or line to create more emphasis.
💠 REPITITION. Repeating chart types, shapes, or colors adds to the effectiveness of our visualization.
💠 PROPORTION. Using various colors and sizes helps demonstrate that you are calling attention to a specific visual over others.
These first 6 principles of design are key considerations that you can make while you are creating your data visualization.
These next 3 principles are useful checks once your data visualization is finished
💠 RYTHM. This refers to creating a sense of movement or flow in your visualization. If your finished design doesnt succesffully create a flow. You might want to rearrange some of the elements to improve the rythm.
💠 VARIETY. Your visualizations should have some variety in the chart type, lines, shapes, colors and values you use. Variety keeps the audience engaged but it is good to find balancesince too much variety can confuse people.
💠 UNITY. Your final data visualization should be cohesive. If the visual is disjointed or not well organized, it will be confusing and overwhelming.
ELEMENTS OF ARTS
- LINE
- SHAPE
-COLOR
-SPACE
-MOVEMENT

5 PHASES OF THE DESIGN PROCESS
➡️ EMPHATHIZE: Thinking about the emotions and needs of the target audince for the data visualization.
➡️ DEFINE: Figuring out exactly what your audience needs from the data.
➡️ IDEATE: Generating ideas for data visualization.
➡️ PROTOTYPE: Putting visualization together for testing and feedback.
➡️ TEST: Showing prototype visualizations to people before stakeholders see them.
I learnt about data visualization with TABLEAU
TABLEAU is a business intelligence and analytics plateform that helps people
see, understand and make decisions with data.
There are DESIGN PRINCIPLES IN TABLEAU which are;
- Choose the right visuals
- Optimize the data-ink ratio
- Use orientation effectively
- Number of elements
- Avoid misleading and or deceptive charts
Some common mistakes to avoid so that your visualizations aren't
accidentally misleading.
- Cutting of the y-axis
- Misleading use of a dual
- Artifically limiting the scope of your data (not showing all of the data)
- Problematic choices in how data is binned or grouped
- Using part to whole visuals when the totals do not sum up appropriately
- Hiding trends in cummulative charts
- Artificially smoothing trends

Few rules about what makes a helpful data visualization.

💠 FIVE SECOND RULE: A data visualization should be clear, effective and
convincing enough to be absorbed in five seconds or less.
💠 COLOR CONTRAST : Graphs and charts should use a diverging color
palette to show contrast between elements.
💠 CONVENTIONS and EXPECTATIONS: Visuals and their organization
should align with audience expectations and cultural conventions.
💠 MINIMAL LABELS: Title, axis and annotations should use a few labels
as it takes to make use. Having too many labels makes your graphs
or charts too busy. It takes up too much space and prevents the labels
from being shown clearly.

Excel, SQL, and R: The power trio that transformed my data skills

Awasume Marylin — Tue, 11 Feb 2025 22:37:18 +0000

I am now at the ANALYSIS PHASE.
Recall
➡️ ASK PHASE
➡️ PREPARE PHASE
➡️ PROCESS PHASE
⭕ ANALYSIS PHASE
➡️ SHARE PHASE
➡️ ACT PHASE
This is the process used to make sense of the data collected. The goal of analysis is to identify trends and relationships with in data so you can accurately answer the questions you are asking.

The four phases of analysis
💠 Organize data
💠 Format and adjust data
💠 Get input from others
💠 Transform data
This phase includes SORTING and FILTERING
- SORTING is the process of arranging data into a meaningful order to make
it easier to understand, analyze and visualize
- FILTERING is used to show only the data that meets a specified criteria while
hiding the rest. It is useful when you have a lot of data.
I also did practical work on SORTING and FILTERING data using MICROSOFT EXCEL and SQL
CASE STUDY: MOVIE DATA, COUNTY NATALITY.
In SQL, i used the 'ORDER BY' and 'WHERE' clauses to sort and filter data.
i learnt FORMATTING FOR BETTER ANALYSIS.
➡️ Convert functions to change units of measurement ( in Excel)

Convert data in spreadsheet
➡️ Convert from STRING to DATE, Converting TEXT values containing both
DATES and TIME
➡️ Convert from STRING to NUMBER
➡️ Combining columns (merge text from 2 or more cells) using AMPERSAND
SYMBOL and CONCATENATE
➡️ Convert from NUMBER to PERCENTAGE
DATA VALIDATION IN EXCEL
You can use data validation in excel to
- Add dropdown lists with predertermined options
- Create customs checkboxes
- Protect structured data formulas
CONDITIONAL FORMATTING
A spreadsheet tool thst changes how cell appear when values meet specific
conditions.
Transforming data type in SQL using the CAST FUNCTION
- Converting a NUMBER to a STRING
-Converting a STRING to a NUMBER
-Converting DATE to STRING
-Convert DATE to DATETIME
I also learnt how to combine 2 or more cells in SQL using the CONCATENATE FUNCTION.
Using the CONCATENATE FUNCTION. In order combine data helps analyst to properly and efficiently analyze data, the data has to be clean and understandable. Data anaylyst use functions like CONCAT to make data easier to work with, which may require combining multiple cells.
i learnt about DATA AGGREGATION, VLOOKUP and JOIN
DATA AGGREGATION is the process of gathering data from multiple sourrces in order to combine it in to a single summarized collection
- Puzzle pieces = data
- Organization = aggregation
- Pile of piece = Summary
- Putting the pieces together = gaining insights
So DATA AGGREGATION helps analyst to
➡️ Identify trends
➡️ Make comparisons
➡️ Gain/insights
Data can also be aggregated over a given time period to provide statistics such as
➡️ MINS
➡️ AVERAGES
➡️ MAXS
➡️ SUMS
VLOOKUP stands for Vertical lookup. It is a function that searches for a certain
value in a column to return a corresponding piece of information.
VLOOKUP can match 2 sheets together on a matching column to populate on a single sheet .
It only returns the data it finds to the right it cant look left.
JOIN is an sql clause that is used to combine rows from 2 or more tables based on a relatable column. It is like VLOOKUP in sql.
COMMON JOINS
💠 INNER JOIN: A function that returns records with matching values in
both tables.
💠 LEFT JOIN: A function that will return all the records from the left table
and only the matching records from the right table
💠 RIGHT JOIN: Return all records from the right table and only the
the matching records from the left
💠 OUTER JOIN: Combines right and left join to return all matching records
in both tables
I also learnt about the importance of ALIASES.
ALIASES are used in sql queries to creat temporary names for a column or
table. They are implememnted by making use of AS command.
I learnt about ,
SUBQUERIES: They are queries within queries. They are called an Inner or Nested query.
- They can make projects easier and more efficient by allowing complex operations to be performed in a single query, reducing the need for multiple trips to the database.
- They also make your code readable and maintainable
Rules to follow to use subquery
➡️ Subqueries must be enclosed within parentheses.
➡️ A subquery can have one or more columns specified in the select
clause
➡️ Subqueries that return more than one row can only be used with
multiple values operators, such as the IN operator which alows
you to specify multiple values in WHERE clause.
➡️ Subquery cant be nested in SET command.
I also learnt about PIVOT TABLES in google sheets. They make it possible to view data in multiple ways in order to identify insights and trends.

They can help you quickly make sense of larger datasets by comparing metrics, performing calcularions and generating reports.
Pivot tables have 4 basic parts
💠 ROWS: Organize and group data you select horizontaly
💠 COLUMNS: Organize and displays values from your data vertically.
💠 FILTERS: Enables you to apply filters based on a specific criteria
💠 VALUES: They are used to calculate and count data.
I also learnt about DATA VALIDATIONS and the TYPES OF DATA VALIDATION
DATA VALIDATION is the process of checking and rechecking the quality of your data so that it is complete, accurate, secure and consistent.
TYPES OF DATA VALIDATION
⭕ DATA TYPES :Check that the data type matches the data type defined for a field
⭕ DATA RANGE: Check that the data falls within an acceptable range of values defined for the field.
⭕ DATA CONSTRAINTS: Check that the data meets certain conditions or criteria for a field. This includes the type of data entered as well as other attributes of the field such as number of characters.
⭕ DATA CONSISTENCY: Check that the data makes sense in the context of other related data.
⭕ DATA STRUCTURE: Check that the data follows or conforms to a set structure
⭕ CODE VALIDATION: Check that the application code systematically performs any of the previously mentioned validation during user data input.

I also learnt about TEMPORARY TABLE

They are tables that is created and exists temporarily on a database server. You create a temporary table in SQL using the WITH clause which is a type of temporary tables that you can query from multiple times .They are extremely helpful for complex calculations queries.

'Data-driven decisions: How i learned to ask the right questions with analytics

Awasume Marylin — Sun, 02 Feb 2025 19:27:38 +0000

In this new week i am learning all about the process phase of data analytics
‼️ Recall the 6 phases of data analytics
- ASK PHASE
- PREPARE PHASE
⭕ PROCESS PHASE
- ANALYZE PHASE
-SHARE PHASE
- ACT PHASE
In the lesson of today, i learnt all about DATA INTERGRITY.
It can be defined as the ACCURACY, COMPLETENESS, CONSISTENCY AND TRUSTWORTHYNESS of data throughout its life cycle.

One missing piece can make all of your data useless. Data can also be compromised, and they can be compromised in many ways. They can be compromised during:
➡️ DATA REPLICATION which is the process of storing data in multiple
locations
➡️ DATA TRANSFER which is process of copying data from a storage
device to memory or from one computer to another If the transfer
is interupted, you may end up with incomplete dataset.
➡️ DATA MANIPULATION process involves changing data to make
it organized and easier to read.
Data can also be compromised through
➡️ Human error
➡️ Viruses
➡️ Malware
➡️ Hacking
➡️ System failures
It is important to check that the data you use aligns with the business obectives.
When you are getting ready for data analysis, you might realize you dont have the data you need or you dont have enough of it In some cases you use what is called PROXY DATA in place of the real data.
There are different types of insufficient data which include;
💠 Data from only one source
💠 Data that keeps updating
💠 Outdated data
💠 Geographyical limited data
There are ways we can address insufficient data problems.
💠 Identify trends with the available data
💠 Wait for more data if time allows
💠 Talk with stakeholders and adjust your objectives
💠 Look for new data set
I also learnt about sample size calculation
- POPULATION: The entire group that you are interested in for study
- SAMPLE: A subset of your population
- MARGIN OF ERROR: Since a sample is used to represent a population
the sample results are expected to differ from what the results would
have been if you had surveyed the entire population.The difference is
called MARGIN OF ERROR
- CONFIDENCE LEVEL: How confident you are in the survey results
- CONFIDENCE INTERVAL: The range of possible values that the
population's result would be at the confidence level of the study.
This range is the sample results +/- the margin of error.
- STATISTICAL SIGNIFANCE: The determination of whether your result
could be due to random or not. The greater the signifance, the less
due to chance
i learned about the task you should complete before analyzing data which are;
- Determine data intergrity by assessing the overall accuracy, consistency
completeness of the data.
- Connect objectives to data by understanding how your business
objectives can be served by an investigation in to the data.
- know when to stop collecting data.
STATISTICL POWER is the probabilty of geting meaningful results from a test
HYPOTHESIS TESTING is a way to see if a survey experiment has meaningful results.
DATA CONSTRAINTS The criteria that determines if a piece of data is clean and valid.
DATEDIF is a spreadsheet function that calculates the number of days, months, or years between 2 dates
ESTIMATED RESPONSE RATE is the average number of people who typically complete a surey
REGULAR EXPRESSION (RegEx) is a rule that says the values in a table must match a prescribed pattern

The most common cause of quality data is HUMAN ERROR
Clean data is data that is complete, correct and relevant to the problem you're trying to solve.
Dirty data is data that is incomplete, incorrect and irrelevant to the problem you are trying to solve. There are different types of dirty data which include;
➡️ Duplicate data which is any data record that shows up more than
once which could be caused by manual data entry, batch data
imports or data migration.
➡️ Outdated data is any data that is old which should be replaced
with newer ad more accurate information.
➡️ Incomplete data is any data that is missing important fields
➡️ Incorrect or inaccurate data is any data that is complete
but inaccurate which could be caused by Human Error inserted
during data input, fake information or mock data
➡️ Inconsistent data is data that uses different formats to represent
the thesame thing which could be caused by data stored
incorrectly or inserted during data transfer.
Other causes of data include;
➡️ Manual data entry
➡️ Error batch data imports
➡️ Data migration
➡️ Software obsolescence
➡️ Improper data collection
➡️ Human errors during data input
DATA VALIDATION is a tool for checking the accuracy and quality of data before adding or importing it. There are steps you have to follow to clean data
💠 Removing unwanted data
💠 Getting rid of duplicates or data that is irrelevant to what you want to
solve
💠 Fixing misspellings
💠 Fixing inconsistent capitalization
💠 Fixing punctuation and other typos
DATA MERGING is the process of combing 2 or more datasets into a single dataset
DATA COMPATIBILITY describes how well 2 or more datasets are able to work together

Common errors you should avoid are
💠 Not checking spelling errors
💠 Forgetting to document errors
💠 Not checking for misfielded values when values are entered in to the
wrong field
💠 Overlooking missing values
💠 Only looking at the subset of the values
💠 Loosing track of business objectives
💠 Not analyzing the system prior to data cleaning
💠 Not fixing the source of the error
💠Not backing up your data prior to data cleaning
💠 Not accounting for data cleaning in your deadlines/process.
Today i also did 3 EXCEL exercises on data cleaning and it was very exciting and i will continue to practice everyday.
i did more of practical work like cleaning of data in microsoft excel using different datasets.
i used different functions to make sure that my data was clean ( removing duplicates, made sure the data format was consistent, removed unwanted spaces, split text and numbers to make reading easy)
I used different excel functions to achieve my objective of having clean data;
💠 TRIM
💠 VLOOKUP
💠 SPLIT TEXT TO COLUMN
💠 DATA VALIDATION
💠 FILTER
💠 SPLIT
💠 CONDITIONAL FORMATTING
💠 PIVOT TABLES
💠 PASTE SPECIAL
i learnt how to clean data using SQL
➡️ REMOVING DUPLICATES USING "DISTINCTION"
➡️ USED "SUBSTRING" TO GET ONLY THE FIRST 2 LETTERS OF A COUNTRY
➡️ USED "LENGTH" TO KNOW HOW MANY LETTERS ARE USED TO REPRESENT A COUNTRY
➡️ USED THE "TRIM" FUNCTION TO REMOVE SPACES
I also did a few exercises to learn how to clean and adjust my data using
➡️ UPDATE
➡️ SET.
DATA VERIFICATION is a process to confirm that a data-cleaning effort was well executed and the resulting data is accurate and reliable.
CHANGELOG is a file containing a chronological list of modifications made to a project. They are usually organized by version and include the date followed by a list of added, improved and removed features.

See the big picture when verifying data cleaning
💠 Consider the business problem
💠 Consider the goal
💠 Consider the data
Data cleaning check list. Correct the most common problems
➡️ Sources of error
➡️ Null data (search using conditional formatting and filters)
➡️ Misspelled words
➡️ Mistyped numbers
➡️ Extra spaces and characters (using TRIM functions)
➡️ Duplicates (distinct in SQL or remove duplicates function in excel or
spreadsheet )
➡️ Messy (inconsistent) strings
➡️ Messy (inconsistent) data formats
➡️ Misleading variable labels (column)
➡️ Truncated or missing data
➡️ Business logic (check that the data makes sense given your knowledge
of the business
➡️ Review the goal of your project to make sure that your data still aligns
with the goal.This is a continuous process that you will do throughout
your project.
DOCUMENTATION
The process of tracking changes, additions, deletions and errors involved in your data cleaning efforts example CHANGELOG

⭕ Recover data cleaning errors to use whenwe come accross a similar
problem or when you want to redo the cleaning
⭕ Inform other users of changes you have made (reference sheet for other
analyst incase you resign)
⭕ Determine quality of data.

Data cleaning demystified: How i learned to transform messy data into gold

Awasume Marylin — Sun, 26 Jan 2025 20:29:07 +0000

WEEK 2 OF MY DATA ANALYTICS JOURNEY
This week i learnt a all about PREPARE PHASE OF DATA ANALYTICS.
DATA EXPLORATION. After asking the right questions whuch we saw in the previous days, PREPARING DATA CORRECTLY is the next step.
N.B Data works best when it is organized.
Every piece of information is data and they are generated as a result of our activities in the world. We have SURVEY DATA which is generated using forms. INTERVIEWS can also generate data.

💠 Forms, Questionaires and Surveys are commonly used
ways to collect and generate data.
💠Data that is generated online doesnt always happen directly
They use COOKIES which help them remember your preferences

⭕ COOKIES are small files stored on computer that contains information about users.
There are 3 different data sources which include:
➡️ First party data which is data collected by an individual or group using their own resources. This is typically the prefered method of collecting data because you know exactly where it came from.
➡️ Second part data which is collected by a group directly from its audience.
➡️ Third part data which is data collected from outside sources who did not collect it directly. This source type might not be very reliable.
⭕ A SAMPLE is part of a population that is representative of the population.
I learnt that selecting the right data is a key important in data analysis.
💠 When solving your business problems be sure to select data that can
solve your problem question.
💠 How much data to collect: If you are collecting your own data make
reasonable decisions about sample size.
💠 Time frame: If you are collecting your own data, decide how long you
will like to collect it especially if you are tracking trends over a long
period of time.

DATA COLLECTION CONSIDERATION
‼️ Select the right type of data
‼️ Determine the time frame
‼️ Decide how data will be collected (collection o new data).
Then decide how much data to be collected.
OR
‼️ Choose data source (use existing data).
Deside what data to use.
Learned about the different data formats which include
➡️DESCRETE DATA which is data that is counted and has a
limted number of values.
➡️ CONTINUOUS DATA is data that can have any numeric value
➡️ NORMINAL DATA is a type of qualitative data that is categorized with
out a set order
➡️ ORDINAL DATA is a type of qualitative data with out a set order or
scale
➡️ INTERNAL DATA which is data that lives within a company's
own systems
➡️ EXTERNAL DATA which is data that doesnt live within a company's
own system
➡️ STRUCTURED DATA is data that is organized in a certain format such
as row and columns
➡️ UNSTRUCTURED DATA is data that is not organized in a certain format
audio,video and images

Data model is a model that is used for organizing data elements and how they relate to one another. Structured data works nicely within a data model.
Data elements are pieces of information such as peoples names, account numbers and addresses.
PHYSICAL DATA MODELING depicts how a database operates. A physical data model defines all entities and attributes used e.g it includes tables, names, columns name, and data types for the database.
2 common methods of approaching data modeling techniques include
‼️ ENTITY RELATIONSHIP DIAGRAM (ERD) which are a visual way to
understand the relationship between entities in the data model.
‼️ UNIFIED MODELING LANGUAGE (UML) diagrams are very detailed
diagrams that describe the structure of a system by showing the system's
entities, attributes, operations and their relationships.
Data type is a specific kind of data attribute that tells what kind of value the data is. It tells you what type of data you are working with. Data types can be dfferent depending on the query language you are using.
➡️ BOOLEAN DATA TYPE is a type of data with only 2 possible values
which include; AND, OR, NOT, TRUE, FALSE
⭕ AND OPERATOR is a symbol that lets you stack both your conditions
⭕ OR OPERATOR lets you move forward if either one of your 2
conditions are met
⭕ NOT OPERATOR lets you filter by subtracting specific conditions from
the results.
WIDE DATA: Is a data set in which every data subject has a single row with multiple columns to hold the values of the various attributes to the subject
LONG DATA: Is data in which each row represents one observation per subect , so each subect will be represented by multiple rows.
DATA TRANFORMATION whic is the process of changing the data's format, structure, or values. It includes;

💠 Adding, copying and replicating data
💠 Deleting fields or records
💠 Standardizing the names of variables
💠 Renaming, moving or combining colums in a database
💠 Joining one set of data with another
💠 Saving file in a different format
There are goals for data transformation. They are

➡️ Data orgaization
➡️ Data migration
➡️ Data enhancement
➡️ Data comparison
DATA BIAS is a type of error that systematically skews results in a certain direction. There are 4 types of bias
-SAMPLE BIAS is when a sample isnt representative of the population as a
whole.
- OBSERVER BIAS (experiment bias/ research bias). The tendency for
different people to observe things differently.
- CONFIRMATION BIAS is the tendency to search for or interprete
information in a way that confirms pre-existing beliefs
- INTERPRETATION BIAS the tendency to always interprete
ambiguous situations in a positive or negative way.
In order for you to indentify good data sources, you need to follow these practices;
⭕ Your data must be RELIABLE (good source)
⭕ Your data must be ORIGINAL
⭕ Your data must be COMPREHENSIVE
⭕ Your data must be CURRENT (upto date)
⭕ Your data must be CITED
If you dont follow this practices you will have bad data.
Just as humans have ethics, data also has ethics. DATA ETHICS is a well founded standard of right and wrong that dictates how data is collected, shared and used.

There are various aspects of data ethics which include;
- Ownership: Individuals own the raw data. They provide and they have primary control over its usage, how its processed and how its shared.
- Transcation transparency: All data processing activities and algorithm should be completely explainable and understood by the individual who provides their data
- Consent: An individual's right to know explicit details about how and why their data will be used before agreeing to provide it.
- Currency: individuals should be aware of financial transactions resulting from the use of their personal data and the scale o these transactions.
- Privacy: Preserving a data subject's information and activities any time a data transaction occurs.
DATA ANONYMIZATION is the process of protecting people's privaye or sensitieve data by eliminating PERSONAL IDENTIFIABLE INFORMATION (PII) which is information that can be used by itself or with other data to track down a person's identity.
OPEN DATA and its features.
OPEN DATA refers to free access, usage and sharing of data. Features of
include;
-Open data must be available as a whole.
-It must be provided under terms that allow reuse and distribution
including the ability to use it with other data sets.
DATA INTEROPERABILITY is the ability of data systems and services to openly connect and share data.
DATABASE is a collection of data stored in a computer system. Data base has 4 features. These features include;

⭕ PRIMARY KEY is a unique identifier of each record within a table.
⭕ FOREIGN KEY is a field within a table that is a primary key in
another table.
⭕ RELATIONAL DATABASE is a database that contains a series of tables
that can be connected to form relationship. Basically they allow
analyst to organize and link data based on what the data has in
common.
⭕ NORMALIZATION is a process of organizing data in relational
database. For example creating tables and establishing
relationships between these tables. It is applied to eliminate
data redundancy, increase intergrity and reduce complexity
in a database.
I also learnt about META DATA. META DATA is data about data, it is used in database management to help data analyst interpret the contents of data within the database.
There are 3 types of meta data.
➡️ DESCRIPTIVE META DATA describes a piece of data and can be used to
identify it at a later point in time.
➡️ STRUCTURAL META DATA it indicates how a piece of data is organized
and whether it is part of one or more than one data collection
➡️ ADMINSTRATIVE META DATA it indicates the technical sources of a
digital asset.
META DATA provides information about other data and helps data analyst interprete the contents of the dat within a database it tells the Who, What, When, Where, Which, Why and How of the data. You can find meta data in PHOTOS, EMAILS, SPREADSHEETS AND ELECTRONIC CREATED DOCUMENTS.
Elements of meta data include;

💠 File or document type
💠 Date, time and creator
💠Title and description
💠 Geolocation
💠 Tags and categories
💠 Who last modified it and when
💠 Who can access or update it.
The benefits of meta data is that it makes data RELIABLE AND CONSISTENT
META DATA REPOSITORIES helps data analyst ensure their data is reliable and consistent. They describe where the meta data came from and store that data in accessible form with a a common structure.
BIG QUERY which is a data warehouse on the google cloud platform used to query and filter large datasets.

Most importantly i learnt how to use a big query and run data from datasets using the conditions
⭕ SELECT
⭕ FROM
⭕ WHERE
These are STRUCTURED QUERY LANGUAGE that helps you to easily get a specific data (e.g customer ID, or number of customers who bought sugar balls from the bakery) from a large range of data, hence making your analysis process easy. They also help you to retrieve and manipulate data from a database.
ORGANIZING and SAVING DATA.

There are best practices organisations can follow when organizing data. They are seen below
⭕ NAMING CONVENTIONS which are consistent guidelines that describe
the content, date, or version of a fle in its name. Means using logical and
descriptive names for your files to make them easier to find and use
⭕ FOLDERING which is organizing files into folders which helps to easily
find your files.
⭕ ARCHIVING CONVENTION can be used to move old projects to a separate
location to create, archive and cut down or clutter
⭕ ALIGN YOUR NAMING and STORAGE practices with your team
⭕ Develope META DATA practices like creating a file that outlines project
naming conventions for easy reference.
Storing data in different places (database or spreadsheet) can make the data
to contradict itself and also takes up alot of space.
RELATIONAL DATABASE can you avoid data duplication and store data more effeciently.
File-naming conventions help you organize access, process and analyze data because they act as a quick refernce points to identify what's in the file.
One important practice is to decide on file naming conventions as a team or compnay early in a project. It's also critical to ensure that file names are meaningful, consistent, and easy to read. File names should include;

➡️ Project's name
➡️File creation data
➡️ Revision version
➡️ consistent style and order
DATA SECURITY is protecting data from unauthorized access or corruption by adopting safety measures.
- Excel and Google sheets both have features that help you protect spreadsheets from being edited from the entire worksheet right down to single cells in a table.
- They both have access control features like password protection and user permissions.

We can both keep our data safe and still have access to use them at anytime we want to use them for analysis. Security measures that can help companies do the 2 are;

‼️ ENCRYPTION. Uses a unique algorithm to alter data and make it unusable ny users and applications that dont know the algorithm. This algorithm is saved as 'KEY' which can be used to reserve the encryption so if you have the key you can still use data in its original form.

‼️ TOKENIZATION. Replaces the data elements you want to protect with randomly generated data refered to as TOKEN.The original data is stored in a separate location and mapped to the tokens.

VERSION CONTROL enales all collaboratores within a file to track changes over time. You can understand who made what changes to a file, when they were made and why.

From raw data to insights: my journey through foundations of data analytics

Awasume Marylin — Sun, 19 Jan 2025 06:51:29 +0000

During my first week of studying data analytics, i got to learn “ What data analytics is”. And it is defined below.

Data analytics is the collection transformation and organisation of facts to draw conclusion, make predictions and drive informed decision-making

KEY WORDS

Collection
Transform
Organisation
Draw conclusion
Make predictions
Drive informed decision-making
What makes you a strong data analyst is not just maths alone its;

Asking the right questions
Finding the best source to answer your question effectively
illustrating your finding clearly in visualization
I also got to learn the different types of business analytics

4 key types of Business Analystics

Descriptive analytics: The interpretation of historical data to identify trends and patterns
Predictive analytics: Centers on taking that information and use it to forecast future outcomes
Diagnostics analytics: Used to identify the root cause of a problem
Prescriptive analytics: Testing an other techniques are employed to determine which outcomes will yield the best result in a given scenario
I learnt that there are 6 phases a data analyst passes through to solve a problem and make DATA DRIVEN INFORMED DECISION MAKING. These phases include:

THE 6 PHASES OF DATA ANALYTICS

Ask phase: Ask questions and define the problem
Prepare phase: Phase data by collecting and storing the information
Process phase: Process data by cleaning and check the information
Analyze phase: Analyze data by finding patterns relationships and trends
Share phase: Share data with your audience
Acts phase: Act on the data and use the analysis results
You can also work with your gut instinct, but your gut instinct needs to be based on the data otherwise you will go off track. Data analysis is rooted in statistics.

I saw something interesting and surprising while studying, which is DATA ECOSYSTEMS, and they also have their elements.

Data ecosystems are the various elements that interact with one another inorder to produce, manage, store, organize, analyze, and share data e.g hardware and software tools and the people that use them.

data can also be found on cloud.

Cloud is a place where data is being kept online rather than a computer hardrive.

While studying as a data analyst people or you might have these COMMON MISCONCEPTIONS.

Data science is thesame with data analytics

Data science creates new questions using data. in otherwords its creating new ways of modeling and understanding the unknown by using raw data while Analyst find answers to existing questions by creating insights from data sources.

Data analysis and Data analytics sound thesame but they are very differnt.

Data analysis is the collection, processing and analyzind of data to make informed decion making while Data analytics is the science of data

One of the powerful ways you can put data to work is with DATA DRIVEN DECISION-MAKING using facts to guide business strategy.

First step in data drivrn decision-making is FIGURING OUT THE BUSINESS NEEDS TO BE SOLVED.

AS AN ANALYST YOU ALSO NEED TO HAVE SKILLS ( ANALYTICAL SKILLS) WHICH IS WHAT ANYONE WHO WANTS TO BE A GOOD OR EXCELLENT ANALYST SHOULD POSSESS.

ANALYTICAL SKILLS are qualities and characteristics associated with problem sloving using facts.

5 essential points to analytical skills

Curiosity
Understanding context (the condition to which something exist or happens)
Having technical mindset (the ability to break things down into smaller steps or pieces and work with them in an orderly and logical way)
Data design ( how you organize information)
Data strategy ( the management of the people, processes and tools used in data analysis)
Analytical thinking involves identifying and defining a problem and solving it by using data in an organized step by step manner. 5 steps of analytical thinking includes

Visualization
Strategy ( helps data analyst see what they want to achieve with the data and how they can get there. it also helps improve the quality and usefulness of the data we collect)
Problem orientation ( it is all about keeping the problem at the top of your mind through out the project)
Correlation ( relationship. it doesnt equal causation)
Big picture and detail oriented thinking ( is like looking at the whole puzzle instead of seeing only little pieces it helps you zoom out and see possibilities and opportunities)
There are also the 5 WHYs which can also help you find the solution or root cause to a problem.

Gap analysis is a method for examining and evaluating how a process works currently in order to get where you want to be in the future.

Data driven decision-making using facts to guide business strategy
Data set a collection of data that can be manipulated or analyzed as can be manipulated or analyzed as one unit
Root cause the reason why a problem occurs
STAGES OF THE DATA LIFE CYCLE

Plan stage: Decides what kind of data is needed, how it will be managed and who will be responsible for it.
Capture: Collect or bring in data from a variety of different sources.
Manage: Care for and maintain the data. This includes determining how and where it is stored and the tools used to do so.
Analyze: Use the data to solve problems make decisions and support business goals.
Archive: Keep relevant data stored for longterm and future reference.
Destroy: Remove data from storage and delete any stored copies of the data.
In order to be able to collect,organize, manage and process data, you need tools . These tools are called DATA ANALYTICAL TOOLS

These tools include;

Spreadsheet: It is a digital work sheet that stores, organizes and sorts data. e.g microsoft excel and google sheets. It allows you to identify patterns and piece the data together in a way that works for each specific data project. They can also creat e excellent data visualization like graphs and charts.
Query: Is a computer programming language that allows you to retrieve and manipulate data from a database. They allow analyst to select, create, add or download data from a database for analysis e.g Microsoft Query Language.
Data Visualization: They turn complex numbers into a story that people can understand. They also help stakeholders come up with conclusions that lead to informed decision effective business strategies e.g maps, graphs, tables.
There is a relationship between Data Analysis process and Data life cycle.

While data analysis process will help you drive your projects and help you help you reach your business gaols, you must understand the life cycle of your data inorder to use that process.

To analyze your data you will need to have a thorough understanding of it. Similarly you can collcet all the data you want, but the data is only useful to you if you have a plan for analyzing it.

It very important as a data analyst to practice FAIRNESS.

FAIRNESS is ensuring that your analysis doesnt create or reinforce bias (create systems that include everyone).

The best practices to ensure fairness you are working are

CONSIDER ALL THE AVAILABLE DATA. We have to decide what data is useful. Often there will be data that isnt relevant to what you’re focusing on or doesnt seem to align with your expetations. But you cant just ignore it. It is critical to consider all the available data so that your analysis reflect the truth and not just your own expectations.
IDENTIFYING SURROUNDING FACTORS. Context is key for you and your stakeholders to understand the final conclusions of any analysis. You must also consider all of the data to get more insights.
INCLUDE SELF REPPORTING DATA. Its a data collection technique where participants provide information about themselves. It is a great wya to introduce fairness into your collection process. People bring conscious and unconscious bias to their observation about the world , including about other people. This method can help you to avoid these observer bias.
USE OVERSAMPLING EFFECTIVELY. It is the process of increasing the sample size of nondominant groups in a population. This can help you better represent them and address imbalance datasets.
Think about fairness from begining to end.
Some questions are more effective than others. A yes or no question is a closed ended question and cannot lead to valuable insights.

Effective questions follow the SMART methodology

-Specific: if the question is too general try to narrow it down by focusing on just one element. But it should not be a closed ended question.

-Measurable: let our questions be measurable. like using figures, quantities etc

-Action-oriented: Let our questions encourage change

-Relevant: When you ask relevant questions it can help you with the problem ou are trying to solve.

-Time bound: it should specify the time to be studied which limits the range of possible data analyst to focus on relevant data.

We need to ask fair questions ( practice fairnes) meaning there shouldnt be any bias in our questions e.g leading your questions towards a certain way and making assumptions.

Some question you might ask when given a project

⭕ OBJECTIVE: what are the goals of the question? What , if, any, questions are expected to be answered?.

⭕ AUDIENCE: Who are the stake holders? Who is interested or cooncerned about the results of the deepdive? Who is the audience for the deep dive?

⭕ TIME: What is the time frame for completion? By what date does this need to be done?

⭕ RESOURCES: What resources are available to accomplish the deep dive?

⭕ SECURITY: Who should have access to the information.

I learnt about HOW DATA EMPOWERS DECISIONS. The two types of data decision making include;

DATA-INSPIRED DECISION MAKING which explores different data source to find out what they have in common
DATA-DRIVEN DECISION MAKING which uses facts to guide business strategies.

Also i learnt about the two types of data which might help us answer alot of different questions. They include;

1) Quantitative data which deals with measurements of numerical facts and ask questions like WHAT?, HOW MANY?,HOW OFTEN?.
2)Qualitative data which deals with explanatory measures and ask questions like WHY?

Also there are 2 types of data presentation tools which are:
-Reports: which are a static collection of data given to stakeholders periodically
-Dashboards: which monitors life incoming data.

3 COMMON TYPES OF DASHBOARDS
i) Strategic dashboards focuses longterm goals and strategy at the
highest level of maintainance.

ii) Operational dashboard which focuses on short term perfomance
tracking and intermediate goals.

iii) Analytical dashboard which consist of data set and the mathematics
used in this set.

I learnt that data can be Big or Small

-SMALL DATA is a set of small specific data points typically involving a short period of time which are useful in making day-to-day decision making.

-BIG DATA is a large complex datasets typically involves long period o times which enables data analyst to address far-reaching business patterns.

As someone who studied ICT in school i definitely know about spreadsheets but have seen mostly MICROSOFT EXCEL.

Recently before begining my data analytics journey i studied more on microsoft excel, and all its formulas and functions and it was a beautiful experience for me having to come accross new formulas and functions i have never seen before and having to explore and do so many things with microsoft excel,

Today i got to learn about how spreadsheets (microsoft excel and google sheets) can be used to analyse data and how it is related to the data life cycle which include:
➡️ Plan
➡️ Capture
➡️ Manage
➡️ Analyze
➡️ Archive
➡️ Destroy

— — PLAN means formatting your cells, the headings you choose to highlight, the color scheme,and the way you order your data points.

— -CAPTURE data by the source by connecting spreadsheets to other data sources such as an online survey application or database

— -MANAGE different types of data with a spreadsheet. This can involve storing, organizing, filtering and updating information. They can also let you kno who can access the data.

— -ANALYZE data in a spreadsheet to help make better decisions. Some of the most common spreadsheet tools include Formulars and pivot tables.

— -ARCHIVE any spreadsheet that you dont use often but might need to reference later with built-in tools

— -DESTROY your spreadsheet when you know you are certain that you will never need again. If you do have use for it, better have back up copies or for legal security.

⭕ OPERATOR is a symbol that name the type of operation or calculations to be made.
⭕ CELL REFERENCE is a cell or range of cells in a worksheet that can be used in a formular
⭕ RANGE is a collection of 2 or more cells.

I learnt that even as an experienced data analyst there are some common errors that can be made when doing operations. These errors include

1) DIV ERROR
2)DN ERROR
3) N/A
4)REF ERROR

I used excel formulars to calculate data. I used basic formulars like SUM, AVERAGE, MIN, MAX, DIVIDE,MULTIPLY etc amongst the many formulars that exist.

I also learnt about thinking which is the process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities and identifying the options.

One way you can practice structured thinking and avoiding mistakes is
by using a SCOPE O WORK.

⭕ SCOPE OF WORK (SOW) is an agreed upon outline of the work you are going to perform on a project e.g work details, reports, schedules that the client can expext. Under this we have:

➡️ DELIVERABLES which focuses on What work is being done and what things are being created as a result of this project?
➡️ MILESTONES which is closely related to your timeline. what are the major milestones in your project? How do you know when a given part of your project is complete
➡️ TIMELINE which is closely tied to the milestone you created for your project.The timeline is a way of of mapping expectations for how long each step of the process should take
➡️ REPORTS you have to give status update to your stakeholders. will it be weekly or monthly ?.

The different stakeholders we might encounter while working on a project. They inlcude:

➡️ The executive team. Who provide strategic and operational
leadership to the company. They are made of VICE PRESIDENTS, CHIEF
MARKETING OFFICER, AND SENIOR LEVELS PROFESSIONALS.
➡️ The customer-facing team. Anyone in an organization who has some
levels of interaction with customers and potential customers.
➡️Data science team. Organizing data within a company takes teamwork.
There is a good chance you will find yourself working with other data
analyst, data scientist and data engineers.

I have learnt that working effectively with stakeholders you’ll often have to go beyond the data and to do that we will need the following tips to communicate clearly, establish trust an deliver your findings accros groups
‼️ Discuss goals
‼️Feel empowered to say no
‼️ Plan for the unexpected
‼️Know your project
‼️Start with words and visuals
‼️ Communicate often.

Clear communcation is key. Before you communicate think about
— WHO YOUR TARGET AUDIENCE IS
-WHAT THEY ALREADY KNOW
-WHAT THEY NEED TO KNOW
— HOW OFTEN CAN YOU COMMUNICATE THAT EFFECTIVELY TO THEM

In todays studies i learnt that good writing, listening and speaking skills can also help you effective communication as a data analyst. This is because speed can be the enemy of accuracy .

This is why communication is the most valuable tools for work with teams. So it is important to start with structured thiking and a well planned SCOPE OF WORK(SOW).

I also learnt that data can have limitations. As a data analyst it is important to know the limits of dat aso you can prepare for it. Limitations of data are as follows,

⭕ Incomplete or nonexistent data.
⭕Dont miss misaligned data
⭕ Deal with dirty ( data cleaning)
⭕ Tell a clear data;

💠 compare thesame types of data
💠 visualize wit data
💠 leave out needles graphs
💠 test for statistical significance
💠 pay attention to sample size

⭕ Be the judge