Data cleaning demystified: How i learned to transform messy data into gold

#learning #database #datascience #community

WEEK 2 OF MY DATA ANALYTICS JOURNEY
This week i learnt a all about PREPARE PHASE OF DATA ANALYTICS.
DATA EXPLORATION. After asking the right questions whuch we saw in the previous days, PREPARING DATA CORRECTLY is the next step.
N.B Data works best when it is organized.
Every piece of information is data and they are generated as a result of our activities in the world. We have SURVEY DATA which is generated using forms. INTERVIEWS can also generate data.

💠 Forms, Questionaires and Surveys are commonly used
ways to collect and generate data.
💠Data that is generated online doesnt always happen directly
They use COOKIES which help them remember your preferences

⭕ COOKIES are small files stored on computer that contains information about users.
There are 3 different data sources which include:
➡️ First party data which is data collected by an individual or group using their own resources. This is typically the prefered method of collecting data because you know exactly where it came from.
➡️ Second part data which is collected by a group directly from its audience.
➡️ Third part data which is data collected from outside sources who did not collect it directly. This source type might not be very reliable.
⭕ A SAMPLE is part of a population that is representative of the population.
I learnt that selecting the right data is a key important in data analysis.
💠 When solving your business problems be sure to select data that can
solve your problem question.
💠 How much data to collect: If you are collecting your own data make
reasonable decisions about sample size.
💠 Time frame: If you are collecting your own data, decide how long you
will like to collect it especially if you are tracking trends over a long
period of time.

DATA COLLECTION CONSIDERATION
‼️ Select the right type of data
‼️ Determine the time frame
‼️ Decide how data will be collected (collection o new data).
Then decide how much data to be collected.
OR
‼️ Choose data source (use existing data).
Deside what data to use.
Learned about the different data formats which include
➡️DESCRETE DATA which is data that is counted and has a
limted number of values.
➡️ CONTINUOUS DATA is data that can have any numeric value
➡️ NORMINAL DATA is a type of qualitative data that is categorized with
out a set order
➡️ ORDINAL DATA is a type of qualitative data with out a set order or
scale
➡️ INTERNAL DATA which is data that lives within a company's
own systems
➡️ EXTERNAL DATA which is data that doesnt live within a company's
own system
➡️ STRUCTURED DATA is data that is organized in a certain format such
as row and columns
➡️ UNSTRUCTURED DATA is data that is not organized in a certain format
audio,video and images

Data model is a model that is used for organizing data elements and how they relate to one another. Structured data works nicely within a data model.
Data elements are pieces of information such as peoples names, account numbers and addresses.
PHYSICAL DATA MODELING depicts how a database operates. A physical data model defines all entities and attributes used e.g it includes tables, names, columns name, and data types for the database.
2 common methods of approaching data modeling techniques include
‼️ ENTITY RELATIONSHIP DIAGRAM (ERD) which are a visual way to
understand the relationship between entities in the data model.
‼️ UNIFIED MODELING LANGUAGE (UML) diagrams are very detailed
diagrams that describe the structure of a system by showing the system's
entities, attributes, operations and their relationships.
Data type is a specific kind of data attribute that tells what kind of value the data is. It tells you what type of data you are working with. Data types can be dfferent depending on the query language you are using.
➡️ BOOLEAN DATA TYPE is a type of data with only 2 possible values
which include; AND, OR, NOT, TRUE, FALSE
⭕ AND OPERATOR is a symbol that lets you stack both your conditions
⭕ OR OPERATOR lets you move forward if either one of your 2
conditions are met
⭕ NOT OPERATOR lets you filter by subtracting specific conditions from
the results.
WIDE DATA: Is a data set in which every data subject has a single row with multiple columns to hold the values of the various attributes to the subject
LONG DATA: Is data in which each row represents one observation per subect , so each subect will be represented by multiple rows.
DATA TRANFORMATION whic is the process of changing the data's format, structure, or values. It includes;

💠 Adding, copying and replicating data
💠 Deleting fields or records
💠 Standardizing the names of variables
💠 Renaming, moving or combining colums in a database
💠 Joining one set of data with another
💠 Saving file in a different format
There are goals for data transformation. They are

➡️ Data orgaization
➡️ Data migration
➡️ Data enhancement
➡️ Data comparison
DATA BIAS is a type of error that systematically skews results in a certain direction. There are 4 types of bias
-SAMPLE BIAS is when a sample isnt representative of the population as a
whole.
- OBSERVER BIAS (experiment bias/ research bias). The tendency for
different people to observe things differently.
- CONFIRMATION BIAS is the tendency to search for or interprete
information in a way that confirms pre-existing beliefs
- INTERPRETATION BIAS the tendency to always interprete
ambiguous situations in a positive or negative way.
In order for you to indentify good data sources, you need to follow these practices;
⭕ Your data must be RELIABLE (good source)
⭕ Your data must be ORIGINAL
⭕ Your data must be COMPREHENSIVE
⭕ Your data must be CURRENT (upto date)
⭕ Your data must be CITED
If you dont follow this practices you will have bad data.
Just as humans have ethics, data also has ethics. DATA ETHICS is a well founded standard of right and wrong that dictates how data is collected, shared and used.

There are various aspects of data ethics which include;
- Ownership: Individuals own the raw data. They provide and they have primary control over its usage, how its processed and how its shared.
- Transcation transparency: All data processing activities and algorithm should be completely explainable and understood by the individual who provides their data
- Consent: An individual's right to know explicit details about how and why their data will be used before agreeing to provide it.
- Currency: individuals should be aware of financial transactions resulting from the use of their personal data and the scale o these transactions.
- Privacy: Preserving a data subject's information and activities any time a data transaction occurs.
DATA ANONYMIZATION is the process of protecting people's privaye or sensitieve data by eliminating PERSONAL IDENTIFIABLE INFORMATION (PII) which is information that can be used by itself or with other data to track down a person's identity.
OPEN DATA and its features.
OPEN DATA refers to free access, usage and sharing of data. Features of
include;
-Open data must be available as a whole.
-It must be provided under terms that allow reuse and distribution
including the ability to use it with other data sets.
DATA INTEROPERABILITY is the ability of data systems and services to openly connect and share data.
DATABASE is a collection of data stored in a computer system. Data base has 4 features. These features include;

⭕ PRIMARY KEY is a unique identifier of each record within a table.
⭕ FOREIGN KEY is a field within a table that is a primary key in
another table.
⭕ RELATIONAL DATABASE is a database that contains a series of tables
that can be connected to form relationship. Basically they allow
analyst to organize and link data based on what the data has in
common.
⭕ NORMALIZATION is a process of organizing data in relational
database. For example creating tables and establishing
relationships between these tables. It is applied to eliminate
data redundancy, increase intergrity and reduce complexity
in a database.
I also learnt about META DATA. META DATA is data about data, it is used in database management to help data analyst interpret the contents of data within the database.
There are 3 types of meta data.
➡️ DESCRIPTIVE META DATA describes a piece of data and can be used to
identify it at a later point in time.
➡️ STRUCTURAL META DATA it indicates how a piece of data is organized
and whether it is part of one or more than one data collection
➡️ ADMINSTRATIVE META DATA it indicates the technical sources of a
digital asset.
META DATA provides information about other data and helps data analyst interprete the contents of the dat within a database it tells the Who, What, When, Where, Which, Why and How of the data. You can find meta data in PHOTOS, EMAILS, SPREADSHEETS AND ELECTRONIC CREATED DOCUMENTS.
Elements of meta data include;

💠 File or document type
💠 Date, time and creator
💠Title and description
💠 Geolocation
💠 Tags and categories
💠 Who last modified it and when
💠 Who can access or update it.
The benefits of meta data is that it makes data RELIABLE AND CONSISTENT
META DATA REPOSITORIES helps data analyst ensure their data is reliable and consistent. They describe where the meta data came from and store that data in accessible form with a a common structure.
BIG QUERY which is a data warehouse on the google cloud platform used to query and filter large datasets.

Most importantly i learnt how to use a big query and run data from datasets using the conditions
⭕ SELECT
⭕ FROM
⭕ WHERE
These are STRUCTURED QUERY LANGUAGE that helps you to easily get a specific data (e.g customer ID, or number of customers who bought sugar balls from the bakery) from a large range of data, hence making your analysis process easy. They also help you to retrieve and manipulate data from a database.
ORGANIZING and SAVING DATA.

There are best practices organisations can follow when organizing data. They are seen below
⭕ NAMING CONVENTIONS which are consistent guidelines that describe
the content, date, or version of a fle in its name. Means using logical and
descriptive names for your files to make them easier to find and use
⭕ FOLDERING which is organizing files into folders which helps to easily
find your files.
⭕ ARCHIVING CONVENTION can be used to move old projects to a separate
location to create, archive and cut down or clutter
⭕ ALIGN YOUR NAMING and STORAGE practices with your team
⭕ Develope META DATA practices like creating a file that outlines project
naming conventions for easy reference.
Storing data in different places (database or spreadsheet) can make the data
to contradict itself and also takes up alot of space.
RELATIONAL DATABASE can you avoid data duplication and store data more effeciently.
File-naming conventions help you organize access, process and analyze data because they act as a quick refernce points to identify what's in the file.
One important practice is to decide on file naming conventions as a team or compnay early in a project. It's also critical to ensure that file names are meaningful, consistent, and easy to read. File names should include;

➡️ Project's name
➡️File creation data
➡️ Revision version
➡️ consistent style and order
DATA SECURITY is protecting data from unauthorized access or corruption by adopting safety measures.
- Excel and Google sheets both have features that help you protect spreadsheets from being edited from the entire worksheet right down to single cells in a table.
- They both have access control features like password protection and user permissions.

We can both keep our data safe and still have access to use them at anytime we want to use them for analysis. Security measures that can help companies do the 2 are;

‼️ ENCRYPTION. Uses a unique algorithm to alter data and make it unusable ny users and applications that dont know the algorithm. This algorithm is saved as 'KEY' which can be used to reserve the encryption so if you have the key you can still use data in its original form.

‼️ TOKENIZATION. Replaces the data elements you want to protect with randomly generated data refered to as TOKEN.The original data is stored in a separate location and mapped to the tokens.

VERSION CONTROL enales all collaboratores within a file to track changes over time. You can understand who made what changes to a file, when they were made and why.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.