DEV Community: Kipngeno Ruto.

Exploratory Data Analysis

Kipngeno Ruto. — Wed, 11 Oct 2023 19:16:56 +0000

Exploratory Data Analysis is a critical step in every data science project's life cycle. It is important in that, it allows you to find patterns that exist in the data, inconsistencies or data quality concerns ranging from outliers, missing values, incorrect data types, and most importantly, it aids in the preliminary selection of the suitable models.'It's like a detective conducting an investigation'

Exploratory data analysis is categorized into 4:
1 Univariate non-graphical
Univariate non-graphical analysis is typically used to examine a single variable or feature. Assume you have a dataset containing the income of people in a specific country; a summary statistic such as mean can be used to calculate the average income of people residing in that country.

2 Graphical univariate
Apart from summary statistics, we can also visualize the data for better understanding; this is the most common method to use, particularly when presenting your findings to non-technical teams, due to the ease with which the information is conveyed, graphical univariate method involves visualizing only one variable/feature. You can use many graphical representations such as bar graphs, line graphs, and so on.

3 Multivariate non-graphical
Multivariate non-graphical, as the name suggests is a method that involves applying summary statistics to more than one variable. An example can be, the management wants to understand various factors contributing to customer churn. In this case, the data scientist will be tasked with looking at various features to see how they contribute to customer churn in an organization and come up with tabular data containing the results.

4 Graphical Multivariate
This method involves visualization of more than one variable to show the distribution and relationship between them. In the example mentioned above, the data scientist, instead of presenting the results in a tabular format, the most effective way will be to visualize the results for easy understanding. Here you also can use various graphical methods available e.g line graphs, bar graphs e.t .c

Database Keys

Kipngeno Ruto. — Sat, 30 Sep 2023 20:57:58 +0000

what is a key?

A data item that allows us to uniquely identify an individual occurrence, record or an entity[table] in databases

Types of keys:

1.Primary key : One or more column or attribute that distinguishes a specific record from another e.g Employee_id in employee table,
employee_id & id_number also being used as primary key

2.Foreign Key-One or more attribute in an entity/table that enables a relationship to another entity

e.g deptno in employee table

3.Candidate key-Any column that could be used as the primary key.

4. Secondary Key-Candidate key that is not chosen as the primary
key. Primary key is selected from candidate keys, and those not selected are referred to as secondary key

5.Simple key: A single attribute that uniquely distinguishes a specific record from another e,g employee_id only or id_number only

6.compound key: A key made up more than one simple key and each of the keys making up the compound key are simple keys on their own

7.composite key : A key made up or more than one simple key and each of the keys making up the composite key are not simple keys on their own

e.g employee_id and employee_jobtitle employee_id is a simple key but employee_jobgroup is not a simple key on its on

8.Artificial Key: Keys generated by a business e.g., VIN numbers for vehicles, ISBN for books.

9:Natural keys : keys derived from real world occurrences e.g
SSN,ID number

10: Surrogate key: A key that uniquely identifies a row, has no relationship with the record it is identifying other than identifying the row uniquely, it is system generated and is mainly used in OLAP systems to maintain history

Advantages of surrogate key

•It is good for maintaining history, used mainly in OLAP e.g., in slowly changing dimension

•it is immutable and has high performance because of its compact datatype (such as a four-byte integer).

•Surrogate keys are also less expensive to join (fewer columns to compare)

•Uniformity - When every table has a uniform surrogate key, some tasks can be easily automated by writing the code in a table-independent way.

Disadvantage:

•Disassociation- The values of generated surrogate keys have no relationship with the real world meaning of data stored/held in a row.

11. Business Key: Unlike surrogate key , business key is a type of key that uniquely identifies an object or a record in the
database.

Data Science for Beginners: 2023 - 2024 Complete Roadmap

Kipngeno Ruto. — Sat, 30 Sep 2023 12:28:22 +0000

Over the past couple of years, there has been a rising need for Data Scientists by businesses, and institutions ranging from retail, banking, health, transport, agriculture etc.

But .. what do data scientists do? what does it take to become one?

Here is an example to break this down

You operate a supermarket and the common activities happening here are supplies being received from your various supplies and the sale of these goods to customers. All these activities generate data which, if well utilized can become a great asset by your management to run your businesses successfully.

Therefore, Data Scientists are professionals responsible for getting the data generated from your business, organizing, cleaning, and exploring these data to extract insights or meaning which can then be used for decision-making

An example of the insight can be; by analyzing sales data and inventory levels, you can identify patterns and trends in product demand.
This information can help the supermarket optimize its inventory management. For example, if the data shows that a particular product has a higher demand on weekdays but has lower demand during weekends, the supermarket can adjust its restocking schedule to ensure the product is available when it's needed most. This can lead to reduced carrying costs and increased sales, ultimately improving the supermarket's profitability.

Data Scientists then use this data to build prediction systems/ML models that be used to predict sales in coming days, weeks, etc. for the supermarket to stock goods accordingly and thus minimize huge losses that can happen as a result of this.

Now, what does it take to a Data Scientist?

Here is RoadMap:

1. Math Skills: Basic math skills, including probability, basic statistics, and some calculus, are essential for understanding and working with data.

2. programming skill, the common data science programming language is Python and you start by getting the basics by either getting an online course, blogs, or YouTube depending on your learning preference

3.Data science packages, in most cases, you will be doing data cleaning and manipulation, and Python already contains these packages.

The commonly used data science packages are;

Pandas for most data manipulation and exploration
Numpy for numeric operations
Matplotlib and Seaborn for visualization
Sklearn which contains algorithms used for building ML models

The most important thing to accelerate your learning is to connect with people already in the field, learn from their experiences, and build on what you're learning.!