Gregory Vicent

Posted on Dec 11, 2022 • Edited on Dec 13, 2022

Basic Statistic to Data Science

#coding

A little warning:

Hi, I apologize beforehand for my mistakes in English. In part I write these posts because I love share what little know and because I want to improve my English skills.

Please, correct me if you find any error here.

Ho no, Mathematics!!!

Ok, I know that the math can be very intimidating, most of us fight them all through school with the idea in mind of:

'And what is this going to do for me?'

The truth is that the mathematics help us in many professions and in many areas of knowledge.

What's more, if you feel like it, mathematics can be really exiting.

And this is our case today.

Mathematics are a very important matter for Data Science, in particular the Statistic branch.

This is the reason why we will review the basic concepts about the Descriptive Statistic for Data Science.

Now, What's Descriptive Statistic ?

Descriptive Statistic is the mathematical technique that help us to get, organize, show and describe a data set with the end become these data useful.

We often work with data tables called Data Frames, this is the basic data structure when we talk about Data Science but it's not the only one.

A lot of programming languages use this rows and columns structure to represent their data. This is the case of Python with its famous framework Pandas.

Now, software like Pandas interprets these data by its type. And this is other important statistic concept.

In Description Statistic we have two data types, these are:

Quantitative or Numerical: Any data that can be represented in numerical form. We can divide this category into two more.
- Discrete: Are those numerical data represented by integers. E.g: Years, populations, age, etc.
- Continuous: Are those numerical data represented by decimal numbers. E.g: Weight, Height, temperature, etc.
Categorical: Any data that can be assigned to a category. In general these data are text and they can be divided into two categories more.
- Binary: These data only can have two possible values. E.g: True or False, closed or open, 0 or 1, etc.
- Ordinal: Here the data have naturally ordered categories y the distances between the categories are not known. E.g: Days week, animals, countries, etc.

Now that we know a bit more about our individual data It's time for us to explore more about the data set.

Measures of Central Tendency

There are many operations that we can do on data sets to get useful information.

The first are those that help us to understand the distribution of our data.

Here we find operations such as the median, mean, trimmed mean, weighted mean and mode. Evidently, these are not all measures of central tendency but they are a good place to start.

Mean

The most basic estimate of location is the mean. The mean is the sum of all values on a data set divided by the number of values.

Its formula is:

E.g:

However the mean has a problem, it is easily influenced by extreme values that are generally errors in the data when they are collected.

To fix this dependency to the values from the mean we have others Measures of Central Tendency like the median that it does not dependency to the values.

Median

The Median is the value that is right in the middle of our data.

Since it does not depend of the values it is not influenced by the extreme values.

These extreme values are called outliers and are not always errors. There are time when these outliers are truth values but are not representative of our overall data.

The Median is really easy to do but it has two cases. When the amount of our data is odd and when is even.

In both cases the first step is sort our data to smallest to largest.

Now, when the median is odd, only one must be added to the total number of elements we have and divided by two. The result represents the position within our data set of the number that we must choose as representative of the median of our data.

This is its formula:

E.g:

When our data set is even, we only have to make a small change, now we will take two values, the total number of data that we have divided by two will be the position in our data of the first value and the second will be the value in this same position added one, that is, the next value.

Both numbers are added and finally the result is divided by two. This is the median in this case.

This is its formula:

E.g:

However, the data does not influence the median, this can be a problem in some cases and for this also a solution with another measure of central tendency, the Trimmed Mean.

Trimmed Mean

The Trimmed Mean is a variation of the Mean, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values.

Post in process...

DEV Community

Basic Statistic to Data Science

Ho no, Mathematics!!!

Now, What's Descriptive Statistic ?

Measures of Central Tendency

Mean

Median

Trimmed Mean

Top comments (0)