DEV Community

Shubham Singh
Shubham Singh

Posted on

Understanding Data For Data Analytics, Data Science, and Machine Learning – Part-1

All the tech giants rely on Data, in this era is one of the thing which drives the world.
Data makes your outcomes more accurate and reliable, not only in the field of computer science, but also in the other major fields of studies like medicine, archeology, architecture, finance, and many more.

To understand data, you need to understand one basics of static's, algebra, and probability. This includes major of central tendency, spread of data, estimation, chance of occurrence, understanding Shape and Distribution, bias and variances, and much more.

The world of data is very vast, so you need to be patient to understand and try to take one thing at a time.

Things to know beforehand

all the code are written in R-lang

  • Sample
    A sample statistic is a piece of statistical information you get from a handful of items.
    A sample is just a part of a population. For example, let’s say your population was every American, and you wanted to find out how much the average person earns. Time and finances stop you from knocking on every door in America, so you choose to ask 1,000 random people. This one thousand people is your sample.
    Once you have your sample, you’ll get some kind of statistic. A statistic is really just a piece of information—in this example, average earnings.

  • Population
    In statistics, a population is a set of similar items or events which is of interest for some question or experiment

  • Outliers
    An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations.

[1] Central Tendency

Studying about the center of data is Central Tendency, there are multiple ways to understand about the center.

Central tendency is defined as “the statistical measure that identifies a single value as representative of an entire distribution.”

[1] Mean

It is a major of center which will be effected by your outliers.

In R there is an input in mean functions for trim, takes value between 0-1 and trim the respective data from top and bottom

[a] Arithmetic mean

Arithmetic mean (or, simply, “mean”) is nothing but the average. It is computed by adding all the values in the data set divided by the number of observations in it. If we have the raw data, mean is given by the formula.

MEAN=X=ΣXn MEAN = \overline{X}= \frac{\Sigma X}{n}

mean(data)
Enter fullscreen mode Exit fullscreen mode

Advantages

  • Good representation of data.
  • Repeated samples drawn from the same population tend to have similar means. The mean is therefore the measure of central tendency that best resists the fluctuation between different samples.

Disadvantages

  • Sensitive to outliers.
  • Can not be calculated for nominal or nonnominal ordinal data.

[b] Weighted mean

When every entry of data doesn't have the same level of significance, then a weight $w_{i}$ is attached to the entry to represent its significance.

WeightedMean=ΣWXΣW Weighted Mean = \frac{\Sigma{}W\overline{X}}{\Sigma{}W}
weighted.mean(data)
Enter fullscreen mode Exit fullscreen mode

[c] Geometric Mean

M is an appropriate measure when values change exponentially and in case of skewed distribution that can be made symmetrical by a log transformation.

GeometricMean=x1x2...xnn Geometric Mean = \sqrt[n]{x_{1}x_{2}...x_{n}}
exp(mean(log(data)))
Enter fullscreen mode Exit fullscreen mode

[d] Harmonic mean

HM is appropriate in situations where the reciprocals of values are more useful. HM is used when we want to determine the average sample size of a number of groups, each of which has a different sample size.

1Σ(1x)n \frac{1}{\Sigma(\frac{1}{x})}{n}
prod(data)^(1/length(data))
Enter fullscreen mode Exit fullscreen mode

DEGREE OF VARIATION BETWEEN THE MEANS

If all the values in a data set are the same, then all the three means (arithmetic mean, GM, and HM) will be identical. As the variability in the data increases, the difference among these means also increases. The arithmetic mean is always greater than the GM, which in turn is always greater than the HM

For Part-2 go here

Oldest comments (2)

Collapse
 
ikirtivardhansingh profile image
Kirtivardhan Singh

Very Insightful!!😃

Collapse
 
sartydefinite profile image
Sarthak Goel

Awesome 😎