Data Transformed into Knowledge, & Knowledge into Power

#numericalmethods #linearregressions #programming #computerscience

“It is about becoming a mathematical thinker, not a calculator”

Whenever someone asked a crowd of people, if they agree that Knowledge is Power, surely you're going to get a BIG YES as an answer. 21st century, is experiencing the fourth industrial revolution: Data. Rarely, we think in the amount of new data that we are creating everyday without noticing it, and thanks to some mathematical thinkers, linear regressions may be used to give a meaning to the data that surrounds us, and transform it into knowledge so that people that manage industries and companies, can make the right decision for their businesses.

And once again, mathematics is behind of this discovery: linear regressions. It is a method that belongs to a section called, Numerical Methods, which are iterative, high performance and high precision algorithms designed to solve numerical problems by finding convergence points, that are an approximation of the solution for a mathematical problem. What linear regressions do, is find an approximated relation between independent variable(s) and a dependent variable, which lie in x-axis and y-axis respectively. This relation is represented in an equation that is measured and chosen under the least square method criteria, because we are seeking an enough accurate equation that can be used to make future predictions. For example, find out the expected number of car accidents in a certain location, like Miami, for a certain period, like New Years Eve.

Linear Regression Model

Now that we know what linear regressions are for, I'm going to describe the steps to implement it with a simple example. First, it is important for you to know that linear regressions need data as an input, in order to give statistical results as an output. The steps to follow are:

create a function for least square (lsq) method
create/import data
chose feature(s) as your independent variable(s)
choose target as your dependent variable
assign coefficients m & c with lsq method
define regression line (linear, logarithmic, exponential)
graph dispersion data with regression line
create a table of approximated predictions

This is the process to follow if you want to run linear regression models with your own methods, but there is another way in R to code linear regression models by using some R packages and functions, in which you won't have to create lsq method nor assign coefficients m & c. If you want to see this other option, I'll leave you my complete code for linear regression demo.

Linear Regression Example

For this example, I'll import a dataset that shows China population in millions from 1900's through year 2010. We'll compare linear and exponential regressions, to see which fits best, and predict China's population for next years.

#Independent Variable
years <- c(1900,1950,1970,1980,1990,2000,2010)
years2 <- seq(1900,2010,1)

#Dependent Variable
people <- c(400,557,825,981,1135,1266,1370)
people_ln <- log(people)

#Coefficients m & c
m2 <- lsq_m(years, people_ln)
c2 <- lsq_c(years, people_ln)

people2<-c()
i=1
while(i<(length(years2)+1)) {
  value<-exp(c2) * exp(m2*years2[i])
  people2<-c(people2,value)
  i=i+1
}

Dispersion of Data

ggplot() +
  geom_point(aes(x = years, y = people), colour = 'red') +
  ggtitle('Population by Year Dispersion') + xlab('Year') + ylab('Population')

Linear Regression Model

ggplot() +
  geom_point(aes(x = years, y = people_ln), colour = 'red') +
  geom_line(aes(x = years, y = m2*years + c2), colour = 'blue') +
  ggtitle('Example 2: Linear Regression') + xlab('Years') + ylab('Population')

It returns a 91.06% of accuracy, which means that 91.06% of variance in China Population is due to the variance of years. With this indicator, we could accept it as a fair model, but not so fast! Let's see how it behaves the exponential regression model before we run into conclusions.

Exponential Regression Model

ggplot() +
  geom_point(aes(x = years, y = people), colour = 'red') +
  geom_line(aes(x = years2, y = people2), colour = 'blue') +
  ggtitle('Example 2: Exponential Regression') + xlab('Years') + ylab('Population')

As I suspected, exponential regression model return a better percent of accuracy. This model shows that 96.27% of variance in China Population is due to the variance of years.

Don't get confused, linear regression model isn't bad, but for this dispersion of data, an exponential regression model has proved to be a more accurate model. So now, we can conclude that China Population shows an exponential growth rather than a lineal growth.

Predictions

For this section of the article, we want to predict China's population from 2011 to 2020. And this are the following results:

Year	China Approximated Population (millions)
2011	1386.023
2012	1402.740
2013	1419.658
2014	1436.779
2015	1454.108
2016	1471.645
2017	1489.394
2018	1507.357
2019	1525.537
2020	1543.936

Conclusion

Hope you could see how data can be transformed into knowledge, because having an accurate approximation of China's Population for example, could be useful for multiple things and multiple individuals, like government, marketing agencies and investors. Personally, I consider that this is the objective of linear regressions, to be used in order to explore through the data you and I, and everybody is generating, so that we can discover insights and conclusions that we couldn't have found without having the correct information in our hands.