DEV Community

Cover image for Linear Regression vs Logistic Regression (Apples to Oranges Comparison on Variable Selection)
shonavar
shonavar

Posted on

Linear Regression vs Logistic Regression (Apples to Oranges Comparison on Variable Selection)

All right, this is how the story goes. I am enrolled in a certificate program called "AI for Business Innovation" at my alma mater, the J Mack Robinson College of Business at Georgia State University in Atlanta, Georgia USA. In one of my Predictive Analytics Modeling courses, we are looking at a data set comprising of 2,111 records and 17 columns based on a study of obesity for Hispanic community members from 3 Spanish-speaking countries.

The team has calculated a new field called Body_Mass_Index based on a simple calculation - Weight in Kg / (Height in meters)^2 and is trying to predict BMI, a continuous variable using linear regression and several other variable reduction or automated feature selection and machine learning modeling techniques like decision trees, random forests and K-Nearest neighbor. The comparison parameter for the best performing model is Mean Squared Prediction Error (we have 80/20 split on the dataset using the same seed value).

Another thought is to use a binomial/binary variable to predict Obesity. Obesity is reflected in the original data set in a variable that has 3 values related to Obese individuals (Obese-1, Obese-2 and Obese-3). Obesity per WHO is a BMI of 30 or above. So, the team has created a new binomial variable that is a Yes/No variable or 0/1 variable that indicates if a respondent based on many independent variables related to type of Food, consumption habits, physical activity, hereditary overweight characteristics and many other factors. The performance factor for a logistic regression is a F1 Score.

Now, the question is if we are trying to predict Obesity, which regression should we choose and why?

Top comments (0)