Linear Regression vs Logistic Regression (Apples to Oranges Comparison on Variable Selection)

shonavar — Fri, 19 Mar 2021 13:45:18 +0000

All right, this is how the story goes. I am enrolled in a certificate program called "AI for Business Innovation" at my alma mater, the J Mack Robinson College of Business at Georgia State University in Atlanta, Georgia USA. In one of my Predictive Analytics Modeling courses, we are looking at a data set comprising of 2,111 records and 17 columns based on a study of obesity for Hispanic community members from 3 Spanish-speaking countries.

The team has calculated a new field called Body_Mass_Index based on a simple calculation - Weight in Kg / (Height in meters)^2 and is trying to predict BMI, a continuous variable using linear regression and several other variable reduction or automated feature selection and machine learning modeling techniques like decision trees, random forests and K-Nearest neighbor. The comparison parameter for the best performing model is Mean Squared Prediction Error (we have 80/20 split on the dataset using the same seed value).

Another thought is to use a binomial/binary variable to predict Obesity. Obesity is reflected in the original data set in a variable that has 3 values related to Obese individuals (Obese-1, Obese-2 and Obese-3). Obesity per WHO is a BMI of 30 or above. So, the team has created a new binomial variable that is a Yes/No variable or 0/1 variable that indicates if a respondent based on many independent variables related to type of Food, consumption habits, physical activity, hereditary overweight characteristics and many other factors. The performance factor for a logistic regression is a F1 Score.

Now, the question is if we are trying to predict Obesity, which regression should we choose and why?

DEV Community: shonavar

Linear Regression vs Logistic Regression (Apples to Oranges Comparison on Variable Selection)