All right, this is how the story goes. I am enrolled in a certificate program called "AI for Business Innovation" at my alma mater, the J Mack Robinson College of Business at Georgia State University in Atlanta, Georgia USA. In one of my Predictive Analytics Modeling courses, we are looking at a data set comprising of 2,111 records and 17 columns based on a study of obesity for Hispanic community members from 3 Spanish-speaking countries.
The team has calculated a new field called Body_Mass_Index based on a simple calculation - Weight in Kg / (Height in meters)^2 and is trying to predict BMI, a continuous variable using linear regression and several other variable reduction or automated feature selection and machine learning modeling techniques like decision trees, random forests and K-Nearest neighbor. The comparison parameter for the best performing model is Mean Squared Prediction Error (we have 80/20 split on the dataset using the same seed value).
Another thought is to use a binomial/binary variable to predict Obesity. Obesity is reflected in the original data set in a variable that has 3 values related to Obese individuals (Obese-1, Obese-2 and Obese-3). Obesity per WHO is a BMI of 30 or above. So, the team has created a new binomial variable that is a Yes/No variable or 0/1 variable that indicates if a respondent based on many independent variables related to type of Food, consumption habits, physical activity, hereditary overweight characteristics and many other factors. The performance factor for a logistic regression is a F1 Score.
Now, the question is if we are trying to predict Obesity, which regression should we choose and why?
Top comments (0)