Haseeb Mohammed

Posted on Jul 9, 2020

Analyzing WFH Survey Data

#machinelearning #datascience

I was curious about how working from home has been for everyone, so I created and shared a survey and received 71 responses!

That's more responses than I was expecting, but too few to get granular. I had to get creative with some feature engineering to get some insights out of it.

Here's my first pass at this data, with a simple Logistic Regression at the end to analyze a person's optimism in their employer's plan to return to normal.

The survey can be found here:
https://www.linkedin.com/posts/bababrownbear_datascience-workingfromhome-activity-6678358398880227328-H0Ge

The notebook for this post can be found here:
https://github.com/bababrownbear/Analyzing_WFH_Survey_Data

Now it's time to do some #datascience. Let's start by loading up the data from the survey.

import pandas as pd                                                                                         
pd.options.display.max_rows = 30

wfh_data = pd.read_csv("wfh.csv");

Let's first take a look at the data, just to see what we're working with. Looks like some answers stored as strings, some as percents, and some as a scale of 1-5.

wfh_data.head()

	Timestamp	Where do you live? (City, State)	How old are you?	What is your gender?	How many years experience do you have?	How many adults are living at home with you?	How many kids are living at home with you?	Do you have an isolated workspace at home?	Have you worked from home previously?	If yes, how long have you worked from home previously?	Prior to the outbreak of COVID-19, what % were you working from home?	What % are you working from home now?	If the outbreak of COVID-19 subsided in the near future, what % of WFH would you prefer to do going forward?	On a scale of 1 to 5, PRIOR to the outbreak of COVID-19, I was productive working from home.	On a scale of 1 to 5, at the START of the outbreak of COVID-19, I was productive working from home.	On a scale of 1 to 5, in the last month, I was productive working from home.	On a scale of 1 to 5, PRIOR to the outbreak of COVID-19, I enjoyed working from home.	On a scale of 1 to 5, at the START of the outbreak of COVID-19, I enjoyed working from home.	On a scale of 1 to 5, in the last month, I enjoyed working from home.	On a scale of 1 to 5, 5 being very optimistic, 1 being very pessimistic, how do you feel about your plan to return to normal?	On a scale of 1 to 5, 5 being very optimistic, 1 being very pessimistic, how do you feel about your employer's plan to return to normal?	Please share how your experience has been working from home. Any pros/cons that you would like to call out?
0	6/15/2020 13:12:36	Pinckney, MI	40-50	Male	20-25 years	1	2	Yes	Yes	5-10 years	100%	100%	100%	5	5	5	5	5	5	3	3	Things will change...less people in offices, l...
1	6/15/2020 13:13:13	Chicago, IL	30-40	Male	15-20 years	2	4	Yes	Yes	1-5 years	30%	30%	30%	5	5	5	5	5	5	5	3	It’s a lot of fun if you have the right equipm...
2	6/15/2020 13:15:04	St louis mo	40-50	Male	15-20 years	2	1	Yes	Yes	5-10 years	50%	100%	100%	5	5	5	5	5	5	3	3	Wfh is most productive way to work
3	6/15/2020 13:20:27	St. Louis, MO	30-40	Male	10-15 years	1	1	No	Yes	5-10 years	70%	100%	80%	5	5	5	5	5	5	4	4	Fully remote meetings (everyone is virtual) ar...
4	6/15/2020 13:21:41	IL	30-40	Male	10-15 years	2	1	Yes	No	NaN	0%	100%	80%	4	4	5	5	5	5	1	4	even if you are great at WFH, the rest of your...

The column names are the actual questions themselves, this won't be fun to work with. During feature engineering I'll be referencing the column names.

wfh_data.columns

Index(['Timestamp', 'Where do you live? (City, State)', 'How old are you?',
       'What is your gender?', 'How many years experience do you have?',
       'How many adults are living at home with you?',
       'How many kids are living at home with you?',
       'Do you have an isolated workspace at home?',
       'Have you worked from home previously?',
       'If yes, how long have you worked from home previously?',
       'Prior to the outbreak of COVID-19, what % were you working from home?',
       'What % are you working from home now?',
       'If the outbreak of COVID-19 subsided in the near future, what % of WFH would you prefer to do going forward?',
       'On a scale of 1 to 5, PRIOR to the outbreak of COVID-19, I was productive working from home.',
       'On a scale of 1 to 5, at the START of the outbreak of COVID-19, I was productive working from home.',
       'On a scale of 1 to 5, in the last month, I was productive working from home.',
       'On a scale of 1 to 5, PRIOR to the outbreak of COVID-19, I enjoyed working from home.',
       'On a scale of 1 to 5, at the START of the outbreak of COVID-19, I enjoyed working from home.',
       'On a scale of 1 to 5, in the last month, I enjoyed working from home.',
       'On a scale of 1 to 5, 5 being very optimistic, 1 being very pessimistic, how do you feel about your plan to return to normal?',
       'On a scale of 1 to 5, 5 being very optimistic, 1 being very pessimistic, how do you feel about your employer's plan to return to normal?',
       'Please share how your experience has been working from home. Any pros/cons that you would like to call out?'],
      dtype='object')

What I'll do instead is pop in a new list of column names. Something easier to work with, and that still represents the column's data.

new_column_names = ['TIMESTAMP', 'CITY_STATE', 'AGE',
       'GENDER', 'YEARS_OF_EXPERIENCE',
       'ADULTS_AT_HOME',
       'KIDS_AT_HOME',
       'HAVE_ISOLATED_WORKSPACE',
       'PREVIOUSLY_WORKED_FROM_HOME',
       'PREVIOUSLY_WORKED_FROM_HOME_YEARS',
       'WFH_PERCENT_PRE_COVID',
       'WFH_PERCENT_CURRENT',
       'WFH_PERCENT_FUTURE_PREFERENCE',
       'PRODUCTIVITY_PRE_COVID',
       'PRODUCTIVITY_COVID_START',
       'PRODUCTIVITY_LAST_MONTH',
       'ENJOYED_WFH_PRE_COVID',
       'ENJOYED_WFH_COVID_START',
       'ENJOYED_WFH_LAST_MONTH',
       'RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF',
       'RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER',
       'FREE_FORM_COMMENTS']

wfh_data.columns = new_column_names

I'm gonna drop the Timestamp, City/state, and the freeform comments. The time they answered this question isn't relevant. There's 90% IL folks answering, so no insights can come from location, and the freeform comments is for another day, whenever I manage to learn natural language processing.

wfh = wfh_data.drop(columns=['TIMESTAMP', 'CITY_STATE','FREE_FORM_COMMENTS'])

Taking a look at the types of data, and keeping in mind that I have very limited number of responses, I think engineering the datatypes to be more yes/no questions would help with insights.

wfh.dtypes

AGE                                        object
GENDER                                     object
YEARS_OF_EXPERIENCE                        object
ADULTS_AT_HOME                              int64
KIDS_AT_HOME                                int64
HAVE_ISOLATED_WORKSPACE                    object
PREVIOUSLY_WORKED_FROM_HOME                object
PREVIOUSLY_WORKED_FROM_HOME_YEARS          object
WFH_PERCENT_PRE_COVID                      object
WFH_PERCENT_CURRENT                        object
WFH_PERCENT_FUTURE_PREFERENCE              object
PRODUCTIVITY_PRE_COVID                      int64
PRODUCTIVITY_COVID_START                    int64
PRODUCTIVITY_LAST_MONTH                     int64
ENJOYED_WFH_PRE_COVID                       int64
ENJOYED_WFH_COVID_START                     int64
ENJOYED_WFH_LAST_MONTH                      int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF         int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER     int64
dtype: object

Begin Feature Engineering:

Let's make Age a number.

wfh['AGE']

0     40-50
1     30-40
2     40-50
3     30-40
4     30-40
      ...  
66    20-30
67    30-40
68    30-40
69    30-40
70    30-40
Name: AGE, Length: 71, dtype: object

wfh['AGE'] = pd.to_numeric(wfh['AGE'].str[:2])

Let's make Gender a true/false question. ("Are you a Male or not?")

wfh['MALE'] = wfh['GENDER'] == 'Male'
wfh = wfh.drop(columns=['GENDER'])

Let's make years of experience a number instead of a string.

yoe = wfh['YEARS_OF_EXPERIENCE'].str.split("-", n = 1, expand = True)
wfh['YEARS_OF_EXPERIENCE'] = yoe[0]
yoe = wfh['YEARS_OF_EXPERIENCE'].str.split("+", n = 1, expand = True)
wfh['YEARS_OF_EXPERIENCE'] = pd.to_numeric(yoe[0], downcast='integer')
wfh['YEARS_OF_EXPERIENCE']

0     20
1     15
2     15
3     10
4     10
      ..
66     5
67    15
68    15
69    10
70    15
Name: YEARS_OF_EXPERIENCE, Length: 71, dtype: int8

Let's make 'Do you have an isolated workspace' a true/false question.

wfh['HAVE_ISOLATED_WORKSPACE']

0     Yes
1     Yes
2     Yes
3      No
4     Yes
     ... 
66    Yes
67    Yes
68     No
69    Yes
70     No
Name: HAVE_ISOLATED_WORKSPACE, Length: 71, dtype: object

wfh['HAVE_ISOLATED_WORKSPACE'] = wfh['HAVE_ISOLATED_WORKSPACE'] == 'Yes'

wfh['HAVE_ISOLATED_WORKSPACE']

0      True
1      True
2      True
3     False
4      True
      ...  
66     True
67     True
68    False
69     True
70    False
Name: HAVE_ISOLATED_WORKSPACE, Length: 71, dtype: bool

Repeat for previously worked from home.

wfh['PREVIOUSLY_WORKED_FROM_HOME']

0     Yes
1     Yes
2     Yes
3     Yes
4      No
     ... 
66    Yes
67    Yes
68     No
69    Yes
70     No
Name: PREVIOUSLY_WORKED_FROM_HOME, Length: 71, dtype: object

wfh['PREVIOUSLY_WORKED_FROM_HOME'] = wfh['PREVIOUSLY_WORKED_FROM_HOME'] == 'Yes'

wfh['PREVIOUSLY_WORKED_FROM_HOME']

0      True
1      True
2      True
3      True
4     False
      ...  
66     True
67     True
68    False
69     True
70    False
Name: PREVIOUSLY_WORKED_FROM_HOME, Length: 71, dtype: bool

wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS']

0     5-10 years
1      1-5 years
2     5-10 years
3     5-10 years
4            NaN
         ...    
66     1-5 years
67    5-10 years
68           NaN
69     1-5 years
70           NaN
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, Length: 71, dtype: object

Make the years worked from home a number as well.

wfh_yoe = wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'].str.split("-", n = 1, expand = True)
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'] = wfh_yoe[0]
wfh_yoe = wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'].str.split("+", n = 1, expand = True)
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'] = pd.to_numeric(wfh_yoe[0], downcast='integer')
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS']

0     5.0
1     1.0
2     5.0
3     5.0
4     NaN
     ... 
66    1.0
67    5.0
68    NaN
69    1.0
70    NaN
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, Length: 71, dtype: float64

wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'].fillna(0, inplace = True)

wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS']

0     5.0
1     1.0
2     5.0
3     5.0
4     0.0
     ... 
66    1.0
67    5.0
68    0.0
69    1.0
70    0.0
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, Length: 71, dtype: float64

wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'] = pd.to_numeric(wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'], downcast='integer')

wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS']

0     5
1     1
2     5
3     5
4     0
     ..
66    1
67    5
68    0
69    1
70    0
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, Length: 71, dtype: int8

Few more to go!

wfh.dtypes

AGE                                         int64
YEARS_OF_EXPERIENCE                          int8
ADULTS_AT_HOME                              int64
KIDS_AT_HOME                                int64
HAVE_ISOLATED_WORKSPACE                      bool
PREVIOUSLY_WORKED_FROM_HOME                  bool
PREVIOUSLY_WORKED_FROM_HOME_YEARS            int8
WFH_PERCENT_PRE_COVID                      object
WFH_PERCENT_CURRENT                        object
WFH_PERCENT_FUTURE_PREFERENCE              object
PRODUCTIVITY_PRE_COVID                      int64
PRODUCTIVITY_COVID_START                    int64
PRODUCTIVITY_LAST_MONTH                     int64
ENJOYED_WFH_PRE_COVID                       int64
ENJOYED_WFH_COVID_START                     int64
ENJOYED_WFH_LAST_MONTH                      int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF         int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER     int64
MALE                                         bool
dtype: object

Switching percents (that are technically strings) to numbers.

wfh['WFH_PERCENT_PRE_COVID']

0     100%
1      30%
2      50%
3      70%
4       0%
      ... 
66     10%
67     60%
68      0%
69     10%
70     10%
Name: WFH_PERCENT_PRE_COVID, Length: 71, dtype: object

wfh_percent = wfh['WFH_PERCENT_PRE_COVID'].str.split("%", n = 1, expand = True)
wfh['WFH_PERCENT_PRE_COVID'] = pd.to_numeric(wfh_percent[0], downcast='integer')

wfh['WFH_PERCENT_PRE_COVID']

0     100
1      30
2      50
3      70
4       0
     ... 
66     10
67     60
68      0
69     10
70     10
Name: WFH_PERCENT_PRE_COVID, Length: 71, dtype: int8

wfh['WFH_PERCENT_PRE_COVID'].fillna(0, inplace = True)

wfh['WFH_PERCENT_PRE_COVID'] = pd.to_numeric(wfh['WFH_PERCENT_PRE_COVID'], downcast='integer')

wfh['WFH_PERCENT_PRE_COVID']

0     100
1      30
2      50
3      70
4       0
     ... 
66     10
67     60
68      0
69     10
70     10
Name: WFH_PERCENT_PRE_COVID, Length: 71, dtype: int8

wfh_percent = wfh['WFH_PERCENT_CURRENT'].str.split("%", n = 1, expand = True)
wfh['WFH_PERCENT_CURRENT'] = pd.to_numeric(wfh_percent[0], downcast='integer')
wfh['WFH_PERCENT_CURRENT'].fillna(0, inplace = True)
wfh['WFH_PERCENT_CURRENT'] = pd.to_numeric(wfh['WFH_PERCENT_CURRENT'], downcast='integer')

wfh_percent = wfh['WFH_PERCENT_FUTURE_PREFERENCE'].str.split("%", n = 1, expand = True)
wfh['WFH_PERCENT_FUTURE_PREFERENCE'] = pd.to_numeric(wfh_percent[0], downcast='integer')
wfh['WFH_PERCENT_FUTURE_PREFERENCE'].fillna(0, inplace = True)
wfh['WFH_PERCENT_FUTURE_PREFERENCE'] = pd.to_numeric(wfh['WFH_PERCENT_FUTURE_PREFERENCE'], downcast='integer')

wfh.dtypes

AGE                                        int64
YEARS_OF_EXPERIENCE                         int8
ADULTS_AT_HOME                             int64
KIDS_AT_HOME                               int64
HAVE_ISOLATED_WORKSPACE                     bool
PREVIOUSLY_WORKED_FROM_HOME                 bool
PREVIOUSLY_WORKED_FROM_HOME_YEARS           int8
WFH_PERCENT_PRE_COVID                       int8
WFH_PERCENT_CURRENT                         int8
WFH_PERCENT_FUTURE_PREFERENCE               int8
PRODUCTIVITY_PRE_COVID                     int64
PRODUCTIVITY_COVID_START                   int64
PRODUCTIVITY_LAST_MONTH                    int64
ENJOYED_WFH_PRE_COVID                      int64
ENJOYED_WFH_COVID_START                    int64
ENJOYED_WFH_LAST_MONTH                     int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF        int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER    int64
MALE                                        bool
dtype: object

wfh

	AGE	YEARS_OF_EXPERIENCE	ADULTS_AT_HOME	KIDS_AT_HOME	HAVE_ISOLATED_WORKSPACE	PREVIOUSLY_WORKED_FROM_HOME	PREVIOUSLY_WORKED_FROM_HOME_YEARS	WFH_PERCENT_PRE_COVID	WFH_PERCENT_CURRENT	WFH_PERCENT_FUTURE_PREFERENCE	PRODUCTIVITY_PRE_COVID	PRODUCTIVITY_COVID_START	PRODUCTIVITY_LAST_MONTH	ENJOYED_WFH_PRE_COVID	ENJOYED_WFH_COVID_START	ENJOYED_WFH_LAST_MONTH	RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF	RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER	MALE
0	40	20	1	2	True	True	5	100	100	100	5	5	5	5	5	5	3	3	True
1	30	15	2	4	True	True	1	30	30	30	5	5	5	5	5	5	5	3	True
2	40	15	2	1	True	True	5	50	100	100	5	5	5	5	5	5	3	3	True
3	30	10	1	1	False	True	5	70	100	80	5	5	5	5	5	5	4	4	True
4	30	10	2	1	True	False	0	0	100	80	4	4	5	5	5	5	1	4	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
66	20	5	1	0	True	True	1	10	100	50	4	3	4	4	4	4	3	4	True
67	30	15	2	0	True	True	5	60	100	100	5	5	5	5	5	5	4	4	True
68	30	15	1	2	False	False	0	0	20	0	1	1	1	1	1	1	4	4	True
69	30	10	0	0	True	True	1	10	100	80	5	5	5	5	5	5	3	4	False
70	30	15	1	2	False	False	0	10	100	80	5	3	4	2	2	4	1	2	True

71 rows × 19 columns

Ok let's take a look at the spread of the answers, and see if we can't split some of the numbered answers into more true/false as well.

for col in wfh.columns:
  print(col,wfh[col].value_counts())

AGE 30    31
40    23
20    10
50     7
Name: AGE, dtype: int64
YEARS_OF_EXPERIENCE 10    18
15    16
20    14
1      9
25     8
5      6
Name: YEARS_OF_EXPERIENCE, dtype: int64
ADULTS_AT_HOME 1    37
2    24
0     5
3     4
4     1
Name: ADULTS_AT_HOME, dtype: int64
KIDS_AT_HOME 0    23
2    19
1    18
3     7
5     2
4     2
Name: KIDS_AT_HOME, dtype: int64
HAVE_ISOLATED_WORKSPACE True     52
False    19
Name: HAVE_ISOLATED_WORKSPACE, dtype: int64
PREVIOUSLY_WORKED_FROM_HOME True     50
False    21
Name: PREVIOUSLY_WORKED_FROM_HOME, dtype: int64
PREVIOUSLY_WORKED_FROM_HOME_YEARS 1     31
0     22
5     13
10     5
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, dtype: int64
WFH_PERCENT_PRE_COVID 10     20
0      16
20     11
40      6
100     4
60      4
30      4
50      3
90      2
70      1
Name: WFH_PERCENT_PRE_COVID, dtype: int64
WFH_PERCENT_CURRENT 100    66
90      3
30      1
20      1
Name: WFH_PERCENT_CURRENT, dtype: int64
WFH_PERCENT_FUTURE_PREFERENCE 100    14
50     14
80     11
70     11
60      7
30      7
40      4
90      1
20      1
0       1
Name: WFH_PERCENT_FUTURE_PREFERENCE, dtype: int64
PRODUCTIVITY_PRE_COVID 5    29
4    20
3    10
2     9
1     3
Name: PRODUCTIVITY_PRE_COVID, dtype: int64
PRODUCTIVITY_COVID_START 5    31
4    18
3    13
1     5
2     4
Name: PRODUCTIVITY_COVID_START, dtype: int64
PRODUCTIVITY_LAST_MONTH 5    40
4    17
3     6
2     6
1     2
Name: PRODUCTIVITY_LAST_MONTH, dtype: int64
ENJOYED_WFH_PRE_COVID 5    37
4    16
3    10
2     4
1     4
Name: ENJOYED_WFH_PRE_COVID, dtype: int64
ENJOYED_WFH_COVID_START 5    36
4    17
3     8
2     7
1     3
Name: ENJOYED_WFH_COVID_START, dtype: int64
ENJOYED_WFH_LAST_MONTH 5    42
4    13
3     9
2     5
1     2
Name: ENJOYED_WFH_LAST_MONTH, dtype: int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF 3    23
2    18
4    15
5    10
1     5
Name: RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF, dtype: int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER 4    25
3    21
5    11
2    10
1     4
Name: RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER, dtype: int64
MALE True     49
False    22
Name: MALE, dtype: int64

A lot going on in this block. I'm changing the number of kids to more columns. Do you have 1 kid at home, do you have 2, do you have more than 2, do you have none.

For answers on a 1-5 scale, I'm taking 4s and 5s as 'True', and the rest as false.

For WFH %, splitting by 50%.

For Years, splitting by 5 and 10 years.

wfh['2_PLUS_ADULTS_AT_HOME'] = wfh['ADULTS_AT_HOME'] > 2
wfh['2_PLUS_KIDS_AT_HOME'] = wfh['KIDS_AT_HOME'] > 2
wfh['2_ADULTS_AT_HOME'] = wfh['ADULTS_AT_HOME'] == 2
wfh['2_KIDS_AT_HOME'] = wfh['KIDS_AT_HOME'] == 2
wfh['1_ADULT_AT_HOME'] = wfh['ADULTS_AT_HOME'] == 1
wfh['1_KID_AT_HOME'] = wfh['KIDS_AT_HOME'] == 1
wfh['0_ADULTS_AT_HOME'] = wfh['ADULTS_AT_HOME'] == 0
wfh['0_KIDS_AT_HOME'] = wfh['KIDS_AT_HOME'] == 0
wfh['PRODUCTIVITY_PRE_COVID'] = wfh['PRODUCTIVITY_PRE_COVID'] > 3
wfh['PRODUCTIVITY_COVID_START'] = wfh['PRODUCTIVITY_COVID_START'] > 3
wfh['PRODUCTIVITY_LAST_MONTH'] = wfh['PRODUCTIVITY_LAST_MONTH'] > 3
wfh['ENJOYED_WFH_PRE_COVID'] = wfh['ENJOYED_WFH_PRE_COVID'] > 3
wfh['ENJOYED_WFH_COVID_START'] = wfh['ENJOYED_WFH_COVID_START'] > 3
wfh['ENJOYED_WFH_LAST_MONTH'] = wfh['ENJOYED_WFH_LAST_MONTH'] > 3
wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER'] = wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER'] > 3
wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF'] = wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF'] > 3
wfh['WFH_50_PERCENT_OR_GREATER_PRE_COVID'] = wfh['WFH_PERCENT_PRE_COVID'] >= 50
wfh['WFH_50_PERCENT_OR_GREATER_CURRENT'] = wfh['WFH_PERCENT_CURRENT'] >= 50
wfh['WFH_50_PERCENT_OR_GREATER_FUTURE_PREFERENCE'] = wfh['WFH_PERCENT_FUTURE_PREFERENCE'] >= 50
wfh['YEARS_OF_EXPERIENCE_10_YEARS_OR_GREATER'] = wfh['YEARS_OF_EXPERIENCE'] >= 10
wfh['PREVIOUSLY_WORKED_FROM_HOME_5_YEARS_OR_GREATER'] = wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'] >= 5

wfh = wfh.drop(columns=['ADULTS_AT_HOME','KIDS_AT_HOME','WFH_PERCENT_PRE_COVID','WFH_PERCENT_CURRENT', 'WFH_PERCENT_FUTURE_PREFERENCE', 'YEARS_OF_EXPERIENCE', 'PREVIOUSLY_WORKED_FROM_HOME_YEARS'])
wfh = wfh.drop(columns=['WFH_50_PERCENT_OR_GREATER_CURRENT'])

This is a heatmap of univariate correlation. What % of 1 variable can be used to predict another. The closer to 1, the more 'important' that feature is.

import matplotlib.pyplot as plt
import seaborn as sns; 
sns.set(style="ticks", color_codes=True)

fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(wfh.corr(), annot=True, ax=ax, vmin=-1, vmax=1, center= 0)

<matplotlib.axes._subplots.AxesSubplot at 0x7f89f58ebd68>

Saving the clean data for later, might import into Tablaeu.

wfh.to_csv('boolified.csv')

I wanted to see what features are important to whether or not you're optimistic about your employer's return to normal plan.

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

X = wfh.drop(columns=['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER'])
y = wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER']

Here I ran a simple Logistic regression to try and predict y, RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER. Using all 23 columns I've got an equation here that will tell me with a 72% accuracy if you're optimistic or not. That's really good! If random is 50%, I'm 22% better than random. :fireemoji:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.70      0.78      0.74        18
        True       0.75      0.67      0.71        18

    accuracy                           0.72        36
   macro avg       0.72      0.72      0.72        36
weighted avg       0.73      0.72      0.72        36



/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

The dashed line here is random. The closer the blue line goes to the top left corner of this chart, the better my algorithm is.

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Here, I can determine which features actually mattered to the algorithm. Closer to 0, not important. Let's see if I can't figure out a smaller subset of features that will give me the same 72% accuracy, or better.

First three columns are pretty important

RETURN_TO_NORMAL_PLAN_OPTIMISIM_SELF
0_KIDS_AT_HOME
HAVE_ISOLATED_WORKSPACE

..That paints a picture, doesn't it.

column_labels = X.columns.tolist()
coef = logreg.coef_.squeeze().tolist()
labels_coef = list(zip(column_labels, coef))
pd.DataFrame({"Feature":X.columns.tolist(),"Coefficients":logreg.coef_[0]}).sort_values(by=['Coefficients'], ascending=False)

	Feature	Coefficients
9	RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF	1.457509
18	0_KIDS_AT_HOME	0.873659
1	HAVE_ISOLATED_WORKSPACE	0.870934
5	PRODUCTIVITY_LAST_MONTH	0.298454
6	ENJOYED_WFH_PRE_COVID	0.260499
15	1_ADULT_AT_HOME	0.223816
2	PREVIOUSLY_WORKED_FROM_HOME	0.144423
11	2_PLUS_ADULTS_AT_HOME	0.113560
13	2_ADULTS_AT_HOME	0.047703
14	2_KIDS_AT_HOME	-0.018416
8	ENJOYED_WFH_LAST_MONTH	-0.019678
0	AGE	-0.043531
21	YEARS_OF_EXPERIENCE_10_YEARS_OR_GREATER	-0.084392
3	PRODUCTIVITY_PRE_COVID	-0.104972
22	PREVIOUSLY_WORKED_FROM_HOME_5_YEARS_OR_GREATER	-0.129090
7	ENJOYED_WFH_COVID_START	-0.148752
19	WFH_50_PERCENT_OR_GREATER_PRE_COVID	-0.190729
20	WFH_50_PERCENT_OR_GREATER_FUTURE_PREFERENCE	-0.312987
12	2_PLUS_KIDS_AT_HOME	-0.338904
17	0_ADULTS_AT_HOME	-0.386431
10	MALE	-0.420092
4	PRODUCTIVITY_COVID_START	-0.432348
16	1_KID_AT_HOME	-0.517691

Just using those three columns and I'm upto 75% accuracy.

(For my data science folks, I'm taking note of the f1-score, but this time around the accuracy and f1-score are the same, and easier to explain!)

columns_to_use = ['RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF','0_KIDS_AT_HOME','HAVE_ISOLATED_WORKSPACE']

X = wfh[columns_to_use]
y = wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.74      0.78      0.76        18
        True       0.76      0.72      0.74        18

    accuracy                           0.75        36
   macro avg       0.75      0.75      0.75        36
weighted avg       0.75      0.75      0.75        36

What's my plan next with this data? Probably toss the boolified version of the data into Tablaeu, and get some fancy charts.

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

Top comments (1)

Haseeb Mohammed • Jul 20 '20

I had gotten a question via slack, "What was the reasoning behind making gender a true/false "Are you a Male" response?"

It was so I could evaluate it as a boolean instead of a difference between two strings, "Male" "Female" vs True False. It's the simplest implementation of a technique called One Hot Encoding.

machinelearningmastery.com/why-one...

DEV Community

Analyzing WFH Survey Data

The Future of AI, LLMs, and Observability on Google Cloud

Top comments (1)

Read next

AI Models Get Human-Like Memory with New Test-Time Regression Framework

A beginner's guide to the Incredibly-Fast-Whisper model by Vaibhavs10 on Replicate

TDoC 2024 - Day 3: Introduction to Machine Learning

Revolutionary Two-Layer Framework Makes Agent-Based Models More Realistic and Adaptive