I was curious about how working from home has been for everyone, so I created and shared a survey and received 71 responses!
That's more responses than I was expecting, but too few to get granular. I had to get creative with some feature engineering to get some insights out of it.
Here's my first pass at this data, with a simple Logistic Regression at the end to analyze a person's optimism in their employer's plan to return to normal.
The survey can be found here:
https://www.linkedin.com/posts/bababrownbear_datascience-workingfromhome-activity-6678358398880227328-H0Ge
The notebook for this post can be found here:
https://github.com/bababrownbear/Analyzing_WFH_Survey_Data
Now it's time to do some #datascience. Let's start by loading up the data from the survey.
import pandas as pd
pd.options.display.max_rows = 30
wfh_data = pd.read_csv("wfh.csv");
Let's first take a look at the data, just to see what we're working with. Looks like some answers stored as strings, some as percents, and some as a scale of 1-5.
wfh_data.head()
Timestamp | Where do you live? (City, State) | How old are you? | What is your gender? | How many years experience do you have? | How many adults are living at home with you? | How many kids are living at home with you? | Do you have an isolated workspace at home? | Have you worked from home previously? | If yes, how long have you worked from home previously? | Prior to the outbreak of COVID-19, what % were you working from home? | What % are you working from home now? | If the outbreak of COVID-19 subsided in the near future, what % of WFH would you prefer to do going forward? | On a scale of 1 to 5, PRIOR to the outbreak of COVID-19, I was productive working from home. | On a scale of 1 to 5, at the START of the outbreak of COVID-19, I was productive working from home. | On a scale of 1 to 5, in the last month, I was productive working from home. | On a scale of 1 to 5, PRIOR to the outbreak of COVID-19, I enjoyed working from home. | On a scale of 1 to 5, at the START of the outbreak of COVID-19, I enjoyed working from home. | On a scale of 1 to 5, in the last month, I enjoyed working from home. | On a scale of 1 to 5, 5 being very optimistic, 1 being very pessimistic, how do you feel about your plan to return to normal? | On a scale of 1 to 5, 5 being very optimistic, 1 being very pessimistic, how do you feel about your employer's plan to return to normal? | Please share how your experience has been working from home. Any pros/cons that you would like to call out? | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6/15/2020 13:12:36 | Pinckney, MI | 40-50 | Male | 20-25 years | 1 | 2 | Yes | Yes | 5-10 years | 100% | 100% | 100% | 5 | 5 | 5 | 5 | 5 | 5 | 3 | 3 | Things will change...less people in offices, l... |
1 | 6/15/2020 13:13:13 | Chicago, IL | 30-40 | Male | 15-20 years | 2 | 4 | Yes | Yes | 1-5 years | 30% | 30% | 30% | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 3 | It’s a lot of fun if you have the right equipm... |
2 | 6/15/2020 13:15:04 | St louis mo | 40-50 | Male | 15-20 years | 2 | 1 | Yes | Yes | 5-10 years | 50% | 100% | 100% | 5 | 5 | 5 | 5 | 5 | 5 | 3 | 3 | Wfh is most productive way to work |
3 | 6/15/2020 13:20:27 | St. Louis, MO | 30-40 | Male | 10-15 years | 1 | 1 | No | Yes | 5-10 years | 70% | 100% | 80% | 5 | 5 | 5 | 5 | 5 | 5 | 4 | 4 | Fully remote meetings (everyone is virtual) ar... |
4 | 6/15/2020 13:21:41 | IL | 30-40 | Male | 10-15 years | 2 | 1 | Yes | No | NaN | 0% | 100% | 80% | 4 | 4 | 5 | 5 | 5 | 5 | 1 | 4 | even if you are great at WFH, the rest of your... |
The column names are the actual questions themselves, this won't be fun to work with. During feature engineering I'll be referencing the column names.
wfh_data.columns
Index(['Timestamp', 'Where do you live? (City, State)', 'How old are you?',
'What is your gender?', 'How many years experience do you have?',
'How many adults are living at home with you?',
'How many kids are living at home with you?',
'Do you have an isolated workspace at home?',
'Have you worked from home previously?',
'If yes, how long have you worked from home previously?',
'Prior to the outbreak of COVID-19, what % were you working from home?',
'What % are you working from home now?',
'If the outbreak of COVID-19 subsided in the near future, what % of WFH would you prefer to do going forward?',
'On a scale of 1 to 5, PRIOR to the outbreak of COVID-19, I was productive working from home.',
'On a scale of 1 to 5, at the START of the outbreak of COVID-19, I was productive working from home.',
'On a scale of 1 to 5, in the last month, I was productive working from home.',
'On a scale of 1 to 5, PRIOR to the outbreak of COVID-19, I enjoyed working from home.',
'On a scale of 1 to 5, at the START of the outbreak of COVID-19, I enjoyed working from home.',
'On a scale of 1 to 5, in the last month, I enjoyed working from home.',
'On a scale of 1 to 5, 5 being very optimistic, 1 being very pessimistic, how do you feel about your plan to return to normal?',
'On a scale of 1 to 5, 5 being very optimistic, 1 being very pessimistic, how do you feel about your employer's plan to return to normal?',
'Please share how your experience has been working from home. Any pros/cons that you would like to call out?'],
dtype='object')
What I'll do instead is pop in a new list of column names. Something easier to work with, and that still represents the column's data.
new_column_names = ['TIMESTAMP', 'CITY_STATE', 'AGE',
'GENDER', 'YEARS_OF_EXPERIENCE',
'ADULTS_AT_HOME',
'KIDS_AT_HOME',
'HAVE_ISOLATED_WORKSPACE',
'PREVIOUSLY_WORKED_FROM_HOME',
'PREVIOUSLY_WORKED_FROM_HOME_YEARS',
'WFH_PERCENT_PRE_COVID',
'WFH_PERCENT_CURRENT',
'WFH_PERCENT_FUTURE_PREFERENCE',
'PRODUCTIVITY_PRE_COVID',
'PRODUCTIVITY_COVID_START',
'PRODUCTIVITY_LAST_MONTH',
'ENJOYED_WFH_PRE_COVID',
'ENJOYED_WFH_COVID_START',
'ENJOYED_WFH_LAST_MONTH',
'RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF',
'RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER',
'FREE_FORM_COMMENTS']
wfh_data.columns = new_column_names
I'm gonna drop the Timestamp, City/state, and the freeform comments. The time they answered this question isn't relevant. There's 90% IL folks answering, so no insights can come from location, and the freeform comments is for another day, whenever I manage to learn natural language processing.
wfh = wfh_data.drop(columns=['TIMESTAMP', 'CITY_STATE','FREE_FORM_COMMENTS'])
Taking a look at the types of data, and keeping in mind that I have very limited number of responses, I think engineering the datatypes to be more yes/no questions would help with insights.
wfh.dtypes
AGE object
GENDER object
YEARS_OF_EXPERIENCE object
ADULTS_AT_HOME int64
KIDS_AT_HOME int64
HAVE_ISOLATED_WORKSPACE object
PREVIOUSLY_WORKED_FROM_HOME object
PREVIOUSLY_WORKED_FROM_HOME_YEARS object
WFH_PERCENT_PRE_COVID object
WFH_PERCENT_CURRENT object
WFH_PERCENT_FUTURE_PREFERENCE object
PRODUCTIVITY_PRE_COVID int64
PRODUCTIVITY_COVID_START int64
PRODUCTIVITY_LAST_MONTH int64
ENJOYED_WFH_PRE_COVID int64
ENJOYED_WFH_COVID_START int64
ENJOYED_WFH_LAST_MONTH int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER int64
dtype: object
Begin Feature Engineering:
Let's make Age a number.
wfh['AGE']
0 40-50
1 30-40
2 40-50
3 30-40
4 30-40
...
66 20-30
67 30-40
68 30-40
69 30-40
70 30-40
Name: AGE, Length: 71, dtype: object
wfh['AGE'] = pd.to_numeric(wfh['AGE'].str[:2])
Let's make Gender a true/false question. ("Are you a Male or not?")
wfh['MALE'] = wfh['GENDER'] == 'Male'
wfh = wfh.drop(columns=['GENDER'])
Let's make years of experience a number instead of a string.
yoe = wfh['YEARS_OF_EXPERIENCE'].str.split("-", n = 1, expand = True)
wfh['YEARS_OF_EXPERIENCE'] = yoe[0]
yoe = wfh['YEARS_OF_EXPERIENCE'].str.split("+", n = 1, expand = True)
wfh['YEARS_OF_EXPERIENCE'] = pd.to_numeric(yoe[0], downcast='integer')
wfh['YEARS_OF_EXPERIENCE']
0 20
1 15
2 15
3 10
4 10
..
66 5
67 15
68 15
69 10
70 15
Name: YEARS_OF_EXPERIENCE, Length: 71, dtype: int8
Let's make 'Do you have an isolated workspace' a true/false question.
wfh['HAVE_ISOLATED_WORKSPACE']
0 Yes
1 Yes
2 Yes
3 No
4 Yes
...
66 Yes
67 Yes
68 No
69 Yes
70 No
Name: HAVE_ISOLATED_WORKSPACE, Length: 71, dtype: object
wfh['HAVE_ISOLATED_WORKSPACE'] = wfh['HAVE_ISOLATED_WORKSPACE'] == 'Yes'
wfh['HAVE_ISOLATED_WORKSPACE']
0 True
1 True
2 True
3 False
4 True
...
66 True
67 True
68 False
69 True
70 False
Name: HAVE_ISOLATED_WORKSPACE, Length: 71, dtype: bool
Repeat for previously worked from home.
wfh['PREVIOUSLY_WORKED_FROM_HOME']
0 Yes
1 Yes
2 Yes
3 Yes
4 No
...
66 Yes
67 Yes
68 No
69 Yes
70 No
Name: PREVIOUSLY_WORKED_FROM_HOME, Length: 71, dtype: object
wfh['PREVIOUSLY_WORKED_FROM_HOME'] = wfh['PREVIOUSLY_WORKED_FROM_HOME'] == 'Yes'
wfh['PREVIOUSLY_WORKED_FROM_HOME']
0 True
1 True
2 True
3 True
4 False
...
66 True
67 True
68 False
69 True
70 False
Name: PREVIOUSLY_WORKED_FROM_HOME, Length: 71, dtype: bool
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS']
0 5-10 years
1 1-5 years
2 5-10 years
3 5-10 years
4 NaN
...
66 1-5 years
67 5-10 years
68 NaN
69 1-5 years
70 NaN
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, Length: 71, dtype: object
Make the years worked from home a number as well.
wfh_yoe = wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'].str.split("-", n = 1, expand = True)
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'] = wfh_yoe[0]
wfh_yoe = wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'].str.split("+", n = 1, expand = True)
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'] = pd.to_numeric(wfh_yoe[0], downcast='integer')
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS']
0 5.0
1 1.0
2 5.0
3 5.0
4 NaN
...
66 1.0
67 5.0
68 NaN
69 1.0
70 NaN
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, Length: 71, dtype: float64
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'].fillna(0, inplace = True)
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS']
0 5.0
1 1.0
2 5.0
3 5.0
4 0.0
...
66 1.0
67 5.0
68 0.0
69 1.0
70 0.0
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, Length: 71, dtype: float64
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'] = pd.to_numeric(wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'], downcast='integer')
wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS']
0 5
1 1
2 5
3 5
4 0
..
66 1
67 5
68 0
69 1
70 0
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, Length: 71, dtype: int8
Few more to go!
wfh.dtypes
AGE int64
YEARS_OF_EXPERIENCE int8
ADULTS_AT_HOME int64
KIDS_AT_HOME int64
HAVE_ISOLATED_WORKSPACE bool
PREVIOUSLY_WORKED_FROM_HOME bool
PREVIOUSLY_WORKED_FROM_HOME_YEARS int8
WFH_PERCENT_PRE_COVID object
WFH_PERCENT_CURRENT object
WFH_PERCENT_FUTURE_PREFERENCE object
PRODUCTIVITY_PRE_COVID int64
PRODUCTIVITY_COVID_START int64
PRODUCTIVITY_LAST_MONTH int64
ENJOYED_WFH_PRE_COVID int64
ENJOYED_WFH_COVID_START int64
ENJOYED_WFH_LAST_MONTH int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER int64
MALE bool
dtype: object
Switching percents (that are technically strings) to numbers.
wfh['WFH_PERCENT_PRE_COVID']
0 100%
1 30%
2 50%
3 70%
4 0%
...
66 10%
67 60%
68 0%
69 10%
70 10%
Name: WFH_PERCENT_PRE_COVID, Length: 71, dtype: object
wfh_percent = wfh['WFH_PERCENT_PRE_COVID'].str.split("%", n = 1, expand = True)
wfh['WFH_PERCENT_PRE_COVID'] = pd.to_numeric(wfh_percent[0], downcast='integer')
wfh['WFH_PERCENT_PRE_COVID']
0 100
1 30
2 50
3 70
4 0
...
66 10
67 60
68 0
69 10
70 10
Name: WFH_PERCENT_PRE_COVID, Length: 71, dtype: int8
wfh['WFH_PERCENT_PRE_COVID'].fillna(0, inplace = True)
wfh['WFH_PERCENT_PRE_COVID'] = pd.to_numeric(wfh['WFH_PERCENT_PRE_COVID'], downcast='integer')
wfh['WFH_PERCENT_PRE_COVID']
0 100
1 30
2 50
3 70
4 0
...
66 10
67 60
68 0
69 10
70 10
Name: WFH_PERCENT_PRE_COVID, Length: 71, dtype: int8
wfh_percent = wfh['WFH_PERCENT_CURRENT'].str.split("%", n = 1, expand = True)
wfh['WFH_PERCENT_CURRENT'] = pd.to_numeric(wfh_percent[0], downcast='integer')
wfh['WFH_PERCENT_CURRENT'].fillna(0, inplace = True)
wfh['WFH_PERCENT_CURRENT'] = pd.to_numeric(wfh['WFH_PERCENT_CURRENT'], downcast='integer')
wfh_percent = wfh['WFH_PERCENT_FUTURE_PREFERENCE'].str.split("%", n = 1, expand = True)
wfh['WFH_PERCENT_FUTURE_PREFERENCE'] = pd.to_numeric(wfh_percent[0], downcast='integer')
wfh['WFH_PERCENT_FUTURE_PREFERENCE'].fillna(0, inplace = True)
wfh['WFH_PERCENT_FUTURE_PREFERENCE'] = pd.to_numeric(wfh['WFH_PERCENT_FUTURE_PREFERENCE'], downcast='integer')
wfh.dtypes
AGE int64
YEARS_OF_EXPERIENCE int8
ADULTS_AT_HOME int64
KIDS_AT_HOME int64
HAVE_ISOLATED_WORKSPACE bool
PREVIOUSLY_WORKED_FROM_HOME bool
PREVIOUSLY_WORKED_FROM_HOME_YEARS int8
WFH_PERCENT_PRE_COVID int8
WFH_PERCENT_CURRENT int8
WFH_PERCENT_FUTURE_PREFERENCE int8
PRODUCTIVITY_PRE_COVID int64
PRODUCTIVITY_COVID_START int64
PRODUCTIVITY_LAST_MONTH int64
ENJOYED_WFH_PRE_COVID int64
ENJOYED_WFH_COVID_START int64
ENJOYED_WFH_LAST_MONTH int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER int64
MALE bool
dtype: object
wfh
AGE | YEARS_OF_EXPERIENCE | ADULTS_AT_HOME | KIDS_AT_HOME | HAVE_ISOLATED_WORKSPACE | PREVIOUSLY_WORKED_FROM_HOME | PREVIOUSLY_WORKED_FROM_HOME_YEARS | WFH_PERCENT_PRE_COVID | WFH_PERCENT_CURRENT | WFH_PERCENT_FUTURE_PREFERENCE | PRODUCTIVITY_PRE_COVID | PRODUCTIVITY_COVID_START | PRODUCTIVITY_LAST_MONTH | ENJOYED_WFH_PRE_COVID | ENJOYED_WFH_COVID_START | ENJOYED_WFH_LAST_MONTH | RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF | RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER | MALE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | 20 | 1 | 2 | True | True | 5 | 100 | 100 | 100 | 5 | 5 | 5 | 5 | 5 | 5 | 3 | 3 | True |
1 | 30 | 15 | 2 | 4 | True | True | 1 | 30 | 30 | 30 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 3 | True |
2 | 40 | 15 | 2 | 1 | True | True | 5 | 50 | 100 | 100 | 5 | 5 | 5 | 5 | 5 | 5 | 3 | 3 | True |
3 | 30 | 10 | 1 | 1 | False | True | 5 | 70 | 100 | 80 | 5 | 5 | 5 | 5 | 5 | 5 | 4 | 4 | True |
4 | 30 | 10 | 2 | 1 | True | False | 0 | 0 | 100 | 80 | 4 | 4 | 5 | 5 | 5 | 5 | 1 | 4 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
66 | 20 | 5 | 1 | 0 | True | True | 1 | 10 | 100 | 50 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | True |
67 | 30 | 15 | 2 | 0 | True | True | 5 | 60 | 100 | 100 | 5 | 5 | 5 | 5 | 5 | 5 | 4 | 4 | True |
68 | 30 | 15 | 1 | 2 | False | False | 0 | 0 | 20 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 4 | True |
69 | 30 | 10 | 0 | 0 | True | True | 1 | 10 | 100 | 80 | 5 | 5 | 5 | 5 | 5 | 5 | 3 | 4 | False |
70 | 30 | 15 | 1 | 2 | False | False | 0 | 10 | 100 | 80 | 5 | 3 | 4 | 2 | 2 | 4 | 1 | 2 | True |
71 rows × 19 columns
Ok let's take a look at the spread of the answers, and see if we can't split some of the numbered answers into more true/false as well.
for col in wfh.columns:
print(col,wfh[col].value_counts())
AGE 30 31
40 23
20 10
50 7
Name: AGE, dtype: int64
YEARS_OF_EXPERIENCE 10 18
15 16
20 14
1 9
25 8
5 6
Name: YEARS_OF_EXPERIENCE, dtype: int64
ADULTS_AT_HOME 1 37
2 24
0 5
3 4
4 1
Name: ADULTS_AT_HOME, dtype: int64
KIDS_AT_HOME 0 23
2 19
1 18
3 7
5 2
4 2
Name: KIDS_AT_HOME, dtype: int64
HAVE_ISOLATED_WORKSPACE True 52
False 19
Name: HAVE_ISOLATED_WORKSPACE, dtype: int64
PREVIOUSLY_WORKED_FROM_HOME True 50
False 21
Name: PREVIOUSLY_WORKED_FROM_HOME, dtype: int64
PREVIOUSLY_WORKED_FROM_HOME_YEARS 1 31
0 22
5 13
10 5
Name: PREVIOUSLY_WORKED_FROM_HOME_YEARS, dtype: int64
WFH_PERCENT_PRE_COVID 10 20
0 16
20 11
40 6
100 4
60 4
30 4
50 3
90 2
70 1
Name: WFH_PERCENT_PRE_COVID, dtype: int64
WFH_PERCENT_CURRENT 100 66
90 3
30 1
20 1
Name: WFH_PERCENT_CURRENT, dtype: int64
WFH_PERCENT_FUTURE_PREFERENCE 100 14
50 14
80 11
70 11
60 7
30 7
40 4
90 1
20 1
0 1
Name: WFH_PERCENT_FUTURE_PREFERENCE, dtype: int64
PRODUCTIVITY_PRE_COVID 5 29
4 20
3 10
2 9
1 3
Name: PRODUCTIVITY_PRE_COVID, dtype: int64
PRODUCTIVITY_COVID_START 5 31
4 18
3 13
1 5
2 4
Name: PRODUCTIVITY_COVID_START, dtype: int64
PRODUCTIVITY_LAST_MONTH 5 40
4 17
3 6
2 6
1 2
Name: PRODUCTIVITY_LAST_MONTH, dtype: int64
ENJOYED_WFH_PRE_COVID 5 37
4 16
3 10
2 4
1 4
Name: ENJOYED_WFH_PRE_COVID, dtype: int64
ENJOYED_WFH_COVID_START 5 36
4 17
3 8
2 7
1 3
Name: ENJOYED_WFH_COVID_START, dtype: int64
ENJOYED_WFH_LAST_MONTH 5 42
4 13
3 9
2 5
1 2
Name: ENJOYED_WFH_LAST_MONTH, dtype: int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF 3 23
2 18
4 15
5 10
1 5
Name: RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF, dtype: int64
RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER 4 25
3 21
5 11
2 10
1 4
Name: RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER, dtype: int64
MALE True 49
False 22
Name: MALE, dtype: int64
A lot going on in this block. I'm changing the number of kids to more columns. Do you have 1 kid at home, do you have 2, do you have more than 2, do you have none.
For answers on a 1-5 scale, I'm taking 4s and 5s as 'True', and the rest as false.
For WFH %, splitting by 50%.
For Years, splitting by 5 and 10 years.
wfh['2_PLUS_ADULTS_AT_HOME'] = wfh['ADULTS_AT_HOME'] > 2
wfh['2_PLUS_KIDS_AT_HOME'] = wfh['KIDS_AT_HOME'] > 2
wfh['2_ADULTS_AT_HOME'] = wfh['ADULTS_AT_HOME'] == 2
wfh['2_KIDS_AT_HOME'] = wfh['KIDS_AT_HOME'] == 2
wfh['1_ADULT_AT_HOME'] = wfh['ADULTS_AT_HOME'] == 1
wfh['1_KID_AT_HOME'] = wfh['KIDS_AT_HOME'] == 1
wfh['0_ADULTS_AT_HOME'] = wfh['ADULTS_AT_HOME'] == 0
wfh['0_KIDS_AT_HOME'] = wfh['KIDS_AT_HOME'] == 0
wfh['PRODUCTIVITY_PRE_COVID'] = wfh['PRODUCTIVITY_PRE_COVID'] > 3
wfh['PRODUCTIVITY_COVID_START'] = wfh['PRODUCTIVITY_COVID_START'] > 3
wfh['PRODUCTIVITY_LAST_MONTH'] = wfh['PRODUCTIVITY_LAST_MONTH'] > 3
wfh['ENJOYED_WFH_PRE_COVID'] = wfh['ENJOYED_WFH_PRE_COVID'] > 3
wfh['ENJOYED_WFH_COVID_START'] = wfh['ENJOYED_WFH_COVID_START'] > 3
wfh['ENJOYED_WFH_LAST_MONTH'] = wfh['ENJOYED_WFH_LAST_MONTH'] > 3
wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER'] = wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER'] > 3
wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF'] = wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF'] > 3
wfh['WFH_50_PERCENT_OR_GREATER_PRE_COVID'] = wfh['WFH_PERCENT_PRE_COVID'] >= 50
wfh['WFH_50_PERCENT_OR_GREATER_CURRENT'] = wfh['WFH_PERCENT_CURRENT'] >= 50
wfh['WFH_50_PERCENT_OR_GREATER_FUTURE_PREFERENCE'] = wfh['WFH_PERCENT_FUTURE_PREFERENCE'] >= 50
wfh['YEARS_OF_EXPERIENCE_10_YEARS_OR_GREATER'] = wfh['YEARS_OF_EXPERIENCE'] >= 10
wfh['PREVIOUSLY_WORKED_FROM_HOME_5_YEARS_OR_GREATER'] = wfh['PREVIOUSLY_WORKED_FROM_HOME_YEARS'] >= 5
wfh = wfh.drop(columns=['ADULTS_AT_HOME','KIDS_AT_HOME','WFH_PERCENT_PRE_COVID','WFH_PERCENT_CURRENT', 'WFH_PERCENT_FUTURE_PREFERENCE', 'YEARS_OF_EXPERIENCE', 'PREVIOUSLY_WORKED_FROM_HOME_YEARS'])
wfh = wfh.drop(columns=['WFH_50_PERCENT_OR_GREATER_CURRENT'])
This is a heatmap of univariate correlation. What % of 1 variable can be used to predict another. The closer to 1, the more 'important' that feature is.
import matplotlib.pyplot as plt
import seaborn as sns;
sns.set(style="ticks", color_codes=True)
fig, ax = plt.subplots(figsize=(15,15))
sns.heatmap(wfh.corr(), annot=True, ax=ax, vmin=-1, vmax=1, center= 0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f89f58ebd68>
Saving the clean data for later, might import into Tablaeu.
wfh.to_csv('boolified.csv')
I wanted to see what features are important to whether or not you're optimistic about your employer's return to normal plan.
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
X = wfh.drop(columns=['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER'])
y = wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER']
Here I ran a simple Logistic regression to try and predict y, RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER. Using all 23 columns I've got an equation here that will tell me with a 72% accuracy if you're optimistic or not. That's really good! If random is 50%, I'm 22% better than random. :fireemoji:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support
False 0.70 0.78 0.74 18
True 0.75 0.67 0.71 18
accuracy 0.72 36
macro avg 0.72 0.72 0.72 36
weighted avg 0.73 0.72 0.72 36
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
The dashed line here is random. The closer the blue line goes to the top left corner of this chart, the better my algorithm is.
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
Here, I can determine which features actually mattered to the algorithm. Closer to 0, not important. Let's see if I can't figure out a smaller subset of features that will give me the same 72% accuracy, or better.
First three columns are pretty important
- RETURN_TO_NORMAL_PLAN_OPTIMISIM_SELF
- 0_KIDS_AT_HOME
- HAVE_ISOLATED_WORKSPACE
..That paints a picture, doesn't it.
column_labels = X.columns.tolist()
coef = logreg.coef_.squeeze().tolist()
labels_coef = list(zip(column_labels, coef))
pd.DataFrame({"Feature":X.columns.tolist(),"Coefficients":logreg.coef_[0]}).sort_values(by=['Coefficients'], ascending=False)
Feature | Coefficients | |
---|---|---|
9 | RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF | 1.457509 |
18 | 0_KIDS_AT_HOME | 0.873659 |
1 | HAVE_ISOLATED_WORKSPACE | 0.870934 |
5 | PRODUCTIVITY_LAST_MONTH | 0.298454 |
6 | ENJOYED_WFH_PRE_COVID | 0.260499 |
15 | 1_ADULT_AT_HOME | 0.223816 |
2 | PREVIOUSLY_WORKED_FROM_HOME | 0.144423 |
11 | 2_PLUS_ADULTS_AT_HOME | 0.113560 |
13 | 2_ADULTS_AT_HOME | 0.047703 |
14 | 2_KIDS_AT_HOME | -0.018416 |
8 | ENJOYED_WFH_LAST_MONTH | -0.019678 |
0 | AGE | -0.043531 |
21 | YEARS_OF_EXPERIENCE_10_YEARS_OR_GREATER | -0.084392 |
3 | PRODUCTIVITY_PRE_COVID | -0.104972 |
22 | PREVIOUSLY_WORKED_FROM_HOME_5_YEARS_OR_GREATER | -0.129090 |
7 | ENJOYED_WFH_COVID_START | -0.148752 |
19 | WFH_50_PERCENT_OR_GREATER_PRE_COVID | -0.190729 |
20 | WFH_50_PERCENT_OR_GREATER_FUTURE_PREFERENCE | -0.312987 |
12 | 2_PLUS_KIDS_AT_HOME | -0.338904 |
17 | 0_ADULTS_AT_HOME | -0.386431 |
10 | MALE | -0.420092 |
4 | PRODUCTIVITY_COVID_START | -0.432348 |
16 | 1_KID_AT_HOME | -0.517691 |
Just using those three columns and I'm upto 75% accuracy.
(For my data science folks, I'm taking note of the f1-score, but this time around the accuracy and f1-score are the same, and easier to explain!)
columns_to_use = ['RETURN_TO_NORMAL_PLAN_OPTIMISM_SELF','0_KIDS_AT_HOME','HAVE_ISOLATED_WORKSPACE']
X = wfh[columns_to_use]
y = wfh['RETURN_TO_NORMAL_PLAN_OPTIMISM_EMPLOYER']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support
False 0.74 0.78 0.76 18
True 0.76 0.72 0.74 18
accuracy 0.75 36
macro avg 0.75 0.75 0.75 36
weighted avg 0.75 0.75 0.75 36
What's my plan next with this data? Probably toss the boolified version of the data into Tablaeu, and get some fancy charts.
Top comments (1)
I had gotten a question via slack, "What was the reasoning behind making gender a true/false "Are you a Male" response?"
It was so I could evaluate it as a boolean instead of a difference between two strings, "Male" "Female" vs True False. It's the simplest implementation of a technique called One Hot Encoding.
machinelearningmastery.com/why-one...