I'm having problems following this and getting working code. Specifically, on the random_forest.fit(train_X, train_y) call, I get the following error: "ValueError: could not convert string to float: 'setosa'"
I think this may be because the train_X data still has the species in text format. How does the fit function know the relationship between the Xspecies column and the Ysetosa/versicolor/virginica columns? Do I need to do one-hot encoding on the X data?
Also, the steps seem to be out of order. Shouldn't you do the get_dummies(y) call before you do the train_test_split(x, y, ...)? Maybe this isn't intended to be a full working example?
Right! My bad 😅 The order is actually correct. Doing get_dummies first before splitting the data might cause a data leakage. We want to make sure that when we split our data, it is "pure". My mistake was that y = pd.get_dummies(y) I've updated it so that would be like this instead:
I added the two lines above, but I still get the same error message. "ValueError: could not convert string to float: 'setosa'"
I think this may be because the train_X data still has the species in text format. How does the fit function know the relationship between the Xspecies column and the Ysetosa/versicolor/virginica columns? Do I need to do one-hot encoding on the X data?
Could you post a full, working Python script somewhere so I can see how this is supposed to work?
I'm having problems following this and getting working code. Specifically, on the
random_forest.fit(train_X, train_y)call, I get the following error: "ValueError: could not convert string to float: 'setosa'"I think this may be because the
train_Xdata still has the species in text format. How does thefitfunction know the relationship between theXspeciescolumn and theYsetosa/versicolor/virginicacolumns? Do I need to do one-hot encoding on the X data?Also, the steps seem to be out of order. Shouldn't you do the
get_dummies(y)call before you do thetrain_test_split(x, y, ...)? Maybe this isn't intended to be a full working example?Right! My bad 😅 The order is actually correct. Doing
get_dummiesfirst before splitting the data might cause a data leakage. We want to make sure that when we split our data, it is "pure". My mistake was thaty = pd.get_dummies(y)I've updated it so that would be like this instead:train_y = pd.get_dummies(train_y)val_y = pd.get_dummies(val_y)Sorry I took so long to reply 😅you can easily reach me tho through twitter @heyimprax.
I added the two lines above, but I still get the same error message. "ValueError: could not convert string to float: 'setosa'"
I think this may be because the train_X data still has the species in text format. How does the fit function know the relationship between the
Xspeciescolumn and theYsetosa/versicolor/virginicacolumns? Do I need to do one-hot encoding on the X data?Could you post a full, working Python script somewhere so I can see how this is supposed to work?
Oh right! Take out the
speciesin the features array. That should fix the "ValueError: could not convert string to float: 'setosa'"Also, I've added the missing
from sklearn.metrics import mean_absolute_errorfor the
mean_absolute_errorfunction.Here's a link to a working kaggle notebook: kaggle.com/interestedmike/iris-dat...