When I started studying data science, I became fascinated about neural networks and their power for such complicated applications. As examples, there are applications in computer vision and natural language processing (NLP). Because of their power, I just wanted to start using them in every single problem. But I had to calm down! Sometimes a simple model can get on good score.
In this post I guide you in my experience as a beginner in my first data science challenge and how it helped me to grow up as a student and a data scientist. I'll never forget the power of a simple linear regression model!
The Challenge
Condenation is a website which sometimes organizes challenges as a first step towards accelerating in different areas, one of them is about data science. The last data science challenge was about a prediction of ENEM (brazilian exam to get in the public university) student score in math.
I started it so excited! But I was blind just because I didn't try any other model unless Random Forest and Neural Network to predict the math score. I made a preprocessing to replace some NaN values and selected some features with high correlation. After it, I did a hard work with RandomizedSearchCV to select best parameters. Despite all the hard work I had done, it was unable to reach 90% and to join to the Codenation. So I got frustrated and I gave up on me.
A blessing in disguise..
Recently I met the same database at Kaggle. It's been a while since I had accept the challenge, so I tried it again. As you can read below I will show a new approach to the challenge and how to judge a simple model as weak without even using it. It was a big mistake and a great learning experience.
A New Approach
Here I won't describe everything I did, for example in relation to data preprocessing. But if you want to see my notebook, you can access it in kaggle.
First of all, I checked the database and if it has taken some NaN values. These values were replaced to 0, because I had to deal with it as students dropouts. Afterwards, I realized that there were some correlation among these features. My idea was to get the highest features and use them to predict the math score. The heatmap below shows these correlations using Pearson coefficient.
As we can see, they have a high correlations. So I decided to use them as a predictor feature in a simple linear regression model, as shown below.
X = df_train_filled.drop(columns=['NU_NOTA_COMP5', 'NU_NOTA_MT'])
y = df_train_filled['NU_NOTA_MT']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
lr = LinearRegression(normalize=True)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)
Using a simple train and test split we've got a accuracy of 90%. This accuracy was better than random forest and neural networks models. But maybe you are wondering: "Did you just use part of the database? For a complete understanding, it's needed to use cross validation!". Ok, ok.. You're right! I did it too, as you can see below.
# making scores
mae = make_scorer(mean_absolute_error)
r2 = make_scorer(r2_score)
cvs = cross_validate(estimator=LinearRegression(normalize=True), X=X, y=y, cv=10, verbose=10, scoring={'mae': mae, 'r2':r2})
I used the mean absolute error and score to evaluate the model. The mean scores were 50.027 and 0.902, respectively. Maybe this model can predict the math score with 90% of score for a test database. So I can be happy and try to make my submission! No, no..
Unfortunately, there is not a possibility to make a submission in Kaggle or on the original website. The Codenation closed a long time ago, so I have to wait for a new chance in the future.
However, what can we learn from it?
It's important to notice that even using random forest and neural networks models I could make some better preprocessing or selected other features and get a good score. Yes, it's right! But this experience was important to me, because I could learn and become a better data scientist.
Even you think that model is so simple to do a hard task, you should give it a chance. Maybe it can't get to a high score or result. However it can become a start point to verify if others models is helping you improving the scorer.
I hope this post could help you!!
Top comments (0)