4 Lessons I learned using data science in a real scientific article

Lan — Tue, 20 Jul 2021 00:22:33 +0000

Over the past few years, there's been an outrageous growth on the interest towards data science, that's a fact. However, there were two main publics for the "new" knowledge field:

Computer scientists who produce data science related articles(Deep Learning models, Transformation techniques comparisons).
People who wanted new skills to work at a high-paying industry.

Personally, I don't think I fit in neither of those. I always felt it was good that the potential data science had as a tool was finally being popularized, but there was a point that bothered me: at the time this article was written, data science still wasn't being broadly used as a research tool at other knowledge fields other than computer science.

So, being a Physics major student at UFRJ(one of the best brazilian universities), and a self taught data scientist, I tried to find a research group at the university, with the aim to use the Data Science skills I had to produce better evidence to a scientific work.

Soon, I joined a group already writing a article. The group used statistics to determine which comorbidities statistically led more people to die of Covid-19. Along with that, I also asked the question: how well can you predict someone's death based on symptons they have. The article was published internationally, and the results and discussion can be seen here.

Finally, here are 4 lessons I learned through 8 months working on the article.

1 - Science isn't a straight line

What I mean by that is: science is not your beginner project, with a well defined dataset, and no responsability attached. First of all: the data you have is usually a mess, and the longest part of the workflow is cleaning it. Second: the decision making process on what to do with each variable is basically a way of molding reality before feeding your model. Is is a tiring process, but it makes all the difference to the outcome if well done.

When working on an article, there is always a hypothesis. Those who initially thought on the hypothesis surely have a guess on what the result will turn out to be.

However, that should always be a main point of attention, since grasping too hard on what you expect the result to be might mislead you into confirmation bias (google it up if you haven't heard of it. It's important). That leads us to the second point:

2- Science learns from bad results

If a model or analysis didn't result on pretty visualizations and excellent models, there's always something to learn from that.

Maybe, the data just doesn't answer the question you're asking. Maybe, there's some information about how the data was collected or what each variable means exactly that you don't know, that would make all the difference to the outcome. Data is not a human: it won't lie to you. So, try to listen to what it is trying to tell you, rather than only seeing what you expect.

3- Model performance isn't always the main goal

Working on the article, our main goal was to measure the relative importance between our variables to the model's outcome. Whilst working on it for a few weeks, we faced a problem: the model performed better when we removed 2 especific variables. If it was a industry problem, it wouldn't be a problem: results are the main goal, so just delete it.

However, in our case, it was not wise to remove these variables. Although they were irrelevant to the model's performance, they were relevant healthcare indicators. Does it mean we designed bad features? Does this indicate a systematic problem on the way data was collected?

In my example, our data was collected in hospitals by stressed out healthcare workers from a developing country during a pandemic. We can't simply close our eyes to the logistic and humanitarian challenge. Or maybe, it just means that the variable we thought was an important indicator for Covid-19 deaths turned out not to be. Don't forget the first point.

4- Reproducibility matters

When working with data science, we(at least I did!) tend to develop models and analysis in a messy way. No one else is going to read it, so why bother making it easy to read and accessible?

First of all, reproducibility is not about making things look good: it is all about making sure that every decision you took is clear, including: where you took the data from, what feature engineering decisions you made along the path, how you dealt with missing values, how you chose a model, and how and why the metrics you're using were chosen.

The scenario you have to imagine is: suppose someone has the same data you have, can they reproduce your exact results only with information you put on the article? If not, it shouldn't be considered science. You're no better than magicians.

Don't get me wrong, it's not that I don't trust you: science isn't about trust. Would you trust a rocket only a scientist said work? Or would you rather have it checked by dozens of experienced engineers?

It's only science if it can be reproduced.

How machine learning made me win more on League of Legends

Lan — Fri, 02 Oct 2020 19:14:17 +0000

I surely don't know you, but I'm sure either of these statements is true: you have either played League of Legends before, or you know at least someone who did. That's fact.

The multiplayer online battle arena(MOBA) game has become a fever over the last decade, having reached in 2021 a simultaneous player count record of 11 million players. Since I'm a nerd myself, until last year I used to also be a regular player, just like that one person you know or are.

This year, searching through Kaggle to find cool data that I could work on a data science project, I stumbled upon a dataset that contains information about the first 10 minutes, the early game, of several diamond-tier matches. After looking through the data for a while, I asked myself the question: how well can I predict which team is going to win?

Also, it is of general knowledge in the data science field that, if your model is a good predictor of something, that means it can also tell you which features of your data were more helpful to the prediction(that's called explainability in AI). So, going back to our data: if our model is a good predictor of which team is going to win, it is also a good method to know which game features you should focus more on improving to win more matches and climb up the rank.

For this analysis, I tried a few ML tree models, and ended up with xgboost, which performed the best with a 75% accuracy on test data.

If you wish to go into more technical detail than that, check out the notebook. From now on I will only be showing results.

So, what can we learn from early game?

A plot of feature importance for our model:

Important: Diff On the variable names stands for (BlueTeam - RedTeam) of that quantity.

1 - Gold is what matters.

As seen on the feature importance graph, gold difference is by far the most important variable for our model. The fact that Exp difference is the second most important, probably means that farming is one of the most efficient ways to earn gold(also, it gives you more experience at a lower risk). That probably means you should be paying more attention to farming and lane phase than getting kills.

2 - Don't forget elite monsters.

On the 3rd and 5th position, we have elite monsters, so don't underestimate them. These buffs really can change the outcome of a match.

3 - Blocking vision is just as valuable as setting vision.

A surprising fact is that destroying wards is just as important as placing it. Leaving the enemy team blind is just as important as giving your team vision.

4 - Assists matter.

That doesn't mean that an assist by itself is that relevant to gold difference, but the meaning behind it: play with your team. Teams that are united usually get more kills, map pressure and objectives.

5 - Sacrificing yourself for an elite monster is not always worth it.

Elite monsters are important and can give you huge map pressure, but if that means having your entire team killed for it, maybe it's not that worth it. Play smart.

6 - Early game matters more than you think.

A model with data only from the 10 minutes of a match can predict it's outcome with a 75% accuracy, what does it tell you? The pace and attitude that you set at the beginning, most times, can be the difference between a victory or defeat.

Final considerations

Considering all the lack of information, I think our model performed great, as it correctly guessed the outcome of ~75% matches only looking at a reduced amount of early game data.

DEV Community: Lan