Baby's First Model

#python #deeplearning #ai #beginners

Background

Over the summer, I finally managed to get my feet wet with machine learning. I've been following a tutorial from FastAI (found here). Deep learning is a very complex subject, and creating well-thought-out models with real-world implications is a skill that takes years to master. But modern advances have made the technology more accessible than ever, and people like Jeremy Howard (of FastAI) and Andrew Ng have created accessible courses that make artificial intelligence something within reach for a person like me. I'm a high schooler with minimal knowledge of things like linear algebra and more advanced coding skills, and it's remarkable that I was able to cobble together something that (kind of?) worked in the past few weeks.

About FastAI

FastAI is a deep learning framework that's simpler to learn in some ways than others because it's built on top of PyTorch and uses built-in best practices. It was created by Jeremy Howard, a legend in the deep learning community, and many other contributors.

When you code a model, FastAI does a lot behind the scenes for you, from creating DataLoaders to make it easier to prep data to helping you find a learning rate. This is illustrated when you compare this Kaggle notebook, which uses PyTorch, to this notebook, which uses FastAI. Because FastAI does so much hand-holding and time-saving, it's probably less powerful than frameworks like Pytorch or Tensorflow in ways I don't yet understand. For my purposes though, it's a great tool to actually make something that I can actually see working on data that I collected myself.

The project

My actual project centered on predicting the locations of fish in the class Actinopterygii, or ray-finned fishes. I wanted to see if I could predict their locations given various other datapoints about them such as their family and genus. I used data from the IUCN Redlist and used the pandas library to clean it. I then used FastAI to create a model that would predict the latitude and longitude of a fish given its other characteristics.

Pandas

After cleaning and normalizing, my data's head looked like this:

presence	origin	seasonal	event_year	latitude	longitude	order_as_num	family_as_num	genus_as_num	category_as_num
0	0.17	0.20	1.00	0.98	-0.38	0.22	0.00	0.00	0.00
1	0.17	0.20	1.00	0.99	-0.24	0.21	0.00	0.00	0.00
2	0.17	0.20	1.00	0.99	0.54	-0.54	0.07	0.01	0.00
3	0.17	0.20	1.00	0.98	0.22	-0.43	0.07	0.03	0.01
4	0.17	0.20	1.00	1.00	0.17	0.50	0.00	0.04	0.01

Not bad!

FastAI

Then I packaged my data into dataloaders with FastAI:

dls = TabularPandas( fish_data, splits=splits, procs = [Categorify], cat_names=["category_as_num","order_as_num","family_as_num","genus_as_num", "seasonal", "origin", "presence"], cont_names=["event_year"], y_names=["latitude", "longitude"], ).dataloaders(path=".", bs=256)

I then created a tabular learner with the help of FastAI:

learn = tabular_learner(dls, metrics=accuracy_multi, layers=[10,10])
learn.lr_find(suggest_funcs=(slide, valley))
learn.fit(16, lr=0.01)

Results

My loss plot ended up looking like this:

To tell the truth, I have no idea if that's even any good for a project like this. I'm just happy that everything ended up working. I went into this project pretty much blind, with no actual experience in how to build, test, or refine a machine learning model. I also had no idea how to "feature engineer" a dataset or even how to normalize it (and I still have a lot to learn in those areas!). As expected, I made a lot of mistakes, from accidentally setting my metrics incorrectly to not properly dividing up categorical and continuous variables.

All in all, I'm super grateful that the machine learning community has come so far that a random somebody like me can start experimenting with cool data in a Jupyter Notebook. Having to not build models from scratch in a library like PyTorch significantly lowers the barrier to entry for inexperienced coders, and everyone has to start somewhere. I'm really looking forward to my next model!

The code for this project can be found on my Github here.

DEV Community