My Second PR: How I Bounced Back and Contributed to Scikit-learn

#opensource #python #beginners #machinelearning

Steven here again.
Where to start... It was a busy busy week. Yes I know, I used the word 'busy' twice. Last week, I found an amazing project, BEHAFIOR-1K and it took me hours to understand the code that is related to the issue. Therefore, I wanted to find a bit of easy-going issue to contribute on. As I mentioned before, I am always interested in Machine Learning so I browsed through pandas, numpy and scikit-learn repositories. They are the three famous libraries that is used by Machin Learning projects. I thought it would be a good chance to leave some trace on this big size project.

Finding a "Good First Issue" in Scikit-learn
After the tough first experience, I knew I needed a different approach. Since I’ve always been interested in ML, I decided to choose the Scikit-learn repository. I figured, why not try to contribute to a tool I actually use? I went directly to their Issues tab on GitHub and started looking for labels like good first issue or help wanted. It felt way more focused, and that’s when I found it.

The issue
The issue was quite simple: change relative import paths to absolute import paths in some of their Cython files.

from ...utils._typedefs cimport ...

from sklearn.utils._typedefs cimport ...

The code change itself was just changing relative path to absolute path. Even though it was simple task, I had to go through project files to find the functions or types to make sure I wrote the correct absolute path. It may seem simple but I learned more interesting new technique through this procedure.

Cython???
Okay, so the task was just changing an import path. Simple, right? But I did learn some new technique that can be useful in future.
The files I was editing weren't standard .py files; they were .pyx and .pxd files. This was my first real encounter with Cython, which is a programming language used to give Python C-level speed. It’s one of the secret sauces that makes libraries like Scikit-learn so fast.

from ...utils._typedefs cimport float64_t, float32_t, intp_t

At first, I had no clear idea what cimport or float64_t meant. I learned that cimport is a special Cython command to import C-level definitions. And things like float64_t are basically C-style "nicknames" for NumPy data types.
By using these C-style types, Cython knows to treat them as simple, super-fast C variables instead of slower Python objects. So, while I was just fixing a file path, I accidentally learned a fundamental concept about how Python libraries are optimized for performance.

I don't know if you can get this feeling from my blog posts but I am having lots of fun doing this. I hope everyone else is experiencing the same.

DEV Community

My Second PR: How I Bounced Back and Contributed to Scikit-learn

Top comments (0)