theralavineela

Posted on Jan 31

My GSoC Journey So Far: Exploring Geographically Weighted Learning in PySAL

#opensource #python #machinelearning #geospatial

Introduction

Over the past couple of months, I’ve been gradually ramping up my involvement with the PySAL / gwlearn ecosystem as part of my preparation for Google Summer of Code (GSoC).

December and January were a bit of a balancing act — semester exams on one side, and open-source exploration on the other. While I couldn’t code full-time throughout this period, I stayed consistently engaged by following community discussions, testing compatibility issues, experimenting locally, and strengthening my understanding of geographically weighted (GW) models.

This post summarizes what I worked on, what I learned, and the research direction I’m currently exploring.

*December: Foundations, Setup, and First Experiments
Revisiting the Basics
*
I began by revisiting geospatial data science fundamentals, using:

ISRO geospatial course notes

PySAL documentation and mentor-recommended resources

This helped solidify the theoretical grounding required for geographically weighted models, especially around spatial relationships and locality-aware learning.

Installing and Verifying gwlearn

I installed the gwlearn package following the official GitHub documentation and verified the installation by checking the package version in Python.

Instead of modifying library files directly, I created a playground script using the Guerry dataset — effectively a “Hello World” example for gwlearn — to ensure everything worked end to end.

This single example validated:

GeoPandas integration
Kernel computations (tricube kernel)
Scikit-learn–style .fit() API
Model execution through the core gwlearn code path Seeing successful output gave me confidence that my environment and dependencies were set up correctly.

*Learning About scikit-learn Metadata Routing
*
Later in December, I explored metadata routing, a newer feature introduced in recent versions of scikit-learn.

Metadata routing allows non-standard data (like spatial geometry) to flow through pipelines without breaking sklearn’s strict estimator API. This is especially important for spatial models, where geometry is essential but not traditionally part of X or y.

Understanding this concept turned out to be critical for my first real contribution.

Contribution: Making gwlearn More scikit-learn Compatible
The Problem

Scikit-learn expects estimators to follow this signature:

fit(X, y)

But gwlearn models inherently require geometry:

fit(X, y, geometry)

This mismatch caused:

Failures in sklearn.utils.estimator_checks

Incompatibility with Pipeline and GridSearchCV

Geometry being dropped during cloning or prediction

Passing geometry via the constructor or embedding it into X also caused additional issues.

The Solution

I worked on improving compatibility by:

Enabling sklearn metadata routing

Declaring geometry as required routed metadata

Refactoring fit() to accept **kwargs instead of explicitly requiring geometry

Introducing an adapter pattern so gwlearn remains expressive internally while presenting a clean sklearn-style interface externally

This approach allowed geometry to pass safely through pipelines, cross-validation, and hyperparameter search — without violating sklearn’s API constraints.

Pull Request

These changes were submitted as:

PR #45 — Add sklearn metadata routing support and stabilize Cross-Validation
https://github.com/pysal/gwlearn/pull/45

The PR:

Fixes critical sklearn compatibility issues
Stabilizes cross-validation behavior
Keeps changes minimal and backward compatible
Was verified using the Guerry dataset with Pipeline and GridSearchCV It also sparked valuable design discussions with mentors — a great reminder that exploratory PRs are part of healthy open-source development.

*January: Testing and Compatibility Work
Pandas 3.0 Testing
*
Even while focusing on semester exams, I kept checking community updates and running compatibility tests.

I ran the gwlearn test suite against both pandas 2.3.3 and pandas 3.0.0 using a GitHub Actions compatibility matrix:

pandas 2.3.3

All tests pass

Including runs with -W error enabled

pandas 3.0.0

Three test failures in test_base.py

Objects expected to be trained models become scalars (float)

This results in:

AttributeError: 'float' object has no attribute 'predict_proba'

I discussed these results with mentors and am currently investigating the underlying cause, which appears related to changes in pandas behavior affecting model persistence or reconstruction.

This highlighted the importance of testing against future dependency versions, not just current stable releases.

**Current Focus: Geographically Weighted Matrix Decomposition

At the moment, I’m studying geographically weighted matrix decomposition algorithms, particularly:

Geographically Weighted Principal Component Analysis (GWPCA)
**
Graph-based spatial representations using libpysal.graph.Graph

Research Direction

_Proposed Focus:
_
Implement geographically weighted matrix decomposition algorithms (such as GWPCA) on top of libpysal.graph.Graph, to be included in the gwlearn sub-package of the PySAL federation, with a scikit-learn compatible API.

This direction aligns well with:

gwlearn’s goal of scalable, sklearn-friendly spatial models

libpysal’s evolving graph-based infrastructure

My interest in combining spatial statistics, machine learning, and clean API design

I’m currently working through the theory and existing implementations to understand how these algorithms can be expressed cleanly within gwlearn’s estimator framework.

Reflection

Even during a semester-heavy period, staying consistently engaged — through testing, reading, and discussion — helped me maintain momentum.

Key takeaways so far:

Compatibility work is as important as new features

Metadata routing is essential for spatial ML in sklearn ecosystems

Testing against upcoming dependency versions surfaces issues early

Open-source contribution is as much about communication as code

I’m excited to build on this foundation and move toward deeper implementation work in the coming months.