nk_Enuke

Posted on Mar 22

Kaggle:Trying to Reproduce the 1st Place NeurIPS Polymer Prediction Solution (Scaled-Down Experiment)

#datascience #machinelearning #science #kaggle

A post-competition reproduction attempt by someone with a chemistry background

The NeurIPS Open Polymer Prediction 2025 competition challenged about 2,000 teams to predict five physical properties of polymers from their chemical structure (SMILES strings).

I have a background in chemistry, so after the competition ended I read through James Day's 1st place writeup and tried to reproduce the approach as best I could. With a late submission, I got a Private LB score of 0.08180 — which should be roughly equivalent to 8th place out of 2,240 teams. Though of course, this was done after the competition closed, so it should be taken with a grain of salt.

Disclaimer: This article describes a scaled-down reproduction experiment. All solution design credit goes to James Day. My goal was simply to see how far I could get reproducing his approach with limited resources.

James Day's 1st Place Writeup:
https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025/writeups/1st-place-solution

nkwork9999 / NeurIP2025_mytrial_following_1st_solution

Reproducing the 1st Place NeurIPS Polymer Prediction 2025 Solution

Top-8 equivalent score (0.08180) on the Private Leaderboard using CodeBERTa-small

This repository contains a reproduction of James Day's 1st place solution for the NeurIPS - Open Polymer Prediction 2025 Kaggle competition.

Disclaimer: This is a reproduction study, not original work. All credit for the solution design goes to the original authors. The purpose is to verify reproducibility and provide a working codebase for the community.

Competition Overview

Task: Predict 5 polymer properties (Tg, FFV, Tc, Density, Rg) from chemical structure (SMILES)
Metric: Weighted Mean Absolute Error (wMAE)
Scale: 2,240 teams, $50,000 prize pool
Ground truth: Averaged from multiple molecular dynamics simulation runs

Results

Rank	Team	Score (wMAE)
1	James Day	0.07536
2	Ezra	0.07722
3	Ghy HUST CS	0.07820
7	CoderGirlM	0.08144
-	This reproduction	0.08180
8	Dmitry Uarov	0.08271

Late submission score of 0.08180 — equivalent to 8th place out…

View on GitHub

Getting Started

Clone the repo and install dependencies to run the pipeline yourself:

git clone https://github.com/nkwork9999/NeurIP2025_mytrial_following_1st_solution.git
cd NeurIP2025_mytrial_following_1st_solution

pip install -r requirements.txt

The training notebooks are designed to run on Google Colab (L4 GPU, ~7 hours). Open the notebooks in order:

Train CodeBERTa — Fine-tune with SMILES augmentation and 5-fold CV
Train AutoGluon — Tabular model on RDKit descriptors
Ensemble & submit — Combine predictions and apply post-processing

Refer to the repo's README for the full walkthrough.

Competition Overview

The Task

Given a polymer's SMILES notation, predict five molecular-dynamics-simulated physical properties:

Property	Description	Unit
Tg	Glass transition temperature	degC
FFV	Fractional free volume	-
Tc	Thermal conductivity	W/(m*K)
Density	Density	g/cm^3
Rg	Radius of gyration	angstrom

The Sparse Label Problem

The defining challenge of this competition was extreme label sparsity. Not every polymer had all five properties measured — the number of available samples varied widely across targets.

Evaluation Metric

Weighted Mean Absolute Error (wMAE), normalized by each property's value range, with higher weights assigned to properties with fewer samples. In other words, the rarest labels matter most.

James Day's 1st Place Solution

Day's winning approach was an ensemble of three models:

ModernBERT-base — Treats SMILES as text, with a regression head for property prediction
AutoGluon — A tabular model using RDKit descriptors and Morgan fingerprints as features
Uni-Mol 2 — A pretrained model that accounts for 3D molecular structure

The most surprising finding: a code-pretrained ModernBERT-base outperformed chemistry-specific models. SMILES notation resembles source code — parentheses, symbols, and repeating structural patterns — so a tokenizer trained on code captures SMILES structure remarkably well. I found this genuinely fascinating. It makes sense that SMILES, being string-based, can benefit from large language models, but the fact that a code-pretrained model specifically has an advantage was unexpected.

Insights from 2nd and 3rd Place

2nd place (Ezra): Simply converting Tg units from Celsius to Fahrenheit dramatically improved the score. Achieved a high rank with ExtraTreesRegressor — a relatively simple model.
3rd place (Hongyu Guo): GATv2Conv (6 layers) + Morgan fingerprints, with post-hoc linear regression calibration.

My Reproduction: Design Decisions

Why CodeBERTa-small?

Day used ModernBERT-base (125M parameters). I went with CodeBERTa-small (84M parameters) — honestly, I simply wanted to try things out easily on Google Colab. Training had to fit within the L4 GPU time limit (~7 hours), and I wanted to get something running quickly without too much setup.

Pipeline Architecture

SMILES --> CodeBERTa-small --> Regression Head --> 5 properties
                                    | (ensemble)
SMILES --> RDKit features --> AutoGluon --> 5 properties

Each target is trained with 5-fold cross-validation. At inference, 30 rounds of test-time augmentation (TTA) are applied.

The Key Technique: Random SMILES Augmentation

This is the single most impactful technique in the pipeline.

SMILES Are Not Unique

The same molecule can be represented by hundreds of different SMILES strings, depending on the atom traversal order:

Example: Ethanol
  Canonical: CCO
  Random 1:  OCC
  Random 2:  C(O)C

RDKit's MolToSmiles(mol, canonical=False, doRandom=True) generates randomized SMILES on the fly.

Training: 10x Augmentation

Each epoch presents the same molecule with a different SMILES representation (10x augmentation). This forces the model to learn representation-invariant molecular features rather than memorizing a specific string.

class SMILESDataset(Dataset):
    def __init__(self, smiles_list, labels, tokenizer, max_len=128, aug_factor=10):
        self.smiles = smiles_list
        self.labels = labels
        self.aug = aug_factor

    def __len__(self):
        return len(self.smiles) * self.aug

    def __getitem__(self, idx):
        ri = idx % len(self.smiles)
        smi = random_smiles(self.smiles[ri])  # Random SMILES each time
        ...

Inference: TTA with Median Aggregation

At test time, 30 random SMILES variants are generated per molecule, and the median prediction is taken as the final output. Median is preferred over mean to suppress outlier predictions.

def predict_tta(model, tokenizer, smiles_list, n_tta=30):
    all_preds = []
    for _ in range(n_tta):
        aug = [random_smiles(s) for s in smiles_list]
        preds = model.predict(aug)
        all_preds.append(preds)
    return np.median(all_preds, axis=0)  # Aggregate with median

Personal note: I find this technique quite intriguing. It feels somewhat analogous to adding noise to images in computer vision augmentation. I would have expected that randomizing the SMILES string might introduce confusing signals, but it actually works surprisingly well.

Post-Processing Tricks

Tg Distribution Shift Correction

There is a known distribution shift between the train and test Tg values. Adding std * 0.5644 to the predictions corrects for this shift. It is a rather analog/manual method, but it turned out to be quite effective.

tg_std = submission['Tg'].std()
tg_shift = tg_std * 0.5644
submission['Tg'] += tg_shift

The coefficient 0.5644 was found via grid search on out-of-fold predictions.

Direct Match

When a test SMILES exactly matches a training SMILES, the known training value is used directly instead of the model prediction. Simple but effective.

for target in TARGETS:
    for _, row in test.iterrows():
        match = train.loc[
            (train['SMILES'] == row['SMILES']) & (train[target].notna()), target
        ]
        if len(match) > 0:
            submission.loc[submission['id'] == row['id'], target] = match.values[0]

Results

Leaderboard Comparison

Rank	Team	Score
1st	James Day	0.07536
2nd	Ezra	0.07722
3rd	Ghy HUST CS	0.07820
7th	CoderGirlM	0.08144
--	This reproduction	0.08180
8th	Dmitry Uarov	0.08271

What I Could Not Reproduce

The biggest missing piece was Uni-Mol 2, the 3D molecular structure model that was part of the 1st place ensemble. I was unable to reproduce this component — I did not have the required software environment set up, and the compute requirements were beyond what I had available. This was a clear limitation of my attempt.

It made me wish there were lighter-weight methods for incorporating 3D molecular structure information. If something like a lightweight 3D-aware featurizer existed that could run on a single GPU, it would make these kinds of reproduction experiments much more practical.

Wrapping Up

There are many summary articles about top competition solutions, but fewer attempts to actually write the code, run the pipeline, and verify the results. I wanted to try doing that here, even in a scaled-down form.

This article may contain errors or misunderstandings — I would appreciate any corrections or feedback. If you notice something off, please feel free to leave a comment or open an issue on the repo.

nkwork9999 / NeurIP2025_mytrial_following_1st_solution

Reproducing the 1st Place NeurIPS Polymer Prediction 2025 Solution

Top-8 equivalent score (0.08180) on the Private Leaderboard using CodeBERTa-small

This repository contains a reproduction of James Day's 1st place solution for the NeurIPS - Open Polymer Prediction 2025 Kaggle competition.

Disclaimer: This is a reproduction study, not original work. All credit for the solution design goes to the original authors. The purpose is to verify reproducibility and provide a working codebase for the community.

Competition Overview

Task: Predict 5 polymer properties (Tg, FFV, Tc, Density, Rg) from chemical structure (SMILES)
Metric: Weighted Mean Absolute Error (wMAE)
Scale: 2,240 teams, $50,000 prize pool
Ground truth: Averaged from multiple molecular dynamics simulation runs

Results

Rank	Team	Score (wMAE)
1	James Day	0.07536
2	Ezra	0.07722
3	Ghy HUST CS	0.07820
7	CoderGirlM	0.08144
-	This reproduction	0.08180
8	Dmitry Uarov	0.08271

Late submission score of 0.08180 — equivalent to 8th place out…

View on GitHub

DEV Community

Kaggle:Trying to Reproduce the 1st Place NeurIPS Polymer Prediction Solution (Scaled-Down Experiment)

A post-competition reproduction attempt by someone with a chemistry background

nkwork9999 / NeurIP2025_mytrial_following_1st_solution

Reproducing the 1st Place NeurIPS Polymer Prediction 2025 Solution

Competition Overview

Results

Getting Started

Competition Overview

The Task

The Sparse Label Problem

Evaluation Metric

James Day's 1st Place Solution

Insights from 2nd and 3rd Place

My Reproduction: Design Decisions

Why CodeBERTa-small?

Pipeline Architecture

The Key Technique: Random SMILES Augmentation

SMILES Are Not Unique

Training: 10x Augmentation

Inference: TTA with Median Aggregation

Post-Processing Tricks

Tg Distribution Shift Correction

Direct Match

Results

Leaderboard Comparison

What I Could Not Reproduce

Wrapping Up

nkwork9999 / NeurIP2025_mytrial_following_1st_solution

Reproducing the 1st Place NeurIPS Polymer Prediction 2025 Solution

Competition Overview

Results

Top comments (0)