A post-competition reproduction attempt by someone with a chemistry background
The NeurIPS Open Polymer Prediction 2025 competition challenged about 2,000 teams to predict five physical properties of polymers from their chemical structure (SMILES strings).
I have a background in chemistry, so after the competition ended I read through James Day's 1st place writeup and tried to reproduce the approach as best I could. With a late submission, I got a Private LB score of 0.08180 — which should be roughly equivalent to 8th place out of 2,240 teams. Though of course, this was done after the competition closed, so it should be taken with a grain of salt.
Disclaimer: This article describes a scaled-down reproduction experiment. All solution design credit goes to James Day. My goal was simply to see how far I could get reproducing his approach with limited resources.
James Day's 1st Place Writeup:
https://www.kaggle.com/competitions/neurips-open-polymer-prediction-2025/writeups/1st-place-solution
Reproducing the 1st Place NeurIPS Polymer Prediction 2025 Solution
Top-8 equivalent score (0.08180) on the Private Leaderboard using CodeBERTa-small
This repository contains a reproduction of James Day's 1st place solution for the NeurIPS - Open Polymer Prediction 2025 Kaggle competition.
Disclaimer: This is a reproduction study, not original work. All credit for the solution design goes to the original authors. The purpose is to verify reproducibility and provide a working codebase for the community.
Competition Overview
- Task: Predict 5 polymer properties (Tg, FFV, Tc, Density, Rg) from chemical structure (SMILES)
- Metric: Weighted Mean Absolute Error (wMAE)
- Scale: 2,240 teams, $50,000 prize pool
- Ground truth: Averaged from multiple molecular dynamics simulation runs
Results
| Rank | Team | Score (wMAE) |
|---|---|---|
| 1 | James Day | 0.07536 |
| 2 | Ezra | 0.07722 |
| 3 | Ghy HUST CS | 0.07820 |
| 7 | CoderGirlM | 0.08144 |
| - | This reproduction | 0.08180 |
| 8 | Dmitry Uarov | 0.08271 |
Late submission score of 0.08180 — equivalent to 8th place out…
Getting Started
Clone the repo and install dependencies to run the pipeline yourself:
git clone https://github.com/nkwork9999/NeurIP2025_mytrial_following_1st_solution.git
cd NeurIP2025_mytrial_following_1st_solution
pip install -r requirements.txt
The training notebooks are designed to run on Google Colab (L4 GPU, ~7 hours). Open the notebooks in order:
- Train CodeBERTa — Fine-tune with SMILES augmentation and 5-fold CV
- Train AutoGluon — Tabular model on RDKit descriptors
- Ensemble & submit — Combine predictions and apply post-processing
Refer to the repo's README for the full walkthrough.
Competition Overview
The Task
Given a polymer's SMILES notation, predict five molecular-dynamics-simulated physical properties:
| Property | Description | Unit |
|---|---|---|
| Tg | Glass transition temperature | degC |
| FFV | Fractional free volume | - |
| Tc | Thermal conductivity | W/(m*K) |
| Density | Density | g/cm^3 |
| Rg | Radius of gyration | angstrom |
The Sparse Label Problem
The defining challenge of this competition was extreme label sparsity. Not every polymer had all five properties measured — the number of available samples varied widely across targets.
Evaluation Metric
Weighted Mean Absolute Error (wMAE), normalized by each property's value range, with higher weights assigned to properties with fewer samples. In other words, the rarest labels matter most.
James Day's 1st Place Solution
Day's winning approach was an ensemble of three models:
- ModernBERT-base — Treats SMILES as text, with a regression head for property prediction
- AutoGluon — A tabular model using RDKit descriptors and Morgan fingerprints as features
- Uni-Mol 2 — A pretrained model that accounts for 3D molecular structure
The most surprising finding: a code-pretrained ModernBERT-base outperformed chemistry-specific models. SMILES notation resembles source code — parentheses, symbols, and repeating structural patterns — so a tokenizer trained on code captures SMILES structure remarkably well. I found this genuinely fascinating. It makes sense that SMILES, being string-based, can benefit from large language models, but the fact that a code-pretrained model specifically has an advantage was unexpected.
Insights from 2nd and 3rd Place
- 2nd place (Ezra): Simply converting Tg units from Celsius to Fahrenheit dramatically improved the score. Achieved a high rank with ExtraTreesRegressor — a relatively simple model.
- 3rd place (Hongyu Guo): GATv2Conv (6 layers) + Morgan fingerprints, with post-hoc linear regression calibration.
My Reproduction: Design Decisions
Why CodeBERTa-small?
Day used ModernBERT-base (125M parameters). I went with CodeBERTa-small (84M parameters) — honestly, I simply wanted to try things out easily on Google Colab. Training had to fit within the L4 GPU time limit (~7 hours), and I wanted to get something running quickly without too much setup.
Pipeline Architecture
SMILES --> CodeBERTa-small --> Regression Head --> 5 properties
| (ensemble)
SMILES --> RDKit features --> AutoGluon --> 5 properties
Each target is trained with 5-fold cross-validation. At inference, 30 rounds of test-time augmentation (TTA) are applied.
The Key Technique: Random SMILES Augmentation
This is the single most impactful technique in the pipeline.
SMILES Are Not Unique
The same molecule can be represented by hundreds of different SMILES strings, depending on the atom traversal order:
Example: Ethanol
Canonical: CCO
Random 1: OCC
Random 2: C(O)C
RDKit's MolToSmiles(mol, canonical=False, doRandom=True) generates randomized SMILES on the fly.
Training: 10x Augmentation
Each epoch presents the same molecule with a different SMILES representation (10x augmentation). This forces the model to learn representation-invariant molecular features rather than memorizing a specific string.
class SMILESDataset(Dataset):
def __init__(self, smiles_list, labels, tokenizer, max_len=128, aug_factor=10):
self.smiles = smiles_list
self.labels = labels
self.aug = aug_factor
def __len__(self):
return len(self.smiles) * self.aug
def __getitem__(self, idx):
ri = idx % len(self.smiles)
smi = random_smiles(self.smiles[ri]) # Random SMILES each time
...
Inference: TTA with Median Aggregation
At test time, 30 random SMILES variants are generated per molecule, and the median prediction is taken as the final output. Median is preferred over mean to suppress outlier predictions.
def predict_tta(model, tokenizer, smiles_list, n_tta=30):
all_preds = []
for _ in range(n_tta):
aug = [random_smiles(s) for s in smiles_list]
preds = model.predict(aug)
all_preds.append(preds)
return np.median(all_preds, axis=0) # Aggregate with median
Personal note: I find this technique quite intriguing. It feels somewhat analogous to adding noise to images in computer vision augmentation. I would have expected that randomizing the SMILES string might introduce confusing signals, but it actually works surprisingly well.
Post-Processing Tricks
Tg Distribution Shift Correction
There is a known distribution shift between the train and test Tg values. Adding std * 0.5644 to the predictions corrects for this shift. It is a rather analog/manual method, but it turned out to be quite effective.
tg_std = submission['Tg'].std()
tg_shift = tg_std * 0.5644
submission['Tg'] += tg_shift
The coefficient 0.5644 was found via grid search on out-of-fold predictions.
Direct Match
When a test SMILES exactly matches a training SMILES, the known training value is used directly instead of the model prediction. Simple but effective.
for target in TARGETS:
for _, row in test.iterrows():
match = train.loc[
(train['SMILES'] == row['SMILES']) & (train[target].notna()), target
]
if len(match) > 0:
submission.loc[submission['id'] == row['id'], target] = match.values[0]
Results
Leaderboard Comparison
| Rank | Team | Score |
|---|---|---|
| 1st | James Day | 0.07536 |
| 2nd | Ezra | 0.07722 |
| 3rd | Ghy HUST CS | 0.07820 |
| 7th | CoderGirlM | 0.08144 |
| -- | This reproduction | 0.08180 |
| 8th | Dmitry Uarov | 0.08271 |
What I Could Not Reproduce
The biggest missing piece was Uni-Mol 2, the 3D molecular structure model that was part of the 1st place ensemble. I was unable to reproduce this component — I did not have the required software environment set up, and the compute requirements were beyond what I had available. This was a clear limitation of my attempt.
It made me wish there were lighter-weight methods for incorporating 3D molecular structure information. If something like a lightweight 3D-aware featurizer existed that could run on a single GPU, it would make these kinds of reproduction experiments much more practical.
Wrapping Up
There are many summary articles about top competition solutions, but fewer attempts to actually write the code, run the pipeline, and verify the results. I wanted to try doing that here, even in a scaled-down form.
This article may contain errors or misunderstandings — I would appreciate any corrections or feedback. If you notice something off, please feel free to leave a comment or open an issue on the repo.
Reproducing the 1st Place NeurIPS Polymer Prediction 2025 Solution
Top-8 equivalent score (0.08180) on the Private Leaderboard using CodeBERTa-small
This repository contains a reproduction of James Day's 1st place solution for the NeurIPS - Open Polymer Prediction 2025 Kaggle competition.
Disclaimer: This is a reproduction study, not original work. All credit for the solution design goes to the original authors. The purpose is to verify reproducibility and provide a working codebase for the community.
Competition Overview
- Task: Predict 5 polymer properties (Tg, FFV, Tc, Density, Rg) from chemical structure (SMILES)
- Metric: Weighted Mean Absolute Error (wMAE)
- Scale: 2,240 teams, $50,000 prize pool
- Ground truth: Averaged from multiple molecular dynamics simulation runs
Results
Rank
Team
Score (wMAE)
1
James Day
0.07536
2
Ezra
0.07722
3
Ghy HUST CS
0.07820
7
CoderGirlM
0.08144
-
This reproduction
0.08180
8
Dmitry Uarov
0.08271
Late submission score of 0.08180 — equivalent to 8th place out…
Top comments (0)