How I redesigned a thermostable enzyme using ProteinMPNN inverse folding - and validated every design with AlphaFold2

#bioinformatics #machinelearning #python #protein

E155 and E215 at 6.03 Å - exactly the nucleophile/acid-base separation
expected for a GH5 retaining endoglucanase.

This step matters. If I had accepted the literature numbering without checking, I would have fixed the wrong residues and the constrained run would be biologically
meaningless.

Step 2: B-factor profile

Before running ProteinMPNN, I computed per-residue B-factors to identify
flexibility hotspots - regions that might benefit most from redesign:

High-flexibility regions (B > 30 Å²):

Residues 246–250: B-factor up to 96.85 Å² - extremely mobile surface loop
N-terminus (residues 1–3)
Loop 169–173

These are the regions where ProteinMPNN has the most freedom and where
thermostabilising mutations would be most impactful in a follow-up study.

Step 3: ProteinMPNN - unconstrained run

First I ran ProteinMPNN without any constraints to understand what sequences it naturally prefers for this backbone:

python protein_mpnn_run.py \
    --pdb_path cel5a_clean.pdb \
    --out_folder mpnn_output/temp_0.1 \
    --num_seq_per_target 100 \
    --sampling_temp 0.1 \
    --seed 42

Five temperatures (0.1, 0.2, 0.3, 0.5, 0.8), 100 sequences each = 500 total.

The result was surprising: at T=0.1, ProteinMPNN placed Threonine at E155
(93/100 sequences) and Alanine at E215 (97/100 sequences). Both catalytic glutamates were replaced with non-catalytic residues.

This is not a bug - it's the correct behaviour. ProteinMPNN optimises for backbone fit, not biological function. Threonine and alanine may pack better against the local structure than glutamate, but they cannot perform the retaining mechanism. This finding motivates the constrained run.

Step 4: ProteinMPNN - catalytic-constrained run

Fix E155 and E215, let everything else vary:

fixed_positions = {
    "cel5a_clean": {
        "A": [155, 215]
    }
}
with open("fixed_positions.jsonl", "w") as f:
    f.write(json.dumps(fixed_positions) + "\n")

python protein_mpnn_run.py \
    --pdb_path cel5a_clean.pdb \
    --fixed_positions_jsonl fixed_positions.jsonl \
    --num_seq_per_target 100 \
    --sampling_temp 0.1

Results:

Temperature	Mean score	Mean recovery	E155 preserved	E215 preserved
0.1	0.763	52.6%	100%	100%
0.2	0.788	52.1%	100%	100%
0.3	0.830	51.5%	100%	100%
0.5	0.983	49.1%	100%	100%
0.8	1.369	43.1%	100%	100%

100% catalytic preservation at all temperatures. And the score distributions are nearly identical to the unconstrained run - fixing 2 out of 605 residues costs essentially nothing in terms of backbone fit.

Step 5: AlphaFold2 validation

Top 20 constrained designs (lowest MPNN score, T=0.1) were folded using
ColabFold on a free T4 GPU:

colabfold_batch \
  top20_constrained_candidates.fasta \
  af2_structures/designs \
  --model-type alphafold2_ptm \
  --num-recycle 3 \
  --msa-mode single_sequence

Also folded the wildtype under identical conditions as a baseline.

Result: 20/20 designs beat wildtype pLDDT.

Design	MPNN score	AF2 pLDDT	ΔpLDDT vs WT
CEL5A_FIXED_013	0.7532	46.04	+12.71
CEL5A_FIXED_012	0.7513	45.69	+12.37
CEL5A_FIXED_014	0.7535	45.61	+12.29
CEL5A_FIXED_010	0.7508	44.48	+11.16
CEL5A_FIXED_004	0.7471	43.72	+10.40

The per-residue pLDDT profiles show the biggest improvements in the catalytic domain region (residues 150–250) - exactly where the constrained redesign introduced the most changes around the fixed glutamates.

A note on absolute pLDDT values

The pLDDT values (33–46) look low compared to typical small protein benchmarks. This is expected for a 605 aa two-domain protein in single-sequence mode - AlphaFold2 relies heavily on co-evolutionary information from MSAs to correctly position domains. Single-sequence mode lacks this signal.

The meaningful comparison is relative pLDDT under identical conditions, not absolute values. Every design predicts better than wildtype under the same constraints.

What I'd do next

MSA-mode validation for top 5 designs - proper pLDDT with full co-evolutionary information
Rosetta ΔΔG scoring - filter by predicted thermostability change
Molecular dynamics - simulate the top 3 designs at 55°C and 70°C to assess thermal stability of the catalytic triad geometry
Experimental validation - express in E. coli, measure Tm by DSF, compare CMC activity at elevated temperatures

Key lessons

1. Verify catalytic residues experimentally, not from literature.
PDB numbering often differs from publication numbering. The 6 Å distance
criterion is more reliable than assuming the literature values transfer directly.

2. Run unconstrained first.
The unconstrained run revealed that ProteinMPNN actively avoids glutamate at both catalytic positions. Without this finding, the constrained run would lack motivation and the project would have less scientific narrative.

3. Fixing 2/605 residues is essentially free.
Score distributions between constrained and unconstrained runs are almost
identical. You can enforce catalytic function without sacrificing sequence
diversity.

4. Low absolute pLDDT is not failure.
Single-sequence mode for large multi-domain proteins always yields low absolute pLDDT. Design relative comparisons, and always fold the wildtype under the same conditions as your baseline.

GitHub: github.com/Farhan89082/proteinmpnn-cel5a

If you have questions about any stage - especially the catalytic residue
identification or the constrained run setup - drop them in the comments.