Antimicrobial resistance is projected to kill 10 million people annually by 2050 - more than cancer. Antimicrobial peptides (AMPs) are one of the most promising therapeutic classes to fight it. They're cationic, membrane-disrupting, and bacteria struggle to develop resistance against them.
The problem: there are thousands of candidate sequences in databases. How do you identify the ones worth synthesising?
In this post I'll walk through a full computational screening pipeline I built from raw database download through machine learning activity prediction to ColabFold structure prediction - ending with 47 high-confidence drug candidates.
The full code is on GitHub: amp-colabfold
The pipeline at a glance
Five stages:
- Data curation - 24,076 sequences from DRAMP 4.0 down to 12,435 after filtering
- Feature engineering - 16 physicochemical features, 0% missing
- Activity modelling - gradient boosting classifier, ROC-AUC 0.77
- Candidate selection - 131 structured peptides sent to ColabFold
- Structure prediction - 69 high-confidence structures, 47 very high confidence
Stage 1: Data curation
I pulled sequences from DRAMP 4.0 (the Database of Antimicrobial Peptides)
using their REST API - no manual downloads needed.
Three filters applied in sequence:
# Length: 10-50 aa (therapeutic window for AMPs)
df = df[df["sequence"].str.len().between(10, 50)]
# Canonical amino acids only - no modified residues
CANONICAL = set("ACDEFGHIKLMNPQRSTVWY")
df = df[df["sequence"].apply(lambda s: set(s).issubset(CANONICAL))]
# 90% identity clustering — Python reimplementation of CD-HIT
# (CD-HIT is Linux-only; I wrote a greedy longest-first algorithm in pure Python)
df_curated = cluster_cdhit_python(df, identity=0.90)
The CD-HIT reimplementation was the most interesting part of this stage.
CD-HIT's greedy algorithm is conceptually simple: sort sequences longest-first, then for each sequence check if it's ≥90% identical to any existing cluster representative. If yes, discard it. If no, it becomes a new representative. I added 3-mer pre-filtering to skip obviously dissimilar pairs, matching CD-HIT's default behaviour.
Result: 24,076 → 12,435 sequences
Stage 2: Feature engineering
For each of the 12,435 sequences I computed 16 physicochemical features using
the peptides Python package:
features = {
"net_charge_pH7": pep.charge(pH=7.0, pKscale="Lehninger"),
"hydrophobicity_eisenberg": pep.hydrophobicity(scale="Eisenberg"),
"hydrophobic_moment": pep.hydrophobic_moment(window=11, angle=100),
"boman_index": pep.boman(),
"aliphatic_index": pep.aliphatic_index(),
"instability_index": pep.instability_index(),
# + Chou-Fasman helix/sheet/turn propensities computed manually
# + aromaticity, fraction_positive, fraction_negative, length, pI, MW
}
One methodological note: aromaticity() was removed from newer versions of
peptides, so I computed it directly from residue frequencies
(F + Y + W) / length — same definition as Biopython, just implemented inline.
Result: 16 features, 0% missing across all 12,435 sequences
Stage 3: Activity modelling
Binary classification: antibacterial (confirmed MIC data) vs
general/natural (uncharacterised or broad labels).
clf = GradientBoostingClassifier(
n_estimators=300,
learning_rate=0.05,
max_depth=3, # shallow trees → interpretable SHAP values
subsample=0.8,
random_state=42,
)
ROC-AUC: 0.77 on a length-decile-stratified holdout set.
The class imbalance (28.9% positive) is real and intentional — confirmed
antibacterial sequences with MIC data are rarer than general database entries.
I didn't oversample, because the model's conservative predictions are
appropriate for a screening tool: better to miss some true positives than
flood the downstream ColabFold stage with false ones.
What drives the predictions?
The SHAP beeswarm tells a clean biological story:
- Length dominates - longer peptides push positive. This reflects the antibacterial subset containing more full-length defensins and cathelicidins.
- fraction_helix - high helix propensity → predicted antibacterial. Consistent with α-helical membrane disruption.
- fraction_positive - cationic residues push positive. Textbook electrostatic attraction to anionic bacterial membranes.
- fraction_negative - anionic residues suppress prediction. Expected.
This is not a black box. Every prediction is traceable to a physical property with a known biological mechanism. That's the point of using shallow trees with SHAP rather than a deep model.
Stage 4: Candidate selection
From the classifier output I selected peptides with ab_proba ≥ 0.938 and
then applied a low-complexity filter:
def is_low_complexity(seq: str, threshold: float = 0.5) -> bool:
# Flag homopolymers (KKKKKK, RRRRRR) and tandem repeats
max_freq = max(seq.count(aa) for aa in set(seq)) / len(seq)
if max_freq > threshold:
return True
for i in range(len(seq) - 2):
kmer = seq[i:i+3]
if seq.count(kmer) * 3 / len(seq) > 0.6:
return True
return False
This step is important. Homopolymeric sequences (poly-K, poly-R) score high
on fraction_positive and fool the classifier - they're real AMPs but
structurally trivial and not useful for ColabFold. The filter removed 44
low-complexity sequences.
Result: 131 structured candidates (15–44 aa) sent to ColabFold
Stage 5: Structure prediction with ColabFold
I ran ColabFold 1.6.1 on Google Colab (free T4 GPU) in single-sequence mode appropriate for short peptides where MSA depth is limited anyway.
colabfold_batch \
colabfold_input.fasta \
colabfold_output/ \
--model-type alphafold2_ptm \
--num-recycle 3 \
--num-models 1 \
--msa-mode single_sequence
For 131 peptides of 15-44 aa, the full run took ~15 minutes on a T4.
Parsing results locally
def parse_scores_json(json_path):
with open(json_path) as f:
data = json.load(f)
return {
"mean_plddt": float(np.mean(data["plddt"])),
"ptm": float(data["ptm"]),
"plddt_per_residue": data["plddt"],
}
pLDDT distribution
69 of 131 candidates (52.7%) have pLDDT ≥ 70, indicating meaningful structural confidence. For short peptides - which are inherently more disordered than globular proteins - this is a strong result.
Structural confidence vs activity prediction
The scatter shows no strong correlation between classifier confidence and
structural confidence - which is biologically expected. A peptide can be
highly predicted antibacterial (high ab_proba) but disordered in solution
(low pLDDT), because many AMPs adopt structure only upon membrane contact.
Final results
Applying both filters - pLDDT ≥ 90 AND ab_proba ≥ 0.95 - gives
47 very high-confidence candidates:
| AMP ID | Length | pLDDT | pTM | ab_proba |
|---|---|---|---|---|
| AMP_02388 | 30 aa | 97.3 | 0.54 | 0.952 |
| AMP_01454 | 36 aa | 97.2 | 0.60 | 0.985 |
| AMP_02689 | 28 aa | 97.2 | 0.50 | 0.954 |
| AMP_01282 | 37 aa | 97.1 | 0.61 | 0.981 |
| AMP_01129 | 38 aa | 97.0 | 0.62 | 0.959 |
Full list: candidate_amps.csv
What I'd do next
- Hemolysis screening - filter candidates through HemoPI2 to remove cytotoxic sequences before any wet-lab work
- ESM-2 embeddings - train a second classifier on sequence embeddings and compare against the physicochemical feature model
- MD simulation - for the top 5 candidates, run short molecular dynamics to assess membrane-binding stability
- Experimental validation - MIC assays against E. coli and S. aureus
Key lessons
1. Reproducibility over convenience. The CD-HIT reimplementation in pure Python adds ~100 lines but means the entire pipeline runs on any OS without binary dependencies.
2. Class imbalance is a feature, not a bug. The 3:1 negative:positive
ratio reflects reality. Don't oversample your way out of it.
3. Interpret your model before trusting it. The SHAP beeswarm told me
the model was learning real biology (charge, helix propensity) rather than
spurious correlations. Without that check I wouldn't have been confident
sending its predictions to a GPU.
4. pLDDT ≠ activity. Low structural confidence doesn't mean an AMP is
inactive - it may just be disordered until it hits a membrane. Keep both
metrics and let them inform each other.
GitHub: github.com/Farhan89082/amp-colabfold
If you found this useful or have questions about any stage, drop a comment below.




Top comments (0)