Farhan Rehman Sherief

Posted on Jul 1 • Edited on Jul 4

How I built a computational AMP screening pipeline: from 24,000 sequences to 47 drug candidates

#machinelearning #bioinformatics #python #datascience

Antimicrobial resistance is projected to kill 10 million people annually by 2050 - more than cancer. Antimicrobial peptides (AMPs) are one of the most promising therapeutic classes to fight it. They're cationic, membrane-disrupting, and bacteria struggle to develop resistance against them.

The problem: there are thousands of candidate sequences in databases. How do you identify the ones worth synthesising?

In this post I'll walk through a full computational screening pipeline I built from raw database download through machine learning activity prediction to ColabFold structure prediction - ending with 47 high-confidence drug candidates.

The full code is on GitHub: amp-colabfold

The pipeline at a glance

Five stages:

Data curation - 24,076 sequences from DRAMP 4.0 down to 12,435 after filtering
Feature engineering - 16 physicochemical features, 0% missing
Activity modelling - gradient boosting classifier, ROC-AUC 0.77
Candidate selection - 131 structured peptides sent to ColabFold
Structure prediction - 69 high-confidence structures, 47 very high confidence

Stage 1: Data curation

I pulled sequences from DRAMP 4.0 (the Database of Antimicrobial Peptides)
using their REST API - no manual downloads needed.

Three filters applied in sequence:

# Length: 10-50 aa (therapeutic window for AMPs)
df = df[df["sequence"].str.len().between(10, 50)]

# Canonical amino acids only - no modified residues
CANONICAL = set("ACDEFGHIKLMNPQRSTVWY")
df = df[df["sequence"].apply(lambda s: set(s).issubset(CANONICAL))]

# 90% identity clustering — Python reimplementation of CD-HIT
# (CD-HIT is Linux-only; I wrote a greedy longest-first algorithm in pure Python)
df_curated = cluster_cdhit_python(df, identity=0.90)

The CD-HIT reimplementation was the most interesting part of this stage.
CD-HIT's greedy algorithm is conceptually simple: sort sequences longest-first, then for each sequence check if it's ≥90% identical to any existing cluster representative. If yes, discard it. If no, it becomes a new representative. I added 3-mer pre-filtering to skip obviously dissimilar pairs, matching CD-HIT's default behaviour.

Result: 24,076 → 12,435 sequences

Stage 2: Feature engineering

For each of the 12,435 sequences I computed 16 physicochemical features using
the peptides Python package:

features = {
    "net_charge_pH7":           pep.charge(pH=7.0, pKscale="Lehninger"),
    "hydrophobicity_eisenberg":  pep.hydrophobicity(scale="Eisenberg"),
    "hydrophobic_moment":        pep.hydrophobic_moment(window=11, angle=100),
    "boman_index":               pep.boman(),
    "aliphatic_index":           pep.aliphatic_index(),
    "instability_index":         pep.instability_index(),
    # + Chou-Fasman helix/sheet/turn propensities computed manually
    # + aromaticity, fraction_positive, fraction_negative, length, pI, MW
}

One methodological note: aromaticity() was removed from newer versions of
peptides, so I computed it directly from residue frequencies
(F + Y + W) / length — same definition as Biopython, just implemented inline.

Result: 16 features, 0% missing across all 12,435 sequences

Stage 3: Activity modelling

Binary classification: antibacterial (confirmed MIC data) vs
general/natural (uncharacterised or broad labels).

clf = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,        # shallow trees → interpretable SHAP values
    subsample=0.8,
    random_state=42,
)

ROC-AUC: 0.77 on a length-decile-stratified holdout set.

The class imbalance (28.9% positive) is real and intentional — confirmed
antibacterial sequences with MIC data are rarer than general database entries.

I didn't oversample, because the model's conservative predictions are
appropriate for a screening tool: better to miss some true positives than
flood the downstream ColabFold stage with false ones.

What drives the predictions?

The SHAP beeswarm tells a clean biological story:

Length dominates - longer peptides push positive. This reflects the antibacterial subset containing more full-length defensins and cathelicidins.
fraction_helix - high helix propensity → predicted antibacterial. Consistent with α-helical membrane disruption.
fraction_positive - cationic residues push positive. Textbook electrostatic attraction to anionic bacterial membranes.
fraction_negative - anionic residues suppress prediction. Expected.

This is not a black box. Every prediction is traceable to a physical property with a known biological mechanism. That's the point of using shallow trees with SHAP rather than a deep model.

Stage 4: Candidate selection

From the classifier output I selected peptides with ab_proba ≥ 0.938 and
then applied a low-complexity filter:

def is_low_complexity(seq: str, threshold: float = 0.5) -> bool:
    # Flag homopolymers (KKKKKK, RRRRRR) and tandem repeats
    max_freq = max(seq.count(aa) for aa in set(seq)) / len(seq)
    if max_freq > threshold:
        return True
    for i in range(len(seq) - 2):
        kmer = seq[i:i+3]
        if seq.count(kmer) * 3 / len(seq) > 0.6:
            return True
    return False

This step is important. Homopolymeric sequences (poly-K, poly-R) score high
on fraction_positive and fool the classifier - they're real AMPs but
structurally trivial and not useful for ColabFold. The filter removed 44
low-complexity sequences.

Result: 131 structured candidates (15–44 aa) sent to ColabFold

Stage 5: Structure prediction with ColabFold

I ran ColabFold 1.6.1 on Google Colab (free T4 GPU) in single-sequence mode appropriate for short peptides where MSA depth is limited anyway.

colabfold_batch \
  colabfold_input.fasta \
  colabfold_output/ \
  --model-type alphafold2_ptm \
  --num-recycle 3 \
  --num-models 1 \
  --msa-mode single_sequence

For 131 peptides of 15-44 aa, the full run took ~15 minutes on a T4.

Parsing results locally

def parse_scores_json(json_path):
    with open(json_path) as f:
        data = json.load(f)
    return {
        "mean_plddt": float(np.mean(data["plddt"])),
        "ptm":        float(data["ptm"]),
        "plddt_per_residue": data["plddt"],
    }

pLDDT distribution

69 of 131 candidates (52.7%) have pLDDT ≥ 70, indicating meaningful structural confidence. For short peptides - which are inherently more disordered than globular proteins - this is a strong result.

Structural confidence vs activity prediction

The scatter shows no strong correlation between classifier confidence and
structural confidence - which is biologically expected. A peptide can be
highly predicted antibacterial (high ab_proba) but disordered in solution
(low pLDDT), because many AMPs adopt structure only upon membrane contact.

Final results

Applying both filters - pLDDT ≥ 90 AND ab_proba ≥ 0.95 - gives
47 very high-confidence candidates:

AMP ID	Length	pLDDT	pTM	ab_proba
AMP_02388	30 aa	97.3	0.54	0.952
AMP_01454	36 aa	97.2	0.60	0.985
AMP_02689	28 aa	97.2	0.50	0.954
AMP_01282	37 aa	97.1	0.61	0.981
AMP_01129	38 aa	97.0	0.62	0.959

Full list: candidate_amps.csv

What I'd do next

Hemolysis screening - filter candidates through HemoPI2 to remove cytotoxic sequences before any wet-lab work
ESM-2 embeddings - train a second classifier on sequence embeddings and compare against the physicochemical feature model
MD simulation - for the top 5 candidates, run short molecular dynamics to assess membrane-binding stability
Experimental validation - MIC assays against E. coli and S. aureus

Key lessons

1. Reproducibility over convenience. The CD-HIT reimplementation in pure Python adds ~100 lines but means the entire pipeline runs on any OS without binary dependencies.

2. Class imbalance is a feature, not a bug. The 3:1 negative:positive
ratio reflects reality. Don't oversample your way out of it.

3. Interpret your model before trusting it. The SHAP beeswarm told me
the model was learning real biology (charge, helix propensity) rather than
spurious correlations. Without that check I wouldn't have been confident
sending its predictions to a GPU.

4. pLDDT ≠ activity. Low structural confidence doesn't mean an AMP is
inactive - it may just be disordered until it hits a membrane. Keep both
metrics and let them inform each other.

GitHub: github.com/Farhan89082/amp-colabfold

If you found this useful or have questions about any stage, drop a comment below.

DEV Community