Predicting Protein Secondary Structures with Machine Learning

#deeplearning #machinelearning #datascience #science

Introduction
Protein secondary structure prediction is a fundamental task in bioinformatics, helping researchers understand protein folding, interactions, and function. Traditionally, techniques like X-ray crystallography and NMR spectroscopy are used to determine secondary structures, but these methods are costly and time-consuming. Machine learning (ML) offers a promising alternative—predicting secondary structures from amino acid sequences with high accuracy.

This project focused on developing an ML model using convolutional neural networks (CNN) and bidirectional long short-term memory (BiLSTM) networks to predict secondary protein structures (Helix = H, Beta Sheet = E, Coil = C). Using Kaggle’s GPU resources, the model was trained, optimised, and evaluated, achieving over 71% overall accuracy.

Dataset and Preprocessing
Dataset Selection
The dataset for this project was sourced from Kaggle, containing peptide sequences and their corresponding secondary structures. The dataset was formatted into tabular data, including:

Amino acid sequences (input features).
Secondary structure labels in Q3 format (H, E, C).
Sequence metadata (length, presence of non-standard amino acids, etc.).
Preprocessing Steps
Encoding Amino Acid Sequences

One-hot encoding (simple and fast).
Pretrained embeddings (ProtBERT, TAPE, ESM2) for better representation.
Label Encoding

Converted H, E, and C into numerical categories (0, 1, 2).
Train-Validation-Test Split

80% training, 10% validation, 10% test split.
Model Architecture and Training
The model architecture was designed to capture both local and long-range sequence dependencies in protein structures.

Model Structure
CNN Layer—Extracts local spatial features from amino acid sequences.
BiLSTM Layer—Captures long-range dependencies, improving sequence learning.Fully Connected Layer—Maps extracted features to secondary structure classes. Softmax Activation—Outputs probability scores for each structure type.

Training Strategy
The function used was Loss Function Categorical Cross-Entropy (multi-class classification). The Optimiser used was Adam (fast convergence).Training Duration: 30 epochs with early stopping to prevent overfitting. The hardware Kaggle’s T4 ensuring efficient training.

Optimization Techniques
Batch Processing & Caching—Reduced latency during training.
GPU Optimisation—Only used the GPU for training, minimising CPU-GPU data transfer overhead. Learning Rate Adjustments—Fine-tuned for stable convergence.

Results and Evaluation
The trained model was evaluated on unseen test data, achieving the following results:

Performance Metrics
Structure Accuracy
H (Helix) 76.21%
E (Beta Sheet) 63.26%
C (Coil) 70.92%
Overall Accuracy: 71.01%

Confidence Statistics
Structure Mean Confidence Std Deviation
H 0.8013 0.1763
E 0.7272 0.1723
C 0.6969 0.1511

Classification Report
Class Precision Recall F1-Score
H 76.37% 76.21% 76.29%
E 66.37% 63.26% 64.78%
C 68.79% 70.92% 69.84%

Key Insights
The H (Helix) and C (Coil) structures were predicted with high accuracy.
The E (Beta Sheet) structure had the lowest accuracy (63%), indicating potential class imbalance. Mean confidence scores were high across all three classes, indicating reliable predictions. Early stopping at epoch 27 ensured stable training without overfitting.

Future Improvements
While the model performed well, there are several areas for enhancement:

Transformers for Protein Sequences

Replace BiLSTM with models like ESM2 or ProtBERT for better sequence representation.
Scaling with Distributed ML
Train on larger datasets using cloud-based ML frameworks.
Tertiary Structure Prediction
Implement Variational Autoencoders (VAEs) and Diffusion Models to predict 3D protein structures.
Generative AI for Protein Design
Extend to synthetic protein generation for drug discovery applications.

Conclusion
This project demonstrated the power of ML in bioinformatics, predicting secondary protein structures with a CNN + BiLSTM architecture. By leveraging Kaggle’s free compute resources, the model achieved 71% accuracy, with potential improvements using transformer-based models. Future work will focus on scaling, generative modeling, and real-world biotech applications.

For those interested, the full code, dataset, and results are open-source, inviting contributions and collaboration.

Check out the project on Kaggle here: https://www.kaggle.com/code/allanwandia/secondary-structure-data-analysis

Top comments (1)

Astro Luna • Mar 3 '25

Growth Factors refer to naturally occurring proteins or hormones that play a key role in regulating cell growth, division, and differentiation. They are critical in various biological processes, such as development, healing, and maintaining tissue health. Growth factors bind to specific receptors on cell surfaces and trigger a series of signaling events that promote cellular activities like proliferation, survival, and specialization.