DEV Community

Yadnyesh Pawar
Yadnyesh Pawar

Posted on

AI Against Intruders: Reviving ML Models for Smarter Cybersecurity

Introduction: Rethinking Cyber Defense

Cyber threats are no longer rare disruptions—they’re persistent, evolving dangers. In 2023 alone, over 66% of small to mid-sized businesses experienced cyber attacks, with the average data breach costing $4.45 million (IBM). As digital infrastructure expands—especially with remote work and IoT—cybersecurity must evolve too.

This is where Machine Learning (ML) comes in. Traditional rule-based defenses fail against novel threats. ML enables dynamic defense by recognizing hidden patterns, adapting over time, and responding in real-time.

The Power of Machine Learning in Cybersecurity

Beyond Rules: ML for Threat Detection
Machine learning doesn’t rely on static signatures. Instead, it detects suspicious behaviors or anomalies—like abnormal user activity or network traffic spikes—before they escalate into breaches.

Anomaly detection models excel at this task. They learn what "normal" looks like and flag anything out of place, acting as a digital immune system.

Project Spotlight: Network Anomaly Detection

To bring this into action, we built a Network Anomaly Detection (NAD) system using classic ML models. The goal? Spot malicious activity in real-time from raw network traffic.

Dataset: KDD CUP 1999
We used the KDD CUP 1999 dataset, a gold standard in intrusion detection benchmarking. It includes normal and attack traffic labeled across four major threat categories:

  • DoS – Denial of Service
  • Probe – Surveillance or scanning
  • R2L – Remote to Local attacks
  • U2R – User to Root privilege escalation Despite being dated, this dataset remains valuable for ML experimentation.

Data Preprocessing: Laying the Groundwork

Key Steps Taken:

  1. Label Encoding: Transformed categorical fields like protocol type and service into numerical values.
  2. Balancing with SMOTE: Since U2R attacks were underrepresented, SMOTE oversampling helped prevent model bias and boosted rare class detection.
  3. Feature Selection: From 42 original features, we cut it to 21 using feature importance metrics without sacrificing performance.

Statistical Testing: Digging into Data Behavior

Why Stats Matter:
Hypothesis testing helped validate which features were most significant. Key insights included:

  1. t-test & ANOVA: src_bytes significantly differed between normal and malicious connections, while dst_bytes mattered more across attack types.
  2. Chi-square Test: Protocol type and certain flags like S0, SH, and RSTR were strongly associated with anomalies.

These findings informed feature engineering and model refinement.

Model Building: Choosing RFC and XGBoost

We tried multiple models—Random Forest (RFC), Gradient Boosted Trees, LightGBM, AdaBoost—but RFC and XGBoost emerged as the most reliable performers.

Why These Two?

  1. Random Forest: Robust to overfitting, good with unbalanced data, and excellent at highlighting feature importance.
  2. XGBoost: Efficient, highly accurate, and great for tabular data like network logs.

We used both in a Voting Classifier ensemble to enhance prediction strength.

Random Forest:
**Random Forest:**

XGBoost:
**XGBoost:**

Model Evaluation & Experiment Tracking

We split the dataset into training, validation, and test sets and manually tuned hyperparameters. To manage experiments, we used MLFlow, which allowed us to:

  1. Track parameters and metrics
  2. Compare models visually
  3. Save experiment history

Key Metrics Tracked:

  • Accuracy: General correctness (but not enough on its own)
  • Precision & Recall: Essential for imbalanced attack detection
  • F1 Score: Balanced view for precision and recall trade-offs

MLFlow
MLFlow

Deployment with Streamlit: Making It Accessible

We built a lightweight web app using Streamlit, named SecureNetAI. Features include:

  • Upload interface for network traffic files
  • Real-time predictions with probability scores
  • Simple, clear UI for non-technical users This app empowers network administrators to monitor traffic and detect threats live.

Why Streamlit?

  1. Rapid prototyping
  2. Clean UI without frontend expertise
  3. Free hosting for demos and PoCs

Real-World Application

To make this project useful beyond development:

  • The full codebase and documentation are available on GitHub
  • A hosted version of SecureNetAI is live via Streamlit Cloud
  • Deployment-ready with CI/CD pipeline support for AWS

  • Network-Anomaly-Detection

  • Github

This project offers a plug-and-play model for organizations looking to incorporate ML-based anomaly detection into their network security.

Looking Ahead: Smarter, Faster Cyber Defense

This project showcases how a well-prepared dataset, basic statistical insights, and classic ML models can build a meaningful defense system. But the journey doesn’t end here.

Future Enhancements:

  • Model drift detection to handle evolving threat patterns
  • Real-time streaming with Kafka/Spark
  • Deep learning-based intrusion detection
  • AutoML pipelines for easier retraining

Conclusion: ML is the New Cybersecurity Ally

Machine learning is transforming cybersecurity from a reactive game to a predictive one. With tools like XGBoost, RFC, and Streamlit, anyone can build systems that detect threats proactively, adapt to new patterns, and safeguard digital infrastructure intelligently.

In a world where threats evolve faster than rules can catch up, ML is not just an option—it’s a necessity.

Top comments (0)