Most "AI security" training right now is about large language models: prompt injection, jailbreaks, RAG poisoning. That work matters, but it skips an older and still unsolved problem. If your organization runs a malware classifier, a phishing detector, a fraud model, or any ML system that makes a security decision, the relevant threat is adversarial machine learning, and most courses do not teach it.
Adversarial machine learning is attacks against the model's learned decision boundary, plus the defenses. It predates the LLM wave by a decade and the techniques transfer directly to the detection models security teams already depend on. Here is what training in this area should cover and where to find it.
What Adversarial ML Actually Covers
The field breaks into a few attack classes. A course worth taking treats each one, because the defenses differ.
- Evasion. Perturb an input at inference time so the model misclassifies it while a human sees nothing wrong. Classic methods are FGSM (Fast Gradient Sign Method), PGD (Projected Gradient Descent), and the Carlini-Wagner attack. In security this is a malware sample tweaked to slip past a static classifier (MITRE ATLAS AML.T0043).
- Poisoning. Corrupt the training data so the model learns the wrong thing. Label flipping degrades accuracy; a backdoor trigger makes the model misbehave only on inputs carrying a specific pattern (ATLAS AML.T0020 and AML.T0018). Any model that retrains on user feedback, like a spam filter, is exposed.
- Model extraction and inference. With only query access to an API, an attacker can approximate the model (stealing it) or recover facts about its training data through membership inference (ATLAS AML.T0024). This is the attack a fraud or abuse model faces in production.
The NIST AI 100-2 taxonomy is the reference that pins down this vocabulary. Read it early so you and the rest of your team use the same terms.
The Tools You Should Be Hands-On With
You learn this by running attacks, not reading about them. The libraries to know:
- Adversarial Robustness Toolbox (ART) is the broadest. Evasion, poisoning, extraction, and inference attacks plus defenses, working across scikit-learn, PyTorch, TensorFlow, and XGBoost.
- Foolbox and CleverHans focus on evasion against neural networks, with clean implementations of the standard attacks.
- TextAttack handles NLP models, which matters for text-based phishing and abuse classifiers.
- RobustBench gives you a standardized robustness benchmark and pretrained robust models to test against.
- Counterfit from Microsoft wraps several of these into a security-team-oriented automation harness.
A short evasion attack with ART against a trained classifier looks like this:
import numpy as np
from art.estimators.classification import SklearnClassifier
from art.attacks.evasion import FastGradientMethod
# clf is a trained scikit-learn classifier; X_test, y_test your hold-out set
classifier = SklearnClassifier(model=clf)
attack = FastGradientMethod(estimator=classifier, eps=0.2)
X_adv = attack.generate(x=X_test)
clean_acc = np.mean(classifier.predict(X_test).argmax(1) == y_test.argmax(1))
adv_acc = np.mean(classifier.predict(X_adv).argmax(1) == y_test.argmax(1))
print(f"clean accuracy: {clean_acc:.3f} adversarial accuracy: {adv_acc:.3f}")
The gap between those two numbers is the point. A model that scores 0.98 on clean data and 0.30 under a modest FGSM perturbation is not deployable in a contested setting, and clean-data accuracy hid that completely.
The Part Most Courses Skip: Evaluating Robustness Honestly
The common failure in this space is reporting accuracy on clean data and calling it security. Real training teaches robustness evaluation: attacking your own model with multiple methods at varying perturbation budgets, and treating the worst result as the truth.
It also has to cover defenses honestly, because most are partial. Adversarial training (training on adversarial examples, the Madry et al. approach) is the strongest general defense and still degrades under stronger attacks. Input preprocessing and detector-based defenses are frequently broken by adaptive attackers who know the defense is there. A course that presents any single defense as a fix is selling something. The honest framing is a measurable raise in attacker cost, mapped to a threat model.
Where to Learn It
A vendor-neutral look at the options:
- Self-study. The ART example notebooks, the CleverHans tutorials, NIST AI 100-2, and the MITRE ATLAS case studies are free and good. What self-study lacks is a target you are cleared to attack and feedback on your method.
- Academic material. Groups like the Madry Lab at MIT publish the foundational work. Strong on theory, lighter on the security-operations framing.
- Conference trainings. Black Hat and Hack In The Box run multi-day intensives from independent specialists. Quality varies by instructor, so read the syllabus and the bio.
- GTK Cyber. Adversarial ML and AI red-teaming taught for security practitioners, with labs in a Python and Jupyter environment so you script your own attacks rather than only running canned scanners. It runs at Black Hat USA 2026 and as custom on-site engagements.
Whatever you pick, apply one test before registering: does the syllabus name specific tools and give you a model to break? Adversarial machine learning is a hands-on discipline. If the answer is no, it is an awareness briefing, and you can get that from a paper for free.
GTK Cyber built its applied AI and AI red-teaming courses around exactly this gap: security people with the adversarial instinct but no AI-specific training, and AI training that never touched a threat model. That intersection is where this work lives.
Top comments (0)