Who’s Really Controlling Your Hiring Algorithm?

#ai #mlsecurity #aisecurity #responsibleai

Research builds a focused benchmark and attack suite showing that resumes, the very documents we feed into LLM screeners can be weaponized. The authors demonstrate realistic manipulations and injection patterns that steer automated screening decisions (sometimes with high success rates), and they evaluate defenses that operate at inference and training time.

Why it matters: organizations increasingly rely on LLMs to triage candidates. These models make high-impact, high-scale decisions (who advances, who gets seen by humans). If attackers can subtly manipulate inputs to bias or poison outcomes, the result is systematic unfairness, legal exposure, and reputational damage and the attacks are often invisible to casual human review.

Key technical takeaways:

• Attack surface & effectiveness — The study constructs a resume-screening benchmark and shows several injection strategies that achieve high success rates for targeted manipulations.

• Stealthy manipulations — Adversarial content can be embedded in resumes in ways that bypass cursory human inspection but strongly influence LLM outputs.

• Defense taxonomy — Prompt-only mitigations give modest protection (small % reduction) while training-time defenses (e.g., FIDS via LoRA) produce larger reductions in attack success, combining inference and training defenses works best.

• Tradeoffs — Defenses reduce attack success but can increase false rejections or require retraining, detection remains imperfect.

• Human review limits — Even with humans in the loop, subtle attack patterns and model disagreement can let manipulated candidates slip through unless tooling and workflows are redesigned.

Practical implications for teams:

• Assume inputs are adversarial — treat resumes, cover letters, and uploaded docs as potential attack vectors, not trusted artifacts.

• Sanitize & canonicalize — parse resumes into structured features (experience, skills, dates) and score canonicalized views in addition to raw text.

• Combine inference + training defenses — apply prompt hardening and detection, but invest in training-time separation/detection (e.g., FIDS-style approaches) for stronger resilience.

• Audit trails & provenance — log raw inputs, sanitized versions, prompts, model outputs, and reviewer actions for post-hoc analysis and compliance.

• Human + metric gating — route high-uncertainty or high-impact cases to human panels and measure inter-annotator agreement to detect model drift or manipulation patterns.

• Test & red-team regularly — include resume-style adversarial examples in your robustness/regression suites and measure impact on fairness metrics.

arxiv.org