aviral srivastava

Posted on Mar 29

I Found 5 Security Vulnerabilities in XGBoost. Here's What Happened

#machinelearning #cybersecurity #xgboost #ai

XGBoost is one of the most important libraries in machine learning. 26,000+ GitHub stars. Used by banks for fraud detection, insurance companies for risk modeling, tech companies for ranking systems, and pretty much every competitive ML team on Kaggle. If you've done production ML in the last decade, chances are XGBoost is somewhere in your stack.

I decided to audit it.

What I found were 5 distinct vulnerabilities spanning memory safety in C++, unsafe deserialization in Python, a concurrency bug in the model loader, and a fundamentally broken authentication scheme in the distributed training protocol. All confirmed with working proof-of-concept code against XGBoost 3.2.0 (latest release at the time of testing).

The XGBoost maintainers decided not to patch any of them. Instead, they published the project's first-ever security disclosure page. That page was directly informed by my research and explicitly references "the reports we received."

This post walks through each finding, the response, and what I think it means for ML security more broadly.

Finding 1: Heap Out-of-Bounds Read via Unvalidated Tree Node Indices

Severity: Critical
CWE: CWE-125 (Out-of-bounds Read)
Affected files: tree_model.cc, cpu_predictor.cc, predict_fn.h

XGBoost model files (.json and .ubj format) contain tree structures with indices pointing to parent nodes, child nodes, and split features. When XGBoost loads a model file, these indices are used directly to access arrays in memory. The problem: none of them are validated against the actual array bounds.

This means a crafted model file can specify an index like 999999 for an array that only has 100 elements. What happens next depends on what's sitting in memory at that offset.

I tested this systematically. Out of 6 test vectors with large out-of-bounds indices, 5 triggered SIGSEGV crashes (immediate denial of service). For the more interesting case, I tested 200 consecutive small offsets just past the valid array boundary. All 200 successfully read adjacent heap memory without crashing. That's a silent information leak.

The core issue is straightforward. When the prediction code walks the tree, it does something like:

node = nodes[node_index]  // node_index comes directly from the model file

There's no check that node_index < nodes.size(). The model file is trusted implicitly.

Finding 2: Memory Corruption in Custom UBJSON Parser

Severity: High
CWE: CWE-120 (Buffer Copy without Checking Size of Input)
Affected files: json_io.h, json.cc

XGBoost implements its own UBJSON parser rather than using an established library. Custom parsers are always interesting from a security perspective because they tend to have fewer eyes on them than battle-tested libraries.

I found multiple issues in this parser:

Missing bounds checks in ReadStream(): The parser reads data from the input stream without verifying that enough bytes are available, leading to reads past the end of the buffer.

Attacker-controlled memcpy sizes in DecodeStr(): String length values come from the UBJSON file and are passed directly to memory copy operations. A crafted file can specify a length that exceeds the available data, causing a read overflow.

Integer truncation in Forward(): The stream position is advanced by a value that goes through an integer type conversion. Depending on the platform, this can wrap around, causing the parser to operate on the wrong region of memory.

Attacker-controlled allocation sizes in ParseTypedArray(): Array length values from the file control allocation sizes. While this alone might just cause an out-of-memory condition, combined with the other issues it creates opportunities for heap corruption.

All of these trigger crashes on crafted .ubj files.

Finding 3: Data Race and Double-Free in Parallel Tree Loading

Severity: High
CWE: CWE-415 (Double Free)
Affected files: gbtree_model.cc

This one is a concurrency bug. When XGBoost loads a model with multiple trees, it uses parallel execution (ParallelFor) to process them concurrently. Each tree entry has an ID that determines which slot it goes into.

If a crafted model file contains two tree entries with the same ID, two threads will simultaneously try to write to the same slot. Specifically, they both call reset() on the same unique_ptr. This is a textbook data race that results in a double-free: the memory backing the unique_ptr is freed twice, corrupting the heap allocator's metadata.

The result is a SIGSEGV crash. In theory, a carefully crafted heap layout could turn a double-free into something more dangerous, but I only demonstrated the crash.

Finding 4: RCE via pickle.loads() on Network Data

Severity: High
CWE: CWE-502 (Deserialization of Untrusted Data)
Affected files: collective.py

This is the finding I feel strongest about technically.

XGBoost's distributed training module includes a broadcast() function that shares Python objects between workers. The implementation serializes objects with pickle.dumps() on the sending side and deserializes with pickle.loads() on the receiving side. There is zero validation, no allowlist, no signing, nothing.

If an attacker can join the training cluster as a rogue worker (see Finding 5 for how easy that is), they can send a crafted pickle payload that executes arbitrary code on every other worker when they deserialize it. This is a well-understood attack vector. Python's own documentation explicitly warns: "The pickle module is not secure. Only unpickle data you trust."

The PySpark integration has a similar issue. It uses cloudpickle.loads() to deserialize metadata from Parquet files containing saved models. If someone hands you a saved PySpark XGBoost model from an untrusted source, loading it can execute arbitrary code.

Finding 5: Missing Authentication in Rabit Tracker Protocol

Severity: Critical
CWE: CWE-798 (Use of Hard-coded Credentials)
Affected files: Rabit tracker implementation

This finding chains with Finding 4 to create a full remote code execution path.

XGBoost's distributed training uses a tracker server that coordinates workers. Workers connect to the tracker and receive information about the cluster topology (which other workers to connect to, the communication ring structure, etc.).

The "authentication" for this connection is a hardcoded magic number: 0xff99. That's it. It's a constant in the public source code. Any attacker who can reach the tracker on the network can:

Connect using the magic number
Receive the full cluster topology
Join as a fake worker
Send malicious pickle payloads to all real workers via broadcast (Finding 4)
Achieve arbitrary code execution on every machine in the training cluster

The federated learning server has a similar issue with insecure default credentials.

The Response

I reported all 5 findings to the XGBoost security team via email (security@xgboost-ci.net) with full PoC code, CVSS scores, and suggested fixes. I also submitted 4 of the 5 through huntr.

The response:

"Thank you for your interest in the XGBoost project. After internal discussions, our team decided not to address the suggestions you submitted. There are multiple reasons for this decision, including: 1) Performance implications; and 2) Lack of developer resources."

They referred to the vulnerabilities as "suggestions."

What they did instead was publish a security disclosure page:
https://xgboost.readthedocs.io/en/latest/security.html

This page documents the threat model and explicitly acknowledges the vulnerability classes I reported. On the model file issues, it states: "The reports we received describe manipulating the JSON files to mislead XGBoost into reading out-of-bounds values or using conflicting tree indices." On pickle: "XGBoost as a machine learning library is not designed to protect against pickle data from an untrusted source." On the tracker authentication: "For performance reasons, we decided that the collective module will NOT support TLS authentication or encryption."

Before my report, this page didn't exist. XGBoost had zero documentation about its security boundaries.

My Take

I'm not going to pretend I'm not disappointed. Five findings with working PoCs, and zero patches. But I also think the outcome was still meaningful.

The reality is that most ML libraries weren't built with an adversarial threat model. XGBoost was designed to be fast, not to resist malicious inputs. When the maintainers say "performance implications," they're being honest. Bounds checking on every tree node access during prediction adds overhead in a library where microseconds matter.

But here's the thing: users need to know that. Before this security page existed, a developer loading an XGBoost model from a user upload, or running distributed training on shared infrastructure, had no way to know they were operating outside the library's threat model. Now they do.

The security page is essentially a contract: "Here's what we protect against. Here's what we don't. You're responsible for everything outside these boundaries."

That's actually useful. It lets security teams make informed decisions. If you're running XGBoost in an environment where model files could be tampered with, you now know you need to validate them externally. If you're running distributed training, you now know the protocol has no authentication and you need network-level isolation. That information didn't exist publicly before.

Lessons for Security Researchers

1. Not every project will fix what you find. Especially in the ML ecosystem, where performance is the primary concern and security is often an afterthought. That doesn't mean the research was wasted.

2. Document everything. When the maintainers responded, they had my full PoCs, file:line references, and suggested fixes available. Even though they chose not to patch, they used my research to build comprehensive security documentation. The quality of your report determines the quality of the outcome, even when the outcome isn't what you wanted.

3. Understand the project's threat model before reporting. If I'd known upfront that XGBoost considers untrusted model files out of scope, I might have focused my effort differently. Findings 4 and 5 (pickle deserialization and tracker authentication) are harder to dismiss with "don't load untrusted files," but the maintainers bundled all 5 findings together in their response.

4. The doc-shield is real. If a project publishes documentation saying "this is unsafe by design," future reports about that exact issue will be rejected. Sometimes your report is the trigger that creates the doc-shield. That's frustrating but it's the reality of how open-source security works.

What You Should Do If You Use XGBoost

Read the security page: https://xgboost.readthedocs.io/en/latest/security.html

Specific recommendations:

Don't load model files from untrusted sources. If you must, validate them in an isolated environment first. XGBoost will not catch malformed indices or corrupted structures. It will either crash or read garbage memory.
Don't use pickle for model serialization in untrusted contexts. Use xgboost.Booster.save_model() and load_model() with .json or .ubj format instead of pickle.dump/load. The native formats have their own issues (Findings 1-3), but they don't give you arbitrary code execution the way pickle does.
Isolate your distributed training network. The tracker has no real authentication and the broadcast protocol uses pickle. If an attacker can reach your training cluster's network, they can join it and execute code on every worker. Use VPCs, network policies, or whatever your cloud provider offers for network isolation.
Don't load PySpark XGBoost models from untrusted sources. The cloudpickle deserialization in the PySpark integration means a malicious saved model can execute code when loaded.

Context

This research is part of my ongoing work auditing the AI/ML open-source ecosystem. Other recent findings include:

CVE-2026-33017 (Langflow): Unauthenticated remote code execution, Critical 9.3. Now on CISA KEV. Exploited in the wild within 20 hours of advisory publication with no public PoC available. Attackers built working exploits directly from the advisory description.
CVE-2026-32628 (AnythingLLM): SQL injection in the SQL Agent plugin via unsanitized table names across MySQL, PostgreSQL, and MSSQL connectors.
Additional accepted or pending findings in Flowise, promptfoo, ComfyUI, Dify, Open WebUI, ModelScan, DefenseClaw, and others.

The ML stack has the same vulnerability classes as traditional software (memory corruption, injection, deserialization, missing auth) but with less security scrutiny. That's the gap I'm working to close.

DEV Community