Hello fellow cybersecurity professionals and enthusiasts,
In this article, I will share my graduate capstone project titled Large Language Models for One-Day Vulnerability Detection that details an innovative penetration testing framework that incorporates natural language processing and large language model (LLM) driven multi-agent systems to optimize one-day vulnerability detection with an accuracy of 89.5% and a runtime of less than 30 seconds.
Problem
New software vulnerabilities are discovered and dis-
closed daily through Common Vulnerabilities and Exposures (CVE) records, which provide standardized documentation of these flaws. Once publicly accessible, these are known as one-day vulnerabilities. Despite disclosure, many systems remain unpatched during critical windows of exposure, allowing adversaries to exploit these flaws before organizations can mitigate them. Modern enterprises maintain expansive digital footprints, making timely vulnerability mitigation essential. Delays in patching one-day vulnerabilities can lead to data breaches, operational disruption, and financial losses. The core issue lies in the latency between the disclosure of CVEs and remediation, often due to insufficient automation and slow, manual vulnerability analysis. Normally, a team of cyber security experts is required to extensively experiment on a target endpoint to detect all potential vulnerabilities. Once an exposure is identified, a patch is deployed with a corresponding Common Vulnerabilities and Exposures (CVE) report. These reports contain critical information on the selected vulnerability, effects, mitigation strategies, and enumerated examples. This can be tremendously helpful for security teams in recognizing potential bugs within their own complex systems, but it can also pinpoint vulnerable attack surfaces that threat actors can distinguish. As faults eventually slip through undetected without the necessary immediate response time, there can be a considerable time frame in which an attacker can have the exploitational advantage. There is a particular need for an efficient and powerful tool that can scan intricate computer applications and output significant weaknesses and relevant techniques to repair them. In this paper, I will discuss an exceptional penetration testing solution to this urgent challenge that harnesses the capabilities of large-language models (LLMs).
Solution and Proposed Methodology
The multi-agent LLM workflow will be explored as shown the figure below in which each agent handles a segment of penetration testing on a target. Furthermore, a target in the case of this experiment is defined as a purposefully vulnerable website (OWASP Vulnerable-Web-Application and Acunetix VulnWeb) in which the LLMs will investigate known potential problems and report back to the user.
There will be an initial Exploration Agent in which a target is specified and all elements of the target will be analyzed, such as input fields and endpoints. Then, a Supervisor Agent will determine in which areas the vulnerabilities are and the CVEs associated with these particular vulnerabilities. When an attack type is determined, there will be Vulnerability-Specific Agents that deal with and test certain types of cyber security attack, such as SQLi and XSS that extract information from these CVEs. Lastly, the Fuzzer and Executor Agents will produce and test payloads in order to achieve successful exploitation. At the end of the testing, there will be a detailed output report listing the weaknesses in the target and the best methods of how to fix these vulnerabilities. In previous solutions, GPT-4 achieves a success rate of 87%, while all other LLM models achieve 0% across the board. My goal for this project is to attain a success rate greater than 87% in identifying one-day vulnerabilities to help ensure a more secure software ecosystem with proactive and reactive cyber security analysis for web-based systems.
In terms of grading the performance of the LLM, the purposefully vulnerable website will have a disclosed number of vulnerabilities which will be compared to the comprehensive report published by the multi-agent system at the end of each trial run.
- Compare the number of true positive vulnerabilities detected by the system with the known vulnerabilities of each testbed.
- Measure precision, recall, and false positives to evaluate effectiveness.
- Assess time-to-detection and compare with manual testing methods.
The experiment will demonstrate not only how many vulnerabilities are found, but also how effectively and quickly the system can identify one-day vulnerabilities after disclosure, thus offering empirical support for the claim of improved cyber security readiness. The Penetration Testing System will utilize GPT-4o-mini by Open AI and LangChain/LangGraph for the Multi-Agent system approach. LangGraph allows for the exchange of information and improvement over time between stages of the workflow.
Multi-Agent Workflow
Multi-agent systems benefit significantly from direct collaboration between agents. To fully leverage this capability, an advanced conversational framework was designed that enables seamless inter-agent communication. By temporarily suspending centralized control, the system efficiently manages transfer requests, which improves context sharing and interaction between agents. This approach theoretically enhances operational efficiency and reduces unnecessary LLM calls, thereby reducing associated costs. This approach improves execution speed and accuracy by breaking down tasks into smaller components. Each agent is equipped with a customized prompt, a defined output schema, and a subset of documents that exemplify SQL Injection or XSS attack types.
It is recognized that one-day vulnerabilities may include previously unidentified signatures, which differ from the well-established patterns typically detected. This makes it particularly challenging for a team of specialized, task-specific agents to accurately identify them. To address this, the methodology is divided into two core components: one sub-team focuses on known vulnerability signatures provided from given CVE reports, but enhanced with new elements, while the other sub-team concentrates on input randomization and execution.
Implementation
The structure is built on LangGraph's multi-agent design pattern, enabling the agent's language model to iteratively loop until a solution is found or a predetermined number of steps is exceeded. The LangGraph Python library permits each agent in the "chain" to transmit messages and contextual knowledge seamlessly. Open AI's GPT-4o-mini model was used in this architecture as it is inexpensive along with providing cutting-edge abilities.
Testing
In order to evaluate the performance of the multi-agent system, a testing environment must be established. If an arbitrary target URL is provided and the LLMs generate a tailored output that detects the SQL Injection and XSS vulnerabilities on the site, there is no method to confirm the validity of the identified vulnerabilities if I do not have access to the underlying code of the website. In this case, purposely vulnerable websites were used as targets in order to preemptively know the confirmed number of SQL Injection and XSS vulnerabilities on each page. This information is then compared with the multi-agent system vulnerability assessment in order to grade its accuracy. The testing applications chosen for this project were Acunetix VulnWeb and OWASP Vulnerable-Web-Application as shown in the figures below.
Evaluation
This section addresses the results, costs, and comprehensive findings regarding the performance of the multi-agent LLM penetration testing framework.
Trials and Cost
Nine trials were conducted on the Acunetix and OWASP targets to gauge the accuracy and efficiency of the LLM system. OWASP SQLi Levels 1-4, OWASP XSS Levels 1-4, and Acunetix VulnWeb were chosen as test cases for thorough evaluation. For each trial, the LLM model was given a target URL and expected to output the vulnerabilities detected on the site with descriptions and security recommendations. The duration of each trial was also measured to assess the system's speed.
In terms of use in the testing phase, GPT-4o-mini incurs an inexpensive cost of $0.15 per million input tokens and $0.60 per million output tokens. The number of steps accumulated in each trial corresponds to the communication between agents. Although full screenings can lead to higher runtime and costs, the added expense is close to insignificant. For example, a trial that steps 100 times will only cost $0.12, so this approach offers an efficient method to identify one-day vulnerabilities and software exploits.
Effectiveness of the Architecture
The LLM-based multi-agent system identifies SQL Injection and XSS vulnerabilities in a target URL with an accuracy of 89.5% in an average trial time of 27.2 seconds. It provides the following:
- Elements Detected
- Vulnerabilities Detected
- Location Detected
- Possible Mitigation Strategies
- Fuzzing Payloads to Exploit Input Fields
- Findings from Controlled Execution
- Summary of Actions Taken (Final Report)
The system delivers a complete and in-depth cyber security analysis of the website for users to process and, in turn, implement critical security improvements to protect against future breaches. Furthermore, the previous goal of an accuracy percentage greater than 87% has been satisfactorily exceeded by 2.5%.
Limitations
The language-model-driven agent network excels in generating accurate and extensive responses to support in cyber security defense operations. Its primary objective of recognizing potential threats in browser-based applications is realized. However, there are drawbacks and limitations in the current implementation of the application that, if amended, will enhance its overall promise as a solution for the future.
LLM Hallucinations
The figure below illustrates an agent transfer error that occurred sporadically due to confusion and sidetracking between nodes that can be attributed to LLM hallucinations. Conversations between the Exploration, Fuzzer, and Executor agents would occasionally derail, leading to an increased number of wasted steps in the output that failed to contribute new information to the vulnerability report until the application would error out, as indicated in the figure. This would happen very rarely in this framework, as opposed to prior solutions that were substantially affected. Unfortunately, issues of this sort are inherent to LLM-based platforms, and remedies would only come with improvements to GPT-4o-mini or the emergence of another economical and robust solution.
Narrow Attack Coverage
The multi-agent architecture as of now only covers SQL Injection and Cross-Site Scripting vulnerabilities. This is another hurdle that must be overcome for practical implementation. Web-based applications are not limited to SQL Injection and XSS exploits, so the scope of the project must be broadened for future iterations of the design. This involves developing more nodes and specialized prompts to include more types of cyber security threats for a broader security assessment.
Conclusion
This paper presents a machine learning-based optimized security testing framework that uses Large Language Models (LLMs) to autonomously parse and examine CVE records supplemented by specialized AI agent review. The LLM agent network is designed to offer an organized and effective approach to identifying one-day vulnerabilities through extensive testing. Through meticulous development and trial verification, this approach has improved the accuracy and efficiency of LLM results compared to previous LLM security systems. Therefore, it establishes a feasible and versatile solution to detect and address software vulnerabilities before or, in more critical stages, after exploitation. With a roadmap for future enhancements and corrections already in place, this robust and adaptable LLM multi-agent system demonstrates the capability to be a valuable resource within the cyber security domain.
Top comments (0)