DEV Community

Aryan
Aryan

Posted on

Building a 10-Agent Security Civilization with Qwen and Alibaba Cloud 🛡️🤖

Security auditing is broken.

If you’ve ever run a static analysis tool (SAST) on a large codebase, you know the pain: thousands of alerts, zero context, and a 90% false-positive rate. On the other end of the spectrum, hiring human penetration testers is incredibly expensive and impossible to scale alongside modern CI/CD pipelines.

For the Qwen Cloud Global AI Hackathon, we decided to rethink the problem entirely. What if, instead of using a single monolithic AI to "find bugs," we built an entire specialized civilization of agents?

Meet NEXUS, an autonomous society of 10 distinct AI agents that discovers, triages, exploits, patches, and reports security vulnerabilities in real open-source software.


🧬 The Agent Civilization

Instead of asking an LLM to "find a bug and fix it" (which usually results in hallucinations), we split the vulnerability lifecycle into 10 distinct, highly-specialized roles.

Powered by the DashScope API (using Qwen-Max and Qwen-Plus), our pipeline looks like this:

  1. 🔍 Scout (Qwen-Plus): Crawls GitHub and maps the repository's file structure.
  2. 📊 Recon (Qwen-Plus): Analyzes the attack surface, identifying high-risk entry points (auth middleware, raw SQL queries, etc.).
  3. 🎯 Hunter (Qwen-Max): Performs deep codebase analysis using Qwen-Max's massive 1M token context window.
  4. 💥 Exploit (Qwen-Max): Generates safe, proof-of-concept (PoC) code to cryptographically prove the vulnerability is real.
  5. Verify (Qwen-Max): Acts as a skeptic, cross-validating the PoC to eliminate false positives.
  6. ⚖️ Governance (3 Agents): A council of three distinct personas (CVSS Scorer, Impact Assessor, Exploitability Judge) that debate and vote on the vulnerability's severity.
  7. 🔧 Patch (Qwen-Max): Generates an AST-aware security fix with regression tests.
  8. 👁️ Review (Qwen-Max): Ensures the patch is clean and doesn't break existing functionality.
  9. 📝 Report (Qwen-Plus): Generates a CVE-ready security advisory.

By forcing the system to generate a PoC and independently verify it, we shifted from a model of guessing bugs to proving them. Zero false positives by design.


🏛️ The Governance Council

One of the coolest features we built is the Governance Council.

When the Hunter agent finds a verified vulnerability, we don't just ask a single LLM to rate its severity. Instead, we spin up three distinct agents with completely different system prompts:

  • The CVSS Scorer: Obsessed with the technical vector (Network vs. Local, High vs. Low complexity).
  • The Impact Assessor: Focuses entirely on business logic (Can this leak PII? Can this take down the database?).
  • The Exploitability Judge: Looks at how easy it is for a script kiddie to actually pull this off in the wild.

These three agents independently evaluate the finding, and the orchestrator mathematically averages their scores to reach a consensus. Watching them debate a vulnerability in real-time on our dashboard feels like a glimpse into the future of autonomous organizations.


🧠 The 3-Tier Memory System

To make NEXUS actually learn from its scans, we couldn't just rely on context windows. We built a 3-tier memory engine:

  1. L1 Working Memory (Redis): Handles real-time agent communication and short-term context. We also use Redis PubSub to stream agent "thoughts" live via WebSockets to our frontend dashboard.
  2. L2 Episodic Memory (PostgreSQL): Permanent, relational storage of scan results, exact PoC code, and historical governance votes.
  3. L3 Semantic Memory (pgvector): This is where it gets interesting. When NEXUS successfully exploits a bug, it generates a vector embedding of that specific code pattern. Before scanning a new file, it performs a similarity search against pgvector. Over time, NEXUS actively learns what vulnerabilities "look" like across different codebases.

☁️ Powered by Alibaba Cloud

NEXUS isn't just an API wrapper; it's deeply integrated into the Alibaba Cloud ecosystem.

  • Intelligence: We rely entirely on the dashscope-intl.aliyuncs.com API for Qwen inference. We routed high-reasoning tasks to Qwen-Max and summarization/routing tasks to Qwen-Plus to optimize our API credits.
  • Artifact Storage: Security advisories and exploit artifacts are highly sensitive. We integrated the oss2 Python SDK so that the moment the Report agent finishes its job, the final Markdown advisory is immutably uploaded to Alibaba Cloud Object Storage Service (OSS).
  • Deployment: The FastAPI backend, ARQ worker queues, and PostgreSQL/Redis databases are deployed seamlessly via Docker Compose onto Elastic Compute Service (ECS) instances.

🚀 The Result

We built a Next.js "Mission Control" dashboard that connects to our backend via WebSockets. When you paste a GitHub URL into NEXUS, you get to sit back and watch 10 AI agents systematically dismantle, exploit, and patch the codebase in real-time.

Building NEXUS taught us that the future of AI isn't a single, omniscient chatbot. It's specialized, communicative, and governed societies of agents working together to solve problems that humans simply don't have the scale to tackle alone.

Built for the Qwen Cloud Global AI Hackathon 2026. Check out the code on GitHub!

Top comments (0)