DEV Community

Cover image for Using local LLMs in API log analysis for near real-time attack detection
Ronaldo Modesto
Ronaldo Modesto

Posted on

Using local LLMs in API log analysis for near real-time attack detection

A practical Proof of Concept (PoC): running a compact language model (qwen2.5-coder:1.5b) locally via Ollama to inspect logs from a REST API and detect attacks — including semantic ones, such as IDOR — that slip past WAFs and static rules.

The project used for this POC can be found here:

github -> https://github.com/R9n/api-logs-llm-analyze


1. Introduction

In recent years, Artificial Intelligence has gone from being a distant promise to becoming part of the day-to-day life of software engineering. Code generation, automated pull request review, assisted documentation, synthetic tests — the adoption curve of LLMs (Large Language Models) in the SDLC (Software Development Life Cycle) has grown almost exponentially. What once required specialized tools and dedicated teams can now be prototyped in an afternoon, on a personal machine.

This same movement opens a fascinating (and ambiguous) chapter for defensive security (AppSec). On one hand, we gain a new kind of "analyst": a model capable of reading unstructured text, understanding context, and reasoning about intent — something that tools based on fixed patterns were never able to do well. On the other hand, AI itself introduces new risks: prompt injection, leakage of sensitive data in prompts, hallucinations that generate false positives, and the dependence on expensive infrastructure for inference at scale.

At the center of this debate lies an old and increasingly acute problem: how do we monitor large volumes of API logs in real time? A modern application generates thousands of log lines per minute. Most of it is legitimate operational noise. Hidden in that stream are the signs of attacks — a sequence of login attempts, a swapped ID, an exposed debug endpoint. Finding these signals manually is unfeasible, and traditional tools, as we will see, have structural limitations in detecting attacks that depend on context rather than signature.

In this article we will explore this idea a bit, trying to identify context-based attacks using an LLM and a simple but fast model.


2. Proposed Approach and Objectives

The core of this work is simple to state and challenging to execute: run a local LLM as an automated security analyst, feeding it logs from a REST API in time windows, and measure its ability to detect attacks in (near) real time. I say near real time because we will always have the overhead of the model itself to analyze the logs, build the response, and return it to us.

Why not just Regex and signatures?

Traditional detection approaches — rule-based WAFs, signature-based IDS, Regex filters — are excellent at what they were designed for: capturing known and textually explicit patterns. A SQL Injection payload with ' OR '1'='1, a Path Traversal attempt with ../../etc/passwd, an XSS <script> — all of these have a clear "signature" that a static pattern recognizes.

The problem is that an entire class of attacks has no signature at all. Consider IDOR (Insecure Direct Object Reference) / BOLA (Broken Object Level Authorization):

GET /users/47b4f0a3-d582-47e9-832f-1e80ac12fe4b
Authorization: Bearer <common-user-token>
Enter fullscreen mode Exit fullscreen mode

Syntactically, this request is identical to a legitimate one. There is no malicious character, no suspicious payload. What makes it an attack is the context: a regular user accessing another user's resource, or a single token touching many distinct IDs in a few seconds. Regex has no way of knowing this — it sees the line, not the story.

This is exactly where AI changes the game. An LLM can understand the context of a set of logs: correlate IPs, observe the cadence of requests, notice that a PATCH /users/:id/promote-admin coming from a recently created token is semantically anomalous — even if each request, in isolation, seems legitimate. It analyzes the intent inferred from behavior, not just the form.

Scope: this is a PoC

It is important to make clear from the outset: this work is a Proof of Concept (PoC). It is not a production-ready product, nor a replacement for corporate SIEM/WAF. It is a "start" — a controlled and reproducible experiment to answer a feasibility question:

Is it possible to use a lightweight local LLM to point out real attacks in API logs, including the logical/semantic attacks that static tools don't catch?

The goal is to demonstrate the feasibility of the approach and map out its pros, cons, and evolution paths toward a production scenario. This project THIS PROJECT IS NOT READY FOR PRODUCTION; IT LACKS ADEQUATE PROTECTION MECHANISMS


3. Technology Stack

To run the experiment we will adopt the "all local" policy: no log data leaves the machine, no calls to third-party APIs, inference cost equal to zero. This is especially relevant for security, where logs frequently contain sensitive data.

Execution platform: Ollama

Inference runs on Ollama, which exposes the model locally at http://localhost:11434. It handles loading the model onto the GPU, managing the context, and serving the responses via HTTP API — which makes it trivial to integrate into any Python pipeline.

Model used: qwen2.5-coder:1.5b

The choice of model was deliberate and perhaps counterintuitive. Instead of a giant model with tens of billions of parameters, we opted for qwen2.5-coder:1.5b — a model with only 1.5 billion parameters. The rationale:

  • Lightweight and fast: it fits comfortably in the VRAM of a more modest GPU and responds fast enough for analysis in 1-minute windows.
  • Optimized for structured tasks: being a coder variant, it is particularly good at understanding and producing structured outputs (JSON), interpreting technical formats (HTTP, paths, query strings), and following format instructions — exactly the profile needed to classify logs.
  • Proof of efficiency: part of the experiment's hypothesis is that analysis of text/structured logs does not require colossal models. A compact, specialized model may be sufficient.

Hardware configuration

The experiment ran entirely on a developer workstation (consumer hardware, not a server):

Component Specification
GPU NVIDIA RTX 3080 (10 GB VRAM)
CPU Intel Core i5 13600KF
Memory 32 GB RAM
Storage Standard NVMe SSD

Language and main libraries

The ecosystem is divided into two worlds, reflecting the real structure of the project:

Monitor and scripts (Python):

  • CrewAI — orchestration of the analysis agent (definition of Agent, Task, and Crew).
  • langchain-ollama / Ollama integration — bridge between CrewAI and the local model.
  • requests — collecting API logs and running the traffic/attack scripts.
  • schedule — scheduling the analysis windows (periodic execution).
  • pydantic — validation and typing of the model's output.

Target API (Node.js / TypeScript):

  • NestJS 11 — framework for the user management REST API. It's just an example API, only so we have something to analyze.
  • @nestjs/jwt + passport-jwt — JWT authentication.
  • bcrypt — password hashing.
  • class-validator / class-transformer — DTO validation.
  • @nestjs/swagger — automatic API documentation.

4. Architecture and Project Structure

The project was organized into four major blocks: the target API, the monitor (LLM agent), the simulation/attack scripts, and the consolidation utilities. The general data flow is as follows:

┌─────────────────────┐     ┌──────────────────────┐     ┌─────────────────────┐
│  Simulation / Attack│────▶│   NestJS API         │────▶│  operations.log     │
│  (Python scripts)   │     │   (generates JSON    │     │  (api/logs/)        │
│                     │     │    logs)             │     │                     │
└─────────────────────┘     └──────────────────────┘     └──────────┬──────────┘
                                                                     │
                                                                     ▼
┌─────────────────────┐     ┌──────────────────────┐     ┌─────────────────────┐
│  Results (.json)    │◀────│   CrewAI Monitor     │◀────│  GET /logs          │
│  monitor/results/   │     │   + Ollama (Qwen)    │     │  (time window)      │
└─────────────────────┘     └──────────────────────┘     └─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Macro role of each component

  • api/ — The target application. A user management API with JWT authentication. Its key role for the experiment lies in the logging/ module: a global interceptor (LoggingInterceptor) captures each request and writes a structured JSON record (method, path, IP, userId, sanitized body, status code, latency). The GET /logs endpoint allows querying these records by time window (?start=&end=&limit=) — it is the source the monitor consumes.

  • monitor/ — The AI agent. main.py runs in a loop: every minute, it computes the time window, fetches the logs via GET /logs, assembles the prompt, and triggers the CrewAI Crew. The model's response is validated against the Pydantic schema and persisted as analysis-YYYYMMDD-HHMMSS.json.

  • scripts/trafic-simulation/ — Generates the legitimate background noise: sign-ups, correct and incorrect logins, profile editing, password reset, expected validation errors (400/401/409). This is fundamental for testing whether the model distinguishes normal traffic from attacks (and not simply "screams attack" at everything).

  • scripts/attacks/ — The controlled attacks: password spray, user enumeration, and IDOR. All run only against the local API.

  • scripts/utils/aggregate_results.py — Consolidates all result JSONs into a Markdown report with aggregated metrics (the basis of Section 7).


5. Agent Configuration and Behavior

The agent was modeled with CrewAI, which cleanly separates the persona (who the agent is) from the task (what it should do). All the "intelligence" of the experiment lives in the prompt engineering of the configuration files.

The persona: a senior security analyst

In the config/agents.yml file, the security_log_analyst agent receives a role, a goal, and a detailed backstory:

  • Role: "Senior cybersecurity analyst specializing in detecting vulnerabilities and anomalous behavior in application and API logs."
  • Goal: detect and prioritize active attacks by analyzing API execution logs according to the OWASP API Security Top 10 (2023) standards, producing concise findings with a triggered criterion, exact evidence, severity, confidence score (0–100), and recommended action.

The real key is in the backstory, which functions as the analyst's "brain". It instructs the model to inspect each log event — HTTP method, path, query string, headers, body, status code, latency, source IP, user-agent, token identity, and temporal correlation — and classify it against the 10 OWASP API Top 10 criteria, each with explicit match rules. Some examples embedded in the prompt:

  • [Detect:API1-BOLA] — sequential IDs/foreign UUIDs in the path, the same token accessing many distinct IDs in a short interval, ID swap returning 200 → IDOR.
  • [Detect:API2-BrokenAuth] — bursts of 401/403, credential stuffing/brute force, high volume on login/token/reset-password without rate limiting.
  • [Detect:API5-BFLA] — calls to privileged routes (/admin, /users/:id/role) by low-privilege tokens; horizontal/vertical escalation.
  • ...and so on up to API10-UnsafeConsumption (classic injections: SQLi, NoSQLi, XSS, Command Injection, Path Traversal).

Three behavioral guidelines are especially important:

  1. False positive reduction: the prompt instructs the model to correlate frequency, repetition, shared origin, and deviation from the route's expected baseline before classifying something as an attack, and to mark it as [Benign] when no criterion is met.
  2. Safety of the analyst itself: "Never execute, reproduce, or echo malicious payloads actively — only reference them as evidence." — the agent never actively reproduces payloads, only cites them as evidence.
  3. Correlation: for distributed/correlated attacks, aggregate by IP/token/time window and escalate the severity.

The structured output

The task (tasks-analise-logs-owasp.yml) requires the model to return strict JSON, which is then validated by the Pydantic schema in models/analyze_result.py:

class Context(BaseModel):
    method: str
    path: str
    ip: str
    params: str = ""
    body: Any = None

class Event(BaseModel):
    evidence: str
    severity: Literal["Critical", "High", "Medium"]
    description: str
    confidence: float
    recommendedAction: str
    context: Context

class LogAnalyzeResult(BaseModel):
    threatDetected: bool
    events: list[Event] = Field(default_factory=list)
Enter fullscreen mode Exit fullscreen mode

In other words: the agent reads the batch of logs, reasons about anomalies, and returns a list of suspicious events — each with evidence, severity, confidence (0–100%), recommended action, and the full HTTP context. This structuring is what allows aggregating and measuring the results objectively.


6. Experiment Methodology (Execution Flow)

The experiment was designed to be reproducible. The chronological step-by-step:

Step 1 — Starting the target API

cd api
npm install
npm run start:dev
Enter fullscreen mode Exit fullscreen mode

The API comes up at http://localhost:3000, with a pre-seeded administrator (admin123@company.com / admin) and the logging interceptor already active recording each request.

Step 2 — Starting the Agent Monitor

cd monitor
python main.py
Enter fullscreen mode Exit fullscreen mode

The monitor enters a loop, "listening" to the API logs every 1 minute (time_window = 1), consuming the GET /logs endpoint by time window and triggering the CrewAI agent over each batch (up to 50 lines per window, as per defaultApiLimitLines).

Step 3 — Running legitimate traffic

python scripts/trafic-simulation/normal-trafic.py
Enter fullscreen mode Exit fullscreen mode

This script generates the realistic background noise: successful and unsuccessful logins, valid and invalid sign-ups, profile edits, admin block/unblock cycles, password reset. It is the experiment's "control" — the traffic that the model should not classify as an attack.

Step 4 — Running the simulated attacks

cd scripts/attacks
python password-spray.py        # one common password against many users
python user-enumeration.py      # 401 vs 404 differentiation to map users
python idor.py                  # semantic attack: GET/PATCH on others' resources
Enter fullscreen mode Exit fullscreen mode

With emphasis on idor.py, the central semantic attack of the experiment: it creates a regular "attacker", logs in, and tries GET /users/:id, PATCH /users/:id/block, .../unblock, and .../promote-admin on a victim's ID — requests that look perfectly legitimate in form, but are malicious in context.

Step 5 — Observation, collection, and analysis

While normal traffic and attacks run in parallel (generating the mixed traffic that makes the test realistic), the monitor produces one JSON per window in monitor/results/. At the end, the consolidation:

python scripts/utils/aggregate_results.py
Enter fullscreen mode Exit fullscreen mode

generates the analise-resultados.md report with all the aggregated metrics — exactly the data presented in the next section.


7. Observed Results and Critical Analysis

The consolidation covered 23 runs of the agent over API logs, processing 809 log lines in mixed traffic (normal + attacks) over approximately 50 minutes.

Experiment summary

Metric Value
Period analyzed 06/20/2026 14:46 – 06/20/2026 15:35 UTC
Total analyses 23
Analyses with threat detected 10 (43.5%)
Analyses without threat 13
Total logs processed 809
Total events generated by the LLM 221
Average events per analysis 9.6
Global average confidence 79.9%

Global severity distribution

Severity Occurrences Share
Critical 23 10.4%
High 187 84.6%
Medium 11 5.0%

Detections by attack type

Heuristic classification of the events based on the endpoint, HTTP method, and description returned by the model, aligned with the repository's attack scripts.

Attack Detections (events) Analyses with the pattern Average severity Average confidence Most cited endpoint
Password Spray / Brute Force 133 10/23 High 78.0% POST /auth/login
Sign-up abuse (POST /users) 38 5/23 High 83.8% POST /users
API scanning / reconnaissance 16 4/23 High 83.4% GET /debug/database
Other / model noise 12 3/23 High 93.3% POST /api-documentation
User enumeration 11 1/23 High 67.7% GET /users/:id?
IDOR (privilege escalation) 6 2/23 High 94.2% PATCH /users/ee87a7c3-f69e-4a4b-ae0f-db425d26a741/unblock
Unauthorized access 5 1/23 High 64.0% GET /check-activity

Here we have some prints of the execution of POC execution

no threats detected

example of threats detected

example of threats detected

Coverage by analysis window

How many monitor runs identified each attack pattern (regardless of the number of events within the same window).

Attack Windows with detection Rate over 23 analyses
Password Spray / Brute Force 10 43.5%
Sign-up abuse (POST /users) 5 21.7%
API scanning / reconnaissance 4 17.4%
Other / model noise 3 13.0%
IDOR (privilege escalation) 2 8.7%
Unauthorized access 1 4.3%
User enumeration 1 4.3%

Critical analysis of the data

The numbers tell an encouraging story for a PoC.

The model detected the attacks that mattered. In 10 of the 23 windows (43.5%), the agent raised a threat flag — and these detections are concentrated exactly on the vectors that were effectively simulated. Password Spray / Brute Force was the detection champion (133 events, present in 10 windows), which is entirely consistent: password-spray.py hammers POST /auth/login repeatedly, and the burst of attempts from the same IP is precisely the kind of temporal pattern that the LLM correlates well.

The high point: the semantic detection of IDOR. This was the most difficult and most relevant test of the PoC. Remember: IDOR requests are syntactically legitimate — a traditional WAF would let them through without blinking. Even so, the agent identified the pattern of privilege escalation via PATCH /users/:id/unblock and administrative actions on others' resources, and did so with the highest average confidence of the entire experiment: 94.2%. In other words, the model not only caught the attack invisible to static rules, but it was also the finding it was most confident about. This result, on its own, validates the central hypothesis of the work: the LLM's contextual analysis sees logical attacks that signatures don't see.

Confidence consistent with the clarity of the signal. The global average confidence stood at 79.9%, but the variation by category is revealing and shows a "calibrated" model: patterns with a strong signature and unambiguous context (IDOR at 94.2%, sign-up abuse at 83.8%, reconnaissance at 83.4%) receive high confidence, while more ambiguous inferences (unauthorized access at 64.0%, enumeration at 67.7%) receive lower confidence. The model, in essence, "knows when it's not sure".

The severity was well distributed. The predominance of High (84.6%) is appropriate for the set of attacks tested, with Critical (10.4%) appearing mainly in the windows of the most intense password spray — consistent with the API2-BrokenAuth criterion, classified as critical in the prompt.

And the cons already show up in the data. The "Other / model noise" category (12 events) is honest about the limit of the approach: it captures malformed or hallucinated responses from the LLM — invented or duplicated paths (POST /api-documentation, /check-activity, /token-authenticate) that don't correspond to real API endpoints. Curiously, these events came with high average confidence (93.3%), which is an important reminder: self-reported confidence by an LLM is no guarantee of correctness. This connects directly to the next section.


8. Pros and Cons of the Approach

A frank assessment, based on what the experiment showed in practice.

Pros

1. Detection of advanced attacks (the real differentiator).
The biggest asset of the approach is mitigating vulnerabilities that go unnoticed by common firewalls and WAFs — especially business logic flaws and IDOR/BOLA. Since the traffic of these attacks is syntactically legitimate, signature-based tools simply don't see them. The LLM's ability to analyze the context (who, accessing what, with what frequency, deviating from which baseline) allowed it to detect the simulated IDOR with 94.2% confidence. This is the kind of coverage that normally requires custom and expensive business rules — and that here emerged from the model's semantic understanding.

2. Efficiency of compact models.
The experiment delivered an excellent hit rate using an extremely lightweight model (Qwen 1.5B). This demystifies the idea that AI-assisted security analysis requires gigantic models. For the task of interpreting text/structured logs, a small, well-instructed model (with a good prompt based on the OWASP API Top 10) was sufficient. The practical implications are enormous: lower cost, lower latency, feasibility of running on-premise, and total privacy of the log data.

**3.Total control over data privacy.
Another critical advantage of this architecture is how it handles data privacy. By utilizing local, lightweight models, no confidential or sensitive data ever leaves the user's controlled environment. The entire pipeline—from log collection and context parsing to the final LLM evaluation—runs within your own infrastructure (on-premise or private cloud). This eliminates compliance risks associated with exposing internal traffic data to third-party APIs, ensuring absolute governance over corporate information.

Cons

1. False positives (and hallucinations).
Under mixed traffic and in stress tests, the model classified legitimate/normal traffic as an attack at some moments, and — more worryingly — even "hallucinated" endpoints that don't exist in the API (the "Other / model noise" category), sometimes with high confidence. This shows that the approach requires continuous prompt refinement (Prompt Tuning), stricter output validation (whitelist of real endpoints, grounding against the API inventory), and, above all, human supervision. An alert with 93% confidence about a non-existent endpoint is proof that self-reported confidence does not replace verification.

2. Local computational cost (Resource Exhaustion).
Even with a model of only 1.5B parameters, the GPU and VRAM consumption on the RTX 3080 (10 GB) was significant during continuous inference. For a PoC with 1-minute windows and ~50 logs per batch, it is perfectly manageable. But the experiment makes it clear that the approach does not scale linearly without cost: in a real production scenario — thousands of requests per second, multiple APIs, smaller windows — it would be necessary either to migrate to cloud models via API (trading privacy and variable cost for scale) or to invest in dedicated GPU infrastructure (raising Capex/Opex considerably). There is no free lunch: the inference cost is the tribute of semantic analysis.
Here we have an print of the moment that the monitor was performing an analyze, as we can see, to run a llm model, even a tiny one, you need a considerable computational power

computational consumption


9. Conclusion and Next Steps

This experiment demonstrated, in practice, that AI opens up disruptive horizons for application security — and not only in runtime detection, but at any stage of the SDLC. A lightweight local model, guided by a prompt grounded in the OWASP API Top 10, was able to detect not only classic attacks (password spray, enumeration, reconnaissance), but mainly the semantic attacks like IDOR, which are the historical blind spot of signature-based tools. For a PoC running on consumer hardware, it is an expressive result.

That said, the same PoC exposed the frontiers of the approach: false positives, occasional hallucinations, and non-trivial computational cost. Therefore, the most important conclusion is also the most sober one: the "Human-in-the-loop" factor is indispensable. AI should act as a force multiplier for the analyst — triaging, prioritizing, and contextualizing — and not as an autonomous judge. Human review remains the quality control that separates a useful alert from expensive noise.

Next Steps: evolving the PoC with Vector DB / RAG

The most structural limitation of the current design is that it analyzes isolated windows, without memory. Each run sees only the last minute of logs — and this amnesia is a conceptual vulnerability.

The next big leap is to integrate a Vector Database (Vector DB) in a RAG (Retrieval-Augmented Generation) architecture. The idea: store embeddings of the history of events and behaviors, so that each new analysis can retrieve relevant past context before deciding.

This would unlock the detection of a class of attacks that is today impossible to capture by analyzing isolated lines:

  • "Low and Slow" attacks: the attacker makes one malicious request today, another tomorrow, deliberately spacing out the actions to stay below rate limiting thresholds and correlation windows. Without memory, each request seems harmless; with semantic memory, the pattern distributed over time emerges.
  • Advanced Persistent Threats (APT): long, multi-stage campaigns, where reconnaissance, escalation, and exfiltration happen over days or weeks. Only a correlatable long-term memory allows connecting the dots.

With a Vector DB, the agent would stop being an analyst with amnesia every minute and become an analyst with long-term memory — able to ask "have I seen this IP/token/behavior before?" and to reconstruct the narrative of an attack that unfolds slowly. This is the natural path to transforming this PoC into something that approaches a truly contextual detection tool.


Legal disclaimer: all attack scripts in this experiment are intended exclusively for controlled lab environments and educational purposes. Never use them against systems without express authorization.

--

References

CrewAi Docs

OWASP Top 10 API

Ollama

Prompt Engineering Guide by Google

--

I hope this article serves at least to spark curiosity about how we can advance the monitoring of our applications.
Stay well and see you next time 🙂

Top comments (0)