Originally published on CoreProse KB-incidents
Most engineering teams are still optimizing RAG stacks while AI quietly becomes core infrastructure. OpenAI’s APIs process over 15 billion tokens per minute, with enterprise already >40% of revenue [5]. AWS reports ~$15B AI annualized revenue, showing workloads are moving from pilots into production backbones [5].
Frontier labs now demo models that find and reproduce real-world software vulnerabilities and managed agents that run as persistent workflows, not one-off prompts [5]. The same advances amplify systemic risks: weaponization, mass cyberattacks, disinformation, and lightly supervised autonomous systems [4].
💡 Goal of this article
A roadmap for engineers who already know RAG and fine-tuning, and want to explore where the stack is heading: AI monitoring AI, cyber reasoning, edge autonomy, and long-running agents—plus architectures you can prototype without wrecking SLOs or security.
1. Why Unconventional AI Use Cases Are Emerging Now
Enterprise AI is shifting from UX surface (chatbots, copilots) to infrastructure. OpenAI’s token volumes and AWS’s AI run rate suggest the main value now flows through backend APIs and embedded agents, not chat UIs [5].
At this scale, the problem becomes running an “AI fabric”: many models, tools, and pipelines wired into live data and production traffic. Examples [5]:
- Models that locate and reproduce vulnerabilities in complex stacks
- Persistent environments and reusable workflows that execute continuously instead of per-prompt
📊 AI as dual-use infrastructure
Advanced LLMs are general-purpose, dual-use tech [4]:
- High risk: mass cyberattacks, AI-augmented disinformation, autonomous robotics, supply-chain subversion
- High value: decision support, simulation, and complex planning
NIST’s Cyber AI Profile splits the space into [3]:
- Cybersecurity of AI systems
- AI-enabled cyberattacks
- AI-enabled cyber defense
Unconventional use cases often straddle these, e.g.:
- Autonomous red-teaming agents attacking your own stack
- Tools that monitor and protect AI pipelines themselves [3]
⚠️ Implication for engineers
If AI is now persistent infrastructure, you must engineer:
- Autonomy and long-horizon reasoning
- Safe environment interaction and tool use
- Operational controls for cost, latency, and safety
The rest of the article shows how.
2. AI Monitoring AI: Agentic Ops and Self-Observability
ThousandEyes’ Agentic Ops is an early pattern of “AI watching AI.” Using Model Context Protocol (MCP), they expose telemetry and topology to AI agents that reason about end-to-end paths—from browser and DNS/TLS, through LLM APIs (OpenAI, Anthropic), into vector databases like Pinecone [1].
Each hop is a failure domain: DNS, TLS, network, embeddings, vector search, model completion [1]. In AI-heavy products, silent degradation (stale embeddings, API deprecations) becomes a business risk, not just a bug [1].
💡 Anecdote: “ghost latency”
A SaaS SRE chased a 300–400 ms latency bump for two weeks. Root cause: an unmonitored regional routing change between their VPC and one embedding endpoint. An LLM observability agent over MCP-style telemetry could have correlated hop metrics and model changes into a plausible hypothesis in minutes [1].
Experimental watchdog architecture
A minimal LLM-powered watchdog:
-
Data layer
- Metrics/traces: Prometheus, OpenTelemetry
- MCP-like adapters exposing typed queries over telemetry and topology [1]
-
Agent core
- LLM with function calling and a tight toolset [7]
- Tools:
-
get_timeseries(metric, scope) -
get_traces(query) -
get_model_changes(service)
-
Loop
while True:
events = poll_alert_stream()
context = fetch_recent_telemetry(events)
plan = llm.plan_diagnosis(context)
actions = execute_tools(plan.tools)
hypothesis = llm.summarize(actions, format="structured_incident")
emit_incident(hypothesis)
- Outputs
- Structured incident hypotheses
- Suggested runbook steps and business-impact narratives per stakeholder [1]
⚠️ Production concerns
- Diagnostic SLOs: targets for first hypothesis vs. full RCA
- Token costs: cap message length, frequency, and tools so loops don’t quietly burn budget [1]
- Evaluation: replay past incidents and compare diagnoses to ground truth to track precision/recall [5]
Agentic monitoring adapts to LLM-specific failure modes (embedding drift, retrieval degradation), but still needs guardrails and human confirmation for impactful actions [1][9].
3. Offense-Grade Reasoning: Cybersecurity as an Experimental Playground
Anthropic’s Claude Mythos is a model with such strong cyber capabilities that access is restricted to vetted partners via Project Glasswing [2]. It can find and reproduce real-world vulnerabilities, including older but still-live exploits—powerful for secure development and potentially for abuse [2][5].
Security teams stress asymmetry: defenders must secure everything; attackers need one gap [2]. AI-accelerated vulnerability discovery enables:
- Systematic scanning of large monorepos and microservice fleets
- IAM and network-policy misconfiguration hunting
- CI/CD and supply-chain dependency analysis
📊 Defensive and offensive impact
Evidence from NIST, OWASP, MITRE, and vendors shows AI helps defense when tied to concrete tasks [3]:
- Faster detection and triage
- Deeper investigations and attack-path simulation
- Automated validation of controls
The same capabilities support:
- Automated phishing and identity abuse
- AI-guided lateral movement and privilege escalation [3][4]
A controlled red-team agent pipeline
A realistic but contained setup:
-
Environment
- Isolated lab: separate cloud account, sample apps, synthetic users and secrets [3]
-
Agent loop
- Recon: DNS/IP enumeration, public Git scraping, config discovery
- Exploit generation: static analysis tools, CVE databases, LLM-synthesized PoCs
- Escalation and lateral movement: graph-based planning over identities and assets
-
Control plane
- Sandboxed execution for payloads
- Policy engine enforcing “lab-only” rules; full action logging [9]
💼 Governance first
Risk surveys flag AI-enabled mass cyberattacks and capability overhang as realistic concerns [4]. For cyber agents, prioritize:
- Strong access control and approvals
- Formal risk assessments before production use
- Evaluation of both defensive benefits and offensive potential [3][9]
Treat these agents like live explosives, not generic devtools.
4. Edge and Physical-World Experiments: Beyond the Data Center
Edge AI unlocks capabilities that never appear in chat UIs. In outdoor power tools (professional chain-saws, grass-cutters), embedded models enabled [6]:
- Self-calibration and adaptive sensing
- Selective data capture
- Usage-based reputation and maintenance tracking
These behave as dynamic capabilities: devices adapt sensing, maintenance schedules, and user feedback in real time [6].
💡 Hybrid edge–cloud agent architectures
Split responsibilities:
-
On-device models [6]:
- Low-latency perception and control (vibration, motor current, pose)
-
Cloud LLM/agents [7]:
- Planning, coordination, cross-fleet analysis
Example patterns:
- Self-calibrating sensor fleets that negotiate sampling rates and firmware updates via a coordinating agent watching drift and anomalies [6].
- Robotic tools streaming degradation traces to a central vector store for predictive maintenance and design feedback [6][8].
⚡ From devices to systems
Similar architectures power:
- Supply chains: agents track stock, lead times, transit, and propose or execute reorders [8].
- Energy grids: agents ingest sensor data, simulate interventions, and call control APIs to rebalance load or reconfigure topology [8].
⚠️ Engineering constraints
Key issues for ML and systems engineers:
- Model selection that fits edge hardware while meeting accuracy targets [6]
- Quantization/distillation to hit latency and power budgets
- Sync strategies that update edge knowledge without crushing bandwidth or privacy [6]
- Robust fallbacks for partial connectivity and noisy sensors
Done well, AI moves from cloud endpoints into the physical fabric of operations.
5. Long-Running Agentic Systems and Secure Experimentation Patterns
Agentic AI systems have autonomy, decision-making, tool use, and environment interaction—beyond narrow ML and one-shot LLM calls [7]. They:
- Plan, act, and reflect across web, software, and physical environments
- Run goal-directed loops instead of single prompts [7]
Analyses highlight high-stakes, long-horizon workflows as promising domains [8][5]:
- Healthcare diagnostics and trial operations
- Supply-chain and logistics optimization
- Fraud detection and complex investigations
Early deployments see higher task completion when agents can plan and revise instead of returning one answer [8][5].
📊 New security threat surface
Recent surveys of agentic AI security identify threats unlike classic bugs [9]:
- Tool abuse from over-permissive capabilities
- Jailbreaks via environment manipulation and indirect prompt injection
- Autonomous propagation and self-replication across systems [9]
These require threat models that explicitly cover:
- Agent loops and planning
- Tooling layers and credentials
- Memory, logs, and learned behaviors [9][3]
Safe tool gateways and experimentation blueprint
MCP-style layers from ThousandEyes illustrate safe exposure of tools [1]:
- Typed, constrained functions (telemetry queries, controlled tests, topology views)
- No raw shell or arbitrary network access
Such context protocols can [1][9]:
- Scope what data agents can access
- Enforce schemas and limits on function arguments
- Log every invocation for audit, replay, and evaluation
💼 Practical blueprint
A sane path for most orgs:
- Start with sandboxed, read-only agents that only observe and recommend [9].
- Instrument all tool calls and decisions with structured logs and correlation IDs.
- Insert human-in-the-loop checkpoints for irreversible or high-blast-radius actions.
- Continuously evaluate agents with red-team scenarios and security taxonomies from emerging research [3][9].
Conclusion
AI’s next wave will look less like chatbots and more like distributed, safety-critical infrastructure: AI monitoring AI, offense-grade reasoning in tightly controlled labs, edge autonomy in physical systems, and long-running agents embedded in core operations.
For engineers, the opportunity is to prototype these architectures now—using tight tool scopes, strong observability, and explicit security models—so your stack evolves with the capabilities instead of being blindsided by them.
About CoreProse: Research-first AI content generation with verified citations. Zero hallucinations.
Top comments (0)