Delafosse Olivier

Posted on Apr 19 • Originally published at coreprose.com

Beyond Chatbots: Unconventional AI Experiments That Hint at the Next Wave of Capabilities

#ai #machinelearning #llm #programming

Originally published on CoreProse KB-incidents

Most engineering teams are still optimizing RAG stacks while AI quietly becomes core infrastructure. OpenAI’s APIs process over 15 billion tokens per minute, with enterprise already >40% of revenue [5]. AWS reports ~$15B AI annualized revenue, showing workloads are moving from pilots into production backbones [5].

Frontier labs now demo models that find and reproduce real-world software vulnerabilities and managed agents that run as persistent workflows, not one-off prompts [5]. The same advances amplify systemic risks: weaponization, mass cyberattacks, disinformation, and lightly supervised autonomous systems [4].

💡 Goal of this article

A roadmap for engineers who already know RAG and fine-tuning, and want to explore where the stack is heading: AI monitoring AI, cyber reasoning, edge autonomy, and long-running agents—plus architectures you can prototype without wrecking SLOs or security.

1. Why Unconventional AI Use Cases Are Emerging Now

Enterprise AI is shifting from UX surface (chatbots, copilots) to infrastructure. OpenAI’s token volumes and AWS’s AI run rate suggest the main value now flows through backend APIs and embedded agents, not chat UIs [5].

At this scale, the problem becomes running an “AI fabric”: many models, tools, and pipelines wired into live data and production traffic. Examples [5]:

Models that locate and reproduce vulnerabilities in complex stacks
Persistent environments and reusable workflows that execute continuously instead of per-prompt

📊 AI as dual-use infrastructure

Advanced LLMs are general-purpose, dual-use tech [4]:

High risk: mass cyberattacks, AI-augmented disinformation, autonomous robotics, supply-chain subversion
High value: decision support, simulation, and complex planning

NIST’s Cyber AI Profile splits the space into [3]:

Cybersecurity of AI systems
AI-enabled cyberattacks
AI-enabled cyber defense

Unconventional use cases often straddle these, e.g.:

Autonomous red-teaming agents attacking your own stack
Tools that monitor and protect AI pipelines themselves [3]

⚠️ Implication for engineers

If AI is now persistent infrastructure, you must engineer:

Autonomy and long-horizon reasoning
Safe environment interaction and tool use
Operational controls for cost, latency, and safety

The rest of the article shows how.

2. AI Monitoring AI: Agentic Ops and Self-Observability

ThousandEyes’ Agentic Ops is an early pattern of “AI watching AI.” Using Model Context Protocol (MCP), they expose telemetry and topology to AI agents that reason about end-to-end paths—from browser and DNS/TLS, through LLM APIs (OpenAI, Anthropic), into vector databases like Pinecone [1].

Each hop is a failure domain: DNS, TLS, network, embeddings, vector search, model completion [1]. In AI-heavy products, silent degradation (stale embeddings, API deprecations) becomes a business risk, not just a bug [1].

💡 Anecdote: “ghost latency”

A SaaS SRE chased a 300–400 ms latency bump for two weeks. Root cause: an unmonitored regional routing change between their VPC and one embedding endpoint. An LLM observability agent over MCP-style telemetry could have correlated hop metrics and model changes into a plausible hypothesis in minutes [1].

Experimental watchdog architecture

A minimal LLM-powered watchdog:

Data layer
- Metrics/traces: Prometheus, OpenTelemetry
- MCP-like adapters exposing typed queries over telemetry and topology [1]
Agent core
- LLM with function calling and a tight toolset [7]
- Tools:
  - get_timeseries(metric, scope)
  - get_traces(query)
  - get_model_changes(service)
Loop

while True:
    events = poll_alert_stream()
    context = fetch_recent_telemetry(events)
    plan = llm.plan_diagnosis(context)
    actions = execute_tools(plan.tools)
    hypothesis = llm.summarize(actions, format="structured_incident")
    emit_incident(hypothesis)

Outputs

Structured incident hypotheses
Suggested runbook steps and business-impact narratives per stakeholder [1]

⚠️ Production concerns

Diagnostic SLOs: targets for first hypothesis vs. full RCA
Token costs: cap message length, frequency, and tools so loops don’t quietly burn budget [1]
Evaluation: replay past incidents and compare diagnoses to ground truth to track precision/recall [5]

Agentic monitoring adapts to LLM-specific failure modes (embedding drift, retrieval degradation), but still needs guardrails and human confirmation for impactful actions [1][9].

3. Offense-Grade Reasoning: Cybersecurity as an Experimental Playground

Anthropic’s Claude Mythos is a model with such strong cyber capabilities that access is restricted to vetted partners via Project Glasswing [2]. It can find and reproduce real-world vulnerabilities, including older but still-live exploits—powerful for secure development and potentially for abuse [2][5].

Security teams stress asymmetry: defenders must secure everything; attackers need one gap [2]. AI-accelerated vulnerability discovery enables:

Systematic scanning of large monorepos and microservice fleets
IAM and network-policy misconfiguration hunting
CI/CD and supply-chain dependency analysis

📊 Defensive and offensive impact

Evidence from NIST, OWASP, MITRE, and vendors shows AI helps defense when tied to concrete tasks [3]:

Faster detection and triage
Deeper investigations and attack-path simulation
Automated validation of controls

The same capabilities support:

Automated phishing and identity abuse
AI-guided lateral movement and privilege escalation [3][4]

A controlled red-team agent pipeline

A realistic but contained setup:

Environment
- Isolated lab: separate cloud account, sample apps, synthetic users and secrets [3]
Agent loop
- Recon: DNS/IP enumeration, public Git scraping, config discovery
- Exploit generation: static analysis tools, CVE databases, LLM-synthesized PoCs
- Escalation and lateral movement: graph-based planning over identities and assets
Control plane
- Sandboxed execution for payloads
- Policy engine enforcing “lab-only” rules; full action logging [9]

💼 Governance first

Risk surveys flag AI-enabled mass cyberattacks and capability overhang as realistic concerns [4]. For cyber agents, prioritize:

Strong access control and approvals
Formal risk assessments before production use
Evaluation of both defensive benefits and offensive potential [3][9]

Treat these agents like live explosives, not generic devtools.

4. Edge and Physical-World Experiments: Beyond the Data Center

Edge AI unlocks capabilities that never appear in chat UIs. In outdoor power tools (professional chain-saws, grass-cutters), embedded models enabled [6]:

Self-calibration and adaptive sensing
Selective data capture
Usage-based reputation and maintenance tracking

These behave as dynamic capabilities: devices adapt sensing, maintenance schedules, and user feedback in real time [6].

💡 Hybrid edge–cloud agent architectures

Split responsibilities:

On-device models [6]:
- Low-latency perception and control (vibration, motor current, pose)
Cloud LLM/agents [7]:
- Planning, coordination, cross-fleet analysis

Example patterns:

Self-calibrating sensor fleets that negotiate sampling rates and firmware updates via a coordinating agent watching drift and anomalies [6].
Robotic tools streaming degradation traces to a central vector store for predictive maintenance and design feedback [6][8].

⚡ From devices to systems

Similar architectures power:

Supply chains: agents track stock, lead times, transit, and propose or execute reorders [8].
Energy grids: agents ingest sensor data, simulate interventions, and call control APIs to rebalance load or reconfigure topology [8].

⚠️ Engineering constraints

Key issues for ML and systems engineers:

Model selection that fits edge hardware while meeting accuracy targets [6]
Quantization/distillation to hit latency and power budgets
Sync strategies that update edge knowledge without crushing bandwidth or privacy [6]
Robust fallbacks for partial connectivity and noisy sensors

Done well, AI moves from cloud endpoints into the physical fabric of operations.

5. Long-Running Agentic Systems and Secure Experimentation Patterns

Agentic AI systems have autonomy, decision-making, tool use, and environment interaction—beyond narrow ML and one-shot LLM calls [7]. They:

Plan, act, and reflect across web, software, and physical environments
Run goal-directed loops instead of single prompts [7]

Analyses highlight high-stakes, long-horizon workflows as promising domains [8][5]:

Healthcare diagnostics and trial operations
Supply-chain and logistics optimization
Fraud detection and complex investigations

Early deployments see higher task completion when agents can plan and revise instead of returning one answer [8][5].

📊 New security threat surface

Recent surveys of agentic AI security identify threats unlike classic bugs [9]:

Tool abuse from over-permissive capabilities
Jailbreaks via environment manipulation and indirect prompt injection
Autonomous propagation and self-replication across systems [9]

These require threat models that explicitly cover:

Agent loops and planning
Tooling layers and credentials
Memory, logs, and learned behaviors [9][3]

Safe tool gateways and experimentation blueprint

MCP-style layers from ThousandEyes illustrate safe exposure of tools [1]:

Typed, constrained functions (telemetry queries, controlled tests, topology views)
No raw shell or arbitrary network access

Such context protocols can [1][9]:

Scope what data agents can access
Enforce schemas and limits on function arguments
Log every invocation for audit, replay, and evaluation

💼 Practical blueprint

A sane path for most orgs:

Start with sandboxed, read-only agents that only observe and recommend [9].
Instrument all tool calls and decisions with structured logs and correlation IDs.
Insert human-in-the-loop checkpoints for irreversible or high-blast-radius actions.
Continuously evaluate agents with red-team scenarios and security taxonomies from emerging research [3][9].

Conclusion

AI’s next wave will look less like chatbots and more like distributed, safety-critical infrastructure: AI monitoring AI, offense-grade reasoning in tightly controlled labs, edge autonomy in physical systems, and long-running agents embedded in core operations.

For engineers, the opportunity is to prototype these architectures now—using tight tool scopes, strong observability, and explicit security models—so your stack evolves with the capabilities instead of being blindsided by them.

About CoreProse: Research-first AI content generation with verified citations. Zero hallucinations.

🔗 Try CoreProse | 📚 More KB Incidents