m365-graph-incident-agent: AI-Powered Incident Triage via Microsoft Graph

#hermeschallenge #ai #python #agents

2:17 AM. Your Phone Is Ringing.

The alert fired 12 minutes ago. Someone paged you. You are on call. You open your laptop, still half asleep, and now you have to figure out what is happening before you can even start fixing it.

You open Teams. There are 47 messages in the incident channel from the last hour. You switch to SharePoint to find the runbook. It is three folders deep, last updated six months ago, maybe not current. You check Outlook because someone sent an escalation email that references a ticket you do not have open. All of this, before you have written a single line of remediation.

That is the problem m365-graph-incident-agent addresses. It connects to Microsoft Graph and does the search for you. It reads the Teams channel, finds the runbook, reads the escalation thread, and hands you a structured triage summary. The reading and synthesis that takes a tired engineer 15 minutes takes the agent about 30 seconds.

Shape of the Fix

The agent authenticates with Microsoft Graph, accepts an incident description, and runs a targeted read across Teams, SharePoint, and Outlook in parallel. It returns a triage summary and a full audit log of every API call made.

from m365_incident_agent import IncidentAgent

agent = IncidentAgent(
    tenant_id="your-tenant-id",
    client_id="your-app-client-id",
    client_secret="your-client-secret"
)

result = agent.triage(
    incident="Payment service returning 503 on checkout endpoint",
    teams_channel_id="19:abc123...",
    sharepoint_site="https://company.sharepoint.com/sites/ops",
    lookback_hours=2
)

print(result.summary)
# "503s started at 01:54 UTC. Three engineers in Teams thread.
#  Root cause hypothesis: upstream DB connection pool exhausted.
#  Runbook: /ops/runbooks/payment-service-recovery.md (found, step 3 is relevant).
#  Last escalation email: from Sarah at 02:08, ticket #INC-9182 open."

print(result.audit_log)
# [{"action": "teams.channel.read", "channel": "incident-p1", "messages": 12},
#  {"action": "sharepoint.search", "query": "payment service runbook", "hits": 1},
#  {"action": "outlook.search", "query": "INC checkout 503", "from_hours_ago": 2, "hits": 1}]

The audit_log is not optional. Every Microsoft Graph API call the agent makes is recorded with the query, the result count, and the timestamp. You can see exactly what the agent read and what it did not. If the triage summary is wrong, you can trace why.

What It Does NOT Do

This agent reads. It does not write. It will not post to Teams, reply to emails, or update any SharePoint documents. Those actions require different Graph permissions and deliberate human approval.

It does not diagnose infrastructure. It does not query your monitoring stack, check Datadog, or pull metrics. It reads the human communication layer: what people said in Teams, what the runbook says, what was in the email thread. You still need your monitoring tools for the actual signal.

It also does not do real-time streaming. It reads the incident channel for the lookback window you specify, typically one to two hours. If your incident is ongoing and new messages are coming in every 30 seconds, you would run the agent periodically rather than expecting a live feed.

Inside the Project

The Graph client wraps the Microsoft Graph SDK with a consistent retry and rate-limit policy. Graph has per-resource throttling that varies by endpoint. The client handles 429 responses with exponential backoff and surfaces clear errors when a token scope is missing rather than returning an empty result silently.

The synthesis layer runs after all three reads complete. It uses an LLM to combine the Teams thread, the runbook snippet, and the Outlook context into a single summary. The LLM prompt is structured: it receives each source as a labeled block and is told to attribute claims. This makes the summary easier to audit. If the summary says "runbook step 3 is relevant," you can check the runbook yourself.

The audit log is built as a list of typed GraphAction objects throughout the run, then serialized to JSON at the end. This was a deliberate choice over logging to a file or a logger. You get a structured object you can inspect in code, pass to agent-decision-log, or write to a database. It also makes the test suite straightforward: 37 tests cover the Graph client, the synthesis prompt structure, the audit log serialization, and the retry behavior.

The agent requires Graph API permissions: ChannelMessage.Read.All, Files.Read.All, and Mail.Read. These are application permissions, not delegated. You register the app in Azure AD and grant admin consent. The README has the full setup steps.

When This Is Useful

This is useful for teams that already live in Microsoft 365 and want to cut the manual search step out of incident triage. It works well when your incident communication is in Teams, your runbooks are in SharePoint, and escalation threads are in Outlook. That describes a lot of enterprise environments.

It is not useful if your incident communication is in Slack, PagerDuty, or another tool. The agent only speaks Microsoft Graph. It is also not useful as a real-time alerting system. It is a read-and-synthesize tool, not a monitoring replacement.

For teams with mature runbook structure in SharePoint, the runbook retrieval is the highest-value feature. A search that would take a human two minutes to navigate takes the agent about two seconds.

Install or Try It

pip install m365-graph-incident-agent

# Configure your Azure AD app credentials
export M365_TENANT_ID="your-tenant-id"
export M365_CLIENT_ID="your-client-id"
export M365_CLIENT_SECRET="your-client-secret"

# Run a triage
python -m m365_incident_agent.cli \
  --incident "Database connection pool exhausted on payment service" \
  --teams-channel "19:your-channel-id" \
  --sharepoint-site "https://yourcompany.sharepoint.com/sites/ops" \
  --lookback-hours 2

# Or use directly in Python
from m365_incident_agent import IncidentAgent
agent = IncidentAgent.from_env()
result = agent.triage(incident="checkout 503", teams_channel_id="...")
print(result.summary)

Setup requires an Azure AD app registration with the three Graph permissions listed above. The README walks through the Azure portal steps.

Related Libraries

Library	What it does	When to pair it
agent-decision-log	Structured WHY-layer log of agent decisions	Pair to record why the agent chose each Graph query
agent-citation	Structured WHERE-layer source citations	Pair to attach source links to each claim in the triage summary
agentsnap	Snapshots agent call traces	Pair to capture full call traces for post-incident review
agent-event-bus	In-process pub/sub for agent events	Pair if you want to react to triage events in a wider system
llm-retry	Exponential backoff for LLM calls	Pair to make the synthesis step resilient to transient LLM errors

What Is Next

The biggest gap right now is SharePoint search quality. The agent does a keyword search against SharePoint's search API. For teams with well-organized runbook folders, this works fine. For teams with a messy SharePoint, the search results can be noisy. An optional vector index over the SharePoint content would improve precision significantly.

The second gap is cross-source correlation. Right now, the agent reads each source independently and lets the LLM synthesize. A structured correlation step that tries to link Teams message timestamps to Outlook email timestamps to runbook steps would make the triage more precise.

Real-time mode is on the backlog. Graph supports webhook subscriptions for Teams channel messages. An optional polling or webhook mode that keeps a live summary updated as new messages come in would be useful for long-running incidents.

Source: MukundaKatta/m365-graph-incident-agent