clover

Posted on Jun 15

Welcoming AI as a SRE DevOps Teammate: Introducing Reili, an Open Source Slack-Based AI Agent

#ai #sre #devops #opensource

Let AI Handle Investigation and Reporting, So You Can Focus on Judgment and Creation

When you work as an SRE or DevOps engineer, you often see notifications throughout the day that are not necessarily urgent, but still feel risky to ignore.

It might be a Datadog alert, a deprecation notice from AWS or an external service, or a casual message in Slack saying, “This error looks a bit concerning.”

Not all of these immediately turn into major incidents. Still, someone needs to take a look at least once. You open dashboards, inspect logs, check recent changes, look through old notes, and sometimes end up replying in Slack with something like, “There doesn’t seem to be any impact for now. I’ll keep watching.”

It may not be a large task, but work that requires crossing multiple sources, piecing together context, and reaching a conclusion steadily drains engineers’ focus.

Could AI take over the process of investigating, organizing, and communicating the key points across these different sources?

That idea led me to build Reili, an open source AI agent that runs in Slack.

Of course, if your organization has a dedicated SRE team and a well-established on-call rotation, this kind of investigation may already be handled by the person on duty.

But not every team has that structure from the beginning. In smaller development teams, there may be no clearly assigned on-call owner, or the same people may be responsible for both infrastructure and application development. In those cases, alerts and small signs of trouble often become things that “someone checks if they notice.”

That is exactly the area Reili is designed to support.

What Is Reili?

Reili is an AI teammate for SRE and DevOps teams.

It investigates across Datadog, GitHub, Slack thread history, and knowledge bases, then posts evidence-based reports back to Slack threads.

Give Instructions by Mentioning Reili

When you want to explicitly ask Reili to do something, just mention it in Slack.

@Reili Please investigate this alert.

Once the investigation starts, Reili posts progress updates in the thread. When it finishes, it sends an investigation summary. There is also a cancel button, so you can stop a task at any time if it is no longer needed.

Autonomous Responses Without Mentions

One thing I especially focused on with Reili is that it should not only act when someone says @Reili. It should also be able to notice, within the natural flow of Slack conversations, when something seems worth picking up.

Reili monitors messages in specified channels. An internal mechanism evaluates each message and decides whether Reili should respond. If it determines that a response is needed, Reili behaves the same way it would when explicitly mentioned.

This allows Reili to respond not only to alert bot notifications, but also to casual comments from teammates like, “What is this error?”

For example, you can configure a team-wide channel like this:

[[channel.slack.channels]]
names = ["team-sre", "alerts-*", "incidents"]
auto_response = true
auto_response_policy = """
Respond to reports that may require investigation, such as signs of production incidents,
alerts, error bursts, latency degradation, technical questions, requests for help,
or troubleshooting discussions.
Do not respond to casual chat, announcements, or reports that have already been resolved.
"""

When someone writes, “This error looks a bit concerning,” Reili reads the context, investigates, and posts the results in the thread. There is no need to type @Reili, and no need for someone else to notice and pick it up manually.

The policy for “which messages Reili should respond to” can be freely defined per channel. You can configure it to respond only to AWS Health deprecation notices, or only to posts from specific users.

What I want Reili to handle is the investigation that happens before humans make decisions. To do that, Reili reads and organizes information such as:

Datadog dashboards, monitors, and metrics
Diffs and code from related GitHub repositories
Slack thread history and past investigation notes
esa documents and the web, when needed
Investigation results, which it posts back to Slack

On the other hand, Reili does not have permission to directly modify production environments. Investigation can be delegated to Reili, but decisions about changes and recovery remain with humans. This separation is intentional.

Reili does not perform actions such as:

Infrastructure changes
Deployments or automatic recovery
Writes to GitHub

Designed for Delegating Real Work to AI

Security and Permission Transparency

It is completely natural to feel uneasy about giving AI access to production-related information. From the beginning, Reili has been designed around the question of what can be safely delegated.

No shell execution environment

Reili does not have shell access. Since it cannot execute arbitrary commands, the attack surface is smaller and unintended operations cannot happen through shell commands.

Read-only access

For both Datadog and GitHub, Reili only uses read permissions. It can read dashboards, but it cannot modify monitors or operate infrastructure.

Users control the permissions

The permissions Reili requires for each connector, such as Datadog and GitHub, are explicitly documented. The actual permissions granted are decided by the user through the Slack App and GitHub App settings. You do not need to accept a provider-defined permission bundle like you often do with hosted AI services.

Because Reili is open source, you can inspect the code and see exactly what it does.

LLM-provider independent

You can choose the AI backend from OpenAI, Anthropic, AWS Bedrock, or Vertex AI Gemini. Since Reili is not tied to a specific provider, your team can continue using a provider it already trusts and has a contract with. You can also switch providers as pricing and performance change.

Built to Be Operated Continuously

If adopting an AI agent requires setting up a dedicated database, managing a job queue, preparing a public endpoint, and maintaining state management, it can feel heavy before you even start.

Reili is designed to avoid that. It has a stateless architecture with no database, and it runs as a single container connected to a Slack App. Since it connects through WebSocket using Slack Socket Mode, you do not need to expose any inbound ports to the internet.

The agent’s memory is also recorded and referenced through Slack messages. Reili leaves past investigation results in Slack and reuses them in future investigations. There is no need to prepare dedicated storage separately.

Setup

Reili can run with Docker, a Slack App, a Datadog API key, and GitHub App credentials.

docker run --rm \
  --env-file .env \
  -v "$(pwd)/reili.toml:/home/reili/reili.toml:ro" \
  ghcr.io/reilidev/reili:latest

You can choose the AI backend from OpenAI, Anthropic, AWS Bedrock, or Vertex AI Gemini, depending on the provider your team already uses.

Detailed setup instructions are available in the README.

Closing Thoughts

Alert investigation and impact assessment are not always highly visible work, but someone has to do them.

In many cases, the first step is to gather the situation, read potentially relevant information, and organize it into a form that humans can use to make decisions.

With Reili, that part can be delegated to an AI teammate.

Reili responds to alerts in Slack, deprecation notices from external services, and casual comments like, “This error looks a bit concerning.”

By noticing those small signs, investigating them, and reporting back with evidence, Reili helps engineers focus less on “starting the investigation” and more on “deciding what to do based on the report.”

Reili currently uses Datadog, GitHub, esa, and the web as information sources, and I plan to continue adding more connectors. As the number of external services your team uses grows, the scope of what Reili can investigate will grow as well.

Please give it a try.

→ https://github.com/reilidev/reili