Pavan Belagatti

Posted on Jun 1

Agentic Observability: How I Wired a Real App with Dynatrace MCP in Minutes!

#mcp #ai #agents #developers

Every engineering team runs into the same annoying problem sooner or later. Monitoring tells you that something is broken, but it usually stops right there. You can see error rates. You can see latency spikes. You can see failed requests. But the questions that matter during an incident are often still unanswered.

Who owns this service? What depends on it? Where is the runbook? Which Slack channel should I use? Is this a real outage or a known failure mode?

That gap is exactly why I put together this small Agentic Observability demo. I built a tiny shopping app, instrumented it with OpenTelemetry, sent the telemetry into Dynatrace, and then used Port as the context layer so I could connect operational signals with engineering knowledge. The result is a much more useful troubleshooting workflow. Instead of staring at dashboards and guessing, I can ask what is happening and get back both live health data and human context in one place.

This setup is intentionally small, but it maps really well to the kind of confusion that happens in real systems. The app has products, a cart, checkout flow, and a few baked-in failure scenarios so the observability story actually has something interesting to surface.

The real problem with observability today

Traditional observability is good at detection. It can tell me that a service is unhealthy, response times are increasing, or failures are climbing. That is valuable, of course. But during incident response, detection is only the starting point.

The painful part begins immediately after that.

I need to know which team owns the service.
I need to know the service tier and whether it is business critical.
I need to understand upstream and downstream dependencies.
I need the right runbook.
I need to know how to contact the people who can fix it.
I need enough context to understand whether the anomaly is expected, accidental, or part of a test.

This is where Agentic Observability becomes interesting. The goal is not just to collect telemetry. The goal is to make the telemetry actionable by connecting it to the operational and organizational context around the system.

The demo architecture at a glance

I kept the demo simple on purpose. There are only three major pieces involved, but together they create a much stronger workflow than any single tool would provide alone.

A Flask shopping app that simulates realistic user behavior and failures.
Dynatrace to ingest traces and analyze service health, latency, logs, and errors.
Port as the context layer, storing service ownership, tier, runbooks, Slack channels, and related metadata.

The connection point between the observability platform and the context layer is the MCP connector in Port. I used that to connect the Dynatrace MCP server, which lets Port access live monitoring data while still grounding the experience in engineering context.

That combination is really the whole idea behind this version of Agentic Observability. Dynatrace knows what is happening technically. Port knows what that service means inside the organization.

What I built: a tiny Flask e-commerce app

The application itself is intentionally modest. It is a small e-commerce style service with a few common user actions:

Browsing products
Adding items to the cart
Checking out
Viewing orders

It is not meant to be production-grade commerce software. It is just realistic enough to behave like a real service and produce interesting telemetry.

I also added fake traffic and fake failures into the flow. That mattered because I did not want a perfect demo where everything stays green all the time. Real systems fail in messy ways, and a good Agentic Observability setup should help make sense of that mess.

Some checkout flows succeed. Some fail. Some traffic is generated artificially. The point is to create enough activity that the tools have something meaningful to detect and explain.

Step 1: Auto-instrument the app with OpenTelemetry

The first layer is instrumentation. I wrapped the Flask app with OpenTelemetry so requests automatically emit traces. I did not need to write a bunch of custom tracing logic for every endpoint. That keeps the setup cleaner and closer to how I would want to instrument a real service quickly.

Once that was in place, every request moving through the shop could generate telemetry data, including:

Request traces
Errors
Latency information
Operational signals around the application flow

This is the foundation. Without it, there is no visibility into what the app is actually doing.

Step 2: Stream traces into Dynatrace

After instrumentation, the traces stream directly into Dynatrace. Dynatrace auto-detects the service and begins tracking the health of the application in real time.

For this demo, that meant I could quickly see:

The service showing up as an active monitored workload
Traffic spikes from the generated activity
Error behavior during intentional checkout failures
Latency and service-level patterns over time

This part is classic observability. Dynatrace is doing exactly what an observability platform should do: gather the signals, analyze them, and make abnormal behavior visible.

But again, raw visibility is not the whole story.

Step 3: Add the missing context in Port

This is where things get a lot more useful.

I modeled the service in Port. Port acts as an agentic developer platform and, in this setup, it works as a context layer over the telemetry coming from Dynatrace. That context includes the kind of information engineers usually have to hunt down manually during an incident.

For the service, I stored details like:

Owner of the service
Tier or criticality level
Environment
Runbook
Slack channel for communication
Dependencies related to the service

This is the missing half of incident response. When a metric turns red, I do not want to begin a scavenger hunt. I want the operational signal and the human context tied together.

How the Dynatrace MCP server fits into the workflow

The Port MCP connector is what ties everything together. I used it to connect the Dynatrace MCP server into Port, which means Port can reach into Dynatrace when needed and pull live monitoring data as part of a contextual query.

That matters because now I am not bouncing between disconnected tools mentally. Instead, Port can combine:

Its own service metadata
Ownership and operational details
Live health information from Dynatrace
Relevant answers returned through agentic queries

Port supports multiple data source patterns, including APIs, GitOps, infrastructure-as-code, web integrations, and MCP servers. For this demo, the Dynatrace MCP integration was the key piece because it let me bridge observability data and service context directly.

Running the app and generating failures

Once the shop app was running locally, I exercised the common paths: browse products, add them to the cart, and go through checkout. I also generated some fake user activity and deliberately introduced failures during checkout.

That created the exact kind of mixed operational picture I wanted:

Normal requests
Confirmed orders
Periodic failures
Traffic increases over time

In the orders view, I could see the system state changing as synthetic traffic and failures were happening. In Dynatrace, the service activity became visible as spikes and behavioral changes. That gave me enough signal to test whether the full Agentic Observability flow could actually explain what was going on.

What the agentic query experience looks like

After connecting Dynatrace and Port, I could ask a plain-language question about the service rather than manually piecing everything together from dashboards and documents.

I queried the system about what was happening with the demo service. Port AI, which is the native chat experience inside Port, then began collecting data from both Port and Dynatrace in parallel.

That is an important detail. It was not just answering from one static metadata record. It was combining two different kinds of information:

Entity context from Port, such as owner, tier, environment, runbook, and communication channel
Health metrics from Dynatrace, such as traffic, recent behavior, and failures

That is the essence of Agentic Observability. The system is not merely showing a chart. It is assembling the context needed to reason about the chart.

The answer gets a lot more useful than a red metric

Once the query completed, I got back a consolidated view of the service.
It identified the service and surfaced key metadata such as:

The owning team
The service tier
The environment
The communication channel
The runbook location
Whether there were any open incidents
Recent traffic behavior over the last couple of hours

That is already a huge improvement over standard monitoring alone. Instead of only knowing that a service is active or unhealthy, I immediately know how that service fits into the engineering organization.

Then I asked a deeper follow-up question about the cause of failures.

The system checked the logs and correlated what it found. The result was actually reassuring: the error was not some mysterious production bug. It was an intentionally hardcoded failure in the demo, resulting in a 500 internal server error during checkout.

That answer is exactly what I wanted to prove. With a good Agentic Observability flow, I should be able to distinguish quickly between:

A real incident
A synthetic test
A known intentional failure mode
An unexpected regression

Why this pattern matters for engineering teams

The demo is small, but the bottleneck it addresses is very real.

In many teams, observability data lives in one place, service ownership in another, runbooks in another, incident tools somewhere else, and tribal knowledge in Slack or people’s heads. During an outage, every extra click and every missing piece of context adds delay.

This approach reduces that friction by bringing the pieces together.

Agentic Observability is useful because it helps answer the operational questions that come right after detection:

What failed?
Why is it failing?
Who owns it?
What should happen next?
Where is the documentation?
Is this service connected to other important systems?

Instead of forcing an engineer to manually join that information, the platform can do it for them.

What Port contributes beyond simple metadata

It is easy to think of Port as just a catalog for services, but in this setup it does something more important. It serves as a reliable operational context layer for engineering teams.

Because the service entity in Port includes ownership, deployment-related knowledge, team details, and related service information, Port becomes the right place to anchor agentic queries. Dynatrace provides the live signal. Port provides the meaning around the signal.

That is why the answers become much more actionable. The system is not simply observing. It is interpreting the observation in the context of how the organization actually works.

You can extend the same pattern to other tools

Although this demo used Dynatrace, the broader pattern is not limited to one observability vendor. Port’s MCP connector approach makes it possible to connect multiple developer tools and bring them into the same context-rich workflow.

I specifically called out that the same idea can be extended to tools like:

PagerDuty
New Relic
Other MCP-enabled developer and operations tools

So the bigger idea here is not “use one tool for everything.” The bigger idea is “build a context layer that can speak to the right tools and answer engineering questions with the full picture.”

The data flow behind this Agentic Observability demo

The end-to-end flow for the demo is straightforward:

A shopper interacts with the Flask application.
OpenTelemetry captures traces as requests move through the system.
Dynatrace ingests and analyzes those traces, logs, and errors.
The Dynatrace MCP server is connected into Port.
Port combines live monitoring data with service context.
Agentic queries return an operationally meaningful answer instead of isolated raw metrics.

That pipeline is the practical core of Agentic Observability. Instrument the app, collect the signals, connect the tools, add the missing human context, and let engineers query the system in a way that reflects how incidents actually happen.

What I liked most about this setup

The most useful part was not the dashboard itself. It was the reduction in ambiguity.

When something breaks, I do not want five tabs open and three separate searches just to figure out basic ownership and intent.
I want one place that can tell me:

What changed
What is unhealthy
Whether the failure is real or expected
Who needs to be involved
What the next step should be

That is why this style of Agentic Observability feels promising. It closes the gap between telemetry and action.

Final thoughts

This demo was intentionally small, but the lesson is not small at all. Good observability should do more than report failures. It should help engineering teams respond with confidence.

Dynatrace handled the telemetry side beautifully. Port added the context that observability platforms often do not have on their own. Connecting the two through the MCP layer created a workflow where I could ask what is happening with a service and get back something genuinely useful.

That, to me, is the practical value of Agentic Observability. It is not just about smarter dashboards or nicer charts. It is about turning system signals into answers that are grounded in ownership, dependencies, documentation, and action.

If you are trying to make incident response less chaotic, this pattern is absolutely worth exploring.

Top comments (1)

Armorer Labs • Jun 22

This is where agentic observability starts to get interesting: the agent is not only being observed, it can act on observability data.

The part I would want to preserve is the handoff between signal and action. If the agent sees an error spike, queries traces, opens a ticket, or changes config, the run record should connect those steps: source signal, tool calls, reasoning summary, approval state, and final side effect.

Otherwise the dashboard gets smarter, but the operational trail can still be hard to reconstruct after the fact.