ObservabilityGuy

Posted on Sep 5

MCP for Observability 2.0 - Six Practices for Making Good Use of MCP

1.Introduction to MCP
MCP is an open protocol for standardizing how applications provide context to LLMs. MCP can be regarded as a USB-C port for AI applications. Much like USB-C provides a standardized way to connect devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.

● MCP Hosts: Programs such as Claude Desktop, IDE, or AI tools that want to access data through MCP.

● MCP Clients: Protocol clients that maintain a 1:1 connection with the server.

● MCP Servers: Lightweight programs, each of which exposes specific functionality through a standardized MCP.

● Local Data Sources: Services that your computer’s files, databases, and MCP servers can securely access.

● Remote Services: External systems accessible via the Internet (for example, via APIs) to which the MCP server can connect.

Getting Started
You can think of the MCP framework as a “large model plug-in” and easily write your own “plug-in” functionalities.

● [Required] First, you need a client that supports MCP. For more information, see the next section. In this context, we use Cherry Studio and DeepChat Client.

● [Required] Prepare an LLM API. Have the API key ready. In this article, we use Alibaba Cloud Model Studio. If there is no API key, you can use the localization large model (Ollama).

● [Optional] Add an MCP server as needed. For more information, see the next two sections.

● [Optional] Write your own MCP server. Go to the https://github.com/modelcontextprotocol repository and select the SDK you want. Add tools with custom functions to start the server for large model calls.

Configuration details:

Recommended Awesome MCP Clients
For all clients, see https://github.com/punkpeye/awesome-mcp-clients

For comparison of clients, see https://modelcontextprotocol.io/clients

The author recommends:

Cherry Studio
DeepChat
AIaW (take note of internal data security)
Cursor
Continue

Recommended Awesome MCP Servers
For all servers, see https://github.com/punkpeye/awesome-mcp-servers

For the instance list, see https://modelcontextprotocol.io/examples

For the official list, see https://github.com/modelcontextprotocol/servers

Add as needed.

2. MCP Server for Alibaba Cloud Observability 2.0

Observability 2.0 Over the past year, we have moved towards universal observability. The shift from “monitoring” to “observability” is not only a technological upgrade, but also a deepening understanding of complex systems. Through the three levels of causal inference (association, intervention, and counterfactual), observability can improve the capability from passive observation to active prediction. Digital transformation for enterprises requires transforming black-box systems into white-box systems to optimize operational efficiency by collecting, storing, and analyzing multimodal data (such as logs, traces, and metrics).

As distributed systems and technology stacks become more complex, observability becomes the key to ensuring system stability and performance. However, many enterprises face the problem of chaotic observability data, such as a lack of standardization, fragmented data storage, low analysis efficiency, and difficulty in knowledge accumulation. To address these problems, we propose UModel — the soul of observability data, a universal “interactive language” of observability.

Based on ontological ideas, UModel uses a graph model composed of sets and links to describe the IT world. It defines core types such as EntitySet, LogSet, and MetricSet, and associations such as EntitySetLink and DataLink. It can be scaled flexibly, and introduces CommonSchema to simplify usage. It also provides features such as Explorer, alert, and event set to improve usability. UModel not only realizes the standardization and unified modeling of data, but also supports the efficient use of data by algorithms and large models, helping to build a new “observability 2.0” system.

Integration of MCP Capabilities

The integration of MCP capabilities in observability 2.0 is a highly suitable attempt. To some extent, it allows users to better perceive the overall usage of observability 2.0 through dialogue and helps users perceive systems and analyze problems, all through natural language.

Effect Showcase End-to-end Service Analysis Features are as follows:

Retrieve the upstream and downstream, dependent middleware, and infrastructure of a service.
Analyze the metrics of the service.
Query whether the service has an error request trace ID and conduct further intelligent analysis.

Basic Observability Information Query Capabilities
Features are as follows:

Query multiple types of information in observability 2.0:
Obtain zones.
Find a workspace.
Entity and Topo Query Capabilities.
Query Umodel schemas and generate images for display.

*3. Hands-on Practices (Trials and Errors) for Designing MCP Servers
*

Experience 1: Tool Interfaces — Simplified and Atomized Preface: MCP server is different from SDK API; it is an interface that interacts with people.

The “interaction with people” here is all-around. Whether it is the understanding of interfaces, parameters, or returns, they all need to be concise and easy to understand. A good MCP server tool is one that “even a three-year-old can understand”.

Example:

The getLogs SDK API of SLS is as follows.

def get_logs(
    ak: str,
    sk: str,
    region_id: str,
    project: str,
    logstore:str,
    query: str,
    from_timestamp: int,
    to_timestamp: int,
    topic: str,
    line: int,
    offset: int,
    reverse: boolean,
    powerSql: boolean
) -> list[Any]:

If this interface is directly provided to the large model as a tool, the success rate of calling will be zero, because it is a very complex interface.

The main bottleneck lies in how to write a query and what the syntax is, which poses a great challenge to the model. This scenario is suitable for A2A (agent collaboration), which generates a reliable SLS query as an agent. Not elaborating further here.
Assuming that we have solved the query problem, much basic information in the API needs to be obtained through multiple rounds of conversations: aksk, region, project, and logstore. In practice, parameters at the level of environment variables are highly repetitive.
The time window is included in the parameter, which requires being used with caution. For more information, see Experience 2.
The parameters include the topic, line, offset, reverse, and powerSql parameters that are not commonly used. Parameters that can handle most requests with default values are not included.

After overall practice, it is believed that for these issues, the architecture of using the MCP server to interact with users on Alibaba Cloud may need some changes.

Simple idea:

Truly applicable:

First of all, the lifecycle of the MCP server should be created after the user enters the workspace. The MCP interactive client and MCP server provided by Alibaba Cloud are very lightweight. Each user’s single workspace access corresponds to one lifecycle. In this case, the MCP server includes environment variable information such as aksk, workspace name, and region ID. Tools in the MCP server do not need to care about such basic parameters.

Then, as the user’s service scope expands, several tools of corresponding modules can be created for users to call as needed.

Finally, it is necessary to encapsulate common operations in MCP tools instead of giving a universal “query” interface.

Therefore, the optimized MCP server tool version of getLogs should look like this:

A2A mode:

def get_log_tool(
    query: str,
    from_timestamp: int = Field(
        int(datetime.now().timestamp()) - 3600, description="from timestamp,unit is second"
    ),
    to_timestamp: int = Field(
        int(datetime.now().timestamp()), description="to timestamp,unit is second"
    )
) -> list[Any]:

def gen_sls_query(
    text: str,
) -> str:

Functional modular mode (example only):

# Obtain the index information of the Logstore
def get_index():
  pass

# Obtain the aggregate statistics of the fields
def get_fields_desc(
    filter: str,
    fields: List[str],
    from_timestamp: int = Field(
        int(datetime.now().timestamp()) - 3600, description="from timestamp,unit is second"
    ),
    to_timestamp: int = Field(
        int(datetime.now().timestamp()), description="to timestamp,unit is second"
    )
) -> list[Any]:
  # * and upstream_status >= 400 | SELECT request_uri, upstream_status, COUNT (*) AS cnt
  # FROM log
  # GROUP BY request_uri, upstream_status
  # ORDER BY cnt DESC
  # LIMIT 10


# Count UV
def get_uv(
    filter: str,
    field: str
    from_timestamp: int = Field(
        int(datetime.now().timestamp()) - 3600, description="from timestamp,unit is second"
    ),
    to_timestamp: int = Field(
        int(datetime.now().timestamp()), description="to timestamp,unit is second"
    )
) -> list[Any]:
  # * | SELECT COUNT(DISTINCT client_ip) AS unique_visitors FROM log

The reason why information like ak, sk, project, and Logstore is not required here is that the MCP should be started when the Logstore is opened, so the basic environment information is already known.

Such a division of functional modules is only an example for illustration.

Experience 2: Tool Interface Parameters — Default Values. Use Time Parameters with Caution. Case: A tool for querying SQL queries. The interface is as follows:

def o11y_list_entities(
    ctx: Context,
    query: str = Field(default=None, description="query"),
    from_timestamp: int = Field(
        ..., description="from timestamp,unit is second"
    ),
    to_timestamp: int = Field(
        ..., description="to timestamp,unit is second"
    )
) -> list[str]:

To facilitate the use of large models, a tool is specially added to call time-related parameters:

In practice, invalid from to values are often passed in:

As a result, the interface directly reports an error. Due to invalid parameters, in such cases, the client and server must be restarted.

Sometimes, an absurd time parameter is passed in, resulting in unsatisfactory subsequent responses:

The timestamp provided here is 2023–06–25 00:00:00. The large model is vague about the concept of “current time”. The provided example still fails to retrieve data for a LogStore that only retains data for 30 days.

A better approach:

def o11y_list_entities(
    ctx: Context,
    query: str = Field(default=None, description="query"),
    from_timestamp: int = Field(
        int(datetime.now().timestamp()) - 3600, description="from timestamp,unit is second"
    ),
    to_timestamp: int = Field(
        int(datetime.now().timestamp()), description="to timestamp,unit is second"
    ),
) -> list[str]:

If the default value for the last hour is directly given, the expected query can be overwritten in most cases. You only need to consider the input of the query without worrying about time details. If the return is empty, the model considers whether it is a time issue and gives feedback.

Experience 3: Simplified Output — Avoid Excessively Long Context Output Some negative examples: query a list of workspaces (return 500 workspaces); query the entity types accessed in a workspace (return 200 types)

The functions of MCP server and SDK API cannot be treated equally. MCP is essentially displayed as a terminal, so content that humans cannot take in at a glance (perception: quantities exceeding 20) is of little significance. The output of each interface should have a controlled limit. You can search by adding filters or directly truncate the output to 10 items.

Beyond perception, an excessively long JSON response will seriously slow down the output speed of subsequent models and affect the effectiveness of context understanding.

The above is the general JSON interface experience. In practice, we also run into scenarios where we want tools to return an image (an SVG or PNG image). We tried several solutions:

● Return data XML directly: the client will not be displayed properly. Moreover, it will slow down due to the large amount of returned data.

● Use the Image class of MCP SDK and return “url”: f”data:image/png;base64,{tool_result.data}”: most clients will not be displayed properly, and the model must be a multimodal one.

● Use the image auxiliary service to directly return the URL format of the markdown: the effect is better.

Args:
    ctx: MCP context for accessing observability clients
    workspace_name: Workspace name, which must exactly match the workspace

Returns:
    ![Schema](http://localhost:9688/xxx.svg})
    The markdown content of the schema, directly printed and displayed as your output

Effect:

Experience 4: Avoid Chained Passing of Inputs and Outputs Between Tools Good practice: Tool atomic capability, where a tool performs an independent task.

Bad practice: Tool1 -> output -> Tool2 input -> output -> Tool3

In practice, chained passing also works, but there is a certain probability that the input parameters do not meet expectations and are difficult to control.

Some good self-repair capabilities are that when a large model finds an error in a parameter, it will call the previous tool on which it depends to verify the correctness of the parameter. This situation has a certain probability of repair, but not entirely. Sometimes the system may persistently insist that it is correct:

In the same problem, chained passing usually performs well. If it is the second query in the same session, it will not get all the information about the interface called in the first query, or it will make a wrong judgment.

● Practice scenario 1:

Tool1: Query the associated upstream and downstream of a service and return the results.

[
    {
        "id": "apm@apm.service:dc7495605d8395b3788f9b54defcb826",
        "type": "upstream",
        "name": "gateway",
    },
    {
        "id": "apm@apm.service:97cd7e28783143028d7e8e9c8dbce99c",
        "type": "downstream",
        "name": "checkout",
    },
]

Then, some other analysis… omitted.

In the conversation context, query the second time: Please help me query the information about the checkout service.

The model will directly pass checkout to Tool2 (the parameter has only one ID and is explained in detail in the Prompt comment)

The second conversation must completely copy apm@apm.service:97cd7e28783143028d7e8e9c8dbce99c to pass the parameters correctly.

● Practice scenario 2:

One tool returns trace IDs with a list:

[
    "93b9111f425a4b6cab6548aa68d18f04",
    "e165fede431c4c178d1e7e399c92b70d",
    "ce7d4e9f74634252969adfd62b509fd6",
    "e824356ddf4f42d884abe7c6a2d55d66",
]

Each value is a trace ID.

It is expected that the model calls the analysis interface with each trace ID, but the model will pass the wrong one.

Experience 5: First Try with Mock Data Before Starting Interface Implementation The way the model uses the interface may not be as expected. Interface design may need a lot of changes to make the model work the way you want, especially when it comes to chained calls of Tools, where extra caution is needed.

When designing the MCP interface, first define the tool interface information, then write the annotation description of tools (guidance prompt), mock some fake data according to the expected data, and return it directly. In the interaction process, the interface and annotation information are continuously adjusted until the fake data can work normally as expected, and then the interface development is started. Otherwise, if the interface is implemented first, the rework rate will be extremely high.

Experience 6: Sometimes Models Show “Blind Confidence” and “Pretend to Work”. Appropriately Lower the Temperature. In this case, the incorrect transmission of the trace ID is the first issue, which has been mentioned before (Experience 4).

After that, we meet the error of time. (Experience 2)

After an error occurs, the model does not prompt the problem, but instead quietly generates fake data based on some correct interface input and output data. It is like college students fabricating data in papers.

The upstream and downstream data here are also fake, which can be seen from the ID.

This problem can be alleviated by lowering the temperature, but it cannot be completely solved. MCP essentially relies on the generation model. Whether the generation model is suitable for interfaces in some serious and strict scenarios remains to be confirmed.

In addition, it is observed that if similar entities with different IDs are asked multiple times in the same conversation context, the model does not seem to call the interface, but directly generates the input parameters of the tool, suggesting the user call it. (Who is the boss??)

There is another scenario where the model is overly influenced by the context and refers to the last output. Then, the association and upstream and downstream of the new entity are the same as those of the entity called last time. (Fool me, right?)

4. Conclusion
MCP is an open protocol for standardized LLM context interaction, including components such as host, client, server, and data source, supporting connections to local/remote services. Users need to configure clients that support MCP (such as Cherry Studio) and LLM APIs (such as Alibaba Cloud Model Studio), and can extend official/custom MCP servers. Alibaba Cloud observability 2.0 unifies multi-modal data interaction through UModel, solving observation challenges caused by system complexity and upgrading from passive monitoring to active prediction.

Considerations and Reflections on MCP Server Design As a medium and tool for the integration of humans, LLMs, and business scenarios, MCP is the sprout of a new generation of interactive tools. However, there are still many considerations in design scenarios.

If all items are judged as “Yes”, it is very suitable for MCP scenarios.

Some unsuitable scenarios can be transformed into suitable ones. For example, the stock trading system does not provide a trading interface but only offers the ability to query stock codes to avoid security issues. After providing the stock query and avoiding returning the specific details of the stock information, the stock trading system guides the model to output the report. The model will output the bullish or bearish perspectives and basis for this stock, improving tolerance.

It can also be found that MCP is naturally suitable for scenarios with short-term, straightforward, and fast-paced business needs. If you want to solve a very complex task, MCP interfaces may not be sufficient, requiring A2A plus more advanced collaboration and understanding of ultra-long context.

Some Excellent MCP Application Scenarios
Various search engines MCP such as Bing search, GitHub search, and cloud drive search

Reasons: independent, concise (return Top 5 related results), tolerant (to a certain extent), robust, and secure.

AutoNavi Map (AMAP) MCP for generating tour routes

Reasons: independent, tolerant (different each time, even with a sense of freshness), robust (supported by AMAP’s stable system), and secure.

Automated crawling, statistics, and analysis of web browsers

Reasons: independent, concise (with output constraints), tolerant, and robust.

FileSystem

Reasons: independent, concise, tolerant (LLMs understand the Linux CLI well), and secure (without rm capabilities).

Redis

Reasons: independent, concise, and tolerant (non-serious Redis databases).

Future Outlook for Observability 2.0 + AI
MCP is suitable for short-term, straightforward, and fast-paced independent interface capabilities. For complex requirements, A2A or more powerful modes are more appropriate. Integrating more AI capabilities, there is more to explore for Observability 2.0 + AI beyond MCP.

DEV Community

MCP for Observability 2.0 - Six Practices for Making Good Use of MCP

Top comments (0)