Sakasegawa

Posted on Dec 1, 2024

A Thorough Explanation of Claude's MCP! & Creating a Video Recommendation Chat AI with gpt-4o+MCP+YouTube API

#mcp #gpt4o #chatgpt

Hello! I'm Sakasegawa (https://x.com/gyakuse_en)!

About This Article

Today, I will introduce the MCP (Model Context Protocol) announced by Claude (https://modelcontextprotocol.io/), and I'll show you how to create your own MCP server and how it can be used with LLMs other than Claude. I chose to cover MCP because it incorporates important concepts for creating agent systems, which have been attracting attention recently.

What You Will Gain From Today's Article

Understanding of MCP
Specific implementation methods

After reading today's article, you'll be able to create a video recommendation app using MCP+ChatGPT+YouTube.

Before MCP

Now, before we get into what MCP is, let's think about AI agents.

MCP is actually a fairly simple protocol (and therefore, I think everyone is making one. It's quite unclear whether MCP will become widespread in the future), so when you think about AI agents, it's something that's naturally required.

For example, an AI agent specialized in weather would understand various tools related to weather, select the appropriate tool in response to a user's command (e.g., "Tell me today's weather"), execute the tool, and provide an answer based on the information obtained.

When we talk about this, you might have remembered Function Calling. Function calling allows you to select the appropriate tool from a user's natural language instructions. MCP is needed from the motivation that tools need to be properly defined in the first place.

The definition of a function call was a set of function name, description, required parameters, and expected return values. For a weather forecast function call, the location and date would be defined as parameters, and the forecast result information would be defined as the expected return value. However, this function call does not cover the part where the actual weather information API is called. Therefore, we need a definition of the tool. So, how should we define it? (Correction: many function calls do not guarantee the return value of the function itself. The main thing is to define the function and the necessary parameters.)

A tool server that summarizes weather information would ideally do the following:

When connected to the server, it tells you:
- A list of available tools
  - Parameters and return values are defined for each tool.
- Workflow-like things that combine tools in a good way
When you send an appropriate request to a tool or workflow, it executes and returns the results nicely
- For weather, it would be nice to have images as well as text returned nicely

MCP is a protocol that summarizes what a tool server should ideally be like.

What is MCP?

Model Context Protocol (MCP) is an open protocol that enables seamless integration between LLM applications and external data sources or tools. Whether you're building an AI-powered IDE, enhancing a chat interface, or creating custom AI workflows, MCP provides a standardized way to connect and provide the necessary context for LLMs.
MCP is an open protocol and can be integrated into any application. IDEs, AI tools, and other software can connect using MCP in a standardized way for local integration.
Quoted and translated from https://modelcontextprotocol.io/introduction

The above explanation might be a bit difficult to understand, but as explained earlier, you can think of it as a mechanism to provide various tools to LLMs.

Terminology Used in MCP

MCP Hosts: Programs that want to access resources through MCP, such as Claude Desktop, IDEs, and AI tools.
- In short, this is the front-end of the interaction with the user.
MCP Clients: Protocol clients that maintain a one-to-one connection with the server.
- This can be thought of as a service layer that mediates the connection between the host and the server.
MCP Servers: Lightweight programs where each server provides specific functionality through the standardized Model Context Protocol.
- This is the actual server.
Local Resources: Resources on your computer, such as databases, files, and services, that the MCP server can safely access.
- These are resources handled by the server.
Remote Resources: Resources on the internet that the MCP server connects to via APIs.
- These are resources in the cloud.

MCP Servers

Now, let's take a closer look at MCP servers.

An MCP server provides the following:

Resources
- Logs, images, API responses (if the MCP server communicates with an external API, for example), etc.
Prompts
- A mechanism for interactively performing workflows (single or multiple tool execution).
- There are two cases from the MCP client: tools are called directly, or they are called via prompts.
Tools
- This is the actual implementation of the execution.
- You can get a group of tools from the MCP client with tools/list.
Notification function
- Notifies when resources, prompts, tools, etc. are added to the server.
- If you watch resources, you can display information like "we are here now", which is convenient.

Example of MCP Server Implementation

Implementing Tools

Here's a sample implementation. It's an easy job to write the implementation under if_name in call_tool and add it to list_tools.

app = Server("example-server")

@app.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="calculate_sum",
            description="Add two numbers together",
            inputSchema={
                "type": "object",
                "properties": {
                    "a": {"type": "number"},
                    "b": {"type": "number"}
                },
                "required": ["a", "b"]
            }
        )
    ]

@app.call_tool()
async def call_tool(
    name: str,
    arguments: dict
) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]:
    if name == "calculate_sum":
        a = arguments["a"]
        b = arguments["b"]
        result = a + b
        return [types.TextContent(type="text", text=str(result))]
    raise ValueError(f"Tool not found: {name}")

Concrete Example of SQLite MCP Server

Let's consider an SQLite MCP server to better understand.

This SQLite server contains sales data for a store.

Pre-stage

Perform a handshake with the MCP server and get tools/list, etc.
It might be a good idea to dynamically convert tools/list and prompts/list into function calls.

Usage Stage

Tool detection on the MCP host
- The host application analyzes the request content.
- Function Calling
  - It determines that the request corresponds to the operation "Get top-selling products".
  - It is inferred that the tool top-selling should be applied.
Parameter validation
- The tool top-selling requires the following arguments:
  - date: Date of sales data (required)
  - limit: Number of top products (optional, default is 5)
- The MCP host obtains or completes these arguments from the user request or conversation history.
  - Example: If date is not specified, set "today's date" as the default value.
Generating and sending SQL statements
- The MCP host generates the appropriate SQL statement.
- This SQL statement is sent to the MCP endpoint corresponding to the tool top-selling.
- The MCP server endpoint will be tools/call.
Processing on the MCP server
- The server receives the tools/call request.
  - Tool name: top-selling
  - Arguments: {"date": "2024-12-01", "limit": 5}
- The server validates the request and executes the corresponding SQL statement.
- It queries the SQLite database internally and gets the results.
Returning results to the host
- The MCP server returns the query results to the host in JSON format.
Processing results on the host
- The MCP host formats the received data and responds to the user in natural language.
- Example: "Today's top-selling products are: 1st Apples (5000 yen), 2nd Gorillas (4500 yen), 3rd Trumpets (3000 yen)..."

And so, the MCP server is created in this way.

LLMs Other Than Claude + MCP Server

Finally, let's actually create a simple MCP server and run it using an LLM other than Claude. It's a mystery whether it will work well with function calling. I hope it does. For now, let's implement it.

Creating an MCP Server

https://modelcontextprotocol.io/docs/first-server/python

https://github.com/modelcontextprotocol/create-python-server

This time, we will use the YouTube Data API v3 to create a tool that searches YouTube.

uvx create-mcp-server

I named it mcp_server_youtube.

A directory named mcp_server_youtube will be created.
The server implementation will be written in mcp_server_youtube/src/mcp_server_youtube/server.py.

Implementation

The implementation of the MCP server was mostly done using gpt-4o.

Key points
- Since only the youtube-search tool is registered in this server this time, processing is only performed if the request arriving at handle_call_tool matches youtube-search.
- The YouTube Data API v3 is simply implemented as an API.
Benefits of this
- If you're just calling an API from function calling normally, you don't need an MCP server. However, creating it as an independent MCP server makes it easier to reuse.

import asyncio
from googleapiclient.discovery import build
from mcp.server.models import InitializationOptions
import mcp.types as types
from mcp.server import NotificationOptions, Server
import mcp.server.stdio
from pydantic import AnyUrl
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set YouTube Data API key
YOUTUBE_API_KEY = os.getenv("YOUTUBE_API_KEY")
if not YOUTUBE_API_KEY:
    raise ValueError("YouTube API Key is not set in the environment variables.")
youtube = build("youtube", "v3", developerKey=YOUTUBE_API_KEY)

# Initialize MCP server
server = Server("mcp_server_youtube")

# Resource (server state) to hold video search results
search_results = {}

# Function to search YouTube videos using YouTube Data API
def search_youtube_videos(query: str, max_results: int = 5):
    """
    Search YouTube videos and return the results.
    """
    request = youtube.search().list(
        q=query,
        part="snippet",
        type="video",
        maxResults=max_results
    )
    response = request.execute()

    # Format and return video information
    videos = []
    for item in response["items"]:
        video_info = {
            "title": item["snippet"]["title"],
            "description": item["snippet"]["description"],
            "channel": item["snippet"]["channelTitle"],
            "url": f"https://www.youtube.com/watch?v={item['id']['videoId']}"
        }
        videos.append(video_info)
    return videos

# Endpoint to list tools
@server.list_tools()
async def handle_list_tools() -> list[types.Tool]:
    """
    Provide a list of tools.
    """
    return [
        types.Tool(
            name="youtube-search",
            description="Search for YouTube videos by query.",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query for YouTube videos."
                    },
                    "max_results": {
                        "type": "integer",
                        "description": "Maximum number of results to return.",
                        "default": 5
                    }
                },
                "required": ["query"]
            }
        )
    ]

# Endpoint to execute tools
@server.call_tool()
async def handle_call_tool(
    name: str, arguments: dict
) -> list[types.TextContent | types.EmbeddedResource]:
    """
    Execute the tool to perform a YouTube search.
    """
    if name != "youtube-search":
        raise ValueError(f"Unknown tool: {name}")

    # Get arguments
    query = arguments.get("query")
    max_results = arguments.get("max_results", 5)

    # Perform YouTube search
    videos = search_youtube_videos(query, max_results)

    # Save results to resources
    search_results[query] = videos

    # Return results to the client
    results_text = "\n".join(
        [f"{video['title']} ({video['url']}) - {video['channel']}" for video in videos]
    )
    return [
        types.TextContent(
            type="text",
            text=f"Search results for '{query}':\n{results_text}"
        )
    ]

# Endpoint to list resources
@server.list_resources()
async def handle_list_resources() -> list[types.Resource]:
    """
    List search results as resources.
    """
    return [
        types.Resource(
            uri=AnyUrl(f"youtube-search://{query}"),
            name=f"Search Results for '{query}'",
            description=f"Search results for query: '{query}'",
            mimeType="text/plain",
        )
        for query in search_results
    ]

# Endpoint to read resources
@server.read_resource()
async def handle_read_resource(uri: AnyUrl) -> str:
    """
    Return the results for the specified search query.
    """
    query = uri.path.lstrip("/")
    if query not in search_results:
        raise ValueError(f"No results found for query: {query}")

    # Format and return results
    videos = search_results[query]
    return "\n".join(
        [f"{video['title']} ({video['url']}) - {video['channel']}" for video in videos]
    )

# Main function
async def main():
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await server.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="mcp_server_youtube",
                server_version="0.0.1",
                capabilities=server.get_capabilities(
                    notification_options=NotificationOptions(),
                    experimental_capabilities={}
                )
            )
        )

# Execution
if __name__ == "__main__":
    asyncio.run(main())

Implementing the Host Side

Next, let's implement the host side using gradio.
This was also implemented with the support of gpt-4o.

Key points
- Intent detection (role as MCP host) using gpt-4o's Function Calling
- The MCP client part is handled by mcp_search_youtube
If you want to go a bit further
- If a list of MCP servers to subscribe to is given, perform a handshake at startup, obtain tools/list, etc., and dynamically generate the function call definition part.

import gradio as gr
import openai
import json
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from dotenv import load_dotenv
import os
import asyncio

# Load environment variables
load_dotenv()

# OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# MCP server settings
SERVER_COMMAND = os.getenv("PYTHON_PATH")  # /path/to/venv/bin/python
SERVER_SCRIPT = os.getenv(
    "MCP_SERVER_PATH"
)  # /path/to/mcp_server_youtube/src/mcp_server_youtube/server.py
print(SERVER_COMMAND, SERVER_SCRIPT)

# MCP client's YouTube search tool function
def mcp_search_youtube(query: str, max_results: int = 3):
    """
    Search YouTube via MCP server (synchronous version)
    """
    async def async_search():
        server_params = StdioServerParameters(
            command=SERVER_COMMAND, args=[SERVER_SCRIPT], env=None
        )

        # Asynchronous connection to the MCP server
        async with stdio_client(server_params) as (read, write):
            async with ClientSession(read, write) as session:
                await session.initialize()
                result = await session.call_tool(
                    "youtube-search", arguments={"query": query, "max_results": max_results}
                )
                return result.content[0].text

    # Run asynchronous processing synchronously
    return asyncio.run(async_search())

# Gradio chat function
def chat_with_mcp(history, user_input):
    """
    Implements conversation with ChatGPT (synchronous version)
    """
    # Initialize history on first call
    if history is None:
        history = []

    # Build current conversation history
    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant that can search YouTube videos or have normal conversations.",
        }
    ]
    # Convert from Gradio history to conversation history
    for msg in history:
        conversation.append({"role": msg["role"], "content": msg["content"]})

    # Add user input
    conversation.append({"role": "user", "content": user_input})

    # Call ChatGPT API
    completion = openai.chat.completions.create(
        model="gpt-4o",
        messages=conversation,
        functions=[
            {
                "name": "youtube_search",
                "description": "Search YouTube videos using a query.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "Search query for YouTube.",
                        },
                        "max_results": {
                            "type": "integer",
                            "description": "Maximum number of results to return.",
                            "default": 3,
                        },
                    },
                    "required": ["query"],
                },
            }
        ],
        function_call="auto",
    )

    # Get response
    message = completion.choices[0].message
    print(message)

    # In case of Function Calling
    if message.function_call:
        print("Function Calling")
        function_call = message.function_call
        function_name = function_call.name
        print(f"Function Name: {function_name}")
        function_args = json.loads(function_call.arguments)

        if function_name == "youtube_search":
            # Call MCP tool
            query = function_args["query"]
            max_results = function_args.get("max_results", 3)

            # Get function call results
            search_results = mcp_search_youtube(query, max_results)
            print("Search Results:")
            print(search_results)

            # Add function results to conversation
            conversation.append(
                {
                    "role": "function",
                    "name": function_name,
                    "content": search_results,
                }
            )

            # Call the API again to get the final response
            completion = openai.chat.completions.create(
                model="gpt-4o",
                messages=conversation,
            )
            assistant_response = completion.choices[0].message.content

            # Update history
            history.append({"role": "user", "content": user_input})
            history.append({"role": "assistant", "content": assistant_response})
            return history

        else:
            # For unknown functions
            assistant_response = f"Unknown function: {function_name}"
            history.append({"role": "user", "content": user_input})
            history.append({"role": "assistant", "content": assistant_response})
            return history

    else:
        # For normal responses
        assistant_response = message.content
        history.append({"role": "user", "content": user_input})
        history.append({"role": "assistant", "content": assistant_response})
        return history

# Gradio interface
with gr.Blocks() as demo:
    chatbot = gr.Chatbot(type="messages")
    with gr.Row():
        txt = gr.Textbox(show_label=False, placeholder="Type your message here...")
        submit_btn = gr.Button("Submit")

    submit_btn.click(chat_with_mcp, [chatbot, txt], chatbot)
    txt.submit(chat_with_mcp, [chatbot, txt], chatbot)

# Run the app
demo.launch()

Summary

MCP is a way to define tools effectively.
Implementation can be done relatively quickly with the help of LLMs, so let's do it more and more.

DEV Community