DEV Community: Dat Tran

Secure MCP Server with NGINX + Supergateway + Render

Dat Tran — Mon, 19 May 2025 06:54:17 +0000

Joint article by Dat Tran (Partner & CTO at DATANOMIQ) and Dr. Alexander Lammers (Chief Data Scientist at DATANOMIQ)

Model Context Protocol (MCP) is the new open standard for connecting AI assistants. Typically you run it as:

stdio (Standard Input/Output): Used when Client and Server run on the same machines. This is simple and effective for local integrations (e.g., accessing local files or running a local script).
HTTP via SSE (Server-Sent Events): The Client connects to the Server via HTTP. After an initial setup, the Server can push messages (events) to the Client over a persistent connection using the SSE standard.

Most MCPs at the moment run as stdio locally but this is not ideal if you have a shared service that you want to deploy on a server. In this article, we will explain how you could do it and also secure an MCP Server with NGINX, Supergateway and deploying it on Render. Securing the endpoint is critical as the MCP server might access sensitive data. In our example, we just show a basic auth example but you can easily use OAuth as well.

Getting Started

For this tutorial we use the Airbnb MCP Server example. Normally, you can use it with Claude like this:

{
  "mcpServers": {
    "airbnb": {
      "command": "npx",
      "args": [
        "-y",
        "@openbnb/mcp-server-airbnb"
      ]
    }
  }
}

This assumes that you have node installed and have it locally running. Let's say you rather want to have it running remotely, then you can use Supergateway
which can run MCP stdio-based servers over SSE (Server-Sent Events) or WebSockets (WS) with one command.

For example, you could easily expose our MCP service over SSE:

npx -y supergateway \
    --stdio "npx -y @openbnb/mcp-server-airbnb" \
    --port 8000 --baseUrl http://localhost:8000

Then in Claude, you can consume it like this:

npx -y supergateway --sse "http://localhost:8000/sse"

I hope you get the point till now. Since we can easily convert stdio to SSE and vice versa, we can actually deploy the service to any cloud provider and then use it in Claude easily. Obviously for this service, deploying it with authentication is fine as it only calls the AirBnB api. However, in many cases you have an API key or access more sensitive systems where you want the service to be secured, for example, via OAuth. MCP itself provides a specification for OAuth 2.1, however, this is still a draft and there a flaws with the implementation.

Another method is to leverage NGINX for this. In this blog post we use NGINX with simple basic authentication (username and password) but you can easily use it with OAuth:

events {
    worker_connections 1024;
}

http {
    upstream airbnb_server {
        server 127.0.0.1:5000;
    }

    # Enable error logging
    error_log /var/log/nginx/error.log debug;
    access_log /var/log/nginx/access.log;

    server {
        listen 8000;
        server_name localhost;

        # Basic Authentication
        auth_basic "Restricted Access";
        auth_basic_user_file /etc/nginx/.htpasswd;

        # Add a location for regular HTTP requests
        location / {
            proxy_pass http://airbnb_server;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Host $host;
            proxy_set_header X-NginX-Proxy true;
        }

        location /sse {
            proxy_pass http://airbnb_server;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Host $host;
            proxy_set_header X-NginX-Proxy true;

            # SSE specific settings
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 86400s;
            proxy_send_timeout 86400s;
            keepalive_timeout 86400s;
            send_timeout 86400s;
        }
    }
}

Now we could deploy the service to the cloud, in our case, I chose Render as it's easy to do. Here we just need a Dockerfile:

# Use Python base image
FROM python:3.10-slim-bookworm

# Install Node.js & nginx
RUN apt-get update && apt-get install -y \
    curl \
    nginx \
    apache2-utils \
    && curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
    && apt-get install -y nodejs \
    && rm -rf /var/lib/apt/lists/*


# Copy nginx configuration
COPY nginx.conf /etc/nginx/nginx.conf

# Create password file (replace 'password' with your desired password)
RUN htpasswd -bc /etc/nginx/.htpasswd user testuser1234!

# Create log directories
RUN mkdir -p /var/log/nginx && \
    touch /var/log/nginx/error.log && \
    touch /var/log/nginx/access.log && \
    chown -R www-data:www-data /var/log/nginx

# Expose ports
EXPOSE 8000 5000

CMD nginx && npx -y supergateway --stdio "npx -y @openbnb/mcp-server-airbnb" --port 5000

Once deployed to Render, you could go to the url then: https://<>.onrender.com/sse. It will pop up with login window where you need to enter your username and password. Alright this is nice but how do we use it in Claude then? Remember we use Supergateway to convert SSE back to stdio. Supergateway also offers the possibility to pass a --header. This is useful for as we can pass an Authorization header. But what do we need to send exactly? We can't simply send the username and password but both are encoded as base64. So what you can do is this here:

curl -v -u user:testuser1234! https://<<your-project>>.onrender.com/sse

This will give us something like this: Authorization: Basic <<your-base64-key>>.

Now we can finally use it Claude:

{
  "mcpServers": {
    "airbnb-server": {
      "command": "npx",
      "args": [
        "-y",
        "supergateway",
        "--sse",
        "https://<<your-project>>.onrender.com/sse",
        "--header",
        "Authorization: Basic <<your-base64-key>>"
      ]
    }
  }
}

Voila! We now have a secured MCP server with basic authentication. If you want to use OAuth, you simply need a provider like auth0, Clerk etc and then use the --oauth2Bearer flag.

Summary

Hope you enjoyed this short article. MCP is super new and a lot of things can change radically. It's far from production ready but hopefully you can use this for your next project. The full code is also on Github which you can use directly to deploy on Render.

Building a Fast and Efficient Semantic Search System Using OpenVINO and Postgres

Dat Tran — Mon, 21 Oct 2024 07:37:39 +0000

Photo by real-napster on Pixabay

In one of my recent projects, I had to build a semantic search system that could scale with high performance and deliver real-time responses for report searches. We used PostgreSQL with pgvector on AWS RDS, paired with AWS Lambda, to achieve this. The challenge was to allow users to search using natural language queries instead of relying on rigid keywords, all while ensuring responses were under 1-2 seconds or even below and could only leverage CPU resources.

In this post, I will walk through the steps I took to build this search system, from retrieval to reranking, and the optimizations made using OpenVINO and intelligent batching for tokenization.

Overview of Semantic Search: Retrieval and Reranking

Modern state-of-the-art search systems usually consist of two main steps: retrieval and reranking.

1) Retrieval: The first step involves retrieving a subset of relevant documents based on the user query. This can be done using pre-trained embeddings models, such as OpenAI's small and large embeddings, Cohere's Embed models, or Mixbread’s mxbai embeddings. Retrieval focuses on narrowing down the pool of documents by measuring their similarity to the query.

Here's a simplified example using Huggingface's sentence-transformers library for retrieval which is one of my favorite libraries for this:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained sentence transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Sample query and documents (vectorize the query and the documents)
query = "How do I fix a broken landing gear?"
documents = ["Report 1 on landing gear failure", "Report 2 on engine problems"]

# Get embeddings for query and documents
query_embedding = model.encode(query)
document_embeddings = model.encode(documents)

# Calculate cosine similarity between query and documents
similarities = np.dot(document_embeddings, query_embedding)

# Retrieve top-k most relevant documents
top_k = np.argsort(similarities)[-5:]
print("Top 5 documents:", [documents[i] for i in top_k])

2) Reranking: Once the most relevant documents have been retrieved, we further improve the ranking of these documents using a cross-encoder model. This step re-evaluates each document in relation to the query more accurately, focusing on deeper contextual understanding.
Reranking is beneficial because it adds an additional layer of refinement by scoring the relevance of each document more precisely.

Here's a code example for reranking using cross-encoder/ms-marco-TinyBERT-L-2-v2, a lightweight cross-encoder:

from sentence_transformers import CrossEncoder

# Load the cross-encoder model
cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")

# Use the cross-encoder to rerank top-k retrieved documents
query_document_pairs = [(query, doc) for doc in documents]
scores = cross_encoder.predict(query_document_pairs)

# Rank documents based on the new scores
top_k_reranked = np.argsort(scores)[-5:]
print("Top 5 reranked documents:", [documents[i] for i in top_k_reranked])

Identifying Bottlenecks: The Cost of Tokenization and Prediction

During the development, I found that the tokenization and prediction stages were taking quite long when handling 1,000 reports with default settings for sentence-transformers. This created a performance bottleneck, especially since we aimed for real-time responses.

Below I profiled my code using SnakeViz to visualize the performances:

As you can see, the tokenization and prediction steps are disproportionately slow, leading to significant delays in serving search results. Overall it took like 4-5 seconds on average. This is due to the fact that there are blocking operations between the tokenization and prediction steps. If we also add up other operations like database call, filtering etc, we easily ended up with 8-9 seconds in total.

Optimizing Performance with OpenVINO

The question I faced was: Can we make it faster? The answer is yes, by leveraging OpenVINO, an optimized backend for CPU inference. OpenVINO helps accelerate deep learning model inference on Intel hardware, which we use on AWS Lambda.

Code Example for OpenVINO Optimization
Here’s how I integrated OpenVINO into the search system to speed up inference:

import argparse
import numpy as np
import pandas as pd
from typing import Any
from openvino.runtime import Core
from transformers import AutoTokenizer


def load_openvino_model(model_path: str) -> Core:
    core = Core()
    model = core.read_model(model_path + ".xml")
    compiled_model = core.compile_model(model, "CPU")
    return compiled_model


def rerank(
    compiled_model: Core,
    query: str,
    results: list[str],
    tokenizer: AutoTokenizer,
    batch_size: int,
) -> np.ndarray[np.float32, Any]:
    max_length = 512
    all_logits = []

    # Split results into batches
    for i in range(0, len(results), batch_size):
        batch_results = results[i : i + batch_size]
        inputs = tokenizer(
            [(query, item) for item in batch_results],
            padding=True,
            truncation="longest_first",
            max_length=max_length,
            return_tensors="np",
        )

        # Extract input tensors (convert to NumPy arrays)
        input_ids = inputs["input_ids"].astype(np.int32)
        attention_mask = inputs["attention_mask"].astype(np.int32)
        token_type_ids = inputs.get("token_type_ids", np.zeros_like(input_ids)).astype(
            np.int32
        )

        infer_request = compiled_model.create_infer_request()
        output = infer_request.infer(
            {
                "input_ids": input_ids,
                "attention_mask": attention_mask,
                "token_type_ids": token_type_ids,
            }
        )

        logits = output["logits"]
        all_logits.append(logits)

    all_logits = np.concatenate(all_logits, axis=0)
    return all_logits


def fetch_search_data(search_text: str) -> pd.DataFrame:
    # Usually you would fetch the data from a database
    df = pd.read_csv("cnbc_headlines.csv")
    df = df[~df["Headlines"].isnull()]

    texts = df["Headlines"].tolist()

    # Load the model and rerank
    openvino_model = load_openvino_model("cross-encoder-openvino-model/model")
    tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-TinyBERT-L-2-v2")
    rerank_scores = rerank(openvino_model, search_text, texts, tokenizer, batch_size=16)

    # Add the rerank scores to the DataFrame and sort by the new scores
    df["rerank_score"] = rerank_scores
    df = df.sort_values(by="rerank_score", ascending=False)

    return df


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Fetch search results with reranking using OpenVINO"
    )

    parser.add_argument(
        "--search_text",
        type=str,
        required=True,
        help="The search text to use for reranking",
    )

    args = parser.parse_args()

    df = fetch_search_data(args.search_text)
    print(df)

With this approach we could get a 2-3x speedup reducing the original 4-5 seconds to 1-2 seconds. The full working code is on Github.

Fine-Tuning for Speed: Batch Size and Tokenization

Another critical factor in improving performance was optimizing the tokenization process and adjusting the batch size and token length. By increasing the batch size (batch_size=16) and reducing the token length (max_length=512), we could parallelize the tokenization and reduce the overhead of repetitive operations. In our experiments, we found that a batch_size between 16 and 64 worked well, with anything larger degrading performance. Similarly, we settled on a max_length of 128, which is viable if the average length of your reports is relatively short. With these changes, we achieved an overall 8x speed-up, reducing the reranking time to under 1 second, even on CPU.

In practice, this meant experimenting with different batch sizes and token lengths to find the right balance between speed and accuracy for your data. By doing so, we saw significant improvements in response times, making the search system scalable even with 1,000+ reports.

Conclusion

By using OpenVINO and optimizing tokenization and batching, we were able to build a high-performance semantic search system that meets real-time requirements on a CPU-only setup. In fact, we experienced a 8x speedup overall. The combination of retrieval using sentence-transformers and reranking with a cross-encoder model creates a powerful, user-friendly search experience.

If you’re building similar systems with constraints on response time and computational resources, I highly recommend exploring OpenVINO and intelligent batching to unlock better performance.

Hopefully, you enjoyed this article. If you found this article useful, give me a like so others can find it too, and share it with your friends. Follow me on Linkedin to stay up-to-date with my work. Thanks for reading!