<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dat Tran</title>
    <description>The latest articles on DEV Community by Dat Tran (@datitran).</description>
    <link>https://dev.to/datitran</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F766568%2F6fc4c5fb-0bca-4d70-ab5c-2a0934cb2058.jpeg</url>
      <title>DEV Community: Dat Tran</title>
      <link>https://dev.to/datitran</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datitran"/>
    <language>en</language>
    <item>
      <title>Secure MCP Server with NGINX + Supergateway + Render</title>
      <dc:creator>Dat Tran</dc:creator>
      <pubDate>Mon, 19 May 2025 06:54:17 +0000</pubDate>
      <link>https://dev.to/datitran/secure-mcp-server-with-nginx-supergateway-render-4i80</link>
      <guid>https://dev.to/datitran/secure-mcp-server-with-nginx-supergateway-render-4i80</guid>
      <description>&lt;p&gt;Joint article by &lt;a href="https://www.linkedin.com/in/dat-tran-a1602320/" rel="noopener noreferrer"&gt;Dat Tran&lt;/a&gt; (Partner &amp;amp; CTO at DATANOMIQ) and &lt;a href="https://www.linkedin.com/in/alexanderlammers/" rel="noopener noreferrer"&gt;Dr. Alexander Lammers&lt;/a&gt; (Chief Data Scientist at DATANOMIQ)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtv2zvd0q2bon2qvcbkb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtv2zvd0q2bon2qvcbkb.jpg" alt="Security Image" width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
Photo by &lt;a href="https://pixabay.com/users/tungnguyen0905-17946924/" rel="noopener noreferrer"&gt;tungnguyen0905&lt;/a&gt; on Pixabay&lt;/p&gt;

&lt;p&gt;Model Context Protocol (MCP) is the new open standard for connecting AI assistants. Typically you run it as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stdio (Standard Input/Output): Used when Client and Server run on the same machines. This is simple and effective for local integrations (e.g., accessing local files or running a local script).&lt;/li&gt;
&lt;li&gt;HTTP via SSE (Server-Sent Events): The Client connects to the Server via HTTP. After an initial setup, the Server can push messages (events) to the Client over a persistent connection using the SSE standard.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most MCPs at the moment run as stdio locally but this is not ideal if you have a shared service that you want to deploy on a server. In this article, we will explain how you could do it and also secure an MCP Server with NGINX, Supergateway and deploying it on Render. Securing the endpoint is critical as the MCP server might access sensitive data. In our example, we just show a basic auth example but you can easily use OAuth as well.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;For this tutorial we use the &lt;a href="https://github.com/openbnb-org/mcp-server-airbnb" rel="noopener noreferrer"&gt;Airbnb MCP Server&lt;/a&gt; example. Normally, you can use it with Claude like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "mcpServers": {
    "airbnb": {
      "command": "npx",
      "args": [
        "-y",
        "@openbnb/mcp-server-airbnb"
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This assumes that you have &lt;code&gt;node&lt;/code&gt; installed and have it locally running. Let's say you rather want to have it running remotely, then you can use &lt;a href="https://github.com/supercorp-ai/supergateway" rel="noopener noreferrer"&gt;Supergateway&lt;br&gt;
&lt;/a&gt; which can run MCP stdio-based servers over SSE (Server-Sent Events) or WebSockets (WS) with one command.&lt;/p&gt;

&lt;p&gt;For example, you could easily expose our MCP service over SSE:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npx -y supergateway \
    --stdio "npx -y @openbnb/mcp-server-airbnb" \
    --port 8000 --baseUrl http://localhost:8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then in Claude, you can consume it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npx -y supergateway --sse "http://localhost:8000/sse"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I hope you get the point till now. Since we can easily convert stdio to SSE and vice versa, we can actually deploy the service to any cloud provider and then use it in Claude easily. Obviously for this service, deploying it with authentication is fine as it only calls the AirBnB api. However, in many cases you have an API key or access more sensitive systems where you want the service to be secured, for example, via OAuth. MCP itself provides a &lt;a href="https://modelcontextprotocol.io/specification/2025-03-26#21-overview" rel="noopener noreferrer"&gt;specification for OAuth 2.1&lt;/a&gt;, however, this is still a draft and &lt;a href="https://github.com/modelcontextprotocol/modelcontextprotocol/issues/205" rel="noopener noreferrer"&gt;there a flaws with the implementation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Another method is to leverage NGINX for this. In this blog post we use NGINX with simple basic authentication (username and password) but you can easily use it with OAuth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;events {
    worker_connections 1024;
}

http {
    upstream airbnb_server {
        server 127.0.0.1:5000;
    }

    # Enable error logging
    error_log /var/log/nginx/error.log debug;
    access_log /var/log/nginx/access.log;

    server {
        listen 8000;
        server_name localhost;

        # Basic Authentication
        auth_basic "Restricted Access";
        auth_basic_user_file /etc/nginx/.htpasswd;

        # Add a location for regular HTTP requests
        location / {
            proxy_pass http://airbnb_server;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Host $host;
            proxy_set_header X-NginX-Proxy true;
        }

        location /sse {
            proxy_pass http://airbnb_server;
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Host $host;
            proxy_set_header X-NginX-Proxy true;

            # SSE specific settings
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 86400s;
            proxy_send_timeout 86400s;
            keepalive_timeout 86400s;
            send_timeout 86400s;
        }
    }
} 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we could deploy the service to the cloud, in our case, I chose &lt;a href="https://render.com/" rel="noopener noreferrer"&gt;Render&lt;/a&gt; as it's easy to do. Here we just need a Dockerfile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Use Python base image
FROM python:3.10-slim-bookworm

# Install Node.js &amp;amp; nginx
RUN apt-get update &amp;amp;&amp;amp; apt-get install -y \
    curl \
    nginx \
    apache2-utils \
    &amp;amp;&amp;amp; curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
    &amp;amp;&amp;amp; apt-get install -y nodejs \
    &amp;amp;&amp;amp; rm -rf /var/lib/apt/lists/*


# Copy nginx configuration
COPY nginx.conf /etc/nginx/nginx.conf

# Create password file (replace 'password' with your desired password)
RUN htpasswd -bc /etc/nginx/.htpasswd user testuser1234!

# Create log directories
RUN mkdir -p /var/log/nginx &amp;amp;&amp;amp; \
    touch /var/log/nginx/error.log &amp;amp;&amp;amp; \
    touch /var/log/nginx/access.log &amp;amp;&amp;amp; \
    chown -R www-data:www-data /var/log/nginx

# Expose ports
EXPOSE 8000 5000

CMD nginx &amp;amp;&amp;amp; npx -y supergateway --stdio "npx -y @openbnb/mcp-server-airbnb" --port 5000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once deployed to Render, you could go to the url then: https://&amp;lt;&amp;gt;.onrender.com/sse. It will pop up with login window where you need to enter your username and password. Alright this is nice but how do we use it in Claude then? Remember we use Supergateway to convert SSE back to stdio. Supergateway also offers the possibility to pass a &lt;code&gt;--header&lt;/code&gt;. This is useful for as we can pass an &lt;code&gt;Authorization&lt;/code&gt; header. But what do we need to send exactly? We can't simply send the username and password but both are encoded as &lt;code&gt;base64&lt;/code&gt;. So what you can do is this here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl -v -u user:testuser1234! https://&amp;lt;&amp;lt;your-project&amp;gt;&amp;gt;.onrender.com/sse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will give us something like this: &lt;code&gt;Authorization: Basic &amp;lt;&amp;lt;your-base64-key&amp;gt;&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now we can finally use it Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "mcpServers": {
    "airbnb-server": {
      "command": "npx",
      "args": [
        "-y",
        "supergateway",
        "--sse",
        "https://&amp;lt;&amp;lt;your-project&amp;gt;&amp;gt;.onrender.com/sse",
        "--header",
        "Authorization: Basic &amp;lt;&amp;lt;your-base64-key&amp;gt;&amp;gt;"
      ]
    }
  }
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voila! We now have a secured MCP server with basic authentication. If you want to use OAuth, you simply need a provider like auth0, Clerk etc and then use the &lt;code&gt;--oauth2Bearer&lt;/code&gt; flag.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Hope you enjoyed this short article. MCP is super new and a lot of things can change radically. It's far from production ready but hopefully you can use this for your next project. The full code is also on &lt;a href="https://github.com/DATANOMIQ/mcp-secure-server" rel="noopener noreferrer"&gt;Github&lt;/a&gt; which you can use directly to deploy on Render.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>security</category>
      <category>programming</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Building a Fast and Efficient Semantic Search System Using OpenVINO and Postgres</title>
      <dc:creator>Dat Tran</dc:creator>
      <pubDate>Mon, 21 Oct 2024 07:37:39 +0000</pubDate>
      <link>https://dev.to/datitran/building-a-fast-and-efficient-semantic-search-system-using-openvino-and-postgres-fd6</link>
      <guid>https://dev.to/datitran/building-a-fast-and-efficient-semantic-search-system-using-openvino-and-postgres-fd6</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfo2v5wwdfutt7xqd0rm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfo2v5wwdfutt7xqd0rm.jpg" alt="Image description" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by &lt;a href="https://pixabay.com/users/real-napster-323187/" rel="noopener noreferrer"&gt;real-napster&lt;/a&gt; on Pixabay&lt;/p&gt;

&lt;p&gt;In one of my recent projects, I had to build a semantic search system that could scale with high performance and deliver real-time responses for report searches. We used PostgreSQL with &lt;code&gt;pgvector&lt;/code&gt; on AWS RDS, paired with AWS Lambda, to achieve this. The challenge was to allow users to search using natural language queries instead of relying on rigid keywords, all while ensuring responses were under 1-2 seconds or even below and could only leverage CPU resources.&lt;/p&gt;

&lt;p&gt;In this post, I will walk through the steps I took to build this search system, from retrieval to reranking, and the optimizations made using &lt;a href="https://docs.openvino.ai/2024/index.html" rel="noopener noreferrer"&gt;OpenVINO&lt;/a&gt; and intelligent batching for tokenization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview of Semantic Search: Retrieval and Reranking
&lt;/h2&gt;

&lt;p&gt;Modern state-of-the-art search systems usually consist of two main steps: &lt;strong&gt;retrieval&lt;/strong&gt; and &lt;strong&gt;reranking&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;1) &lt;strong&gt;Retrieval:&lt;/strong&gt; The first step involves retrieving a subset of relevant documents based on the user query. This can be done using pre-trained embeddings models, such as &lt;a href="https://platform.openai.com/docs/guides/embeddings" rel="noopener noreferrer"&gt;OpenAI's small and large embeddings&lt;/a&gt;, &lt;a href="https://cohere.com/embed" rel="noopener noreferrer"&gt;Cohere's Embed models&lt;/a&gt;, or &lt;a href="https://www.mixedbread.ai/docs/embeddings/overview" rel="noopener noreferrer"&gt;Mixbread’s mxbai embeddings&lt;/a&gt;. Retrieval focuses on narrowing down the pool of documents by measuring their similarity to the query.&lt;/p&gt;

&lt;p&gt;Here's a simplified example using &lt;a href="https://sbert.net/" rel="noopener noreferrer"&gt;Huggingface's &lt;code&gt;sentence-transformers&lt;/code&gt;&lt;/a&gt; library for retrieval which is one of my favorite libraries for this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Load a pre-trained sentence transformer model
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Sample query and documents (vectorize the query and the documents)
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I fix a broken landing gear?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Report 1 on landing gear failure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Report 2 on engine problems&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Get embeddings for query and documents
&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;document_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Calculate cosine similarity between query and documents
&lt;/span&gt;&lt;span class="n"&gt;similarities&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Retrieve top-k most relevant documents
&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarities&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Top 5 documents:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;2) &lt;strong&gt;Reranking:&lt;/strong&gt; Once the most relevant documents have been retrieved, we further improve the ranking of these documents using a &lt;strong&gt;cross-encoder&lt;/strong&gt; model. This step re-evaluates each document in relation to the query more accurately, focusing on deeper contextual understanding.&lt;br&gt;
Reranking is beneficial because it adds an additional layer of refinement by scoring the relevance of each document more precisely.&lt;/p&gt;

&lt;p&gt;Here's a code example for reranking using &lt;code&gt;cross-encoder/ms-marco-TinyBERT-L-2-v2&lt;/code&gt;, a lightweight cross-encoder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrossEncoder&lt;/span&gt;

&lt;span class="c1"&gt;# Load the cross-encoder model
&lt;/span&gt;&lt;span class="n"&gt;cross_encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrossEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cross-encoder/ms-marco-TinyBERT-L-2-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use the cross-encoder to rerank top-k retrieved documents
&lt;/span&gt;&lt;span class="n"&gt;query_document_pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_document_pairs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Rank documents based on the new scores
&lt;/span&gt;&lt;span class="n"&gt;top_k_reranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Top 5 reranked documents:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_k_reranked&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Identifying Bottlenecks: The Cost of Tokenization and Prediction
&lt;/h2&gt;

&lt;p&gt;During the development, I found that the tokenization and prediction stages were taking quite long when handling 1,000 reports with default settings for &lt;code&gt;sentence-transformers&lt;/code&gt;. This created a performance bottleneck, especially since we aimed for real-time responses.&lt;/p&gt;

&lt;p&gt;Below I profiled my code using &lt;a href="https://jiffyclub.github.io/snakeviz/" rel="noopener noreferrer"&gt;SnakeViz&lt;/a&gt; to visualize the performances:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31j3gq1k9uo7njpb3a4y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31j3gq1k9uo7njpb3a4y.png" alt="Image description" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the tokenization and prediction steps are disproportionately slow, leading to significant delays in serving search results. Overall it took like 4-5 seconds on average. This is due to the fact that there are blocking operations between the tokenization and prediction steps. If we also add up other operations like database call, filtering etc, we easily ended up with 8-9 seconds in total.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimizing Performance with OpenVINO
&lt;/h2&gt;

&lt;p&gt;The question I faced was: &lt;strong&gt;Can we make it faster?&lt;/strong&gt; The answer is yes, by leveraging &lt;strong&gt;OpenVINO&lt;/strong&gt;, an optimized backend for CPU inference. OpenVINO helps accelerate deep learning model inference on Intel hardware, which we use on AWS Lambda.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code Example for OpenVINO Optimization&lt;/strong&gt;&lt;br&gt;
Here’s how I integrated OpenVINO into the search system to speed up inference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openvino.runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Core&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_openvino_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Core&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Core&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.xml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;compiled_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CPU&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;compiled_model&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;compiled_model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Core&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;max_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
    &lt;span class="n"&gt;all_logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# Split results into batches
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;batch_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batch_results&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;truncation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;longest_first&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;np&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Extract input tensors (convert to NumPy arrays)
&lt;/span&gt;        &lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;attention_mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attention_mask&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;token_type_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int32&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;infer_request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compiled_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_infer_request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;infer_request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;infer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attention_mask&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;attention_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_type_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;token_type_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;all_logits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;all_logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;all_logits&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_search_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Usually you would fetch the data from a database
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cnbc_headlines.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Headlines&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;

    &lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Headlines&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Load the model and rerank
&lt;/span&gt;    &lt;span class="n"&gt;openvino_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_openvino_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cross-encoder-openvino-model/model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cross-encoder/ms-marco-TinyBERT-L-2-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rerank_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;openvino_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;search_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Add the rerank scores to the DataFrame and sort by the new scores
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rerank_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rerank_scores&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rerank_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetch search results with reranking using OpenVINO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--search_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The search text to use for reranking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_search_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this approach we could get a &lt;code&gt;2-3x&lt;/code&gt; speedup reducing the original 4-5 seconds to 1-2 seconds. The full working code is on &lt;a href="https://github.com/datitran/reranker_openvino" rel="noopener noreferrer"&gt;Github&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-Tuning for Speed: Batch Size and Tokenization
&lt;/h2&gt;

&lt;p&gt;Another critical factor in improving performance was optimizing the &lt;strong&gt;tokenization&lt;/strong&gt; process and adjusting the &lt;strong&gt;batch size&lt;/strong&gt; and &lt;strong&gt;token length&lt;/strong&gt;. By increasing the batch size (&lt;code&gt;batch_size=16&lt;/code&gt;) and reducing the token length (&lt;code&gt;max_length=512&lt;/code&gt;), we could parallelize the tokenization and reduce the overhead of repetitive operations. In our experiments, we found that a &lt;code&gt;batch_size&lt;/code&gt; between &lt;code&gt;16&lt;/code&gt; and &lt;code&gt;64&lt;/code&gt; worked well, with anything larger degrading performance. Similarly, we settled on a &lt;code&gt;max_length&lt;/code&gt; of &lt;code&gt;128&lt;/code&gt;, which is viable if the average length of your reports is relatively short. With these changes, we achieved an overall &lt;code&gt;8x&lt;/code&gt; speed-up, reducing the reranking time to under 1 second, even on CPU.&lt;/p&gt;

&lt;p&gt;In practice, this meant experimenting with different batch sizes and token lengths to find the right balance between speed and accuracy for your data. By doing so, we saw significant improvements in response times, making the search system scalable even with 1,000+ reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By using OpenVINO and optimizing tokenization and batching, we were able to build a high-performance semantic search system that meets real-time requirements on a CPU-only setup. In fact, we experienced a &lt;code&gt;8x&lt;/code&gt; speedup overall. The combination of retrieval using &lt;code&gt;sentence-transformers&lt;/code&gt; and reranking with a cross-encoder model creates a powerful, user-friendly search experience.&lt;/p&gt;

&lt;p&gt;If you’re building similar systems with constraints on response time and computational resources, I highly recommend exploring OpenVINO and intelligent batching to unlock better performance.&lt;/p&gt;

&lt;p&gt;Hopefully, you enjoyed this article. If you found this article useful, give me a like so others can find it too, and share it with your friends. Follow me on &lt;a href="https://www.linkedin.com/in/dat-tran-a1602320/" rel="noopener noreferrer"&gt;Linkedin&lt;/a&gt; to stay up-to-date with my work. Thanks for reading!&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>postgres</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
