How to Serve a Vision AI Model Locally with vLLM and Reka Edge

#ai #vision #webdev #programming

Running an AI model as a one-shot script is useful, but it forces you to restart the model every time you need a result. Setting it up as a service lets any application send requests to it continuously, without reloading the model. This guide shows how to serve Reka Edge using vLLM and an open-source plugin, then connect a web app to it for image description and object detection.

The vLLM plugin is available at github.com/reka-ai/vllm-reka. The demo Media Library app is at github.com/fboucher/media-library.

Prerequisites

You need a machine with a GPU and either Linux, macOS, or Windows (with WSL). I use UV, a fast Python package and project manager, or pip + venv if you prefer.

Clone the vLLM Reka Plugin

Reka models require a dedicated plugin to run under vLLM, not all models need this extra step, but Reka's architecture requires it. Clone the plugin repository and enter the directory:

git clone https://github.com/reka-ai/vllm-reka
cd vllm-reka

The repository contains the plugin code and a serve.sh script you will use to start the service.

Download the Reka Edge Model

Before starting the service, you need the model weights locally. Install the Hugging Face Hub CLI and use it to pull the reka-edge-2603 model into your project directory:

uv sync
uv pip install huggingface_hub
uvx hf download RekaAI/reka-edge-2603 --local-dir ./models/reka-edge-2603

This is a large model, so make sure you have enough disk space and a stable connection.

Start the Service

Once the model is downloaded, start the vLLM service using the serve.sh script included in the plugin:

uv run bash serve.sh ./models/reka-edge-2603

The script accepts environment variables to configure which model to load and how much GPU memory to allocate. If your GPU cannot fit the model at default settings, open serve.sh and adjust the variables at the top. The repository README lists the available options. The service takes a few seconds to load the model weights, then starts listening for HTTP requests.

As an example with an NVIDIA GeForce RTX 5070, here are the settings I used to run the model:

export GPU_MEM=0.80
export MAX_LEN=4096
export MAX_BATCH_TOKENS=4096
export MAX_IMAGES=2
export MAX_VIDEOS=1
export VIDEO_NUM_FRAMES=4
uv run bash serve.sh ./models/reka-edge-2603

Connect the Media Library App

With the backend running, time to start the Media Library app. Clone the repository, jump into the directory, and run it with Docker:

git clone https://github.com/fboucher/media-library
cd media-library
docker compose up --build -d

Open http://localhost:8080 in your browser, then add a new connection with these settings:

Name: local (or any label you want)
IP address: your machine's local network IP (e.g. 192.168.x.x)
API key: leave blank or enter anything — no key is required for a local connection
Model: reka-edge-2603

Click Test to confirm the connection, then save it.

Try It: Image Description and Object Detection

Select an image in the app and choose your local connection, then click Fill with AI. The app sends the image to your vLLM service, and the model returns a natural language description. You can watch the request hit your backend in the terminal where the service is running.

Reka Edge also supports object detection. Type a prompt asking the model to locate a specific feature (ex: "face") and the model returns bounding-box coordinates. The app renders these as red boxes overlaid on the image. This works for any region you can describe in a prompt.

Switch to the Reka Cloud API

If your local GPU is too slow for production use, you can point the app at the Reka APIs instead. Add a new connection in the app and set the base URL to the Reka API endpoint. Get your API key from platform.reka.ai. OpenRouter is another option if you prefer a unified API across providers.

The model name stays the same (reka-edge-2603), so switching between local and cloud is just a matter of selecting a different connection in the app. The cloud API is noticeably faster because Reka's servers are more powerful than a local GPU (at least mine :) ). During development, use the local service to avoid burning credits; switch to the API for speed when you need it.

What You Can Build

The service you just set up accepts any image, or video via HTTP — point a script at a folder and you have a batch pipeline for descriptions, tags, or bounding boxes. Swap the prompt and you change what it extracts. The workflow is the same whether you are running locally or through the API.