Running an AI model as a one-shot script is useful, but it forces you to restart the model every time you need a result. Setting it up as a service lets any application send requests to it continuously, without reloading the model. This guide shows how to serve Reka Edge using vLLM and an open-source plugin, then connect a web app to it for image description and object detection.
The vLLM plugin is available at github.com/reka-ai/vllm-reka. The demo Media Library app is at github.com/fboucher/media-library.
Prerequisites
You need a machine with a GPU and either Linux, macOS, or Windows (with WSL). I use UV, a fast Python package and project manager, or pip + venv if you prefer.
Clone the vLLM Reka Plugin
Reka models require a dedicated plugin to run under vLLM, not all models need this extra step, but Reka's architecture requires it. Clone the plugin repository and enter the directory:
git clone https://github.com/reka-ai/vllm-reka
cd vllm-reka
The repository contains the plugin code and a serve.sh script you will use to start the service.
Download the Reka Edge Model
Before starting the service, you need the model weights locally. Install the Hugging Face Hub CLI and use it to pull the reka-edge-2603 model into your project directory:
uv sync
uv pip install huggingface_hub
uvx hf download RekaAI/reka-edge-2603 --local-dir ./models/reka-edge-2603
This is a large model, so make sure you have enough disk space and a stable connection.
Start the Service
Once the model is downloaded, start the vLLM service using the serve.sh script included in the plugin:
uv run bash serve.sh ./models/reka-edge-2603
The script accepts environment variables to configure which model to load and how much GPU memory to allocate. If your GPU cannot fit the model at default settings, open serve.sh and adjust the variables at the top. The repository README lists the available options. The service takes a few seconds to load the model weights, then starts listening for HTTP requests.
As an example with an NVIDIA GeForce RTX 5070, here are the settings I used to run the model:
export GPU_MEM=0.80
export MAX_LEN=4096
export MAX_BATCH_TOKENS=4096
export MAX_IMAGES=2
export MAX_VIDEOS=1
export VIDEO_NUM_FRAMES=4
uv run bash serve.sh ./models/reka-edge-2603
Connect the Media Library App
With the backend running, time to start the Media Library app. Clone the repository, jump into the directory, and run it with Docker:
git clone https://github.com/fboucher/media-library
cd media-library
docker compose up --build -d
Open http://localhost:8080 in your browser, then add a new connection with these settings:
- Name: local (or any label you want)
-
IP address: your machine's local network IP (e.g.
192.168.x.x) - API key: leave blank or enter anything — no key is required for a local connection
-
Model:
reka-edge-2603
Click Test to confirm the connection, then save it.
Try It: Image Description and Object Detection
Select an image in the app and choose your local connection, then click Fill with AI. The app sends the image to your vLLM service, and the model returns a natural language description. You can watch the request hit your backend in the terminal where the service is running.
Reka Edge also supports object detection. Type a prompt asking the model to locate a specific feature (ex: "face") and the model returns bounding-box coordinates. The app renders these as red boxes overlaid on the image. This works for any region you can describe in a prompt.
Switch to the Reka Cloud API
If your local GPU is too slow for production use, you can point the app at the Reka APIs instead. Add a new connection in the app and set the base URL to the Reka API endpoint. Get your API key from platform.reka.ai. OpenRouter is another option if you prefer a unified API across providers.
The model name stays the same (reka-edge-2603), so switching between local and cloud is just a matter of selecting a different connection in the app. The cloud API is noticeably faster because Reka's servers are more powerful than a local GPU (at least mine :) ). During development, use the local service to avoid burning credits; switch to the API for speed when you need it.
What You Can Build
The service you just set up accepts any image, or video via HTTP — point a script at a folder and you have a batch pipeline for descriptions, tags, or bounding boxes. Swap the prompt and you change what it extracts. The workflow is the same whether you are running locally or through the API.
References
- Reka Edge model: huggingface.co/RekaAI/reka-edge-2603
- vLLM Reka plugin: github.com/reka-ai/vllm-reka
- Media Library app: github.com/fboucher/media-library
- Reka API platform: platform.reka.ai

Top comments (1)
In our accelerator, we often see teams struggle with integrating AI models into existing systems due to the lack of persistent state management. One surprising insight is that leveraging agent-based architectures can help mitigate this issue by maintaining context across sessions without constant restarts. It's a game-changer for scaling AI applications locally, as it allows for seamless workflow integration and real-time adaptability. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)