Ayush kumar for NodeShift

Posted on Sep 8 • Originally published at nodeshift.cloud

How to Install & Run Microsoft Kosmos-2.5 Locally?

#microsoft #ai #llm #kosmos

Kosmos-2.5 is Microsoft’s multimodal “literate” model for reading text-heavy images (receipts, invoices, forms, docs). It does two things out of the box using task prompts: (a) OCR with spatially-aware text blocks (text + bounding boxes) via , and (b) image→Markdown conversion via . It’s implemented in Transformers (supported from v4.56+) with ready-to-run Python snippets, and the paper details the shared decoder-only architecture and doc-understanding focus.

GPU Configuration (What Actually Works)

Ballpark VRAM based on 1.3B-param model running in bfloat16 with image patches; add headroom for long outputs / larger pages.

Resources

Link: https://huggingface.co/microsoft/kosmos-2.5

Step-by-Step Process to Install & Run Microsoft Kosmos-2.5 Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Microsoft Kosmos-2.5, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Microsoft Kosmos-2.5
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Microsoft Kosmos-2.5.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Gemma-3-270m & Instruct runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install pip (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install pip (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install pip via get-pip.py:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.10 -m venv kosmos
source kosmos/bin/activate

Step 10: Install PyTorch

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 11: Install Model Dependencies

Run the following command to install model dependencies:

pip install "transformers>=4.56" accelerate pillow requests

Transformers ≥4.56 is required.

Step 12: Install Wheel & Flash Attn

Run the following command to install wheel & flash-attn:

pip install wheel
pip install flash-attn --no-build-isolation

Step 13: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Microsoft Kosmos-2.5 model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 14: Smoke Test: Markdown Extraction

Create kosmos25_md.py and add the following code:

import torch, requests
from PIL import Image
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration

repo = "microsoft/kosmos-2.5"
device = "cuda:0"
dtype = torch.bfloat16

model = Kosmos2_5ForConditionalGeneration.from_pretrained(
    repo,
    device_map=device,
    torch_dtype=dtype,
    # If you installed flash-attn, uncomment the next line
    # attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(repo)

# Sample image from the model card
url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "<md>"
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Keep & use the scaled dimensions from the model card example
height, width = inputs.pop("height"), inputs.pop("width")

inputs = {k: (v.to(device) if v is not None else None) for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)

out_ids = model.generate(**inputs, max_new_tokens=1024)
text = processor.batch_decode(out_ids, skip_special_tokens=True)[0]
print(text)

Run the script from the following command:

python3 kosmos25_md.py

What kosmos25_md.py does

Imports libraries

torch: for running the model on GPU/CPU.
requests: to download a sample image from the Hugging Face repo.
PIL.Image: to load and process that image.
transformers: provides the AutoProcessor (for preprocessing text+images) and Kosmos2_5ForConditionalGeneration (the actual model).

Defines model + device setup

Chooses repo = “microsoft/kosmos-2.5”.
Sets device = "cuda:0" (so it uses your first GPU).
Uses dtype = torch.bfloat16 (lighter precision for efficiency).
Loads the model weights from Hugging Face into GPU memory.
Loads the paired processor, which knows how to tokenize text and convert images into patches.

Fetches a sample image

Downloads a receipt image (receipt_00008.png) directly from the Hugging Face repo.
Opens it with PIL so it’s ready to feed to the model.

Prepares the task prompt

Sets prompt = "".
This tells Kosmos-2.5 you want Markdown transcription (not OCR bounding boxes).

Processes input into tensors

Calls the processor with the text () + image.
Returns model-ready tensors (pixel_values, input_ids, flattened_patches, height, width).
Keeps track of height and width (for scaling purposes).

Moves data to GPU

Iterates over input tensors and sends them to the CUDA device.
Ensures flattened_patches are stored in bfloat16 for efficiency.

Runs generation with the model

Calls model.generate() with inputs.
max_new_tokens=1024 → allows up to 1024 tokens of output.
The model produces a sequence representing Markdown text.

Decodes the output

Uses processor.batch_decode() to convert model IDs back into text.
Skips special tokens (, , etc.).

Prints result to terminal

Displays the generated Markdown string representing the document layout.
Example: headings, tables, or text blocks reflecting the receipt’s content.

Summary

When you run python kosmos25_md.py, the script:

Loads Kosmos-2.5 on GPU in bf16.
Downloads a sample receipt image.
Sends + image through the model.
Generates structured Markdown output of the document.
Prints the Markdown text to your terminal.

Step 15: OCR with bounding boxes

Create kosmos25_ocr.py and add the following code:

import re, torch, requests
from PIL import Image, ImageDraw
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration

repo = "microsoft/kosmos-2.5"
device = "cuda:0"; dtype = torch.bfloat16

model = Kosmos2_5ForConditionalGeneration.from_pretrained(
    repo,
    device_map=device,
    torch_dtype=dtype,
    # attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(repo)

url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "<ocr>"
inputs = processor(text=prompt, images=image, return_tensors="pt")
height, width = inputs.pop("height"), inputs.pop("width")
raw_width, raw_height = image.size
scale_h = raw_height / height
scale_w = raw_width / width

inputs = {k: (v.to(device) if v is not None else None) for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)

out_ids = model.generate(**inputs, max_new_tokens=1024)
y = processor.batch_decode(out_ids, skip_special_tokens=True)[0]

# Post-process (from model card example)
pattern = r"<bbox><x_\\d+><y_\\d+><x_\\d+><y_\\d+></bbox>"
boxes_raw = re.findall(pattern, y)
lines = re.split(pattern, y)[1:]
boxes = [[int(j) for j in re.findall(r"\\d+", i)] for i in boxes_raw]

draw = ImageDraw.Draw(image)
for i, line in enumerate(lines):
    x0,y0,x1,y1 = boxes[i]
    if x0 < x1 and y0 < y1:
        x0,y0,x1,y1 = int(x0*scale_w), int(y0*scale_h), int(x1*scale_w), int(y1*scale_h)
        draw.polygon([x0,y0, x1,y0, x1,y1, x0,y1], outline="red")
image.save("ocr_output.png")
print("Saved ocr_output.png")

Run the script from the following command:

python3 kosmos25_ocr.py

What kosmos25_ocr.py does

Imports libraries

Same as the Markdown script: torch, requests, PIL.Image, and transformers.
Adds re (regular expressions) to parse bounding box tags in the model’s output.
Adds ImageDraw from PIL to draw boxes on the image.

Defines model + device setup

Loads the Kosmos-2.5 model (microsoft/kosmos-2.5) into GPU memory.
Uses device = "cuda:0" and dtype = torch.bfloat16 for GPU execution.
Loads the paired processor for tokenization and image preprocessing.

Fetches the sample image

Downloads the same receipt image (receipt_00008.png) from Hugging Face.
Opens it using PIL.

Prepares the task prompt

Sets prompt = "".
This tells Kosmos-2.5 to generate text with bounding box coordinates for each block of text it detects.

Processes input into tensors

Calls the processor with text () + image.
Extracts height and width from the processed input for scaling.
Keeps track of raw image dimensions (raw_width, raw_height).
Computes scaling factors (scale_height, scale_width) so that bounding boxes from the model can be mapped correctly to the real image size.

Moves data to GPU

Just like in the Markdown script, pushes tensors to the GPU.
Converts flattened_patches to bfloat16.

Runs generation with the model

Calls model.generate() with max 1024 tokens.
Output contains both text and bounding box tags (e.g., ...).

Post-processes the output

Decodes the model output back to text.
Removes the prompt from the result.
Uses regex to extract bounding box coordinates.
Splits the text into lines associated with those bounding boxes.
Scales the bounding boxes to match the original image resolution.

Overlays bounding boxes on the image

Uses PIL’s ImageDraw.Draw to draw red polygons around detected text regions.
Associates each bounding box with its recognized text.

Saves + prints results

Saves a new image (output.png) with bounding boxes drawn.
Prints the recognized text with bounding box coordinates in the terminal.

Key Difference vs Markdown script

Markdown script (kosmos25_md.py) → Converts the entire document into structured Markdown text (no spatial layout).
OCR script (kosmos25_ocr.py) → Extracts text with spatial coordinates and draws bounding boxes directly onto the image.

In short:

Run Markdown mode when you want a neat Markdown document version of your image.
Run OCR mode when you want raw text + bounding boxes for further analysis or visualization.

Step 16: Install Streamlit

Run the following command to install streamlit:

pip install streamlit

Step 17: Create a app.py

Create a file (ex: app.py) and add the following code:

import streamlit as st
import torch, requests, re
from PIL import Image, ImageDraw
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration

# Load once at startup
repo = "microsoft/kosmos-2.5"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if "cuda" in device else torch.float32

@st.cache_resource
def load_model():
    model = Kosmos2_5ForConditionalGeneration.from_pretrained(
        repo,
        device_map=device,
        torch_dtype=dtype,
    )
    processor = AutoProcessor.from_pretrained(repo)
    return model, processor

model, processor = load_model()

st.title("Kosmos-2.5 WebUI (OCR + Markdown)")
mode = st.radio("Choose task:", ["Markdown (<md>)", "OCR (<ocr>)"])
uploaded = st.file_uploader("Upload an image", type=["png","jpg","jpeg"])

if uploaded:
    image = Image.open(uploaded).convert("RGB")
    st.image(image, caption="Uploaded Image", use_column_width=True)

    if st.button("Run Kosmos-2.5"):
        prompt = "<md>" if mode.startswith("Markdown") else "<ocr>"
        inputs = processor(text=prompt, images=image, return_tensors="pt")
        height, width = inputs.pop("height"), inputs.pop("width")
        raw_w, raw_h = image.size
        scale_h, scale_w = raw_h/height, raw_w/width

        inputs = {k: (v.to(device) if v is not None else None) for k,v in inputs.items()}
        inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)

        with torch.no_grad():
            out_ids = model.generate(**inputs, max_new_tokens=1024)
        text = processor.batch_decode(out_ids, skip_special_tokens=True)[0]

        if mode.startswith("Markdown"):
            st.subheader("Markdown Output")
            st.code(text, language="markdown")
        else:
            # Post-process OCR boxes
            pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
            boxes_raw = re.findall(pattern, text)
            lines = re.split(pattern, text)[1:]
            boxes = [[int(j) for j in re.findall(r"\d+", i)] for i in boxes_raw]

            draw = ImageDraw.Draw(image)
            for i, line in enumerate(lines):
                x0,y0,x1,y1 = boxes[i]
                if x0 < x1 and y0 < y1:
                    x0,y0,x1,y1 = int(x0*scale_w), int(y0*scale_h), int(x1*scale_w), int(y1*scale_h)
                    draw.polygon([x0,y0, x1,y0, x1,y1, x0,y1], outline="red")
            st.subheader("OCR with Bounding Boxes")
            st.image(image)
            st.text_area("OCR Text", "\n".join(lines), height=200)

Step 18: Launch Streamlit

Run the following command to launch streamlit:

streamlit run app.py

Step 19: Access the WebUI in Your Browser

Once Streamlit is running, it will display three links:

Local URL → http://localhost:8501 (works if you’re running on your own machine).
Network URL → http://:8501 (for internal access inside your VM network).
External URL → http://:8501 (use this to open from your laptop/PC browser).

Open the External URL in your browser.
Example:

http://38.29.145.10:8501

The Kosmos-2.5 WebUI will load with:

A task selector (Markdown or OCR ).
An upload box to drag & drop or browse images.

Upload any PNG/JPG/JPEG image (e.g., receipts, invoices, documents).

Click Run and view:

Markdown Mode → a structured Markdown transcription of the document.
OCR Mode → text + bounding boxes drawn directly on your image.

Tip: If your VM is remote (e.g., NodeShift), ensure port 8501 is open in firewall/security settings, or use SSH port forwarding:

ssh -L 8501:localhost:8501 root@<your-vm-ip>

Step 20: Upload and Process Documents

In the WebUI, click Browse files (or drag & drop) to upload an image.

Supported formats: PNG, JPG, JPEG
File size limit: 200 MB

Once uploaded, the file name will appear below the upload box (e.g., receipt_00008.png).

Choose the task mode:

Markdown () → generates a structured Markdown transcription.
OCR () → extracts text with bounding boxes overlaid on the uploaded image.

The model will process the image and show results below:

In Markdown Mode → you’ll see neatly formatted text output.
In OCR Mode → the uploaded image will be re-rendered with red bounding boxes drawn around detected text regions, along with extracted text output.

Tip: If you see a warning about use_column_width being deprecated, you can safely ignore it — it’s a Streamlit UI message and doesn’t affect the model’s output.

Step 21: View OCR Results

Switch the task selector to OCR ().

This tells Kosmos-2.5 to extract text + bounding box coordinates instead of Markdown.

After uploading the image (e.g., receipt_00008.png), the model will process it and return:

Annotated Image → your uploaded image will now display with red bounding boxes drawn around detected text areas.
OCR Text Output → the recognized text lines will appear below the image (or in a text box), showing exactly what was extracted from each bounding box.

Use this mode when you need precise localization of text in documents (e.g., invoices, receipts, forms).

Tip: If you want to save the annotated output, check the next step (Step 22) where we’ll enable download options for both the Markdown text and the OCR image.

Conclusion

Kosmos-2.5 makes working with text-heavy images simple — whether you need clean Markdown transcriptions or OCR with bounding boxes. By setting it up on a GPU-powered NodeShift VM and integrating it with a Streamlit WebUI, you now have an efficient, browser-based workflow for document understanding at scale.

DEV Community