Kosmos-2.5 is Microsoft’s multimodal “literate” model for reading text-heavy images (receipts, invoices, forms, docs). It does two things out of the box using task prompts: (a) OCR with spatially-aware text blocks (text + bounding boxes) via , and (b) image→Markdown conversion via . It’s implemented in Transformers (supported from v4.56+) with ready-to-run Python snippets, and the paper details the shared decoder-only architecture and doc-understanding focus.
GPU Configuration (What Actually Works)
Ballpark VRAM based on 1.3B-param model running in bfloat16 with image patches; add headroom for long outputs / larger pages.
Resources
Link: https://huggingface.co/microsoft/kosmos-2.5
Step-by-Step Process to Install & Run Microsoft Kosmos-2.5 Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Microsoft Kosmos-2.5, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including nvcc)
- Proper support for building and running GPU-based applications like Microsoft Kosmos-2.5
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Microsoft Kosmos-2.5.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
Template Name:
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Gemma-3-270m & Instruct runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Verify Python Version & Install pip (if not present)
Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.
Step 8.1: Check Python Version
Run the following command to verify Python 3.10 is installed:
python3 --version
You should see output like:
Python 3.10.12
Step 8.2: Install pip (if not already installed)
Even if Python is installed, pip might not be available.
Check if pip exists:
pip3 --version
If you get an error like command not found, then install pip manually.
Install pip via get-pip.py:
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
This will download and install pip into your system.
You may see a warning about running as root — that’s okay for now.
After installation, verify:
pip3 --version
Expected output:
pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Now pip is ready to install packages like transformers, torch, etc.
Step 9: Created and Activated Python 3.10 Virtual Environment
Run the following commands to created and activated Python 3.10 virtual environment:
apt update && apt install -y python3.10-venv git wget
python3.10 -m venv kosmos
source kosmos/bin/activate
Step 10: Install PyTorch
Run the following command to install PyTorch:
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
Step 11: Install Model Dependencies
Run the following command to install model dependencies:
pip install "transformers>=4.56" accelerate pillow requests
Transformers ≥4.56 is required.
Step 12: Install Wheel & Flash Attn
Run the following command to install wheel & flash-attn:
pip install wheel
pip install flash-attn --no-build-isolation
Step 13: Connect to Your GPU VM with a Code Editor
Before you start running model script with the Microsoft Kosmos-2.5 model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 14: Smoke Test: Markdown Extraction
Create kosmos25_md.py and add the following code:
import torch, requests
from PIL import Image
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
repo = "microsoft/kosmos-2.5"
device = "cuda:0"
dtype = torch.bfloat16
model = Kosmos2_5ForConditionalGeneration.from_pretrained(
repo,
device_map=device,
torch_dtype=dtype,
# If you installed flash-attn, uncomment the next line
# attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(repo)
# Sample image from the model card
url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<md>"
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Keep & use the scaled dimensions from the model card example
height, width = inputs.pop("height"), inputs.pop("width")
inputs = {k: (v.to(device) if v is not None else None) for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
out_ids = model.generate(**inputs, max_new_tokens=1024)
text = processor.batch_decode(out_ids, skip_special_tokens=True)[0]
print(text)
Run the script from the following command:
python3 kosmos25_md.py
What kosmos25_md.py does
Imports libraries
- torch: for running the model on GPU/CPU.
- requests: to download a sample image from the Hugging Face repo.
- PIL.Image: to load and process that image.
- transformers: provides the AutoProcessor (for preprocessing text+images) and Kosmos2_5ForConditionalGeneration (the actual model).
Defines model + device setup
- Chooses repo = “microsoft/kosmos-2.5”.
- Sets device = "cuda:0" (so it uses your first GPU).
- Uses dtype = torch.bfloat16 (lighter precision for efficiency).
- Loads the model weights from Hugging Face into GPU memory.
- Loads the paired processor, which knows how to tokenize text and convert images into patches.
Fetches a sample image
- Downloads a receipt image (receipt_00008.png) directly from the Hugging Face repo.
- Opens it with PIL so it’s ready to feed to the model.
Prepares the task prompt
- Sets prompt = "".
- This tells Kosmos-2.5 you want Markdown transcription (not OCR bounding boxes).
Processes input into tensors
- Calls the processor with the text () + image.
- Returns model-ready tensors (pixel_values, input_ids, flattened_patches, height, width).
- Keeps track of height and width (for scaling purposes).
Moves data to GPU
- Iterates over input tensors and sends them to the CUDA device.
- Ensures flattened_patches are stored in bfloat16 for efficiency.
Runs generation with the model
- Calls model.generate() with inputs.
- max_new_tokens=1024 → allows up to 1024 tokens of output.
- The model produces a sequence representing Markdown text.
Decodes the output
- Uses processor.batch_decode() to convert model IDs back into text.
- Skips special tokens (, , etc.).
Prints result to terminal
- Displays the generated Markdown string representing the document layout.
- Example: headings, tables, or text blocks reflecting the receipt’s content.
Summary
When you run python kosmos25_md.py, the script:
- Loads Kosmos-2.5 on GPU in bf16.
- Downloads a sample receipt image.
- Sends + image through the model.
- Generates structured Markdown output of the document.
- Prints the Markdown text to your terminal.
Step 15: OCR with bounding boxes
Create kosmos25_ocr.py and add the following code:
import re, torch, requests
from PIL import Image, ImageDraw
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
repo = "microsoft/kosmos-2.5"
device = "cuda:0"; dtype = torch.bfloat16
model = Kosmos2_5ForConditionalGeneration.from_pretrained(
repo,
device_map=device,
torch_dtype=dtype,
# attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(repo)
url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<ocr>"
inputs = processor(text=prompt, images=image, return_tensors="pt")
height, width = inputs.pop("height"), inputs.pop("width")
raw_width, raw_height = image.size
scale_h = raw_height / height
scale_w = raw_width / width
inputs = {k: (v.to(device) if v is not None else None) for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
out_ids = model.generate(**inputs, max_new_tokens=1024)
y = processor.batch_decode(out_ids, skip_special_tokens=True)[0]
# Post-process (from model card example)
pattern = r"<bbox><x_\\d+><y_\\d+><x_\\d+><y_\\d+></bbox>"
boxes_raw = re.findall(pattern, y)
lines = re.split(pattern, y)[1:]
boxes = [[int(j) for j in re.findall(r"\\d+", i)] for i in boxes_raw]
draw = ImageDraw.Draw(image)
for i, line in enumerate(lines):
x0,y0,x1,y1 = boxes[i]
if x0 < x1 and y0 < y1:
x0,y0,x1,y1 = int(x0*scale_w), int(y0*scale_h), int(x1*scale_w), int(y1*scale_h)
draw.polygon([x0,y0, x1,y0, x1,y1, x0,y1], outline="red")
image.save("ocr_output.png")
print("Saved ocr_output.png")
Run the script from the following command:
python3 kosmos25_ocr.py
What kosmos25_ocr.py does
Imports libraries
- Same as the Markdown script: torch, requests, PIL.Image, and transformers.
- Adds re (regular expressions) to parse bounding box tags in the model’s output.
- Adds ImageDraw from PIL to draw boxes on the image.
Defines model + device setup
- Loads the Kosmos-2.5 model (microsoft/kosmos-2.5) into GPU memory.
- Uses device = "cuda:0" and dtype = torch.bfloat16 for GPU execution.
- Loads the paired processor for tokenization and image preprocessing.
Fetches the sample image
- Downloads the same receipt image (receipt_00008.png) from Hugging Face.
- Opens it using PIL.
Prepares the task prompt
- Sets prompt = "".
- This tells Kosmos-2.5 to generate text with bounding box coordinates for each block of text it detects.
Processes input into tensors
- Calls the processor with text () + image.
- Extracts height and width from the processed input for scaling.
- Keeps track of raw image dimensions (raw_width, raw_height).
- Computes scaling factors (scale_height, scale_width) so that bounding boxes from the model can be mapped correctly to the real image size.
Moves data to GPU
- Just like in the Markdown script, pushes tensors to the GPU.
- Converts flattened_patches to bfloat16.
Runs generation with the model
- Calls model.generate() with max 1024 tokens.
- Output contains both text and bounding box tags (e.g., ...).
Post-processes the output
- Decodes the model output back to text.
- Removes the prompt from the result.
- Uses regex to extract bounding box coordinates.
- Splits the text into lines associated with those bounding boxes.
- Scales the bounding boxes to match the original image resolution.
Overlays bounding boxes on the image
- Uses PIL’s ImageDraw.Draw to draw red polygons around detected text regions.
- Associates each bounding box with its recognized text.
Saves + prints results
- Saves a new image (output.png) with bounding boxes drawn.
- Prints the recognized text with bounding box coordinates in the terminal.
Key Difference vs Markdown script
- Markdown script (kosmos25_md.py) → Converts the entire document into structured Markdown text (no spatial layout).
- OCR script (kosmos25_ocr.py) → Extracts text with spatial coordinates and draws bounding boxes directly onto the image.
In short:
- Run Markdown mode when you want a neat Markdown document version of your image.
- Run OCR mode when you want raw text + bounding boxes for further analysis or visualization.
Step 16: Install Streamlit
Run the following command to install streamlit:
pip install streamlit
Step 17: Create a app.py
Create a file (ex: app.py) and add the following code:
import streamlit as st
import torch, requests, re
from PIL import Image, ImageDraw
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration
# Load once at startup
repo = "microsoft/kosmos-2.5"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if "cuda" in device else torch.float32
@st.cache_resource
def load_model():
model = Kosmos2_5ForConditionalGeneration.from_pretrained(
repo,
device_map=device,
torch_dtype=dtype,
)
processor = AutoProcessor.from_pretrained(repo)
return model, processor
model, processor = load_model()
st.title("Kosmos-2.5 WebUI (OCR + Markdown)")
mode = st.radio("Choose task:", ["Markdown (<md>)", "OCR (<ocr>)"])
uploaded = st.file_uploader("Upload an image", type=["png","jpg","jpeg"])
if uploaded:
image = Image.open(uploaded).convert("RGB")
st.image(image, caption="Uploaded Image", use_column_width=True)
if st.button("Run Kosmos-2.5"):
prompt = "<md>" if mode.startswith("Markdown") else "<ocr>"
inputs = processor(text=prompt, images=image, return_tensors="pt")
height, width = inputs.pop("height"), inputs.pop("width")
raw_w, raw_h = image.size
scale_h, scale_w = raw_h/height, raw_w/width
inputs = {k: (v.to(device) if v is not None else None) for k,v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)
with torch.no_grad():
out_ids = model.generate(**inputs, max_new_tokens=1024)
text = processor.batch_decode(out_ids, skip_special_tokens=True)[0]
if mode.startswith("Markdown"):
st.subheader("Markdown Output")
st.code(text, language="markdown")
else:
# Post-process OCR boxes
pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
boxes_raw = re.findall(pattern, text)
lines = re.split(pattern, text)[1:]
boxes = [[int(j) for j in re.findall(r"\d+", i)] for i in boxes_raw]
draw = ImageDraw.Draw(image)
for i, line in enumerate(lines):
x0,y0,x1,y1 = boxes[i]
if x0 < x1 and y0 < y1:
x0,y0,x1,y1 = int(x0*scale_w), int(y0*scale_h), int(x1*scale_w), int(y1*scale_h)
draw.polygon([x0,y0, x1,y0, x1,y1, x0,y1], outline="red")
st.subheader("OCR with Bounding Boxes")
st.image(image)
st.text_area("OCR Text", "\n".join(lines), height=200)
Step 18: Launch Streamlit
Run the following command to launch streamlit:
streamlit run app.py
Step 19: Access the WebUI in Your Browser
Once Streamlit is running, it will display three links:
- Local URL → http://localhost:8501 (works if you’re running on your own machine).
- Network URL → http://:8501 (for internal access inside your VM network).
- External URL → http://:8501 (use this to open from your laptop/PC browser).
Open the External URL in your browser.
Example:
http://38.29.145.10:8501
The Kosmos-2.5 WebUI will load with:
- A task selector (Markdown or OCR ).
- An upload box to drag & drop or browse images.
Upload any PNG/JPG/JPEG image (e.g., receipts, invoices, documents).
Click Run and view:
- Markdown Mode → a structured Markdown transcription of the document.
- OCR Mode → text + bounding boxes drawn directly on your image.
Tip: If your VM is remote (e.g., NodeShift), ensure port 8501 is open in firewall/security settings, or use SSH port forwarding:
ssh -L 8501:localhost:8501 root@<your-vm-ip>
Step 20: Upload and Process Documents
In the WebUI, click Browse files (or drag & drop) to upload an image.
- Supported formats: PNG, JPG, JPEG
- File size limit: 200 MB
Once uploaded, the file name will appear below the upload box (e.g., receipt_00008.png).
Choose the task mode:
- Markdown () → generates a structured Markdown transcription.
- OCR () → extracts text with bounding boxes overlaid on the uploaded image.
The model will process the image and show results below:
- In Markdown Mode → you’ll see neatly formatted text output.
- In OCR Mode → the uploaded image will be re-rendered with red bounding boxes drawn around detected text regions, along with extracted text output.
Tip: If you see a warning about use_column_width being deprecated, you can safely ignore it — it’s a Streamlit UI message and doesn’t affect the model’s output.
Step 21: View OCR Results
Switch the task selector to OCR ().
- This tells Kosmos-2.5 to extract text + bounding box coordinates instead of Markdown.
After uploading the image (e.g., receipt_00008.png), the model will process it and return:
- Annotated Image → your uploaded image will now display with red bounding boxes drawn around detected text areas.
- OCR Text Output → the recognized text lines will appear below the image (or in a text box), showing exactly what was extracted from each bounding box.
Use this mode when you need precise localization of text in documents (e.g., invoices, receipts, forms).
Tip: If you want to save the annotated output, check the next step (Step 22) where we’ll enable download options for both the Markdown text and the OCR image.
Conclusion
Kosmos-2.5 makes working with text-heavy images simple — whether you need clean Markdown transcriptions or OCR with bounding boxes. By setting it up on a GPU-powered NodeShift VM and integrating it with a Streamlit WebUI, you now have an efficient, browser-based workflow for document understanding at scale.
Top comments (0)