DEV Community: NodeShift

A Step-by-Step Guide to Install Qwen3-Next 80B

Aditi Bindal — Mon, 22 Sep 2025 07:15:58 +0000

If you're relentlessly following AI advancements, one thing can be clearly observed, the trend has been simple: go bigger. However, the new Qwen3-Next-80B series models challenges this paradigm by focusing on groundbreaking efficiency rather than just raw scale. This model represents a monumental leap forward, delivering the performance of a much larger model with a fraction of the computational cost. At its core is a revolutionary Hybrid Attention mechanism, to process ultra-long context lengths, natively supporting 262,144 tokens and extensible to over a million. This is paired with a High-Sparsity Mixture-of-Experts (MoE) architecture that keeps a staggering 80 billion total parameters on tap while only activating 3 billion at any given time. The result? Drastically reduced computational load, leading to inference speeds up to 10 times faster than its predecessors on long-context tasks. With additional enhancements like Multi-Token Prediction for accelerated performance and advanced stability optimizations, Qwen3-Next-80B proves its worth by outperforming models like Qwen3-32B with only 10% of the training cost and performing on par with models on key reasoning, coding, and alignment benchmarks.

In this article we'll dive into the simple and straightforward walkthrough showing the installation, setup and usage of this model.

Prerequisites

The minimum system requirements for running this model are:

GPU: 2x H200s or 4x H100s
Storage: 1TB+ (preferable)
VRAM: at least 160GB
Anaconda installed

Step-by-step process to install Qwen3-Next-80B-A3B Locally

For the purpose of this tutorial, we’ll use a GPU-powered Virtual Machine by NodeShift since it provides high compute Virtual Machines at a very affordable cost on a scale that meets GDPR, SOC2, and ISO27001 requirements. Also, it offers an intuitive and user-friendly interface, making it easier for beginners to get started with Cloud deployments. However, feel free to use any cloud provider of your choice and follow the same steps for the rest of the tutorial.

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

These GPU nodes are GPU-powered virtual machines by NodeShift. These nodes are highly customizable and let you control different environmental configurations for GPUs ranging from H100s to A100s, CPUs, RAM, and storage, according to your needs.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 2x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 5 TB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 2x H200 140GB GPU node with 192vCPUs/504GB RAM/5TB SSD.

2) Next, you'll need to select an authentication method. Two methods are available: Password and SSH Key. We recommend using SSH keys, as they are a more secure option. To create one, head over to our official documentation.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen python=3.11 -y && conda activate qwen

Output:

2) Install required dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 
pip install git+https://github.com/huggingface/transformers.git@main
pip install git+https://github.com/huggingface/accelerate
pip install huggingface_hub

Output:

3) Login to Hugging Face with HF READ token.

This is a gated model, make sure to get access granted from the model card.

hf auth login

Output:

4) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

5) If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

To download the thinking model, just replace the model_name value with "Qwen/Qwen3-Next-80B-A3B-Thinking".

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto",
)

Output:

3) Run the model for inference.

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

And for inferencing with the thinking model, use the following snippet:

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)

Output:

Conclusion

The Qwen3-Next-80B model represents a shift in AI development, prioritizing efficiency over raw scale through its Hybrid Attention and High-Sparsity Mixture-of-Experts (MoE) architecture. This allows it to achieve high performance with a fraction of the computational load, enabling it to handle massive context lengths and accelerate inference speeds. NodeShift Cloud plays a crucial role in making this advanced technology accessible and practical by providing a cost-effective, secured platform for deploying and running such compute-intensive models. By offering affordable GPU resources, NodeShift Cloud democratizes access to state-of-the-art AI, allowing developers and businesses to leverage the power of models like Qwen3-Next-80B without the prohibitive costs and infrastructure management typically associated with large-scale AI.

For more information about NodeShift:

How to Install & Run EmbeddingGemma-300m Locally?

Ayush kumar — Mon, 08 Sep 2025 09:32:42 +0000

EmbeddingGemma-300M is Google DeepMind’s lightweight, multilingual (100+ languages) embedding model built on Gemma 3/T5Gemma foundations. It outputs 768-dim vectors (with Matryoshka down-projections to 512/256/128) optimized for retrieval, classification, clustering, semantic similarity, QA, and code retrieval. It’s designed for low-resource / on-device use, loads via SentenceTransformers, and does not support float16—use FP32 or bfloat16.

Evaluation

Benchmark Results

The model was evaluated against a large collection of different datasets and metrics to cover different aspects of text understanding.

Full Precision Checkpoint

QAT Checkpoints

Note: QAT models are evaluated after quantization

Mixed Precision refers to per-channel quantization with int4 for embeddings, feedforward, and projection layers, and int8 for attention (e4_a8_f4_p4).

GPU/CPU Configuration Table

Use the following prompts based on your use case and input data type. These may already be available in the EmbeddingGemma configuration in your modeling framework of choice.

Resources

Link: https://huggingface.co/google/embeddinggemma-300m

Step-by-Step Process to Install & Run EmbeddingGemma-300m Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running EmbeddingGemma-300m, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like EmbeddingGemma-300m
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like EmbeddingGemma-300m.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the EmbeddingGemma-300m runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install pip (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install pip (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install pip via get-pip.py:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.10 -m venv gemma
source gemma/bin/activate

Step 10: Install Dependencies

Run the following command to install dependencies:

pip install -U sentence-transformers faiss-cpu

Step 11: Install Hugging Face Hub

Run the following command to install huggingface_hub:

pip install -U huggingface_hub

Step 12: Log in to Hugging Face (CLI)

Run the following command to login in to hugging face:

huggingface-cli login

When prompted, paste your HF token (from https://huggingface.co/settings/tokens).

For “Add token as git credential? (Y/n)”:

Y if you plan to git clone models/repos.
n if you only use huggingface_hub downloads.

You should see: “Token is valid… saved to /root/.cache/huggingface/stored_tokens”.

The red line “Cannot authenticate through git-credential…” just means no Git credential helper is set. It’s safe to ignore.

Step 13: Connect to Your GPU VM with a Code Editor

Before you start running model script with the EmbeddingGemma-300m model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 14: Create app.py and Add the Following Code

Create the file
From your VM terminal:

nano app.py

Or in VS Code (as in your screenshot), click New File → name it app.py.

Paste this code:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load the EmbeddingGemma-300M model (Google’s open embedding model)
model = SentenceTransformer("google/embeddinggemma-300m")  # auto device (CPU/GPU)

# A sample query
query = "Which planet is known as the Red Planet?"

# A small list of candidate documents
docs = [
    "Venus is often called Earth's twin.",
    "Mars, with its reddish hue, is the Red Planet.",
    "Jupiter is the largest planet.",
    "Saturn has iconic rings."
]

# Encode the query → vector representation optimized for search
q = model.encode_query(query)

# Encode the documents → vector representations optimized for retrieval
D = model.encode_document(docs)

# Compute similarity between the query vector and each document vector
scores = model.similarity(q, D).squeeze().tolist()

# Pair each score with its document and sort (highest similarity first)
ranked = sorted(zip(scores, docs), reverse=True)

# Print top 3 results
print(ranked[:3])

What this file does (detailed)

Imports:

SentenceTransformer loads the EmbeddingGemma-300M model.
numpy is for vector math.

Model load:

Loads the Google EmbeddingGemma-300M embedding model, which converts text into vectors (embeddings).

Query + documents:

Defines one query ("Which planet is known as the Red Planet?") and a small set of candidate sentences (our mini “document corpus”).

Encoding:

model.encode_query(query) → creates a vector representation of the query.
model.encode_document(docs) → creates vector representations of the candidate docs.
Using separate methods ensures query/document embeddings are tuned for retrieval.

Similarity:

model.similarity(q, D) computes how close each doc is to the query in vector space.

Ranking:

Sorts docs by similarity score (highest first). The result shows which document best answers the query.

Output:

Prints the top 3 results. You should see “Mars…” ranked highest, since it matches the Red Planet question.

In short:
app.py is a minimal semantic search demo using EmbeddingGemma. It shows how to encode queries & docs, compute similarity, and rank results — the basic workflow behind search engines, chatbots, and RAG systems.

Step 15: Run the Script

Run the script from the following command:

python3 app.py

This will download the model and generate response on terminal.

Step 16: Create build_index.py and add the following code

Create the file

nano build_index.py

Or in VS Code → New File → name it build_index.py.

Paste the full code (you already have it):

import os, json, argparse, numpy as np
from pathlib import Path
from sentence_transformers import SentenceTransformer
import faiss

def read_corpus(folder):
    paths = []
    texts = []
    for p in Path(folder).rglob("*"):
        if p.suffix.lower() in {".txt", ".md"} and p.stat().st_size > 0:
            paths.append(str(p))
            texts.append(p.read_text(encoding="utf-8", errors="ignore"))
    return paths, texts

def mrl_truncate_and_norm(X, k):
    X = X[:, :k]
    X = X / np.linalg.norm(X, axis=1, keepdims=True)
    return X.astype("float32")

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--data_dir", required=True, help="Folder with .txt/.md")
    ap.add_argument("--dim", type=int, default=768, choices=[768,512,256,128])
    ap.add_argument("--out_dir", default="index")
    args = ap.parse_args()

    os.makedirs(args.out_dir, exist_ok=True)

    print("Loading model…")
    model = SentenceTransformer("google/embeddinggemma-300m")  # fp32/bf16 only

    print("Reading corpus…")
    paths, texts = read_corpus(args.data_dir)
    assert texts, "No .txt/.md files found"

    print(f"Encoding {len(texts)} docs…")
    D = model.encode_document(texts, batch_size=64, convert_to_numpy=True)
    # L2-normalize (cosine sim via inner product)
    D = D / np.linalg.norm(D, axis=1, keepdims=True)

    if args.dim < 768:
        print(f"Applying Matryoshka truncation to {args.dim}…")
        D = mrl_truncate_and_norm(D, args.dim)

    index = faiss.IndexFlatIP(D.shape[1])
    index.add(D)

    faiss.write_index(index, f"{args.out_dir}/faiss_{args.dim}.index")
    np.save(f"{args.out_dir}/embeddings_{args.dim}.npy", D)
    with open(f"{args.out_dir}/mapping.json", "w") as f:
        json.dump(paths, f, indent=2)

    print(f"Saved index to {args.out_dir} (dim={args.dim}, N={len(texts)})")

if __name__ == "__main__":
    main()

What this script does

read_corpus(folder):
Reads all .txt and .md files in the given folder. Returns two lists:

paths → file paths
texts → file contents

mrl_truncate_and_norm(X, k):
Implements Matryoshka Representation Learning.

Takes embeddings of size 768.
Truncates to smaller dimension (512, 256, or 128).
Re-normalizes them for cosine similarity search.

main():
Parse arguments:

--data_dir → where your text files are.
--dim → embedding size (default 768).
--out_dir → where to save the index (default index/).

Load the EmbeddingGemma-300M model.
Read all docs from your folder.
Encode them with model.encode_document().
Normalize vectors.
Optionally shrink with MRL.
Create a FAISS index (cosine similarity using IndexFlatIP).

Save:

faiss_.index → the FAISS index file.
embeddings_.npy → numpy array of embeddings.
mapping.json → file path mapping to docs.

How to run it

Create some docs (if you don’t have any yet):

mkdir docs
echo "Mars is the Red Planet." > docs/mars.txt
echo "Venus is Earth's twin." > docs/venus.txt
echo "Jupiter is the largest planet." > docs/jupiter.txt

Run the script:

python3 build_index.py --data_dir ./docs

This will:

Read your .txt files in docs/
Encode them with EmbeddingGemma-300M
Save an index under ./index/

Output example:

Loading model…
Reading corpus…
Encoding 3 docs…
Saved index to index (dim=768, N=3)

What you get after running

Inside the index/ folder:

faiss_768.index → FAISS index file
embeddings_768.npy → stored embeddings
mapping.json → JSON mapping file paths

In short: build_index.py prepares your text files into a searchable embedding index using EmbeddingGemma + FAISS.

Conclusion

EmbeddingGemma-300M is a powerful yet lightweight open embedding model from Google DeepMind, designed for retrieval, semantic similarity, classification, clustering, and more — all while being efficient enough to run on laptops, desktops, or modest GPUs. In this guide, we walked through setting up a NodeShift GPU VM, installing dependencies, and building two core scripts:

app.py for a quick semantic search demo using queries and documents.
build_index.py for preparing and indexing your own text corpus with FAISS, ready for scalable search.

With these steps, you now have everything you need to integrate EmbeddingGemma into search pipelines, recommendation systems, or retrieval-augmented applications. Whether on-device or in the cloud, EmbeddingGemma-300M provides a practical and cost-effective foundation for embedding-based workflows.

How to Install & Run Microsoft Kosmos-2.5 Locally?

Ayush kumar — Mon, 08 Sep 2025 08:24:43 +0000

Kosmos-2.5 is Microsoft’s multimodal “literate” model for reading text-heavy images (receipts, invoices, forms, docs). It does two things out of the box using task prompts: (a) OCR with spatially-aware text blocks (text + bounding boxes) via , and (b) image→Markdown conversion via . It’s implemented in Transformers (supported from v4.56+) with ready-to-run Python snippets, and the paper details the shared decoder-only architecture and doc-understanding focus.

GPU Configuration (What Actually Works)

Ballpark VRAM based on 1.3B-param model running in bfloat16 with image patches; add headroom for long outputs / larger pages.

Resources

Link: https://huggingface.co/microsoft/kosmos-2.5

Step-by-Step Process to Install & Run Microsoft Kosmos-2.5 Locally

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Microsoft Kosmos-2.5, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Microsoft Kosmos-2.5
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Microsoft Kosmos-2.5.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Gemma-3-270m & Instruct runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Verify Python Version & Install pip (if not present)

Since Python 3.10 is already installed, we’ll confirm its version and ensure pip is available for package installation.

Step 8.1: Check Python Version

Run the following command to verify Python 3.10 is installed:

python3 --version

You should see output like:

Python 3.10.12

Step 8.2: Install pip (if not already installed)

Even if Python is installed, pip might not be available.

Check if pip exists:

pip3 --version

If you get an error like command not found, then install pip manually.

Install pip via get-pip.py:

curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py

This will download and install pip into your system.

You may see a warning about running as root — that’s okay for now.

After installation, verify:

pip3 --version

Expected output:

pip 25.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)

Now pip is ready to install packages like transformers, torch, etc.

Step 9: Created and Activated Python 3.10 Virtual Environment

Run the following commands to created and activated Python 3.10 virtual environment:

apt update && apt install -y python3.10-venv git wget
python3.10 -m venv kosmos
source kosmos/bin/activate

Step 10: Install PyTorch

Run the following command to install PyTorch:

pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio

Step 11: Install Model Dependencies

Run the following command to install model dependencies:

pip install "transformers>=4.56" accelerate pillow requests

Transformers ≥4.56 is required.

Step 12: Install Wheel & Flash Attn

Run the following command to install wheel & flash-attn:

pip install wheel
pip install flash-attn --no-build-isolation

Step 13: Connect to Your GPU VM with a Code Editor

Before you start running model script with the Microsoft Kosmos-2.5 model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Step 14: Smoke Test: Markdown Extraction

Create kosmos25_md.py and add the following code:

import torch, requests
from PIL import Image
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration

repo = "microsoft/kosmos-2.5"
device = "cuda:0"
dtype = torch.bfloat16

model = Kosmos2_5ForConditionalGeneration.from_pretrained(
    repo,
    device_map=device,
    torch_dtype=dtype,
    # If you installed flash-attn, uncomment the next line
    # attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(repo)

# Sample image from the model card
url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "<md>"
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Keep & use the scaled dimensions from the model card example
height, width = inputs.pop("height"), inputs.pop("width")

inputs = {k: (v.to(device) if v is not None else None) for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)

out_ids = model.generate(**inputs, max_new_tokens=1024)
text = processor.batch_decode(out_ids, skip_special_tokens=True)[0]
print(text)

Run the script from the following command:

python3 kosmos25_md.py

What kosmos25_md.py does

Imports libraries

torch: for running the model on GPU/CPU.
requests: to download a sample image from the Hugging Face repo.
PIL.Image: to load and process that image.
transformers: provides the AutoProcessor (for preprocessing text+images) and Kosmos2_5ForConditionalGeneration (the actual model).

Defines model + device setup

Chooses repo = “microsoft/kosmos-2.5”.
Sets device = "cuda:0" (so it uses your first GPU).
Uses dtype = torch.bfloat16 (lighter precision for efficiency).
Loads the model weights from Hugging Face into GPU memory.
Loads the paired processor, which knows how to tokenize text and convert images into patches.

Fetches a sample image

Downloads a receipt image (receipt_00008.png) directly from the Hugging Face repo.
Opens it with PIL so it’s ready to feed to the model.

Prepares the task prompt

Sets prompt = "".
This tells Kosmos-2.5 you want Markdown transcription (not OCR bounding boxes).

Processes input into tensors

Calls the processor with the text () + image.
Returns model-ready tensors (pixel_values, input_ids, flattened_patches, height, width).
Keeps track of height and width (for scaling purposes).

Moves data to GPU

Iterates over input tensors and sends them to the CUDA device.
Ensures flattened_patches are stored in bfloat16 for efficiency.

Runs generation with the model

Calls model.generate() with inputs.
max_new_tokens=1024 → allows up to 1024 tokens of output.
The model produces a sequence representing Markdown text.

Decodes the output

Uses processor.batch_decode() to convert model IDs back into text.
Skips special tokens (, , etc.).

Prints result to terminal

Displays the generated Markdown string representing the document layout.
Example: headings, tables, or text blocks reflecting the receipt’s content.

Summary

When you run python kosmos25_md.py, the script:

Loads Kosmos-2.5 on GPU in bf16.
Downloads a sample receipt image.
Sends + image through the model.
Generates structured Markdown output of the document.
Prints the Markdown text to your terminal.

Step 15: OCR with bounding boxes

Create kosmos25_ocr.py and add the following code:

import re, torch, requests
from PIL import Image, ImageDraw
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration

repo = "microsoft/kosmos-2.5"
device = "cuda:0"; dtype = torch.bfloat16

model = Kosmos2_5ForConditionalGeneration.from_pretrained(
    repo,
    device_map=device,
    torch_dtype=dtype,
    # attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(repo)

url = "https://huggingface.co/microsoft/kosmos-2.5/resolve/main/receipt_00008.png"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "<ocr>"
inputs = processor(text=prompt, images=image, return_tensors="pt")
height, width = inputs.pop("height"), inputs.pop("width")
raw_width, raw_height = image.size
scale_h = raw_height / height
scale_w = raw_width / width

inputs = {k: (v.to(device) if v is not None else None) for k, v in inputs.items()}
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)

out_ids = model.generate(**inputs, max_new_tokens=1024)
y = processor.batch_decode(out_ids, skip_special_tokens=True)[0]

# Post-process (from model card example)
pattern = r"<bbox><x_\\d+><y_\\d+><x_\\d+><y_\\d+></bbox>"
boxes_raw = re.findall(pattern, y)
lines = re.split(pattern, y)[1:]
boxes = [[int(j) for j in re.findall(r"\\d+", i)] for i in boxes_raw]

draw = ImageDraw.Draw(image)
for i, line in enumerate(lines):
    x0,y0,x1,y1 = boxes[i]
    if x0 < x1 and y0 < y1:
        x0,y0,x1,y1 = int(x0*scale_w), int(y0*scale_h), int(x1*scale_w), int(y1*scale_h)
        draw.polygon([x0,y0, x1,y0, x1,y1, x0,y1], outline="red")
image.save("ocr_output.png")
print("Saved ocr_output.png")

Run the script from the following command:

python3 kosmos25_ocr.py

What kosmos25_ocr.py does

Imports libraries

Same as the Markdown script: torch, requests, PIL.Image, and transformers.
Adds re (regular expressions) to parse bounding box tags in the model’s output.
Adds ImageDraw from PIL to draw boxes on the image.

Defines model + device setup

Loads the Kosmos-2.5 model (microsoft/kosmos-2.5) into GPU memory.
Uses device = "cuda:0" and dtype = torch.bfloat16 for GPU execution.
Loads the paired processor for tokenization and image preprocessing.

Fetches the sample image

Downloads the same receipt image (receipt_00008.png) from Hugging Face.
Opens it using PIL.

Prepares the task prompt

Sets prompt = "".
This tells Kosmos-2.5 to generate text with bounding box coordinates for each block of text it detects.

Processes input into tensors

Calls the processor with text () + image.
Extracts height and width from the processed input for scaling.
Keeps track of raw image dimensions (raw_width, raw_height).
Computes scaling factors (scale_height, scale_width) so that bounding boxes from the model can be mapped correctly to the real image size.

Moves data to GPU

Just like in the Markdown script, pushes tensors to the GPU.
Converts flattened_patches to bfloat16.

Runs generation with the model

Calls model.generate() with max 1024 tokens.
Output contains both text and bounding box tags (e.g., ...).

Post-processes the output

Decodes the model output back to text.
Removes the prompt from the result.
Uses regex to extract bounding box coordinates.
Splits the text into lines associated with those bounding boxes.
Scales the bounding boxes to match the original image resolution.

Overlays bounding boxes on the image

Uses PIL’s ImageDraw.Draw to draw red polygons around detected text regions.
Associates each bounding box with its recognized text.

Saves + prints results

Saves a new image (output.png) with bounding boxes drawn.
Prints the recognized text with bounding box coordinates in the terminal.

Key Difference vs Markdown script

Markdown script (kosmos25_md.py) → Converts the entire document into structured Markdown text (no spatial layout).
OCR script (kosmos25_ocr.py) → Extracts text with spatial coordinates and draws bounding boxes directly onto the image.

In short:

Run Markdown mode when you want a neat Markdown document version of your image.
Run OCR mode when you want raw text + bounding boxes for further analysis or visualization.

Step 16: Install Streamlit

Run the following command to install streamlit:

pip install streamlit

Step 17: Create a app.py

Create a file (ex: app.py) and add the following code:

import streamlit as st
import torch, requests, re
from PIL import Image, ImageDraw
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration

# Load once at startup
repo = "microsoft/kosmos-2.5"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if "cuda" in device else torch.float32

@st.cache_resource
def load_model():
    model = Kosmos2_5ForConditionalGeneration.from_pretrained(
        repo,
        device_map=device,
        torch_dtype=dtype,
    )
    processor = AutoProcessor.from_pretrained(repo)
    return model, processor

model, processor = load_model()

st.title("Kosmos-2.5 WebUI (OCR + Markdown)")
mode = st.radio("Choose task:", ["Markdown (<md>)", "OCR (<ocr>)"])
uploaded = st.file_uploader("Upload an image", type=["png","jpg","jpeg"])

if uploaded:
    image = Image.open(uploaded).convert("RGB")
    st.image(image, caption="Uploaded Image", use_column_width=True)

    if st.button("Run Kosmos-2.5"):
        prompt = "<md>" if mode.startswith("Markdown") else "<ocr>"
        inputs = processor(text=prompt, images=image, return_tensors="pt")
        height, width = inputs.pop("height"), inputs.pop("width")
        raw_w, raw_h = image.size
        scale_h, scale_w = raw_h/height, raw_w/width

        inputs = {k: (v.to(device) if v is not None else None) for k,v in inputs.items()}
        inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype)

        with torch.no_grad():
            out_ids = model.generate(**inputs, max_new_tokens=1024)
        text = processor.batch_decode(out_ids, skip_special_tokens=True)[0]

        if mode.startswith("Markdown"):
            st.subheader("Markdown Output")
            st.code(text, language="markdown")
        else:
            # Post-process OCR boxes
            pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>"
            boxes_raw = re.findall(pattern, text)
            lines = re.split(pattern, text)[1:]
            boxes = [[int(j) for j in re.findall(r"\d+", i)] for i in boxes_raw]

            draw = ImageDraw.Draw(image)
            for i, line in enumerate(lines):
                x0,y0,x1,y1 = boxes[i]
                if x0 < x1 and y0 < y1:
                    x0,y0,x1,y1 = int(x0*scale_w), int(y0*scale_h), int(x1*scale_w), int(y1*scale_h)
                    draw.polygon([x0,y0, x1,y0, x1,y1, x0,y1], outline="red")
            st.subheader("OCR with Bounding Boxes")
            st.image(image)
            st.text_area("OCR Text", "\n".join(lines), height=200)

Step 18: Launch Streamlit

Run the following command to launch streamlit:

streamlit run app.py

Step 19: Access the WebUI in Your Browser

Once Streamlit is running, it will display three links:

Local URL → http://localhost:8501 (works if you’re running on your own machine).
Network URL → http://:8501 (for internal access inside your VM network).
External URL → http://:8501 (use this to open from your laptop/PC browser).

Open the External URL in your browser.
Example:

http://38.29.145.10:8501

The Kosmos-2.5 WebUI will load with:

A task selector (Markdown or OCR ).
An upload box to drag & drop or browse images.

Upload any PNG/JPG/JPEG image (e.g., receipts, invoices, documents).

Click Run and view:

Markdown Mode → a structured Markdown transcription of the document.
OCR Mode → text + bounding boxes drawn directly on your image.

Tip: If your VM is remote (e.g., NodeShift), ensure port 8501 is open in firewall/security settings, or use SSH port forwarding:

ssh -L 8501:localhost:8501 root@<your-vm-ip>

Step 20: Upload and Process Documents

In the WebUI, click Browse files (or drag & drop) to upload an image.

Supported formats: PNG, JPG, JPEG
File size limit: 200 MB

Once uploaded, the file name will appear below the upload box (e.g., receipt_00008.png).

Choose the task mode:

Markdown () → generates a structured Markdown transcription.
OCR () → extracts text with bounding boxes overlaid on the uploaded image.

The model will process the image and show results below:

In Markdown Mode → you’ll see neatly formatted text output.
In OCR Mode → the uploaded image will be re-rendered with red bounding boxes drawn around detected text regions, along with extracted text output.

Tip: If you see a warning about use_column_width being deprecated, you can safely ignore it — it’s a Streamlit UI message and doesn’t affect the model’s output.

Step 21: View OCR Results

Switch the task selector to OCR ().

This tells Kosmos-2.5 to extract text + bounding box coordinates instead of Markdown.

After uploading the image (e.g., receipt_00008.png), the model will process it and return:

Annotated Image → your uploaded image will now display with red bounding boxes drawn around detected text areas.
OCR Text Output → the recognized text lines will appear below the image (or in a text box), showing exactly what was extracted from each bounding box.

Use this mode when you need precise localization of text in documents (e.g., invoices, receipts, forms).

Tip: If you want to save the annotated output, check the next step (Step 22) where we’ll enable download options for both the Markdown text and the OCR image.

Conclusion

Kosmos-2.5 makes working with text-heavy images simple — whether you need clean Markdown transcriptions or OCR with bounding boxes. By setting it up on a GPU-powered NodeShift VM and integrating it with a Streamlit WebUI, you now have an efficient, browser-based workflow for document understanding at scale.

Generate Expressive, Long Form Multi-Speaker Audios & Podcasts with Microsoft's VibeVoice

Aditi Bindal — Wed, 03 Sep 2025 16:48:39 +0000

If you're looking for an open-source text-to-speech system that can generate podcasts, audiobooks, or multi-speaker conversations that actually sound real, Microsoft’s VibeVoice is a model you’ll want to try. Unlike traditional TTS systems that often feel robotic, inconsistent, or restricted to short clips, VibeVoice is designed from the ground up to produce expressive, long-form, multi-speaker audio with remarkable naturalness and flow. It can synthesize speech lasting up to 90 minutes and seamlessly handle up to four distinct speakers, an impressive upgrade over most existing models that struggle to maintain quality beyond a few minutes or across more than two voices. What makes this possible is its continuous speech tokenizers (acoustic and semantic) that operate at a very low frame rate (7.5 Hz), preserving audio richness while drastically reducing computation. On top of this, the model uses a next-token diffusion framework, powered by a Qwen2.5-based LLM, to understand dialogue context and generate nuanced turn-taking, while a lightweight diffusion head ensures high-fidelity acoustic detail. The result: smooth, consistent, and lifelike conversations that feel like they were recorded, not generated.

In this guide, we have covered a simple and step-by-step walkthrough of how to get this model up and running locally or in GPU-accelerated environments.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x RTX 4090 or 1x RTX A6000
Storage: 50GB (preferable)
VRAM: at least 16GB
Anaconda installed

Step-by-step process to install and run VibeVoice

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x A100 SXM4 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 100GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x RTXA6000 GPU node with 64vCPUs/63GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running a CUDA dependent application like VibeVoice, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

That’s it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n vibe python=3.11 -y && conda activate vibe

Output:

2) Clone the official repository and move inside the project directory.

git clone https://github.com/microsoft/VibeVoice.git

Output:

3) Install required dependencies.

pip install -e .
pip install flash-attn --no-build-isolation
apt update && apt install ffmpeg -y

Output:

4) Launch the Gradio demo. This will automatically download the model checkpoints as well.

python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share

Output:

5) If you're on a remote machine (e.g., NodeShift GPU), you'll need to do SSH port forwarding in order to access the Gradio session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server - you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 7860:localhost:7860 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

After this copy the URL you received in your remote server: http://0.0.0.0:7860

And paste this on your local browser to access the Gradio session.

Step 8: Run the model

1) Once you access the Gradio interface, it will look like this:

2) Generate podcast from script.

Put any script of your chocie, we’re using one of the example scripts given in the official repo.

3) The app has started to stream generated podcast audio in real-time.

Conclusion

VibeVoice stands out as a groundbreaking open-source TTS framework that combines continuous speech tokenizers, a Qwen2.5-powered LLM, and a diffusion head to deliver expressive, long-form, multi-speaker audio that feels astonishingly real. Its ability to generate up to 90 minutes of consistent, multi-voice speech makes it a powerful tool for creators and researchers alike. And while running it locally is a great way to get started, NodeShift makes the experience even smoother by providing GPU-accelerated environments, simplified deployment, and scalability out of the box, so you can focus on exploring and scaling with the model’s capabilities without worrying about complex infrastructure setup.

For more information about NodeShift:

A Step-by-Step Guide to Install DeepSeek V3.1

Aditi Bindal — Mon, 01 Sep 2025 18:21:46 +0000

DeepSeek has once again pushed the boundaries of what’s possible in open-source AI with the release of DeepSeek-V3.1, a next-generation hybrid model that seamlessly supports both thinking and non-thinking modes. Building on the foundation of its powerful V3 base checkpoint, this version introduces smarter tool calling, faster reasoning efficiency, and a more versatile chat template design that adapts effortlessly to different use cases. Its post-training optimization dramatically boosts performance in agent tasks and tool usage, making it a strong choice for developers working on automation, research assistance, and coding agents. Moreover, the model’s ability to process extended contexts has been expanded through a two-phase long context extension approach: a massive 10x increase in the 32K token phase to 630B tokens and a 3.3x increase in the 128K token phase to 209B tokens. Combined with training on the cutting-edge UE8M0 FP8 data format, DeepSeek-V3.1 not only ensures efficiency and scalability but also guarantees compatibility with modern microscaling data pipelines.

Deploying a model of this caliber locally might seem daunting at first due to its substantial 671 billion parameters. However, Unsloth has made it entirely feasible. Unsloth has used selective quantization techniques to reduce the model's size without any significant loss of accuracy by targeting specific layers, such as the Mixture-of-Experts (MoE) layers, while preserving the precision of attention and other critical layers.

In the following guide, we'll walk you through the step-by-step process of installing and running DeepSeek-V3.1 locally using LLaMA.cpp and Unsloth's dynamic quants, ensuring you can access its full potential efficiently and effectively.

Prerequisites

The system requirements for running DeepSeek-V3.1 are:

GPU: Multiple H100s or H200s (count may vary across different bits)
Storage: 1TB+ (preferable)
Nvidia Cuda installed.
Anaconda installed
Disk Space requirements depending on the type of model are as follows:

Source: Unsloth

We recommend you to take a screenshot of this chart and save it somewhere to quickly look up to the disk space prerequisites before trying a specific bit quantized version.

For this article, we’ll download the 2.71-bit version (recommended).

Step-by-step process to install DeepSeek-V3.1 Locally

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 200 GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x H100 SXM 80GB GPU node with 192vCPUs/80GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Install and build LLaMA.cpp

llama.cpp is a C++ library for running LLaMA and other large language models efficiently on GPUs, CPUs and edge devices.

We’ll first install llama.cpp as we’ll use it to install and run DeepSeek-V3-0324.

1) Start by creating a virtual environment using Anaconda.

conda create -n deepseek python=3.11 -y && conda activate deepseek

Output:

2) Once inside the environment, update the Ubuntu package source-list for fetching the latest repository updates and patches.

apt-get update`

3) Install dependencies for llama.cpp.

apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y

Output:

4) Clone the official repository of llama.cpp.

git clone https://github.com/ggml-org/llama.cpp

Output:

5) Compile llama.cpp‘s build files.

In the below command, keep -DGGML_CUDA=OFF if you’re running it on a non-GPU system. However, it’s recommended to keep it OFF, even if you’re on a GPU-based system, as it will allow llama.cpp’s compilation process to occur through CPU, which is faster in this case as compared to GPU-based compilation. In addition to being slow, compiling llama.cpp through GPU can sometimes throw unwanted errors.

cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF -DLLAMA_CURL=ON

Output:

6) Build llama.cpp from the build directory.

cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split

Output:

7) Finally, we’ll copy all the executables from llama.cpp/build/bin/ that start with llama- into the llama.cpp directory.

cp llama.cpp/build/bin/llama-* llama.cpp

Step 8: Download the Model Files

We’ll download the model files from Hugging Face using a Python script.

1) To do that, let’s first install the Hugging Face Python packages.

pip install huggingface_hub hf_transfer

huggingface_hub – Provides an interface to interact with the Hugging Face Hub, allowing you to download, upload, and manage models, datasets, and other resources.
hf_transfer – A tool optimized for faster uploads and downloads of large files (e.g., LLaMA, DeepSeek models) from the Hugging Face Hub using a more efficient transfer protocol.

`
Output:

2) Run the model installation script with Python.

The script below will download all the specifical quant’s checkpoints from unsloth/DeepSeek-V3.1.

` python -c "import os; os.environ['HF_HUB_ENABLE_HF_TRANSFER']='0'; from huggingface_hub import snapshot_download; snapshot_download(repo_id='unsloth/DeepSeek-V3.1-GGUF', local_dir='unsloth/DeepSeek-V3.1-GGUF', allow_patterns=['*UD-Q2_K_XL*'])" `

Output:

Depending on your GPU configuration, the download process can be slow and take some time. The installation might also seem stuck at some points, which is normal, so do not interrupt or kill the installation in between.

Step 9: Run the model for Inference

Finally, once all checkpoints are downloaded, we can proceed to the inference part.

In the below command, we’ll run the model with a prompt given inside a formatted template which will be run through LLaMA.cpp’s LLaMA-CLI tool. The prompt will ask the model to create a Flappy Bird game in Python with all the interface, logic, and controls.

` ./llama.cpp/llama-cli \ --model unsloth/DeepSeek-V3.1-GGUF/UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \ --cache-type-k q4_0 \ --threads -1 \ --n-gpu-layers 99 \ --prio 3 \ --temp 0.6 \ --top_p 0.95 \ --min_p 0.01 \ --ctx-size 16384 \ --seed 3407 \ -ot ".ffn_.*_exps.=CPU" \ -no-cnv \ --prompt "<｜User｜>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<｜Assistant｜>" `
Output:

The model has started generating the code as shown below:

Once the process is complete, it may end the output like this:

As we run the code for the Flappy Bird game generated by DeepSeek-V3.1 through VSCode Editor, it opens a game panel as shown below (Note: Install pygame in your editor before running the code):

You can see the live demonstration of the game in the video attached on the original article here.

Conclusion

In this guide, we explored how DeepSeek-V3.1 elevates open-source AI with its hybrid thinking modes, smarter tool calling, faster reasoning, and extended long-context capabilities, all supported by efficient training techniques like FP8 scaling and Unsloth’s dynamic quantization. While deploying such a massive model locally with LLaMA.cpp is now more accessible, it still demands considerable compute resources. This is where NodeShip Cloud steps in, offering a seamless alternative with scalable, cost-effective GPU and compute infrastructure. By offloading deployment to NodeShip’s intuitive cloud platform, developers can unlock the full potential of DeepSeek-V3.1 without the burden of managing heavy local infrastructure, making experimentation, scaling, and production use both faster and simpler.

For more information about NodeShift:

A Complete Setup Guide to Powerful AI Image Editing with Qwen-Image-Edit

Aditi Bindal — Mon, 01 Sep 2025 16:52:19 +0000

Image editing has always required a delicate balance between precision and creativity, and that’s exactly what Qwen-Image-Edit delivers. Built on the robust 20B Qwen-Image model, this cutting-edge tool takes image editing to the next level by combining semantic control (powered by Qwen2.5-VL) with appearance control (via its VAE Encoder). This dual-system approach allows users to seamlessly perform both low-level edits, like adding or removing objects while keeping the rest of the image untouched, and high-level transformations, such as rotating objects, transferring artistic styles, or even creating new concepts entirely. What truly sets Qwen-Image-Edit apart, however, is its precise text editing capability, enabling direct modification of text in English and Chinese while preserving the original font, size, and style.

If you’re looking for an image editing model that’s powerful, versatile, and incredibly easy to use, Qwen-Image-Edit is a must-try. Let's see how to get it up and running on your machine.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x H100
Storage: 50 GB (preferable)
VRAM: at least 64 GB
Anaconda installed

Step-by-step process to install and run Qwen Image Edit

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x H100 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 200 GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x H100 SXM 80GB GPU node with 192vCPUs/80GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen python=3.11 -y && conda activate qwen

Output:

2) Install required dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/huggingface/diffusers
pip install transformers accelerate gradio pillow

Output:

3) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

4) If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

` with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

`
import os
from PIL import Image
import torch

from diffusers import QwenImageEditPipeline

pipeline = QwenImageEditPipeline.from_pretrained("Qwen/Qwen-Image-Edit")
print("pipeline loaded")
pipeline.to(torch.bfloat16)
pipeline.to("cuda")
pipeline.set_progress_bar_config(disable=None)
image = Image.open("./cat.jpg").convert("RGB")
prompt = "Add a pillow under cat's head and cover it with a blanket."
inputs = {
"image": image,
"prompt": prompt,
"generator": torch.manual_seed(0),
"true_cfg_scale": 4.0,
"negative_prompt": " ",
"num_inference_steps": 50,
}

with torch.inference_mode():
output = pipeline(**inputs)
output_image = output.images[0]
output_image.save("output_image_edit.png")
print("image saved at", os.path.abspath("output_image_edit.png"))
`

Output:

Original Image:

Edited Image:

Conclusion

Qwen-Image-Edit stands out as a next-generation image editing model, seamlessly blending semantic intelligence with appearance precision to enable everything from subtle object adjustments to bold creative transformations, all while offering unmatched text editing capabilities. By running it on Nodeshift Cloud, you gain a frictionless way to harness this power, eliminating complex setup hurdles and ensuring a smooth, scalable environment for experimentation. Together, Qwen-Image-Edit and Nodeshift Cloud make advanced image editing not just possible, but practical and accessible for creators, developers, and enterprises alike.

For more information about NodeShift:

Get Started with MiniCPM-v4: The Next-Gen Multimodal AI Model by OpenBMB

Aditi Bindal — Mon, 01 Sep 2025 16:24:49 +0000

Multimodal AI is rapidly evolving, MiniCPM-V 4.0 by OpenBMB emerges as a game-changer, combining cutting-edge visual understanding with unprecedented efficiency. Built on SigLIP2-400M and MiniCPM4-3B, this compact yet powerful model packs 4.1B parameters, but consistently punches above its weight. It not only inherits the strong single-image, multi-image, and video comprehension capabilities of its predecessor (MiniCPM-V 2.6), but also surpasses it with remarkable efficiency. Benchmark results on OpenCompass demonstrate this leap. MiniCPM-V 4.0 achieves a 69.0 average score, outperforming models like GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B), and Qwen2.5-VL-3B-Instruct, proving that smaller can indeed be smarter. What makes it even more exciting is its real-world usability: the model runs seamlessly on end devices, delivering under 2s first-token delay and over 17 tokens/s decoding on iPhone 16 Pro Max, all without heating issues, making on-device multimodal AI finally practical. With easy integration across frameworks like llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory, and even a native iOS app, MiniCPM-V 4.0 isn’t just another AI model, it’s a versatile, efficient, and deployment-ready multimodal powerhouse.

In this article, we're going to see a step-by-step process to install and run this model locally or in GPU-accelerated environments.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x RTX 4090 or 1x RTX A6000
Storage: 50GB (preferable)
VRAM: at least 16GB
Anaconda installed

Step-by-step process to install and run MiniCPM-v4

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x A100 SXM4 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 100GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x RTXA6000 GPU node with 64vCPUs/63GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n minicpm python=3.11 -y && conda activate minicpm

Output:

2) Install required dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 
pip install einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy

Output:

3) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

4) If you're on a remote machine (e.g., NodeShift GPU), you'll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server - you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

from PIL import Image
import torch
from transformers import AutoModel, AutoTokenizer

model_path = 'openbmb/MiniCPM-V-4'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True,
                                  # sdpa or flash_attention_2, no eager
                                  attn_implementation='sdpa', torch_dtype=torch.bfloat16)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(
    model_path, trust_remote_code=True)



image = Image.open('./landform.jpg').convert('RGB')

# First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    image=image,
    tokenizer=tokenizer
)
print(answer)


# Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": [
            "What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    image=None,
    tokenizer=tokenizer
)
print(answer)

Output:

Here’s the image we used to testing the model:

Picsum ID: 866

Output:

Conclusion

To wrap up, MiniCPM-V 4.0 clearly demonstrates how multimodal AI is becoming more efficient, accessible, and deployment-ready, setting a new benchmark in balancing compact design with powerful visual and reasoning capabilities. From its ability to outperform larger models on benchmarks to its seamless real-world usability on devices like the iPhone 16 Pro Max, it proves that high performance no longer requires massive scale. At the same time, Nodeshift Cloud makes experimenting with and deploying such state-of-the-art models far more practical, offering GPU-accelerated environments, simple setup workflows, and flexible scaling to match your needs.

For more information about NodeShift:

How to Install & Run Qwen Image

Aditi Bindal — Mon, 01 Sep 2025 16:10:21 +0000

Imagine transforming a simple text prompt into a high-quality image with just a few lines of code. Qwen-Image makes this possible by combining advanced image generation with precise text rendering, whether you’re working in English or Chinese. It handles everything from photorealistic scenes and impressionist-style paintings to clean, minimalist designs, adapting its output to your needs. On top of that, Qwen-Image offers powerful editing features: you can insert or remove objects, fine-tune colours and details, edit text directly within an image, and even adjust human poses—all through clear, natural-language commands. Behind the scenes, it also performs tasks like object detection, semantic segmentation, depth estimation and super-resolution, giving you a complete toolkit for creating and refining images with ease.

Getting started is simple. In the next section, you’ll see exactly how to install Qwen-Image and run your first prompt in minutes.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x H100
Storage: 50 GB (preferable)
VRAM: at least 64 GB
Anaconda installed

Step-by-step process to install and run Qwen Image

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 200 GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x H100 SXM 80GB GPU node with 192vCPUs/80GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen-img python=3.11 -y && conda activate qwen-img

Output:

2) Install required dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 
pip install einops timm pillow
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install huggingface_hub
pip install sentencepiece bitsandbytes protobuf decord numpy

Output:

3) Install and run jupyter notebook.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook --allow-root

4) If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the jupyter notebook session on your local browser.

Run the following command in your local terminal after replacing:

` with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 8888:localhost:8888 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Output:

After this copy the URL you received in your remote server:

And paste this on your local browser to access the Jupyter Notebook session.

Step 8: Download and Run the model

1) Open a Python notebook inside Jupyter.

2) Download the model checkpoints.

`
from diffusers import DiffusionPipeline
import torch

model_name = "Qwen/Qwen-Image"

Load the pipeline

if torch.cuda.is_available():
torch_dtype = torch.bfloat16
device = "cuda"
else:
torch_dtype = torch.float32
device = "cpu"

pipe = DiffusionPipeline.from_pretrained(model_name, torch_dtype=torch_dtype)
pipe = pipe.to(device)

positive_magic = {
"en": "Ultra HD, 4K, cinematic composition." # for english prompt,
"zh": "超清，4K，电影级构图" # for chinese prompt,
}

Generate image

prompt = '''A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197". Ultra HD, 4K, cinematic composition'''

negative_prompt = " " # using an empty string if you do not have specific concept to remove

Generate with different aspect ratios

aspect_ratios = {
"1:1": (1328, 1328),
"16:9": (1664, 928),
"9:16": (928, 1664),
"4:3": (1472, 1140),
"3:4": (1140, 1472),
"3:2": (1584, 1056),
"2:3": (1056, 1584),
}

width, height = aspect_ratios["16:9"]

image = pipe(
prompt=prompt + positive_magic["en"],
negative_prompt=negative_prompt,
width=width,
height=height,
num_inference_steps=50,
true_cfg_scale=4.0,
generator=torch.Generator(device="cuda").manual_seed(42)
).images[0]

image.save("example.png")
`

Output:

Conclusion

You’ve seen how Qwen-Image turns simple text prompts into stunning, high-fidelity images, whether photorealistic, painterly or minimal, and offers intuitive editing for objects, colour, text and even human poses, all backed by robust image-understanding capabilities like segmentation and super-resolution. Equally straightforward is getting up and running: a few commands installs the model via diffusers and, in minutes, you’re generating your first visuals. By pairing Qwen-Image with NodeShift Cloud, you gain instant access to scalable GPU instances, automated deployment of your inference pipeline and managed versioning, so you can focus on creativity while NodeShift ensures performance, reliability and easy integration into your existing workflows.

For more information about NodeShift:

How to Install & Run Gemma-3-270m, GGUF & Instruct Locally?

Ayush kumar — Fri, 22 Aug 2025 07:56:27 +0000

google/gemma-3-270m (Pre-trained)
A lightweight, open vision-language model from Google DeepMind, designed for both text and image inputs. With a 32K context window, it’s suitable for general-purpose text generation, summarization, reasoning, and image analysis. Trained on diverse multilingual, code, math, and visual datasets, it offers strong performance in resource-constrained environments like laptops or small cloud VMs.

google/gemma-3-270m-it (Instruction-Tuned)
An instruction-optimized variant of Gemma 3-270M that’s fine-tuned to follow user prompts more accurately. It keeps the same multimodal capabilities as the base model but excels in conversational AI, question answering, and structured output tasks, making it more user-friendly for chatbots, assistants, and guided content generation.

unsloth/gemma-3-270m-it-GGUF
A GGUF-format, instruction-tuned Gemma 3-270M released by Unsloth AI for efficient local inference with llama.cpp and similar tools. It’s optimized for faster performance and lower memory usage while retaining multimodal capabilities, making it ideal for on-device or low-resource deployment scenarios.

Gemma 3 270M

GPU Configuration Table for Gemma-3-270m, GGUF & Instruct Models

Notes:

The GGUF version is much lighter because it uses quantization, so it can run even on lower-end GPUs or CPUs.
The pre-trained (PT) and instruction-tuned (IT) models from Google will require more VRAM if used in FP16 or BF16 formats.
If you use CPU inference with GGUF, you should have at least 8–16 GB of system RAM for smooth execution.

Resources

Link 1: https://huggingface.co/google/gemma-3-270m

Link 2: https://huggingface.co/google/gemma-3-270m-it

Link 3: https://huggingface.co/unsloth/gemma-3-270m-it-GGUF

Step-by-Step Process to Install & Run Gemma-3-270m, GGUF & Instruct Locally

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Gemma-3-270m & Instruct, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like Gemma-3-270m & Instruct
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Gemma-3-270m & Instruct.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the Gemma-3-270m & Instruct runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default Python3 Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv openwebui
source openwebui/bin/activate

Step 13: Install Open-WebUI

Run the following command to install open-webui:

pip install open-webui

Step 14: Serve Open-WebUI

In your activated Python environment, start the Open-WebUI server by running:

open-webui serve

Wait for the server to complete all database migrations and set up initial files. You’ll see a series of INFO logs and a large “OPEN WEBUI” banner in the terminal.
When setup is complete, the WebUI will be available and ready for you to access via your browser.

Step 15: Set up SSH port forwarding from your local machine

On your local machine (Mac/Windows/Linux), open a terminal and run:

ssh -L 8080:localhost:8080 -p 40128 root@38.29.145.10

This forwards:

Local localhost:8000 → Remote VM 127.0.0.1:8000

Step 16: Access Open-WebUI in Your Browser

Go to:

http://localhost:8080

You should see the Open-WebUI login or setup page.
Log in or create a new account if this is your first time.
You’re now ready to use Open-WebUI to interact with your models!

Step 17: Install Ollama

After connecting to the terminal via SSH, it’s now time to install Ollama from the official Ollama website.

Website Link: https://ollama.com/

Run the following command to install the Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Step 18: Serve Ollama

Run the following command to host the Ollama so that it can be accessed and utilized efficiently:

ollama serve

Step 19: Pull the Gemma3:270M Model

Run this command to pull the gemma3:270m model:

ollama pull gemma3:270m

Step 20: Run the Gemma3:270M Model for Inference

Now that your models are installed, you can start running them and interacting directly from the terminal.

To run the gemma3:270m model, use:

ollama run gemma3:270m

Step 21 — Chat with Gemma-3-270M in Open WebUI (auto-detected from Ollama)

You’ve already tested the model in the terminal with Ollama and installed Open WebUI earlier. Now we’ll use the Web UI to chat with the same local model.

Make sure Ollama is running

If you’re in a VM, keep the Ollama service up.
Quick check:

ollama pull gemma3:270m   # if not pulled yet
curl http://localhost:11434/api/tags | jq . # should list gemma3:270m

Open the Web UI

Visit your Open WebUI URL (e.g., http://:8080).
Click the model dropdown at the top (“Select a model”).
Pick the model

You should see gemma3:270m under Local. Select it.

That’s it—Open WebUI automatically detects any model you’ve pulled with Ollama and shows it in the list.
(Your screen should look like the screenshot: gemma3:270m visible in the model picker.)

Start chatting

Type your prompt in the chat box and send.
Use the icon (if available) to tweak temperature, max tokens, etc.

If the model doesn’t appear

Click the refresh icon next to the model list, or go to Settings → Providers → Ollama and confirm the Base URL (usually http://localhost:11434), then Save and Sync Models.
If Ollama runs on another machine, set the Base URL to that host (make sure the port is reachable).

Step 22 — Stress-test the model in Open WebUI (tune settings + quick rubric)

Now that gemma3:270m shows up in Open WebUI and you can chat, do a fast quality check and tune generation so it behaves well.

Open a new chat → pick gemma3:270m

Click the gear (generation settings) and start with:

Temperature: 0.6
Top-p: 0.9
Max new tokens: 512
Repeat penalty: 1.1
(Optional) Seed: 42 for reproducible runs

Paste 3 single-line “hard” prompts to probe reasoning & constraints

If five painters take five hours to paint five walls, how long would 100 painters take to paint 100 walls? Explain without skipping steps.
Summarize the book “The Little Prince” in exactly 7 words, keeping its emotional tone intact.
Translate “La vie est belle” into English, reverse each word, and then write a haiku using the reversed words as the first line.

Grade quickly with a mini-rubric (write notes in the chat or a doc)

Correctness (math/logic right?)
Constraint keeping (exact word count, formatting, “no synonyms” rules)
Clarity (step-by-step, no hand-waving)
Latency (tokens/sec acceptable?)
Determinism (does it change across retries? if yes, lower temp)

If it struggles, tweak and retry

Reasoning tasks: lower Temperature → 0.2–0.4.
Short answers cut off: raise Max new tokens.
Add a System message like: “Follow constraints strictly. Show numbered steps.”

Up to here, we’ve been interacting with google/gemma-3-270m via Ollama in the terminal and through Open WebUI in the browser (Open WebUI auto-detected the Ollama model, so chatting worked in both places). Now we’ll install the lightweight GGUF variant of this model directly from Hugging Face inside Open WebUI’s Manage Models panel, so you can run the llama.cpp-style build with lower memory usage and switch between the Ollama and GGUF versions from the same model dropdown.

Step 23 — Pull the GGUF build from Hugging Face (Unsloth)

Unsloth publishes a ready-to-run GGUF pack for this model: unsloth/gemma-3-270m-it-GGUF.
In Open WebUI → Settings → Models → Manage Models, paste this repo path into “Pull a model from Ollama.com” (it accepts hf.co/... too):

hf.co/unsloth/gemma-3-270m-it-GGUF

Click the download icon. When file choices appear, I recommend starting with:

gemma-3-270m-it.Q4_K_M.gguf (best speed/quality balance)
Lighter options if RAM/VRAM is tiny: IQ2_XXS / IQ3_XXS
Higher quality: Q8_0 (or F16 if you want full precision)

After the download finishes, the GGUF model will show up in your model selector alongside the Ollama one, and you can chat with either version directly in Open WebUI.

Step 24 — Chat with the GGUF model in Open WebUI (verify + tune)

Select the GGUF build
Open a new chat and pick hf.co/unsloth/gemma-3-270m-it-GGUF:latest from the model dropdown (you’ll see the full HF path in the header, like in your screenshot).

Use the same stress prompts
Paste the three single-line tests (√2 proof without “number”, paradox in one sentence, 12-word Inception). This makes A/B comparison with the Ollama version straightforward.

Tune generation for GGUF

Temperature 0.4–0.6 (start 0.5)
Top-p 0.9
Max new tokens 512
Repeat penalty 1.1
Context/window: 8192 (you can go higher if your RAM allows)

Compare vs. Ollama run

Correctness: does it keep constraints (exact word counts, banned words)?
Coherence: fewer/random jumps → nudge temp down to 0.3–0.4.
Latency: if slow on CPU, try a lighter quant (IQ3_XXS) or shorter max tokens. If quality feels thin, bump to Q6_K or Q8_0.

Optional: save a preset
Click … → Save as preset (e.g., “Gemma3-270m-GGUF-Q4KM”) so future chats load your tuned settings instantly.

If something’s off

Model not loading: re-open Settings → Models → Manage Models → Sync/Refresh.
Quality too low: switch the file to a higher quant (Q6_K / Q8_0).
Memory tight: keep quant at Q4_K_M and reduce context or max tokens.

Now you can flip between Ollama (gemma3:270m) and GGUF (hf.co/unsloth/…) in the same UI and capture side-by-side behavior for your write-up.

Up to this point, we’ve been chatting with google/gemma-3-270m, google/gemma-3-270m-it, and the unsloth/gemma-3-270m-it-GGUF build via Ollama in the terminal and Open WebUI in the browser (which auto-detected our Ollama pulls). Now we’ll move beyond the UI and run the original Hugging Face models google/gemma-3-270m (pretrained) and google/gemma-3-270m-it (instruction-tuned) directly via script—downloading them with Transformers using your HF token, so we can control settings programmatically, batch tests, and log clean benchmarks.

Step 25 — Install Torch

Run the following command to install torch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Step 26: Install Python Dependencies

Run the following command to install python dependencies:

python -m pip install -U "transformers>=4.53" accelerate sentencepiece

Step 27 — Install/Verify Hugging Face Hub (CLI + token)

Install (or update) the Hub tools:

pip install -U huggingface_hub "transformers>=4.53"
huggingface-cli --version

Authenticate (same account that accepted Gemma access):

huggingface-cli login            # paste HF_xxx token with read scope
# optional env var so scripts/daemons inherit it
export HF_TOKEN=HF_xxx
echo 'export HF_TOKEN=HF_xxx' >> ~/.bashrc

Step 28: Connect to Your GPU VM with a Code Editor

Before you start running Python scripts with the Gemma-3-270m & Instruct models and Transformers, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Step 29: Run Gemma-3-270M Models with Transformers in Python

Now you’re ready to interact with Gemma-3-270M directly in your own Python scripts using the Transformers library.

Here’s an example script (gemma3_run.py) you can use:

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

model_id = "google/gemma-3-270m-it"  # or "google/gemma-3-27m" for base PT

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",   # GPU if present, else CPU
    attn_implementation="sdpa"  # good default in recent PyTorch
)

streamer = TextStreamer(tok)
inputs = tok("Explain Rust ownership like I'm 12:", return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=200, streamer=streamer)

Step 30: Run the script and generate a response

Run the script with the following command to load google/gemma-3-270m-it and generate a response:

python3 gemma3_run.py

Step 31: Run Gemma-3-270M Models with Transformers in Python

Next we will interact with Gemma-3-270M directly in your own Python scripts using the Transformers library.

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch

model_id = "google/gemma-3-270m"  # or "google/gemma-3-27m-it" for instruct

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",   # GPU if present, else CPU
    attn_implementation="sdpa"  # good default in recent PyTorch
)

streamer = TextStreamer(tok)
inputs = tok("Explain Rust ownership like I'm 12:", return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=200, streamer=streamer)

Step 32: Run the script and generate a response

Run the script with the following command to load google/gemma-3-270m and generate a response:

python3 gemma3_run.py

Conclusion

Gemma-3-270M is a perfect example of how cutting-edge AI can be scaled down without losing its versatility. Whether you’re experimenting with the pre-trained variant for raw, general-purpose tasks, the instruction-tuned version for natural conversations, or the GGUF build for low-resource deployments, you get a model that’s fast, flexible, and surprisingly capable for its size.

With this guide, you’ve learned how to set up a GPU-powered environment, run Gemma models through Ollama, Open WebUI, and Transformers, and even optimize them for speed and memory efficiency. You can now seamlessly switch between interactive browser-based chats, terminal sessions, and custom Python scripts—all while taking advantage of the model’s multimodal capabilities.

Whether you’re building a chatbot, testing reasoning skills, summarizing content, or just exploring model behavior, Gemma-3-270M gives you the freedom to run it your way—from high-end GPUs to modest local machines. Now, it’s your turn to put it to the test, push its limits, and see what’s possible when big ideas meet small but mighty AI.

The OCR Model That Outranks GPT-4o

Ayush kumar — Fri, 22 Aug 2025 06:28:33 +0000

NuMarkdown-8B-Thinking is a reasoning-powered OCR Vision-Language Model (VLM) built to transform documents into clean, structured Markdown. Fine-tuned from Qwen2.5-VL-7B, it introduces thinking tokens that help the model analyze complex layouts, tables, and unusual document structures before generating output. This makes it especially useful for RAG pipelines, document extraction, and knowledge organization. With its reasoning-first approach, NuMarkdown-8B-Thinking consistently outperforms generic OCR and even rivals large closed-source reasoning models in accuracy and layout understanding.

Arena ranking against popular alternatives (using trueskill-2 ranking system, with around 500 model-anonymized votes):

Win/Draw/Lose-rate against others Models

GPU Configuration Table – NuMarkdown-8B-Thinking

Step-by-Step Process to Install & Run NuMarkdown-8B-Thinking Locally

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running NuMarkdown-8B-Thinking, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based applications like NuMarkdown-8B-Thinking
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like NuMarkdown-8B-Thinking.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the NuMarkdown-8B-Thinking runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.11 python3.11-venv python3.11-dev

Step 10: Update the Default Python3 Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py

Then, run the following command to check the version of pip:

pip --version

Step 12: Created and activated Python 3.11 virtual environment

Run the following commands to created and activated Python 3.11 virtual environment:

apt update && apt install -y python3.11-venv git wget
python3.11 -m venv numarkdown
source numarkdown/bin/activate

Step 13: Install Torch

Run the following command to install torch:

pip install "torchvision==0.18.1+cu121" --index-url https://download.pytorch.org/whl/cu121

Step 14: Install Dependencies

Run the following command to install dependencies:

pip install -U pillow transformers accelerate

Step 15: Connect to your GPU VM using Remote SSH

Open VS Code, cursor or choice of code editor on your Mac.
Press Cmd + Shift + P, then choose Remote-SSH: Connect to Host.
Select your configured host.
Once connected, you’ll see SSH: 149.7.4.3(Your VM IP) in the bottom-left status bar (like in the image).

Step 16: Create a New Python Script ex.py and Add the Following Code

Create a new python script (example: numarkdown.py) and add the following code:

import os
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

# --- Force stable attention backend (avoid FlashAttention-2) ---
os.environ["TRANSFORMERS_ATTENTION_IMPLEMENTATION"] = "sdpa"
os.environ["HF_USE_FLASH_ATTENTION_2"] = "0"

# --- Model & processor setup ---
model_id = "numind/NuMarkdown-8B-Thinking"

# Use slow processor to silence "fast vs slow" warnings (optional)
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
    use_fast=False,  # keep legacy processor
    min_pixels=100 * 28 * 28,
    max_pixels=5000 * 28 * 28
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype="bfloat16",        # efficient on modern GPUs
    device_map="auto",             # auto-GPU placement
    trust_remote_code=True,
    attn_implementation="sdpa",    # force PyTorch SDPA attention
)

# --- Input image (replace with your doc image) ---
img = Image.open("sample.png").convert("RGB")

# Optional downscale: keep under ~3–4 MP to save VRAM
MAX_SIDE = 2200
img.thumbnail((MAX_SIDE, MAX_SIDE))

# --- Prompt & inputs ---
messages = [{"role": "user", "content": [{"type": "image"}]}]
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)

# --- Run inference ---
with torch.no_grad():
    out = model.generate(
        **inputs,
        temperature=1e-5,
        max_new_tokens=2000  # adjust if you need longer markdown
    )

result = processor.decode(out[0])

# --- Extract <answer> cleanly ---
def between(s, a, b):
    i = s.find(a)
    j = s.find(b, i + len(a))
    return s[i + len(a):j] if i != -1 and j != -1 else s

answer = between(result, "<answer>", "</answer>")
print(answer)

Step 17 — Upload Image via the Editor & Run the Script

17.1 Open the VM workspace in your editor

In VS Code: Remote Explorer → SSH Targets → connect to your VM → open /root (or your chosen project folder).
You should see your project files (numarkdown.py, etc.) in the left Explorer.

17.2 Upload your local image to the VM (drag & drop)

In VS Code Explorer (connected to the VM), right-click the folder where numarkdown.py lives (e.g., /root) and choose “Reveal in File Explorer” (optional) just to confirm location.
Drag your local image file (e.g., sample.png or myscan.jpg) from your laptop’s file manager into the VS Code Explorer for the VM workspace.
Confirm the upload when prompted. You should now see the image in the remote file list (e.g., /root/sample.png).

17.3 (Optional) Rename the file to match the script

If your script expects image.png:

In VS Code Explorer: right-click the uploaded file → Rename → image.png. (Or skip this if your script accepts a CLI argument.)

17.4 Activate the venv in the editor’s terminal (remote)

In VS Code, open a terminal (Terminal → New Terminal). It’s already running on the VM.

source ~/numarkdown/bin/activate
cd ~

17.5 Run the extractor

If your script expects image.png:

python3 numarkdown.py

If your script accepts a filename:

python3 numarkdown.py sample.png

You’ll see the Markdown printed in the terminal.

17.6 Save the Markdown to a file (so you can open it in the editor)

# image.png route
python3 numarkdown.py > output.md

# argument route
python3 numarkdown.py sample.png > output.md

In VS Code Explorer, click output.md to preview the formatted result right in your editor.

17.7 Quick checks & common fixes

Don’t see the image in VS Code on the VM? You likely uploaded to a different folder. Check the terminal:

pwd && ls -lh

Make sure the image sits next to numarkdown.py (or pass its full path).

FileNotFoundError: 'image.png'
Rename your uploaded file to image.png or run python3 numarkdown.py .

Large scans / VRAM: If you hit OOM, downscale locally before upload, or let the script handle it (our script already thumbnails to ~3–4 MP).

Up until now, we’ve been running and interacting with our model directly from the terminal. That worked fine for quick tests, but now let’s make things smoother and more user-friendly by running it inside a browser interface. For that, we’ll use Streamlit, a lightweight Python framework that lets us build interactive web apps in just a few lines of code.

Step 18: Install Required Libraries for Browser App

First, install Streamlit along with a few other helper libraries we’ll need:

pip install streamlit pillow pdf2image pypdf transformers accelerate timm

This command will:

streamlit → run the browser app
pillow → handle image processing
pdf2image & pypdf → process PDFs
transformers, accelerate, timm → load and run the model efficiently

Step 19: Fix APT Sources, Update, and Install Poppler Utils

We’ll switch the Ubuntu mirror to the official archive, clean bad apt lists, update package indexes with resilience, and finally install poppler-utils (provides pdftoppm/pdftocairo) in one command.

sudo sed -i 's|http://mirror.serverion.com/ubuntu|http://archive.ubuntu.com/ubuntu|g' /etc/apt/sources.list && \
sudo apt-get clean && \
sudo rm -rf /var/lib/apt/lists/* && \
sudo apt-get update -o Acquire::Retries=3 --fix-missing && \
sudo apt-get install -y poppler-utils

Step 20: Create the Streamlit App Script (app.py)

We’ll write a full Streamlit UI that lets you upload an image or PDF, runs NuMarkdown-8B-Thinking, and returns clean Markdown (with an option to view the raw output that contains ).

Create app.py in your VM (inside your project folder) and add the following code:

import os
import io
import time
from typing import List, Tuple

import streamlit as st
import torch
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

# --- Force stable attention backend (avoid FlashAttention-2) ---
os.environ["TRANSFORMERS_ATTENTION_IMPLEMENTATION"] = "sdpa"
os.environ["HF_USE_FLASH_ATTENTION_2"] = "0"

MODEL_ID = "numind/NuMarkdown-8B-Thinking"
MAX_SIDE = 2200                           # ~3–4MP safety
MIN_PIXELS = 100 * 28 * 28               # model hint
MAX_PIXELS = 5000 * 28 * 28              # model hint
DEFAULT_MAX_NEW_TOKENS = 2000

st.set_page_config(page_title="NuMarkdown-8B-Thinking UI", layout="wide")

@st.cache_resource(show_spinner=True)
def load_model_and_processor():
    processor = AutoProcessor.from_pretrained(
        MODEL_ID,
        trust_remote_code=True,
        use_fast=False,          # quiet warnings, stable behavior
        min_pixels=MIN_PIXELS,
        max_pixels=MAX_PIXELS,
    )
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        MODEL_ID,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
        attn_implementation="sdpa",
    )
    model.eval()
    return processor, model

def pil_from_upload(file) -> Image.Image:
    img = Image.open(file).convert("RGB")
    img.thumbnail((MAX_SIDE, MAX_SIDE))
    return img

def pdf_to_images(file_bytes: bytes, dpi: int = 200) -> List[Image.Image]:
    # Convert PDF bytes to a list of PIL images (requires poppler-utils)
    try:
        from pdf2image import convert_from_bytes
    except Exception as e:
        raise RuntimeError(
            "pdf2image is not available or Poppler is missing. "
            "Install with `pip install pdf2image` and `sudo apt-get install poppler-utils`."
        ) from e
    images = convert_from_bytes(file_bytes, dpi=dpi)
    # downscale each page to ~3–4MP max
    for i in range(len(images)):
        images[i] = images[i].convert("RGB")
        images[i].thumbnail((MAX_SIDE, MAX_SIDE))
    return images

def between(s: str, a: str, b: str) -> str:
    i = s.find(a)
    j = s.find(b, i + len(a))
    return s[i + len(a):j] if i != -1 and j != -1 else s

@torch.inference_mode()
def run_single_image(processor, model, img: Image.Image, temperature: float, max_new_tokens: int) -> Tuple[str, str]:
    messages = [{"role": "user", "content": [{"type": "image"}]}]
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=prompt, images=[img], return_tensors="pt").to(model.device)

    out = model.generate(
        **inputs,
        temperature=max(temperature, 1e-5),  # must be > 0 in recent transformers
        max_new_tokens=max_new_tokens,
    )
    text = processor.decode(out[0])
    answer = between(text, "<answer>", "</answer>")
    return answer, text  # (markdown, raw_with_think)

def concat_markdown(pages_md: List[str]) -> str:
    # Add page separators for clarity
    parts = []
    for i, md in enumerate(pages_md, 1):
        parts.append(f"\n\n---\n\n<!-- Page {i} -->\n\n{md.strip()}\n")
    return "".join(parts).strip()

# ----------------- UI -----------------

st.title("🧠 NuMarkdown-8B-Thinking — Document → Markdown")
st.caption("Upload a scanned page (PNG/JPG) or a PDF. The model reasons about layout, tables, etc., then returns clean Markdown.")

col_left, col_right = st.columns([2, 1])

with col_right:
    st.subheader("Settings")
    temperature = st.number_input("Temperature", value=0.00001, min_value=0.00001, max_value=2.0, step=0.00001, format="%.5f")
    max_new_tokens = st.number_input("Max new tokens", value=DEFAULT_MAX_NEW_TOKENS, min_value=200, max_value=6000, step=100)
    show_think = st.toggle("Show <think> (reasoning) raw output", value=False)
    run_button = st.button("Run Extraction", type="primary", use_container_width=True)

with col_left:
    upload = st.file_uploader("Upload an image or a PDF", type=["png", "jpg", "jpeg", "pdf"])

st.divider()

if run_button:
    if not upload:
        st.error("Please upload a PNG/JPG or PDF first.")
        st.stop()

    processor, model = load_model_and_processor()

    filetype = (upload.type or "").lower()
    start_time = time.time()

    if "pdf" in filetype or upload.name.lower().endswith(".pdf"):
        # PDF → images
        with st.status("Converting PDF to images…", expanded=False):
            pdf_bytes = upload.read()
            images = pdf_to_images(pdf_bytes, dpi=200)
        st.success(f"PDF pages: {len(images)}")

        pages_md = []
        progress = st.progress(0, text="Running model on pages…")
        for i, img in enumerate(images, 1):
            md, raw = run_single_image(processor, model, img, temperature, max_new_tokens)
            pages_md.append(md)
            progress.progress(i / len(images), text=f"Processed page {i}/{len(images)}")

            if show_think:
                with st.expander(f"Raw output (page {i})"):
                    st.code(raw)

        markdown_all = concat_markdown(pages_md)
        dur = time.time() - start_time

        st.subheader("📄 Markdown (all pages)")
        st.code(markdown_all, language="markdown")
        st.download_button("Download Markdown", data=markdown_all.encode("utf-8"),
                           file_name=f"{upload.name.rsplit('.',1)[0]}_extracted.md", mime="text/markdown")
        st.caption(f"Done in {dur:.1f}s")

    else:
        # Single image
        img = pil_from_upload(upload)
        st.image(img, caption="Input image", use_column_width=True)

        with st.status("Running model…", expanded=False):
            md, raw = run_single_image(processor, model, img, temperature, max_new_tokens)
        dur = time.time() - start_time

        st.subheader("📝 Markdown")
        st.code(md, language="markdown")
        st.download_button("Download Markdown", data=md.encode("utf-8"),
                           file_name=f"{upload.name.rsplit('.',1)[0]}_extracted.md", mime="text/markdown")

        if show_think:
            st.subheader("🧩 Raw output (with <think>)")
            st.code(raw)

        st.caption(f"Done in {dur:.1f}s")

Step 21: Launch the Streamlit App

Now that we’ve written our app.py Streamlit script, the next step is to launch the app from the terminal.

Run the following command inside your VM:

streamlit run app.py --server.port 7860 --server.address 0.0.0.0

--server.port 7860 → Runs the app on port 7860 (you can change it if needed).
--server.address 0.0.0.0 → Ensures the app is accessible externally (not just inside the VM).

Once executed, Streamlit will start the web server and you’ll see a message:

You can now view your Streamlit app in your browser.

URL: http://0.0.0.0:7860

Step 22: Access the Streamlit App in Browser

After launching the app, you’ll see the interface in your browser.

Go to:

http://0.0.0.0:7860/

Step 23: Upload and Extract Documents

Use the Drag and Drop or Browse files button to upload a scanned image (.jpg/.png) or a PDF.
Adjust Settings on the right:

Temperature → Controls randomness (keep very low like 0.00001 for OCR).
Max new tokens → Length of output (default: 2000).
Show reasoning → Optional, shows model’s reasoning process.

Click Run Extraction.

The model will process your input file, convert images/PDF pages into clean Markdown output, and display it below. You can copy or download this Markdown directly.

---

<!-- Page 1 -->

# Ayush Kumar

+91-998-4219-294 | ayushknj3@gmail.com | linktr.ee/Ayush7614
[in] ayush-kumar-984443191 | [Chat] Ayush7614 | [Twitter] @AyushKu38757918
Noida, Uttar Pradesh, India

### Objective
Developer Relations Engineer and Full-Stack Developer with deep expertise in open-source, cloud, LLMs, AI/ML, DevOps, and technical community building. Adept at creating large-scale developer education content and tools that empower engineers globally.

### Education
* ABES Engineering College
  * B.Tech in Electronics and Communication Engineering
  * – GPA: 7.7 / 10
  * – Courses: Operating Systems, Data Structures, Algorithms, AI, ML, Networking, Databases
  * July 2019 – August 2023
  * Ghaziabad, India

### Experience
* NodeShift AI Cloud
  * Lead Developer Relations Engineer
  * – Authored 150+ blogs on AI, LLMs, MCP, APIs, Web3, Gaming, Cloud, and TAK Server.
  * – Worked on the Dubai UAE Government’s TAK Server deployment project using NodeShift GPU and compute VMs.
  * – Designed and implemented marketing strategies to enhance brand visibility and audience engagement.
  * – Created developer-focused content in multiple formats (blogs, guides, videos) to educate and captivate our global community.
  * – Actively engaged with users across platforms to increase awareness and adoption of NodeShift services.
  * – Explored and initiated sponsorship and partnership opportunities across technical and developer communities.
  * – Reviewed customer feedback and usage patterns to refine developer experience and improve product documentation.
  * – Led efforts to improve and expand technical documentation to ensure a smoother onboarding experience and increased retention.
  * July 2024 – Present
  * Remote
* Techlatest.net
  * DevRel Engineer Consultant
  * – Content Lead – Developed strategy for AI/ML, DevOps, and GUI-based content.
  * – Authored 150+ blogs and tutorials across Cloud, Linux, Stable Diffusion, Flowise, Superset, etc.
  * – Built GUI Linux (Ubuntu, Kali, Rocky, Tails), Redash, VSCode, RStudio-based developer VMs.
  * – Created newsletters, video courses, and product documentation.
  * – Lead social media presence and SEO optimization; grow Discord and Twitter community.
  * – Worked across AWS, GCP, and Azure ecosystems for product testing and publishing.
  * March 2023 – July 2024
  * Estonia, Remote
* DEVs Dungeon
  * DevRel Engineer, Community Work (Part Time)
  * – Writing blogs for the DEVs Dungeon Community blog.
  * – Organizing Meetups and Hackathons in my Region.
  * – Participating in Events to Represent DEVs Dungeon.
  * – Social media marketing for DEVs Dungeon.
  * – Creating Content on GitHub, Twitter, and LinkedIn.
  * – Building and managing the community.
  * March 2023 – December 2023
  * Remote
* Google Summer of Code - Fossology
  * Student Developer
  * – Built REST APIs using ReactJs and improved legacy APIs.
  * – Created new endpoints with PHP and Slim Framework.
  * – Updated documentation using YAML files for API clarity.
  * May 2022 – August 2022
  * Remote


---

<!-- Page 2 -->

* **Humalect**
  * **DevRel Engineer (Intern)**
    – Content Lead for Humalect on social platforms.
    – Wrote blogs, newsletters, and planned podcasts.
    – Represented Humalect at events and built community.
  December 2022 – January 2023
  Remote

* **QwikSkills**
  * **Community Manager (Intern)**
    – Onboarded 300+ community members, hosted online events.
    – Managed Discord/Telegram and wrote community blogs.
    – Designed campaigns and handled technical support.
  August 2022 – January 2023
  Remote

* **NimbleEdge**
  * **Community Manager (Intern)**
    – Engaged OSS community and hosted global events.
    – Managed dev communities across GitHub, Discord, Meetup.
    – Created support content, handled social media and code issues.
  September 2022 – November 2022
  Remote

* **Keploy**
  * **Open Source Engineer (Intern)**
    – Set up CI/CD pipelines using GitHub Actions.
    – Built UI for Keploy website with ReactJs.
    – Contributed to the main platform.
  May 2022 – August 2022
  Remote

* **Keploy**
  * **DevRel Engineer (Intern)**
    – Provided API guidance and SDK support.
    – Built demo apps and participated in technical forums.
  April 2022 – July 2022
  Remote

* **CryptoCapable**
  * **DevRel Engineer (Intern)**
    – Promoted Web3, Crypto, Blockchain technologies.
    – Delivered talks and guided developer onboarding.
  February 2022 – April 2022
  Remote

* **Hyathi Technologies**
  * **Full Stack Developer (Intern)**
    – Built website MVP with React, Tailwind, NodeJS, MongoDB.
    – Implemented CI/CD using GitHub Actions.
  December 2021 – January 2022
  Remote

* **OneGo**
  * **Full Stack Developer (Intern)**
    – Developed startup site using HTML, CSS, Bootstrap.
    – Integrated Firebase backend, deployed via GitHub Actions.
  September 2021 – November 2021
  Ghaziabad, India

## Projects

* **Paanch-Editor**
  * **Responsive image editing tool using JS, HTML/CSS with 5+ effects**
    – Allows users to apply effects and download edited images directly in-browser.
  Remote

* **Etihaas Chrome Extension**
  * **Displays 'On this day' historical facts using public APIs**
    – Chrome extension shows history events for today’s date from API.
  Remote

* **Foody-Moody**
  * **Fusion food recipe site using React, Node, MongoDB**
    – Dynamic full-stack web app offering unique cuisine recipes.
  Remote

* **Tutorhuntz (Freelance)**
  * **Platform connecting tutors and students in 100+ subjects**
    – Built with React, Node.js, Express.js, Minimal UI, designed for academic support.
  Remote

* **Zipify**
  * **File compression web app built in Node.js**
    – Compress files into ZIPs using jszip and Express server.
  Remote

* **Women-Help Tracker**
  * **Health tracking web app for menstrual wellness**
    – Developed using HTML/CSS, Node.js, Python to support women’s wellness.
  Remote


---

<!-- Page 3 -->

## Honors and Awards

*   Winner – Smart India Hackathon 2022, led team of 5 to national victory.
*   First in college to become GitHub Campus Expert and GSoC contributor.
*   AWS Machine Learning and SUSE Cloud Native Scholarship by Udacity.
*   Top ranks: 3rd in KWOC, 5th SWOC, 17th JWOC, 81st DWOC, 6th CWOC.
*   Best Mentor Award – HSSOC, PSOC, DevicePT open source programs.

## Volunteer Experience

*   Founder – Nexus What The Hack: national-level hackathon community.
*   GitHub Campus Expert – Conducted 20+ technical events, meetups, and hackathons.
*   Auth0 Ambassador – Delivered tech sessions, supported community growth.
*   Mentor – SigmaHacks, CalHacks, Hack This November, HackVolunteer, Garuda Hacks.
*   Organized 15+ community bootcamps and mentored 2000+ budding OSS contributors.

Conclusion

NuMarkdown-8B-Thinking brings reasoning into OCR like never before. By combining the power of Qwen2.5-VL with fine-tuned thinking tokens, it doesn’t just extract text — it understands layouts, tables, and complex structures before producing clean Markdown. This reasoning-first approach makes it a strong choice for document extraction, RAG pipelines, and knowledge organization, often rivaling even closed-source models in accuracy.

With the setup steps we walked through — from provisioning a GPU VM to running the model inside an intuitive Streamlit interface — you now have a complete end-to-end workflow. You can upload PDFs or images, watch them convert into structured Markdown in real time, and immediately use that output in your own applications.

Whether you’re a researcher, developer, or enterprise team, NuMarkdown-8B-Thinking offers a practical, open, and high-performing solution for document intelligence. Try it on your own documents, plug it into your pipelines, and experience what reasoning-powered OCR can unlock.

The Open-Source App Builder That Ate SaaS: Dyad + Ollama Setup

Ayush kumar — Fri, 22 Aug 2025 05:50:28 +0000

Dyad is a free, local, and open-source app builder that lets you create AI-powered apps without writing code. It’s a privacy-friendly alternative to platforms like Lovable, v0, Bolt, and Replit—designed to run entirely on your computer, with no lock-in or vendor dependency. With built-in Supabase integration, support for any AI model (including local ones via Ollama), and seamless connection to your existing tools, Dyad makes it easy to launch full-stack apps quickly. Fast, intuitive, and open-source, Dyad is built for makers who want control, speed, and limitless creativity.

Resources

Website

Link: https://www.dyad.sh/

GitHub

Link: https://github.com/dyad-sh/dyad

Step-by-Step Process to Setup Dyad + Ollama

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H100 SXM GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

Next, you will need to choose an image for your Virtual Machine. We will deploy Ollama on an NVIDIA Cuda Virtual Machine. This proprietary, closed-source parallel computing platform will allow you to install Ollama on your GPU Node.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, if you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Install Ollama

After connecting to the terminal via SSH, it’s now time to install Ollama from the official Ollama website.

Website Link: https://ollama.com/

Run the following command to install the Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Step 9: Serve Ollama

Run the following command to host the Ollama so that it can be accessed and utilized efficiently:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Step 10: Pull the GPT OSS 120B Model

Run the following command to pull the GPT OSS 120B Model:

ollama pull gpt-oss:120b

Wait for the download and extraction to finish until you see success.

Step 11: Verify Downloaded Models

After pulling the GPT-OSS models, you can check that they’ve been successfully downloaded and are available on your system.

Just run:

ollama list

You should see output like this:

NAME           ID              SIZE   MODIFIED
gpt-oss:120b   735371f916a9    65 GB  50 seconds ago

Step 12: Set Up SSH Port Forwarding (For Remote Models Like Ollama on a GPU VM)

If you’re running a model like Ollama on a remote GPU Virtual Machine (e.g. via NodeShift, AWS, or your own server), you’ll need to port forward the Ollama server to your local machine so Dyad can connect to it.

Here’s how to do it:

Example (Mac/Linux Terminal):

ssh -L 11434:localhost:11434 root@<your-vm-ip> -p <your-ssh-port>

Once connected, your local machine will treat http://localhost:11434 as if Ollama is running locally.

Replace with your VM’s IP address
Replace with the custom port (e.g. 19257)

On Windows:
Use a tool like PuTTY or ssh from WSL/PowerShell with similar port forwarding.

If you’re running large language models (like GPT-OSS 120b) on a remote GPU Virtual Machine, you’ll want Dyad on your local machine to talk to that remote Ollama instance.

But since the model is running on the VM — not on your laptop — we need to bridge the gap.

That’s where SSH port forwarding comes in.

Why use a GPU VM?
Large models require serious compute power. Your laptop might struggle or overheat trying to run them. So we spin up a GPU-powered VM in the cloud — it gives us:

Faster responses
Support for large models (7B, 13B, even 120B!)
More RAM + VRAM for smoother inference

Step 13: Download Dyad

To get started with Dyad, you’ll need to download the installer from the official website:

Open your web browser (Google Chrome, Safari, Firefox, or Edge).
In the search bar, type “Dyad app” and press Enter.
From the search results, click on the link to the official Dyad website (look for the domain that says it’s the official site).
On the homepage, locate the “Download Dyad” button at the top right or center of the page.
Select the correct version for your operating system:
macOS (Apple Silicon or Intel)
Windows
Linux (if available)
Click the button to start the download. The file will automatically save to your computer’s default download folder.
Once the download is complete, you’re ready to move on to installation.

Tip: Dyad is free, open-source, and works without vendor lock-in. It supports building full-stack AI apps with Supabase integration and can connect with popular models like Gemini, GPT, and Claude.

Step 14: Set Up Dyad for the First Time

Once Dyad is installed and launched, you’ll see a setup screen that helps you prepare your environment for building apps. Follow these steps carefully:

Install Node.js (App Runtime)

Dyad requires Node.js to run your applications locally.
If Node.js is already installed on your machine, Dyad will detect it automatically and mark this step as complete (green check).
If not, you’ll be prompted to download and install Node.js. Simply follow the link provided, install the latest LTS version, and restart Dyad.

Setup AI Model Access
To generate and run apps, Dyad needs access to AI providers. You can connect one or multiple providers:

Google Gemini – Click “Setup Google Gemini API Key” to use Gemini for free. You’ll be redirected to create or retrieve your API key, then paste it back into Dyad.
Other AI Providers – If you want more options, click “Setup other AI providers.” Dyad supports OpenAI, Anthropic, OpenRouter, and more. Enter the corresponding API keys in the fields provided.

Import or Start a New App
Once setup is complete, you can either:

Click “Import App” to load an existing Dyad project.
Or, type your idea directly in the “Ask Dyad to build…” box. For example, enter “Build a To-Do List App” or “Build a Recipe Finder App.”

Choose from Starter Templates (Optional)

Dyad also provides quick templates such as To-Do List App, Virtual Avatar Builder, Recipe Finder & Meal Planner, AI Image Generator, or 3D Portfolio Viewer.
Select one to quickly spin up a project and start experimenting.

Tip: You can always switch between models (Auto/Pro) based on your needs and API access. Auto uses free/available models, while Pro unlocks premium capabilities.

Step 15: Configure AI Providers in Dyad

To enable Dyad to build and run apps, you need to connect it with one or more AI providers. This allows Dyad to generate code using different models.

Open Settings → AI → Model Providers

On the left sidebar, click Settings, then select AI > Model Providers.
You’ll see a list of supported providers: OpenAI, Anthropic, Google (Gemini), OpenRouter, Dyad, and an option to add a custom provider.

Choose Your Provider

Google (Gemini) – Offers a free tier. Click Setup and follow the link to get your API key. Paste it into the input field in Dyad.
OpenAI – If you have an API key, click Setup, then paste your key to enable GPT models.
Anthropic – Enter your Claude API key if you use Anthropic.
OpenRouter – Supports multiple models with a free tier. Setup is similar — retrieve your key from OpenRouter and paste it.
Dyad – If you prefer, you can set up Dyad’s native model.
Custom Provider – Advanced users can connect any LLM endpoint by clicking Add custom provider and entering endpoint details + API key.

Enable Telemetry (Optional)

Telemetry is enabled by default to anonymously record usage data and improve Dyad. You can toggle it ON or OFF based on your preference.

Enable Native Git (Optional)

Under Experiments, you can enable Native Git for faster version control. This requires installing Git on your system if not already installed.

Save & Verify

Once you enter API keys, Dyad will validate them.
If successful, the status will change from “Needs Setup” to Active.
You’re now ready to start building apps with your chosen AI models.

Tip: You can set up multiple providers and switch between them depending on which model you want to use for a project.

Step 16: Add a Custom AI Provider

If you want Dyad to use a language model that isn’t listed (e.g., a self-hosted model, private API, or enterprise endpoint), you can configure it as a Custom Provider.

Click “Add Custom Provider”

In the AI Providers section of the Settings menu, select Add Custom Provider.
A setup form will appear (like in the screenshot).

Fill Out Provider Details

Provider ID – A unique identifier without spaces (e.g., my-provider).
Display Name – The friendly name you want to appear in Dyad’s interface (e.g., My Enterprise LLM).
API Base URL – The root URL of the model’s API (e.g., https://api.example.com/v1).
Environment Variable (Optional) – If you want Dyad to reference a stored API key, enter its environment variable name here (e.g., MY_PROVIDER_API_KEY).

Authentication

Make sure the API key or token required by the provider is properly stored in your system’s environment variables.
If not using environment variables, Dyad may prompt you to input the key directly when connecting.

Save the Provider

Once all fields are complete, click Add Provider.
The provider will appear alongside OpenAI, Anthropic, Google, and others in your Model Providers list.

Test the Connection

After adding, Dyad will validate the provider by making a test API call.

Tip: This feature is powerful if you’re hosting open-source models locally, using private APIs like vLLM, or experimenting with custom endpoints. It gives you full flexibility without vendor lock-in.

Step 17: Connect Dyad with Ollama

Now that you’ve filled out the Add Custom Provider form for Ollama:

Enter Provider Details

Provider ID: ollama
Display Name: ollama (or any friendly name you prefer).
API Base URL:http://localhost:11434/v1
This points Dyad to the local Ollama server that runs on port 11434.

Save the Provider

Click Add Provider to save the configuration.
You should now see Ollama listed as an active provider in your Dyad AI Providers panel.
Run Ollama Locally
Make sure Ollama is running on your machine. Start the Ollama server by opening a terminal and running:

ollama serve

This ensures Dyad can connect to the Ollama API at localhost:11434.

Test the Connection

In Dyad, try generating a simple app idea (e.g., “Build a To-Do List app”).
If the connection is successful, Dyad will use Ollama to generate the project code.

Step 18: Add Ollama Models in Dyad (and verify)

Now that Configure ollama shows Setup Complete, make the actual models available to Dyad.

Make sure Ollama is running

ollama serve

In Settings → AI → Model Providers → ollama → Models, click Add Custom Model.
Fill in: Model ID: the exact Ollama model name (e.g., llama3:8b). Display Name: anything friendly (e.g., Llama 3 (8B)). Context Window: optional (set if you know it; otherwise leave blank). Max Output Tokens: optional (e.g., 1024).
Save. Repeat for any other Ollama models you want exposed.

Step 19: Add and Register a Custom Model in Dyad

Fill Out the Model Details

Model ID:gpt-oss:120b
This must exactly match the model name available in your Ollama installation.
Name: gpt-oss (this is the display name that will appear in Dyad).
Description (Optional): You can write something like “Open-source GPT OSS 120B model via Ollama”.
Max Output Tokens (Optional): e.g., 4096 (or adjust based on model capability).
Context Window (Optional): e.g., 8192.

Save the Model

Click Add Model.
The model will now appear under Models in the Ollama provider section.

Step 20: Build your first Dyad app with gpt-oss (Ollama)

Now that gpt-oss:120b shows up under Models and Ollama is Setup Complete, let’s generate an app.

Step 21: Select ollama → gpt-oss in the Builder and generate

Open the model picker

In the build screen (the bar above “Ask Dyad to build…”), click the Model dropdown.

Choose the local provider

Navigate to Local models → ollama (or directly ollama in the list).

Pick your model

Select gpt-oss (the one you registered as gpt-oss:120b).
Optional: switch Auto → Pro if you want Dyad to always use your chosen model without auto-switching.

Set generation options (optional)

Click the small settings/gear near the prompt bar:
Max output tokens: 2048–4096 (for long code generations).
Temperature: 0.2–0.5 for reliable code; raise for creativity.
Context window / system prompt: leave default unless you need custom guardrails.

Prompt Dyad to build

In “Ask Dyad to build…”, paste a concrete request, e.g.

Build a Newsletter Creator:
- Tech stack: React + Vite + Tailwind
- Features: editor with markdown preview, save drafts to localStorage, export to HTML/Markdown, simple dark UI, keyboard shortcuts
- Include README with setup & run steps

Hit Send (paper-plane). Review the plan → Accept.

Run and iterate

When scaffolding completes, click Run (or open terminal) and follow the start script (usually npm install && npm run dev).
Iterate with follow-up prompts: “add image upload”, “add tags & search”, “deploy-ready build script”, etc.

If the model dropdown doesn’t show ollama/gpt-oss:

Ensure ollama serve is running and the model exists (ollama list).
Recheck the base URL http://localhost:11434/v1 in Settings → AI → Model Providers → ollama.
If using a remote VM, use http://:11434/v1 or tunnel via SSH: ssh -L 11434:localhost:11434 user@VM.

In this video, I walk through the entire process of setting up and using Dyad with Ollama as the custom AI provider. Starting from downloading and installing Dyad, I show how to configure Node.js, connect API providers, and register a custom model inside Ollama (gpt-oss:120b). The video captures each step clearly—adding the API base URL, activating Ollama, registering the model in Dyad, and finally selecting it from the model picker. To demonstrate the workflow, I use Dyad’s builder interface to generate a project, including an AI Image Generator app, showing how prompts translate into scaffolded code in real time. By the end, viewers can see a complete pipeline: from local model setup → integration in Dyad → running their first functional AI app without vendor lock-in.

https://youtu.be/z4kaIEPcIEc

Conclusion

Dyad makes building AI-powered apps simple, fast, and completely under your control. By combining it with Ollama on a GPU-powered VM, you unlock the ability to run powerful open-source models locally or remotely—without vendor lock-in. Whether you’re a developer, a tinkerer, or someone exploring no-code AI tools, Dyad gives you the flexibility to prototype, build, and scale apps in minutes. With this setup, you now have a private, efficient, and future-proof way to turn your ideas into fully functional apps.

A Step-By-Step Guide to Install Qwen3 30B Locally

Aditi Bindal — Mon, 11 Aug 2025 13:43:54 +0000

The Qwen3-30B-A3B-Instruct-2507 is an advanced iteration of the Qwen3 series, marking a significant leap forward in the landscape of causal language models. Boasting an impressive 30.5 billion parameters with 3.3 billion actively engaged, this model excels across a diverse array of capabilities such as instruction following, complex logical reasoning, text comprehension, mathematics, and science. Its robust coding proficiency, demonstrated by high scores in benchmarks such as MultiPL-E and LiveCodeBench, makes it particularly attractive to developers and researchers. The model also excels in multilingual contexts and handles extensive 256K token contexts effortlessly, making it ideal for intricate, lengthy tasks. Furthermore, its refined alignment with user preferences in subjective and open-ended scenarios ensures that interactions feel natural, intuitive, and highly personalised.

In this article, we guide you step-by-step on installing Qwen3-30B locally or in GPU-accelerated environment.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x H200
Storage: 50 GB (preferable)
VRAM: at least 64 GB
Anaconda installed

Step-by-step process to install and run Qwen3-30B

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using 1x H200 GPU, however, you can choose any GPU as per the prerequisites.

2) Similarly, we’ll opt for 200 GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x H100 SXM 80GB GPU node with 192vCPUs/80GB RAM/200GB SSD.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

1) Create a virtual environment using Anaconda.

conda create -n qwen python=3.11 -y && conda activate qwen

Output:

2) Once you’re inside the environment, install vllm with dependencies.

pip install --upgrade vllm

Output:

3) Also, open a second terminal, connect to remote server with SSH and install open-webui.

pip install open-webui

Step 8: Download and Run the model

1) Download model with vllm and host the endpoint at 8000.

vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 --max-model-len 32768 --gpu-memory-utilization 0.95

Output:

2) In the second terminal connected with the GPU host with ssh, serve the open-webui frontend endpoint.

open-webui serve --port 3000

Output:

3) Forward both the ports and tunnel them to access in the local browser.

If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the both vllm and open-webui session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 3000:localhost:3000 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

In another local terminal run forward the port for vllm endpoint:

ssh -L 8000:localhost:8000 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Step 9: Run the model via Open WebUI Interface

Once ports are forwarded, you can simply access the model via Open WebUI interface and chat with it.

1) Before running the model, connect the webui with vllm API endpoint in the settings.

2) Select the Qwen3-30B model in the chat page and run the prompt.

For e.g., we’re testing the following prompt:

1. Summarize the following passage in 3 bullet points.
2. Then, extract 3 key insights and explain their implications.
3. Finally, write a Python function that could analyze similar passages for sentiment.

---
Passage:

"The rapid advancement of AI technologies has transformed industries across the globe. In healthcare, AI models are diagnosing diseases earlier and more accurately. In finance, algorithmic trading and risk modeling are becoming more sophisticated. Yet, as AI grows more powerful, ethical questions around bias, privacy, and job displacement remain urgent. Policymakers and technologists must collaborate to create guardrails that ensure innovation benefits society as a whole."
---

Give your response in clearly separated sections.

Output:

Conclusion

Installing Qwen3-30B-A3B-Instruct-2507 locally equips developers and researchers with a cutting-edge language model, renowned for its powerful reasoning, extensive multilingual support, and exceptional handling of long-context tasks. Pariring it with NodeShift GPUs further enhances this experience, providing streamlined deployment, efficient resource management, and scalable infrastructure. Together, these tools empower users to harness advanced AI capabilities effectively, bridging innovation with accessibility and performance.

For more information about NodeShift: