AutoJanitor

Posted on Feb 19 • Edited on Mar 7 • Originally published at rustchain.org

Self-Hosting a Vision Model on a Datacenter GPU: BAGEL-7B-MoT on a Tesla V100

#ai #python #computervision #selfhosted

I have an AI character named Sophia who lives inside a Godot game. She talks, she listens, she plays music, she controls the smart lights. And now she can see.

Not "process an image if you upload one" see. Real-time webcam-capture, face-detection, emotion-reading see. She looks through the camera, describes what she sees, reads your mood, and responds accordingly.

The vision model powering all of this is BAGEL-7B-MoT running on a Tesla V100 16GB GPU. Getting it there was not straightforward.

Why We Ditched LLaVA

We were running LLaVA 1.6 (7B) via Ollama for months. It worked, but it had problems:

Slow -- 8-15 seconds for a basic description on a V100
Hallucination-heavy -- it would confidently describe objects that weren't there
No generation capability -- LLaVA is understand-only. No image editing, no generation
Stale architecture -- the LLaVA project hasn't seen meaningful updates

BAGEL-7B-MoT (Mixture of Transformers) from ByteDance Research offered everything we needed: image understanding, image generation, and image editing in a single model. The MoT architecture routes different modalities through specialized transformer blocks instead of forcing everything through the same weights. Understanding is sharper. Descriptions are more grounded. And it fits in the same VRAM footprint.

The switch was a drop-in replacement at the API level -- BAGEL serves an Ollama-compatible /api/generate endpoint, so every HTTP call in our codebase stayed identical. Only the URL and model name changed.

The V100 Compatibility Nightmare

Here is where it gets ugly. BAGEL was built for A100s and H100s. The Tesla V100, despite being an absolute workhorse with 16GB of HBM2 at 900 GB/s bandwidth, has two fatal gaps:

1. No bfloat16 Support

The V100 (compute capability 7.0) does not support bfloat16. At all. The tensor cores do FP16 and INT8. BAGEL's default weights are bfloat16 everywhere -- attention projections, MLP layers, layer norms, the works.

If you just load the model naively, PyTorch will either crash or silently fall back to FP32 emulation that eats double the VRAM and runs at half speed.

The fix: force float16 at every level.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,   # NOT bfloat16
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    "BAGEL-7B-MoT",
    quantization_config=quantization_config,
    torch_dtype=torch.float16,              # NOT bfloat16
    device_map="auto",
)

Every single instance of bfloat16 in the model code, the config, the processing pipeline -- all of it has to become float16. Miss one and you get cryptic CUDA errors about unsupported dtypes that point to line numbers inside compiled PyTorch extensions.

2. No Flash Attention

Flash Attention 2 requires compute capability 8.0+. The V100 is 7.0. The BAGEL codebase calls flash_attn directly in several places.

The fix: replace every flash attention call with PyTorch's built-in scaled dot-product attention (SDPA):

# Instead of:
# from flash_attn import flash_attn_func
# attn_output = flash_attn_func(q, k, v, causal=True)

# Use inline SDPA:
attn_output = torch.nn.functional.scaled_dot_product_attention(
    q, k, v,
    is_causal=True,
    attn_mask=None,
)

PyTorch's SDPA automatically selects the best available backend -- on V100 it uses the "math" fallback which is slower than flash attention but still plenty fast for 7B inference. On our hardware, it adds maybe 200ms per inference compared to what an A100 would do with flash attention. Acceptable.

3. No torch.compile

We also had to disable torch.compile(). On V100 with CUDA 11.x, the Triton compiler that backs torch.compile often generates invalid PTX for older architectures. Every torch.compile decoration gets commented out or gated behind a compute capability check.

NF4 Quantization: Fitting 7B in 9GB

BAGEL-7B-MoT in float16 would eat about 14GB of VRAM. That leaves only 2GB for KV cache, activations, and the image encoder. Not enough.

NF4 (Normal Float 4-bit) quantization via bitsandbytes brings the model weight footprint down to roughly 4.2GB. With the image encoder, KV cache, and runtime overhead, total VRAM usage lands at about 9GB. That leaves 7GB of headroom on the V100 -- enough to process high-resolution images without OOM.

The double_quant=True flag adds a second round of quantization to the quantization constants themselves. It saves about 0.4GB extra with negligible quality loss. On a 16GB card, that matters.

Key point: NF4 preserves the model's ability to understand images remarkably well. We tested the same 50 images through both float16 and NF4, and the descriptions were nearly identical. The only noticeable degradation is in very fine-grained spatial reasoning ("the book is to the left of the lamp" type queries), which we don't need for our use case.

The Flask API Wrapper

The actual API server is surprisingly simple. We wrap BAGEL in a Flask app that serves an Ollama-compatible endpoint, so existing code that talked to LLaVA via Ollama doesn't need to change:

from flask import Flask, request, jsonify
import torch
import base64
from PIL import Image
from io import BytesIO

app = Flask(__name__)

# Model loaded at startup (see quantization config above)
model = None
processor = None

@app.route("/api/generate", methods=["POST"])
def generate():
    data = request.json
    prompt = data.get("prompt", "Describe this image.")
    images_b64 = data.get("images", [])
    options = data.get("options", {})

    temperature = options.get("temperature", 0.3)
    max_tokens = options.get("num_predict", 150)

    # Decode base64 images
    pil_images = []
    for img_b64 in images_b64:
        img_bytes = base64.b64decode(img_b64)
        pil_images.append(Image.open(BytesIO(img_bytes)).convert("RGB"))

    # Build inputs
    inputs = processor(
        text=prompt,
        images=pil_images if pil_images else None,
        return_tensors="pt",
    ).to("cuda", dtype=torch.float16)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            do_sample=temperature > 0,
        )

    response_text = processor.decode(
        outputs[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True,
    )

    return jsonify({
        "model": "bagel-7b-mot",
        "response": response_text.strip(),
        "done": True,
    })

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8095)

This runs as a systemd service (bagel-api.service) on the NAS machine at 192.168.0.160. The GPU is explicitly assigned:

[Service]
Environment="CUDA_VISIBLE_DEVICES=1"
ExecStart=/home/sophia/models/venv/bin/python3 /home/sophia/models/bagel_api.py

GPU 0 runs Ollama (for text-only LLMs). GPU 1 runs BAGEL. They never fight over VRAM.

Wiring It Into a Godot Game

This is where it gets fun. Sophia lives in a Godot 4.3 game -- a Victorian-style study with bookshelves, a fireplace, and an AI character you talk to via voice. The vision module lets her see through your actual webcam.

The client code (sophia_vision.py) orchestrates a multi-stage pipeline:

def webcam_vision_report(include_emotion=True):
    """Full webcam vision pipeline:
       capture -> face detect -> BAGEL describe -> emotion."""

    # 1. Capture frame from webcam via OpenCV
    frame = capture_webcam_frame()

    # 2. Fast face detection with Haar cascades (<100ms)
    faces = detect_faces_opencv(frame)

    # 3. Crop the largest face with padding
    if faces:
        largest = max(faces, key=lambda f: f["w"] * f["h"])
        face_crop = crop_face(frame, largest)

    # 4. Send full frame to BAGEL for person description
    person_desc = bagel_describe_person(WEBCAM_FRAME_PATH)

    # 5. Send face crop to BAGEL for emotion reading
    if faces and include_emotion:
        emotion = bagel_read_emotion(WEBCAM_FACE_PATH)

    # 6. Optional: Hailo-8L YOLOv8n for object detection
    hailo_result = hailo_detect(source="local", image_path=WEBCAM_FRAME_PATH)

When you say "look at me" or "how do I look," Sophia:

Grabs a 1280x720 frame from /dev/video0
Runs OpenCV Haar cascade face detection (under 100ms)
Crops the largest face with 40% padding for expression context
Sends the full frame to BAGEL with a detailed prompt asking for appearance, clothing, expression, emotion, body language, and environment
Sends the face crop to BAGEL with a focused emotion-analysis prompt
Optionally runs Hailo-8L YOLOv8n for fast object detection
Assembles everything into a vision report that gets injected into the LLM context

The BAGEL calls use targeted prompts that produce structured, useful output:

prompt = (
    "Analyze this person's facial expression and emotional state. "
    "Consider: eye openness, mouth shape, eyebrow position, "
    "forehead tension, jaw clenching, eye contact direction. "
    "Give the PRIMARY emotion and a BRIEF explanation. "
    "One sentence. Example: 'Relaxed - soft eyes, slight smile, loose jaw.'"
)

Low temperature (0.2-0.3) keeps the descriptions factual. Higher values make BAGEL creative, which is the opposite of what you want for a vision report.

Performance Numbers

On our Tesla V100 16GB with NF4 quantization:

Task	Time	Token Count
Short description (2-3 sentences)	~2 seconds	~50 tokens
Detailed person analysis	~8 seconds	~150 tokens
Full emotion + description	~13 seconds	~250 tokens
Scene description (security cam)	~5 seconds	~100 tokens

For comparison, LLaVA 1.6 7B via Ollama on the same hardware:

Task	Time
Short description	~6 seconds
Detailed analysis	~15 seconds

BAGEL is 2-3x faster for short responses and produces noticeably better descriptions. The MoT architecture pays off -- routing image tokens through specialized vision transformer blocks instead of the generic language blocks means less wasted computation.

Running It Yourself

If you have a V100 (or any pre-Ampere GPU), here's the minimum viable setup:

pip install torch transformers bitsandbytes accelerate flask pillow

# Download the model (about 14GB)
git lfs install
git clone https://huggingface.co/ByteDance-Research/BAGEL-7B-MoT

# Set CUDA device
export CUDA_VISIBLE_DEVICES=0

# Run the API
python3 bagel_api.py

Test it:

# Encode an image
IMG_B64=$(base64 -w0 test_photo.jpg)

# Query
curl -X POST http://localhost:8095/api/generate \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"bagel-7b-mot\",
    \"prompt\": \"Describe what you see in this image.\",
    \"images\": [\"$IMG_B64\"],
    \"stream\": false,
    \"options\": {\"temperature\": 0.3, \"num_predict\": 150}
  }"

The three critical compatibility fixes for V100:

bnb_4bit_compute_dtype=torch.float16 (not bfloat16)
Replace flash_attn with torch.nn.functional.scaled_dot_product_attention
Remove all torch.compile() calls

If you're on an A100 or newer, you can skip all three and just load normally.

The Bigger Picture

This vision model is one piece of a larger system we're building at Elyan Labs -- an ecosystem where AI agents have real capabilities, not just chat interfaces. Sophia can see, hear, speak, browse the web, control smart home devices, play music, and interact with other agents via the Beacon Protocol.

Her videos live on BoTTube, a platform built specifically for AI creators. The whole infrastructure runs on vintage and datacenter hardware -- including an IBM POWER8 server with 768GB of RAM and a blockchain that rewards vintage hardware for participating in consensus.

The agent internet is bigger than you think. Vision is just one more sense.