Endogen

Posted on Mar 7

Building a Telegram Bot for Allen AI's Open-Source Models

#ai #api #mcp #python

I wanted a Telegram bot that lets me chat with Allen AI's open-source language models — OLMo, Tülu, and Molmo 2 — without running any models locally. No GPU, no inference server, just a lightweight Python bot that talks to Allen AI's free public playground API.

The result is OLMo Bot, and it ended up with more capabilities than I initially planned: multi-model switching, web search, vision, and even visual object pointing with annotated image overlays.

Connecting to Allen AI

Allen AI runs a public playground with their latest models. There's no official API, but I built Web2API — a tool that turns websites into REST APIs — and created a recipe for it. The bot doesn't scrape anything itself; it just calls Web2API endpoints:

async def query_model(model, prompt, history=None, file_path=None):
    endpoint = MODELS.get(model)  # e.g. "/allenai/olmo-32b"
    url = f"{WEB2API_URL}{endpoint}"
    params = {"q": full_prompt}

    async with httpx.AsyncClient(timeout=120) as client:
        resp = await client.get(url, params=params)

    items = resp.json().get("items", [])
    return items[0]["fields"]["response"]

The Allen AI recipe in Web2API uses a custom scraper that handles their streaming NDJSON chat API directly — no browser automation needed for this one.

Model Switching

The bot supports six text models and two vision models, switchable per user with simple commands:

Command	Model
`/olmo32b`	OLMo 3.1 32B Instruct (default)
`/think`	OLMo 3.1 32B Think (reasoning)
`/olmo7b`	OLMo 3 7B Instruct
`/tulu8b`	Tülu 3 8B
`/tulu70b`	Tülu 3 70B
`/molmo2`	Molmo 2 8B (vision)
`/molmo2track`	Molmo 2 8B Tracking

Each user's model choice is stored in memory. Send /think, and all your subsequent messages go to the reasoning model until you switch again.

The Think model is particularly interesting — it's Allen AI's chain-of-thought model that shows its reasoning process, similar to what you'd get from o1 or DeepSeek R1, but fully open-source.

Conversation Memory

Memory is off by default (stateless, each message is independent) but can be toggled with /memory:

if mem_on:
    # Build context from history
    parts = []
    for msg in history:
        role = msg["role"]
        parts.append(f"{'User' if role == 'user' else 'Assistant'}: {msg['text']}")
    parts.append(f"User: {prompt}")
    full_prompt = "\n\n".join(parts)

When enabled, the bot maintains up to 20 turns of conversation per user. The full history is prepended to each prompt so the model has context. /clear wipes it.

Web Search via Tool Calling

This is where Web2API's MCP bridge comes in. Allen AI's models support tool calling — you pass a tools_url parameter pointing to a tool endpoint, and the model can decide to call those tools during generation.

I configured the bot to always pass the Brave Search tool:

# config.py
DEFAULT_TOOLS_URL = os.getenv(
    "OLMO_TOOLS_URL",
    "http://127.0.0.1:8000/mcp/only/brave-search",
)

# bot.py — included in every text model request
params = {"q": full_prompt}
if DEFAULT_TOOLS_URL and model not in VISION_MODELS:
    params["tools_url"] = DEFAULT_TOOLS_URL

The flow works like this:

User asks "What's the weather in Berlin?"
Bot sends the prompt to Web2API with tools_url pointing to the Brave Search bridge
Web2API's Allen AI scraper passes the tool definition to the model
OLMo decides it needs current data, calls web_search
The scraper executes the search via the MCP bridge, feeds results back to the model
OLMo generates a response incorporating the search results
Bot sends the answer to the user

The model decides autonomously whether to search — if you ask "What is 2+2?", it just answers directly. If you ask about current events, it searches. All of this happens inside Web2API's Docker container.

One detail worth mentioning: the tools_url points to http://127.0.0.1:8000 (container-internal port), not the external 8010. Since the Allen AI scraper runs inside the same Docker container as the MCP bridge, it can reach it on localhost without going through nginx.

Vision models skip the tools parameter — Molmo 2 doesn't need web search.

Vision: Image and Video Analysis

Send a photo or video to the bot with a caption, and it analyzes it using Molmo 2:

# Auto-switch to molmo2 if current model doesn't support vision
if model not in VISION_MODELS:
    model = "molmo2"

The bot downloads the file from Telegram, sends it as a multipart POST to Web2API, and returns the model's analysis. If no caption is provided, it defaults to "Describe this image in detail."

The auto-switch is key for usability — you don't have to manually switch to Molmo 2 before sending a photo. Send an image on any model, and the bot temporarily uses Molmo 2 for that message, then stays on your selected text model for the next.

Point Overlay: "Show Me Where"

This was the feature I didn't plan but couldn't resist building. Molmo 2 has a pointing capability — ask it to point at objects, and it returns coordinates in a normalized 0–1000 coordinate space:

User: "Point to the eyes" (with photo attached)
Molmo 2: <points coords="1 1 421 430 2 633 352">eyes</points>

The response format encodes multiple points: the first point has a two-number prefix plus x,y coordinates, subsequent points have an index plus x,y. All values are in a 0–1000 space relative to image dimensions.

The bot parses these coordinates and draws colored markers on the original image using Pillow:

def _make_marker(color, radius, label, *, scale=4):
    """Render an anti-aliased marker via 4× supersampling."""
    sr = radius * scale
    marker = Image.new("RGBA", (size, size), (0, 0, 0, 0))
    draw = ImageDraw.Draw(marker)

    # White border ring
    draw.ellipse([...], fill=(255, 255, 255, 240))
    # Colored circle
    draw.ellipse([...], fill=(*color, 230))
    # Centered number label
    draw.text((cx, cy), label, fill=(255, 255, 255), font=font, anchor="mm")

    # Downscale for smooth anti-aliasing
    return marker.resize((final_size, final_size), Image.LANCZOS)

The markers are rendered at 4× resolution and downscaled with LANCZOS filtering for smooth, anti-aliased edges — no jagged circles or pixel artifacts. Each point gets a distinct color (red, blue, green, orange...) with a white border and a numbered label.

The bot sends the annotated image back as a photo with a caption like "📍 eyes (2 points)". Prompts that trigger pointing include variations of "Point to...", "Find the...", "Where is the...", and "Locate the...".

Setup

The bot is a single bot.py file plus a config and the pointing module. Dependencies are minimal: python-telegram-bot, httpx, and Pillow.

git clone https://github.com/Endogen/olmo-bot.git
cd olmo-bot
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # set OLMO_BOT_TOKEN
python bot.py

It requires a running Web2API instance with the allenai recipe (and optionally brave-search for web search). Access can be restricted to specific Telegram user IDs via the OLMO_ALLOWED_USERS env var.

What's Next

The main limitation is Allen AI's native tool calling — while the model acknowledges tools and can call them, it doesn't always do so proactively. A bot-side tool loop (parsing tool-call JSON from the model output and executing tools locally) would make this more reliable.

The pointing coordinate format from Molmo 2 also isn't officially documented — I reverse-engineered it from testing. It works reliably, but the format could change.

Links:

Top comments (1)

Alberto Nieto • Mar 28

Cool project — especially the Molmo2 image/video analysis integration.

If you ever move to self-hosting inference instead of going through the playground, you might find turboquant-vllm useful. It's a vLLM plugin that compresses the KV cache 3.76x on Molmo2 using Google's TurboQuant algorithm. Video input hits hard — each frame is ~81 visual tokens, so a 30-second clip eats 1.6 GB of KV cache. With compression that drops to ~435 MiB, which means longer clips or more concurrent users on the same GPU.

One pip install, one flag: pip install turboquant-vllm[vllm] and --attention-backend CUSTOM. Same OpenAI-compatible API so your bot code wouldn't need to change.