I wanted a Telegram bot that lets me chat with Allen AI's open-source language models — OLMo, Tülu, and Molmo 2 — without running any models locally. No GPU, no inference server, just a lightweight Python bot that talks to Allen AI's free public playground API.
The result is OLMo Bot, and it ended up with more capabilities than I initially planned: multi-model switching, web search, vision, and even visual object pointing with annotated image overlays.
Connecting to Allen AI
Allen AI runs a public playground with their latest models. There's no official API, but I built Web2API — a tool that turns websites into REST APIs — and created a recipe for it. The bot doesn't scrape anything itself; it just calls Web2API endpoints:
async def query_model(model, prompt, history=None, file_path=None):
endpoint = MODELS.get(model) # e.g. "/allenai/olmo-32b"
url = f"{WEB2API_URL}{endpoint}"
params = {"q": full_prompt}
async with httpx.AsyncClient(timeout=120) as client:
resp = await client.get(url, params=params)
items = resp.json().get("items", [])
return items[0]["fields"]["response"]
The Allen AI recipe in Web2API uses a custom scraper that handles their streaming NDJSON chat API directly — no browser automation needed for this one.
Model Switching
The bot supports six text models and two vision models, switchable per user with simple commands:
| Command | Model |
|---|---|
/olmo32b |
OLMo 3.1 32B Instruct (default) |
/think |
OLMo 3.1 32B Think (reasoning) |
/olmo7b |
OLMo 3 7B Instruct |
/tulu8b |
Tülu 3 8B |
/tulu70b |
Tülu 3 70B |
/molmo2 |
Molmo 2 8B (vision) |
/molmo2track |
Molmo 2 8B Tracking |
Each user's model choice is stored in memory. Send /think, and all your subsequent messages go to the reasoning model until you switch again.
The Think model is particularly interesting — it's Allen AI's chain-of-thought model that shows its reasoning process, similar to what you'd get from o1 or DeepSeek R1, but fully open-source.
Conversation Memory
Memory is off by default (stateless, each message is independent) but can be toggled with /memory:
if mem_on:
# Build context from history
parts = []
for msg in history:
role = msg["role"]
parts.append(f"{'User' if role == 'user' else 'Assistant'}: {msg['text']}")
parts.append(f"User: {prompt}")
full_prompt = "\n\n".join(parts)
When enabled, the bot maintains up to 20 turns of conversation per user. The full history is prepended to each prompt so the model has context. /clear wipes it.
Web Search via Tool Calling
This is where Web2API's MCP bridge comes in. Allen AI's models support tool calling — you pass a tools_url parameter pointing to a tool endpoint, and the model can decide to call those tools during generation.
I configured the bot to always pass the Brave Search tool:
# config.py
DEFAULT_TOOLS_URL = os.getenv(
"OLMO_TOOLS_URL",
"http://127.0.0.1:8000/mcp/only/brave-search",
)
# bot.py — included in every text model request
params = {"q": full_prompt}
if DEFAULT_TOOLS_URL and model not in VISION_MODELS:
params["tools_url"] = DEFAULT_TOOLS_URL
The flow works like this:
- User asks "What's the weather in Berlin?"
- Bot sends the prompt to Web2API with
tools_urlpointing to the Brave Search bridge - Web2API's Allen AI scraper passes the tool definition to the model
- OLMo decides it needs current data, calls
web_search - The scraper executes the search via the MCP bridge, feeds results back to the model
- OLMo generates a response incorporating the search results
- Bot sends the answer to the user
The model decides autonomously whether to search — if you ask "What is 2+2?", it just answers directly. If you ask about current events, it searches. All of this happens inside Web2API's Docker container.
One detail worth mentioning: the tools_url points to http://127.0.0.1:8000 (container-internal port), not the external 8010. Since the Allen AI scraper runs inside the same Docker container as the MCP bridge, it can reach it on localhost without going through nginx.
Vision models skip the tools parameter — Molmo 2 doesn't need web search.
Vision: Image and Video Analysis
Send a photo or video to the bot with a caption, and it analyzes it using Molmo 2:
# Auto-switch to molmo2 if current model doesn't support vision
if model not in VISION_MODELS:
model = "molmo2"
The bot downloads the file from Telegram, sends it as a multipart POST to Web2API, and returns the model's analysis. If no caption is provided, it defaults to "Describe this image in detail."
The auto-switch is key for usability — you don't have to manually switch to Molmo 2 before sending a photo. Send an image on any model, and the bot temporarily uses Molmo 2 for that message, then stays on your selected text model for the next.
Point Overlay: "Show Me Where"
This was the feature I didn't plan but couldn't resist building. Molmo 2 has a pointing capability — ask it to point at objects, and it returns coordinates in a normalized 0–1000 coordinate space:
User: "Point to the eyes" (with photo attached)
Molmo 2: <points coords="1 1 421 430 2 633 352">eyes</points>
The response format encodes multiple points: the first point has a two-number prefix plus x,y coordinates, subsequent points have an index plus x,y. All values are in a 0–1000 space relative to image dimensions.
The bot parses these coordinates and draws colored markers on the original image using Pillow:
def _make_marker(color, radius, label, *, scale=4):
"""Render an anti-aliased marker via 4× supersampling."""
sr = radius * scale
marker = Image.new("RGBA", (size, size), (0, 0, 0, 0))
draw = ImageDraw.Draw(marker)
# White border ring
draw.ellipse([...], fill=(255, 255, 255, 240))
# Colored circle
draw.ellipse([...], fill=(*color, 230))
# Centered number label
draw.text((cx, cy), label, fill=(255, 255, 255), font=font, anchor="mm")
# Downscale for smooth anti-aliasing
return marker.resize((final_size, final_size), Image.LANCZOS)
The markers are rendered at 4× resolution and downscaled with LANCZOS filtering for smooth, anti-aliased edges — no jagged circles or pixel artifacts. Each point gets a distinct color (red, blue, green, orange...) with a white border and a numbered label.
The bot sends the annotated image back as a photo with a caption like "📍 eyes (2 points)". Prompts that trigger pointing include variations of "Point to...", "Find the...", "Where is the...", and "Locate the...".
Setup
The bot is a single bot.py file plus a config and the pointing module. Dependencies are minimal: python-telegram-bot, httpx, and Pillow.
git clone https://github.com/Endogen/olmo-bot.git
cd olmo-bot
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # set OLMO_BOT_TOKEN
python bot.py
It requires a running Web2API instance with the allenai recipe (and optionally brave-search for web search). Access can be restricted to specific Telegram user IDs via the OLMO_ALLOWED_USERS env var.
What's Next
The main limitation is Allen AI's native tool calling — while the model acknowledges tools and can call them, it doesn't always do so proactively. A bot-side tool loop (parsing tool-call JSON from the model output and executing tools locally) would make this more reliable.
The pointing coordinate format from Molmo 2 also isn't officially documented — I reverse-engineered it from testing. It works reliably, but the format could change.
Links:
Top comments (0)