Luke Manning

Posted on Apr 15 • Originally published at lukemanning.ie

Straico Has Great Models But No Streaming, So I Built a Proxy

#straico #openai #showdev #opencode

I use OpenCode as my main AI coding tool. I switched from Claude Code after Anthropic started going after open source projects and I kept hitting session limits on my subscription extremely fast.

OpenCode works with any OpenAI-compatible API. Straico gives me access to Claude, GPT, Gemini, DeepSeek, and a bunch more through a single API key. Cheap too. Problem is, Straico's API is missing two things OpenCode needs: streaming responses and function calling.

Without streaming, OpenCode just hangs. Never gets a response. But Straico keeps eating tokens on their end anyway. Without function calling, the AI can't use tools like reading files or running bash commands. Both are non-negotiable for an agentic coding tool.

So I built a proxy. It sits between OpenCode and Straico, translating requests and responses to fill in the gaps.

OpenCode
  → localhost:8000 (my proxy)
    → Straico API

What started as "just simulate streaming and inject tool definitions" turned into a surprisingly full-featured thing. The codebase is at github.com/ManningWorks/DOAI-Proxy.

The Architecture I Ended Up With

I didn't start with a provider pattern. I started with four files: server.js, streaming.js, tools.js, utils.js. But once I started thinking about the possibility of adding other providers down the line, I refactored into something cleaner.

The provider pattern lives in providers/. BaseProvider is an abstract class that handles the interface contract and retry logic. StraicoProvider extends it with Straico-specific request/response transformation. ProviderFactory instantiates the right one based on the PROVIDER_TYPE env var.

Right now only Straico exists, but the factory already has stubs for OpenAI and Anthropic. The ADDING_PROVIDERS.md doc in the repo lays out how to add a new one.

The other modules:

server.js - Express server, routing, auth, request lifecycle
streaming.js - SSE simulation with two modes
tools.js - Tool injection and response parsing
utils.js - Logging, formatting, log rotation
utils/model-limits.js - Fetches context limits from Straico's API
summarizer.js - Conversation summarization for long sessions
scripts/sync-opencode-config.js - Syncs model list to OpenCode config

Streaming Without Streaming

Straico returns the full response at once. No SSE. No chunks. The proxy has to fake it.

Two modes: none and smart.

none is what I'd recommend as default. It sends the entire response in one SSE chunk, then the [DONE] marker. Fast, no formatting issues, still technically SSE.

smart is more interesting. It splits the response into chunks with delays to simulate real streaming. The naive approach is responseText.match(new RegExp('.{1,15}', 'g')) and that kind of works. But it breaks markdown. Split mid-bold, mid-code-block, mid-backtick and the rendering glitches.

So smartChunkText() in streaming.js looks for safe boundaries. It prefers splitting on newlines, then whitespace. It also checks for markdown delimiters and extends the chunk to avoid splitting them. There's a max size limit (targetSize * 10) to prevent infinite extension.

// streaming.js - simplified version of the boundary logic
for (let i = end; i > start; i--) {
  if (text[i] === '\n') {
    return i + 1;
  }
}
for (const delim of ['**', '__', '`']) {
  const delimStart = text.indexOf(delim, start);
  if (delimStart !== -1 && delimStart < end) {
    const delimEnd = delimStart + delim.length;
    if (delimEnd > end) {
      return Math.min(delimEnd, text.length, start + maxSize);
    }
  }
}

Default is 15 characters per chunk with 80ms delay. That feels about right for most models. Configurable via STREAM_CHUNK_SIZE and STREAM_DELAY_MS env vars.

I set STREAM_MODE=none as the recommended default. smart works but it's more of a showcase thing. The boundary detection catches most cases but I wouldn't trust it with complex nested markdown.

Function Calling via Prompt Injection

Straico doesn't support function calling natively. The workaround: inject tool definitions into the system prompt and parse the AI's response to detect tool calls.

injectToolsIntoSystem() in tools.js appends a formatted list of available tools to the system message:

You have access to the following tools:
- bash: Run bash commands
- read: Read file contents

When you need to use a tool, format your response like this:
TOOL_CALL: <tool_name>
ARGUMENTS: <json_arguments>

There's a sentinel comment () to prevent double-injection if the same messages get processed twice.

The tricky part is parsing. Different models output tool calls in different formats. I ended up with four parsers that run in sequence:

Minimax XML - <minimax:tool_call> with <invoke> tags
Claude XML - <invoke name="..."> with <parameter_list> tags
OpenAI Native - JSON with "tool_calls": [...] embedded in the response
Text Format - The TOOL_CALL: / ARGUMENTS: format from the injection prompt

Each parser tries to extract tool calls from the response text. The first one that succeeds wins. This was a gradual thing. I started with just the text format parser. Then Minimax models returned XML. Then Claude models returned different XML. Then some models returned JSON that looked like OpenAI's format. Four parsers later and it handles most cases.

The text format parser was the hardest to get right. Matching TOOL_CALL: tool_name ARGUMENTS: {json} seems simple until the JSON contains nested objects, strings with braces, or the model forgets the space between the tool name and ARGUMENTS. The implementation tracks brace depth to find where the JSON actually ends:

for (let i = argsStartIndex; i < responseText.length; i++) {
  const char = responseText[i];
  if (char === '{') braceCount++;
  else if (char === '}') {
    braceCount--;
    if (braceCount === 0) {
      argsEndIndex = i + 1;
      foundClosingBrace = true;
      break;
    }
  }
}

The proxy also validates tool calls against the list of available tools. If the model invents a tool that doesn't exist, it gets filtered out. If all tool calls are invalid, the response is treated as regular text.

Tool Call Streaming

OpenCode expects tool calls to arrive as SSE chunks, same as regular text. streamToolCalls() in streaming.js sends an init chunk with the tool name and ID, then an args chunk with the arguments, then a final chunk with finish_reason: 'tool_calls'. Each chunk has a small delay (20ms, 10ms, 20ms) to feel like actual streaming.

Conversation Summarization

This one sneaked up on me. Straico has model context limits. Some models have 8k tokens, some have 128k. OpenCode sends the entire conversation history with every request. In a long coding session, that history grows fast.

summarizer.js checks if the estimated token count is approaching the model's limit. When it hits a configurable threshold (default 70% of the model's word_limit), it takes all but the most recent messages, sends them to Straico for summarization, and replaces them with a single summary message.

The summarization itself uses Straico's smart_llm_selector with pricing_method: balance, so it picks a cheap model for the summary. Configurable via SUMMARIZATION_MODEL.

I'm still not 100% sure this is the right approach. The summary is lossy. Sometimes the model needs context from earlier messages that the summary glossed over. But without it, long sessions just fail with context limit errors. Tradeoff.

Model Limits and Validation

utils/model-limits.js fetches all available models from Straico's /models endpoint at startup. It caches their context limits (word_limit) and max output tokens (max_output). The proxy uses this to validate incoming requests. If estimated_input_tokens + max_tokens > word_limit, it rejects the request with a 400 error before even hitting Straico.

The model list is also exposed at /v1/models so OpenCode can discover what's available. There's an admin endpoint at /v1/admin/refresh-models to force a refresh if Straico adds new models.

The sync script (scripts/sync-opencode-config.js) goes one step further. It fetches the model list from Straico, then updates ~/.config/opencode/opencode.json with all chat-type models. The Docker entrypoint runs this script before starting the server, so the model list is always current.

Authentication

Four modes, controlled by AUTH_MODE:

required - Needs PROXY_API_KEY, rejects requests without it. Default in production.
optional - Uses the key if set, warns if not. Default in development.
disabled - No auth. For isolated environments.
external - Trusts an external auth header. For when the proxy sits behind an API gateway or service mesh.

The key comparison uses crypto.timingSafeEqual to prevent timing attacks. Took me a moment to realise I needed buffer length checks too, since timingSafeEqual throws if the buffers are different lengths.

Retry and Graceful Shutdown

BaseProvider.makeRequestWithRetry() wraps every API call with exponential backoff. Retries on 429, 5xx, and network errors (ECONNREFUSED, ECONNRESET, ETIMEDOUT). Default is 3 attempts with a 1-second base delay.

Graceful shutdown was one of those things I didn't think about until I ran into issues. When Docker sends SIGTERM, the proxy stops accepting new requests and waits for active ones to drain. There's a timeout (default 30 seconds) after which it force-exits. Without this, long-running streaming responses would get cut off mid-chunk when the container restarted.

Docker Setup

The Dockerfile uses node:18-alpine and an entrypoint script. The entrypoint runs the OpenCode config sync, then starts the server.

Docker Compose mounts two volumes. The .env file for config. And ~/.config/opencode so the sync script can write to the OpenCode config file.

volumes:
  - ./.env:/app/.env:ro
  - ~/.config/opencode:/root/.config/opencode

One thing I got wrong initially was the Dockerfile CMD. I had CMD ["node", "server.js"] which meant the config sync never ran. Switched to ENTRYPOINT ["/app/docker-entrypoint.sh"] and that fixed it. Small thing, but it meant every container restart would have stale model lists.

The Straico-Specific Quirks

Straico's API is mostly OpenAI-compatible but with some differences that caught me out.

Tool result messages use role: "tool" in OpenAI format. Straico doesn't support that role. The proxy converts them to role: "user" with a [Tool Result]: prefix. Same with assistant messages that contain tool calls. Those get converted to the text format the injection prompt expects.

Empty assistant messages get filtered out entirely. Some models return an assistant message with empty content before making a tool call. Straico chokes on those.

There's a TOOL_RESULT_MAX_LENGTH env var that truncates large tool outputs. Some tool results (file reads, command output) can be massive. Without truncation, they blow out the context window and the next request fails.

The proxy also normalises messages. OpenAI sends content as arrays of objects (text parts, image parts, system reminders). The proxy flattens those into plain strings and strips out <system-reminder> tags. Straico doesn't know what to do with the array format.

What I'd Do Differently

The provider pattern is solid but I'd start with it from the beginning rather than refactoring into it. The four-file structure worked fine until I wanted to add features that crossed module boundaries. The abstraction would have saved me some reshuffling.

The smart streaming mode is neat but I'd think harder about whether it's worth the complexity. The boundary detection handles most markdown but not all edge cases. none mode is faster and more reliable. I use none day to day.

The summarization feature is the part I'm least confident about. It works, but the lossy compression means sometimes context gets dropped at exactly the wrong moment. I might revisit this with a sliding window approach instead of a hard summarize-and-replace.

Where It Stands

The proxy handles:

All 90+ Straico models through a single endpoint
Streaming simulation (both modes)
Function calling with four parser strategies
Conversation summarization for long sessions
Model context validation
Authentication with four modes
Retry with exponential backoff
Graceful shutdown with request draining
Docker deployment with automatic model sync

It runs on my machine and OpenCode talks to it at http://localhost:8000/v1. Works well enough that I don't think about it most of the time. Which is exactly what a proxy should do.

The code is on GitHub if you want to look or use it. Or add a provider. The architecture supports it.

DEV Community