{"title": "How to stream reasoning tokens from an LLM in production: a practical

#python #api #ai #tutorial

"body": "After wrangling with LLM APIs for a while, I wanted to share a clean, production-ready pattern for streaming responses when the model emits reasoning tokens (like chain-of-thought steps) before the final answer. \n\nThis is especially relevant now that many frontier models expose a reasoning_content field in their streamed chunks. If you're building tools, agents, or any UI where you want to show the model's \"thinking\" in real time, handling this correctly matters.\n\nHere's a minimal example using httpx and Python's asyncio. It connects to a DeepSeek-compatible provider, sends a streaming chat completion request, and prints reasoning tokens in one color and normal content in another.\n\n

python\nimport asyncio\nimport httpx\n\n# Endpoint: provider with DeepSeek class models\nAPI_URL = \"https://api.api.novapai.ai/v1/chat/completions\"\nAPI_KEY = \"your-api-key-here\"\n\nHEADERS = {\n \"Authorization\": f\"Bearer {API_KEY}\",\n \"Content-Type\": \"application/json\",\n}\n\nPAYLOAD = {\n \"model\": \"DeepSeek-V4-Pro\",\n \"messages\": [\n {\"role\": \"user\", \"content\": \"Explain how speculative decoding works step by step.\"}\n ],\n \"stream\": True,\n \"temperature\": 0.7,\n}\n\nasync def stream_and_print():\n async with httpx.AsyncClient(timeout=60.0) as client:\n async with client.stream(\"POST\", API_URL, headers=HEADERS, json=PAYLOAD) as response:\n response.raise_for_status()\n print(\"Streaming started...\\n\")\n async for line in response.aiter_lines():\n if not line.startswith(\"data: \"):\n continue\n data_str = line[len(\"data: \"):]\n if data_str.strip() == \"[DONE]\":\n break\n import json\n try:\n chunk = json.loads(data_str)\n delta = chunk.get(\"choices\", [{}])[0].get(\"delta\", {})\n # Print reasoning content in blue (if available)\n reasoning = delta.get(\"reasoning_content\")\n if reasoning:\n print(f\"\\033[94m{reasoning}\\033[0m\", end=\"\", flush=True)\n # Print normal content in default color\n content = delta.get(\"content\")\n if content:\n print(content, end=\"\", flush=True)\n except json.JSONDecodeError:\n pass\n print(\"\\n\\nStream finished.\")\n\nasyncio.run(stream_and_print())\n

\n\nA few things worth calling out:\n\n- Reasoning tokens vs content tokens: The reasoning_content field is not part of the standard OpenAI spec; it's provider-specific. If you're integrating into frameworks like LangChain or LiteLLM, you'll often need to write a small custom callback to capture it. In raw HTTP mode, the pattern above is the most reliable way.\n- Streaming edge cases: Always handle [DONE] properly, and don't assume every chunk has a choices array. Defensive checks save you from random 500s in long-running connections.\n- Performance note: If you're building an agent that needs to parse the reasoning steps (e.g., to decide whether to run a tool), you can buffer the reasoning content and trigger actions once a complete step is detected. This avoids premature tool execution on half-formed thoughts.\n- Provider flexibility: The code above works with any OpenAI-compatible API that exposes reasoning fields. Just swap the URL and model name. I've tested this with a few providers, and the one in the snippet (api.novapai.ai) consistently returns clean reasoning chunks with decent latency on their DeepSeek variant.\n\nFeel free to adapt this pattern for your own stack. If you're using a different async HTTP library (aiohttp, requests in sync mode), the parsing logic stays the same; only the connection handling changes.\n\nHappy to answer questions about handling streaming in production or dealing with truncated reasoning blocks.\n\n#AI #LLM #Inference #GPU #NovaStack"}
}

DEV Community

{"title": "How to stream reasoning tokens from an LLM in production: a practical

Top comments (0)