DEV Community

Cover image for Run DeepSeek in Your App Without a GPU — Streaming Chat API with Session Memory
Patrick DeVos
Patrick DeVos

Posted on

Run DeepSeek in Your App Without a GPU — Streaming Chat API with Session Memory

Running your own LLM usually means one of two things: paying OpenAI per-token forever, or dealing with GPU provisioning, quantization, and VRAM math at 1am. Neither is great when you just want AI chat in your app.

The LocalLLM Chat API is a third option. DeepSeek-powered, hosted, streaming, session-aware — and it handles the privacy problem most hosted LLM APIs skip entirely: your users don't share session state with each other.

What It Does

  • Persistent chat sessions — create a session, send messages, sessions remember the full conversation history
  • SSE streaming — tokens arrive in real-time, same as ChatGPT's streaming interface
  • Per-user isolation — each API subscriber's sessions are completely invisible to every other subscriber; no cross-contamination of conversation history
  • DeepSeek backend — runs deepseek-chat, one of the strongest open-weight models at any price point
  • Simple REST — standard JSON in, text/event-stream out

Quick Start

Sign up at RapidAPI, search "LocalLLM Chat" by Circle of Wizards, subscribe to the free BASIC plan, and grab your X-RapidAPI-Key.

pip install requests sseclient-py
Enter fullscreen mode Exit fullscreen mode

Step 1: Create a Session

A session holds conversation history. Each message you send is added to the context automatically — no need to re-send prior turns yourself.

import requests

KEY  = "YOUR_RAPIDAPI_KEY"
HOST = "localllm-chat.p.rapidapi.com"
BASE = f"https://{HOST}"

HEADERS = {
    "X-RapidAPI-Key":  KEY,
    "X-RapidAPI-Host": HOST,
    "Content-Type":    "application/json",
}

def create_session() -> str:
    r = requests.post(f"{BASE}/sessions", headers=HEADERS,
                      json={"backend": "deepseek"})
    r.raise_for_status()
    data = r.json()
    print(f"Session {data['id']} | model: {data['model']}")
    return data["id"]

session_id = create_session()
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "id": "43962dfa",
  "created_at": "2026-06-06T23:41:32.167345+00:00",
  "last_active": "2026-06-06T23:41:32.167384+00:00",
  "backend": "deepseek",
  "model": "deepseek-chat"
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Stream a Response

Chat uses Server-Sent Events. Each token arrives as a separate event, so your UI can render progressively instead of waiting for the full response.

import sseclient

def chat(session_id: str, message: str) -> str:
    r = requests.post(
        f"{BASE}/sessions/{session_id}/chat",
        headers={**HEADERS, "Accept": "text/event-stream"},
        json={"message": message},
        stream=True,
    )
    r.raise_for_status()

    full_text = ""
    client = sseclient.SSEClient(r)

    for event in client.events():
        if event.event == "token":
            import json
            token = json.loads(event.data)["text"]
            print(token, end="", flush=True)
            full_text += token
        elif event.event == "done":
            print()  # newline after stream ends
            break

    return full_text


response = chat(session_id, "In one sentence, what is quantum entanglement?")
Enter fullscreen mode Exit fullscreen mode

Raw SSE stream:

event: token
data: {"text": "Quantum"}

event: token
data: {"text": " entanglement"}

event: token
data: {"text": " is"}

event: token
data: {"text": " a"}

event: token
data: {"text": " physical"}

event: token
data: {"text": " phenomenon"}

...

event: done
data: {}
Enter fullscreen mode Exit fullscreen mode

The done event signals end of stream — check for it explicitly so you don't hang waiting for more tokens.


Step 3: Multi-Turn Conversation

Sessions handle history server-side. Just keep sending to the same session ID:

session_id = create_session()

# Turn 1
chat(session_id, "My name is Alex. Remember that.")
# → "Of course, Alex! I'll remember your name..."

# Turn 2
chat(session_id, "What's my name?")
# → "Your name is Alex, as you mentioned earlier."

# Turn 3
chat(session_id, "Give me a haiku about that.")
# → "Alex speaks today / A name carried through the stream / Echo finds its mark"
Enter fullscreen mode Exit fullscreen mode

The model has full context of every prior turn. You don't re-send history. Sessions persist as long as the subscription is active.


JavaScript / Browser Integration

For frontend apps, use fetch with streaming:

const KEY  = "YOUR_RAPIDAPI_KEY";
const HOST = "localllm-chat.p.rapidapi.com";
const BASE = `https://${HOST}`;

// Create session
async function createSession() {
  const res = await fetch(`${BASE}/sessions`, {
    method: "POST",
    headers: {
      "X-RapidAPI-Key":  KEY,
      "X-RapidAPI-Host": HOST,
      "Content-Type":    "application/json",
    },
    body: JSON.stringify({ backend: "deepseek" }),
  });
  const data = await res.json();
  return data.id;
}

// Stream chat response
async function chat(sessionId, message, onToken) {
  const res = await fetch(`${BASE}/sessions/${sessionId}/chat`, {
    method: "POST",
    headers: {
      "X-RapidAPI-Key":  KEY,
      "X-RapidAPI-Host": HOST,
      "Content-Type":    "application/json",
    },
    body: JSON.stringify({ message }),
  });

  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n");
    buffer = lines.pop(); // keep incomplete line

    let eventType = "";
    for (const line of lines) {
      if (line.startsWith("event: ")) {
        eventType = line.slice(7).trim();
      } else if (line.startsWith("data: ") && eventType === "token") {
        const token = JSON.parse(line.slice(6)).text;
        onToken(token);
      }
    }
  }
}

// Usage
const sessionId = await createSession();
let output = "";
await chat(sessionId, "Explain WebSockets in 2 sentences.", (token) => {
  output += token;
  document.getElementById("output").textContent = output; // live update
});
Enter fullscreen mode Exit fullscreen mode

Session Management

List your sessions — you only see your own. Other subscribers' sessions are invisible to you:

def list_sessions() -> list:
    r = requests.get(f"{BASE}/sessions", headers=HEADERS)
    r.raise_for_status()
    return r.json()

sessions = list_sessions()
for s in sessions:
    print(f"{s['id']} | {s['message_count']} messages | last: {s['last_active'][:10]}")
Enter fullscreen mode Exit fullscreen mode
43962dfa | 3 messages | last: 2026-06-06
d09998f4 | 3 messages | last: 2026-06-06
Enter fullscreen mode Exit fullscreen mode

Delete a session when done:

def delete_session(session_id: str):
    r = requests.delete(f"{BASE}/sessions/{session_id}", headers=HEADERS)
    r.raise_for_status()

delete_session(session_id)
Enter fullscreen mode Exit fullscreen mode

A Note on Privacy

Most LLM proxy APIs run all users through shared conversation state or log everything to a central store. LocalLLM isolates sessions by API subscriber identity — the X-RapidAPI-User header the gateway injects. Your session files never appear in another subscriber's session list, and vice versa.

This matters if you're building multi-tenant apps, handling user conversations with any sensitivity, or just don't want to accidentally bleed one user's context into another's chat history.


Building a Simple CLI Chatbot

Put it all together:

import requests
import sseclient
import json

KEY  = "YOUR_RAPIDAPI_KEY"
HOST = "localllm-chat.p.rapidapi.com"
BASE = f"https://{HOST}"
HEADERS = {
    "X-RapidAPI-Key":  KEY,
    "X-RapidAPI-Host": HOST,
    "Content-Type":    "application/json",
}


def create_session() -> str:
    r = requests.post(f"{BASE}/sessions", headers=HEADERS,
                      json={"backend": "deepseek"})
    r.raise_for_status()
    return r.json()["id"]


def chat_stream(session_id: str, message: str):
    r = requests.post(
        f"{BASE}/sessions/{session_id}/chat",
        headers={**HEADERS, "Accept": "text/event-stream"},
        json={"message": message},
        stream=True,
    )
    r.raise_for_status()
    for event in sseclient.SSEClient(r).events():
        if event.event == "token":
            print(json.loads(event.data)["text"], end="", flush=True)
        elif event.event == "done":
            print()
            return


def main():
    print("LocalLLM Chat (DeepSeek) — type 'quit' to exit\n")
    session_id = create_session()
    print(f"Session: {session_id}\n")

    while True:
        try:
            user_input = input("You: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nBye.")
            break

        if user_input.lower() in ("quit", "exit", "q"):
            break
        if not user_input:
            continue

        print("AI: ", end="")
        chat_stream(session_id, user_input)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Run it, start a conversation — context carries across every turn automatically.


Why DeepSeek?

DeepSeek-V3 (deepseek-chat) consistently outperforms GPT-4o on coding benchmarks and matches it on reasoning tasks, at a fraction of the inference cost. For apps that don't need OpenAI branding, it's a straightforward swap.

The BASIC plan covers personal projects and prototypes. PRO ($9.99/mo) unlocks higher request limits for production traffic.


The API is live on RapidAPI — search "LocalLLM Chat" by Circle of Wizards. Free to try. If you build something on top of it — a chatbot, a writing tool, a customer support prototype — drop a link in the comments.

Top comments (0)