Hassann

Posted on Apr 24 • Originally published at apidog.com

How to Use DeepSeek V4: Web Chat, API, and Self-Hosted Paths

DeepSeek V4 launched on April 23, 2026, offering four checkpoints, a live API, and open weights (MIT license) on Hugging Face. You can access it instantly, make production API calls, or self-host for on-premise deployment. This guide covers all three options with actionable steps, tradeoffs, and a production-ready prompt workflow.

Try Apidog today

If you need an overview, start with what is DeepSeek V4. For API integration, read the DeepSeek V4 API guide. For zero-cost usage, see how to use DeepSeek V4 for free. When you’re ready to test, download Apidog to pre-build your API collection.

TL;DR

Fastest: chat.deepseek.com — Free chat UI, V4-Pro by default, three reasoning modes.
Production: Use https://api.deepseek.com/v1/chat/completions with deepseek-v4-pro or deepseek-v4-flash.
Self-hosted: Pull weights from Hugging Face, run the /inference code.
Choose Non-Think for fast routing/classification, Think High for code/analysis, Think Max for accuracy-critical tasks.
Recommended sampling: temperature=1.0, top_p=1.0.
Use Apidog as your API client. The OpenAI-compatible format allows easy replay across DeepSeek, OpenAI, and Anthropic.

Pick the right path for your workload

Choose the integration that fits your needs:

Path	Cost	Setup time	Best for
chat.deepseek.com	Free	30 seconds	Quick tests, ad-hoc work
DeepSeek API	Per-token billing	5 minutes	Production, agents, batch jobs
Self-hosted V4-Flash	Hardware cost only	A few hours	On-prem compliance, offline inference
Self-hosted V4-Pro	Cluster cost only	A day	Research, custom fine-tunes
OpenRouter / aggregator	Per-token billing	2 minutes	Multi-provider fallback

Path 1: Use V4 in the web chat

For the quickest evaluation, use the official chat interface:

Go to chat.deepseek.com.
Sign in (email, Google, or WeChat).
V4-Pro is the default. Toggle between Non-Think, Think High, and Think Max at the top.
Enter your prompt and send.

Web chat supports file uploads, web search, and full 1M-token context. Rate limits are per account; heavy use may slow responses.

Best for: error trace diagnosis, summarizing large PDFs, quick benchmark tests. Not suitable for automation or repeatable workflows.

Path 2: Use the DeepSeek API

The API is OpenAI-compatible and ready for production. Model IDs (deepseek-v4-pro, deepseek-v4-flash) are stable.

Get an API key

Sign up at platform.deepseek.com.
Add a payment method (top-ups from $2).
Create an API key under API Keys and copy it (you won’t see it again).

Set the key in your environment:

export DEEPSEEK_API_KEY="sk-..."

Minimum viable request (curl)

curl https://api.deepseek.com/v1/chat/completions \
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v4-pro",
    "messages": [
      {"role": "user", "content": "Refactor this Python function to async. Reply with code only."}
    ],
    "thinking_mode": "thinking"
  }'

Switch deepseek-v4-pro to deepseek-v4-flash for lower cost. Use non-thinking mode for faster outputs.

Python client (OpenAI SDK)

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are a concise senior engineer."},
        {"role": "user", "content": "Explain the CSA+HCA hybrid attention stack."},
    ],
    extra_body={"thinking_mode": "thinking_max"},
    temperature=1.0,
    top_p=1.0,
)

print(response.choices[0].message.content)

Any OpenAI-compatible library (LangChain, LlamaIndex, DSPy) works by just changing the base URL.

Node client (OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.DEEPSEEK_API_KEY,
  baseURL: "https://api.deepseek.com/v1",
});

const response = await client.chat.completions.create({
  model: "deepseek-v4-flash",
  messages: [{ role: "user", content: "Write a fizzbuzz in Rust." }],
  temperature: 1.0,
  top_p: 1.0,
});

console.log(response.choices[0].message.content);

See the DeepSeek V4 API guide for endpoint details, parameters, and error handling.

Path 3: Iterate with Apidog

For repeated API calls, Apidog streamlines your workflow and helps manage credits.

Download Apidog for your OS: Mac, Windows, Linux.
Create a new API project. Add a POST request to https://api.deepseek.com/v1/chat/completions.
Add Authorization: Bearer {{DEEPSEEK_API_KEY}} as a header. Store the key in environment variables.
Paste your JSON request body and save. Replay or tweak with a click.
Use the response viewer to compare outputs (e.g., Non-Think vs Think Max modes).

You can store multiple requests (OpenAI, Claude, DeepSeek) side by side for easy A/B testing and billing visibility. To migrate, simply change the base URL in your existing GPT-5.5 API collection.

Path 4: Self-host V4-Flash

The MIT license allows full self-hosting. This is ideal for compliance, air-gapped environments, or custom economics.

Hardware requirements

V4-Flash (13B active, 284B total): 2–4 H100/H200/MI300X GPUs (FP8). Quantized INT4 fits on a single 80GB card for small batches.
V4-Pro (49B active, 1.6T total): Requires 16–32 H100s for production inference.

Download model weights

# Install Hugging Face CLI
pip install -U "huggingface_hub[cli]"

# (Optional) Log in to reduce rate limits
huggingface-cli login

# Download V4-Flash weights
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./models/deepseek-v4-flash \
  --local-dir-use-symlinks False

V4-Flash is ~500GB at FP8; V4-Pro is several TB.

Run inference (vLLM)

pip install "vllm>=0.9.0"

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --dtype auto

Once running, point OpenAI-compatible clients to http://localhost:8000/v1. You can use the same Apidog collection with the new base URL.

Prompting V4 effectively

DeepSeek V4’s prompt handling differs from GPT-5.5 or Claude. For best results:

Set thinking_mode explicitly for each task.
Use system prompts for persona only, not task instructions. Place the main task in the user message.
For code tasks, provide a test harness or failing test case to increase solution accuracy.

For long-context prompts, keep the most relevant info at the beginning and end. V4’s attention is optimized, but recency and primacy effects remain.

Cost control

To prevent overspending, apply these safeguards:

Default to V4-Flash. Use V4-Pro only for proven quality needs.
Default to Non-Think. Use Think High or Think Max as required.
Set max_tokens. The 1M context is a limit, not a target. Most responses fit in 2,000 tokens.

In Apidog, use environment variables for DEEPSEEK_API_KEY to separate test and production billing. Apidog also records token counts per response to help spot runaway prompts.

Migrating from DeepSeek V3 or other models

From deepseek-chat / deepseek-reasoner: Change model ID to deepseek-v4-pro or deepseek-v4-flash. Old IDs deprecate July 24, 2026.
From OpenAI GPT-5.x: Change base URL to https://api.deepseek.com/v1 and model ID. Request shape is otherwise unchanged. See the GPT-5.5 API guide for reference.
From Anthropic Claude: Use https://api.deepseek.com/anthropic for Anthropic message format, or convert to OpenAI format for main endpoint.

FAQ

Do I need a paid account? Web chat is free. API access requires a minimum $2 top-up. See how to use DeepSeek V4 for free for no-cost options.

Which variant should I use? Start with V4-Flash in Non-Think mode, measure quality, and only upgrade if necessary.

Can I run V4 on a MacBook? V4-Flash runs on M3/M4 Max with 128GB RAM (INT4, slow). V4-Pro requires much more. For laptops, use the API or web chat.

Does V4 support function calling? Yes. The OpenAI-compatible endpoint accepts the standard tools array. Responses return tool_calls. Anthropic endpoint uses its native schema.

How do I stream responses? Set stream: true in your request body. Responses are OpenAI-compatible SSE streams; any compatible library works out of the box.

Are there rate limits? Hosted API has per-tier limits (api-docs.deepseek.com). Self-hosted: limited only by hardware.

DEV Community