Notarena

Posted on Jan 29

Running DeepSeek-R1 Model on Your Local Machine

#deepseek #chatgpt #llm #deeplearning

I remember that DeepSeek is a Chinese company that has developed various AI models, including the DeepSeek-R1-Large-Preview. They have open-sourced some models on Hugging Face.

First, I need to outline the prerequisites. The user will need Python installed, probably Python 3.8 or higher. They'll need a code editor like VS Code. Hardware-wise, since it's a large model, a GPU with sufficient VRAM is necessary. Maybe mention that without a GPU, it's possible but slow. Then, they need to install libraries like transformers, torch, accelerate, and sentencepiece. Oh right, the transformers library from Hugging Face is essential for loading the model.

Next, obtaining the model. The model is on Hugging Face, so they need to request access. Once granted, they can use the Hugging Face CLI or Git LFS to download it. Maybe provide the commands for that. Also, logging into Hugging Face from the CLI might be required.

Then, writing the Python script. Import AutoModelForCausalLM and AutoTokenizer from transformers. Load the model using the model name, and set the device map to auto for multi-GPU support. They need to make sure the model path is correct, pointing to where they downloaded it.

In the inference example, show how to create a prompt, tokenize it, generate a response with parameters like max_new_tokens and temperature. Then decode the tokens back to text. Also, mention that the model uses ChatML format, so the prompt should follow that structure with <|im_start|> and <|im_end|> tags.

Hardware requirements are important. The model's size affects VRAM usage. For the 7B parameter model, maybe 16GB of VRAM is needed, and for the 13B, 24GB. Quantization could help reduce memory usage, so briefly touch on using bitsandbytes for 4-bit quantization.

Common issues might include out-of-memory errors, which can be addressed by using a smaller model or quantization. Slow inference on CPU is another point to mention. Also, ensuring that the model is properly downloaded and the path is correct.

Finally, wrap it up with a conclusion about the steps and possible use cases. Maybe add a note about ethical use and computational resources.

Wait, did I miss anything? Oh, right, the user might not be familiar with Hugging Face's access process. Should explain that they need to go to the model's page, click on "Request Access," and wait for approval. Also, when using Git LFS, they need to have it installed. And in the code, using device_map="auto" allows the model to utilize available GPUs automatically. Also, mention the use of fp16 to save memory.

I should check if the model requires any specific tokenizer settings or special tokens. The example prompt uses ChatML, so that's correct. Maybe suggest adjusting temperature and max tokens for different results. Also, the difference between the base and instruct models: if they're using the instruct version, they need to format the prompt accordingly.

What about dependencies? They need to install transformers, torch, accelerate, and sentencepiece. Maybe specify the pip install commands. Also, note that PyTorch should be installed with CUDA support if using a GPU.

Possible troubleshooting: if they get an error about not being logged in, they need to run huggingface-cli login. Also, if the model path is incorrect, the script will fail. So make sure the path is correct.

I think that covers the main points. Structure the answer step-by-step, starting with prerequisites, setup, code example, hardware notes, common issues, and conclusion. Keep it clear and concise, avoiding unnecessary jargon.

Here's a step-by-step explanation to run the DeepSeek-R1-Large-Preview model (or similar variants) on your local machine:

Prerequisites
Hardware Requirements:

GPU with at least 16GB VRAM (for 7B parameter models) or 24GB VRAM (for 13B parameter models).

Sufficient RAM (32GB+ recommended for larger models).

NVIDIA drivers and CUDA toolkit installed (for GPU acceleration).

Software Requirements:

Python 3.8+.

PyTorch with CUDA support.

Hugging Face libraries (transformers, accelerate, sentencepiece).

Model Access:

Request access to the DeepSeek-R1 model on Hugging Face Hub (example link).
(Note: Approval may be required from the model's authors.)

**Step 1: Install Dependencies
**Install the required Python packages:

bash
Copy
pip install torch transformers accelerate sentencepiece
**Step 2: Download the Model
**After gaining access to the model on Hugging Face:

Option 1: Use git-lfs to clone the repository:

bash
Copy
git lfs install
git clone https://huggingface.co/deepseek-ai/deepseek-r1-large-preview
Option 2: Load the model directly via the Hugging Face API (requires authentication):

python
Copy
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/deepseek-r1-large-preview"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
**Step 3: Write Inference Code
**Create a Python script (e.g., deepseek_inference.py) to load the model and generate text:

python
Copy
from transformers import AutoModelForCausalLM, AutoTokenizer

Load model and tokenizer

model_path = "deepseek-r1-large-preview" # Path to local model or Hugging Face ID
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto", # Auto-select GPU/CPU
torch_dtype="auto", # Automatically choose FP16/FP32
)

Create a prompt

Tokenize input

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

Generate response

outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)

Decode and print output

response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)
Key Notes
Model Format:

The DeepSeek-R1 model uses the ChatML format for prompts. Use the <|im_start|> and <|im_end|> tokens to structure your input (similar to OpenAI's ChatML).

Example prompt:

Reduce VRAM usage by loading the model in 4-bit or 8-bit mode:

python
Copy
model = AutoModelForCausalLM.from_pretrained(
model_path,
load_in_4bit=True, # Requires bitsandbytes library
device_map="auto",
)
Performance Tips:

Use fp16 precision for faster inference: torch_dtype=torch.float16.

Adjust max_new_tokens to control response length.

Common Errors & Fixes
Out-of-Memory (OOM) Errors:

Reduce max_new_tokens.

Use quantization (load_in_4bit=True).

Upgrade your GPU hardware.

Model Not Found:

Ensure you’ve downloaded the model locally or have access to the Hugging Face repository.

CUDA Errors:

Verify CUDA and PyTorch compatibility.

Reinstall PyTorch with CUDA support: pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121.

Example Use Cases
Code Generation: Fine-tune for coding tasks.

Chat Applications: Deploy as a local chatbot.

Research: Experiment with alignment techniques or model behavior.

Ethical Considerations
Ensure compliance with the model's license (check Hugging Face model card).

Avoid misuse for harmful or unethical purposes.

By following these steps, you can run DeepSeek-R1 locally for research or development. Adjust parameters based on your hardware and use case!

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →