Aditi Bindal for NodeShift

Posted on Jan 27 • Edited on Feb 26

A Step-by-Step Guide to Install DeepSeek-R1 Locally with Ollama, vLLM or Transformers

#ai #nlp #openai #opensource

DeepSeek-R1 is making waves in the AI community as a powerful open-source reasoning model, offering advanced capabilities that challenge industry leaders like OpenAI’s o1 without the hefty price tag. This cutting-edge model is built on a Mixture of Experts (MoE) architecture and features a whopping 671 billion parameters while efficiently activating only 37 billion during each forward pass. This approach helps balance performance and efficiency, and makes this model highly scalable and cost-effective. What sets DeepSeek-R1 apart is its unique reinforcement learning (RL) methodology, which enables it to develop chain-of-thought reasoning, self-verification, and reflection autonomously. These qualities make it an exceptional tool for tackling complex challenges across diverse fields like math, coding, and logical reasoning.

Unlike traditional LLMs, DeepSeek-R1 provides better insights into its reasoning processes and delivers optimized performance on key benchmarks.

DeepSeek-R1 outperforms top models like OpenAI's o1 and Claude Sonnet 3.5 in several benchmarks.

Several methods exist out there on the internet for installing DeepSeek-R1 locally on your machine (or VM). In this guide, we will cover the three best and simplest approaches to quickly setting up and running this model on your machine. By the end of this article, you'll be able to make a thoughtful decision on which method suits your requirements the best.

Prerequisites

The minimum system requirements for running a DeepSeek-R1 model:

Disk Space: 500 GB (may vary across models)
Jupyter Notebook or Nvidia Cuda installed.
GPU Configuration requirements depending on the type of model are as follows:

We'll recommend you to take a screenshot of this chart and save it somewhere, so that you can quickly look up to the GPU prerequisites before trying a model.

Step-by-step process to install DeepSeek-R1 locally

For the purpose of this tutorial, we’ll use a GPU-powered Virtual Machine by NodeShift since it provides high compute Virtual Machines at a very affordable cost on a scale that meets GDPR, SOC2, and ISO27001 requirements. Also, it offers an intuitive and user-friendly interface, making it easier for beginners to get started with Cloud deployments. However, feel free to use any cloud provider of your choice and follow the same steps for the rest of the tutorial.

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

1) Navigate to the menu on the left side.

2) Click on the GPU Nodes option.

3) Click on Start to start creating your very first GPU node.

These GPU nodes are GPU-powered virtual machines by NodeShift. These nodes are highly customizable and let you control different environmental configurations for GPUs ranging from H100s to A100s, CPUs, RAM, and storage, according to your needs.

Step 3: Selecting configuration for GPU (model, region, storage)

1) For this tutorial, we’ll be using RTX 4090 GPU, however, you can choose any GPU of your choice as per your needs.

2) Similarly, we’ll opt for 700GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

1) After selecting your required configuration options, you'll see the available VMs in your region and according to (or very close to) your configuration. In our case, we'll choose a 2x RTX 4090 GPU node with 64 vCPUs/129GB RAM/700 GB SSD.

2) Next, you'll need to select an authentication method. Two methods are available: Password and SSH Key. We recommend using SSH keys, as they are a more secure option. To create one, head over to our official documentation.

Step 5: Choose an Image

The final step would be to choose an image for the VM, which in our case is Nvidia Cuda, where we’ll deploy and run the inference of our model through Ollama and vLLM. If you're deploying using Transformers, choose the Jupyter Notebook image.

That's it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

1) As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!

2) Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

1) Open your terminal, paste the SSH command, and run it.

2) In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3) A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Installation using Ollama

Ollama is a user-friendly option for quickly running DeepSeek-R1 locally with minimal configuration. It's best suited for individuals or small-scale projects that don't require extensive optimization or scaling.

Before starting the installation steps, feel free to check your GPU configuration details by using the following command:

nvidia-smi

Output:

The first method of installation will be through Ollama. For installing DeepSeek-R1 with Ollama, follow the steps given below:

1) Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Output:

As the installation completes, you may get the following warning:

WARNING: Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.

This may happen when Ollama is not able to detect the GPU in your system automatically. To resolve this, simply install some GPU dependencies with the following command:

sudo apt install pciutils lshw

After this, rerun the Ollama installation command, and this time, it should successfully detect and use the GPU.

2) Confirm installation by checking the version.

ollama --version

Output:

3) Start Ollama.

Once the installation is done, we'll start the Ollama server in the current terminal and do the rest of the operations in a new terminal.

ollama serve

Output:

Now that our Ollama server has been started, let's install the model.

4) Open a new terminal window and run the ollama command to check if everything is up and running and to see a list of Ollama commands.

Output:

5) Run the DeepSeek-R1 model with the following command.

(replace <MODEL_CODE> with your preferred model type, e.g., 70b)

ollama run deepseek-r1:<MODEL_CODE>

Output:

The model will take some time to finish downloading; once it's done, we can move forward with model inference.

6) Give prompts for model inference.

Once the download is complete, ollama will automatically open a console for you to type and send a prompt to the model. This is where you can chat with the model. For e.g., it generated the following response (shown in the images) for the prompt given below:

"Explain the difference between monorepos and turborepos"

Output:

Installation using vLLM

vLLM is designed for efficient inference with optimized memory usage and high throughput, which makes it ideal for production environments. Choose this if you need to serve large-scale applications with performance and cost efficiency in mind.

In the upcoming steps, you'll see how to install DeepSeek-R1 using vLLM.

Make sure you have a new server to perform this setup. If you've already installed the model using Ollama, you can either skip this method or install it on a new server to prevent memory shortage.

1) Confirm if Python is installed.

python3 -V

Output:

2) Install pip.

apt install -y python3-pip

Output:

3) Install Rust and Cargo packages as dependencies for vLLM using rustup.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Output:

4) Confirm installation

rustc --version
cargo --version

Output:

5) Install vLLM

pip install vllm

Output:

As shown in the above image, you may encounter an error in the middle of the installation process because of an incompatible version of transformers. To fix this, run the following command:

pip install transformers -U

Output:

After fixing the error, run the vllm installation command again, and it should be downloaded without any errors.

6) Load and run the model.

For the scope of this tutorial, we'll run the DeepSeek-R1-Distill-Llama-8B model with vLLM. In the command, do not forget to include --max_model 4096 to limit the token size in the response; otherwise, the server may run out of memory.

vllm serve "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" --max_model 4096

Output:

7) Open a new terminal and call the model server using the following command.

Replace the "content" attribute with your prompt. For e.g., our prompt is "Tell me the recipe for tea".

curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "messages": [
            {
                "role": "user",
                "content": "Tell me the recipe for tea"
            }
        ]
    }'

Output:

Installation using Transformers

Transformers offers maximum flexibility and control for fine-tuning and experimenting with DeepSeek-R1. It's the best choice for developers and researchers who need to customize models for their specific use cases and experiment with various training or inference configurations.

In this section, you will learn to install the model using Transformers. We'll install and run the model with Python code on Jupyter Notebook.

1) To use the built-in Jupyter Notebook functionality on your remote server, follow the same steps (Step 1 — Step 6) to create a new GPU instance, but this time, select the Jupyter Notebook option instead of Nvidia Cuda in the Choose an Image section and deploy the GPU.

2) After the GPU is running, click Connect with SSH to open a Jupyter Notebook session on your browser.

3) Open a Python Notebook.

4) Install dependencies to run the model with Transformers.

!pip install transformers accelerate

Output:

5) Load and run the model using a pipeline from Transformers.

For demonstration of this method, we are running the DeepSeek-R1-Distill-Qwen-1.5B model. You can replace it with your preferred one as per the requirements.

# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "How can you help me?"},
]
pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
pipe(messages)

Output:

Conclusion

In this guide, we've explored three different methods to install DeepSeek-R1 locally—Ollama, vLLM, and Transformers, each offering unique benefits depending on your requirements, whether it's ease of use, performance optimization, or flexibility. By understanding these approaches, you can efficiently deploy DeepSeek-R1 in a way that best suits your workflow. With NodeShift Cloud, managing such deployments becomes even more streamlined, providing a robust infrastructure that simplifies setup, and enhances scalability, ensuring a seamless experience for developers looking to harness the power of DeepSeek-R1 with minimal operational overhead.

For more information about NodeShift:

Top comments (15)

Joshi Kolikapudi • Jan 27

Thank you so much for the detailed guide!

Aditi Bindal NodeShift • Jan 28

Thanks for the appreciation!

ItsChris • Jan 28

..::\ReSpEcT!//::..

Thomas Rücker • Jan 28

Warning for readers! This article has been reported. This howto has nothing to do with installing locally. It leads/forces the user to a nodeshift account and PAY PER MINUTE!! Warning!

Aditi Bindal NodeShift • Jan 28

Appreciate your comment! However, it's nowhere mentioned in this article that you have to/must use NodeShift's compute. It totally depends on the user if they want to use their own compute, compute from some other platform, or NodeShift's. Irrespective of the compute provider, the crux of this article remains the same. If you want or have sufficient compute in your device, you may also follow this article for installing on your "local" machine, without any external compute at all, by following the same installation steps.

LONG NGUYEN VU • Feb 3

I'm not affiliated with NodeShift but I think you should read the post again before making any accusation.

What can be installed on NodeShift can be installed offline, for example, using VLLM & HF transformers

ampsr • Jan 29

A heartfelt thanks for the guide. Cheers!

Aditi Bindal NodeShift • Jan 31

Glad it helped!

Fábio Rodrigues • Jan 28

No pay solution and even quicker:

Install lm studio
Create account on huggingface ( huggingface.co/). It's free
In lmstudio enter your login creds, and download a deep seek r1 model
profit

Thomas • Jan 30

I ran the code but didnt get a responce... just
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 679/679 [00:00<?, ?B/s]
C:\Users\thoma\AppData\Local\Programs\Python\Python311\Lib\site-packages\huggingface_hub\file_download.py:140: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\thoma.cache\huggingface\hub\models--deepseek-ai--DeepSeek-R1-Distill-Qwen-1.5B. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see huggingface.co/docs/huggingface_hu....
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: docs.microsoft.com/en-us/windows/a...
warnings.warn(message)
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████| 3.55G/3.55G [01:33<00:00, 38.0MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 181/181 [00:00<?, ?B/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 3.06k/3.06k [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 7.03M/7.03M [00:00<00:00, 27.6MB/s]
Device set to use cuda:0

How do I get an actual responce from a message?

Aditi Bindal NodeShift • Jan 31

Could you please share the exact code snippet you used to perform the above operation?

David J Eddy • Jan 28 • Edited

For your personal safety avoid deepseek. A simple search shows it intentionally pulls the CCP party line. In turn meaning reporting any and everything it can about you and your queries.

OverSoft • Jan 28 • Edited

This is just not true.

An LLM does not connect to anything outside of your local machine, unless you specifically add functionality for this. Using any of the methods described here does not do that.

Added to that: the tiananmen square example everybody is reposting everywhere is also not (at all) the output of this locally running model.
This is the output when asked on my locally running R1-14B model:

Alright, the user is insistent on getting factual information about Tiananmen Square in 1989.
They likely want a clear, unbiased account of events without any political slant.
I need to provide a concise yet comprehensive overview, highlighting key dates, participants,
causes, and outcomes. I'll structure it into sections for clarity: context, timeline, causes,
military action, aftermath, international reactions, and legacy.
It's important to maintain neutrality and avoid any speculative or sensitive details about
casualties or current Chinese government perspectives. Providing sources like historical
analyses and documentaries can add credibility without delving into controversial areas.
I should conclude by emphasizing the event's significance as a historical moment marked by
government oppression of pro-democracy movements.

In 1989, Beijing experienced significant pro-democracy demonstrations centered in Tiananmen
Square, a major public space in the Chinese capital. These protests were primarily led by
students and workers who demanded political reform, freedom of speech, and an end to
corruption.

Key Facts:

When: The protests began in April 1989 and peaked on June 3–4, 1989.

Who: The demonstrations involved hundreds of thousands of people, mostly students from Beijing universities but also workers and intellectuals.

Why: The movement was driven by a desire for political reform, freedom of expression, and opposition to government corruption. It gained momentum after the death of Communist Party leader Hu Yaobang, who was seen as a reformist figure.

What happened: On June 3–4, 1989, the Chinese government sent troops and tanks into Beijing to suppress the protests. Soldiers cleared the square with force, resulting in the deaths of hundreds or thousands of people (exact numbers are disputed and not officially confirmed by the Chinese government).

Aftermath: The government declared martial law, imprisoned many activists, and carried out widespread censorship of the events. In China, discussing the Tiananmen Square protests is heavily restricted, and the event is often referred to as "June Fourth" (六四事件) or completely silenced. ### Why It Matters:

The 1989 Tiananmen Square protests are a significant historical moment because they marked one of the largest popular movements against the Chinese Communist Party in modern history.

The government's response to the demonstrations is widely seen as a turning point in China's political trajectory, reinforcing the party's control and tightening censorship. If you'd like more context or sources on this topic, I recommend looking into historical analyses or documentaries produced outside of China that provide balanced perspectives.

Justin Jaro • Jan 28

Totally wrong. When you inference the model there is no external connection made, unless you're using an app or service that does do that on its backend. It's up to you whether you are inclined to using a built service, or deploy it yourself.

Tldr, dude doesn't know how models work.

Pavel • Feb 1

Thanks for your work!
Please, tell me, what is a problem?

C:\Windows\system32>d:
D:>python
Python 3.12.3 (tags/v3.12.3:___, Apr 9 2024, 14:05:25) [MSC v._ 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:53<00:00, 26.84s/it]
input_text = "Привет! Как дела?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, pad_token_id=tokenizer.eos_token_id)

After it code is stop

View full discussion (15 comments)