Pipeline Parallelism in PyTorch

#mlops #llmops #ai #llm

As LLMs keep growing in size I started wondering about the possibility of deploying a model across many machines. While training is done in a distributed way, inference rarely goes beyond one machine as it has much less requirements to serve a model.

When the model is big, many of us would request a single-GPU machine with enough resources to serve the model. As a rule of thumb, you need at least twice GPU space for inference than the number of the model parameters. For example, LLaMA 13B has 13 giga-parameters (a giga is a billion), hence it requires 26 GB of GPU. Using float precision, it is 2 bytes per parameter, that is why. Choose a NVIDIA A100 40GB and we are good to go.

If the model is too big, then a tool like Hugging Face Accelerate can help us to run the model on a single machine with multiple GPUs. You are able to attach up to 8 devices to one machine in most cloud providers.

Then there are projects like Alpa that enable the deployment of really big models like OPT 175B and BLOOM 176B on multi-machine multi-device clusters. One drawback though, the model needs to be supported by the project.

Doing further research I came across PyTorch's PiPPy project. It features
automatic splitting of model code. It means that you do not have to make modifications to the model code to make parallelism work.

While consulting PiPPy docs and source code, I did the following exercise in order to grasp elemental insights from this tool and pipeline parallelism in general: to run a model across two Docker containers.

The idea is to deploy heavyweight models but for the sake of simplicity, let's use this model.

example.py

import os
import pippy
from torch.distributed import rpc
from torch import nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(128, 128)
        self.fc2 = nn.Linear(128,   8)

    def forward(self, x):
        x = self.fc1(x)
        x = nn.functional.relu(x)
        x = self.fc2(x)

        return x

net = Net()
net.eval()

Let's assume that we have a bunch of machines, all on the same network. The same script runs on all the nodes, there are a head node and worker nodes. Worker nodes should be able to reach the head node at a host:port. Each node has a rank number and the total count of nodes is the world.

example.py (continuation)

RANK  = int(os.environ["RANK"])
WORLD = int(os.environ["WORLD"])
HOST  = os.environ["HOST"]
PORT  = os.environ["PORT"]
print(f"My rank is {RANK}")


# first thing to do is to init RCP
print("Waiting for all the nodes...")
rpc.init_rpc(
    f"worker{RANK}", # just an identifier
    rank=RANK,
    world_size=WORLD,
    rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
        num_worker_threads=8,
        rpc_timeout=10, # seconds
        init_method=f"tcp://{HOST}:{PORT}", # head node's address and port
    )
)

# split the model, each process materializes its pipeline stage
driver, stage = pippy.all_compile(
    net,
    num_ranks=WORLD,
    num_chunks=WORLD, # microbatching
    schedule="FillDrain", # feed chunks through the pipeline sequentially
    split_policy=pippy.split_into_equal_size(WORLD), # split the model into specified number of equal-size stages
)
print(stage)

if rank == 0:
    x = torch.randn(4, 128)
    y = driver(x) # only rank 0 is able the call the pipeline's driver
    print(y)

rpc.shutdown()
print("Bye!")

In a terminal:

$ docker build -t example .
$ docker network create rpc
$ docker run -e RANK=0 -e WORLD=2 -e HOST=head -e PORT=3000 \
  --net rpc --name head --rm -it example

Notice how the HOST address equals the container's name.

In another terminal:

$ docker run -e RANK=1 -e WORLD=2 -e HOST=head -e PORT=3000 \
  --net rpc --name worker --rm -it example

This container is named worker, HOST points to the rank 0 container head.

Head output, after all processes join.

My rank is 0
Waiting for all nodes...

PipeStageModule(
  (fc1): Linear(in_features=128, out_features=128, bias=True)
)

def forward(self, x):
    fc1 = self.fc1(x);  x = None
    return fc1

tensor([[-0.1043,  ..., -0.0093],
        ...
        [-0.1566,  ..., -0.0765]])
Bye!

Worker output.

My rank is 1
Waiting for all nodes...

PipeStageModule(
  (fc2): Linear(in_features=128, out_features=8, bias=True)
)

def forward(self, fc1):
    relu = nn.functional.relu(fc1, inplace = False); fc1 = None
    fc2 = self.fc2(relu);  relu = None
    return fc2

Bye!

Next steps in this topic are

use GPU,
more than one GPU device per machine,
deploy on local Kubernetes,
deploy on Google Cloud GKE,
use some really big models.

requirements.txt — CPU-only PyTorch

packaging # missing dependency from PiPPy
numpy
--index-url https://download.pytorch.org/whl/cpu
torch

Dockerfile

FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

RUN git clone https://github.com/pytorch/PiPPy.git \
    && cd PiPPy \
    && python setup.py install \
    && cd ..

COPY . .

ENTRYPOINT ["python", "example.py"]

DEV Community

Pipeline Parallelism in PyTorch

Top comments (0)

Read next

A beginner's guide to the Whisperx model by Erium on Replicate

10 Hidden AI Tools Every Developer Should Know About

Announcing the 2024 Browser Conference

Twilio made for changing