Shivay Lamba

Posted on Apr 21 • Edited on May 24

Fine-Tuning LLMs for Legal Tech: Nebius AI Cloud vs Nebius Token Factory

#ai #llm #machinelearning #cloud

A hands-on walkthrough of fine-tuning the same legal Q&A dataset on two very different platforms — and what it really costs.

Why Fine-Tune for Legal Tech?

Large language models are impressively general, but "general" is the enemy of "trustworthy" in legal work. A model that confidently summarizes UK legislation one moment and hallucinates a fictional statute the next isn't useful in production. Fine-tuning on a curated domain dataset — in our case, legislation Q&A pairs derived from real UK statutory text — teaches the model to stay grounded, adopt the right tone, and answer in the format lawyers actually expect.

The question isn't whether to fine-tune. It's how. Nebius offers two distinct surfaces for this: the Nebius AI Cloud (raw GPU VMs, full infrastructure control) and Nebius Token Factory (a managed, API-driven fine-tuning and inference service). We built the same legal model on both, and the experience couldn't be more different.

The Dataset: UK Legislation Q&A

Both pipelines use the same starting point: legislation_qa_clean.jsonl, a ~160-row curated chat dataset in OpenAI message format. Each record is a user question grounded in a statutory context, paired with a legally accurate answer. It's a small, high-quality dataset — exactly the kind that rewards fine-tuning over few-shot prompting.

{
  "messages": [
    {"role": "user", "content": "What obligations does section 4 impose on employers under the Health and Safety at Work Act?"},
    {"role": "assistant", "content": "Section 4 requires every person who has control of premises used as a workplace to ensure, so far as is reasonably practicable, that the premises, the means of access and egress, and any plant or substance in the premises are safe and without risks to health."}
  ]
}

Path 1: Nebius AI Cloud — Full Control, Full Complexity

The Architecture

The Cloud approach gives you a dedicated H100 GPU VM on Nebius, and then you build everything yourself. The pipeline looks like this:

legislation.jsonl
    ↓ convert_legislation_to_qa.py
legislation_qa_clean.jsonl
    ↓ train_gemma.py (TRL SFTTrainer + LoRA)
gemma-legal-qa-clean-lora/          ← LoRA adapter weights
    ↓ merge_lora.py
gemma-legal-qa-clean-merged/        ← full merged weights
    ↓ serve.sh (vLLM)
http://localhost:8100/v1            ← OpenAI-compatible API
    ↓ api.py (FastAPI)
http://localhost:8000               ← legal-specific routes
    ↓ scripts/cloudflare_tunnel.sh  (optional HTTPS)
https://your-subdomain.trycloudflare.com

That's six stages before you serve a single inference request.

Step 1: Provision the VM and Set Up the Environment

You start by spinning up an H100 VM on Nebius Cloud. On-demand H100 pricing sits around $2.00–$2.49/hour, with dedicated GPU hosts reaching up to $4.00/hour for full isolation and no sharing. Once your VM is live, you run the bootstrap script:

chmod +x setup_and_train.sh
export HF_TOKEN=hf_your_token_here
./setup_and_train.sh

That script does a lot:

apt-get installs build tools, git, Python headers
Verifies nvidia-smi is reachable
Creates a Python venv and installs PyTorch for CUDA 12.4 (torch==2.6.0)
Installs transformers 5.5.0, accelerate, peft, trl, datasets, bitsandbytes
Attempts to build flash-attn 2 for H100 speedup (gracefully skips if it fails)
Runs interactive Hugging Face login — Gemma is a gated model

This alone takes 15–30 minutes on a fresh VM. If your HF token isn't set, step 5 blocks on interactive input. If flash-attn fails to compile (it often does the first time), you lose another few minutes watching the build fail before the fallback kicks in.

Step 2: Train with TRL SFTTrainer + LoRA

Once setup completes, training starts automatically. The train_gemma.py script wraps HuggingFace's TRL SFTTrainer with PEFT LoRA:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=(
        r"^model\.language_model\.layers\.\d+\."
        r"(self_attn\.(q_proj|k_proj|v_proj|o_proj)|"
        r"mlp\.(gate_proj|up_proj|down_proj))$"
    ),
)

The model is google/gemma-4-E4B — Gemma 4's 4-billion-parameter multimodal variant. Vision and audio towers are frozen so training only updates language weights. Default hyperparameters: 5 epochs, learning rate 1e-4, max sequence length 1024, gradient accumulation 4 steps. On a single H100, this finishes in roughly 5–10 minutes for 160 rows.

One subtle gotcha in the training config: use_liger_kernel=False is hardcoded because Liger Kernel causes CUDA illegal access errors on Gemma 4. This is the kind of model-specific trap you only find by hitting the error.

Step 3: Merge the LoRA Adapter

vLLM doesn't load PEFT adapters natively in this setup — it expects a single merged checkpoint. You must run a separate merge step:

python merge_lora.py \
  --base_model google/gemma-4-E4B \
  --adapter_path ./gemma-legal-qa-clean-lora \
  --output_path ./gemma-legal-qa-clean-merged

Internally, this loads the full base model again into GPU memory (torch_dtype=torch.bfloat16, device_map="auto"), loads the adapter on top with PeftModel.from_pretrained, and then calls merge_and_unload(). On an H100 with 80GB VRAM this works fine, but it's another 5–10 minute step that consumes billable GPU time while producing no training progress.

Step 4: Serve with vLLM

./serve.sh

Which expands to:

exec vllm serve "$MODEL_PATH" \
    --host 0.0.0.0 --port 8100 \
    --served-model-name legal-lora \
    --chat-template "$SCRIPT_DIR/chat_template.jinja" \
    --chat-template-content-format string \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90

The --chat-template-content-format string flag is critical for Gemma. Without it, the Jinja template mis-handles messages and system prompts leak into generation. vLLM takes 2–4 minutes to load the merged weights and warm up before it starts accepting requests.

Step 5: FastAPI Legal Layer + Optional HTTPS Tunnel

The final layer adds legal-specific routes (/v1/legal/analyze, /v1/legal/chat) with a default UK legislation system prompt, plus optional Cloudflare quick tunnels for HTTPS without opening firewall ports:

WITH_TUNNEL=1 ./run.sh

The Real Cost on Nebius Cloud

Stage	Duration	GPU Running?	Cost @ $2.49/hr
VM setup + dependencies	~25 min	Yes (idle)	~$1.04
Training (160 rows, 5 epochs)	~10 min	Yes (active)	~$0.42
Merge LoRA	~8 min	Yes (active)	~$0.33
vLLM startup	~4 min	Yes (idle)	~$0.17
Total to first inference	~47 min		~$1.96

And then the VM keeps billing at $2.49/hr (or up to $4/hr for a fully dedicated host) every hour you leave it running for serving. Traffic at 3am? You're paying the same rate.

For dedicated GPU hosting at $4/hour running 24/7, that's ~$2,880/month before storage or egress.

Path 2: Nebius Token Factory — Fewer Lines, Less Everything

The Architecture

legislation_qa_clean.jsonl
    ↓ sanitize_dataset()          ← normalize roles, validate
artifacts/legislation_qa_clean.nebius.jsonl
    ↓ upload_training_file()      ← client.files.create()
file-id
    ↓ create_finetune_job()       ← client.fine_tuning.jobs.create()
job-id
    ↓ wait_for_job()              ← poll every 30s
fine-tuned model checkpoint
    ↓ create_custom_model()       ← POST /v0/models
deployed model endpoint
    ↓ smoke_test()                ← client.chat.completions.create()

No VM. No CUDA. No vLLM. No merge step. No FastAPI. No Cloudflare tunnel. The whole flow runs from your laptop.

Step 0: Install Dependencies

pip install openai

That is the entire dependency list. One package. The Token Factory API is OpenAI-compatible, so you're using the standard openai Python SDK against a different base_url.

Step 1: Sanitize the Dataset

The sanitizer isn't just cosmetic — Token Factory enforces strict message validation. It repairs malformed role names (the first record in legislation_qa_clean.jsonl actually stores the user question in the role field, which would silently corrupt training without this step), normalizes aliases ("human" → "user", "bot" → "assistant"), and drops records missing either a user or assistant turn.

report = sanitize_dataset(DATASET_PATH, CLEAN_DATASET_PATH)
# → {"total_records": 160, "kept_records": 158, "dropped_records": 2, "repaired_records": 12}

Step 2: Upload, Train, Monitor — Three API Calls

client = OpenAI(base_url="https://api.tokenfactory.nebius.com/v1/", api_key=NEBIUS_API_KEY)

# Upload
training_file = client.files.create(file=open(dataset_path, "rb"), purpose="fine-tune")

# Create job
job = client.fine_tuning.jobs.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    training_file=training_file.id,
    suffix="legislation-qa-lora",
    hyperparameters={
        "n_epochs": 4,
        "learning_rate": 1e-5,
        "lora": True,
        "lora_r": 16,
        "lora_alpha": 16,
        "lora_dropout": 0.05,
        "packing": True,
    },
    seed=42,
)

# Poll
while job.status not in {"succeeded", "failed", "cancelled"}:
    time.sleep(30)
    job = client.fine_tuning.jobs.retrieve(job.id)
    print(f"status={job.status} trained_tokens={job.trained_tokens}")

Hyperparameters are still yours to control — rank, alpha, dropout, learning rate, epochs. The service handles GPU allocation, scheduling, and checkpointing transparently.

Step 3: Deploy in Four Lines

response = requests.post(
    "https://api.tokenfactory.nebius.com/v0/models",
    json={
        "source": f"{job_id}:{checkpoint_id}",
        "base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "name": "legislation-qa-private",
    },
    headers={"Authorization": f"Bearer {NEBIUS_API_KEY}"},
)

After the model status becomes "active", you call it exactly like any other model — same SDK, same endpoint, just your model name:

client.chat.completions.create(
    model="legislation-qa-private",
    messages=[{"role": "user", "content": "What does the Act require of employers?"}]
)

No vLLM install. No port management. No Cloudflare tunnel. The Token Factory handles all of that as a managed service, including a built-in smoke test after deployment.

Head-to-Head Comparison

Pricing

Dimension	Nebius AI Cloud	Nebius Token Factory
Compute model	Per-hour VM billing	Per-token (training + inference)
H100 on-demand	~$2.00–$2.49/hr	N/A (no GPU provisioning)
Dedicated host	Up to ~$4.00/hr	N/A
Training cost	GPU hours × rate (idle + active)	training_price_per_million_tokens × epochs
Idle serving cost	Full GPU rate 24/7	Zero (serverless inference)
Minimum spend	Minimum 1hr billing window	Only what you actually train/infer
Storage	Managed disk billed separately	Included

Ease of Development

Dimension	Nebius AI Cloud	Nebius Token Factory
Dependencies	PyTorch, CUDA, TRL, PEFT, vLLM, FastAPI, accelerate, bitsandbytes, flash-attn	`openai`
Environment setup time	25+ minutes, CUDA version matching, gated model login	Zero
GPU expertise required	Yes (VRAM management, quantization, OOM debugging)	No
Steps to first inference	7 (VM, setup, convert data, train, merge, serve, API)	4 (sanitize, upload, train, deploy)
Lines of code (core flow)	~600 across 5 files	~50 in one file
Failure modes	CUDA errors, flash-attn build failures, OOM, vLLM startup failures	API errors (well-documented, retryable)
Platform knowledge needed	Linux, CUDA, PyTorch internals, vLLM configuration	REST / OpenAI SDK
Debugging	SSH into VM, `nvidia-smi`, training logs	JSON event stream from API

Fine-Tuning Capabilities

Dimension	Nebius AI Cloud	Nebius Token Factory
Model selection	Any model on HuggingFace (gated or not)	30+ curated open-source models
Model used in this project	google/gemma-4-E4B ✅	meta-llama/Llama-3.1-8B-Instruct ✅
Custom architectures	Yes	No
Training approach	LoRA, QLoRA (4-bit), 8-bit, full fine-tune	LoRA (all models); full fine-tune (<20B only)
Hyperparameter control	Full (all TRL/SFTConfig params)	Partial (epochs, LR, LoRA rank/alpha/dropout)
Multi-GPU training	Yes (`accelerate launch`)	Managed (transparent to user)
Custom data formats	text, messages, prompt/completion, Alpaca	OpenAI message format only
Checkpoint control	Full (save_steps, save_total_limit)	Managed checkpoints, API-accessible
Post-training merge step	Required for vLLM	None

Deployment and Serving

Dimension	Nebius AI Cloud	Nebius Token Factory
Inference server	vLLM (self-managed)	Managed (Token Factory)
Scaling	Manual (provision more VMs)	Automatic
Concurrency	Limited by single-VM VRAM	Scales with demand
Uptime management	You	Nebius
Model versioning	File system	Named model endpoints
Rollback	Manual checkpoint swap	Swap model name in API call
Latency	Low (direct GPU, no shared infra)	Shared infrastructure latency
SLA	DIY	Service-level guarantees

Which One Should You Choose?

This is the most important question, and the answer depends less on technical preference and more on your constraints.

Choose Nebius Token Factory when:

You're a developer or data scientist, not an MLOps engineer.
You shouldn't need to know what --gpu-memory-utilization 0.90 does to fine-tune a model for your product. Token Factory removes the infrastructure layer entirely.

You need to move fast.
Token Factory gets you from raw JSONL to a deployed, callable API endpoint in under an hour of calendar time. Nebius Cloud takes the better part of an afternoon just for environment setup.

Your traffic is variable or bursty.
Serverless per-token pricing means you pay nothing when no one is querying the model. A law firm's internal Q&A tool doesn't need an H100 running at 3am.

You're working with one of the 30+ supported models.
The catalog covers Llama 3 (1B–70B), Qwen (1.5B–72B), Mistral, DeepSeek, and others. Llama 3.1 8B is an excellent base for legal Q&A — instruction-tuned, well-documented, and small enough to iterate quickly.

You're building a prototype, internal tool, or low-to-medium traffic application.
The managed serving tier is production-quality for most team-facing use cases without the operational overhead.

You want predictable, pay-as-you-go costs.
Token Factory's per-token model means your fine-tuning and inference costs are directly proportional to actual usage. No surprises from an idle GPU you forgot to stop.

Choose Nebius AI Cloud when:

You need a specific model not in the Token Factory catalog.
We used google/gemma-4-E4B specifically because it's a multimodal model — Token Factory doesn't offer it. If your use case requires a custom or gated model from HuggingFace, Cloud is your only option.

You're doing research-grade fine-tuning.
QLoRA (4-bit), custom PEFT configurations, multi-GPU distributed training with accelerate launch, experimental architectures, non-standard data formats — Cloud gives you the full HuggingFace ecosystem with raw GPU access.

You have high, sustained inference volume.
At 24/7 near-capacity utilization, a dedicated H100 at ~$2/hr (~$1,460/month) can be significantly cheaper than per-token pricing at scale. If your model handles thousands of concurrent requests around the clock, the economics of dedicated infrastructure start to win.

You need complete data sovereignty.
Your training data never leaves a VM you control. There's no intermediary API call, no data transiting a managed service. For regulated industries where data residency and chain-of-custody matter, this can be a hard requirement.

Your team has MLOps capability.
If you already manage CUDA environments, operate vLLM or TGI, and have monitoring in place — Cloud is just another VM. The operational overhead is already built into your team's workflow.

You need to customize the inference stack.
Custom batching strategies, non-standard context windows, multi-modal inference pipelines, integration with proprietary serving infrastructure — these all require access to the serving layer itself, which Token Factory abstracts away.

The Decision at a Glance

Situation	Recommended Path
First fine-tuned model, moving fast	Token Factory
Small team, no MLOps engineer	Token Factory
Variable or low traffic	Token Factory
Model not in Token Factory catalog	Cloud
Need QLoRA / 4-bit training	Cloud
Research or experimentation at scale	Cloud
24/7 high-volume production inference	Cloud
Data sovereignty / regulated industry	Cloud
Prototype → internal tool	Token Factory
Custom inference server requirements	Cloud

The Developer Experience, Honestly

What the Cloud Path Feels Like

You spend the first 30 minutes getting the environment right. There's a moment — usually around the flash-attn build — where you're watching compilation output scroll by and wondering if it's working or stuck. Then training logs start appearing and you feel good. Then you realize vLLM needs the merged weights, not the adapter, and you're back to loading the full model a second time. Then vLLM won't start because you forgot to run the merge. Then it starts but chat completions return garbled output because --chat-template-content-format string isn't set.

Every one of those steps is documented in the codebase, but you have to read carefully. And the bill is running the whole time.

The payoff is real: you get Gemma 4, full LoRA control, a production vLLM server, and a FastAPI legal endpoint you completely own. If your use case demands Gemma specifically, or you need to tune the inference server's memory utilization, or you want to run multi-GPU distributed training on a custom dataset of 100k rows — Cloud is the only option.

What the Token Factory Path Feels Like

You install openai. You call three functions. You poll until training finishes. You deploy. You call the model.

The sanitization step is the most "developer-y" thing in the whole pipeline, and it's still just Python dicts and a for loop. The learning curve is zero if you've used the OpenAI API before, because it literally is the OpenAI API — just pointed at api.tokenfactory.nebius.com.

The constraint you'll feel is the model catalog. Token Factory gives you 30+ models including the full Llama 3 family, Qwen, Mistral, and frontier models like Qwen3 Coder 480B. But if your legal team specifically wants Gemma 4 or a model not on the list, you're out of luck. You also can't do QLoRA or customize the optimizer — that's behind the service abstraction.

Cost Reality Check: A Legal Chatbot Scenario

Imagine a law firm running a legislation Q&A tool. The model serves 500 queries/day from 30 lawyers, each query ~800 tokens in / 400 tokens out.

Daily usage: 500 × (800 + 400) = 600,000 tokens

Token Factory (approximate inference at ~$0.13–0.20/M tokens for Llama 3.1 8B):

Daily inference cost: ~$0.08–$0.12
Monthly: ~$2.50–$3.60
Fine-tuning cost (one-time, 160 rows × 4 epochs): ~$0.01–$0.05
Total Month 1: ~$3–$4

Nebius Cloud H100 @ $2.49/hr:

VM running 24/7: $2.49 × 24 × 30 = $1,792/month
VM running business hours only (8hr/day, weekdays): ~$398/month
Plus setup time, maintenance, monitoring overhead

For 500 queries/day from 30 lawyers, Token Factory wins by two orders of magnitude. The H100 dedicated GPU only makes economic sense when you're pushing hundreds of concurrent requests at sustained, near-100% GPU utilization.

The Verdict

Nebius Token Factory and Nebius AI Cloud solve genuinely different problems, and conflating them is the most common mistake when evaluating the two.

Token Factory is fine-tuning as a service. It abstracts the GPU, the training framework, the merge step, and the inference server into a handful of API calls. You pay for what you use. You deploy in minutes. You don't need to know what lora_alpha is to get a domain-specific model working — though you can tune it if you do.

Nebius AI Cloud is infrastructure. It's the right choice when the constraints of a managed service — model catalog, LoRA-only for large models, abstracted hyperparameters — are actual constraints for your use case.

For legal tech teams building their first domain-adapted model, Token Factory is where to start. It removes every obstacle between your training data and a callable API endpoint. When you outgrow it — because you need Gemma 4's multimodal capabilities, because you're training on 500k proprietary documents and need distributed multi-GPU runs, because you're at the scale where GPU utilization economics change — Nebius Cloud is right there, and the skills you built on Token Factory (LoRA, dataset formatting, hyperparameter intuition) transfer directly.

Both paths converge on the same outcome: a fine-tuned model that knows UK legislation. The question is how much of your time and your bill should go toward getting there.

Top comments (1)

Andy Zhang • Jun 15

Great read!