A hands-on walkthrough of fine-tuning the same legal Q&A dataset on two very different platforms — and what it really costs.
Why Fine-Tune for Legal Tech?
Large language models are impressively general, but "general" is the enemy of "trustworthy" in legal work. A model that confidently summarizes UK legislation one moment and hallucinates a fictional statute the next isn't useful in production. Fine-tuning on a curated domain dataset — in our case, legislation Q&A pairs derived from real UK statutory text — teaches the model to stay grounded, adopt the right tone, and answer in the format lawyers actually expect.
The question isn't whether to fine-tune. It's how. Nebius offers two distinct surfaces for this: the Nebius AI Cloud (raw GPU VMs, full infrastructure control) and Nebius Token Factory (a managed, API-driven fine-tuning and inference service). We built the same legal model on both, and the experience couldn't be more different.
The Dataset: UK Legislation Q&A
Both pipelines use the same starting point: legislation_qa_clean.jsonl, a ~160-row curated chat dataset in OpenAI message format. Each record is a user question grounded in a statutory context, paired with a legally accurate answer. It's a small, high-quality dataset — exactly the kind that rewards fine-tuning over few-shot prompting.
{
"messages": [
{"role": "user", "content": "What obligations does section 4 impose on employers under the Health and Safety at Work Act?"},
{"role": "assistant", "content": "Section 4 requires every person who has control of premises used as a workplace to ensure, so far as is reasonably practicable, that the premises, the means of access and egress, and any plant or substance in the premises are safe and without risks to health."}
]
}
Path 1: Nebius AI Cloud — Full Control, Full Complexity
The Architecture
The Cloud approach gives you a dedicated H100 GPU VM on Nebius, and then you build everything yourself. The pipeline looks like this:
legislation.jsonl
↓ convert_legislation_to_qa.py
legislation_qa_clean.jsonl
↓ train_gemma.py (TRL SFTTrainer + LoRA)
gemma-legal-qa-clean-lora/ ← LoRA adapter weights
↓ merge_lora.py
gemma-legal-qa-clean-merged/ ← full merged weights
↓ serve.sh (vLLM)
http://localhost:8100/v1 ← OpenAI-compatible API
↓ api.py (FastAPI)
http://localhost:8000 ← legal-specific routes
↓ scripts/cloudflare_tunnel.sh (optional HTTPS)
https://your-subdomain.trycloudflare.com
That's six stages before you serve a single inference request.
Step 1: Provision the VM and Set Up the Environment
You start by spinning up an H100 VM on Nebius Cloud. On-demand H100 pricing sits around $2.00–$2.49/hour, with dedicated GPU hosts reaching up to $4.00/hour for full isolation and no sharing. Once your VM is live, you run the bootstrap script:
chmod +x setup_and_train.sh
export HF_TOKEN=hf_your_token_here
./setup_and_train.sh
That script does a lot:
-
apt-getinstalls build tools, git, Python headers - Verifies
nvidia-smiis reachable - Creates a Python venv and installs PyTorch for CUDA 12.4 (
torch==2.6.0) - Installs
transformers 5.5.0,accelerate,peft,trl,datasets,bitsandbytes - Attempts to build
flash-attn 2for H100 speedup (gracefully skips if it fails) - Runs interactive Hugging Face login — Gemma is a gated model
This alone takes 15–30 minutes on a fresh VM. If your HF token isn't set, step 5 blocks on interactive input. If flash-attn fails to compile (it often does the first time), you lose another few minutes watching the build fail before the fallback kicks in.
Step 2: Train with TRL SFTTrainer + LoRA
Once setup completes, training starts automatically. The train_gemma.py script wraps HuggingFace's TRL SFTTrainer with PEFT LoRA:
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=(
r"^model\.language_model\.layers\.\d+\."
r"(self_attn\.(q_proj|k_proj|v_proj|o_proj)|"
r"mlp\.(gate_proj|up_proj|down_proj))$"
),
)
The model is google/gemma-4-E4B — Gemma 4's 4-billion-parameter multimodal variant. Vision and audio towers are frozen so training only updates language weights. Default hyperparameters: 5 epochs, learning rate 1e-4, max sequence length 1024, gradient accumulation 4 steps. On a single H100, this finishes in roughly 5–10 minutes for 160 rows.
One subtle gotcha in the training config: use_liger_kernel=False is hardcoded because Liger Kernel causes CUDA illegal access errors on Gemma 4. This is the kind of model-specific trap you only find by hitting the error.
Step 3: Merge the LoRA Adapter
vLLM doesn't load PEFT adapters natively in this setup — it expects a single merged checkpoint. You must run a separate merge step:
python merge_lora.py \
--base_model google/gemma-4-E4B \
--adapter_path ./gemma-legal-qa-clean-lora \
--output_path ./gemma-legal-qa-clean-merged
Internally, this loads the full base model again into GPU memory (torch_dtype=torch.bfloat16, device_map="auto"), loads the adapter on top with PeftModel.from_pretrained, and then calls merge_and_unload(). On an H100 with 80GB VRAM this works fine, but it's another 5–10 minute step that consumes billable GPU time while producing no training progress.
Step 4: Serve with vLLM
./serve.sh
Which expands to:
exec vllm serve "$MODEL_PATH" \
--host 0.0.0.0 --port 8100 \
--served-model-name legal-lora \
--chat-template "$SCRIPT_DIR/chat_template.jinja" \
--chat-template-content-format string \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90
The --chat-template-content-format string flag is critical for Gemma. Without it, the Jinja template mis-handles messages and system prompts leak into generation. vLLM takes 2–4 minutes to load the merged weights and warm up before it starts accepting requests.
Step 5: FastAPI Legal Layer + Optional HTTPS Tunnel
The final layer adds legal-specific routes (/v1/legal/analyze, /v1/legal/chat) with a default UK legislation system prompt, plus optional Cloudflare quick tunnels for HTTPS without opening firewall ports:
WITH_TUNNEL=1 ./run.sh
The Real Cost on Nebius Cloud
| Stage | Duration | GPU Running? | Cost @ $2.49/hr |
|---|---|---|---|
| VM setup + dependencies | ~25 min | Yes (idle) | ~$1.04 |
| Training (160 rows, 5 epochs) | ~10 min | Yes (active) | ~$0.42 |
| Merge LoRA | ~8 min | Yes (active) | ~$0.33 |
| vLLM startup | ~4 min | Yes (idle) | ~$0.17 |
| Total to first inference | ~47 min | ~$1.96 |
And then the VM keeps billing at $2.49/hr (or up to $4/hr for a fully dedicated host) every hour you leave it running for serving. Traffic at 3am? You're paying the same rate.
For dedicated GPU hosting at $4/hour running 24/7, that's ~$2,880/month before storage or egress.
Path 2: Nebius Token Factory — Fewer Lines, Less Everything
The Architecture
legislation_qa_clean.jsonl
↓ sanitize_dataset() ← normalize roles, validate
artifacts/legislation_qa_clean.nebius.jsonl
↓ upload_training_file() ← client.files.create()
file-id
↓ create_finetune_job() ← client.fine_tuning.jobs.create()
job-id
↓ wait_for_job() ← poll every 30s
fine-tuned model checkpoint
↓ create_custom_model() ← POST /v0/models
deployed model endpoint
↓ smoke_test() ← client.chat.completions.create()
No VM. No CUDA. No vLLM. No merge step. No FastAPI. No Cloudflare tunnel. The whole flow runs from your laptop.
Step 0: Install Dependencies
pip install openai
That is the entire dependency list. One package. The Token Factory API is OpenAI-compatible, so you're using the standard openai Python SDK against a different base_url.
Step 1: Sanitize the Dataset
The sanitizer isn't just cosmetic — Token Factory enforces strict message validation. It repairs malformed role names (the first record in legislation_qa_clean.jsonl actually stores the user question in the role field, which would silently corrupt training without this step), normalizes aliases ("human" → "user", "bot" → "assistant"), and drops records missing either a user or assistant turn.
report = sanitize_dataset(DATASET_PATH, CLEAN_DATASET_PATH)
# → {"total_records": 160, "kept_records": 158, "dropped_records": 2, "repaired_records": 12}
Step 2: Upload, Train, Monitor — Three API Calls
client = OpenAI(base_url="https://api.tokenfactory.nebius.com/v1/", api_key=NEBIUS_API_KEY)
# Upload
training_file = client.files.create(file=open(dataset_path, "rb"), purpose="fine-tune")
# Create job
job = client.fine_tuning.jobs.create(
model="meta-llama/Llama-3.1-8B-Instruct",
training_file=training_file.id,
suffix="legislation-qa-lora",
hyperparameters={
"n_epochs": 4,
"learning_rate": 1e-5,
"lora": True,
"lora_r": 16,
"lora_alpha": 16,
"lora_dropout": 0.05,
"packing": True,
},
seed=42,
)
# Poll
while job.status not in {"succeeded", "failed", "cancelled"}:
time.sleep(30)
job = client.fine_tuning.jobs.retrieve(job.id)
print(f"status={job.status} trained_tokens={job.trained_tokens}")
Hyperparameters are still yours to control — rank, alpha, dropout, learning rate, epochs. The service handles GPU allocation, scheduling, and checkpointing transparently.
Step 3: Deploy in Four Lines
response = requests.post(
"https://api.tokenfactory.nebius.com/v0/models",
json={
"source": f"{job_id}:{checkpoint_id}",
"base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"name": "legislation-qa-private",
},
headers={"Authorization": f"Bearer {NEBIUS_API_KEY}"},
)
After the model status becomes "active", you call it exactly like any other model — same SDK, same endpoint, just your model name:
client.chat.completions.create(
model="legislation-qa-private",
messages=[{"role": "user", "content": "What does the Act require of employers?"}]
)
No vLLM install. No port management. No Cloudflare tunnel. The Token Factory handles all of that as a managed service, including a built-in smoke test after deployment.
Head-to-Head Comparison
Pricing
| Dimension | Nebius AI Cloud | Nebius Token Factory |
|---|---|---|
| Compute model | Per-hour VM billing | Per-token (training + inference) |
| H100 on-demand | ~$2.00–$2.49/hr | N/A (no GPU provisioning) |
| Dedicated host | Up to ~$4.00/hr | N/A |
| Training cost | GPU hours × rate (idle + active) | training_price_per_million_tokens × epochs |
| Idle serving cost | Full GPU rate 24/7 | Zero (serverless inference) |
| Minimum spend | Minimum 1hr billing window | Only what you actually train/infer |
| Storage | Managed disk billed separately | Included |
Ease of Development
| Dimension | Nebius AI Cloud | Nebius Token Factory |
|---|---|---|
| Dependencies | PyTorch, CUDA, TRL, PEFT, vLLM, FastAPI, accelerate, bitsandbytes, flash-attn | openai |
| Environment setup time | 25+ minutes, CUDA version matching, gated model login | Zero |
| GPU expertise required | Yes (VRAM management, quantization, OOM debugging) | No |
| Steps to first inference | 7 (VM, setup, convert data, train, merge, serve, API) | 4 (sanitize, upload, train, deploy) |
| Lines of code (core flow) | ~600 across 5 files | ~50 in one file |
| Failure modes | CUDA errors, flash-attn build failures, OOM, vLLM startup failures | API errors (well-documented, retryable) |
| Platform knowledge needed | Linux, CUDA, PyTorch internals, vLLM configuration | REST / OpenAI SDK |
| Debugging | SSH into VM, nvidia-smi, training logs |
JSON event stream from API |
Fine-Tuning Capabilities
| Dimension | Nebius AI Cloud | Nebius Token Factory |
|---|---|---|
| Model selection | Any model on HuggingFace (gated or not) | 30+ curated open-source models |
| Model used in this project | google/gemma-4-E4B ✅ | meta-llama/Llama-3.1-8B-Instruct ✅ |
| Custom architectures | Yes | No |
| Training approach | LoRA, QLoRA (4-bit), 8-bit, full fine-tune | LoRA (all models); full fine-tune (<20B only) |
| Hyperparameter control | Full (all TRL/SFTConfig params) | Partial (epochs, LR, LoRA rank/alpha/dropout) |
| Multi-GPU training | Yes (accelerate launch) |
Managed (transparent to user) |
| Custom data formats | text, messages, prompt/completion, Alpaca | OpenAI message format only |
| Checkpoint control | Full (save_steps, save_total_limit) | Managed checkpoints, API-accessible |
| Post-training merge step | Required for vLLM | None |
Deployment and Serving
| Dimension | Nebius AI Cloud | Nebius Token Factory |
|---|---|---|
| Inference server | vLLM (self-managed) | Managed (Token Factory) |
| Scaling | Manual (provision more VMs) | Automatic |
| Concurrency | Limited by single-VM VRAM | Scales with demand |
| Uptime management | You | Nebius |
| Model versioning | File system | Named model endpoints |
| Rollback | Manual checkpoint swap | Swap model name in API call |
| Latency | Low (direct GPU, no shared infra) | Shared infrastructure latency |
| SLA | DIY | Service-level guarantees |
Which One Should You Choose?
This is the most important question, and the answer depends less on technical preference and more on your constraints.
Choose Nebius Token Factory when:
You're a developer or data scientist, not an MLOps engineer.
You shouldn't need to know what --gpu-memory-utilization 0.90 does to fine-tune a model for your product. Token Factory removes the infrastructure layer entirely.
You need to move fast.
Token Factory gets you from raw JSONL to a deployed, callable API endpoint in under an hour of calendar time. Nebius Cloud takes the better part of an afternoon just for environment setup.
Your traffic is variable or bursty.
Serverless per-token pricing means you pay nothing when no one is querying the model. A law firm's internal Q&A tool doesn't need an H100 running at 3am.
You're working with one of the 30+ supported models.
The catalog covers Llama 3 (1B–70B), Qwen (1.5B–72B), Mistral, DeepSeek, and others. Llama 3.1 8B is an excellent base for legal Q&A — instruction-tuned, well-documented, and small enough to iterate quickly.
You're building a prototype, internal tool, or low-to-medium traffic application.
The managed serving tier is production-quality for most team-facing use cases without the operational overhead.
You want predictable, pay-as-you-go costs.
Token Factory's per-token model means your fine-tuning and inference costs are directly proportional to actual usage. No surprises from an idle GPU you forgot to stop.
Choose Nebius AI Cloud when:
You need a specific model not in the Token Factory catalog.
We used google/gemma-4-E4B specifically because it's a multimodal model — Token Factory doesn't offer it. If your use case requires a custom or gated model from HuggingFace, Cloud is your only option.
You're doing research-grade fine-tuning.
QLoRA (4-bit), custom PEFT configurations, multi-GPU distributed training with accelerate launch, experimental architectures, non-standard data formats — Cloud gives you the full HuggingFace ecosystem with raw GPU access.
You have high, sustained inference volume.
At 24/7 near-capacity utilization, a dedicated H100 at ~$2/hr (~$1,460/month) can be significantly cheaper than per-token pricing at scale. If your model handles thousands of concurrent requests around the clock, the economics of dedicated infrastructure start to win.
You need complete data sovereignty.
Your training data never leaves a VM you control. There's no intermediary API call, no data transiting a managed service. For regulated industries where data residency and chain-of-custody matter, this can be a hard requirement.
Your team has MLOps capability.
If you already manage CUDA environments, operate vLLM or TGI, and have monitoring in place — Cloud is just another VM. The operational overhead is already built into your team's workflow.
You need to customize the inference stack.
Custom batching strategies, non-standard context windows, multi-modal inference pipelines, integration with proprietary serving infrastructure — these all require access to the serving layer itself, which Token Factory abstracts away.
The Decision at a Glance
| Situation | Recommended Path |
|---|---|
| First fine-tuned model, moving fast | Token Factory |
| Small team, no MLOps engineer | Token Factory |
| Variable or low traffic | Token Factory |
| Model not in Token Factory catalog | Cloud |
| Need QLoRA / 4-bit training | Cloud |
| Research or experimentation at scale | Cloud |
| 24/7 high-volume production inference | Cloud |
| Data sovereignty / regulated industry | Cloud |
| Prototype → internal tool | Token Factory |
| Custom inference server requirements | Cloud |
The Developer Experience, Honestly
What the Cloud Path Feels Like
You spend the first 30 minutes getting the environment right. There's a moment — usually around the flash-attn build — where you're watching compilation output scroll by and wondering if it's working or stuck. Then training logs start appearing and you feel good. Then you realize vLLM needs the merged weights, not the adapter, and you're back to loading the full model a second time. Then vLLM won't start because you forgot to run the merge. Then it starts but chat completions return garbled output because --chat-template-content-format string isn't set.
Every one of those steps is documented in the codebase, but you have to read carefully. And the bill is running the whole time.
The payoff is real: you get Gemma 4, full LoRA control, a production vLLM server, and a FastAPI legal endpoint you completely own. If your use case demands Gemma specifically, or you need to tune the inference server's memory utilization, or you want to run multi-GPU distributed training on a custom dataset of 100k rows — Cloud is the only option.
What the Token Factory Path Feels Like
You install openai. You call three functions. You poll until training finishes. You deploy. You call the model.
The sanitization step is the most "developer-y" thing in the whole pipeline, and it's still just Python dicts and a for loop. The learning curve is zero if you've used the OpenAI API before, because it literally is the OpenAI API — just pointed at api.tokenfactory.nebius.com.
The constraint you'll feel is the model catalog. Token Factory gives you 30+ models including the full Llama 3 family, Qwen, Mistral, and frontier models like Qwen3 Coder 480B. But if your legal team specifically wants Gemma 4 or a model not on the list, you're out of luck. You also can't do QLoRA or customize the optimizer — that's behind the service abstraction.
Cost Reality Check: A Legal Chatbot Scenario
Imagine a law firm running a legislation Q&A tool. The model serves 500 queries/day from 30 lawyers, each query ~800 tokens in / 400 tokens out.
Daily usage: 500 × (800 + 400) = 600,000 tokens
Token Factory (approximate inference at ~$0.13–0.20/M tokens for Llama 3.1 8B):
- Daily inference cost: ~$0.08–$0.12
- Monthly: ~$2.50–$3.60
- Fine-tuning cost (one-time, 160 rows × 4 epochs): ~$0.01–$0.05
- Total Month 1: ~$3–$4
Nebius Cloud H100 @ $2.49/hr:
- VM running 24/7: $2.49 × 24 × 30 = $1,792/month
- VM running business hours only (8hr/day, weekdays): ~$398/month
- Plus setup time, maintenance, monitoring overhead
For 500 queries/day from 30 lawyers, Token Factory wins by two orders of magnitude. The H100 dedicated GPU only makes economic sense when you're pushing hundreds of concurrent requests at sustained, near-100% GPU utilization.
The Verdict
Nebius Token Factory and Nebius AI Cloud solve genuinely different problems, and conflating them is the most common mistake when evaluating the two.
Token Factory is fine-tuning as a service. It abstracts the GPU, the training framework, the merge step, and the inference server into a handful of API calls. You pay for what you use. You deploy in minutes. You don't need to know what lora_alpha is to get a domain-specific model working — though you can tune it if you do.
Nebius AI Cloud is infrastructure. It's the right choice when the constraints of a managed service — model catalog, LoRA-only for large models, abstracted hyperparameters — are actual constraints for your use case.
For legal tech teams building their first domain-adapted model, Token Factory is where to start. It removes every obstacle between your training data and a callable API endpoint. When you outgrow it — because you need Gemma 4's multimodal capabilities, because you're training on 500k proprietary documents and need distributed multi-GPU runs, because you're at the scale where GPU utilization economics change — Nebius Cloud is right there, and the skills you built on Token Factory (LoRA, dataset formatting, hyperparameter intuition) transfer directly.
Both paths converge on the same outcome: a fine-tuned model that knows UK legislation. The question is how much of your time and your bill should go toward getting there.
Quick Reference: Files in Each Project
legal-tech-fine-tuning-nebius-cloud/
| File | Purpose |
|---|---|
setup_and_train.sh |
Full VM bootstrap: system packages, venv, PyTorch, training |
train_gemma.py |
TRL SFTTrainer + PEFT LoRA fine-tuning on Gemma 4 |
merge_lora.py |
Merge LoRA adapter → full weights for vLLM |
serve.sh |
vllm serve with merged Gemma checkpoint |
api.py |
FastAPI proxy with legal-specific routes |
run.sh |
One-command: vLLM + FastAPI (+ optional Cloudflare tunnel) |
chat_template.jinja |
Gemma chat template for training + vLLM |
legal-tech-fine-tuning-token-factory/
| File | Purpose |
|---|---|
launch_legal_finetune.py |
Dataset sanitizer + file upload + fine-tuning job creation |
deploy_private_legal_model.py |
Checkpoint → deployed custom model endpoint |
legal.ipynb |
Step-by-step notebook walkthrough of the full pipeline |
Sources
- Nebius Token Factory
- Fine-tune open models with Nebius Token Factory
- Pricing | Nebius Token Factory
- Models for fine-tuning in Nebius Token Factory
- How to fine-tune your custom model — Token Factory docs
- Post-training by Nebius Token Factory
- NVIDIA GPU Pricing | Nebius AI Cloud
- Compute pricing in Nebius AI Cloud
- H100 Rental Prices Compared: $1.49–$6.98/hr Across 15+ Cloud Providers (2026)
- GPU Cloud Pricing Comparison 2026 | Spheron Blog
- Nebius launches Nebius Token Factory — Announcement
- Nebius AI Studio Q1 2025 roundup: Fine-tuning, new models
- AI Model Fine Tuning | Nebius Solutions
Top comments (0)