DEV Community

Shivay Lamba
Shivay Lamba

Posted on

Fine-Tuning LLMs for Legal Tech: Nebius AI Cloud vs Nebius Token Factory — A Developer's Honest Comparison

A hands-on walkthrough of fine-tuning the same legal Q&A dataset on two very different platforms — and what it really costs.


Why Fine-Tune for Legal Tech?

Large language models are impressively general, but "general" is the enemy of "trustworthy" in legal work. A model that confidently summarizes UK legislation one moment and hallucinates a fictional statute the next isn't useful in production. Fine-tuning on a curated domain dataset — in our case, legislation Q&A pairs derived from real UK statutory text — teaches the model to stay grounded, adopt the right tone, and answer in the format lawyers actually expect.

The question isn't whether to fine-tune. It's how. Nebius offers two distinct surfaces for this: the Nebius AI Cloud (raw GPU VMs, full infrastructure control) and Nebius Token Factory (a managed, API-driven fine-tuning and inference service). We built the same legal model on both, and the experience couldn't be more different.


The Dataset: UK Legislation Q&A

Both pipelines use the same starting point: legislation_qa_clean.jsonl, a ~160-row curated chat dataset in OpenAI message format. Each record is a user question grounded in a statutory context, paired with a legally accurate answer. It's a small, high-quality dataset — exactly the kind that rewards fine-tuning over few-shot prompting.

{
  "messages": [
    {"role": "user", "content": "What obligations does section 4 impose on employers under the Health and Safety at Work Act?"},
    {"role": "assistant", "content": "Section 4 requires every person who has control of premises used as a workplace to ensure, so far as is reasonably practicable, that the premises, the means of access and egress, and any plant or substance in the premises are safe and without risks to health."}
  ]
}
Enter fullscreen mode Exit fullscreen mode

Path 1: Nebius AI Cloud — Full Control, Full Complexity

The Architecture

The Cloud approach gives you a dedicated H100 GPU VM on Nebius, and then you build everything yourself. The pipeline looks like this:

legislation.jsonl
    ↓ convert_legislation_to_qa.py
legislation_qa_clean.jsonl
    ↓ train_gemma.py (TRL SFTTrainer + LoRA)
gemma-legal-qa-clean-lora/          ← LoRA adapter weights
    ↓ merge_lora.py
gemma-legal-qa-clean-merged/        ← full merged weights
    ↓ serve.sh (vLLM)
http://localhost:8100/v1            ← OpenAI-compatible API
    ↓ api.py (FastAPI)
http://localhost:8000               ← legal-specific routes
    ↓ scripts/cloudflare_tunnel.sh  (optional HTTPS)
https://your-subdomain.trycloudflare.com
Enter fullscreen mode Exit fullscreen mode

That's six stages before you serve a single inference request.

Step 1: Provision the VM and Set Up the Environment

You start by spinning up an H100 VM on Nebius Cloud. On-demand H100 pricing sits around $2.00–$2.49/hour, with dedicated GPU hosts reaching up to $4.00/hour for full isolation and no sharing. Once your VM is live, you run the bootstrap script:

chmod +x setup_and_train.sh
export HF_TOKEN=hf_your_token_here
./setup_and_train.sh
Enter fullscreen mode Exit fullscreen mode

That script does a lot:

  1. apt-get installs build tools, git, Python headers
  2. Verifies nvidia-smi is reachable
  3. Creates a Python venv and installs PyTorch for CUDA 12.4 (torch==2.6.0)
  4. Installs transformers 5.5.0, accelerate, peft, trl, datasets, bitsandbytes
  5. Attempts to build flash-attn 2 for H100 speedup (gracefully skips if it fails)
  6. Runs interactive Hugging Face login — Gemma is a gated model

This alone takes 15–30 minutes on a fresh VM. If your HF token isn't set, step 5 blocks on interactive input. If flash-attn fails to compile (it often does the first time), you lose another few minutes watching the build fail before the fallback kicks in.

Step 2: Train with TRL SFTTrainer + LoRA

Once setup completes, training starts automatically. The train_gemma.py script wraps HuggingFace's TRL SFTTrainer with PEFT LoRA:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=(
        r"^model\.language_model\.layers\.\d+\."
        r"(self_attn\.(q_proj|k_proj|v_proj|o_proj)|"
        r"mlp\.(gate_proj|up_proj|down_proj))$"
    ),
)
Enter fullscreen mode Exit fullscreen mode

The model is google/gemma-4-E4B — Gemma 4's 4-billion-parameter multimodal variant. Vision and audio towers are frozen so training only updates language weights. Default hyperparameters: 5 epochs, learning rate 1e-4, max sequence length 1024, gradient accumulation 4 steps. On a single H100, this finishes in roughly 5–10 minutes for 160 rows.

One subtle gotcha in the training config: use_liger_kernel=False is hardcoded because Liger Kernel causes CUDA illegal access errors on Gemma 4. This is the kind of model-specific trap you only find by hitting the error.

Step 3: Merge the LoRA Adapter

vLLM doesn't load PEFT adapters natively in this setup — it expects a single merged checkpoint. You must run a separate merge step:

python merge_lora.py \
  --base_model google/gemma-4-E4B \
  --adapter_path ./gemma-legal-qa-clean-lora \
  --output_path ./gemma-legal-qa-clean-merged
Enter fullscreen mode Exit fullscreen mode

Internally, this loads the full base model again into GPU memory (torch_dtype=torch.bfloat16, device_map="auto"), loads the adapter on top with PeftModel.from_pretrained, and then calls merge_and_unload(). On an H100 with 80GB VRAM this works fine, but it's another 5–10 minute step that consumes billable GPU time while producing no training progress.

Step 4: Serve with vLLM

./serve.sh
Enter fullscreen mode Exit fullscreen mode

Which expands to:

exec vllm serve "$MODEL_PATH" \
    --host 0.0.0.0 --port 8100 \
    --served-model-name legal-lora \
    --chat-template "$SCRIPT_DIR/chat_template.jinja" \
    --chat-template-content-format string \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90
Enter fullscreen mode Exit fullscreen mode

The --chat-template-content-format string flag is critical for Gemma. Without it, the Jinja template mis-handles messages and system prompts leak into generation. vLLM takes 2–4 minutes to load the merged weights and warm up before it starts accepting requests.

Step 5: FastAPI Legal Layer + Optional HTTPS Tunnel

The final layer adds legal-specific routes (/v1/legal/analyze, /v1/legal/chat) with a default UK legislation system prompt, plus optional Cloudflare quick tunnels for HTTPS without opening firewall ports:

WITH_TUNNEL=1 ./run.sh
Enter fullscreen mode Exit fullscreen mode

The Real Cost on Nebius Cloud

Stage Duration GPU Running? Cost @ $2.49/hr
VM setup + dependencies ~25 min Yes (idle) ~$1.04
Training (160 rows, 5 epochs) ~10 min Yes (active) ~$0.42
Merge LoRA ~8 min Yes (active) ~$0.33
vLLM startup ~4 min Yes (idle) ~$0.17
Total to first inference ~47 min ~$1.96

And then the VM keeps billing at $2.49/hr (or up to $4/hr for a fully dedicated host) every hour you leave it running for serving. Traffic at 3am? You're paying the same rate.

For dedicated GPU hosting at $4/hour running 24/7, that's ~$2,880/month before storage or egress.


Path 2: Nebius Token Factory — Fewer Lines, Less Everything

The Architecture

legislation_qa_clean.jsonl
    ↓ sanitize_dataset()          ← normalize roles, validate
artifacts/legislation_qa_clean.nebius.jsonl
    ↓ upload_training_file()      ← client.files.create()
file-id
    ↓ create_finetune_job()       ← client.fine_tuning.jobs.create()
job-id
    ↓ wait_for_job()              ← poll every 30s
fine-tuned model checkpoint
    ↓ create_custom_model()       ← POST /v0/models
deployed model endpoint
    ↓ smoke_test()                ← client.chat.completions.create()
Enter fullscreen mode Exit fullscreen mode

No VM. No CUDA. No vLLM. No merge step. No FastAPI. No Cloudflare tunnel. The whole flow runs from your laptop.

Step 0: Install Dependencies

pip install openai
Enter fullscreen mode Exit fullscreen mode

That is the entire dependency list. One package. The Token Factory API is OpenAI-compatible, so you're using the standard openai Python SDK against a different base_url.

Step 1: Sanitize the Dataset

The sanitizer isn't just cosmetic — Token Factory enforces strict message validation. It repairs malformed role names (the first record in legislation_qa_clean.jsonl actually stores the user question in the role field, which would silently corrupt training without this step), normalizes aliases ("human""user", "bot""assistant"), and drops records missing either a user or assistant turn.

report = sanitize_dataset(DATASET_PATH, CLEAN_DATASET_PATH)
# → {"total_records": 160, "kept_records": 158, "dropped_records": 2, "repaired_records": 12}
Enter fullscreen mode Exit fullscreen mode

Step 2: Upload, Train, Monitor — Three API Calls

client = OpenAI(base_url="https://api.tokenfactory.nebius.com/v1/", api_key=NEBIUS_API_KEY)

# Upload
training_file = client.files.create(file=open(dataset_path, "rb"), purpose="fine-tune")

# Create job
job = client.fine_tuning.jobs.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    training_file=training_file.id,
    suffix="legislation-qa-lora",
    hyperparameters={
        "n_epochs": 4,
        "learning_rate": 1e-5,
        "lora": True,
        "lora_r": 16,
        "lora_alpha": 16,
        "lora_dropout": 0.05,
        "packing": True,
    },
    seed=42,
)

# Poll
while job.status not in {"succeeded", "failed", "cancelled"}:
    time.sleep(30)
    job = client.fine_tuning.jobs.retrieve(job.id)
    print(f"status={job.status} trained_tokens={job.trained_tokens}")
Enter fullscreen mode Exit fullscreen mode

Hyperparameters are still yours to control — rank, alpha, dropout, learning rate, epochs. The service handles GPU allocation, scheduling, and checkpointing transparently.

Step 3: Deploy in Four Lines

response = requests.post(
    "https://api.tokenfactory.nebius.com/v0/models",
    json={
        "source": f"{job_id}:{checkpoint_id}",
        "base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "name": "legislation-qa-private",
    },
    headers={"Authorization": f"Bearer {NEBIUS_API_KEY}"},
)
Enter fullscreen mode Exit fullscreen mode

After the model status becomes "active", you call it exactly like any other model — same SDK, same endpoint, just your model name:

client.chat.completions.create(
    model="legislation-qa-private",
    messages=[{"role": "user", "content": "What does the Act require of employers?"}]
)
Enter fullscreen mode Exit fullscreen mode

No vLLM install. No port management. No Cloudflare tunnel. The Token Factory handles all of that as a managed service, including a built-in smoke test after deployment.


Head-to-Head Comparison

Pricing

Dimension Nebius AI Cloud Nebius Token Factory
Compute model Per-hour VM billing Per-token (training + inference)
H100 on-demand ~$2.00–$2.49/hr N/A (no GPU provisioning)
Dedicated host Up to ~$4.00/hr N/A
Training cost GPU hours × rate (idle + active) training_price_per_million_tokens × epochs
Idle serving cost Full GPU rate 24/7 Zero (serverless inference)
Minimum spend Minimum 1hr billing window Only what you actually train/infer
Storage Managed disk billed separately Included

Ease of Development

Dimension Nebius AI Cloud Nebius Token Factory
Dependencies PyTorch, CUDA, TRL, PEFT, vLLM, FastAPI, accelerate, bitsandbytes, flash-attn openai
Environment setup time 25+ minutes, CUDA version matching, gated model login Zero
GPU expertise required Yes (VRAM management, quantization, OOM debugging) No
Steps to first inference 7 (VM, setup, convert data, train, merge, serve, API) 4 (sanitize, upload, train, deploy)
Lines of code (core flow) ~600 across 5 files ~50 in one file
Failure modes CUDA errors, flash-attn build failures, OOM, vLLM startup failures API errors (well-documented, retryable)
Platform knowledge needed Linux, CUDA, PyTorch internals, vLLM configuration REST / OpenAI SDK
Debugging SSH into VM, nvidia-smi, training logs JSON event stream from API

Fine-Tuning Capabilities

Dimension Nebius AI Cloud Nebius Token Factory
Model selection Any model on HuggingFace (gated or not) 30+ curated open-source models
Model used in this project google/gemma-4-E4B ✅ meta-llama/Llama-3.1-8B-Instruct ✅
Custom architectures Yes No
Training approach LoRA, QLoRA (4-bit), 8-bit, full fine-tune LoRA (all models); full fine-tune (<20B only)
Hyperparameter control Full (all TRL/SFTConfig params) Partial (epochs, LR, LoRA rank/alpha/dropout)
Multi-GPU training Yes (accelerate launch) Managed (transparent to user)
Custom data formats text, messages, prompt/completion, Alpaca OpenAI message format only
Checkpoint control Full (save_steps, save_total_limit) Managed checkpoints, API-accessible
Post-training merge step Required for vLLM None

Deployment and Serving

Dimension Nebius AI Cloud Nebius Token Factory
Inference server vLLM (self-managed) Managed (Token Factory)
Scaling Manual (provision more VMs) Automatic
Concurrency Limited by single-VM VRAM Scales with demand
Uptime management You Nebius
Model versioning File system Named model endpoints
Rollback Manual checkpoint swap Swap model name in API call
Latency Low (direct GPU, no shared infra) Shared infrastructure latency
SLA DIY Service-level guarantees

Which One Should You Choose?

This is the most important question, and the answer depends less on technical preference and more on your constraints.

Choose Nebius Token Factory when:

You're a developer or data scientist, not an MLOps engineer.
You shouldn't need to know what --gpu-memory-utilization 0.90 does to fine-tune a model for your product. Token Factory removes the infrastructure layer entirely.

You need to move fast.
Token Factory gets you from raw JSONL to a deployed, callable API endpoint in under an hour of calendar time. Nebius Cloud takes the better part of an afternoon just for environment setup.

Your traffic is variable or bursty.
Serverless per-token pricing means you pay nothing when no one is querying the model. A law firm's internal Q&A tool doesn't need an H100 running at 3am.

You're working with one of the 30+ supported models.
The catalog covers Llama 3 (1B–70B), Qwen (1.5B–72B), Mistral, DeepSeek, and others. Llama 3.1 8B is an excellent base for legal Q&A — instruction-tuned, well-documented, and small enough to iterate quickly.

You're building a prototype, internal tool, or low-to-medium traffic application.
The managed serving tier is production-quality for most team-facing use cases without the operational overhead.

You want predictable, pay-as-you-go costs.
Token Factory's per-token model means your fine-tuning and inference costs are directly proportional to actual usage. No surprises from an idle GPU you forgot to stop.


Choose Nebius AI Cloud when:

You need a specific model not in the Token Factory catalog.
We used google/gemma-4-E4B specifically because it's a multimodal model — Token Factory doesn't offer it. If your use case requires a custom or gated model from HuggingFace, Cloud is your only option.

You're doing research-grade fine-tuning.
QLoRA (4-bit), custom PEFT configurations, multi-GPU distributed training with accelerate launch, experimental architectures, non-standard data formats — Cloud gives you the full HuggingFace ecosystem with raw GPU access.

You have high, sustained inference volume.
At 24/7 near-capacity utilization, a dedicated H100 at ~$2/hr (~$1,460/month) can be significantly cheaper than per-token pricing at scale. If your model handles thousands of concurrent requests around the clock, the economics of dedicated infrastructure start to win.

You need complete data sovereignty.
Your training data never leaves a VM you control. There's no intermediary API call, no data transiting a managed service. For regulated industries where data residency and chain-of-custody matter, this can be a hard requirement.

Your team has MLOps capability.
If you already manage CUDA environments, operate vLLM or TGI, and have monitoring in place — Cloud is just another VM. The operational overhead is already built into your team's workflow.

You need to customize the inference stack.
Custom batching strategies, non-standard context windows, multi-modal inference pipelines, integration with proprietary serving infrastructure — these all require access to the serving layer itself, which Token Factory abstracts away.


The Decision at a Glance

Situation Recommended Path
First fine-tuned model, moving fast Token Factory
Small team, no MLOps engineer Token Factory
Variable or low traffic Token Factory
Model not in Token Factory catalog Cloud
Need QLoRA / 4-bit training Cloud
Research or experimentation at scale Cloud
24/7 high-volume production inference Cloud
Data sovereignty / regulated industry Cloud
Prototype → internal tool Token Factory
Custom inference server requirements Cloud

The Developer Experience, Honestly

What the Cloud Path Feels Like

You spend the first 30 minutes getting the environment right. There's a moment — usually around the flash-attn build — where you're watching compilation output scroll by and wondering if it's working or stuck. Then training logs start appearing and you feel good. Then you realize vLLM needs the merged weights, not the adapter, and you're back to loading the full model a second time. Then vLLM won't start because you forgot to run the merge. Then it starts but chat completions return garbled output because --chat-template-content-format string isn't set.

Every one of those steps is documented in the codebase, but you have to read carefully. And the bill is running the whole time.

The payoff is real: you get Gemma 4, full LoRA control, a production vLLM server, and a FastAPI legal endpoint you completely own. If your use case demands Gemma specifically, or you need to tune the inference server's memory utilization, or you want to run multi-GPU distributed training on a custom dataset of 100k rows — Cloud is the only option.

What the Token Factory Path Feels Like

You install openai. You call three functions. You poll until training finishes. You deploy. You call the model.

The sanitization step is the most "developer-y" thing in the whole pipeline, and it's still just Python dicts and a for loop. The learning curve is zero if you've used the OpenAI API before, because it literally is the OpenAI API — just pointed at api.tokenfactory.nebius.com.

The constraint you'll feel is the model catalog. Token Factory gives you 30+ models including the full Llama 3 family, Qwen, Mistral, and frontier models like Qwen3 Coder 480B. But if your legal team specifically wants Gemma 4 or a model not on the list, you're out of luck. You also can't do QLoRA or customize the optimizer — that's behind the service abstraction.


Cost Reality Check: A Legal Chatbot Scenario

Imagine a law firm running a legislation Q&A tool. The model serves 500 queries/day from 30 lawyers, each query ~800 tokens in / 400 tokens out.

Daily usage: 500 × (800 + 400) = 600,000 tokens

Token Factory (approximate inference at ~$0.13–0.20/M tokens for Llama 3.1 8B):

  • Daily inference cost: ~$0.08–$0.12
  • Monthly: ~$2.50–$3.60
  • Fine-tuning cost (one-time, 160 rows × 4 epochs): ~$0.01–$0.05
  • Total Month 1: ~$3–$4

Nebius Cloud H100 @ $2.49/hr:

  • VM running 24/7: $2.49 × 24 × 30 = $1,792/month
  • VM running business hours only (8hr/day, weekdays): ~$398/month
  • Plus setup time, maintenance, monitoring overhead

For 500 queries/day from 30 lawyers, Token Factory wins by two orders of magnitude. The H100 dedicated GPU only makes economic sense when you're pushing hundreds of concurrent requests at sustained, near-100% GPU utilization.


The Verdict

Nebius Token Factory and Nebius AI Cloud solve genuinely different problems, and conflating them is the most common mistake when evaluating the two.

Token Factory is fine-tuning as a service. It abstracts the GPU, the training framework, the merge step, and the inference server into a handful of API calls. You pay for what you use. You deploy in minutes. You don't need to know what lora_alpha is to get a domain-specific model working — though you can tune it if you do.

Nebius AI Cloud is infrastructure. It's the right choice when the constraints of a managed service — model catalog, LoRA-only for large models, abstracted hyperparameters — are actual constraints for your use case.

For legal tech teams building their first domain-adapted model, Token Factory is where to start. It removes every obstacle between your training data and a callable API endpoint. When you outgrow it — because you need Gemma 4's multimodal capabilities, because you're training on 500k proprietary documents and need distributed multi-GPU runs, because you're at the scale where GPU utilization economics change — Nebius Cloud is right there, and the skills you built on Token Factory (LoRA, dataset formatting, hyperparameter intuition) transfer directly.

Both paths converge on the same outcome: a fine-tuned model that knows UK legislation. The question is how much of your time and your bill should go toward getting there.


Quick Reference: Files in Each Project

legal-tech-fine-tuning-nebius-cloud/

File Purpose
setup_and_train.sh Full VM bootstrap: system packages, venv, PyTorch, training
train_gemma.py TRL SFTTrainer + PEFT LoRA fine-tuning on Gemma 4
merge_lora.py Merge LoRA adapter → full weights for vLLM
serve.sh vllm serve with merged Gemma checkpoint
api.py FastAPI proxy with legal-specific routes
run.sh One-command: vLLM + FastAPI (+ optional Cloudflare tunnel)
chat_template.jinja Gemma chat template for training + vLLM

legal-tech-fine-tuning-token-factory/

File Purpose
launch_legal_finetune.py Dataset sanitizer + file upload + fine-tuning job creation
deploy_private_legal_model.py Checkpoint → deployed custom model endpoint
legal.ipynb Step-by-step notebook walkthrough of the full pipeline

Sources

Top comments (0)