Fine-Tuning 14B SLMs for 3GPP Root Cause Analysis on Amazon SageMaker

#slm #sagemaker #aws #3gpp

Can fine-tuned small language models (SLMs) match frontier foundation models on automated root cause analysis of 5G core network logs? I built a benchmark to find out, using Amazon SageMaker Training Jobs as the primary compute.

The project: sagemaker-finetuning-telco-slm

The Problem

When something goes wrong in a 5G SA core network, NOC engineers sift through 3GPP protocol logs (NAS, NGAP, RRC) to identify the root cause. These logs are noisy — a single UPF degradation can trigger cascading heartbeat timeouts, keepalive failures, and secondary alarms. Separating the actual root cause from sympathetic noise is time-consuming and error-prone.

The question: can we automate this with language models?

The Approach

I benchmarked three fine-tuned 14B-class SLMs against two frontier models (Claude and Amazon Nova Pro) across 1,000 synthetic 3GPP failure scenarios covering 8 failure types: core network failure, authentication failure, handover failure, congestion, QoS violation, transport jitter, radio failure, and normal (no fault).

Fine-Tuned SLMs

Mistral-Nemo-Base-2407 (12B) — QLoRA 4-bit on 1× A10G (ml.g5.2xlarge)
Qwen3-14B — QLoRA 4-bit on 4× A10G (ml.g5.12xlarge)
Gemma 3 12B — QLoRA 4-bit on 1× A10G (ml.g5.2xlarge)

Frontier Models (via Amazon Bedrock)

Claude 4.6 Opus — zero-shot, 5-shot, and 5-shot + Chain-of-Thought
Amazon Nova Pro — same three prompting strategies

Training on SageMaker

All fine-tuning used Amazon SageMaker Training Jobs — no instance provisioning, no SSH, no manual teardown. You provide a training script and an S3 dataset path, specify the instance type, and SageMaker handles the rest.

The training data: 1,300 synthetic 3GPP signaling logs generated via Amazon Bedrock, each labeled with ground-truth root cause error codes across the 8 failure types.

Why QLoRA 4-bit for all three models?

All three 12B–14B models exceed the 24GB A10G VRAM limit when loaded in BF16 with training overhead. For example, Mistral-Nemo at BF16 is ~24GB for weights alone — zero headroom for activations, gradients, or optimizer states. QLoRA compresses weights to 4-bit (~6GB), leaving ~18GB for training.

Qwen3-14B additionally needs 4× GPUs because its larger architecture generates heavier activations and optimizer states during training, pushing peak memory to 60–80GB.

Training Results

Model	Instance	Time	Final Loss	Avg Loss
Mistral-Nemo	ml.g5.2xlarge (1× A10G)	~41 min	1.035	1.359
Qwen3-14B	ml.g5.12xlarge (4× A10G)	~93 min	0.605	0.582
Gemma 3 12B	ml.g5.2xlarge (1× A10G)	~87 min	5.144	4.920

All three completed 325 steps (2 epochs over 1,300 examples). Training cost ranged from ~$1.31 (Mistral-Nemo) to ~$11.78 (Qwen3-14B on the larger instance).

Note: loss scales are not directly comparable across models due to different tokenizers and vocabulary sizes. The real comparison comes from the evaluation metrics (F1, precision, recall).

Key Takeaways

SageMaker Training Jobs simplify GPU fine-tuning — no instance management, automatic artifact upload to S3, and you only pay for the training time.
QLoRA 4-bit is essential for 12B–14B models on A10G GPUs — even models that "technically fit" in BF16 will OOM during training due to activation and optimizer overhead.
Synthetic data works for bootstrapping — Bedrock-generated 3GPP logs provide a viable starting point. Real operator data would be the next step for production validation.
A deterministic post-processing filter is critical — stripping sympathetic noise (heartbeat timeouts, keepalives) from model outputs before scoring ensures fair comparison across all models.

Architecture

The full pipeline runs on AWS managed services:

Amazon Bedrock — synthetic data generation + frontier model evaluation
Amazon SageMaker Training Jobs — QLoRA fine-tuning of all three SLMs
Amazon S3 — dataset storage, adapter weights, evaluation results
Amazon EC2 (alternative) — for interactive development and debugging

For production deployment, the fine-tuned SLMs can run on SageMaker Real-Time Endpoints, self-hosted EC2, or even AWS Outposts for on-premise telco edge deployments where data residency is required.