Dextra Labs

Posted on Apr 25

How to Fine-Tune Claude on Amazon Bedrock for Your Domain (Complete Guide with Code)

#ai #machinelearning #llm #programming

Dataset prep, Bedrock setup, training configuration, evaluation, deployment, with real cost estimates for startups without ML teams.

Let me tell you when fine-tuning is actually the right answer.

Most of the time it isn't. A well-crafted system prompt with good examples handles 80% of domain adaptation problems faster, cheaper and with less operational overhead than fine-tuning. I'll come back to this at the end because it's genuinely important and most tutorials skip it.

But there's a specific category of problem where fine-tuning earns its complexity: when you need consistent output format that prompting alone can't reliably produce, when you're running high-volume inference where per-token costs compound, when your domain has terminology or reasoning patterns so specialised that few-shot examples don't transfer well, or when latency from long system prompts is measurably affecting your product experience.

If you're in that category, this guide gets you from zero to a deployed fine-tuned Claude model on Amazon Bedrock with working code throughout.

What Amazon Bedrock Fine-Tuning Actually Is

Bedrock's fine-tuning is model customisation, you're taking Anthropic's base Claude model and continuing its training on your domain-specific data. The result is a custom model variant that lives in your AWS account, responds to the same API you're already using and handles your specific use case with more consistency than the base model on the same prompts.

The key constraint: Bedrock fine-tuning uses the Claude models Anthropic makes available for customisation, which at time of writing is Claude Haiku. The capability is narrower than you might expect from the marketing , you're adapting behaviour and format consistency, not teaching the model fundamentally new knowledge. If you need the model to reason differently, fine-tuning helps. If you need it to know things that aren't in its training data, you need RAG, not fine-tuning.

Prerequisites

Before the code:

AWS account with Bedrock access enabled in your target region (us-east-1 or us-west-2 for Bedrock availability)
IAM role with Bedrock full access and S3 read/write permissions
Python 3.9+
An S3 bucket for your training data and model artefacts
Training dataset (we'll build one)

bash
pip install boto3 pandas jsonlines scikit-learn tqdm

python
import boto3
import json
import pandas as pd
import jsonlines
from pathlib import Path

# Configure your AWS session
session = boto3.Session(
    region_name='us-east-1'  # Confirm Bedrock availability in your region
)

bedrock = session.client('bedrock')
bedrock_runtime = session.client('bedrock-runtime')
s3 = session.client('s3')

BUCKET_NAME = "your-fine-tuning-bucket"
MODEL_ID = "anthropic.claude-haiku-20240307-v1:0"

Step 1: Dataset Preparation

This is where most fine-tuning projects succeed or fail. The model learns what you show it, garbage in, garbage out is nowhere more true than in fine-tuning.

Bedrock's fine-tuning expects data in a specific JSONL format. Each line is a complete training example with a prompt and the ideal completion.

python
# Each training example must follow this structure
example = {
    "prompt": "Your input prompt here",
    "completion": "The ideal output you want the model to produce"
}

For a domain adaptation use case, let's say we're fine-tuning for a legal document summarisation task, your data preparation looks like this:

python
class DatasetPreparator:
    def __init__(self, output_path: str):
        self.output_path = Path(output_path)
        self.examples = []

    def add_example(
        self, 
        document_text: str, 
        ideal_summary: str,
        document_type: str = None
    ):
        """Add a training example with optional metadata."""

        # Build the prompt that matches your production prompt structure
        # CRITICAL: Your fine-tuning prompt must match your inference prompt
        prompt = self._build_prompt(document_text, document_type)

        self.examples.append({
            "prompt": prompt,
            "completion": ideal_summary
        })

    def _build_prompt(self, text: str, doc_type: str = None) -> str:
        type_context = f" This is a {doc_type}." if doc_type else ""
        return (
            f"Summarise the following legal document in three sections: "
            f"Key Parties, Core Obligations and Risk Flags.{type_context}"
            f"\n\nDocument:\n{text}\n\nSummary:"
        )

    def validate_and_write(self, train_split: float = 0.9):
        """Validate examples and write train/validation splits."""

        # Validation checks
        issues = []
        for i, ex in enumerate(self.examples):
            if len(ex['prompt']) < 10:
                issues.append(f"Example {i}: prompt too short")
            if len(ex['completion']) < 20:
                issues.append(f"Example {i}: completion too short")
            if len(ex['prompt']) > 4000:
                issues.append(f"Example {i}: prompt exceeds token limit")

        if issues:
            print(f"Found {len(issues)} issues:")
            for issue in issues[:10]:  # Show first 10
                print(f"  {issue}")
            return False

        # Split into train and validation
        split_idx = int(len(self.examples) * train_split)
        train_data = self.examples[:split_idx]
        val_data = self.examples[split_idx:]

        # Write JSONL files
        for filename, data in [
            ("train.jsonl", train_data), 
            ("validation.jsonl", val_data)
        ]:
            with jsonlines.open(self.output_path / filename, 'w') as writer:
                writer.write_all(data)

        print(f"Written {len(train_data)} training examples")
        print(f"Written {len(val_data)} validation examples")
        return True

# Usage
prep = DatasetPreparator("./training_data")

# Load your examples — minimum 32 for Bedrock, 
# recommend 200+ for meaningful results
for _, row in your_dataframe.iterrows():
    prep.add_example(
        document_text=row['document'],
        ideal_summary=row['expert_summary'],
        document_type=row['type']
    )

prep.validate_and_write()

Dataset size guidance: Bedrock requires a minimum of 32 training examples. In practice, you won't see meaningful domain adaptation below 100 examples and the sweet spot for most use cases is 300 to 1,000 high-quality examples. High quality beats high volume. 200 expert-written summaries will outperform 2,000 mediocre ones.

Step 2: Upload Training Data to S3

python
def upload_training_data(
    local_dir: str, 
    bucket: str, 
    prefix: str = "fine-tuning"
) -> dict:
    """Upload training files to S3 and return URIs."""

    s3_uris = {}

    for filename in ["train.jsonl", "validation.jsonl"]:
        local_path = Path(local_dir) / filename
        s3_key = f"{prefix}/{filename}"

        print(f"Uploading {filename}...")
        s3.upload_file(
            str(local_path),
            bucket,
            s3_key
        )

        s3_uris[filename] = f"s3://{bucket}/{s3_key}"
        print(f"Uploaded to {s3_uris[filename]}")

    return s3_uris

uris = upload_training_data(
    "./training_data",
    BUCKET_NAME,
    "legal-summarisation/v1"
)

Step 3: Configure and Launch Fine-Tuning Job

python
def launch_fine_tuning_job(
    job_name: str,
    training_uri: str,
    validation_uri: str,
    output_bucket: str,
    role_arn: str
) -> str:
    """Launch a Bedrock fine-tuning job and return job ARN."""

    response = bedrock.create_model_customization_job(
        jobName=job_name,
        customModelName=f"{job_name}-model",
        roleArn=role_arn,
        baseModelIdentifier=MODEL_ID,

        # Training data configuration
        trainingDataConfig={
            "s3Uri": training_uri
        },
        validationDataConfig={
            "validators": [{
                "s3Uri": validation_uri
            }]
        },

        # Output configuration
        outputDataConfig={
            "s3Uri": f"s3://{output_bucket}/fine-tuning-output/{job_name}/"
        },

        # Hyperparameters
        # These are the defaults — adjust based on your dataset size
        hyperParameters={
            "epochCount": "3",        # Start with 3, increase if underfitting
            "batchSize": "32",        # 32 is standard for most cases  
            "learningRate": "0.00001" # Conservative default
        },

        customizationType="FINE_TUNING"
    )

    job_arn = response['jobArn']
    print(f"Fine-tuning job launched: {job_arn}")
    return job_arn

# Your IAM role ARN — must have Bedrock and S3 permissions
ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/BedrockFineTuningRole"

job_arn = launch_fine_tuning_job(
    job_name="legal-summarisation-v1",
    training_uri=uris["train.jsonl"],
    validation_uri=uris["validation.jsonl"],
    output_bucket=BUCKET_NAME,
    role_arn=ROLE_ARN
)

Hyperparameter guidance:

epochCount controls how many times the model sees your training data. Start at 3. If your validation loss is still improving at epoch 3, try 5. If it plateaus at epoch 1, your dataset may have quality issues.

learningRate at 0.00001 is conservative and safe. Going higher risks destabilising the base model's general capabilities. Lower if you're seeing erratic validation loss.

batchSize of 32 works for most datasets. Larger batches are more stable but require more memory.

Step 4: Monitor the Job

Fine-tuning a Claude Haiku model typically takes 30 to 90 minutes depending on dataset size. Don't just wait, monitor it.

python
import time

def monitor_job(job_arn: str, check_interval: int = 60) -> str:
    """Poll job status until completion. Returns final status."""

    print(f"Monitoring job: {job_arn}")

    while True:
        response = bedrock.get_model_customization_job(
            jobIdentifier=job_arn
        )

        status = response['status']
        print(f"[{time.strftime('%H:%M:%S')}] Status: {status}")

        if status in ['Completed', 'Failed', 'Stopped']:
            if status == 'Completed':
                custom_model_arn = response['outputModelArn']
                print(f"Success! Model ARN: {custom_model_arn}")
                return custom_model_arn
            else:
                failure_msg = response.get('failureMessage', 'Unknown error')
                raise Exception(f"Job {status}: {failure_msg}")

        # Show metrics if available
        if 'trainingMetrics' in response:
            metrics = response['trainingMetrics']
            print(f"  Training loss: {metrics.get('trainingLoss', 'N/A'):.4f}")

        time.sleep(check_interval)

custom_model_arn = monitor_job(job_arn)

Step 5: Evaluate Before You Deploy

Never skip evaluation. The fine-tuned model will be different from the base model, the question is whether it's different in the ways you wanted.

python
def evaluate_model(
    custom_model_arn: str,
    test_examples: list,
    base_model_id: str = MODEL_ID
) -> dict:
    """Compare fine-tuned model against base model on test examples."""

    results = {
        'fine_tuned': [],
        'base_model': [],
        'comparisons': []
    }

    for example in test_examples:
        prompt = example['prompt']
        reference = example['reference_output']

        # Run inference on both models
        ft_response = bedrock_runtime.invoke_model(
            modelId=custom_model_arn,
            body=json.dumps({
                "prompt": f"\n\nHuman: {prompt}\n\nAssistant:",
                "max_tokens_to_sample": 1000,
                "temperature": 0.1
            })
        )

        base_response = bedrock_runtime.invoke_model(
            modelId=base_model_id,
            body=json.dumps({
                "prompt": f"\n\nHuman: {prompt}\n\nAssistant:",
                "max_tokens_to_sample": 1000,
                "temperature": 0.1
            })
        )

        ft_output = json.loads(
            ft_response['body'].read()
        )['completion']

        base_output = json.loads(
            base_response['body'].read()
        )['completion']

        results['comparisons'].append({
            'prompt': prompt,
            'reference': reference,
            'fine_tuned': ft_output,
            'base_model': base_output
        })

    return results

# Run on 20-30 held-out examples that weren't in training
evaluation = evaluate_model(
    custom_model_arn,
    held_out_test_set
)

# Review comparisons manually — automated metrics 
# miss nuance that matters in production
for comp in evaluation['comparisons'][:5]:
    print(f"Prompt: {comp['prompt'][:100]}...")
    print(f"Reference: {comp['reference'][:200]}")
    print(f"Fine-tuned: {comp['fine_tuned'][:200]}")
    print(f"Base model: {comp['base_model'][:200]}")
    print("---")

Read the outputs. Don't just run BLEU scores and call it done. The qualitative assessment, does the fine-tuned model actually behave the way you wanted it to?, is what tells you whether to deploy or iterate.

Step 6: Deploy via Provisioned Throughput

Custom models require provisioned throughput to serve inference. This is the ongoing cost commitment.

python
def provision_model(
    model_arn: str,
    provisioned_name: str,
    model_units: int = 1
) -> str:
    """Provision throughput for the fine-tuned model."""

    response = bedrock.create_provisioned_model_throughput(
        modelUnits=model_units,
        provisionedModelName=provisioned_name,
        modelId=model_arn
    )

    provisioned_arn = response['provisionedModelArn']
    print(f"Provisioned model ARN: {provisioned_arn}")
    return provisioned_arn

provisioned_arn = provision_model(
    custom_model_arn,
    "legal-summarisation-prod",
    model_units=1  # Scale up based on your throughput needs
)

Production inference:

python
def invoke_custom_model(
    prompt: str,
    provisioned_arn: str,
    max_tokens: int = 1000
) -> str:
    """Invoke the fine-tuned model for production inference."""

    response = bedrock_runtime.invoke_model(
        modelId=provisioned_arn,
        body=json.dumps({
            "prompt": f"\n\nHuman: {prompt}\n\nAssistant:",
            "max_tokens_to_sample": max_tokens,
            "temperature": 0.1,
            "stop_sequences": ["\n\nHuman:"]
        }),
        contentType="application/json",
        accept="application/json"
    )

    result = json.loads(response['body'].read())
    return result['completion']

Cost Estimates

Honest numbers for a startup-scale use case:

Training costs:

Fine-tuning job: approximately $0.004 per 1,000 tokens in your training dataset
A 500-example dataset with average 800 tokens per example: ~$1.60 for training
Training runs multiple epochs: multiply by epoch count (~$5-8 total for 3 epochs)

Provisioned throughput:

1 model unit: approximately $5.50 per hour
Running 24/7: ~$3,960 per month
Running 8 hours/day: ~$1,320 per month

The provisioned throughput cost is the real number to plan around. For most startups, a fine-tuned Claude Haiku model only makes economic sense at volume, if you're running thousands of requests per day where the per-token efficiency gain or quality improvement justifies the fixed monthly cost.

Before You Fine-Tune: The Honest Check

I promised to come back to this.

Fine-tuning is genuinely powerful for the right problems. It's also consistently reached for too early by teams who haven't fully explored what's achievable with well-engineered prompts.

Before committing to the complexity and cost of fine-tuning, spend a week on prompt engineering. A good system prompt with 5-10 examples often gets you to 90% of what fine-tuning would achieve, at zero training cost, with the ability to iterate in minutes rather than hours.

For enterprise-grade prompt engineering, the methodology, evaluation approach and common mistakes that waste weeks of iteration, we wrote the complete guide on what prompt engineering actually is and how to do it systematically. Read it before you start a fine-tuning project.

If you've done the prompt work and you're still hitting the limitations, then the fine-tune Claude on Bedrock enterprise guide covers the production considerations, IAM architecture, multi-model versioning, A/B testing custom models, that go beyond what fits in a single tutorial.

Before you fine-tune, make sure you've exhausted prompt engineering. Sometimes a well-crafted system prompt does 80% of the job. Here's our enterprise prompt engineering guide if you haven't been there yet.

Published by Dextra Labs | AI Consulting & Enterprise LLM Solutions

DEV Community