Ryan Giggs

Posted on Jan 15

OCI Generative AI: Dedicated Cluster Sizing, Pricing, and Fine-Tuning Deep Dive

#ocigenai #finetuning #mlops #clustersizing

Understanding how to properly size, price, and configure dedicated AI clusters is crucial for cost-effective deployment of custom LLMs on Oracle Cloud Infrastructure. This comprehensive guide walks you through cluster types, pricing calculations, fine-tuning configurations, and how to interpret your results.

Dedicated AI Cluster Unit Types

OCI Generative AI provides several dedicated cluster unit types, each optimized for specific models and workloads:

Available Unit Types

Small Cohere Dedicated (V2)

Unit identifier: SMALL_COHERE_V2
Suitable for: Cohere Command R models
Use case: Fine-tuning and hosting smaller Cohere models

Large Cohere Dedicated

Unit identifier: LARGE_COHERE
Suitable for: Cohere Command R+ models
Use case: Larger Cohere models requiring more compute

Embed Cohere Dedicated

Unit identifier: EMBED_COHERE
Suitable for: Cohere embedding models (English V3, Multilingual V3, Light variants)
Use case: Generating embeddings at scale

Large Meta Dedicated

Unit identifier: LARGE_META (or specific like Llama2-70-count)
Suitable for: Meta Llama models (70B, 405B parameters)
Use case: Hosting and fine-tuning large Meta Llama models

Understanding Unit Requirements

Different models require different numbers of units for fine-tuning and hosting. Let's use the Cohere Command R 08-2024 model as an example:

Fine-Tuning Requirements:
To fine-tune a Cohere Command R 08-2024 model, you need a preset value of 8 Small Cohere V2 units. This is non-negotiable—the fine-tuning cluster is automatically sized for optimal training performance.

Hosting Requirements:
To host the fine-tuned Command R 08-2024 model, you need a minimum of 1 Small Cohere V2 unit. You can add more units (up to 3 or more) to handle higher call volumes.

Total for Complete Workflow:
If you want to both fine-tune and host a Command R model, you'll need:

8 Small Cohere V2 units (fine-tuning cluster)
1+ Small Cohere V2 units (hosting cluster)
Total: Minimum 9 Small Cohere V2 units

Commitment Requirements and Billing

Understanding OCI's commitment model is essential for cost planning.

Minimum Commitments

Hosting Clusters: Require a minimum commitment of 744 unit-hours per cluster

744 hours = 31 days (one month)
This is your minimum billing period for hosting

Fine-Tuning Clusters: Require a minimum of 1 unit-hour

You pay only for the actual time used
Much more flexible than hosting commitments

Billing Mechanics

Character-Based Billing (On-Demand):
A transaction is one character. 10,000 transactions = 10,000 characters. Foundational models on-demand are charged per character for prompts and responses (except embeddings, where responses aren't counted).

Unit-Hour Billing (Dedicated Clusters):
If you're hosting foundational models or fine-tuning them on dedicated AI clusters, you're charged by the unit-hour rather than by transaction.

Real-World Pricing Example: Ryan's Fine-Tuning Journey

Let's walk through a complete pricing scenario based on Bob's example from Oracle documentation, adapted for our Ryan scenario:

Ryan's Requirements

Ryan wants to:

Fine-tune a Cohere Command R 08-2024 model
Host the fine-tuned model for production use
Create a new fine-tuned version weekly (4 times per month)

Step-by-Step Cost Calculation

Step 1: Fine-Tuning Cluster Costs

Ryan creates a fine-tuning dedicated AI cluster with the preset value of 8 Small Cohere V2 units. The fine-tuning job takes 5 hours to complete. Ryan creates a fine-tuning cluster every week.

Monthly Fine-Tuning Calculation:

Fine-tuning sessions per month: 4 (weekly)
Units per session: 8 Small Cohere V2
Hours per session: 5 hours
Total unit-hours: 4 sessions × 8 units × 5 hours = 160 unit-hours

Monthly fine-tuning cost = 160 unit-hours × $<Small-Cohere-dedicated-unit-per-hour-price>

Step 2: Hosting Cluster Costs

For hosting, dedicated AI clusters require a minimum commitment of 744 unit-hours per cluster.

Let's assume Ryan uses 1 Small Cohere V2 unit for hosting:

Hosting units: 1 Small Cohere V2
Hours per month: 744 (minimum commitment)
Total unit-hours: 1 unit × 744 hours = 744 unit-hours

Monthly hosting cost = 744 unit-hours × $<Small-Cohere-dedicated-unit-per-hour-price>

Step 3: Total Monthly Cost

Total monthly cost = Fine-tuning cost + Hosting cost
Total monthly cost = (160 + 744) unit-hours × $<Small-Cohere-dedicated-unit-per-hour-price>
Total monthly cost = 904 unit-hours × $<Small-Cohere-dedicated-unit-per-hour-price>

Scaling Hosting for Higher Traffic

If Ryan decides to buy three units of Small Cohere V2 to handle higher call volume, the hosting calculation becomes:

Hosting units: 3 Small Cohere V2
Hours per month: 744 (minimum commitment applies per unit)
Total unit-hours: 3 units × 744 hours = 2,232 unit-hours

Monthly hosting cost = 2,232 unit-hours × $<Small-Cohere-dedicated-unit-per-hour-price>

Fine-Tuning Configuration Parameters

Understanding fine-tuning hyperparameters is crucial for achieving optimal model performance.

Training Methods

OCI Generative AI supports two main parameter-efficient fine-tuning methods:

T-Few (Training Few):

Parameter-efficient method for Cohere models
Selectively updates only a fraction of model weights
Recommended for small datasets (<100,000 samples)
Ideal for changing instruction-following behavior

LoRA (Low-Rank Adaptation):

Parameter-efficient method for Llama models
Adds trainable adapter matrices while freezing base model
Balances efficiency with performance
Suitable for most fine-tuning scenarios

Hyperparameters Explained

OCI Generative AI fine-tunes each base model using hyperparameters based on the pretrained base model. Start training with default hyperparameter values.

Training Epochs Total:
An epoch is one complete pass through the entire training dataset. The number of epochs determines how many times the model sees all training examples.

Too few epochs: Model underfits, doesn't learn patterns well
Too many epochs: Model overfits, memorizes training data instead of generalizing
Optimal range: Typically 3-10 epochs depending on dataset size

Learning Rate:
The learning rate influences how much model parameters are adjusted in each iteration. A lower learning rate means smaller changes to parameter estimates, requiring longer training time but potentially achieving minimum loss.

Too high: Model may overshoot optimal parameters, fail to converge
Too low: Training takes excessively long, may get stuck in local minima
Typical range: 1e-5 to 1e-3 for fine-tuning

Techniques like learning rate schedules adjust the rate over time, or adaptive algorithms like Adam or RMSprop are commonly used.

Training Batch Size:
Number of training examples processed together before updating model parameters.

Larger batches: More stable gradient estimates, faster training, higher memory usage
Smaller batches: More gradient noise (can help escape local minima), lower memory usage
Typical range: 4-32 for LLM fine-tuning

Early Stopping Patience:
Number of epochs to wait for improvement before stopping training early. Prevents wasting compute on models that have stopped improving.

Example: If patience = 3, training stops if validation loss doesn't improve for 3 consecutive epochs

Early Stopping Threshold:
The minimum improvement required to reset the patience counter. Helps distinguish real improvements from noise.

Example: Threshold = 0.001 means validation loss must improve by at least 0.001 to count as improvement

Log Model Metrics Interval Steps:
Frequency of logging training metrics (loss, accuracy) during training.

Smaller intervals: More detailed monitoring, larger log files
Larger intervals: Less granular view, smaller storage requirements
Typical value: Every 10-100 steps

Hyperparameter Calculation

The model calculates totalTrainingSteps using: totalTrainingSteps = (totalTrainingEpochs × size(trainingDataset)) / trainingBatchSize

Example Calculation:

Training dataset: 1,000 examples
Training epochs: 5
Batch size: 10

Total training steps = (5 × 1,000) / 10 = 500 steps

Understanding Fine-Tuning Results

After training completes, evaluating model performance is critical. In the model's detail page, under Model Performance, check the values for accuracy and loss. If you're not happy with the results, create another model with either a larger dataset or different hyperparameters until performance improves.

Key Metrics Explained

Accuracy:
Accuracy measures the proportion of correct predictions among the total number of cases evaluated. In the context of LLM fine-tuning, accuracy measures when the generated tokens match the annotated tokens in your training data.

Formula:

Accuracy = (Number of correct token predictions) / (Total number of tokens)

Interpretation:

Accuracy = 0.90 (90%): The model predicts the correct token 90% of the time
Higher is better, but perfect accuracy (1.0) often indicates overfitting

Example:
If fine-tuning on a medical terminology task, accuracy of 85% means the model correctly generates medical terms 85% of the time.

Loss:
Loss (also called training loss) measures how wrong the generated outputs of the model are. The loss function quantifies the discrepancy between predicted outputs and actual targets.

Key Characteristics:

Loss decreases as the model improves
Lower loss indicates better fit to training data
Should decrease consistently across epochs (if not, check learning rate)

Common Loss Functions:

Cross-entropy loss: Standard for classification and language modeling
Mean squared error: Less common for LLMs

Interpreting Loss Curves:

Training loss started at step 25 at 2.86 and by step 200, total training and validation loss was insignificant at just 0.4.

Epoch 1: Loss = 2.5
Epoch 2: Loss = 1.8  ← Good! Decreasing
Epoch 3: Loss = 1.3  ← Still improving
Epoch 4: Loss = 1.1  ← Slowing down
Epoch 5: Loss = 1.05 ← Converging

Warning Signs:

Loss increases: Learning rate too high, or data quality issues
Loss plateaus early: Learning rate too low, or model capacity insufficient
Loss fluctuates wildly: Batch size too small, or learning rate too high

Training vs. Validation Loss

Training Loss:
Measured on the data the model is learning from. Always decreases with enough training.

Validation Loss:
Measured on held-out data the model hasn't seen. Indicates generalization ability.

Ideal Pattern:

Epoch  | Training Loss | Validation Loss
-------|---------------|----------------
1      | 2.5          | 2.6
2      | 1.8          | 1.9
3      | 1.3          | 1.4
4      | 1.1          | 1.2

Both decreasing = healthy training

Overfitting Pattern:

Epoch  | Training Loss | Validation Loss
-------|---------------|----------------
1      | 2.5          | 2.6
2      | 1.8          | 1.9
3      | 1.3          | 1.6  ← Warning!
4      | 1.0          | 1.8  ← Overfitting!

Training loss decreases but validation loss increases = model memorizing training data

Additional Evaluation Metrics

While accuracy and loss are primary, consider these secondary metrics:

Perplexity:
Exponential of the loss, more interpretable for language models. Lower is better.

Perplexity = e^(loss)
If loss = 1.5, perplexity = 4.48

F1-Score:
The F1-score is a combined measure of recall and precision, offering a balanced assessment of LLM performance in language tasks.

Particularly useful for classification tasks within fine-tuning (e.g., intent classification, sentiment analysis).

BLEU/ROUGE Scores:
For text generation quality, though traditional metrics like BLEU and ROUGE are considered insufficient for natural conversations.

Best Practices for Cost Optimization

1. Fine-Tuning Strategy

Minimize Training Sessions:

Prepare high-quality datasets before fine-tuning
Test hyperparameters on small subsets first
Batch multiple improvements into single training runs

Optimize Training Time:

Use early stopping to avoid unnecessary epochs
Monitor loss curves—stop if converging
Start with smaller epochs, increase if needed

2. Hosting Strategy

Right-Size Hosting Units:
Start with 1 unit and add more only as call volume demands. Remember the 744-hour minimum commitment.

Consider On-Demand for Low Volume:
If usage is sporadic or low, on-demand pricing might be more cost-effective than dedicated hosting.

Calculation for Break-Even:

On-demand cost per 10k characters: $X
Dedicated cost per unit-hour: $Y
Characters processed per hour: Z

Break-even: Z × $X = $Y

3. Development vs. Production

Development/Testing:

Use on-demand inferencing
Fine-tune infrequently with small datasets
No hosting commitment needed

Production:

Dedicated hosting for consistent performance
Batch fine-tuning updates (weekly/monthly)
Scale hosting units based on actual traffic

4. Monitoring and Optimization

Track Key Metrics:

Cost per fine-tuning session
Cost per inference (characters processed)
Utilization rate of hosting clusters
Model performance metrics (accuracy, loss)

Optimization Opportunities:

Reduce fine-tuning frequency if model performance stable
Adjust hosting capacity based on traffic patterns
Use cached responses for common queries

Common Pitfalls and Solutions

Pitfall 1: Over-Fine-Tuning

Problem: Creating new fine-tuned models too frequently
Cost Impact: Unnecessary compute charges
Solution: Fine-tune only when you have substantial new training data or performance degrades

Pitfall 2: Over-Provisioning Hosting

Problem: Allocating too many hosting units "just in case"
Cost Impact: Paying for unused capacity (744 hours/month minimum per unit)
Solution: Start with minimum units, monitor utilization, scale up as needed

Pitfall 3: Ignoring Model Metrics

Problem: Not evaluating accuracy and loss before deploying
Cost Impact: Poor model performance requiring additional fine-tuning sessions
Solution: Always check Model Performance metrics and iterate if needed

Pitfall 4: Poor Hyperparameter Choices

Problem: Using default hyperparameters without experimentation
Cost Impact: Suboptimal models requiring more training sessions
Solution: Test hyperparameters systematically, document what works

Practical Workflow Example

Here's a complete workflow with cost awareness:

Phase 1: Initial Fine-Tuning (Week 1)

1. Prepare 5,000 training examples
2. Create fine-tuning cluster (8 Small Cohere V2 units)
3. Fine-tune with default hyperparameters (5 hours)
4. Cost: 40 unit-hours × $price
5. Check accuracy: 82% ← Needs improvement

Phase 2: Iteration (Week 2)

1. Increase dataset to 10,000 examples
2. Adjust learning rate from 1e-4 to 5e-5
3. Fine-tune again (6 hours)
4. Cost: 48 unit-hours × $price
5. Check accuracy: 91% ← Acceptable!

Phase 3: Deployment (Week 3-4)

1. Create hosting cluster (1 Small Cohere V2 unit)
2. Deploy fine-tuned model
3. Cost: 744 unit-hours × $price (monthly minimum)
4. Monitor usage and performance

Total Month 1 Cost:

Fine-tuning: (40 + 48) = 88 unit-hours
Hosting: 744 unit-hours (minimum commitment)
Total: 832 unit-hours × $<price per unit-hour>

Conclusion

Effective use of OCI Generative AI dedicated clusters requires understanding three key areas:

1. Sizing: Match cluster unit types and counts to your specific models

Fine-tuning: 8 Small Cohere V2 units for Command R
Hosting: Start with 1 unit, scale based on traffic

2. Pricing: Plan for commitments and optimize costs

Fine-tuning: Pay for actual hours used (minimum 1 hour)
Hosting: 744-hour monthly minimum per cluster
Calculate break-even vs. on-demand based on usage

3. Configuration: Tune hyperparameters and monitor metrics

Start with defaults, iterate systematically
Watch accuracy (higher is better) and loss (lower is better)
Use early stopping to save compute time

By following these guidelines, you can deploy custom fine-tuned models on OCI cost-effectively while maintaining high performance.

DEV Community