DEV Community

Ryan Giggs
Ryan Giggs

Posted on

Chat and Embedding Models in OCI Generative AI

Oracle Cloud Infrastructure (OCI) Generative AI provides two powerful categories of models: chat models for conversational AI and text generation, and embedding models for semantic understanding and search. Understanding how to configure and optimize these models is essential for building effective AI applications.

Understanding Tokens

Before diving into model parameters, it's crucial to understand that LLMs understand tokens rather than characters. A token can be a word, part of a word, or even punctuation. As a general rule, estimate approximately 4 characters per token—meaning 100 tokens is roughly 60-80 words.

This matters because all LLM limits, costs, and parameters are expressed in tokens, not words or characters.

Chat Models: Configuration Parameters

Chat models in OCI Generative AI offer extensive configuration options to control their behavior and output characteristics.

Max Output Tokens

The maximum number of tokens that you want the model to generate per response. Each model has different maximum output token limits specified in their key features.

Important: Because you're prompting a chat model, the response depends on the prompt and each response doesn't necessarily use up the maximum allocated tokens. Setting this parameter essentially creates a ceiling—the model will stop generating once it hits this limit.

Preamble Override

An initial context or guiding message for a chat model that changes the model's overall chat behavior and conversation style. When you don't provide a preamble, the model uses its default.

Default Preamble for Cohere Command R Family:

You are Command. You are an extremely capable large language model built by Cohere. 
You are given instructions programmatically via an API that you follow to the best 
of your ability.
Enter fullscreen mode Exit fullscreen mode

Custom Preamble Example:

You are a helpful assistant specialized in Italian cuisine. 
Respond with enthusiasm and include regional context when discussing dishes.
Enter fullscreen mode Exit fullscreen mode

With this preamble override, when asked about cannolis, the model responds with regional details about Sicily and specific bakeries like Caffe Sierra and Pasticceria Cappello.

Alternative Approach: You can also include preamble-like instructions directly in your conversation. For example: "Answer the following question in a marketing tone. Where's the best place to go sailing?"

Temperature

Temperature controls the randomness of the output. It's one of the most important parameters for shaping model behavior.

How Temperature Works:

  • Temperature modulates the probability distribution over words in the vocabulary
  • Temperature = 0: Makes the model deterministic, always selecting the word with the highest probability (essentially greedy decoding)
  • Low temperature (< 1.0): Distribution is more peaked around the most likely tokens, producing focused, predictable outputs
  • Temperature = 1.0: Standard probability distribution with no modification
  • High temperature (> 1.0): Distribution flattens across all words, increasing randomness and creativity

Best Practices:

Start with temperature set to 0 or less than one, and increase it as you regenerate prompts for more creative output. Be cautious: high temperatures can introduce hallucinations and factually incorrect information.

Use Cases:

  • Low temperature (0.0-0.3): Code generation, factual Q&A, technical documentation
  • Medium temperature (0.5-0.7): General chatbots, customer service, balanced responses
  • High temperature (0.8-1.0+): Creative writing, brainstorming, diverse idea generation

Top-K Sampling

Top-K tells the model to pick the next token from the top K tokens in its list sorted by probability.

A higher value for K generates more random output, making the text sound more natural. The parameter essentially limits the selection pool to the K most likely candidates at each step.

Range:

  • Minimum: 1 (considers only the single most probable token, essentially deterministic)
  • Maximum: Vocabulary size (considers all possible tokens)
  • OCI Default: 0 for Command models, -1 for Llama models (meaning the model considers all tokens and doesn't use this method)

Example: If K=3 and the top three tokens are:

  • "favorite" (40% probability)
  • "best" (30% probability)
  • "preferred" (15% probability)

The model randomly selects from only these three options, weighted by their probabilities.

Top-P (Nucleus Sampling)

Top-P is similar to Top-K but picks from top tokens based on the sum of their probabilities. It's more dynamic than Top-K because the number of tokens considered varies.

Top-P selects tokens whose cumulative probability exceeds a threshold (p), limiting the pool to the smallest set meeting this sum.

How It Works:

Top-P picks the next token based on cumulative probability, ensuring the sum of probabilities is below a certain threshold.

Example: If Top-P = 0.5 and tokens have probabilities:

  • Token A: 0.3 (30%)
  • Token B: 0.2 (20%)
  • Token C: 0.1 (10%)

The model only considers A and B (cumulative 50%), excluding C.

Range:

  • Minimum: 0.0 (considers only the single most probable token)
  • Maximum: 1.0 (considers all tokens)

A high top-p value means the model looks at more possible words, even less likely ones, making generated text more diverse.

Practical Guidelines:

  • Top-P = 0.5: Only considers words adding up to 50% probability—focused responses
  • Top-P = 0.9: Includes many more words—varied and original responses

The Interaction: Temperature, Top-K, and Top-P

The selection process works in stages: Top-K tokens with highest probabilities are sampled first, then tokens are further filtered based on Top-P, with the final token being selected using temperature.

Decision Framework:

  • For tasks requiring accuracy and focus (like code completion), use lower temperature, higher Top-P/Top-K, and moderate max tokens
  • For creative writing, experiment with higher temperature, lower Top-P/Top-K, and larger max token limits

Frequency and Presence Penalties

These parameters are useful if you want to get rid of repetition in your outputs.

Frequency Penalty:
A penalty assigned to a token when that token appears frequently. High penalties encourage fewer repeated tokens and produce more random output.

Presence Penalty:
A penalty assigned to each token when it appears in the output to encourage generating outputs with tokens that haven't been used.

Use Case: If your model keeps repeating the same phrases or concepts, increase these penalties to encourage more diverse vocabulary and concepts.

Seed Parameter (Reproducibility)

Assigning a number for the seed parameter is similar to tagging the request with a number. The LLM aims to generate the same set of tokens for the same integer in consecutive requests.

Important Notes:

  • Allowed values are integers; assigning large or small seed values doesn't affect the result
  • Especially useful for debugging and testing
  • Warning: The seed parameter might not produce the same result long-term because model updates might invalidate the seed

OCI Limits:

  • API: No maximum value
  • Console: Maximum value is 9999
  • Leave blank (Console) or null (API) to disable

Stop Sequences

A sequence of characters—such as a word, phrase, newline (\n), or period—that tells the model when to stop the generated output.

Example: If the stop sequence is a period (.), the model stops generating text once it reaches the end of the first sentence, even if the token limit is much higher.

Use Case: Useful for controlling output length and format, especially when you want responses limited to single sentences or specific sections.

Likelihood Display

Shows how likely a token is to follow the current generated token. Likelihood is defined by a number between -15 and 0, where more negative numbers mean less likely tokens.

Example: The word "favorite" is more likely to be followed by "food" or "book" rather than "zebra."

Important: This parameter doesn't influence the generation process itself but serves as a diagnostic tool to help understand the model's behavior.

Embedding Models: Converting Text to Vectors

Embedding models transform text into numerical representations that capture semantic meaning, enabling AI systems to understand relationships between words, sentences, and documents.

What Are Embeddings?

Embeddings are numerical representations of text converted to number sequences. A piece of text can be a phrase, sentence, or one or more paragraphs.

A vector embedding is a mapping from input (like a word, list of words, or image) into a list of floating-point numbers. That list represents the input in the multidimensional embedding space of the model.

OCI Embedding Model Dimensions

OCI Generative AI embedding models transform each phrase, sentence, or paragraph into an array with 384 (light models) or 1024 numbers, depending on the embedding model selected.

Available Models:

  • Cohere Embed English V3: 1024 dimensions
  • Cohere Embed English Light V3: 384 dimensions
  • Cohere Embed Multilingual V3: 1024 dimensions
  • Cohere Embed Multilingual Light V3: 384 dimensions
  • Cohere Embed 4: Latest version supporting text and images

Trade-offs:

  • Light models (384d): Faster processing, lower memory usage, suitable for most applications
  • Standard models (1024d): Richer semantic representation, better for complex similarity tasks

Types of Embeddings

Word Embeddings:
Capture properties and relationships of individual words. Word2vec was the most well-known embedding model for a long time, accepting only single words but very good at representing semantic meaning, typically outputting 300-dimensional vectors.

Sentence Embeddings:
Associate every sentence with a vector of numbers, capturing meaning at the sentence level rather than individual words.

Document Embeddings:
Represent entire paragraphs or documents as single vectors, useful for document similarity and classification tasks.

Semantic Similarity: The Core Principle

Embeddings that are numerically similar are also semantically similar. This is the fundamental principle that makes embeddings powerful.

Similar words, documents, or images have vectors that are also similar. For example, "basketball" and "baseball" have embedding vectors much closer to each other than "rainforest".

Measuring Similarity:

The most popular way to compare vectors is cosine similarity, which measures the cosine of the angle between two vectors in multi-dimensional space. The closer the vectors, the smaller the angle.

Other Distance Metrics:

  • Euclidean distance: Straight-line distance between vectors
  • Dot product (inner product): Especially effective for unit vectors
  • Manhattan distance: Sum of absolute differences

Cosine similarity is popular for text embeddings, measuring the angle between vectors. Dot product can also work, especially for unit vectors, offering performance benefits in vector databases.

Embedding Model Evolution

Word2vec was the pioneer, focusing on single words with 300 dimensions—lightweight and great for semantic meaning. Then came OpenAI's text-embedding-ada-002 in 2022, supporting up to 8192 tokens and outputting 1536 dimensions.

2025 State of the Art:

Advances in 2025, driven by LLMs and benchmarks, show transformer-based and instruction-tuned embeddings achieving top performance. Key trends include multilingual models (1000+ languages), domain-specific models (medicine, code), and multimodal models (text-image-audio).

Embedding Use Cases

1. Vector Databases and Semantic Search

The most common use case: Embeddings are mostly used for semantic searches where the search function focuses on the meaning of the text rather than finding results based on keywords.

Workflow:

  1. User query is converted to a vector representation using an embedding model
  2. Query vector is stored/compared in a vector database
  3. Similar vectors are retrieved from private content
  4. Retrieved content is sent to an LLM for response generation
  5. LLM sends response back to the user

We use the embedding model to create vector embeddings for content we want to index. The vector embedding is inserted into the vector database with some reference to the original content.

2. Recommendation Systems

By embedding items (products, articles, movies) and user preferences, systems can recommend semantically similar items even if they don't share obvious keywords.

3. Text Classification and Clustering

Embeddings enable text classification and text clustering by grouping semantically similar items together.

Applications:

  • Classifying support tickets by department
  • Categorizing documents by topic
  • Detecting duplicate or similar content
  • Sentiment analysis grouping

4. Retrieval-Augmented Generation (RAG)

While the most common scenario for vector databases is retrieval-augmented generation (RAG), other possible use cases include recommendations, anomaly detection, and more.

5. Question Answering Systems

Create chatbots that respond to questions about your own data—for instance, a chatbot responding to employee questions on healthcare coverage. Hundreds of pages of documentation can be split into chunks, converted into embeddings, and searched based on vector similarity.

6. Semantic Caching

Reduce cost and latency of LLMs by caching LLM completions. LLM queries are compared using vector similarity—if a new query is similar enough to a previously cached query, the cached query is returned.

7. LLM Conversation Memory

Persist conversation history with an LLM as embeddings in a vector database. Applications can use vector search to pull relevant history or "memories" into LLM responses.

8. Cross-Lingual and Multimodal Search

We can use vector search across languages since embedding models are frequently trained on more than just English data, and we can use vector search with images if we use multimodal embedding models trained on both text and images.

Vector Databases: Storage and Retrieval

A vector database is a database that can store, manage, retrieve, and compare vectors.

Key Capabilities:

Vector databases are purpose-built to manage vector embeddings, offering data management (insert, delete, update), metadata storage and filtering, scalability with distributed processing, and real-time query performance.

Popular Vector Databases:

  • Pinecone
  • Weaviate
  • Qdrant
  • Milvus
  • Chroma
  • Oracle Database 23ai (with AI Vector Search)
  • Azure Redis
  • MyScale

Advanced Features:

Some vector databases can perform hybrid searches by first narrowing results based on characteristics or metadata before conducting vector search—making searches more effective and customizable.

OCI Embedding Model Input Requirements

Input data for text embeddings must meet the following requirements: maximum of 96 inputs allowed for each run, with each input less than 512 tokens for text-only models.

Handling Long Inputs:

If an input exceeds the 512 token limit, you can set the Truncate parameter to Start or End to cut off text to fit within the token limit. Setting it to None will produce an error if inputs exceed limits.

Multimodal Support:

For text and image embed models like Cohere Embed English Image V3, you can add either text or one image only.

Best Practices for OCI Generative AI

Chat Model Optimization

  1. Start Simple: Begin with default parameters, then adjust based on results
  2. Temperature First: Adjust temperature before other sampling parameters
  3. Use Preambles Wisely: Set clear behavioral expectations with custom preambles
  4. Implement Stop Sequences: Control output format and length precisely
  5. Test Reproducibility: Use seed parameter during development for consistent testing
  6. Monitor Penalties: Add frequency/presence penalties if seeing repetition

Embedding Model Selection

  1. Choose Appropriate Dimensions: Use light models (384d) for most applications; standard models (1024d) for complex similarity tasks
  2. Select Right Model: English vs. Multilingual based on your data
  3. Batch Process: Take advantage of the 96-input limit for efficient processing
  4. Index Strategically: Store embeddings in vector databases with appropriate indexes
  5. Measure Appropriately: Use cosine similarity for most text applications

Integration Patterns

OCI Generative AI integrates with LangChain for building context-augmented applications and RAG solutions. It also works with LlamaIndex for accessing pretrained models or creating custom models on dedicated AI clusters.

Understanding chat and embedding models in OCI Generative AI enables you to build sophisticated AI applications. Chat models with properly tuned parameters (temperature, top-k, top-p, penalties) produce high-quality, contextually appropriate responses. Embedding models convert text into semantic vectors that power search, recommendations, and retrieval systems.

Key takeaways:

  • Tokens are the fundamental unit—everything is measured in tokens, not characters
  • Temperature controls creativity—low for precision, high for creativity
  • Sampling parameters work together—Top-K and Top-P filter candidates before temperature applies
  • Embeddings enable semantic understanding—similar meanings produce similar vectors
  • Vector databases make it practical—efficient storage and retrieval of embeddings at scale

Whether building chatbots, implementing semantic search, or creating RAG applications, mastering these parameters and concepts is essential for production-grade AI systems on OCI.

Top comments (0)