Ntombizakhona Mabaso

for AWS Community Builders

Posted on Jan 19

Describe Design Considerations For Applications That Use Foundation Models (FMs)

#aws #ai #aipractitioner #cloud

🤖 Exam Guide: AI Practitioner
Domain 3: Applications of Foundation Models
📘Task Statement 3.1

🎯 Objectives

This task shifts from “what GenAI is” (Domain 2) to “how to design real applications with it” (Domain 3).

You’re expected to understand how to pick a model, control its behaviour, ground it with RAG, store embeddings, choose customization strategies, and use agents for multi-step workflows.

1) Selection Criteria For Choosing Pre-trained Models

When selecting a pre-trained FM/LLM, consider:

1.1 Cost

Often driven by token pricing (input/output) and model tier.
Higher capability typically costs more.

1.2 Modality

Text-only vs multimodal (text+image) vs image generation, etc.
Choose based on required input/output types.

1.3 Latency

Interactive assistants need low latency while batch workloads can tolerate more delay.

1.4 Multilingual Support

If you serve global users, ensure strong performance in required languages.

1.5 Model Size / Complexity

Larger models can be more capable but slower and more expensive.
Smaller models can be cheaper/faster and “good enough” for narrow tasks.

1.6 Customization Options

Can you fine-tune?
Is prompt engineering sufficient?
Does the service support your needed controls?

1.7 Input/output length

Context window size determines how much the model can “see.”
Output limits matter for summarization length, structured responses, etc.

1.8 Prompt Caching

(if available in your architecture/service)
Reusing repeated prompt prefixes can reduce cost/latency for common templates (e.g., same system instructions used for every request).

2) Effect Of Inference Parameters On Model Responses

Inference parameters change the “style” and consistency of outputs:

2.1 Temperature

Controls randomness/creativity.

Lower temperature → more consistent, safer, more deterministic-like outputs (often better for extraction/structured formats).
Higher temperature → more varied/creative outputs (often better for brainstorming, marketing copy).

2.2 Input Length

More input context can improve grounding, but increases cost and can introduce noise if irrelevant.
Longer prompts can also increase latency.

2.3 Output Length

Limits how much the model can generate.
Longer outputs cost more and may increase the chance of drifting off-topic and shorter outputs may omit needed details.

know the directionality:
temperature up: more variability
more tokens: more cost

3) Retrieval Augmented Generation (RAG)

3.1 Definition

RAG is an approach where the model’s response is generated using:
1 retrieved relevant information from an external knowledge source (documents, wikis, policies, tickets), and

2 the FM generates an answer grounded in that retrieved context.

3.2 Why Businesses Use RAG

1 Keeps answers aligned with company-specific and up-to-date information.
2 Reduces hallucinations by grounding outputs in retrieved text.
3 Avoids retraining/fine-tuning for every knowledge update.

3.3 Business applications

1 Knowledge base Q&A (HR policies, IT runbooks, product docs)
2 Customer support assistants (answers based on help-center articles)
3 Contract/policy summarization with citations to source text
4 Research assistants over internal document repositories

3.4 AWS Example

Amazon Bedrock Knowledge Bases provide managed building blocks for RAG-style workflows.

4) AWS Services For Storing Embeddings In Vector Databases

RAG typically requires storing embeddings so you can do similarity search. AWS services commonly referenced include:

4.1 Amazon OpenSearch Service

Often used for search + vector similarity search use cases.

4.2 Amazon Aurora

Can be used to store vectors and metadata depending on engine/features.

4.3 Amazon RDS for PostgreSQL

PostgreSQL is commonly used for vector storage and similarity search (engine/extensions depend on configuration).

4.4 Amazon Neptune

Amazon Neptune is a graph database that can support relationship-centric retrieval and can be part of retrieval strategies involving connected data.

4.5 Amazon DocumentDB (with MongoDB compatibility)

Amazon DocumentDB can be used to store content/metadata and support retrieval patterns depending on features/architecture.

Recognize these as AWS options for vector/embedding storage—not deep implementation details.

5) Cost Tradeoffs Of FM Customization Approaches

You’re expected to understand that there are multiple ways to “customize” outcomes, each with different cost/effort:

5.1 Pre-training

Training a foundation model from scratch.
Highest cost and complexity, usually only large providers do this.

5.2 Fine-tuning

Adjusting a pre-trained model on your domain/task data, which usually improves consistency, domain style, and task performance.
Costs: training + governance + ongoing maintenance.

5.3 In-Context Learning (Prompting / Few-shot)

Provide instructions and examples in the prompt.
Fastest and lowest engineering overhead, but increases token usage and may be less consistent at scale.

5.4 RAG

Use retrieval to ground responses in external knowledge, which is often cheaper and more maintainable than fine-tuning for knowledge updates.
Costs: embedding generation + vector storage + retrieval + added prompt tokens.

1 If the problem is “the model doesn’t know our private docs” → start with RAG.
2 If the problem is “the model won’t follow our format/tone reliably” → consider fine-tuning or stricter prompting + validation.
3 If the problem is “we need a brand-new capability at massive scale” → that’s closer to pre-training (rare).

6) Role Of Agents In Multi-Step Tasks

Agents extend a model from “responding” to “acting.”
Agents can plan and execute multi-step workflows, calling tools/APIs along the way.
Agents are useful when tasks require:
1 looking up information,
2 taking actions (create ticket, book meeting),
3 iterating through steps,
3 using multiple systems.

Examples

1 Agents for Amazon Bedrock: managed capability to build agentic workflows around FMs.
2 Agentic AI: LLM + tools + planning + memory/context.
3 Model Context Protocol (MCP): a standard pattern for connecting models to external tools/context providers (exam-level awareness only).

💡 Quick Questions

1. Name three criteria you would use to select a pre-trained foundation model for a customer-facing assistant.
2. What does temperature control, and what happens when you increase it?
3. What is RAG, and why do organizations use it instead of retraining models for new documents?
4. Name two AWS services that can be used to store embeddings for vector similarity search.
5. In one sentence, what is the role of an agent in an FM-based application?

Additional Resources

✅ Answers to Quick Questions

1. Latency (fast enough for interactive use), cost (token pricing/usage), and multilingual support (if serving multiple languages).

(Also valid: modality, input/output length, model size/complexity, customization options, prompt caching.)

2. Temperature controls randomness/creativity in the model’s outputs. Increasing it generally makes responses more varied and less predictable (often more creative, but potentially less consistent).

3. RAG (Retrieval Augmented Generation) retrieves relevant information from an external knowledge source and provides it to the model to generate a grounded response. Organizations use it because it can incorporate updated/internal documents without retraining, reducing cost and improving factual grounding.

4. Amazon OpenSearch Service and Amazon RDS for PostgreSQL.

(Also valid: Amazon Aurora, Amazon Neptune, Amazon DocumentDB.)

5. An agent enables the model to perform multi-step tasks by planning and calling external tools/APIs to take actions, not just generate text.

DEV Community