A practical guide to architecture, implementation, and lessons learned from a multi-agent AI system
Posts in this Series
Building FeedbackForge a Multi-Agent AI System on Azure with MAF, Foundry & AI Gateway (Part 1)
Microsoft Foundry
The whole project uses Microsoft Foundry , which is a unified PaaS platform for designing, customizing, and deploying AI applications and agents. It offers access to over 11,000 models, including OpenAI, Meta Llama, and Mistral, alongside tools like Foundry Agent Service and Foundry IQ to build, manage, and secure AI apps from prototype to production.
Foundry Control Plane
Microsoft Foundry Control Plane is a unified management interface that provides visibility, governance, and control for AI agents, models, and tools across your Foundry enterprise. Foundry Control Plane centralizes management for your AI agent fleet, from build to production.
- Manages multiple AI agents across different projects or teams.
- Requires centralized compliance visibility and policy enforcement across an AI fleet.
- Integrates Microsoft Defender and Microsoft Purview for AI governance and threat protection.
- Operates agents from multiple platforms, including Foundry, Microsoft, and non-Microsoft sources.
- Needs to track cost, token usage, and resource consumption across an entire AI environment.
Without Foundry Control Plane, you manage agents, models, and compliance through individual Azure portal blades and separate per-project views. Foundry Control Plane adds cross-project visibility, unified compliance enforcement, and integrated security signals in a single interface.
Security
For agents, we will use a typical Managed Identity , which will act as its identity.
Observability
AI observability refers to the ability to monitor, understand, and troubleshoot AI systems throughout their lifecycle. Teams can trace, evaluate, integrate automated quality gates into CI/CD pipelines, and collect signals such as evaluation metrics, logs, traces, and model outputs to gain visibility into performance, quality, safety, and operational health.
Monitoring
Production monitoring ensures your deployed AI applications maintain quality and performance in real-world conditions. Integrated with Azure Monitor Application Insights, Microsoft Foundry delivers real-time dashboards tracking operational metrics, token consumption, latency, error rates, and quality scores. Teams can set up alerts when outputs fail quality thresholds or produce harmful content, enabling rapid issue resolution.
The Agent Monitoring Dashboard in Microsoft Foundry tracks operational metrics and evaluation results for your agents. This dashboard helps you understand token usage, latency, success rates, and evaluation outcomes for production traffic.
Tracing
Distributed tracing captures the execution flow of AI applications, providing visibility into LLM calls, tool invocations, agent decisions, and inter-service dependencies. Built on OpenTelemetry standards and integrated with Application Insights, tracing enables debugging complex agent behaviors, identifying performance bottlenecks, and understanding multi-step reasoning chains. Microsoft Foundry supports tracing for popular frameworks including LangChain, Semantic Kernel, and the OpenAI Agents SDK.
Guardrails and Controls
Microsoft Foundry provides safety and security guardrails that you can apply to core models and agents. Guardrails consist of a set of controls. The controls define a risk to be detected, intervention points to scan for the risk, and the response action to take in the model or agent when the risk is detected.
The created guardrail contains controls for:
- Jailbreak
- Indirect Prompt Injections
- Sensitive Data Leakage (PII)
- Content Safety
- Protected Materials
These controls are assigned to all the agents used in the system.
Evaluators
Evaluators measure the quality, safety, and reliability of AI responses throughout development. Microsoft Foundry provides built-in evaluators for general-purpose quality metrics (coherence, fluency), RAG-specific metrics (groundedness, relevance), safety and security (hate/unfairness, violence, protected materials), and agent-specific metrics (tool call accuracy, task completion). Teams can also build custom evaluators tailored to their domain-specific requirements.
An evaluator was created with the following JSON structure:
{
"query":"What's the weekly summary?",
"response":"",
"context":"User is requesting a weekly feedback summary. Expected to use get_weekly_summary tool and return sentiment, top issues, feedback count.",
"ground_truth":"The response should include sentiment breakdown, top issues, feedback count, and distinguish between positive and negative feedback. Required fields: total_feedback, sentiment_breakdown.",
"conversation_id":"conv_001",
"response_id":"test_001",
"previous_response_id":null,
"latency":null,
"response_length":null
}
The criteria for the evaluator were left as the default ones. They could be adjusted since many do not apply. In this case, they were left unchanged for simplicity.
Purview, Defender, and Entra
Scaling AI agents securely requires combining development capabilities with strong governance, security, and compliance controls.
- AI systems introduce new risks such as data leakage, prompt injection, and agent misuse.
- Security must be applied across all layers: agents, data, models, and interactions.
- Microsoft Foundry acts as the control plane to manage, observe, and secure AI workloads.
✔️ Purview focuses on data governance, compliance, and protection (e.g., sensitivity labels, DLP, auditing).
✔️ Defender provides security posture management and runtime threat protection for AI workloads.
✔️ Entra introduces the concept of managing agents as identities with controlled access.
AI security is not a single tool but a combination of:
- Governance (Purview)
- Threat protection (Defender)
- Identity and access control (Entra)
- Control plane and observability (Foundry)
This enables organizations to move from experimental AI solutions to secure, governed, and production-ready agent systems.
Enabling Purview or Defender for this project was considered overkill, but in a real production scenario they should be mandatory.
AI Gateway
The AI gateway in Azure API Management provides a set of capabilities to manage AI backends effectively. It enables control over security, reliability, observability, and cost.
The following sections describe the main capabilities, combining traditional API gateway features with AI-specific functionality.
Governance
Token Rate Limiting and Quotas
You can configure token-based limits on LLM APIs to control usage per consumer based on token consumption.
This allows defining:
- Tokens per minute (TPM)
- Token quotas over time (hourly, daily, monthly, etc.)
<llm-token-limit counter-key="@(context.Subscription.Id)"
tokens-per-minute="500"
estimate-prompt-tokens="false"
remaining-tokens-variable-name="remainingTokens">
</llm-token-limit>
Security and Safety
Security
All authentication is performed through the API Management subscription key, without relying on JWT or any Entra ID OAuth mechanism.
Content Safety
In this case, Content Safety is not configured at the AI Gateway level. Instead, it is managed through Microsoft Foundry.
Observability
Token Metrics
Token usage can be emitted using the llm-emit-token-metric policy, with custom dimensions for filtering in Azure Monitor. The following example emits token metrics with dimensions for client IP address, API ID, and user ID (from a custom header):
<llm-emit-token-metric namespace="llm-metrics">
<dimension name="Client IP" value="@(context.Request.IpAddress)" />
<dimension name="API ID" value="@(context.Api.Id)" />
<dimension name="User ID" value="@(context.Request.Headers.GetValueOrDefault("x-user-id", "N/A"))" />
</llm-emit-token-metric>
Prompt and Completion Logging
Logging can be enabled to track:
- Token usage
- Prompts and completions
- API consumption patterns
This data can be analyzed in Application Insights and visualized through built-in dashboards in API Management.
Scalability and Performance
Semantic Caching
Semantic caching is a technique that improves the performance of LLM APIs by caching the results (completions) of previous prompts and reusing them by comparing the vector proximity of the prompt to prior requests.
Benefits:
- Reduced calls to backend AI services
- Lower latency
- Cost optimization
It can be implemented using Azure Managed Redis or any compatible cache.
<policies>
<inbound>
<base />
<llm-semantic-cache-lookup
score-threshold="0.05"
embeddings-backend-id="azure-openai-backend"
embeddings-backend-auth="system-assigned">
<vary-by>@(context.Subscription.Id)</vary-by>
</llm-semantic-cache-lookup>
<rate-limit calls="10" renewal-period="60" />
</inbound>
<outbound>
<llm-semantic-cache-store duration="60" />
<base />
</outbound>
</policies>
Note: A rate-limit policy should be applied after the cache lookup to prevent backend overload if the cache is unavailable.
Weight or Session-Aware Load Balancing
The backend load balancer supports round-robin, weighted, priority-based, and session-aware load balancing. You can define a load distribution strategy that meets your specific requirements.
Priority Routing to Provisioned Capacity Models
Enabling priority processing at the request level is optional. Both the chat completions API and responses API have an optional attribute service_tier that specifies the processing type to use when serving a request. In this case, we will not use it.
Velocity
Import Azure OpenAI as an API
Azure API Management allows importing Azure OpenAI endpoints as APIs with a single action. This includes:
- Automatic OpenAPI schema generation
- Managed identity authentication
- Simplified onboarding
External Models and Integrations
No models outside Azure are used in this project. MCP servers and A2A agents are fully implemented within the Python backend rather than managed through the gateway.
Full Policy Example
Below is the full AI Gateway policy used in FeedbackForge. It uses policy fragments to simplify configuration and improve maintainability.
<policies>
<inbound>
<base />
<!--
FeedbackForge Policy v2
Purpose: Routes requests to appropriate backend pools based on model selection and RBAC permissions
Flow:
1. Validate Entra ID authentication (optional, controlled by entra-validate named value)
2. Extract and validate model parameter from request payload
3. Configure backend pools and routing rules
4. Determine target backend pool based on model and permissions
5. Set up authentication and route to selected backend
6. Configure collecting usage metrics
Configuration:
- Modify allowedBackendPools for RBAC (comma-separated pool IDs, empty = all allowed)
- Set defaultBackendPool for unmapped models (empty = return error)
- Backend pool definitions are in frag-set-backend-pools fragment -->
<set-backend-service id="apim-generated-policy" backend-id="feedbackforgev2-ai-endpoint" />
<!.. - Step 1: Validate Entra ID authentication (if enabled) -->
<!-- - <include-fragment fragment-id="aad-auth" /> -->
<!-- - Step 2: Extract and validate model parameter from request -->
<include-fragment fragment-id="set-llm-requested-model" />
<!-- - Remove api-key header to prevent it from being passed to backend endpoints -->
<set-header name="api-key" exists-action="delete" />
<!-- - Step 4: Configure RBAC and default routing behavior -->
<!-- - RBAC: Set allowed backend pools (comma-separated pool IDs, empty = all pools allowed) -->
<set-variable name="allowedBackendPools" value="" />
<!-- - Set default backend pool (empty = return error for unmapped models) -->
<set-variable name="defaultBackendPool" value="" />
<!-- - Step 4: Load backend pool configurations -->
<include-fragment fragment-id="set-backend-pools" />
<!-- - Step 5: Determine target backend pool based on model and permissions -->
<include-fragment fragment-id="set-target-backend-pool" />
<!-- - Step 6: Configure authentication and route to selected backend -->
<include-fragment fragment-id="set-backend-authorization" />
<!-- - Step 7: Configure collecting usage metrics -->
<include-fragment fragment-id="set-llm-usage" />
<!-- - CORS Configuration for AI Foundry Compatibility -->
<include-fragment fragment-id="ai-foundry-compatibility" />
<llm-semantic-cache-lookup score-threshold="0.0" embeddings-backend-id="AI-SC-text-embedding-aokaqomw6tctxm5" />
<rate-limit calls="10" renewal-period="60" />
<llm-token-limit remaining-quota-tokens-header-name="remaining-tokens" remaining-tokens-header-name="remaining-tokens" tokens-per-minute="1000" token-quota="100" token-quota-period="Hourly" counter-key="@(context.Subscription.Id)" estimate-prompt-tokens="true" tokens-consumed-header-name="consumed-tokens" />
<llm-emit-token-metric namespace="llm-metrics">
<dimension name="Client IP" value="@(context.Request.IpAddress)" />
<dimension name="API ID" value="@(context.Api.Id)" />
<dimension name="User ID" value="@(context.Request.Headers.GetValueOrDefault("x-user-id", "N/A"))" />
</llm-emit-token-metric>
</inbound>
<backend>
<!--
- Backend Retry Logic
Purpose: Implements retry mechanism for transient failures (429 throttling, 503 service unavailable)
Configuration:
- Retry count: Set to one less than number of backends in pool to try all backends
- Condition: Retries on 429 (throttling) or 503 (except when backend pool is unavailable)
- Strategy: First fast retry with zero interval, buffer request body for replay
-->
<retry count="2" interval="0" first-fast-retry="true" condition="@(context.Response.StatusCode == 429 || (context.Response.StatusCode == 503 && !context.Response.StatusReason.Contains("Backend pool") && !context.Response.StatusReason.Contains("is temporarily unavailable")))">
<forward-request buffer-request-body="true" />
</retry>
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
<!--
- Error Handling and Custom Metrics
Purpose: Pushes custom metrics for 429 throttling errors to enable Azure Monitor alerts
Variables Set:
- service-name: Identifies the service generating the error
- target-deployment: The model that was requested when error occurred
Integration: Uses throttling-events fragment to send metrics to monitoring system
-->
<set-variable name="service-name" value="FeedbackForgev2" />
<set-variable name="target-deployment" value="@((string)context.Variables["requestedModel"])" />
<include-fragment fragment-id="throttling-events" />
</on-error>
</policies>
Azure Well-Architected Framework
The Azure Well-Architected Framework AI workload assessment is a review tool that can be used to self-assess the readiness of an AI workload for production.
Running AI workloads on Azure can be complex, and this assessment helps evaluate how well the system aligns with the best practices defined in the Well-Architected Framework pillars.
Although this is not a full professional project following all stages of a typical SDLC, this assessment highlights several missing areas and helps identify gaps that would need to be addressed in a real production scenario.
Assessment Results
The result of applying the assessment is not particularly strong, with most areas marked as Critical.
However, this is expected given the scope of the project, which focuses on architecture and experimentation rather than production readiness.
Vibe Coding
All the Python and React development for the agents, backend, and frontend systems was done using Claude.
This significantly accelerated the project, reducing the implementation time from months to weeks.
This was the first time building a non-trivial project of this scale using an LLM in a declarative way, and the results were a big surprise in terms of quality.
The project initially started using Github Copilot but later transitioned to Claude due to its stronger ability to understand context and generate more accurate and complete code.
It also proved to be a very effective companion for problem-solving, helping identify complex errors, suggest fixes, and resolve issues related to libraries and version compatibility.
Lessons Learned
- Feature overlap across MAF, Foundry, and AI Gateway Some capabilities such as memory management and content safety appear across multiple layers, which can lead to duplication and requires clear architectural decisions on where to implement each concern.
- Vibe Coding Refer to the previous section. While highly effective, it may not be suitable for everyone due to cost considerations. In this case, the total cost was approximately 260 euros over 3 months of development. Using a different model might reduce costs.
- MAF complexity MAF provides multiple ways to instantiate and interact with agents, which can feel somewhat convoluted when deciding which approach to use.
- AI Gateway setup is non-trivial Designing and implementing an AI Gateway landing zone requires significant effort and understanding of policies, security, and routing.
- Multi-agent orchestration Complex orchestrations involving multiple agents are challenging by nature, but MAF significantly simplifies their implementation.
- Production readiness requires more effort As highlighted in the Azure Well-Architected Framework assessment, a real production-grade system would require additional work across governance, security, and operational maturity.
- Observability is critical Observability using OpenTelemetry (logs, traces, and metrics) is essential in AI systems to understand agent behavior and system execution at any point in time.
Source code:




















Top comments (0)