DEV Community: Brian Spann

Presidio as an LLM Guardrail

Brian Spann — Fri, 12 Jun 2026 02:47:05 +0000

Every previous part of this series has been building toward this one. You can detect PII. You can anonymize it with the right operator for each entity type. You can build custom recognizers for your organization's specific data patterns. Now we put it all together into the architecture that matters most in 2026: a PII guardrail that sits between your users and your LLM.

The problem is straightforward. Users type personal information into prompts. Support agents paste customer records into chat interfaces. Developers pipe production data into debugging workflows. All of that PII flows to your model provider's API endpoint. Even if the provider says they don't train on your data, the information still transits their infrastructure. For regulated industries, that transit itself can be a compliance violation.

The PII Proxy Pattern

The solution is a proxy that intercepts every LLM request, scrubs PII from the prompt, forwards the clean version, and then restores the PII in the response.

The flow looks like this:

User sends a prompt containing PII
Proxy detects and encrypts all PII entities
Clean prompt (with encrypted tokens) goes to the LLM
LLM responds using the encrypted tokens
Proxy decrypts the tokens in the response, restoring original PII
User sees a response with their real data intact

The user never notices the proxy exists. The LLM never sees the real PII. The encryption key stays on your infrastructure.

Building the Proxy in Python

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import OperatorConfig
import openai

# Initialize Presidio engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
deanonymizer = DeanonymizeEngine()

ENCRYPTION_KEY = "WmZq4t7w!z%C*F-J"  # In production, pull from Key Vault

def scrub_prompt(text: str) -> tuple:
    """Detect and encrypt PII in the prompt."""
    results = analyzer.analyze(text=text, language="en")

    if not results:
        return text, None

    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "DEFAULT": OperatorConfig("encrypt", {"key": ENCRYPTION_KEY})
        }
    )

    return anonymized.text, anonymized.items

def restore_response(text: str, items: list) -> str:
    """Decrypt PII tokens in the LLM response."""
    if not items:
        return text

    deanonymized = deanonymizer.deanonymize(
        text=text,
        entities=items,
        operators={
            "DEFAULT": OperatorConfig("decrypt", {"key": ENCRYPTION_KEY})
        }
    )

    return deanonymized.text

def chat_with_guardrail(user_message: str) -> str:
    """Send a message to the LLM with PII protection."""
    # Step 1: Scrub
    clean_prompt, pii_items = scrub_prompt(user_message)

    # Step 2: Send to LLM
    client = openai.AzureOpenAI(
        azure_endpoint="https://your-endpoint.openai.azure.com/",
        api_key="your-api-key",
        api_version="2024-02-01"
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": clean_prompt}]
    )

    llm_response = response.choices[0].message.content

    # Step 3: Restore
    final_response = restore_response(llm_response, pii_items)

    return final_response

Test it:

user_input = """
Summarize this customer case: John Smith (john.smith@acme.com, 
SSN 123-45-6789) reported unauthorized charges on his Visa 
ending 4242. He can be reached at 206-555-0147.
"""

response = chat_with_guardrail(user_input)
print(response)

What the LLM sees: encrypted tokens where the PII was. What the user sees: a response with their real customer data. The LLM processes the request without ever handling the actual PII.

Moving the Guardrail into Azure API Management

The Python proxy works, but it lives inside one application. Every team that wants the same protection has to wire in the same code and keep it current. A guardrail belongs at the edge, where every model call already passes through. On Azure, that edge is API Management.

Put APIM in front of Azure OpenAI and point your applications at the APIM endpoint instead of the model endpoint. Now APIM is the one place that sees every prompt and every completion. An inbound policy scrubs PII out of the prompt before it reaches the model. An outbound policy restores it on the way back, so the caller still gets their real values. You can run either direction on its own, or both.

The flow with APIM:

App calls the APIM endpoint with a prompt containing PII
Inbound policy sends the prompt to Presidio, which encrypts the PII entities
APIM stashes the entity map in a context variable and forwards the scrubbed prompt to Azure OpenAI
Azure OpenAI responds, echoing back the encrypted tokens
Outbound policy sends the response plus the saved entity map to Presidio to decrypt
APIM returns the restored response to the app

The model never sees real PII. The encryption key and the entity map never leave your APIM instance and its backend. No application code changes.

In this setup Presidio sits behind two small endpoints, /deidentify and /reidentify, that wrap the analyzer and anonymizer (a thin container that encrypts on the way in, decrypts on the way out, with the key pulled from Key Vault). The APIM policy calls them with send-request:

<policies>
  <inbound>
    <base />
    <!-- Pull the user's prompt out of the chat completion body -->
    <set-variable name="userPrompt"
      value="@(context.Request.Body.As<JObject>(preserveContent: true)["messages"].Last["content"].ToString())" />

    <!-- De-identify: send the prompt to Presidio before the model sees it -->
    <send-request mode="new" response-variable-name="deidentified" timeout="10">
      <set-url>https://presidio.internal/deidentify</set-url>
      <set-method>POST</set-method>
      <set-header name="Content-Type" exists-action="override">
        <value>application/json</value>
      </set-header>
      <set-body>@(new JObject(new JProperty("text", (string)context.Variables["userPrompt"])).ToString())</set-body>
    </send-request>

    <!-- Save the entity map so the outbound step can re-identify -->
    <set-variable name="entityMap"
      value="@(((IResponse)context.Variables["deidentified"]).Body.As<JObject>()["entities"].ToString())" />

    <!-- Swap the scrubbed prompt back into the request before it hits the model -->
    <set-body>@{
      var body = context.Request.Body.As<JObject>();
      var clean = ((IResponse)context.Variables["deidentified"]).Body.As<JObject>()["text"].ToString();
      body["messages"].Last["content"] = clean;
      return body.ToString();
    }</set-body>
  </inbound>

  <backend>
    <base />
  </backend>

  <outbound>
    <base />
    <!-- Re-identify: decrypt the PII back into the model's response -->
    <send-request mode="new" response-variable-name="reidentified" timeout="10">
      <set-url>https://presidio.internal/reidentify</set-url>
      <set-method>POST</set-method>
      <set-header name="Content-Type" exists-action="override">
        <value>application/json</value>
      </set-header>
      <set-body>@{
        var resp = context.Response.Body.As<JObject>(preserveContent: true);
        var content = resp["choices"][0]["message"]["content"].ToString();
        return new JObject(
          new JProperty("text", content),
          new JProperty("entities", JArray.Parse((string)context.Variables["entityMap"]))
        ).ToString();
      }</set-body>
    </send-request>

    <set-body>@{
      var resp = context.Response.Body.As<JObject>();
      var restored = ((IResponse)context.Variables["reidentified"]).Body.As<JObject>()["text"].ToString();
      resp["choices"][0]["message"]["content"] = restored;
      return resp.ToString();
    }</set-body>
  </outbound>

  <on-error>
    <base />
  </on-error>
</policies>

With this policy in place, every application pointed at the APIM endpoint gets PII protection without changing a line of its own code. The inbound and outbound blocks are independent: scrub on the way in only, restore on the way out only, or both, depending on whether you need the real values back in the response.

Two decisions shape the setup:

Reversibility. The policy above uses Presidio's encrypt operator so the outbound step can decrypt. If you only need to keep PII away from the model and never need it back, switch the wrapper to replace and drop the outbound policy. It's simpler and there's no key to manage.

Where Presidio runs. The send-request calls point at an internal Presidio endpoint. Keep it on the same VNet as APIM so prompts never touch the public internet. The next section covers those deployment options.

Deploying on Azure

For production, you need Presidio running as a service, not embedded in your application code. Here are the deployment options on Azure, from the quickest to stand up to the most production-ready.

Azure App Service

The fastest path to production. Deploy the Presidio Docker containers to App Service with minimal configuration.

# Create a resource group
az group create --name rg-presidio --location eastus

# Create an App Service plan
az appservice plan create \
  --name presidio-plan \
  --resource-group rg-presidio \
  --is-linux \
  --sku B2

# Deploy the analyzer
az webapp create \
  --name presidio-analyzer-prod \
  --resource-group rg-presidio \
  --plan presidio-plan \
  --deployment-container-image-name mcr.microsoft.com/presidio-analyzer:latest

# Deploy the anonymizer
az webapp create \
  --name presidio-anonymizer-prod \
  --resource-group rg-presidio \
  --plan presidio-plan \
  --deployment-container-image-name mcr.microsoft.com/presidio-anonymizer:latest

Azure Container Apps

For more control over scaling, networking, and multi-container deployments:

# Create an ACA environment
az containerapp env create \
  --name presidio-env \
  --resource-group rg-presidio \
  --location eastus

# Deploy analyzer
az containerapp create \
  --name presidio-analyzer \
  --resource-group rg-presidio \
  --environment presidio-env \
  --image mcr.microsoft.com/presidio-analyzer:latest \
  --target-port 3000 \
  --ingress internal \
  --min-replicas 1 \
  --max-replicas 10

# Deploy anonymizer
az containerapp create \
  --name presidio-anonymizer \
  --resource-group rg-presidio \
  --environment presidio-env \
  --image mcr.microsoft.com/presidio-anonymizer:latest \
  --target-port 3000 \
  --ingress internal \
  --min-replicas 1 \
  --max-replicas 10

Using --ingress internal means the Presidio services aren't exposed to the internet. Only other services in the same ACA environment (or VNet) can reach them. Your /deidentify and /reidentify wrapper sits in the same environment and calls the analyzer and anonymizer over the internal network, and APIM calls the wrapper the same way.

Kubernetes

For enterprise deployments with existing AKS clusters, Presidio publishes Helm charts. The setup is more involved but gives you full control over resource limits, HPA scaling, pod affinity, and network policies.

Production Hardening

Logging and Monitoring

Log every detection for audit trails, but never log the actual PII values. Log the entity types, confidence scores, and positions.

import logging

logger = logging.getLogger("presidio-guardrail")

def scrub_with_logging(text: str, request_id: str) -> tuple:
    results = analyzer.analyze(text=text, language="en")

    # Log detection summary (not the actual PII)
    for r in results:
        logger.info(
            f"request={request_id} "
            f"entity_type={r.entity_type} "
            f"score={r.score:.2f} "
            f"start={r.start} end={r.end}"
        )

    logger.info(f"request={request_id} total_entities={len(results)}")

    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={"DEFAULT": OperatorConfig("encrypt", {"key": ENCRYPTION_KEY})}
    )

    return anonymized.text, anonymized.items

False Positive Handling

Presidio will occasionally flag non-PII as PII. A city name like "Jordan" might be detected as a person name. A product SKU might match a phone number pattern. For production systems, build a feedback mechanism:

# Maintain an allow list of known false positives
FALSE_POSITIVE_ALLOWLIST = {
    "PERSON": ["Jordan", "Phoenix", "Austin"],  # Cities that are also names
    "PHONE_NUMBER": ["555-0100"],  # Known test number
}

def filter_false_positives(text: str, results: list) -> list:
    filtered = []
    for r in results:
        value = text[r.start:r.end].strip()
        allowlist = FALSE_POSITIVE_ALLOWLIST.get(r.entity_type, [])
        if value not in allowlist:
            filtered.append(r)
    return filtered

Performance Considerations

Presidio's analyzer is CPU-intensive, especially with the large spaCy model. For high-throughput workloads:

Keep the analyzer engine warm. Initializing AnalyzerEngine() loads the NLP model, which takes a few seconds. Do it once at startup, not per request.

Set a score threshold. Processing low-confidence detections wastes CPU cycles and increases false positives. Start with 0.5 and adjust based on your accuracy requirements.

Use the right NLP model size. en_core_web_lg is more accurate but slower. en_core_web_sm is faster but misses more entities. Profile your specific workload to find the right tradeoff.

Cache recognizer results for repeated text. If the same support template gets processed thousands of times, cache the detection results and only run the anonymizer.

When the guardrail runs inside APIM, two more things matter. Set a sane timeout on the send-request calls so a slow Presidio response can't hang the whole model call, and decide how to fail. Failing closed (block the request if Presidio is unreachable) protects PII at the cost of availability. Failing open does the reverse. For regulated workloads, fail closed and put Presidio behind enough replicas that it rarely comes to that.

Series Wrap-Up

Over these five parts we've gone from zero to a production-ready PII detection and anonymization pipeline. You can install and run Presidio, detect PII in text, images, and structured data, build custom recognizers for your organization's specific patterns, choose the right anonymization strategy for each use case, and deploy Presidio as an LLM guardrail at the APIM edge that keeps sensitive data off third-party infrastructure.

The framework is actively maintained, the Docker images are production-ready, and the extensibility model (custom recognizers, custom operators, external NLP services) means it adapts to whatever compliance requirements your organization throws at it.

This is Part 5 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.

Anonymization Strategies

Brian Spann — Wed, 10 Jun 2026 01:10:48 +0000

Detection tells you where the PII is. Anonymization decides what to do about it. Presidio's anonymizer ships with five built-in operators, each suited for different compliance requirements and use cases. Choosing wrong means either destroying data you needed to recover or leaving sensitive information exposed in ways you didn't intend.

This part covers every anonymization operator, when to use each one, how to build pseudonymization with consistent name mappings, and how to process PII in PDFs.

The Five Built-In Operators

Replace

Replaces the detected entity with a specified value. This is the default operator.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "John Smith called from 206-555-0147 about his account."

results = analyzer.analyze(text=text, language="en")

# Replace with entity type labels (default behavior)
anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("replace", {"new_value": "[REDACTED NAME]"}),
        "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "[REDACTED PHONE]"})
    }
)

print(anonymized.text)
# Output: [REDACTED NAME] called from [REDACTED PHONE] about his account.

Use replace when you want the output to be human-readable and when the original values don't need to be recovered. Good for sharing anonymized datasets with external teams, displaying sanitized text in dashboards, and audit logs where the PII type matters but the value doesn't.

Redact

Removes the entity entirely, leaving no placeholder behind.

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("redact"),
        "PHONE_NUMBER": OperatorConfig("redact")
    }
)

print(anonymized.text)
# Output:  called from  about his account.

Redaction changes the text structure and can make sentences unreadable. It's appropriate for internal audit logs where readability isn't a priority, strict compliance scenarios where no trace of PII should remain, and automated pipelines where the text isn't shown to humans.

Mask

Replaces each character with a masking character, preserving the length of the original value.

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("mask", {
            "masking_char": "*",
            "chars_to_mask": 100,  # Mask all characters
            "from_end": False
        }),
        "PHONE_NUMBER": OperatorConfig("mask", {
            "masking_char": "#",
            "chars_to_mask": 8,    # Mask first 8 chars
            "from_end": False
        })
    }
)

print(anonymized.text)
# Output: ********** called from ########47 about his account.

Masking is useful when you need to preserve the length or partial value. Think credit card receipts showing the last four digits, or support screens where agents need to confirm partial identifiers.

Hash

Replaces the entity with a one-way hash. The same input always produces the same hash, which makes it useful for analytics without exposing raw PII.

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "PERSON": OperatorConfig("hash", {"hash_type": "sha256"}),
        "PHONE_NUMBER": OperatorConfig("hash", {"hash_type": "sha256"})
    }
)

print(anonymized.text)
# Output: ef92b778bafe771e89245b89ecbc08a44a4e166c06659911881f383d4473e94f called from ...

Hash supports sha256 (default) and sha512. Hashing is irreversible. You can't get the original value back from the hash. But you can compare hashes to determine if two records refer to the same person without knowing who that person is. Good for analytics pipelines, deduplication, and cross-referencing anonymized datasets.

Encrypt

Replaces the entity with an encrypted value that can be decrypted later with the right key.

anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        "DEFAULT": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C*F-J"})
    }
)

print(anonymized.text)
# Entities replaced with base64-encoded encrypted strings

Encrypt is the only reversible operator. You can deanonymize later:

from presidio_anonymizer import DeanonymizeEngine
from presidio_anonymizer.entities import OperatorConfig

deanonymizer = DeanonymizeEngine()

deanonymized = deanonymizer.deanonymize(
    text=anonymized.text,
    entities=anonymized.items,
    operators={
        "DEFAULT": OperatorConfig("decrypt", {"key": "WmZq4t7w!z%C*F-J"})
    }
)

print(deanonymized.text)
# Output: John Smith called from 206-555-0147 about his account.

Use encrypt/decrypt for the PII proxy pattern (scrub before sending to LLM, decrypt after). We'll build that exact pipeline in Part 5.

Mixing Operators Per Entity Type

In practice you'll want different strategies for different entity types in the same document.

operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
    "EMAIL_ADDRESS": OperatorConfig("hash", {"hash_type": "sha256"}),
    "PHONE_NUMBER": OperatorConfig("mask", {
        "masking_char": "*",
        "chars_to_mask": 8,
        "from_end": False
    }),
    "CREDIT_CARD": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C*F-J"}),
    "US_SSN": OperatorConfig("redact"),
    "DEFAULT": OperatorConfig("replace", {"new_value": "<PII>"})
}

The DEFAULT operator catches any entity type that doesn't have a specific operator assigned. Always set a default so nothing slips through unhandled.

Pseudonymization with Consistent Mappings

Standard replacement generates different placeholders each time. If "John Smith" appears three times in a document, each occurrence gets the same generic <PERSON> label. That's fine for redaction but breaks any analysis that needs to track individuals across records.

Pseudonymization maps each unique value to a consistent fake value. "John Smith" always becomes "Robert Chen." "Jane Doe" always becomes "Maria Santos." The mapping is consistent within a dataset but the original values are unrecoverable without the mapping table.

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from faker import Faker

fake = Faker()
Faker.seed(42)  # Reproducible fake data

# Maintain a mapping for consistency
pii_mapping = {}

def get_consistent_replacement(original, entity_type):
    key = f"{entity_type}:{original}"
    if key not in pii_mapping:
        if entity_type == "PERSON":
            pii_mapping[key] = fake.name()
        elif entity_type == "EMAIL_ADDRESS":
            pii_mapping[key] = fake.email()
        elif entity_type == "PHONE_NUMBER":
            pii_mapping[key] = fake.phone_number()
        elif entity_type == "LOCATION":
            pii_mapping[key] = fake.city()
        else:
            pii_mapping[key] = f"[{entity_type}_{len(pii_mapping)}]"
    return pii_mapping[key]

To integrate this with Presidio, you can build a custom operator or post-process the results:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = """John Smith emailed john@example.com about the project.
Later, John Smith called to follow up. His colleague Jane Doe 
also reached out from jane@example.com."""

results = analyzer.analyze(text=text, language="en")

# Sort by start position (descending) to replace from end to start
sorted_results = sorted(results, key=lambda x: x.start, reverse=True)

pseudonymized = text
for result in sorted_results:
    original = text[result.start:result.end]
    replacement = get_consistent_replacement(original, result.entity_type)
    pseudonymized = pseudonymized[:result.start] + replacement + pseudonymized[result.end:]

print(pseudonymized)

Both occurrences of "John Smith" map to the same fake name. Both email addresses map to consistent fake emails. The relationships in the data are preserved without exposing the real identities.

Reversible vs. Irreversible: When to Use Which

Irreversible (replace, redact, mask, hash): Use when the original values should never be recoverable. Compliance with GDPR right-to-erasure, publishing anonymized datasets, any scenario where re-identification is a risk.

Reversible (encrypt): Use when you need the original values back later. The PII proxy pattern (anonymize before LLM, deanonymize after), temporary anonymization for testing, workflows where an authorized user needs to see the real data.

The key question: does anyone, ever, need to get the original PII back? If yes, encrypt. If no, use one of the irreversible operators. Don't hash when you need reversibility (common mistake). Don't encrypt when you need true anonymization (the key becomes a liability).

Processing PDFs

Presidio doesn't process PDFs natively, but you can extract text, anonymize it, and annotate the original PDF with redaction boxes.

import fitz  # PyMuPDF
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Open the PDF
doc = fitz.open("customer_report.pdf")

for page in doc:
    text = page.get_text()

    # Detect PII
    results = analyzer.analyze(text=text, language="en")

    for result in results:
        # Find the text location on the page
        pii_text = text[result.start:result.end]
        instances = page.search_for(pii_text)

        # Draw redaction boxes
        for inst in instances:
            page.add_redact_annot(inst, fill=(0, 0, 0))

    # Apply all redactions on this page
    page.apply_redactions()

# Save the redacted PDF
doc.save("customer_report_redacted.pdf")
doc.close()

This approach searches for each detected PII string on the PDF page and draws a black box over it. The apply_redactions() call permanently removes the underlying text, so the PII is gone from the file, not just covered up visually.

What's Next

You now have the full anonymization toolkit. In Part 5, we'll put it all together as an LLM guardrail: building a PII proxy that intercepts prompts, scrubs PII with encrypt, forwards the clean prompt to the model, and deanonymizes the response. We'll also cover LiteLLM integration, deployment on Azure, and production hardening.

This is Part 4 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.

Building Custom Recognizers

Brian Spann — Mon, 08 Jun 2026 22:20:14 +0000

Presidio's built-in recognizers cover the common PII types: names, emails, phone numbers, credit cards, SSNs. But every organization has PII that's specific to their business. Internal employee IDs that follow a custom format. Project codenames that shouldn't leak externally. Customer account numbers that don't match any standard pattern. Medical record numbers, policy IDs, internal ticket references. The built-in recognizers don't know about these.

This part covers four ways to build custom recognizers, from the simplest (a list of words to flag) to the most sophisticated (connecting an external NLP service).

Deny-List Recognizers

The fastest way to add a custom recognizer is a deny list. You give Presidio a list of words or phrases and it flags any exact match as a specific entity type.

Use case: your company has internal project codenames (like "Project Titan," "Sapphire," "Nightingale") that are confidential and should never appear in data sent to external services.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer

# Create a deny-list recognizer
project_recognizer = PatternRecognizer(
    supported_entity="INTERNAL_PROJECT",
    deny_list=["Titan", "Sapphire", "Nightingale", "Ironclad", "Meridian"],
    deny_list_score=1.0
)

# Add it to the analyzer
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(project_recognizer)

# Test it
text = "The Titan rollout is scheduled for Q3. Contact sarah@company.com for details."
results = analyzer.analyze(text=text, language="en")

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score:.2f})")

Output:

INTERNAL_PROJECT: 'Titan' (score: 1.00)
EMAIL_ADDRESS: 'sarah@company.com' (score: 1.00)

The deny_list_score parameter sets the confidence level for matches. Set it to 1.0 if the deny list is curated and every match is definitely PII. Lower it if some terms might appear in non-sensitive contexts.

Deny lists are case-insensitive by default. "titan," "TITAN," and "Titan" all match.

Regex Recognizers

When your PII follows a pattern but the built-in recognizers don't cover it, write a regex recognizer.

Use case: your company uses employee IDs in the format EMP-XXXXX (EMP- followed by 5 digits) and customer account numbers in the format ACC-XXXX-XXXX.

from presidio_analyzer import PatternRecognizer, Pattern

# Employee ID recognizer
emp_id_pattern = Pattern(
    name="employee_id_pattern",
    regex=r"\bEMP-\d{5}\b",
    score=0.9
)

emp_recognizer = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[emp_id_pattern],
    name="EmployeeIdRecognizer"
)

# Customer account recognizer
account_pattern = Pattern(
    name="account_number_pattern",
    regex=r"\bACC-\d{4}-\d{4}\b",
    score=0.9
)

account_recognizer = PatternRecognizer(
    supported_entity="CUSTOMER_ACCOUNT",
    patterns=[account_pattern],
    name="CustomerAccountRecognizer"
)

# Register both
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(emp_recognizer)
analyzer.registry.add_recognizer(account_recognizer)

text = "Employee EMP-28471 processed refund for account ACC-9921-0047."
results = analyzer.analyze(text=text, language="en")

for r in results:
    print(f"{r.entity_type}: '{text[r.start:r.end]}' (score: {r.score:.2f})")

Output:

EMPLOYEE_ID: 'EMP-28471' (score: 0.90)
CUSTOMER_ACCOUNT: 'ACC-9921-0047' (score: 0.90)

The score in the Pattern object sets the base confidence. You can define multiple patterns for the same entity type if the format varies (some systems might use EMP-XXXXX and others use E-XXXXXXX).

Context Enhancement

Regex patterns alone can produce false positives. A pattern like \d{5} matches any 5-digit number, not just employee IDs. Context words help Presidio distinguish between a zip code and an employee number.

from presidio_analyzer import PatternRecognizer, Pattern

# A medical record number recognizer with context
mrn_pattern = Pattern(
    name="mrn_pattern",
    regex=r"\b\d{7,10}\b",
    score=0.3  # Low base score because 7-10 digit numbers are common
)

mrn_recognizer = PatternRecognizer(
    supported_entity="MEDICAL_RECORD",
    patterns=[mrn_pattern],
    context=["medical record", "mrn", "patient id", "patient number", 
             "chart number", "medical id", "health record"],
    name="MedicalRecordRecognizer"
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(mrn_recognizer)

# With context: high confidence
text1 = "Patient medical record number: 4829173"
results1 = analyzer.analyze(text=text1, language="en")
# Score boosted because "medical record number" is a context word

# Without context: low confidence (might be filtered by threshold)
text2 = "Order 4829173 shipped on Tuesday"
results2 = analyzer.analyze(text=text2, language="en")
# Score stays at base 0.3 because no context words present

The pattern starts with a low base score (0.3). When context words appear within a configurable window around the match, Presidio boosts the score. When they don't, the score stays low and gets filtered out by your threshold.

This is the right approach for any pattern that's too generic on its own. Set a low base score, provide strong context words, and let the context scoring do the disambiguation.

No-Code Recognizers via YAML

For teams that want to manage recognizers without touching Python code, Presidio supports YAML-based configuration. You define recognizers in a YAML file and load them at startup.

# custom_recognizers.yaml
recognizers:
  - name: "Project Code Recognizer"
    supported_language: "en"
    supported_entity: "INTERNAL_PROJECT"
    deny_list:
      - "Titan"
      - "Sapphire"
      - "Nightingale"
      - "Ironclad"
    deny_list_score: 1.0

  - name: "Employee ID Recognizer"
    supported_language: "en"
    supported_entity: "EMPLOYEE_ID"
    patterns:
      - name: "emp_id"
        regex: "\\bEMP-\\d{5}\\b"
        score: 0.9
    context:
      - "employee"
      - "emp"
      - "staff"
      - "worker"

  - name: "Policy Number Recognizer"
    supported_language: "en"
    supported_entity: "POLICY_NUMBER"
    patterns:
      - name: "policy_format"
        regex: "\\bPOL-[A-Z]{2}-\\d{6}\\b"
        score: 0.95
    context:
      - "policy"
      - "insurance"
      - "coverage"
      - "claim"

Load them into the analyzer:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider

# Load recognizers from YAML
registry_provider = RecognizerRegistryProvider(
    conf_file="custom_recognizers.yaml"
)

analyzer = AnalyzerEngine(registry=registry_provider.create_recognizer_registry())

The YAML approach is useful when non-developers (security teams, compliance officers) need to update the recognizer list. They edit a YAML file, the service restarts with the new configuration. No code changes, no deployments.

Connecting External Services

For cases where local regex and NER aren't enough, Presidio supports remote recognizers that call external NLP services. Azure AI Language is the most common integration.

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Configure the analyzer to use a transformer model instead of spaCy
nlp_config = {
    "nlp_engine_name": "transformers",
    "models": [
        {
            "lang_code": "en",
            "model_name": {
                "spacy": "en_core_web_sm",
                "transformers": "dslim/bert-base-NER"
            }
        }
    ]
}

nlp_engine = NlpEngineProvider(nlp_configuration=nlp_config).create_engine()
analyzer = AnalyzerEngine(nlp_engine=nlp_engine)

The transformer-based NER model (dslim/bert-base-NER or similar) often outperforms spaCy's default model on names and locations, especially for non-English text or unusual name formats. The tradeoff is speed. Transformer models are slower than spaCy, so profile your latency requirements before switching.

Testing Your Recognizers

Before deploying custom recognizers, test them against labeled data.

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
# (add your custom recognizers)

# Test cases: (input_text, expected_entity_type, expected_value)
test_cases = [
    ("Employee EMP-12345 submitted the report", "EMPLOYEE_ID", "EMP-12345"),
    ("Contact acc-9921-0047 about the refund", "CUSTOMER_ACCOUNT", "ACC-9921-0047"),
    ("Project Titan launch is next month", "INTERNAL_PROJECT", "Titan"),
    ("The titan submarine was discovered", "INTERNAL_PROJECT", "titan"),  # Should this match?
    ("Order number 12345 shipped", None, None),  # Should NOT match EMPLOYEE_ID
]

for text, expected_type, expected_value in test_cases:
    results = analyzer.analyze(text=text, language="en", score_threshold=0.5)
    relevant = [r for r in results if r.entity_type == expected_type] if expected_type else results

    if expected_type and relevant:
        found_value = text[relevant[0].start:relevant[0].end]
        status = "PASS" if found_value.lower() == expected_value.lower() else "FAIL"
    elif not expected_type and not relevant:
        status = "PASS"
    else:
        status = "FAIL"

    print(f"[{status}] '{text}' -> {expected_type or 'NONE'}")

Pay particular attention to false positives (non-PII flagged as PII) and false negatives (actual PII missed). Adjust regex patterns, context words, and score thresholds based on your test results.

What's Next

You can now extend Presidio to detect any entity type your business needs. In Part 4, we'll cover anonymization strategies: the full set of operators (replace, redact, mask, hash, encrypt), pseudonymization with consistent mappings, synthetic data generation, and when to use reversible vs. irreversible anonymization.

This is Part 3 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.

Detecting PII in Real-World Text

Brian Spann — Sun, 07 Jun 2026 18:28:34 +0000

In Part 1 we installed Presidio and ran a basic detection on clean sample text. Real data is messier. Emails have signatures with phone numbers buried in HTML. Support tickets mix PII with technical jargon. Chat logs have informal name references that NER models struggle with. And sometimes the PII isn't in text at all. It's in screenshots and scanned documents.

This part covers how Presidio's detection engine actually works under the hood, how to process different text types you'll encounter in production, and how to handle structured data and images.

How the Analyzer Engine Works

Presidio doesn't rely on a single detection method. It layers three approaches and combines their results.

Named Entity Recognition (NER)

The NER model (spaCy by default) processes the text and identifies entities based on the language model's training. It's good at catching names, locations, and organizations even when they don't follow a fixed pattern. "John Smith" is easy. "Dr. J. Martinez-Garcia" is harder but the NER model handles it because it understands context and word patterns.

The tradeoff is that NER is probabilistic. It can miss unusual names or flag common words as entities. That's why Presidio doesn't stop here.

Pattern Matching (Regex)

For entities with predictable formats, Presidio uses regex recognizers. Credit card numbers, SSNs, email addresses, IP addresses, phone numbers all have known patterns. A Luhn-validated 16-digit number is almost certainly a credit card. A string matching \d{3}-\d{2}-\d{4} in the right context is probably an SSN.

Pattern-based detections typically get higher confidence scores than NER detections because the pattern itself is strong evidence.

Context Scoring

Here's where it gets interesting. Presidio looks at the words surrounding a potential match to boost or lower confidence. If the text says "my SSN is 123-45-6789," the phrase "my SSN is" provides strong context that the number is actually a social security number and not some random ID. The context words push the confidence score higher.

Without context scoring, a 9-digit number in the format XXX-XX-XXXX could be an SSN or a product SKU or an internal reference number. The surrounding words help Presidio decide.

Each recognizer defines its own list of context words. The SSN recognizer looks for words like "social," "security," "ssn," "tax id." The credit card recognizer looks for "credit," "card," "visa," "mastercard," "payment."

Processing Different Text Types

Emails

Email bodies often contain PII in signatures, forwarded messages, and inline contact details. The challenge is separating the PII you care about from the structural noise (headers, disclaimer text, HTML tags).

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

email_body = """
From: Sarah Chen <sarah.chen@acme.com>
To: support@company.com
Subject: Account Issue

Hi, I'm having trouble with my account. My customer ID is CUS-2847391 
and the last four of my card are 4242. Please call me at (415) 555-0198 
or email me at sarah.chen@acme.com.

Thanks,
Sarah Chen
VP of Engineering, Acme Corp
Office: (415) 555-0100
Mobile: (415) 555-0198
"""

results = analyzer.analyze(text=email_body, language="en")

for result in results:
    print(f"{result.entity_type}: '{email_body[result.start:result.end].strip()}' "
          f"(score: {result.score:.2f})")

Presidio will pick up the email addresses, phone numbers, and the person's name from both the body and the signature. It will also likely flag "Acme Corp" as an organization. You'll notice the same phone number appears twice (in the body and the signature), and Presidio reports each occurrence separately with its own position.

Support Tickets

Support tickets mix PII with technical content. Users paste error messages, stack traces, and config snippets alongside their personal details.

ticket = """
User report from jane.doe@company.com:

I'm getting error 500 when trying to update my billing info. 
My account number is 7829-4451-2290 and I'm using the card 
ending in 8847. The error started after I changed my address 
to 1234 Oak Street, Portland, OR 97201.

Stack trace:
java.lang.NullPointerException at com.billing.PaymentService.update(PaymentService.java:142)
"""

results = analyzer.analyze(text=ticket, language="en")

Presidio handles this well because the regex recognizers match the structured PII (email, account number pattern, zip code) while the NER model catches the street address and name. The stack trace doesn't trigger any false positives because Java class names and file paths don't match PII patterns.

Chat Logs

Chat logs are the hardest text type for PII detection. Messages are short, informal, and full of abbreviations. Names appear without context. Phone numbers get typed without dashes.

chat_log = """
[10:42] mike_t: hey can someone help with my acct? 
[10:42] mike_t: email is m.thompson@gmail.com
[10:43] support_bot: Sure Mike! What's the issue?
[10:44] mike_t: charge on my visa ending 4242 wasnt mine
[10:44] mike_t: my number is 5105105105105100
[10:45] support_bot: I'll look into that. Can you confirm your DOB?
[10:45] mike_t: march 15 1990
"""

results = analyzer.analyze(text=chat_log, language="en")

The credit card number without dashes or spaces is harder to catch, but Presidio's credit card recognizer applies Luhn validation on sequences of digits, so it will still flag it. The date of birth is trickier since Presidio detects dates but classifying a date as a DOB requires context. The surrounding text "confirm your DOB" provides that context.

Confidence Scores and Thresholds

Every result comes with a confidence score between 0 and 1. By default, Presidio returns everything above 0. In production you'll want to set thresholds.

# Only return high-confidence detections
results = analyzer.analyze(
    text=text,
    language="en",
    score_threshold=0.7
)

# Or filter after the fact for more control
high_confidence = [r for r in results if r.score >= 0.7]
medium_confidence = [r for r in results if 0.4 <= r.score < 0.7]
low_confidence = [r for r in results if r.score < 0.4]

A practical approach: use a high threshold (0.7 or above) for automated anonymization where false positives are costly, and a lower threshold (0.3-0.5) for audit/review workflows where a human checks the flagged items.

Batch Processing with presidio-structured

When your PII lives in CSVs, DataFrames, or JSON files, processing text column by column is tedious. The presidio-structured package handles this.

pip install presidio-structured

import pandas as pd
from presidio_structured import StructuredEngine, PandasAnalysisBuilder
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

# Sample DataFrame
df = pd.DataFrame({
    "customer_name": ["John Smith", "Jane Doe", "Bob Wilson"],
    "email": ["john@example.com", "jane@example.com", "bob@example.com"],
    "notes": [
        "Called about SSN 123-45-6789",
        "Address: 456 Elm St, Portland OR",
        "Card ending 4242, refund requested"
    ]
})

# Set up the structured engine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
structured_engine = StructuredEngine(
    analyzer_engine=analyzer,
    anonymizer_engine=anonymizer
)

# Build the analysis configuration
analysis_builder = PandasAnalysisBuilder()

# Analyze and anonymize
anonymized_df = structured_engine.anonymize(df, analysis_builder)
print(anonymized_df)

The structured engine processes each cell in the DataFrame, detects PII using the same analyzer, and anonymizes it. You can configure which columns to process, set different thresholds per column, and apply different anonymization operators per entity type.

Image Redaction with presidio-image-redactor

Sometimes PII isn't in text at all. It's in screenshots of forms, scanned documents, or photos of ID cards. Presidio's image redactor handles this by running OCR (via Tesseract) to extract text from images, detecting PII in the extracted text, and then drawing colored boxes over the PII regions in the original image.

# Install the image redactor
pip install presidio-image-redactor

# Make sure Tesseract is installed
# Mac: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr

from presidio_image_redactor import ImageRedactorEngine
from PIL import Image

# Load an image
image = Image.open("support_screenshot.png")

# Initialize the redactor
redactor = ImageRedactorEngine()

# Redact PII from the image
redacted_image = redactor.redact(image, fill=(0, 0, 0))

# Save the result
redacted_image.save("support_screenshot_redacted.png")

The fill parameter sets the color of the redaction boxes. Black (0, 0, 0) is the default. You can also use specific colors per entity type:

from presidio_image_redactor import ImageRedactorEngine
from presidio_analyzer import PatternRecognizer

redactor = ImageRedactorEngine()

# Redact with entity-specific colors
redacted = redactor.redact(
    image,
    fill=(0, 0, 0),       # Default: black
    entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS"]
)

Image redaction accuracy depends heavily on the OCR quality. Clean screenshots with standard fonts work well. Handwritten text, low-resolution scans, and images with complex backgrounds will produce lower accuracy. For those cases, you may want to preprocess the image (deskew, enhance contrast) before sending it to the redactor.

What's Next

Now you understand how Presidio's detection layers work together and how to process the text types you'll actually encounter. In Part 3, we'll build custom recognizers: deny-list recognizers for company-specific terms, regex recognizers for internal ID formats, rule-based recognizers with context enhancement, and no-code recognizers via YAML configuration.

This is Part 2 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.

What Is Microsoft Presidio and Why You Need It (Setup + First Detection)

Brian Spann — Fri, 05 Jun 2026 12:24:35 +0000

If you're building anything that touches user data and sends it to an LLM, you have a PII problem. Names, emails, phone numbers, credit card numbers, social security numbers sitting in support tickets, chat logs, documents, and database fields. Every time you pipe that data into a prompt, you're sending someone's personal information to a third-party model endpoint. Maybe that's fine for your use case. Maybe it's not. Either way, you should know what's in your data before you make that call.

Microsoft Presidio is an open-source framework that detects and anonymizes PII in text, images, and structured data. It's been around since 2019, it's actively maintained, and it's what I reach for when I need to scrub data before it hits an LLM. This series walks through the entire framework from installation to production deployment. No toy examples. Real workloads.

What Presidio Actually Does

Presidio has two core modules that handle the detection and anonymization pipeline separately.

The Analyzer finds PII. It combines named entity recognition (NER) from spaCy or Hugging Face transformers with regex pattern matching and contextual scoring. When you feed it text, it returns a list of detected entities with types, confidence scores, and character positions. It doesn't modify the text. It just tells you what it found.

The Anonymizer takes the analyzer's output and does something with it. Replace detected names with <PERSON>. Redact phone numbers entirely. Mask credit card numbers with asterisks. Hash emails. Encrypt values you need to reverse later. The anonymizer is where you decide how to handle each entity type.

Beyond those two, Presidio has additional modules for specific use cases. presidio-image-redactor handles OCR on images and redacts PII from screenshots and scanned documents. presidio-structured processes tabular data in DataFrames and JSON. We'll get to those in later parts of this series.

Installing Presidio

You have two paths: Python packages via pip or Docker containers. I'll cover both because you'll want pip for development and experimentation, and Docker for anything that needs to serve an API.

pip Installation

Set up a virtual environment first. Presidio pulls in spaCy and NLP models that you don't want colliding with other projects.

# Create and activate a virtual environment
python -m venv presidio-env
source presidio-env/bin/activate  # Linux/Mac
# presidio-env\Scripts\activate   # Windows

# Install the core packages
pip install presidio-analyzer presidio-anonymizer

# Download a spaCy language model (the large model is more accurate)
python -m spacy download en_core_web_lg

The en_core_web_lg model is about 560MB. If you're tight on space or just experimenting, en_core_web_sm works but you'll see lower accuracy on name and location detection. For anything beyond a quick test, use the large model.

Docker Installation

Presidio publishes official images to Microsoft Container Registry. Each module runs as its own REST API.

# Pull the images
docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer

# Run the analyzer on port 5001
docker run -d -p 5001:3000 mcr.microsoft.com/presidio-analyzer:latest

# Run the anonymizer on port 5002
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-anonymizer:latest

Both containers expose REST APIs on port 3000 internally. Map them to whatever ports you want on the host. Once they're running, you can hit them with curl or any HTTP client.

To verify they're up:

curl http://localhost:5001/health
curl http://localhost:5002/health

Your First Detection

Let's feed the analyzer some text and see what comes back. I'll show both the Python API and the REST API so you can pick whichever fits your workflow.

Python API

from presidio_analyzer import AnalyzerEngine

# Initialize the analyzer
analyzer = AnalyzerEngine()

# Sample text with multiple PII types
text = """
Hi, my name is John Smith and I live in Seattle. 
My email is john.smith@example.com and my phone 
number is 206-555-0147. My SSN is 123-45-6789 
and my credit card is 4111-1111-1111-1111.
"""

# Analyze the text
results = analyzer.analyze(text=text, language="en")

# Print what we found
for result in results:
    print(f"{result.entity_type}: '{text[result.start:result.end].strip()}' "
          f"(score: {result.score:.2f}, position: {result.start}-{result.end})")

Output:

PERSON: 'John Smith' (score: 0.85, position: 18-28)
LOCATION: 'Seattle' (score: 0.85, position: 42-49)
EMAIL_ADDRESS: 'john.smith@example.com' (score: 1.00, position: 64-86)
PHONE_NUMBER: '206-555-0147' (score: 0.75, position: 110-122)
US_SSN: '123-45-6789' (score: 0.85, position: 134-145)
CREDIT_CARD: '4111-1111-1111-1111' (score: 1.00, position: 169-188)

REST API (Docker)

curl -X POST http://localhost:5001/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "My name is John Smith and my email is john.smith@example.com",
    "language": "en"
  }'

The response is a JSON array of detected entities with the same fields: entity type, start position, end position, and confidence score.

Anatomy of a Recognizer Result

Every detection result contains five fields that matter:

entity_type is what Presidio thinks it found. PERSON, EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, US_SSN, LOCATION, and dozens more.

start and end are character positions in the original text. This is how you know exactly which substring triggered the detection. It's also how the anonymizer knows what to replace.

score is a confidence value between 0 and 1. A regex match on a credit card pattern returns 1.0 because the pattern is deterministic. A name detected by NER might return 0.85 because the model is making a probabilistic judgment. You can set a threshold to filter out low-confidence detections. The default is 0.

analysis_explanation is available in the detailed results and tells you which recognizer fired and why. Useful for debugging false positives.

Supported Entities Out of the Box

Presidio ships with recognizers for a wide range of entity types across multiple categories.

Global entities (work across languages): credit card numbers, crypto wallet addresses, email addresses, IBAN codes, IP addresses, phone numbers, URLs, domain names, dates.

US-specific: Social Security numbers, bank account numbers, driver's license numbers, ITIN, passport numbers.

UK-specific: NHS numbers.

Other regions: Singapore financial numbers, Australian business numbers, and more through community recognizers.

The full list is in the Presidio supported entities documentation. If your entity type isn't covered, you can build custom recognizers. That's Part 3 of this series.

Running Your First Anonymization

Detection is only half the job. Let's anonymize the results.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

text = "My name is John Smith and my email is john.smith@example.com"

# Detect PII
results = analyzer.analyze(text=text, language="en")

# Anonymize with default settings (replaces with entity type labels)
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized.text)
# Output: My name is <PERSON> and my email is <EMAIL_ADDRESS>

The default behavior replaces each detected entity with its type label wrapped in angle brackets. In Part 4 we'll dig into all the anonymization operators (replace, redact, mask, hash, encrypt) and when to use each one. For now, the point is that detection and anonymization are separate steps. You can detect without anonymizing, anonymize differently per entity type, or build a pipeline that does both in one shot.

What's Next

That's the foundation. Presidio installed, first detection running, and you understand what the output looks like. In Part 2, we'll go deeper on the analyzer: how the NER models, regex patterns, and context scoring work together, how to process different text types (emails, support tickets, chat logs), batch processing with presidio-structured, and image redaction with presidio-image-redactor.

This is Part 1 of the Hands-On Microsoft Presidio series. I write about PII detection, AI infrastructure, and building with Claude Code on Dev.to.

Azure Container Apps Express: The Agent-First Platform You've Been Waiting For

Brian Spann — Thu, 28 May 2026 03:34:09 +0000

I've been running AI workloads on Azure Container Apps for over a year. Every time I spin up a new agent backend, the ritual is the same: create an environment, configure networking, set scaling rules, wire up health probes, then deploy the actual container. For a prototype agent that might live for a week, that's too much ceremony for what you get.

ACA Express, which hit public preview in May 2026, kills most of that ceremony. And a separate but related announcement, Docker Compose for Agents, brings MCP gateways and model serving to standard ACA environments. They solve different problems and run on different infrastructure, but together they cover the full spectrum of agent deployment on Azure.

Let me break down both.

ACA Express: What It Actually Is

Express is a new environment tier within Azure Container Apps. You bring a container image. Express handles provisioning, HTTPS, scaling (including scale-from-zero with subsecond cold starts), and resource allocation. No environment to manually provision through the portal. No networking to configure. No scaling rules to write.

Under the hood, Express is built on ACA Sandboxes, a platform primitive that uses prewarmed pools to deliver that subsecond startup. This isn't the standard ACA cold-start experience with a fresh coat of paint. It's a different architecture.

The tradeoffs are real. Express is HTTP workloads only, consumption CPU only. No GPU. No VNet integration. No Dapr. No service discovery between apps. No managed identity at runtime. No health probes. If you need any of those, standard ACA environments are still there. But for stateless HTTP agent backends, Express is dramatically faster to deploy and cheaper to run.

Here's what it takes to get a container running:

# Create an express environment
az containerapp env create \
  --name my-express-env \
  --resource-group rg-my-agents \
  --environment-mode express \
  --logs-destination none

# Deploy your app
az containerapp create \
  --name my-agent-api \
  --resource-group rg-my-agents \
  --environment my-express-env \
  --image mcr.microsoft.com/k8se/quickstart:latest \
  --target-port 80 \
  --ingress external \
  --min-replicas 0 \
  --max-replicas 1

Your app is running in seconds. Not minutes. Seconds.

Express also has its own portal experience at containerapps.azure.com, separate from the Azure portal. If you're using the portal, you don't even need to create the environment yourself. It handles that automatically.

Why "Agent-First" Is the Right Framing

Microsoft is explicitly positioning Express for two audiences: developers who want to ship fast, and AI agents that deploy endpoints on demand. That second audience is the interesting one.

Think about how modern agent architectures work. An orchestrator spins up tool-use APIs, runs them for the duration of a task, and tears them down. The infrastructure needs to provision fast, scale from zero, and cost nothing when idle. That's exactly the Express model.

The platform is designed for MCP servers, tool-use endpoints, multi-step workflow APIs, and human-in-the-loop UIs that agents spin up dynamically. Scale-from-zero with subsecond cold starts means you're not paying for agent backends that aren't actively serving requests. And when a request does come in, the agent is ready almost instantly instead of waiting through a cold start.

Docker Compose for Agents: A Separate (and Complementary) Feature

Here's where a lot of early coverage got confused, and where I got it wrong in my first draft of this post. Docker Compose for Agents is not an Express feature. It deploys to standard ACA environments with workload profiles, not to Express.

Why? Because Compose for Agents supports GPU model serving, MCP gateway containers, sidecar processes, and multi-service stacks. All of those require capabilities that Express doesn't have (workload profiles, service discovery, sidecars). Different tool for a different job.

What Compose for Agents does is let you take the same compose.yml you use locally for development and deploy it directly to ACA. The CLI translates compose services into Container Apps resources automatically.

Here's what a compose file looks like for an agent stack:

services:
  my-agent-app:
    build: .
    ports:
      - "8080:8080"
    environment:
      - MCP_GATEWAY_URL=${MCP_GATEWAY_URL}

  mcp-gateway:
    image: docker/mcp-gateway
    x-azure-deployment:
      image: acateam.azurecr.io/preview-ai-compose/mcp-gateway:latest

models:
  gemma:
    model: ai/gemma3-qat
    x-azure-deployment:
      workloadProfiles:
        workloadProfileType: Consumption-GPU-NC8as-T4

The x-azure-deployment directive is the bridge between local and cloud. Docker ignores it locally. ACA uses it during deployment. Same file, both environments.

What the CLI creates behind the scenes:

Your agent app as a Container App with ingress. An MCP gateway running as its own Container App with managed identity, dynamically managing MCP tool containers. Model serving via Docker's model runner on serverless GPU. The MCP gateway handles stdio-to-SSE translation, so your MCP servers run as standard Container Apps without modification.

To deploy it:

# Install the preview CLI extension
az extension remove --name containerapp
az extension add --source "<preview-extension-url>" --yes

# Deploy your compose file to a standard ACA environment
az containerapp compose create \
  --compose-file-path compose.yml \
  --resource-group rg-my-agents \
  --environment my-standard-env

Notice that --environment flag. This deploys to a standard ACA environment, not Express. That's the distinction.

Where Each Fits in the Azure AI Stack

The Azure AI hosting landscape has gotten crowded. Here's how I think about the options as someone who's deployed on most of them:

Azure AI Foundry is for when you want managed model endpoints with built-in safety, content filtering, and enterprise governance. You're consuming models, not hosting infrastructure.

ACA Standard is for when you need GPU workloads (self-hosted Ollama, vLLM), microservices with Dapr, VNet isolation, or any enterprise feature that Express doesn't have yet. This is also where Docker Compose for Agents deploys.

ACA Express is for fast, cheap, stateless agent backends. Prototypes, MCP servers, tool-use APIs, webhook handlers, agent orchestrators that don't need GPU compute.

ACA Dynamic Sessions is for sandboxed code execution for AI-generated code. Hyper-V isolated, millisecond provisioning, MCP-integrated.

Express isn't replacing anything. It's filling the gap for lightweight agent infrastructure that's too simple for standard ACA but too complex for a serverless function.

What's Missing in Express (For Now)

This is a public preview, and the supported feature list reflects that. The "No" column is long:

No secrets management (no Key Vault integration). No managed identity at app runtime. No health probes. No custom domains or managed certificates. No VNet integration. No CORS, session affinity, or sidecar containers. No OpenTelemetry. No autoscaling rules (KEDA). Region-limited to West Central US and East Asia.

For production agent backends, these gaps matter. No managed identity means you're passing credentials through environment variables. No health probes means you're trusting the platform's defaults. No secrets means API keys sit in plain text config.

But for prototypes, internal tools, and agent backends in active development? These limitations are acceptable tradeoffs for the provisioning speed and cost model. And Microsoft is shipping features on what they describe as a "rapid cadence" through the preview period.

When to Use Which

If you're building a lightweight agent backend, an MCP server, or a tool-use API that handles HTTP requests and doesn't need GPU, go Express. You'll have a running endpoint in seconds with zero infrastructure decisions.

If you're building a full agent stack with model serving, an MCP gateway coordinating multiple tool containers, and GPU workloads, use Docker Compose for Agents on standard ACA. The compose file gives you local-to-cloud parity and the workload profiles give you the compute you need.

If you need both, use both. Express for the lightweight endpoints, standard ACA for the heavy lifting. They run on the same platform and can coexist in the same resource group.

BMAD Method + Claude Code: How I Actually Ship Projects with Spec-Driven AI Development

Brian Spann — Wed, 27 May 2026 02:29:17 +0000

I vibe-coded my way through three months of Claude Code projects before I admitted something was off. The code worked, mostly, but I kept losing hours to the same problem: Claude and I would drift from the original intent mid-session, and by session two or three, neither of us remembered why we'd made half the decisions in the codebase.

I'd been watching the BMAD Method since v3 introduced its orchestrator concept, but it felt like overhead I didn't need. Then v4 landed with a real architectural overhaul (NPM distribution, modular agents, multi-IDE support) and I gave it a real shot. It clicked almost immediately. I've spent years working on teams with PMs, architects, scrum masters, QA. The full SDLC cast. BMAD maps those same roles onto AI agents, so the workflow felt familiar instead of foreign. I wasn't learning a new process. I was running the one I already knew, just with different team members. That was roughly nine months ago. I don't build without it now.

BMAD in 60 Seconds

BMAD (Breakthrough Method for Agile AI-Driven Development) is an open-source framework that structures AI-assisted coding around specifications, role-based agents, and phased workflows. The spec is the source of truth. Code is the output.

As of v6, the project has 19+ specialized agents (PM, Architect, Scrum Master, Developer, QA, and others), 50+ named workflows, and a module system that hooks into Claude Code natively through skills, commands, and hooks. It's crossed 40,000 GitHub stars and the ecosystem has spawned several third-party Claude Code plugins.

Numbers aside, does it actually change how you work? For me, yes. Substantially.

How I Set It Up

I use BMAD across several active projects. Bridgely, CoinFolio, FiveCrowns, and Vela are all built this way. Different domains, same workflow.

On CLAUDE.md

A lot of BMAD guides will tell you to set up a CLAUDE.md file alongside it. I actually don't bother. BMAD's own agent configurations, skills, and workflow definitions carry enough context on their own. Adding a CLAUDE.md on top of that is redundant at best, and at worst you end up with conflicting instructions. Your CLAUDE.md says one thing, BMAD's agent config says another, and Claude Code picks whichever it sees last.

I keep project-level conventions (file naming, directory structure, don't-commit-secrets type rules) in BMAD's own config. One source of truth, not two.

The Actual Workflow

A feature goes through four phases. I'll walk through what this looks like on a real task, building a Dev.to publishing skill for my blog pipeline.

Spec first. I describe what I want. BMAD's PM agent writes a PRD. The Architect agent reviews it and produces a technical design. Both end up as markdown files in docs/specs/ that persist across sessions.

> "I need a skill that publishes blog posts to Dev.to via their API,
>  handles draft mode, and manages frontmatter validation."

The PM agent gives me user stories, acceptance criteria, and scope boundaries. The Architect maps that to file structure, dependencies, and integration points. No code yet. Just the blueprint.

Story breakdown. The Scrum Master agent splits the spec into implementable stories, each with clear done-criteria, each in its own file under docs/stories/. This is the part that replaced my old habit of writing one giant implementation prompt. Smaller chunks mean each piece is actually testable.

Implementation. Claude Code writes code against the spec and story files, not against a vague prompt I typed twenty minutes ago. The Dev agent pulls in the story file, the architecture doc, and project conventions from BMAD's config. Decisions trace back to a spec instead of disappearing into chat history.

Validation. The QA agent checks work against acceptance criteria, runs tests, and flags gaps. Vibe-coding skips this step entirely, which is exactly why vibe-coded projects accumulate the kind of debt they do.

What's Different Now

The context-drift problem I mentioned at the top? Gone. Spec files give Claude Code something persistent to anchor to, so I'm not re-explaining decisions from last Tuesday. Features that used to take two or three sessions now finish in one because the spec does the remembering.

The other shift was subtler. I used to treat "it runs" as done. Now done means the QA agent signed off against acceptance criteria. It sounds like a small distinction but it changes how much rework I do later. A lot less.

Refactoring got easier too. When you need to restructure something, having the original intent documented next to the implementation means you can tell Claude Code what the code was supposed to do, not just what it currently does.

I won't overstate it. The biggest improvement isn't raw speed. It's that I can predict what I'm going to get at the end of a session, because I defined it before I started.

Where It Falls Short

Context window pressure is real. On bigger projects, BMAD specs plus architecture docs plus story files eat context fast. I've gotten better at keeping specs concise, but there's a tension between "enough detail to be useful" and "not so much that Claude Code forgets the beginning by the time it reads the end."

Agent handoffs can be rough. The Architect agent sometimes makes assumptions that don't line up with what the PM agent specified. I've started adding explicit handoff checklists in my story files to catch this, but it's a manual workaround for what should probably be a tighter integration.

And for small stuff (a typo fix, a CSS tweak, a one-line config change) the full BMAD workflow is overkill. I skip it for anything that touches fewer than three files or doesn't involve a real design decision.

Trying It Out

The GitHub repo has install instructions. My advice: start with BMM Core (the base module) and don't install everything at once. Pick a real feature on a real project and spec it before you write any code.

The thing that took me longest to internalize is that the process matters more than the prompts. I spent months tweaking how I asked Claude Code to do things. BMAD shifted that energy toward defining what I wanted Claude Code to build, and the prompts mostly took care of themselves.

Spec-Driven Dev Is Bigger Than BMAD

BMAD isn't the only framework pushing this direction. Kiro, GSD, and RALPH-LOOP are all built on variations of the same thesis: AI-generated code is only as good as the structure you feed it.

BMAD works for me because it maps directly onto Claude Code's extension model. Skills, hooks, commands. It's not a wrapper around Claude Code. It's a playbook for the tools Claude Code already has.

Test Architect (TEA): AI-Driven Testing That Doesn't Rot (Part 5)

Brian Spann — Wed, 20 May 2026 01:23:33 +0000

Test Architect (TEA): AI-Driven Testing That Doesn't Rot

Part 5 of the BMAD-Method series

We've covered the core framework, workflows, custom agents with BMad Builder, and the Creative Intelligence Suite. There's one piece of the BMAD ecosystem we haven't touched yet, and it solves the problem I hear about most from teams using AI for development: the tests are garbage.

Not "they don't run" garbage. They run fine. They pass. They look reasonable in a PR review. Then three sprints later, half of them are flaky, a quarter test implementation details instead of behavior, and nobody trusts the suite enough to block a deploy on it. The tests rotted — not because anyone wrote bad code, but because the AI that generated them had no testing strategy. It just wrote assertions that matched the current behavior.

TEA (Test Engineering Architect) is BMAD's answer to that problem. It's a module that brings the same structured, workflow-driven approach we use for product management and architecture to the testing side. Nine workflows covering everything from risk-based test planning to release gate decisions.

The Problem TEA Solves

Ask any AI coding tool to "write tests for this component" and you'll get tests. Lots of tests. They'll have descriptive names, reasonable assertions, and they'll pass on the first run. Ship it, right?

Here's what goes wrong. The AI doesn't know which parts of your system are high-risk and need deep coverage versus which parts are stable and need a smoke test. It doesn't know that your checkout flow handles real money and needs different test rigor than your settings page. It doesn't build fixtures that compose cleanly or follow network-first patterns that eliminate flakiness. It just generates test code that looks like test code.

TEA's thesis is that testing is an engineering discipline, not a code generation task. Before you write a single test, you should know what's risky, what the priorities are, and what "good enough" coverage looks like for this specific feature. TEA provides that structure through a knowledge base of testing patterns and a set of workflows that guide you from planning through execution to release decisions.

What TEA Actually Is

TEA is a BMAD module — you install it the same way you install any other BMAD component. It adds a specialized agent persona (Murat, the Test Architect) and nine workflows that cover the full testing lifecycle.

The workflows span BMAD's phases:

Phase 3 (Solutioning) — system-level test design, framework scaffolding, CI pipeline setup. This is where you answer "how do we test this system?" before anyone writes implementation code.

Phase 4 (Implementation) — per-epic test design, ATDD (writing failing tests before code), test automation, test review, and traceability. This is where tests get written, reviewed, and mapped to requirements.

Release Gate — NFR assessment and the trace workflow's gate decision (PASS / CONCERNS / FAIL / WAIVED). This is where you decide whether the build ships.

The Nine Workflows

Here's what each one does and when you'd use it:

Workflow	Trigger	What It Does
Teach Me Testing	`TMT`	Interactive 7-session learning path — fundamentals through advanced
Framework Setup	`TF`	Scaffolds Playwright or Cypress with config, fixtures, and sample structure
CI Pipeline	`CI`	Generates CI workflow with selective test scripts and secrets checklist
Test Design	`TD`	Risk-based test planning with P0–P3 prioritization
ATDD	`AT`	Generates failing acceptance tests before implementation (red phase TDD)
Automate	`TA`	Generates tests for existing features with fixture composition
Test Review	`RV`	Audits test quality against the knowledge base, scores 0–100
Trace	`TR`	Maps tests to requirements, generates coverage matrix, makes gate decisions
NFR Assessment	`NR`	Evaluates non-functional requirements — security, performance, reliability

Two workflows deserve special attention.

Test Design is the backbone. It produces a risk assessment using probability × impact scoring, then generates a prioritized test plan. P0 items are critical path — if these fail, users can't use the product. P3 items are edge cases that matter but won't block a release. This prioritization is what prevents the "generate 200 tests and hope for the best" approach. You know exactly where to invest testing effort.

Trace is the closer. It's a two-phase workflow: Phase 1 builds a traceability matrix mapping tests to requirements, and Phase 2 makes a gate decision. The gate isn't just "did the tests pass" — it evaluates coverage gaps, risk areas without tests, and NFR compliance. The output is a YAML artifact you can attach to your release process.

Getting Started: Zero to Passing Tests in 30 Minutes

TEA has five engagement models — you don't have to go all-in. Here's the fastest path.

Install

npx bmad-method install
# Select: Test Architect (TEA)

Load the Agent

In your AI coding tool (Claude Code, Cursor, Windsurf, etc.):

bmad-tea

This loads the TEA agent with its menu of workflows.

Scaffold Your Test Framework

framework

TEA asks about your stack (React? Node? What test runner?) and generates a production-ready Playwright or Cypress scaffold — config, directory structure, fixtures, .env.example, the works. Not a toy starter template. The generated structure follows TEA's knowledge base patterns for fixture architecture and network-first testing.

Create a Test Design

test-design

Tell TEA what you're testing and it produces a risk assessment with P0–P3 priorities. For a TodoMVC-style app, the output might flag "creating and displaying todos" as P0 (critical path) and "clearing completed todos" as P2 (medium value). Each priority level gets specific test scenarios.

Generate Tests

automate

Point it at your test design document and TEA generates tests that follow the priorities. P0 scenarios get thorough coverage. P3 scenarios get a smoke test. The generated code uses the fixture patterns from TEA's knowledge base — composable fixtures, network interception before navigation, explicit assertions instead of snapshot comparisons.

Run

npx playwright test

That's the TEA Lite path. You used three workflows (framework, test-design, automate), and you have a test suite that was designed before it was generated. The risk assessment stays with the project as documentation — when someone asks "why do we test X but not Y?" the test design document has the answer.

How TEA Fits Into Full BMAD Projects

If you're running the full BMAD workflow (Parts 1–4 of this series), TEA plugs into specific phases:

After your architect produces the architecture and ADRs in Phase 3, run test-design in system-level mode. This produces two documents: one for the architecture team (testability gaps, ASR validation) and one for QA (test execution recipe, coverage plan, Sprint 0 setup). Both feed into the implementation-readiness gate.

During Phase 4, each epic gets its own test-design run. Then for each story: optionally run atdd to generate failing acceptance tests before development, run automate after the feature is built, and optionally run test-review to audit quality. The trace workflow refreshes the coverage matrix as tests accumulate.

At the release gate, trace Phase 2 evaluates everything and produces a PASS/CONCERNS/FAIL/WAIVED decision with evidence.

You don't have to use every workflow. Plenty of teams start with just automate and add the planning workflows as they see the value. TEA is designed to be adopted incrementally.

What Makes TEA's Tests Different

Three things separate TEA-generated tests from "just ask the AI to write tests":

Risk-based prioritization. Tests aren't generated uniformly. High-risk features get deep coverage. Low-risk features get a smoke test. This matches how experienced test architects actually think — you don't spend the same effort testing a payment flow and a color theme toggle.

Knowledge base patterns. TEA carries a knowledge base of 42 testing fragments covering fixture architecture, network-first patterns, step-file organization, and quality standards. Every generated test follows these patterns. The fixture architecture alone — pure function → fixture → composition — prevents the most common source of test rot: fixtures that are coupled to implementation details.

Network-first approach. Instead of using page.waitForTimeout(2000) or hoping the page loads fast enough, TEA's patterns intercept network calls before navigating. Tests wait for actual responses, not arbitrary delays. This is the single biggest factor in eliminating flakiness.

Enterprise and Brownfield Support

TEA handles more than greenfield projects.

For brownfield projects (existing codebase, existing tests), start with trace to baseline your current coverage. TEA maps what's tested and what's not, identifies regression hotspots, and focuses the test design on the areas where new work intersects with existing risk. You don't throw away your existing tests — you improve them incrementally.

For enterprise projects with compliance requirements, TEA's nfr-assess workflow captures security, performance, and reliability requirements early. The release gate produces audit-trail artifacts that map to SOC 2 and HIPAA evidence requirements.

Try It

The fastest way to see TEA in action:

npx bmad-method install  # Select TEA

Then in your AI coding tool: bmad-tea → framework → test-design → automate → npx playwright test.

Thirty minutes. Risk-based test plan plus passing tests. No test rot.

Full documentation: TEA Docs
GitHub: bmad-code-org/bmad-method-test-architecture-enterprise

Using TEA on a real project? I'd love to hear how the risk-based approach compares to your previous testing workflow — drop it in the comments.

Building a C# Agent with Microsoft Agent Framework and Ollama

Brian Spann — Tue, 21 Apr 2026 22:01:24 +0000

Building a C# Agent with Microsoft Agent Framework and Ollama

Part 3 of "Running LLMs & Agents on Azure Container Apps"

We've got Ollama running in Azure Container Apps with persistent storage and secure access. Now let's write an agent that talks to it.

Two weeks ago, Microsoft shipped Agent Framework 1.0 -- the production-ready successor to both Semantic Kernel and AutoGen. Same team, dramatically simpler API. If you've been building agents with Semantic Kernel's ChatCompletionAgent and Kernel objects, the new framework strips away most of that ceremony. You get an agent in three lines of code instead of fifteen.

I rewrote my Ollama agent code the week it shipped. This post walks through what that looks like.

Why Agent Framework Over Semantic Kernel

I used Semantic Kernel for everything up until this month. It's solid, and I still have projects running on it. But Agent Framework fixes the things that always bothered me.

In Semantic Kernel, every agent needs a Kernel instance. You build a kernel, configure providers, register plugins, then pass the kernel to the agent. It's a lot of plumbing for what amounts to "talk to this model and call these functions." Agent Framework collapses that into a single extension method. You take your chat client -- whatever provider -- and call .AsAIAgent(). Done.

Tool registration is the other big improvement. Semantic Kernel requires [KernelFunction] attributes on every method, a plugin class, and a kernel to register it on. Agent Framework uses AIFunctionFactory.Create() to wrap any C# method as a tool. You pass your tools directly when you create the agent. No attributes, no plugin classes, no kernel.

The underlying model abstraction is Microsoft.Extensions.AI, which means any provider that implements IChatClient works. Ollama, Azure OpenAI, OpenAI, Anthropic -- same agent code, different client. That portability is why I chose this stack for the series.

A note on Semantic Kernel: Microsoft will keep maintaining it and fixing bugs, but new features go into Agent Framework. If you're starting fresh, start here. If you have Semantic Kernel code running in production, there's no rush to migrate -- but new projects should use the new framework.

Project Setup

dotnet new console -n OllamaAgent
cd OllamaAgent
dotnet add package Microsoft.Agents.AI --prerelease
dotnet add package Microsoft.Extensions.AI.Ollama --prerelease

Two packages. Microsoft.Agents.AI is the agent framework itself. Microsoft.Extensions.AI.Ollama is the first-party Ollama connector built on the IChatClient abstraction. Both are marked --prerelease because the NuGet packages shipped as 1.0.0-preview while the framework itself is GA. Microsoft does this sometimes. The APIs are stable.

Your First Agent

using System;
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;

var chatClient = new OllamaChatClient(
    new Uri("https://your-ollama.azurecontainerapps.io"),
    modelId: "llama3:8b");

AIAgent agent = chatClient.AsAIAgent(
    instructions: "You are a helpful assistant running on self-hosted infrastructure.");

Console.WriteLine(await agent.RunAsync("What is Azure Container Apps?"));

That's the whole thing. Three meaningful lines: create a client, make it an agent, run it. The endpoint is the internal FQDN of your Ollama container app from Part 2. If your code runs in the same ACA environment (which it will in Part 4), it reaches Ollama directly over the internal network.

Compare that to the Semantic Kernel equivalent, which needs Kernel.CreateBuilder(), AddOllamaChatCompletion(), builder.Build(), then kernel.InvokePromptAsync(). Same result, twice the ceremony.

Swappable Backends

This is the pattern I use on every project. Configure a local backend for development and a cloud backend for production, and a flag decides which one runs.

using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;
using Azure.AI.OpenAI;
using Azure;

AIAgent CreateAgent(bool useLocal = false)
{
    IChatClient client;

    if (useLocal)
    {
        client = new OllamaChatClient(
            new Uri(Environment.GetEnvironmentVariable("OLLAMA_URL")!),
            modelId: "llama3:8b");
    }
    else
    {
        client = new AzureOpenAIClient(
            new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!),
            new AzureKeyCredential(
                Environment.GetEnvironmentVariable("AZURE_OPENAI_KEY")!))
            .GetChatClient("gpt-4");
    }

    return client.AsAIAgent(
        instructions: "You are a helpful technical assistant.");
}

In development, useLocal is true. In production, it's false. Your agent instructions, tools, and orchestration stay identical. You're only changing the inference backend.

This pays off in ways beyond the obvious cost savings. You can run your full test suite against a local model in CI/CD without API charges. You can demo at a conference or customer site without depending on network connectivity. I've done both.

Multi-Turn Conversations

Agent Framework introduces sessions for managing conversation state. Each session tracks its own message history.

var agent = chatClient.AsAIAgent(
    instructions: "You are a technical advisor for Azure deployments.");

// Create a session for a multi-turn conversation
var session = await agent.CreateSessionAsync();

// First turn
var response1 = await agent.RunAsync(
    "I need to deploy a containerized ML model on Azure.", session);
Console.WriteLine(response1);

// Second turn -- the agent remembers the context
var response2 = await agent.RunAsync(
    "What about GPU support?", session);
Console.WriteLine(response2);

The session handles all the chat history management. In Semantic Kernel, you'd create a ChatHistory object, manually append messages, and pass it around. Here, the session does that behind the scenes. You can also serialize sessions to JSON for persistence, which is useful when you need conversations that survive container restarts.

Adding Tools (Function Calling)

Tools are where agents stop being chatbots and start doing useful work. Agent Framework makes tool registration dead simple compared to Semantic Kernel.

using System.ComponentModel;
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;

// Just a regular C# method -- no [KernelFunction] attribute needed
[Description("Gets the current weather for a location")]
static string GetWeather(string location)
{
    // In production, this calls a weather API
    return $"Weather in {location}: 72°F, Sunny";
}

[Description("Looks up an Azure resource's current status")]
static string CheckResourceStatus(string resourceName)
{
    return $"{resourceName}: Running, 0 errors in last 24h";
}

var agent = chatClient.AsAIAgent(
    instructions: "You are an operations assistant with access to weather and Azure monitoring tools.",
    tools: [
        AIFunctionFactory.Create(GetWeather),
        AIFunctionFactory.Create(CheckResourceStatus)
    ]);

var response = await agent.RunAsync("What's the weather in Seattle and is my ollama-prod app healthy?");
Console.WriteLine(response);

Notice what's missing: no plugin class, no kernel, no FunctionChoiceBehavior settings. You pass your tools as a list when you create the agent, and the framework handles the rest. The [Description] attribute is optional but I always include it -- it's what the LLM reads to decide whether to call the function. A good description is the difference between the model calling your function correctly and ignoring it entirely.

In Semantic Kernel, the same setup requires creating a plugin class with [KernelFunction] attributes, building a kernel, registering the plugin on the kernel, configuring FunctionChoiceBehavior.Auto() in execution settings, and then invoking. Agent Framework gets the same result with half the code and no framework-specific attributes on your business logic.

Function Calling with Local Models: What Actually Works

Function calling with self-hosted models is not as reliable as with GPT-4. It works, but you need to pick the right models. I've burned enough time on this to have opinions.

Llama 3.1 and later have solid function calling support. If you're on Llama 3 (without the .1), function calling will be flaky -- the model wasn't trained for tool use. This is the number one issue I see people hit.

Mistral and Mixtral handle tool use well. They're my go-to when you need function calling on Ollama at a smaller size than Llama 3.1 70B.

Qwen 2.5 is strong on structured output and function calling, especially the 7B and 14B sizes. It's become my default for agents that need reliable tool use on modest hardware.

Practical advice: write an integration test that sends a prompt requiring a function call and verifies the function actually fired. Takes five minutes, saves hours.

// Quick smoke test for function calling support
[Description("Returns the current UTC time")]
static string GetTime() => DateTime.UtcNow.ToString("o");

var testAgent = chatClient.AsAIAgent(
    instructions: "Use the GetTime tool to answer time questions.",
    tools: [AIFunctionFactory.Create(GetTime)]);

var result = await testAgent.RunAsync("What time is it?");
// If the response contains a real timestamp, function calling works
Console.WriteLine(result);

Run that against each model you're evaluating. If it returns something like "I don't have access to real-time information" instead of an actual timestamp, that model can't do tool use.

Smart Routing: Right Model for the Job

Once you have both backends available, you can route requests to the model that fits the task.

public class SmartRouter
{
    private readonly AIAgent _localAgent;
    private readonly AIAgent _cloudAgent;

    public SmartRouter(string ollamaUrl, string azureEndpoint, string azureKey)
    {
        var localClient = new OllamaChatClient(
            new Uri(ollamaUrl), modelId: "qwen2.5:14b");

        var cloudClient = new AzureOpenAIClient(
            new Uri(azureEndpoint),
            new AzureKeyCredential(azureKey))
            .GetChatClient("gpt-4");

        _localAgent = localClient.AsAIAgent(
            instructions: "You are a data processing assistant.");
        _cloudAgent = cloudClient.AsAIAgent(
            instructions: "You are an expert analyst and writer.");
    }

    public Task<AgentResponse> RouteAsync(string input, string taskType)
    {
        var agent = taskType switch
        {
            "classify" or "extract" or "summarize" => _localAgent,
            "reason" or "analyze" or "generate" => _cloudAgent,
            _ => _localAgent
        };

        return agent.RunAsync(input);
    }
}

Local models handle classification, extraction, and summarization almost as well as GPT-4 -- well enough for production. Where GPT-4 still pulls ahead is multi-step reasoning, complex code generation, and text that needs a specific tone. A routing layer like this cuts API costs by 60-80% without a noticeable quality drop.

Complete Example: Document Triage Agent

Here's something closer to what I've built for real teams -- an agent that triages incoming documents, classifies them, extracts key fields, and routes them for review.

using System.ComponentModel;
using Microsoft.Agents.AI;
using Microsoft.Extensions.AI;

// Tool functions -- clean C# methods, no framework attributes required
[Description("Classifies a document as: invoice, contract, support-ticket, or other")]
static string ClassifyDocument(string content)
{
    // In production: fine-tuned classifier or pattern matching
    return "invoice";
}

[Description("Extracts vendor name, amount, and due date from an invoice")]
static string ExtractInvoiceFields(string content)
{
    return """{"vendor": "Contoso", "amount": 4250.00, "due": "2026-05-15"}""";
}

[Description("Routes a document to a review queue based on category and priority")]
static string RouteForReview(string category, string priority)
{
    return $"Routed {category} to {priority}-priority queue";
}

// Create the agent with Qwen 2.5 14B -- reliable tool use, runs on CPU
var chatClient = new OllamaChatClient(
    new Uri("https://your-ollama.azurecontainerapps.io"),
    modelId: "qwen2.5:14b");

var triageAgent = chatClient.AsAIAgent(
    instructions: """
        You are a document triage agent. When given a document:
        1. Classify its type
        2. If it's an invoice, extract the key fields
        3. Route it for review based on category and urgency
        Use the available tools for each step.
        """,
    tools: [
        AIFunctionFactory.Create(ClassifyDocument),
        AIFunctionFactory.Create(ExtractInvoiceFields),
        AIFunctionFactory.Create(RouteForReview)
    ]);

var session = await triageAgent.CreateSessionAsync();
var result = await triageAgent.RunAsync(
    "Process this document: Invoice from Contoso for $4,250 due May 15, 2026 for Azure consulting services.",
    session);

Console.WriteLine(result);

I'm using qwen2.5:14b because it chains multiple tool calls reliably -- classify, then extract, then route -- without dropping steps. It's small enough to run on CPU without painful latency. Llama 3 can't do this sequence consistently; Qwen 2.5 nails it.

This is a single-agent setup. In Part 4, we'll break this apart -- a classifier agent, an extraction agent, a routing agent -- each running as its own container on ACA, communicating through Dapr, with Dynamic Sessions for sandboxed code execution.

What Changed from Semantic Kernel (Quick Reference)

If you've been following this series and have Semantic Kernel code, here's what moves where:

Semantic Kernel	Agent Framework
`Kernel.CreateBuilder()`	`new OllamaChatClient(...)`
`builder.AddOllamaChatCompletion(...)`	(done in client constructor)
`kernel.InvokePromptAsync(...)`	`agent.RunAsync(...)`
`[KernelFunction]` attribute	`AIFunctionFactory.Create(method)`
`builder.Plugins.AddFromType<T>()`	`tools: [...]` parameter
`FunctionChoiceBehavior.Auto()`	(automatic -- no config needed)
`ChatHistory`	`AgentSession`
`Microsoft.SemanticKernel` namespace	`Microsoft.Agents.AI` + `Microsoft.Extensions.AI`

The Semantic Kernel packages still work. If you have production code on them, there's no fire to put out. But for new projects, Agent Framework is less code, less ceremony, and where Microsoft is putting new features.

Next Up

Part 4 is where the architecture gets interesting: multiple agents running as separate containers on ACA, passing messages through Dapr, with Azure Container Apps Dynamic Sessions for sandboxed code execution. We go from "one agent that triages documents" to "a team of agents that can research, code, and review."

Questions about migrating from Semantic Kernel or getting Ollama working with Agent Framework? Drop them in the comments -- I migrated a project last week and the gotchas are fresh.

Running Ollama on Azure Container Apps

Brian Spann — Sun, 19 Apr 2026 17:15:39 +0000

Running Ollama on Azure Container Apps

Part 2 of "Running LLMs & Agents on Azure Container Apps"

In Part 1, I made the case for why Azure Container Apps hits the sweet spot for self-hosted LLM inference. Now let's actually build it.

By the end of this post, you'll have Ollama running in Azure, serving Llama 3, with persistent model storage and a secure endpoint. The basic deployment takes about 20 minutes. The production hardening we'll add (persistent volumes, auth, GPU) takes it from a demo to something you'd actually run for a team.

A Quick Word on Ollama

If you haven't used Ollama before, the pitch is simple: it's the easiest way to run open-source LLMs. On your local machine, it's one command, ollama run llama3, and you've got a model running with an API endpoint.

The reason Ollama works so well for what we're building is the OpenAI-compatible API at /v1/chat/completions. Any code written against the OpenAI SDK, including Semantic Kernel (which we'll use in Part 3), works with Ollama without modification. Swap the endpoint URL and you're done. That portability is why I chose Ollama for this series over vLLM or text-generation-inference.

Step 1: Create the Environment

First, set up a resource group and an ACA environment. The environment is the shared boundary for your container apps: networking, Dapr configuration, and logging all live at this level.

az group create --name rg-ollama-demo --location eastus

az containerapp env create \
  --name ollama-env \
  --resource-group rg-ollama-demo \
  --location eastus

I'm using East US here because it has good availability for GPU workload profiles. If you're just doing CPU-only for development, any region works.

Step 2: Deploy Ollama

az containerapp create \
  --name ollama \
  --resource-group rg-ollama-demo \
  --environment ollama-env \
  --image ollama/ollama:latest \
  --target-port 11434 \
  --ingress internal \
  --cpu 4 \
  --memory 8Gi \
  --min-replicas 0 \
  --max-replicas 1

Two settings here that I want to call attention to.

--ingress internal means this endpoint is only accessible to other containers in the same ACA environment. I've seen people deploy Ollama with --ingress external in tutorials, and that's a real problem. An unauthenticated Ollama instance on the public internet means anyone who finds the URL can run arbitrary models on your hardware. You're handing out free GPU time. Start with internal ingress, and if you need external access later, add authentication first (I'll show you how below).

--min-replicas 0 enables scale-to-zero. When nobody's sending requests, ACA shuts down the container entirely and you stop paying. The first request after idle triggers a cold start: the container needs to spin up and (if models aren't persisted) re-download the model weights. We'll fix the cold start problem with persistent storage in a minute, but even with it, expect 15-30 seconds on the first request. That's fine for development. For production, you might want --min-replicas 1 to keep one instance warm.

Step 3: Pull a Model

With internal ingress, you can't hit the endpoint directly from your local machine. You need to either exec into the container or temporarily switch to external ingress to pull your first model.

# Get the internal FQDN
OLLAMA_URL=$(az containerapp show \
  --name ollama \
  --resource-group rg-ollama-demo \
  --query "properties.configuration.ingress.fqdn" -o tsv)

# From another container in the same environment, or temporarily with external ingress:
curl -X POST "https://$OLLAMA_URL/api/pull" \
  -d '{"name": "llama3:8b"}'

Practical tip: If you're just getting started, temporarily flip to --ingress external, pull your model, then flip back to internal. It's a few seconds of exposure and much simpler than setting up a jump box. For production, use the pre-baked image approach I cover later in this post. It avoids runtime downloads entirely.

Step 4: Test It

curl "https://$OLLAMA_URL/api/generate" \
  -d '{"model": "llama3:8b", "prompt": "Hello!", "stream": false}'

You should get back a JSON response with the model's reply. If you do, you've got a self-hosted LLM running in Azure.

The OpenAI-compatible endpoint is what we'll actually use in code:

curl "https://$OLLAMA_URL/v1/chat/completions" \
  -d '{"model": "llama3:8b", "messages": [{"role": "user", "content": "Hello"}]}'

This is the endpoint that Semantic Kernel, LangChain, and anything else built against the OpenAI API will talk to. We'll wire it up in Part 3.

Persistent Model Storage

Here's a gotcha that bites everyone the first time: when your container scales to zero and back up, it loses everything in ephemeral storage. That includes your downloaded models. Llama 3 8B is about 4.7 GB. Re-downloading it on every cold start means your first request takes minutes instead of seconds, and you're paying for egress bandwidth every time.

The fix is to mount an Azure Files share so models survive container restarts.

# Create a storage account
az storage account create \
  --name stollamademo \
  --resource-group rg-ollama-demo \
  --location eastus \
  --sku Standard_LRS

# Create a file share
az storage share create \
  --name ollama-models \
  --account-name stollamademo

# Get the storage account key
STORAGE_KEY=$(az storage account keys list \
  --account-name stollamademo \
  --resource-group rg-ollama-demo \
  --query "[0].value" -o tsv)

# Register the storage with your ACA environment
az containerapp env storage set \
  --name ollama-env \
  --resource-group rg-ollama-demo \
  --storage-name ollama-storage \
  --azure-file-account-name stollamademo \
  --azure-file-account-key $STORAGE_KEY \
  --azure-file-share-name ollama-models \
  --access-mode ReadWrite

Now you need to mount that storage into the container. ACA requires a YAML file for volume mounts because there's no pure CLI flag for this. Create volume-mount.yaml:

properties:
  template:
    volumes:
      - name: ollama-models
        storageName: ollama-storage
        storageType: AzureFile
    containers:
      - image: ollama/ollama:latest
        name: ollama
        resources:
          cpu: 4
          memory: 8Gi
        env:
          - name: OLLAMA_MODELS
            value: /models
        volumeMounts:
          - volumeName: ollama-models
            mountPath: /models

Apply it:

az containerapp update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --yaml volume-mount.yaml

The OLLAMA_MODELS environment variable tells Ollama where to store and look for model files. With this in place, the first cold start after pulling a model still takes a few seconds (the container itself needs to start), but the model weights are already there on the mounted share. Every subsequent start is fast.

Adding GPU Support

Everything we've done so far uses CPU-only compute. For development and testing with 7-8B parameter models, CPU is fine. Llama 3 8B generates tokens at a usable speed on 4 cores with 8 GB of RAM. Not fast, but fast enough to test your agent logic without waiting.

When you need production-level latency or you're working with larger models (70B+), you'll want a GPU. ACA supports this through workload profiles:

az containerapp env workload-profile add \
  --name ollama-env \
  --resource-group rg-ollama-demo \
  --workload-profile-name gpu \
  --workload-profile-type NC24-A100 \
  --min-nodes 0 \
  --max-nodes 1

az containerapp update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --workload-profile-name gpu

A word of caution on cost: A100 GPUs run about $2/hour on ACA. If you leave --min-nodes 1 (always on), that's roughly $1,440/month. With --min-nodes 0, you only pay when there's active inference traffic, but you take a cold start hit when the GPU node needs to spin up. For most development work, stick with CPU. Add GPU when you've validated your agent logic and need to optimize for latency.

Securing External Access

At some point you'll need external access. Maybe it's a frontend app, a mobile client, or a teammate who wants to test from their machine. Here are three approaches, in order of complexity.

Option 1: ACA Built-in Authentication

ACA has a built-in auth feature that can gate access behind Azure AD, Google, or other identity providers:

az containerapp auth update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --enabled true \
  --unauthenticated-client-action RedirectToLoginPage

This works well for interactive users (browser-based access), but it's clunky for programmatic API calls.

Option 2: API Key via Reverse Proxy

For programmatic access, deploy a lightweight proxy container in front of Ollama that validates a custom X-API-Key header before forwarding requests. This is what I typically set up for team development environments. Everyone gets an API key, and you can rotate or revoke keys without touching the Ollama deployment.

# Switch to external ingress
az containerapp ingress update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --type external

Then add a sidecar or separate container app that acts as your auth gateway.

Option 3: VNet Integration

For enterprise scenarios where you need network-level isolation, keep ingress internal and access Ollama through VNet peering, a VPN gateway, or ExpressRoute. This is the option I recommend for production workloads handling sensitive data.

az containerapp env create \
  --name ollama-env \
  --resource-group rg-ollama-demo \
  --infrastructure-subnet-resource-id /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Network/virtualNetworks/{vnet}/subnets/{subnet}

You're putting your entire ACA environment inside your corporate network. External access goes through whatever VPN or gateway you already have.

Pre-Baking Models into the Image

For production deployments, I recommend avoiding runtime model downloads entirely. Build a custom Docker image that includes the model weights:

FROM ollama/ollama:latest

# Pre-download the model during build
RUN ollama serve & sleep 5 && ollama pull llama3:8b && pkill ollama

Build and push to your Azure Container Registry:

docker build -t myregistry.azurecr.io/ollama-llama3:latest .
docker push myregistry.azurecr.io/ollama-llama3:latest

az containerapp update \
  --name ollama \
  --resource-group rg-ollama-demo \
  --image myregistry.azurecr.io/ollama-llama3:latest

The downside is image size. You're looking at 5 GB+ for even a small model. But you get deterministic deployments: every release gets exactly the model version you tested against, and cold starts don't depend on network speed to a model registry. Combined with persistent storage (which acts as a cache for any additional models you pull at runtime), this is the fastest and most reliable startup configuration.

Practical Cost Tips

A few things I've learned from running this setup across different projects.

Scale-to-zero is your biggest lever. If your workload is bursty (heavy during business hours, quiet at night), the difference between always-on and scale-to-zero can be 3-4x on your monthly bill. The cold start penalty is real, but for many use cases it's worth it.

I've seen teams default to GPU instances "just in case" and spend 10x more than they needed to. Llama 3 8B runs fine on 4 cores and 8 GB of RAM. Start with CPU, measure your token generation speed, and only upgrade if it's actually too slow for your use case.

Don't overlook smaller models either. Phi-3 Mini and Qwen 2.5 3B handle classification, extraction, and structured output at a fraction of the compute cost. Not everything needs a 70B model.

And persistent storage is cheap insurance. An Azure Files share costs pennies per GB per month. Re-downloading models on every cold start costs more in egress bandwidth and startup latency than the storage ever will.

Next Up

In Part 3, we'll build a C# agent with Semantic Kernel that talks to this Ollama endpoint, with swappable backends so you can use self-hosted models for development and Azure OpenAI for production without changing your code.

Questions about the deployment? Hit me in the comments. I've probably hit the same wall you're about to hit.

Why Azure Container Apps for AI Workloads

Brian Spann — Fri, 17 Apr 2026 23:25:02 +0000

Why Azure Container Apps for AI Workloads

Part 1 of "Running LLMs & Agents on Azure Container Apps"

I spend a lot of time helping teams at Microsoft figure out where to run their AI workloads. The conversation usually starts the same way: "We want to use LLMs, but we don't want to send our data to OpenAI, and we don't want to manage Kubernetes." That's a completely reasonable position. It's exactly the gap Azure Container Apps fills.

In this series, I'll walk you through deploying Ollama on ACA, building C# agents with Semantic Kernel, wiring up multi-agent architectures with Dapr, and hardening the whole thing for production. But first, let's talk about why ACA is the right platform for this kind of work, and when it isn't.

The Problem with Running Your Own LLMs

The moment you decide to self-host a model, you've signed up for a set of infrastructure decisions that most application developers aren't used to making. Where does the model live? How do you serve it? What happens when nobody's using it at 2 AM, are you still paying for a GPU?

In my experience, teams end up in one of four places:

Running on a laptop works great for hacking on a Saturday afternoon, but it's a dead end for anything beyond that. You can't share it with a team, you can't scale it, and you can't keep it running when you close your lid.

A VM with a GPU solves the sharing problem but creates a new one: you're paying 24/7 whether the model is handling requests or sitting idle. I've seen teams burn through hundreds of dollars a month on GPU VMs that were doing real work less than 10% of the time.

Kubernetes (AKS) gives you everything: autoscaling, GPU scheduling, health checks, the works. But now you need someone who knows how to operate a Kubernetes cluster. For a team building AI features, not a platform team, that's a big ask. Projects stall for weeks while developers learn about node pools, taints, and GPU device plugins.

Azure Container Apps sits in the gap between "just give me a VM" and "I guess I need Kubernetes." You deploy a Docker image, ACA handles scaling, and you don't touch kubectl. It's built on Kubernetes under the hood, but that's an implementation detail you never have to think about.

What Azure Container Apps Actually Gives You

If you haven't worked with ACA before, the short version is: it's serverless containers. You give it a Docker image and tell it what port to listen on. ACA provisions the infrastructure, handles TLS, and scales your containers based on demand. That includes scaling to zero when there's no traffic, which means no cost when nobody's using your model.

What makes it interesting for AI workloads specifically is the combination of a few features that came together over the last year or so. Workload profiles now include GPU-enabled options, so you can run inference on actual GPU hardware without managing nodes. Dapr integration is built in, which matters when you start running multiple agents that need to talk to each other (we'll get deep into this in Part 4). And KEDA-based autoscaling means you can scale on custom metrics beyond HTTP concurrency, like queue depth or even custom telemetry from your model.

Think of it as the serverless experience of Azure Functions, but without being locked into the Functions programming model. You bring any container, and ACA runs it.

How ACA Compares to the Alternatives

Let me break this down the way I explain it to teams I work with.

Azure OpenAI Service

Azure OpenAI is the easiest path to production. Setup takes minutes, you get access to GPT-4 and the latest models, and Microsoft handles all the infrastructure. Your data stays within your Azure tenant, which satisfies most compliance requirements.

Where it gets expensive is token volume. Azure OpenAI charges per token, and that math gets uncomfortable fast. A chatbot processing a million tokens a day at GPT-4 prices will run you around $600/month. That's fine for a prototype or a low-volume internal tool, but high-traffic production apps feel it.

You also give up control. You get fine-tuning, but you don't get to run arbitrary open-source models, and you can't customize the serving infrastructure. If you need to run Llama 3 or Mistral or a fine-tuned domain model, Azure OpenAI isn't the answer.

Azure Kubernetes Service (AKS)

AKS is the power tool. You get full control over scheduling, GPU node pools, custom operators like KubeRay, and the entire CNCF ecosystem. If you're running large-scale inference with a dedicated ML ops team, AKS is probably the right choice.

But "full control" comes with "full responsibility." You're managing node pools, configuring GPU drivers, writing Helm charts, and debugging pod scheduling issues. One team I worked with spent more time operating their cluster than building their actual AI application. If you already have Kubernetes expertise on the team, great. Most teams building AI features don't, and for them it's a distraction.

Azure Container Apps

ACA gives you most of what AKS offers for inference workloads (containerized deployments, autoscaling, GPU support, health probes) without the operational overhead. Setup takes minutes instead of hours. You don't need to know what a DaemonSet is.

The catch is flexibility. ACA has fewer knobs than raw Kubernetes. GPU workload profiles are still relatively new, and you're limited to the instance types ACA supports. You can't install custom operators or run training workloads. But for inference, which is what most application teams actually need, it covers the use case well.

When ACA Is the Right Call

I've found ACA works best in a few specific scenarios, and I want to be honest about where it doesn't.

The strongest use case is development and iteration. When you're building an agent and experimenting with different models, the last thing you want is to burn through API credits every time you test a prompt. Deploy Ollama to ACA, point your code at it, and iterate as much as you want. Scale to zero means you're only paying when you're actually working.

It also makes sense for cost-sensitive production. If you've done the math and your token volume is high enough that self-hosting is cheaper than API calls (I'll show you exactly where that crossover is in a minute), ACA lets you capture those savings without the operational burden of Kubernetes.

Data sovereignty comes up a lot in the government and financial services teams I work with. Some workloads simply can't send data to a third-party API, even one hosted in Azure. Self-hosting on ACA means your data never leaves your subscription, your VNet, or your region. And increasingly, I'm seeing teams run hybrid architectures where a cheap local model handles classification, summarization, and simple tasks while complex reasoning gets routed to Azure OpenAI. ACA makes it easy to run the local piece alongside the rest of your application.

Where ACA is not the right call: training workloads, multi-GPU inference (70B+ parameter models that need model parallelism across GPUs), or situations where you need fine-grained control over GPU scheduling. For those, you want AKS or Azure ML.

The Cost Crossover: Self-Hosted vs. API

This is the question everyone asks, so let me lay it out with real numbers.

At low token volumes, say 100K tokens per day, the math is roughly a wash. Azure OpenAI GPT-4 costs about $60/month at that volume. A self-hosted Llama 3 instance on ACA with CPU-only compute costs about the same, and GPT-4 is a better model, so the API wins on quality.

The crossover happens around 200-300K tokens per day. Above that, self-hosting costs stay relatively flat (you're paying for compute time, not tokens), while API costs scale linearly with usage. At 1M tokens/day, Azure OpenAI runs about $600/month. The same workload self-hosted on ACA? Still around $60/month, maybe $120 if you're on a GPU profile.

That's a 5-10x difference, and it only gets wider at higher volumes.

The caveat (and I always flag this) is that you're comparing different models. Llama 3 70B is good, but it's not GPT-4. For many tasks (classification, extraction, summarization, structured output), the quality gap is negligible. For complex multi-step reasoning, GPT-4 still has an edge. The hybrid approach I mentioned earlier lets you get the best of both.

Note: These cost estimates are based on Azure consumption pricing as of early 2026. Your actual costs will vary based on model size, workload profile, region, and usage patterns. Always check the Azure pricing calculator for current rates.

What We're Building in This Series

Over the next four posts, we'll go from zero to a production-ready, multi-agent AI system running entirely on Azure Container Apps.

We'll start in Part 2 by getting Ollama deployed and serving models, with persistent storage so you're not re-downloading 5GB on every cold start, and proper security so you don't accidentally expose an unauthenticated GPU endpoint to the internet. From there, Part 3 connects Semantic Kernel to your Ollama instance and builds a C# agent with function calling, the kind that can actually do things, not just chat. Part 4 is where it starts to feel like a real system: multiple specialized agents communicating through Dapr, with Dynamic Sessions for safe code execution. Finally, Part 5 hardens everything for production: health probes that account for slow model loading, autoscaling that makes sense for LLM workloads, monitoring, and cost controls.

I'll include working code for everything, and I'll call out the gotchas I've hit so you don't have to discover them yourself.

Next up: Deploying Ollama to Azure Container Apps, with persistent model storage and proper security.

Creative Intelligence Suite: Innovation and Design Thinking for Developers (Part 4)

Brian Spann — Mon, 13 Apr 2026 19:44:53 +0000

Throughout this series, we've explored BMAD's core framework, workflows, and custom agent building. Now we tackle the fuzzy front-end of development: where ideas are born.

What is the Creative Intelligence Suite?

The Creative Intelligence Suite (CIS) extends BMAD with tools for structured creativity. It's designed for those moments when you:

Don't know what to build yet
Are stuck on a problem with no obvious solution
Need to think beyond conventional approaches
Want to validate ideas before investing in implementation

"Think differently." — CIS tagline

When to Use CIS

Situation	What CIS Offers
Stuck on a problem	Systematic diagnosis and root cause analysis
Need fresh ideas	Structured brainstorming with proven techniques
Designing for users	Human-centered design thinking process
Finding market gaps	Innovation and disruption analysis
Telling your product story	Narrative crafting for products and features
Preparing a pitch	Persuasive presentation structuring

Installation

CIS is installed as a module during BMAD setup:

npx bmad-method install

Select Creative Intelligence Suite from the modules list.

The CIS Agent Lineup

🚀 Innovation Strategist

Identifies disruption opportunities and business model innovation.

You: /cis-innovation

Innovation Strategist: Let's identify opportunities for disruption.

                       What market or industry are you exploring?

You: Developer productivity tools

Strategist: Interesting space! Let me guide you through the analysis:

            🎯 MARKET ANALYSIS
            Current pain points in developer tools:
            - Tool fragmentation
            - Context switching costs
            - Configuration complexity

            📊 DISRUPTION OPPORTUNITIES
            1. AI-native development environments
            2. Zero-config tooling
            3. Intent-based programming

            Which angle interests you most?

🎨 Design Thinking Coach

Guides you through human-centered design with empathy, ideation, and prototyping.

You: /cis-design-thinking

Design Coach: Welcome to the design thinking journey!

              We'll work through 5 phases:
              1. 🎯 Empathize - Understand your users
              2. 🔍 Define - Frame the problem
              3. 💡 Ideate - Generate solutions
              4. 🔧 Prototype - Build to learn
              5. ✅ Test - Validate assumptions

              Let's start with empathy.
              Who are your target users?

You: Senior developers who mentor junior team members

Coach: Great! Let's understand their world:

       🎤 INTERVIEW QUESTIONS
       - What's most frustrating about mentoring?
       - How do you currently share knowledge?
       - What takes up most of your mentoring time?

       🔍 OBSERVATION PROMPTS
       - Watch a mentoring session
       - Review their code review comments
       - Analyze their documentation patterns

       What insights have you gathered?

🧠 Brainstorming Coach

Facilitates creative ideation sessions with proven techniques.

You: /cis-brainstorm

Brainstorming Coach: Ready to generate ideas!

                     What would you like to brainstorm about?

You: Ways to improve user onboarding

Coach: Let's use the SCAMPER technique!

       📝 SCAMPER ANALYSIS

       🔄 SUBSTITUTE
       - What if signup required no password?
       - Replace forms with conversation?

       🔗 COMBINE  
       - Merge onboarding with first use?
       - Combine tutorial with real tasks?

       🔧 ADAPT
       - How does gaming handle new players?
       - What can we learn from social apps?

       ✨ MODIFY
       - Make it 10x shorter?
       - Make it 10x more personal?

       🎯 PUT TO OTHER USE
       - Use onboarding for ongoing education?
       - Use it for feature discovery?

       ❌ ELIMINATE
       - Remove ALL steps - what happens?
       - No onboarding - users figure it out?

       🔃 REVERSE
       - End with signup, start with value?
       - User teaches the system?

       Which angle resonates? Let's dive deeper.

🔍 Problem Solver

Systematic problem diagnosis and root cause analysis.

You: /cis-problem-solve

Problem Solver: Let's diagnose this systematically.

                What problem are you facing?

You: Our API response times are inconsistent

Solver: I'll guide you through root cause analysis.

        📊 SYMPTOM MAPPING
        - When did this start?
        - What changed recently?
        - Is it all endpoints or specific ones?

        🔍 5 WHYS ANALYSIS
        Why are response times inconsistent?
        → [Your answer leads to next why]

        🐟 FISHBONE DIAGRAM
        Potential causes by category:

        PEOPLE: Team changes? Knowledge gaps?
        PROCESS: Deployment changes? Monitoring gaps?
        TECHNOLOGY: Dependencies? Infrastructure?
        ENVIRONMENT: Traffic patterns? External services?

        Let's explore each branch...

💡 Creative Problem Solver

Generates creative solutions using lateral thinking.

You: /cis-creative-solve

Creative Solver: Let's think sideways!

                 What conventional problem are you trying to solve?

You: Users forget to complete their profiles

Solver: 🔀 LATERAL THINKING TECHNIQUES

        REVERSAL: What if incomplete profiles were the goal?
        → Maybe minimal profiles are faster to scan?
        → Progressive disclosure as relationships form?

        ANALOGY: How does nature handle incomplete information?
        → Trees grow toward light - profiles grow toward value
        → Caterpillar → butterfly - staged transformation

        RANDOM ENTRY: [Picking random concept: "music"]
        → Playlists build over time through use
        → Profile as a "greatest hits" that evolves

        PROVOCATION: "Profiles should be invisible"
        → What if actions spoke louder than descriptions?
        → Infer profile from behavior?

        Which direction excites you?

📖 Storyteller

Crafts compelling narratives for products and features.

You: /cis-storytelling

Storyteller: Every great product has a story.

             What product or feature needs a narrative?

You: Our new AI code review feature

Storyteller: Let's build your product story.

             📖 STORY STRUCTURE

             THE HERO (Your User):
             A developer drowning in pull requests...

             THE STRUGGLE:
             Hours spent on tedious reviews, missing
             critical issues while catching style nits...

             THE DISCOVERY:
             An AI assistant that understands context,
             catches bugs, and learns team preferences...

             THE TRANSFORMATION:
             From code reviewer to code mentor,
             focusing on architecture and growth...

             THE NEW WORLD:
             Faster releases, fewer bugs, happier teams...

             Want me to develop any section further?

🎤 Presentation Master

Structures and delivers persuasive presentations.

You: /cis-presentation

Presentation Master: Let's build a compelling deck.

                     What's your presentation about?

You: Pitching our startup to investors

Master: 🎯 INVESTOR PITCH STRUCTURE

        1. HOOK (30 sec)
           One sentence that makes them lean in

        2. PROBLEM (1 min)
           Pain so real they feel it

        3. SOLUTION (2 min)
           Your unique insight

        4. TRACTION (1 min)
           Proof it's working

        5. MARKET (1 min)
           Why this can be huge

        6. TEAM (30 sec)
           Why you'll win

        7. ASK (30 sec)
           Specific and confident

        Let's craft your hook first.
        What's the one sentence?

CIS Workflows in Action

Complete Example: Ideating a New Feature

Scenario: Your SaaS product needs a new feature, but you're not sure what would have the most impact.

You: /cis-brainstorm
     We need a new feature for our project management tool.
     Users are asking for "something to help with planning."

Coach: Let's explore what "help with planning" means to users.

       DIVERGENT PHASE 🌊
       Generate as many ideas as possible:

       1. AI that predicts project timelines
       2. Visual roadmap builder
       3. Automatic dependency detection
       4. Team capacity planning
       5. Risk forecasting
       6. Meeting scheduler optimizer
       7. Sprint recommendation engine
       8. "What-if" scenario modeling
       9. Stakeholder communication templates
       10. Progress prediction vs actual

       [Generates 20+ ideas...]

       CONVERGENT PHASE 🎯
       Let's evaluate against criteria:

       | Idea | User Value | Feasibility | Differentiation |
       |------|------------|-------------|----------------|
       | AI timeline prediction | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
       | What-if modeling | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
       | Capacity planning | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |

       Top candidates: What-if modeling + AI predictions

       Want to validate these with design thinking?

You: Yes, let's do design thinking on what-if modeling

Coach: Switching to Design Thinking mode...

Design Thinking: Validating the Idea

Design Coach: 🎯 PHASE 1: EMPATHIZE

              Who would use what-if modeling?

              PERSONA HYPOTHESES:

              "Project Paula" - PM at a startup
              - Manages 3-5 concurrent projects
              - Constantly asked "when will this ship?"
              - Burned by scope creep before

              "Executive Eric" - VP of Engineering  
              - Needs to commit to quarterly goals
              - Balances multiple team requests
              - Lives in spreadsheets currently

              Which persona should we focus on?

You: Project Paula - she's our core user

Coach: 🔍 PHASE 2: DEFINE

       Based on empathy work, here's the problem statement:

       "Project Paula needs a way to CONFIDENTLY answer
       timeline questions because her current tools only
       show ONE path forward, hiding the uncertainty
       she knows exists."

       Key insight: She KNOWS there's uncertainty,
       but has no way to COMMUNICATE it.

       💡 PHASE 3: IDEATE

       How might we help Paula communicate uncertainty?

       IDEAS:
       1. Probability ranges instead of dates
       2. "Optimistic / Realistic / Pessimistic" views
       3. Risk-adjusted timelines
       4. Shareable scenarios for stakeholders
       5. Historical accuracy tracking

       🔧 PHASE 4: PROTOTYPE

       Let's sketch the simplest test:

       Low-fidelity prototype:
       - Add "best case / worst case" to any date
       - Show range visually on timeline
       - One-click share with stakeholders

       Want to define the test criteria?

Problem-Solving: Debugging a Business Issue

You: /cis-problem-solve
     Our trial-to-paid conversion dropped 20% last month.

Problem Solver: Let's diagnose systematically.

                📊 SYMPTOM ANALYSIS

                First, some clarifying questions:
                1. Did trial signups change?
                2. Which plans saw the biggest drop?
                3. Any changes to pricing or features?
                4. What does the funnel look like?

You: Signups are stable. Drop is across all plans.
     We added a new onboarding flow 6 weeks ago.

Solver: 🔍 HYPOTHESIS FORMATION

        Timing correlates with onboarding change.

        Possible causes:
        H1: New onboarding is confusing users
        H2: New onboarding attracts wrong users
        H3: New onboarding front-loads value (no need to pay)
        H4: Technical issues in new flow

        Let's test H1 first:

        DATA TO CHECK:
        - Onboarding completion rates: old vs new
        - Time to first value: old vs new
        - Support tickets mentioning onboarding
        - User feedback/NPS scores

        What do you have access to?

You: Completion is UP 40%, but time-to-first-value
     increased from 5 min to 25 min.

Solver: 🎯 ROOT CAUSE IDENTIFIED

        The new onboarding is more thorough but
        DELAYS the "aha moment".

        Users who converted before:
        - Hit value quickly (5 min)
        - Explored on their own
        - Asked for features = signal to upgrade

        Users now:
        - Complete comprehensive tutorial (25 min)
        - Feel "trained" on free features
        - Don't discover premium value naturally

        💡 SOLUTION DIRECTIONS:
        1. Shorter path to first value, detailed training later
        2. Include premium feature preview in onboarding
        3. Guide toward premium use cases during trial

        Want to brainstorm implementations?

Combining CIS with Core BMAD

CIS shines when integrated with the full BMAD workflow:

┌──────────────────────────────────────────────────────────────┐
│                    FULL INNOVATION FLOW                       │
│                                                               │
│  CIS Phase                    BMAD Phase                     │
│  ─────────                    ──────────                     │
│  /cis-brainstorm      →      Product ideas generated        │
│  /cis-design-thinking →      User needs validated           │
│  /cis-innovation      →      Market opportunity confirmed   │
│                               ↓                              │
│                        /create-product-brief                 │
│                        /create-prd                           │
│                        /create-architecture                  │
│                        /dev-story                            │
│                               ↓                              │
│  /cis-storytelling    →      Launch narrative ready         │
│  /cis-presentation    →      Stakeholder buy-in secured     │
└──────────────────────────────────────────────────────────────┘

Practical Integration Example

# Phase 1: Discover (CIS)
/cis-brainstorm
# Generate and evaluate feature ideas

/cis-design-thinking
# Validate with user empathy

# Phase 2: Define (BMAD)
/create-product-brief
# Capture the validated idea

/create-prd
# Full requirements with PM agent

# Phase 3: Design (BMAD)
/create-architecture
# Technical solution

# Phase 4: Build (BMAD)
/sprint-planning
/dev-story
/code-review

# Phase 5: Launch (CIS)
/cis-storytelling
# Craft the launch narrative

/cis-presentation
# Prepare stakeholder communications

Team Collaboration with CIS

CIS includes team configurations for collaborative creativity:

Creative Squad

Bring together multiple CIS agents for cross-functional sessions:

You: /party-mode Innovation, Design, Storyteller

Innovation: 🚀 Looking at this from a market disruption angle...

Design: 🎨 Let me consider the user experience implications...

Storyteller: 📖 Here's how we might frame this for users...

Design Pair

Two-person design thinking sessions:

You: /party-mode Problem-Solver, Creative-Solver

Problem-Solver: 🔍 Systematically, the issue stems from...

Creative-Solver: 💡 But what if we flip that assumption...

CIS Quick Reference

Workflow	Command	Best For
Brainstorming	`/cis-brainstorm`	Generating many ideas
Design Thinking	`/cis-design-thinking`	User-centered solutions
Innovation	`/cis-innovation`	Market opportunities
Problem Solving	`/cis-problem-solve`	Root cause analysis
Creative Solving	`/cis-creative-solve`	Unconventional solutions
Storytelling	`/cis-storytelling`	Product narratives
Presentations	`/cis-presentation`	Persuasive decks

When CIS + BMAD Help Synergize

Here's how CIS enhances different BMAD phases:

Analysis Phase

CIS: /cis-brainstorm → Generate feature ideas
CIS: /cis-innovation → Identify market opportunities  
BMAD: /create-product-brief → Capture validated direction

Planning Phase

CIS: /cis-design-thinking → Validate user needs
CIS: /cis-problem-solve → Clarify problem space
BMAD: /create-prd → Document requirements

Solutioning Phase

CIS: /cis-creative-solve → Explore novel architectures
BMAD: /create-architecture → Document technical decisions

Launch Phase

CIS: /cis-storytelling → Craft product narrative
CIS: /cis-presentation → Prepare launch materials

Tips for Effective CIS Usage

1. Separate Divergent from Convergent Thinking

Don't evaluate ideas while generating them:

❌ "That won't work because..." (during brainstorm)
✅ "Let's generate 20 ideas, then evaluate" (process)

2. Trust the Structure

CIS techniques are proven. Even when uncomfortable, follow the process:

❌ Skipping empathy to jump to solutions
✅ Completing all design thinking phases

3. Document Insights

Capture what you learn for future reference:

/cis-brainstorm
→ Save promising ideas to _bmad-output/brainstorm-results.md

/cis-design-thinking
→ Save personas to _bmad-output/user-personas.md

4. Combine Techniques

Different situations need different approaches:

Vague problem → /cis-problem-solve first
Clear problem → /cis-brainstorm directly
User-facing → /cis-design-thinking required
Technical → /cis-creative-solve useful

5. Use CIS Help

Integrated with BMAD's help system:

/bmad-help I'm stuck on a problem and don't know where to start

BMAD: Based on your situation, I recommend:
      - /cis-problem-solve for diagnosis
      - /cis-brainstorm once problem is clear

Series Wrap-Up

Over these four articles, we've covered:

Core BMAD: AI as collaborator with 12+ specialized agents
Workflows: Quick Flow, Full Planning, and Party Mode
BMad Builder: Creating custom agents and modules
Creative Intelligence Suite: Innovation and design thinking

The BMAD Philosophy Revisited

BMAD isn't about AI doing work for you—it's about AI as a thinking partner that:

Brings structure to chaos
Ensures nothing is forgotten
Provides expert perspectives on demand
Maintains context across sessions
Scales from bug fixes to enterprise systems

Getting Started

npx bmad-method install

Select the modules you need:

BMad Method (BMM) — Core workflows
BMad Builder (BMB) — Custom agents
Creative Intelligence Suite (CIS) — Innovation tools

Resources

The Creative Intelligence Suite completes the BMAD ecosystem—from the first spark of an idea through implementation and launch. Whether you're solving problems, generating ideas, or crafting narratives, CIS provides the structured creativity that turns good developers into innovative ones.

Thanks for following this series! Questions? Ideas for future topics? Drop them in the comments!

Series Index

BMAD-Method Core: AI-Driven Agile Development That Actually Works
BMAD-Method Workflows Deep Dive: From Idea to Production
BMad Builder: Creating Custom AI Agents for Your Domain
Creative Intelligence Suite: Innovation and Design Thinking for Developers (this article)