Will Velida

Posted on Mar 13

Preventing Human-Agent Trust Exploitation in AI Agents

#ai #security #dotnet #azure

Your health data agent says: "Your sleep quality improved 23% this month compared to last month." You adjust your bedtime routine, change your medication timing, or skip a doctor's appointment because "the AI says I'm improving." But what if the 23% was hallucinated? What if the agent compared 30 days to 28 days without normalising? What if one of the tool calls failed and the agent filled in the gap with a plausible-sounding number?

Most of the controls we've discussed so far are about preventing external attackers from compromising the agent. ASI09 is different. It's about preventing the user themselves from being harmed by over-trusting the agent's output.

In my side project (Biotrackr), I have a chat agent that queries my health data (activity, sleep, weight, and food records) and presents analysis using natural language. The agent is useful, but it's not a doctor, a dietitian, or a sleep specialist. If I start treating it like one, that's where the danger lies.

In this article, we'll cover Human-Agent Trust Exploitation, and how we can implement prevention and mitigation strategies to prevent users from treating agent output as authoritative, using Biotrackr as an example.

What is Human-Agent Trust Exploitation?

Human-Agent Trust Exploitation (ASI09) is about exploiting the trust relationship between humans and agents through social engineering, impersonation, or manipulation. For data agents like Biotrackr, the bigger risk is unintentional over-trust: users believe the agent's analysis is authoritative because it's delivered with confidence.

LLMs present information with uniform confidence. They don't distinguish between "I calculated this from your data" and "I'm making this up because the tool call failed." Every response is delivered in the same measured, articulate tone. There are no error bars, no "I'm not sure about this" qualifiers, no visual cues that an answer might be unreliable.

Health and wellness is a high-trust domain. Users are predisposed to take health insights seriously, especially when they come from a system that has access to their actual data. The combination of real data access and confident delivery creates a false sense of expertise. The agent looks like it knows what it's talking about, even when it doesn't.

This matters for agents beyond health too. Financial agents, legal assistants, educational tutors, any domain where users might act on AI-generated analysis carries the same risk. The consequences differ (bad financial advice vs. bad health advice), but the trust dynamic is the same.

Why does this matter for Biotrackr?

The chat agent analyses activity, sleep, weight, and food data, all domains where users make lifestyle decisions. A hallucinated trend could lead a user to change their diet, exercise, or sleep habits, based on data that the agent invented.

Even when the data IS accurate, the agent's interpretation might be misleading. Correlation doesn't equal causation, and trends over short time periods might not be statistically significant. The agent can't tell you that your improved sleep correlates with the weather changing, not the new pillow you bought.

The agent's conversational tone creates a false sense of expertise. It responds in full sentences, uses domain terminology, and presents data with the confidence of someone who knows what they're doing. But it's pattern-matching on language, not reasoning about health.

🤖 "Based on your weight data, you've gained 1.3 kg over the past 3 months. This is likely due to your decreased activity levels during February."

That sounds authoritative. It might even be correct. But the agent doesn't actually know why your weight changed. It's inferring a narrative from two correlated datasets.

The OWASP specification defines 9 prevention and mitigation guidelines. Let's walk through each one and see how Biotrackr implements (or could implement) them.

Explicit Confirmations

"Require multi-step approval or 'human in the loop' before accessing extra sensitive data or performing risky actions."

In agent systems, the user might ask an innocent question that triggers a chain of tool calls returning sensitive data. In a health domain, all data is inherently sensitive. The question is whether the agent should surface it all without hesitation, or whether certain actions should require explicit confirmation.

Biotrackr's agent is read-only by design. All 12 tools are GET requests that retrieve data. There are no write, update, or delete operations exposed to the agent. This is itself a form of implicit confirmation: the riskiest action the agent can take is showing data, not changing it.

The tool set is fixed at startup. The agent cannot dynamically register new tools or gain capabilities beyond what's compiled into the application:

// Program.cs — fixed, read-only tool set
AIAgent chatAgent = anthropicClient.AsAIAgent(
    model: modelName,
    name: "BiotrackrChatAgent",
    instructions: systemPrompt,
    tools:
    [
        AIFunctionFactory.Create(activityTools.GetActivityByDate),
        AIFunctionFactory.Create(activityTools.GetActivityByDateRange),
        AIFunctionFactory.Create(activityTools.GetActivityRecords),
        // ... all 12 tools are read-only GET requests
    ]);

The system prompt includes explicit boundaries that prevent the agent from acting outside its scope. If a user asks a question that implies a risky action (e.g., "Should I change my medication?"), the system prompt instructs the agent to redirect rather than act:

You are not a medical professional — remind users to consult a 
healthcare provider for medical advice.

The system prompt is loaded from Azure App Configuration at startup and is immutable for the lifetime of the process, even if the user tries to manipulate the agent into providing medical advice, the constraint is baked in:

// Program.cs — system prompt loaded from Azure App Configuration, immutable at runtime
var systemPrompt = builder.Configuration.GetValue<string>("Biotrackr:ChatSystemPrompt")!;

This is a system prompt constraint, not just a UI disclaimer. Even if users bypass the UI and call the API directly, the agent still refuses to give medical advice.

Conversation loading also requires an explicit user action. The agent starts with a clean context, and the user must deliberately choose to continue a previous conversation:

// EndpointRouteBuilderExtensions.cs — conversation endpoints require explicit user action
conversationEndpoints.MapGet("/", ChatHandlers.GetConversations);           // List summaries
conversationEndpoints.MapGet("/{sessionId}", ChatHandlers.GetConversation); // Load full history
// The agent starts with clean context — user must explicitly choose to continue

Some key points here:

Read-only tools — all 12 tools are GET requests. The agent cannot modify data, delete records, or trigger side effects
Immutable tool set — tools are registered at startup via AIFunctionFactory.Create() and cannot be changed at runtime
System prompt constraint — the agent redirects medical/health advice questions to professionals rather than attempting to answer them
Explicit conversation loading — the user must deliberately choose to continue a previous conversation; the agent doesn't auto-load history

What's missing is multi-step confirmation for sensitive data access. Currently, if a user asks "Show me all my health data for the past year," the agent will call multiple tools and surface everything in one response. For a multi-user production system, you'd want confirmation gates for large data retrievals ("This will query 365 days of activity, sleep, weight, and food data. Proceed?") especially if the data could be displayed on a shared screen or exported. For a single-user side project, the read-only tool set and system prompt constraints provide a reasonable baseline.

Immutable Logs

"Keep tamper-proof records of user queries and agent actions for audit and forensics."

When a user claims the agent told them to skip their medication, you need to be able to prove exactly what the agent said, when it said it, and what data it used (probably a good idea right?). Immutable logs are the forensic foundation for accountability in human-agent interactions.

Biotrackr persists every conversation to Cosmos DB through the ConversationPersistenceMiddleware, creating a full audit trail of user messages, agent responses, and tool calls with timestamps.

Every assistant response is persisted with a complete tool call audit trail:

// ConversationPersistenceMiddleware.cs — full audit trail
var toolCalls = new List<string>();

await foreach (var update in innerAgent.RunStreamingAsync(messages, session, options, cancellationToken))
{
    foreach (var content in update.Contents)
    {
        if (content is TextContent textContent)
        {
            responseText.Append(textContent.Text);
        }
        else if (content is FunctionCallContent functionCall)
        {
            toolCalls.Add(functionCall.Name);
        }
    }
    yield return update;
}

// Persisted: which tools were called, when, in which session
await repository.SaveMessageAsync(sessionId, "assistant", assistantContent,
    toolCalls.Count > 0 ? toolCalls : null);

logger.LogInformation("Persisted assistant response for session {SessionId} ({ToolCount} tool calls)",
    sessionId, toolCalls.Count);

Every message has a timestamp and role attribution, providing a timeline for forensic reconstruction:

// ChatMessage.cs — provenance metadata on every message
public class ChatMessage
{
    [JsonPropertyName("role")]
    public string Role { get; set; } = string.Empty;  // "user" or "assistant"

    [JsonPropertyName("content")]
    public string Content { get; set; } = string.Empty;

    [JsonPropertyName("timestamp")]
    public DateTime Timestamp { get; set; } = DateTime.UtcNow;

    [JsonPropertyName("toolCalls")]
    public List<string>? ToolCalls { get; set; }  // Tool names invoked in this turn
}

OpenTelemetry provides a second, independent record layer spanning the full request chain:

// Program.cs — OpenTelemetry for distributed tracing
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()     // Incoming SSE requests
        .AddHttpClientInstrumentation()      // Outgoing APIM tool calls
        .AddOtlpExporter())
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOtlpExporter());

Cosmos DB diagnostic logging captures all data plane operations independently of application logging:

// serverless-cosmos-db.bicep — data plane logging
logs: [
  { category: 'DataPlaneRequests', enabled: true }     // All read/write operations
  { category: 'QueryRuntimeStatistics', enabled: true } // Query performance
  { category: 'ControlPlaneRequests', enabled: true }   // Management operations
]

Some key points here:

Three log layers — application logs (structured logging), conversation persistence (Cosmos DB), and infrastructure logs (OpenTelemetry + Cosmos DB diagnostics) provide independent audit trails
Message-level provenance — every message has role, content, timestamp, and tool call list
Agent identity binding — all Cosmos DB operations are authenticated via Entra Agent ID with Federated Identity Credentials, binding operations to a verifiable identity

What's missing is tamper-evident storage. Currently, application logs are collected by the Container App platform and conversation data is stored in Cosmos DB, both of which are modifiable by administrators. For true immutability, logs should be written to an append-only storage backend:

// Recommended: immutable blob storage for tamper-evident audit logs
resource auditStorage 'Microsoft.Storage/storageAccounts@2023-01-01' = {
  name: auditStorageName
  properties: {
    immutableStorageWithVersioning: {
      enabled: true  // Write-once, read-many — logs cannot be modified or deleted
    }
  }
}

There's also no cryptographic signing of log entries. For legally defensible audit trails, each log entry could be signed by the agent's Entra identity, creating a chain of non-repudiable records.

Behavioral Detection

"Monitor sensitive data being exposed in either conversations or agentic connections, as well as risky action executions over time."

Trust exploitation often isn't a single event. A user gradually asks more sensitive questions, the agent gradually provides more detailed health analysis, and over time the user starts treating the agent as a medical authority. Behavioral detection is about spotting these patterns before they lead to harm.

Biotrackr captures the raw data for behavioral detection through its conversation persistence and structured logging, but does not yet implement automated pattern analysis.

The conversation persistence layer records every tool invocation, providing a timeline of data access patterns:

// ConversationPersistenceMiddleware.cs — tool call tracking for behavioral analysis
if (content is FunctionCallContent functionCall)
{
    toolCalls.Add(functionCall.Name);  // E.g., "GetActivityByDate", "GetWeightByDateRange"
}

// Persisted: which data domains the agent accessed in this turn
await repository.SaveMessageAsync(sessionId, "assistant", assistantContent,
    toolCalls.Count > 0 ? toolCalls : null);

Structured logging with session context enables querying for patterns across conversations:

// ChatHistoryRepository.cs — structured logging with session context
_logger.LogInformation("Saving {Role} message to conversation {SessionId}", role, sessionId);
_logger.LogInformation("Saved message to conversation {SessionId}, total messages: {Count}",
    sessionId, conversation.Messages.Count);

The telemetry pipeline provides the data needed to detect trust-relevant anomalies:

Responses without tool calls — if the agent responds to a data question without calling any tools, it's likely hallucinating. The persisted toolCalls array makes this detectable
Tool call failures followed by confident responses — if a tool returns an error but the agent still presents data confidently, the response may contain fabricated numbers
Unusually long or detailed responses — a response that's significantly longer than the agent's baseline might indicate the agent is confabulating rather than summarising retrieved data

Some key points here:

Tool call audit trail — every conversation records which tools were called and when, enabling detection of escalating data access patterns
Session-scoped logging — session IDs and message counts are logged in structured format, allowing Log Analytics queries across all sessions
Dual-layer telemetry — OpenTelemetry traces and Cosmos DB diagnostic logs provide independent views of agent behavior

What's missing is automated behavioral alerting. The data is captured, but no one is watching for patterns. For a production health agent, you'd want alerts for:

// Recommended: KQL alert for sensitive data exposure patterns
AppLogs
| where Message contains "tool calls"
| parse Message with * "(" ToolCount:int " tool calls)"
| summarize TotalToolCalls = sum(ToolCount) by SessionId = extract("session ([a-f0-9-]+)", 1, Message), bin(TimeGenerated, 1h)
| where TotalToolCalls > 20  // Unusual volume of data access in a single session

You'd also want time-series analysis for risky trends, like a user who starts with "how many steps yesterday?" and over weeks escalates to "analyze my health trends and tell me what I should do differently." The tool call history provides the raw signal; automated analysis would convert it into actionable alerts.

Allow Reporting of Suspicious Interactions

"In user-interactive systems, provide plain-language risk summary (not model-generated rationales) and a clear option for users to flag suspicious or manipulative agent behavior, triggering automated review or a temporary lockdown of agent capabilities."

Users are the first line of defence against their own over-trust. If the agent says something that feels wrong; an implausible number, advice that sounds too specific, or a response that doesn't match what the user expected, they need a way to flag it. Critically, the risk summary should be in plain language written by developers, not generated by the model (which might rationalise its own errors).

Biotrackr does not currently implement a user feedback mechanism. However, we could implement this through conversation persistence already stores the full context that a review process would need.

The persistent disclaimer banner provides a static risk summary in plain language:

<!-- Chat.razor — plain-language risk summary, developer-written -->
<RadzenAlert AlertStyle="AlertStyle.Warning" Variant="Variant.Flat"
             ShowIcon="true" AllowClose="false" class="rz-mb-0">
    This AI assistant provides data summaries only. It is not medical advice.
    Always consult a healthcare professional.
</RadzenAlert>

This is developer-written, hardcoded HTML. The model cannot modify, suppress, or rationalise away this disclaimer.

What's missing is a per-message flag button and an automated review pipeline. A production implementation would add a flag button to each assistant message:

<!-- Recommended: per-message flag button -->
@if (message.Role == "assistant")
{
    <button class="flag-button" title="Flag this response"
            @onclick="() => FlagMessage(message)">
        ⚠️ Flag as suspicious
    </button>
}

When a user flags a message, the system should:

Record the flag with the full conversation context (session ID, message content, tool calls, timestamps) — not just the flagged message, but the surrounding context
Provide a plain-language acknowledgement — "Thanks for flagging this. A human will review this conversation." — not a model-generated explanation of why its answer might have been wrong
Trigger an automated review — log the flag to a review queue, and if a threshold is exceeded (e.g., 3 flags in one session), temporarily increase the agent's caution level or restrict its responses

There's also no mechanism for automated lockdown. If a conversation is generating multiple flagged responses, the system should be able to temporarily restrict the agent's capabilities. For example, only allowing it to present raw data without analysis until a human reviews the session.

Adaptive Trust Calibration

"Continuously adjust the level of agent autonomy and required human oversight based on contextual risk scoring. Implement confidence-weighted cues (e.g., 'low certainty' or 'unverified source') that visually prompt users to question high-impact actions, reducing automation bias and blind approval. Develop and continuously maintain appropriate training of human personnel involved in the evolving human oversight of autonomous agentic systems."

Trust calibration isn't static, the appropriate level of trust depends on what the agent is doing. Answering "how many steps did I take yesterday?" is low-risk and high-confidence. Analyzing a 6-month weight trend and making lifestyle observations is higher-risk and lower-confidence. The UI should reflect this difference.

Biotrackr implements basic trust calibration through system prompt engineering and structured error responses from tools. Dynamic, per-response confidence indicators are somthing that could extend this further.

The system prompt should include guidance to calibrate language to data quality:

"When presenting trends, always mention the number of data points used."
"If a tool call returned an error for some dates, disclose that the analysis is based on partial data."
"Avoid definitive statements about health trends — use language like 'the data suggests' or 'based on the available records'."

The tools return structured JSON with specific data points, which helps the agent present concrete numbers rather than vague claims:

// ActivityTools.cs — tools return structured JSON, not narrative text
var client = httpClientFactory.CreateClient("BiotrackrApi");
var response = await client.GetAsync($"/activity/{date}");

if (!response.IsSuccessStatusCode)
    return $"{{\"error\": \"Activity data not found for {date}.\"}}";

var result = await response.Content.ReadAsStringAsync();
return result;  // Structured JSON — specific numbers, not narratives

When a tool returns an error, the agent receives a structured error in JSON. This gives the agent the information it needs to disclose the gap ("I couldn't retrieve activity data for March 3, so this analysis covers 6 of the 7 days") rather than silently filling in the missing data with a plausible guess.

Tool call badges in the UI provide a basic form of confidence cue. The user can see which data sources were actually queried:

<!-- Chat.razor — tool call badges as confidence cues -->
@if (message.Role == "assistant" && message.ToolCalls is { Count: > 0 })
{
    <div class="message-tool-badges">
        @foreach (var tool in message.ToolCalls)
        {
            <RadzenBadge Text="@tool" BadgeStyle="BadgeStyle.Info"
                         IsPill="true" class="rz-mr-1" />
        }
    </div>
}

Some key points here:

Data-driven language — the system prompt steers the agent toward quantified statements with caveats rather than definitive claims
Error transparency — structured JSON errors let the agent disclose data gaps instead of hallucinating missing values
Tool call visibility — badges show which data sources were used, letting users verify the data scope matches the analysis scope

What's missing is dynamic confidence scoring. A production system could classify responses into risk tiers based on the query type and data quality:

// Recommended: risk-tier classification for adaptive trust cues
private string ClassifyResponseRisk(string query, List<string> toolCalls, int errorCount)
{
    // High risk: trend analysis, lifestyle recommendations, multi-domain queries
    if (query.Contains("trend") || query.Contains("should I") || toolCalls.Count > 3)
        return "high";

    // Medium risk: date range queries, comparisons
    if (toolCalls.Any(t => t.Contains("DateRange")) || query.Contains("compare"))
        return "medium";

    // Low risk: single-date factual queries
    return "low";
}

The UI could then render different visual cues per risk tier. Low-risk responses with a green indicator, medium with amber, and high-risk with a red border and an explicit caveat like "This analysis covers multiple data sources. Verify important insights with your healthcare provider." This reduces automation bias by visually prompting users to question high-impact analysis.

There's also no formal training program, which would be overkill for my silly little agent. For a production health agent serving multiple users, you'd want documentation on how to interpret the agent's outputs, what the confidence cues mean, and when to involve a professional, continuously updated as the agent's capabilities evolve.

Content Provenance and Policy Enforcement

"Attach verifiable metadata — source identifiers, timestamps, and integrity hashes — to all recommendations and external data. Enforce digital signature validation and runtime policy checks that block actions lacking trusted provenance or exceeding the agent's declared scope."

Every data point the agent presents should be traceable to its source. If the agent says "you took 10,342 steps on March 5," the user should be able to verify that this came from a specific tool call, at a specific time, against a specific API. Provenance is the antidote to hallucination.

Biotrackr implements basic provenance through tool call tracking and timestamps. All tool results come from authenticated, scoped API calls. However, verifiable metadata (integrity hashes, digital signatures) on individual data points are better tools for more sophisticated provenance.

Tool call tracking is built into the conversation persistence layer.Every response records which tools were used:

// ConversationPersistenceMiddleware.cs — tool call provenance
if (content is FunctionCallContent functionCall)
{
    toolCalls.Add(functionCall.Name);  // E.g., "GetActivityByDate"
}

// Persisted: which tools were called for each response
await repository.SaveMessageAsync(sessionId, "assistant", assistantContent,
    toolCalls.Count > 0 ? toolCalls : null);

Every tool call is authenticated through APIM. The subscription key provides a verifiable chain from agent request to API response:

// ApiKeyDelegatingHandler.cs — authenticated provenance chain
protected override async Task<HttpResponseMessage> SendAsync(
    HttpRequestMessage request, CancellationToken cancellationToken)
{
    if (!string.IsNullOrWhiteSpace(_subscriptionKey))
    {
        request.Headers.TryAddWithoutValidation(SubscriptionKeyHeader, _subscriptionKey);
    }
    return await base.SendAsync(request, cancellationToken);
}

The tool set is fixed and compiled. The agent cannot invoke tools that exceed its declared scope:

// Program.cs — tools are compiled into the application, not dynamic
tools:
[
    AIFunctionFactory.Create(activityTools.GetActivityByDate),
    AIFunctionFactory.Create(activityTools.GetActivityByDateRange),
    // ... fixed set, cannot be modified at runtime
]);

Some key points here:

Tool call attribution — every response records which tools were called, providing basic source provenance
Authenticated data chain — tool results flow through APIM (subscription key) from APIs that read from Cosmos DB (agent identity), creating a verifiable authentication chain
Scope enforcement — the tool set is fixed at compile time. The agent cannot call tools outside its declared scope

What's missing is verifiable metadata on individual data points. Currently, the tool call name is recorded but not the specific parameters, response hashes, or timestamps of the actual API calls. A more robust provenance system would attach metadata to each piece of data:

// Recommended: provenance metadata on tool results
public class ToolResultProvenance
{
    public string ToolName { get; set; }
    public Dictionary<string, string> Parameters { get; set; }  // e.g., {"date": "2026-03-05"}
    public DateTime RetrievedAt { get; set; }
    public string ResponseHash { get; set; }  // SHA-256 of the raw API response
    public string ApiEndpoint { get; set; }   // Which APIM endpoint was called
}

This would allow forensic verification: "The agent's claim of 10,342 steps came from a GetActivityByDate call at 14:23:05 UTC with response hash abc123, which can be cross-referenced against the Activity API's access logs." For a production system, digital signature validation on API responses would ensure the data hasn't been tampered with in transit.

Separate Preview from Effect

"Block any network or state-changing calls during preview context and display a risk badge with source provenance and expected side effects."

Preview mode ensures that users can see what the agent would do before it actually does it. This prevents the agent from taking actions the user didn't intend and gives users a chance to course-correct before any real effects occur.

Biotrackr implements this by design. The agent's entire tool set is read-only. There are no state-changing operations exposed to the agent, so every interaction is effectively a "preview."

All 12 tools are HTTP GET requests that retrieve data without side effects:

// ActivityTools.cs — all tools are read-only
var client = httpClientFactory.CreateClient("BiotrackrApi");
var response = await client.GetAsync($"/activity/{date}");  // GET — no state change

// SleepTools.cs, WeightTools.cs, FoodTools.cs — same pattern
var response = await client.GetAsync($"/sleep/{date}");     // GET — no state change
var response = await client.GetAsync($"/weight/{date}");    // GET — no state change
var response = await client.GetAsync($"/food/{date}");      // GET — no state change

The agent cannot modify conversation data through tools. Persistence is handled by the middleware, not by any tool the agent can invoke:

// ConversationPersistenceMiddleware.cs — persistence is middleware-controlled, not agent-controlled
// The agent has no tool for SaveConversation, DeleteConversation, or UpdateConversation
// Only the middleware writes to Cosmos DB
await repository.SaveMessageAsync(sessionId, "assistant", assistantContent,
    toolCalls.Count > 0 ? toolCalls : null);

Conversation deletion is a user-initiated action through the UI, not an agent capability:

// EndpointRouteBuilderExtensions.cs — delete is a UI action, not an agent tool
conversationEndpoints.MapDelete("/{sessionId}", ChatHandlers.DeleteConversation);
// This endpoint is called by the UI, not by the agent

Some key points here:

Read-only tool set — all tools are GET requests. The agent cannot write, update, or delete any data
Middleware-controlled persistence — only the ConversationPersistenceMiddleware writes to Cosmos DB. The agent cannot directly access the persistence layer
User-controlled deletion — conversation deletion is a UI action, not an agent capability

What's missing is explicit risk badging. Even though the agent is read-only, when it presents health analysis (especially trends or comparisons), it would be valuable to display a risk badge indicating the nature of the response. "Data summary" for factual single-date queries vs. "AI analysis — verify with your provider" for trend interpretations. Since the tool calls are already tracked, the UI could infer the risk level:

<!-- Recommended: risk badges based on response type -->
@if (message.ToolCalls?.Any(t => t.Contains("DateRange")) == true)
{
    <RadzenBadge Text="AI Analysis — verify important insights"
                 BadgeStyle="BadgeStyle.Warning" IsPill="true" />
}
else if (message.ToolCalls?.Count > 0)
{
    <RadzenBadge Text="Data Summary"
                 BadgeStyle="BadgeStyle.Success" IsPill="true" />
}

For agents that DO have state-changing capabilities (order placement, data modification, account changes), this guideline becomes critical. You'd implement a two-phase pattern: the agent first presents what it would do (preview), the user confirms, and only then does the agent execute. Biotrackr's read-only design sidesteps this, but for any agent with write capabilities, preview-before-effect is essential.

Human-Factors and UI Safeguards

"Visually differentiate high-risk recommendations using cues such as red borders, banners, or confirmation prompts, and periodically remind users of manipulation patterns and agent limitations. Where appropriate, avoid persuasive or emotionally manipulative language in safety-critical flows. Maintain appropriate training and assessment of personnel to ensure familiarity and consistency of perception of human-factors and UI."

The UI is where trust is built or broken. A response that looks the same whether it's backed by 90 days of data or completely hallucinated creates a trust calibration problem. Visual differentiation helps users quickly assess the reliability of what they're seeing.

Biotrackr implements several UI safeguards: a permanent disclaimer banner, non-anthropomorphised agent design, tool call badges, and data-driven response style.

A persistent, non-dismissible warning banner is always visible at the top of the chat:

<!-- Chat.razor — persistent disclaimer banner, cannot be dismissed -->
<RadzenAlert AlertStyle="AlertStyle.Warning" Variant="Variant.Flat"
             ShowIcon="true" AllowClose="false" class="rz-mb-0">
    This AI assistant provides data summaries only. It is not medical advice.
    Always consult a healthcare professional.
</RadzenAlert>

The agent is deliberately non-anthropomorphised. No human name, no avatar, no emotional language:

// Program.cs — technical agent name, not anthropomorphised
AIAgent chatAgent = anthropicClient.AsAIAgent(
    model: modelName,
    name: "BiotrackrChatAgent",  // Technical identifier, not "Dr. Bio" or "Health Buddy"
    instructions: systemPrompt,
    tools: [...]
);

The system prompt steers the agent toward factual, data-driven language:

❌ "I noticed you had a great week for steps! Really impressive!"
✅ "Your step count for March 3-9 averaged 11,200, which is above the 10,000 daily target."

No emotional language. "Great job on your steps!" creates a personal connection that increases trust. "Your step count of 12,500 exceeded the 10,000 target" presents the same information without the emotional wrapper. This is a design choice, not just a security control. Engagement comes at the cost of appropriate trust calibration.

The empty state frames expectations by presenting the agent as a data query tool:

<!-- Chat.razor — empty state sets expectations -->
<div class="chat-empty-state">
    <p>Ask me about your health and fitness data.</p>
    <p>Try: <em>"How many steps did I take yesterday?"</em></p>
</div>

This frames the agent as a data tool ("ask me about your health and fitness data"), not a health advisor. The example prompt demonstrates factual questions, not analysis requests.

Some key points here:

Non-dismissible banner — AllowClose="false" means the disclaimer is always visible, even in long conversations
No anthropomorphisation — no human name, avatar, or emotional language. The agent is a tool, not a person
Data-driven response style — the system prompt steers toward factual messages with specific numbers, not opinions or celebrations
Expectation framing — the empty state and example prompts set the right mental model from the first interaction

What's missing is visual differentiation between response types. Currently, all assistant messages look identical regardless of whether they're simple data lookups or complex trend analyses. High-risk responses (multi-domain analysis, trend interpretations, responses that could influence health decisions) should be visually distinct. An amber border, a "verify this" badge, or a contextual reminder like "This is an AI interpretation. Cross-reference with your Fitbit app."

There's also no periodic reminders. In a long conversation, the user might scroll past the disclaimer banner and forget they're talking to an AI. Inserting a periodic reminder every N messages ("Reminder: I'm an AI assistant providing data summaries. For health advice, consult a professional.") would reinforce trust calibration during extended sessions.

Plan-Divergence Detection

"Compare agent action sequences against approved workflow baselines and alert when unusual detours, skipped validation steps, or novel tool combinations indicate possible deception or drift."

Agent behavior should follow predictable patterns. A query about yesterday's steps should call one tool. A weekly trend analysis should call a date range tool. When the agent starts calling unexpected tool combinations, skipping its usual data-retrieval step, or responding without tool calls at all, something may be off. Either the model is drifting, a prompt injection is steering behavior, or the conversation has entered territory the agent wasn't designed for.

Biotrackr captures the raw data for plan-divergence detection through its tool call audit trail, but does not currently implement baseline comparison or divergence alerting.

The ConversationPersistenceMiddleware records every tool call sequence:

// ConversationPersistenceMiddleware.cs — tool call sequence tracking
var toolCalls = new List<string>();

await foreach (var update in innerAgent.RunStreamingAsync(messages, session, options, cancellationToken))
{
    foreach (var content in update.Contents)
    {
        if (content is FunctionCallContent functionCall)
        {
            toolCalls.Add(functionCall.Name);
        }
    }
    yield return update;
}

// The tool call sequence is persisted — e.g., ["GetActivityByDate", "GetSleepByDate"]
await repository.SaveMessageAsync(sessionId, "assistant", assistantContent,
    toolCalls.Count > 0 ? toolCalls : null);

Expected workflows for Biotrackr are straightforward:

"How many steps yesterday?" → [GetActivityByDate]
"Compare my sleep this week to last week" → [GetSleepByDateRange, GetSleepByDateRange]
"Show me my weight trend for the past month" → [GetWeightByDateRange]
"Summarise my day yesterday" → [GetActivityByDate, GetSleepByDate, GetFoodByDate]

Deviations from these patterns are potential drift indicators:

A response about activity data that called GetWeightByDate instead of GetActivityByDate — possible tool confusion
A response with 0 tool calls that presents specific health numbers — almost certainly hallucination
A response that called 8+ tools for a simple question — possible prompt injection causing excessive data access

Some key points here:

Tool call sequences are persisted — every assistant response includes the ordered list of tools called, providing the raw data for divergence analysis
Predictable patterns — Biotrackr's tools map cleanly to user intent: activity questions → activity tools, sleep questions → sleep tools
Zero-tool-call detection — responses that present data without calling any tools are the strongest divergence signal

What's missing is automated baseline comparison. A production system would define expected tool call patterns per query type and alert on deviations:

// Recommended: KQL query for plan-divergence detection
// Flag sessions where tool call patterns are unexpected
AppLogs
| where Message contains "tool calls"
| parse Message with * "session " SessionId " (" ToolCount:int " tool calls)"
| where ToolCount == 0 or ToolCount > 6
| project TimeGenerated, SessionId, ToolCount, DivergenceReason = 
    case(ToolCount == 0, "Response without tool calls — possible hallucination",
         ToolCount > 6, "Excessive tool calls — possible drift or injection",
         "Unknown")

You'd also want to track novel tool combinations over time. For example, if the agent has never historically called GetFoodByDate and GetWeightByDateRange in the same turn, but suddenly starts doing so, that's a novel pattern worth investigating. It might indicate the model has been updated, a prompt injection is steering behavior, or the user's questions have changed in a way that warrants attention. For a production agent, integrating this with a workflow engine that defines "approved" tool call sequences per intent category would provide formal plan-divergence detection.

Putting It All Together

Let's walk through a concrete scenario showing the controls in action:

User asks: "My weight has been going up for 3 months. Should I go on a diet?"

What the agent does (with controls):

The agent calls GetWeightByDateRange for the last 90 days — tool-grounded response, content provenance
The agent responds: "Based on 87 weight records from December 9 to March 9, your average weight increased from 82.1 kg to 83.4 kg (a change of +1.3 kg over 90 days). I'm not a medical professional and can't recommend dietary changes. I'd suggest discussing this with your healthcare provider or a registered dietitian." — explicit confirmation boundary, adaptive trust calibration
Below the response, the user sees a GetWeightByDateRange badge — content provenance
Above the entire chat, the persistent disclaimer reads: "This AI assistant provides data summaries only. It is not medical advice." — human-factors UI safeguard
The tool call sequence [GetWeightByDateRange] is logged and persisted — immutable logs, plan-divergence baseline data
The agent's response is factual and data-driven, with no emotional language — human-factors, no anthropomorphisation

The user gets useful information (their weight trend with specific numbers), a clear redirection to a professional, and multiple visual cues that this is a data tool, not a health advisor.

Wrapping up

Human-Agent Trust Exploitation (ASI09) is arguably the most "human" control in the OWASP Agentic Top 10. It's about how people interact with AI, not just how code runs. For health data agents, the stakes are real. Users may change their behaviour, adjust their routines, or skip professional consultations based on AI analysis that might be hallucinated, incomplete, or misleading.

The controls are layered: explicit confirmation boundaries (read-only tools + system prompt constraints) → immutable audit logs (conversation persistence + OpenTelemetry + Cosmos DB diagnostics) → behavioral detection (tool call tracking + structured logging) → user reporting mechanisms → adaptive trust calibration (confidence cues + system prompt engineering) → content provenance (tool call badges + authenticated data chain) → preview-by-design (read-only tools) → human-factors UI safeguards (disclaimer banner + no anthropomorphisation + data-driven style) → plan-divergence detection (tool call sequence tracking). Even if the agent occasionally generates an overconfident response (LLMs aren't deterministic), the UI disclaimer and tool badges give users the context to calibrate their trust appropriately.

One important thing to note: ASI09 interacts with other controls in the series. The system prompt immutability from ASI01 (goal hijack prevention) is what makes the medical disclaimer constraint reliable. The tool-level input validation from ASI02 (tool misuse) ensures the data the agent retrieves is accurate. The structured JSON responses from ASI05 (unexpected code execution) prevent injection payloads from contaminating the agent's analysis. The cascading failure controls from ASI08 (resilience handlers, circuit breakers) ensure the agent fails gracefully rather than hallucinating to fill gaps. Trust calibration depends on all the other controls working correctly.

There are gaps I haven't addressed yet. Per-message user flagging with automated review, dynamic confidence scoring per response, verifiable metadata with integrity hashes on data points, risk-tier visual differentiation in the UI, periodic trust reminders during long conversations, and formal plan-divergence alerting. If you're building agents in any high-trust domain, health, finance, legal, education, these controls become critical rather than nice-to-have.

In the next post in this series, I'll cover ASI10 — Rogue Agents, which is what happens when an agent goes completely off-script! Operating outside its defined scope, generating harmful outputs, or behaving in ways that its developers never intended.

If you have any questions about the content here, please feel free to reach out to me on Bluesky or comment below.

Until next time, Happy coding! 🤓🖥️

DEV Community

Preventing Human-Agent Trust Exploitation in AI Agents

What is Human-Agent Trust Exploitation?

Why does this matter for Biotrackr?

Explicit Confirmations

Immutable Logs

Behavioral Detection

Allow Reporting of Suspicious Interactions

Adaptive Trust Calibration

Content Provenance and Policy Enforcement

Separate Preview from Effect

Human-Factors and UI Safeguards

Plan-Divergence Detection

Putting It All Together

Wrapping up

Top comments (0)