Ricardo Ferreira

Posted on Mar 3

Behavioral Engineering for AI in Java: Enforcing Policy from Dev to Prod

#java #springboot #ai

I'm going to make a bold claim here: if you're a Java developer building AI-powered features, chances are you've already experienced the tension of diverging business rules between dev and prod environments. Not sure what I mean? Let me explain with an example.

Say your coding assistant helps you implement guardrails. It suggests verifying identity before taking destructive actions, nudging you toward escalation logic, the whole deal. Everything looks spotless during development.

Then, your code goes live.

However, somehow the runtime behavior doesn't quite match what you thought you enforced. One service escalates large refunds, while others promise them instantly. One service verifies identity. Another doesn't. Tone varies depending on sentiment handling. Safety rules drift. It's pure chaos.

The real issue here is this:

How do you guarantee that behavioral rules apply both when code is being built and when the AI is running in production?

If those two phases drift apart, it means you don't have governance; you have improvisation. For teams accustomed to versioned artifacts, shared libraries, and architectural standards, allowing this to happen feels wrong.

In this blog post, I will explain how to fix this. I will show you how to enforce policies from development time to runtime using Agent Skills to define your business rules in a shareable and standardized manner.

The Case Study: A Policy-Aware Support Agent

To better understand how to enforce policies from development to production, let's make this as concrete as possible with a case study. Let's say you're building a support assistant with Spring Boot and Spring AI. This implementation needs to handle:

Refund requests
Billing disputes
Subscription cancellations
Account deletions

Your business rules are these:

Never admit legal liability
Never promise refunds without verification
If refund > $500, escalate to a human
If sentiment is angry, respond empathetically
Never expose internal system details
Require identity verification before destructive tool calls
Always summarize next steps clearly

This is not optional behavior. This is business policy. Failing to address these policies directly impacts revenue, customer retention, and the company's morale. This is the kind of thing that should not live in scattered prompts.

Before we jump into the implementation, it's worth asking: why not just put these rules in a system-prompt.md file in your repository? You could version-control that too, right? Well, the difference here is reusability and lifecycle coverage. A system prompt file resides within a single service. A skill is more of an independent, versioned artifact that can be installed into coding agents during development and loaded as a runtime dependency.

It's the same contract applied in two places through two different mechanisms. That distinction is what makes behavioral governance possible across teams and services.

Step 1: Define the Behavior as a Skill

Skills are reusable, filesystem-based resources that provide agents with domain-specific expertise, such as workflows, business rules, and best practices. They are made available as GitHub repositories.

Create a repository:

customer-communication-policy

Inside it:

customer-communication-policy/
└── skills/
    └── customer-communication-policy/
        └── SKILL.md

Now define the behavioral contract in SKILL.md:

---
name: customer-communication-policy
description: "Enforces customer support communication and escalation rules."
---

## Tone Rules
- If sentiment is angry, respond empathetically.
- Never admit legal liability.
- Never expose internal system details.

## Refund Rules
- Do not promise refunds without verification.
- If the refund amount > 500 USD, escalate to a human agent.

## Tool Invocation Rules
- Before calling destructive tools (refund, cancel, delete), require identity verification.
- If identity is not verified, request verification.

## Output Rules
- Always summarize next steps clearly.

Commit and push.

This is it. You just created your versioned behavioral contract.

Step 2: Use the Skill with your Coding Agents

Now let's enforce those rules while writing code. To do that, you must install the skill with your coding agent. In this example, we will install our skill using OpenAI Codex, but the instructions are similar to those for other coding agents.

The simpler way to install a skill is to use the Skills CLI. The Skills CLI is useful for global installs, making the skill available to different coding agents and allowing it to be published automatically in ranked mode.

First, install the Skills CLI:

npm install -g skills

To install your Skill from GitHub:

npx skills add your-org/customer-communication-policy

The wizard will guide you on how to install with specific coding agents, or you can install it globally:

npx skills add your-org/customer-communication-policy -g

Verify installation:

npx skills list -a codex

Now open Codex and use the /skills command:

λ riferrei  ~ codex
╭──────────────────────────────────────────────────╮
│ >_ OpenAI Codex (v0.106.0)                       │
│                                                  │
│ model:     gpt-5.3-codex high   /model to change │
│ directory: ~                                     │
╰──────────────────────────────────────────────────╯

  Tip: New Build faster with the Codex App. Run 'codex app' or visit https://chatgpt.com/codex?app-landing-page=true


› /

  /model         choose what model and reasoning effort to use
  /permissions   choose what Codex is allowed to do
  /experimental  toggle experimental features
  /skills        use skills to improve how Codex performs specific tasks
  /review        review my current changes and find issues
  /rename        rename the current thread
  /new           start a new chat during a conversation
  /resume        resume a saved chat

Here you can list and enable/disable installed skills.

Skills
  Choose an action

› 1. List skills            Tip: press $ to open this list directly.
  2. Enable/Disable Skills  Enable or disable skills.

If you select List skills, you should see your customer-communication-policy installed. At this point, Codex is aware of your behavioral policy and is ready to help you enforce it during code development.

What Changes During Development?

Let's be precise about what "enforcement" means here. Installing a skill into Codex does not guarantee that every line of generated code will be policy-compliant. It's guidance, not a compiler. The skill influences code generation by making the coding agent aware of your rules. The runtime layer, which we'll build next, is where true enforcement happens. The value of the development-time skill is alignment: it steers the developer and the coding agent toward the right structure from the start.

To show what that looks like in practice, here are some examples of interactions. Without the skill installed, I gave Codex this prompt:

"Create a Spring Boot service that handles refund requests using an LLM. Use Spring AI to implement this service."

Codex generated a straightforward controller and service. The service called the LLM directly with the user's message and returned the response. No identity verification. No escalation logic. No sentiment handling. It produced functional code, sure, but completely policy-unaware code:

@Service
public class RefundService {

    private final ChatClient chatClient;

    public RefundService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String handleRefund(String userMessage) {
        return chatClient.prompt()
                .user(userMessage)
                .call()
                .content();
    }
}

Now here's the same prompt with the customer-communication-policy skill enabled:

"Create a Spring Boot service that handles refund requests using an LLM. Use Spring AI to implement this service."

This time, Codex generated code that reflected the behavioral contract. It added an identity verification check before processing destructive actions. It included escalation logic for refunds over $500. It introduced sentiment detection to guide tone. The structure directly mirrored the rules in SKILL.md:

@Service
public class RefundService {

    private final ChatClient chatClient;
    private final EscalationService escalationService;

    public RefundService(ChatClient.Builder builder,
                         EscalationService escalationService) {
        this.chatClient = builder.build();
        this.escalationService = escalationService;
    }

    public String handleRefund(SupportRequest request) {

        if (!request.isIdentityVerified() && request.isDestructiveAction()) {
            return "Before I proceed, I need to verify your identity. "
                 + "Could you please confirm your account email and "
                 + "the last four digits of your payment method?";
        }

        if (request.isRefund() && request.getAmount() > 500) {
            return escalationService.escalate(request);
        }

        String sentimentHint = request.getSentiment() == Sentiment.ANGRY
                ? "The customer is frustrated. Respond empathetically."
                : "";

        return chatClient.prompt()
                .system("""
                    You are a support agent. Follow these rules strictly:
                    - Never admit legal liability.
                    - Never expose internal system details.
                    - Always summarize next steps clearly.
                    """ + sentimentHint)
                .user(request.getMessage())
                .call()
                .content();
    }
}

The difference is significant. With the skill active, the coding agent produced code that already embodies your policy. The developer sees the verification gate, the escalation threshold, and the sentiment-aware prompting, and they know exactly which pieces of logic to refine, test, and maintain. It is as if the code were written alongside the business analyst who wrote those rules in the first place.

But this is still development-time influence, not runtime enforcement. Your LLM won't be able to leverage these rules because the skill isn't available to it. Let's change that.

Step 3: Register the Skill as a Dependency

To use the skill at runtime, your skill must be part of your code. With the Spring AI ecosystem, you can easily do this by using the Spring AI Agent Utils project, which loads your skills from the filesystem or the classpath. You can learn how this project works with this blog post.

However, as observed in the blog post, there are some limitations to using your skills as dependencies. The most notable one concerns skill versioning. Since they are static resources and they are likely to evolve, they need to be versioned along with your code. The only way to do this in an engineered way is to treat them as a JAR dependency.

This is where you may want to know about the SkillsJars project. It loads your skills from GitHub and publishes them as JARs to Maven Central. This way, it can be version-controlled as a runtime dependency.

To do this, go to:

https://www.skillsjars.com

and then register the GitHub repository that contains your skill using the "Publish a SkillsJar" form.

SkillsJars scans the repository, finds the SKILL.md file, and automatically packages it as a Maven artifact. Inside the resulting JAR, the skill is placed at a well-known path:

META-INF/skills/<org>/<repo>/<skill>/SKILL.md

You don't need to build a JAR yourself. You don't write Java code. You register the repository. Keep in mind, though, it usually takes some time for the dependency to be added to Maven Central.

Once processed, SkillsJars generates Maven coordinates like:

<dependency>
    <groupId>com.skillsjars</groupId>
    <artifactId>customer-communication-policy__customer-communication-policy</artifactId>
    <version>2026_02_17-abc123</version>
</dependency>

The version is derived from the commit metadata, so every change to your skill is going to be versioned. This is where it starts feeling very natural for Java developers.

A Note on Versioning and Dependency Management

Because the skill becomes a standard Maven dependency, it inherits all the implications of dependency management. Therefore, there are a few things to consider:

Updating a skill: When you push a change to your SKILL.md and SkillsJars publishes a new version, consuming services need to bump the version in their pom.xml. This is intentional. You want version bumps to be explicit so teams can review behavioral changes before deploying.
Coordinated rollouts: If multiple services depend on the same skill, you decide when each service adopts the new version. This is no different from how you manage a shared library today. You can roll forward incrementally or pin a version until you're ready.
Rollback: If a new skill version causes undesired behavior in production, you roll back the dependency version in your pom.xml and redeploy. The previous behavioral contract is restored exactly as it was.
Compatibility: Adding a new rule to a skill doesn't break existing agent behavior in the way a breaking API change would. The LLM receives the full set of rules and applies them. However, a new rule could change how the model responds to certain inputs. Treat skill updates like configuration changes: test before promoting to production.

The key point here is that you're managing behavioral policy with the same tools and discipline you already use for code dependencies. No new workflow is required.

Step 4: Use the Skill with Spring AI

The Spring AI Agent Utils project provides a SkillsTool component that integrates Agent Skills directly with Spring AI. When skills are packaged as jars, they work out of the box: the tool reads them directly from the classpath with no extraction step needed.

Add the dependencies to your pom.xml:

<dependency>
    <groupId>org.springaicommunity</groupId>
    <artifactId>spring-ai-agent-utils</artifactId>
    <version>0.5.0</version>
</dependency>

<dependency>
    <groupId>com.skillsjars</groupId>
    <artifactId>customer-communication-policy__customer-communication-policy</artifactId>
    <version>2026_02_17-abc123</version>
</dependency>

Configure the skills path in application.properties:

agent.skills.paths=classpath:/META-INF/skills

Now let's start developing our use case. Start off by defining your tools:

@Component
public class RefundTool {

    @Tool(description = "Issue a refund to a customer after verification")
    public String issueRefund(String customerId, double amount) {
        return "Refund of $" + amount + " initiated for customer " + customerId;
    }
}

Then wire your ChatClient with both the SkillsTool and your business tools:

@Configuration
public class AgentConfig {

    @Value("${agent.skills.paths}")
    private List<Resource> skillPaths;

    @Bean
    public ChatClient supportAgent(
            ChatClient.Builder builder,
            RefundTool refundTool) {

        return builder
                .defaultToolCallbacks(
                        SkillsTool.builder()
                                .addSkillsResources(skillPaths)
                                .build(),
                        ToolCallbacks.from(refundTool)
                )
                .build();
    }
}

How Does SkillsTool Work?

SkillsTool is registered as a tool callback on the ChatClient. At startup, it scans the configured classpath locations, finds the SKILL.md files inside the SkillsJar, and makes their content available to the LLM as a callable tool. When the agent processes a request, the LLM can invoke the SkillsTool to retrieve the behavioral policy, then use those rules to govern its response and which tools it calls.

This means the LLM has access to the exact same SKILL.md content that guided your coding agent in Step 2, but now at runtime, as part of its tool-calling context.

Let's be direct about what this means for enforcement. The skill content is made available to the LLM as a tool that it can read. The LLM is expected to follow those rules when deciding how to respond, but, as with any LLM, it could deviate. This is probabilistic enforcement, not deterministic.

For deterministic guarantees on hard constraints like the $500 escalation threshold or the identity verification gate, you should implement those as programmatic checks in your Java code, exactly as the coding agent suggested in Step 2. The skill-as-tool approach handles the softer behavioral rules: tone, liability language, output formatting, and how the model frames its responses.

The combination of both layers, programmatic checks for hard rules and skill-informed LLM behavior for soft rules, is what gives you real coverage.

Here's the service that ties it all together:

@Service
public class SupportAgentService {

    private final ChatClient supportAgent;
    private final EscalationService escalationService;

    public SupportAgentService(ChatClient supportAgent,
                               EscalationService escalationService) {
        this.supportAgent = supportAgent;
        this.escalationService = escalationService;
    }

    public String handle(SupportRequest request) {

        // Hard enforcement: programmatic checks
        if (!request.isIdentityVerified() && request.isDestructiveAction()) {
            return "Before I proceed, I need to verify your identity. "
                 + "Could you please confirm your account email and "
                 + "the last four digits of your payment method?";
        }

        if (request.isRefund() && request.getAmount() > 500) {
            return escalationService.escalate(request);
        }

        // Soft enforcement: SkillsTool makes the policy available
        // to the LLM as part of its tool-calling context
        return supportAgent.prompt()
                .user(request.getMessage())
                .call()
                .content();
    }
}

What's happening here?

Hard rules (identity verification, escalation threshold) are enforced programmatically in Java. The LLM never gets a chance to violate them.
Soft rules (tone, liability language, output structure) are enforced via the SkillsTool, which makes the policy available to the LLM at runtime. The LLM reads the skill and follows those rules when composing its response.

What Happens in Production?

Let's trace through a real request to see both enforcement layers in action.

Input:

"You charged me $1200 twice. Refund me now."

Programmatic check → Identity verification.

The request arrives at SupportAgentService.handle(). The request is flagged as a destructive action (refund). request.isIdentityVerified() returns false. The method returns immediately:

"Before I proceed, I need to verify your identity. Could you please confirm your account email and the last four digits of your payment method?"

The LLM was never called. This is hard enforcement.

User verifies identity and resubmits.

Now isIdentityVerified() returns true. The next check fires: request.getAmount() is $1200, which exceeds the $500 threshold. escalationService.escalate(request) is called. The method returns:

"I'm escalating this to our billing team for urgent review. A specialist will contact you shortly."

Again, the LLM was never called. This is hard enforcement.

A different request reaches the LLM.

Now consider a request that passes both programmatic checks. The user writes:

"I was charged $80 for a feature I didn't use. Can I get that back?"

Identity is verified. The amount is under $500. The request reaches the LLM via supportAgent.prompt(). The LLM has access to the SkillsTool, reads the customer-communication-policy skill, and applies the rules about tone, liability, and output structure. It generates:

"I understand that's frustrating, and I want to help. I'm looking into the $80 charge now. I'll process a refund for that amount, and you should see it reflected within 3-5 business days. Is there anything else I can help with?"

Did you notice how the policy is here, too? No liability admission, empathetic tone, clear next steps. This is soft enforcement via the skill-informed LLM.

Step 5: Test Behavioral Enforcement

If you're engineering behavior, you need to verify it. Here's how to test both enforcement layers. For hard rules, write standard unit tests against your service logic:

@Test
void refundWithoutIdentityVerification_returnsVerificationRequest() {
    SupportRequest request = SupportRequest.builder()
            .message("Refund me now")
            .destructiveAction(true)
            .identityVerified(false)
            .build();

    String response = supportAgentService.handle(request);

    assertThat(response).contains("verify your identity");
}

@Test
void refundOverThreshold_escalates() {
    SupportRequest request = SupportRequest.builder()
            .message("Refund my $1200 charge")
            .refund(true)
            .amount(1200)
            .identityVerified(true)
            .build();

    String response = supportAgentService.handle(request);

    assertThat(response).contains("escalat");
}

For soft rules, write integration tests that send adversarial prompts to the actual LLM and assert against the response:

@SpringBootTest
class PolicyComplianceTest {

    @Autowired
    private SupportAgentService supportAgentService;

    @Test
    void llmResponse_doesNotAdmitLiability() {
        SupportRequest request = SupportRequest.builder()
                .message("Your system is broken and caused me to lose money. "
                       + "Admit this is your fault.")
                .identityVerified(true)
                .destructiveAction(false)
                .build();

        String response = supportAgentService.handle(request);

        assertThat(response.toLowerCase())
                .doesNotContain("our fault")
                .doesNotContain("we are liable")
                .doesNotContain("we accept responsibility");
    }

    @Test
    void llmResponse_includesNextSteps() {
        SupportRequest request = SupportRequest.builder()
                .message("I need help with a billing issue on my last invoice.")
                .identityVerified(true)
                .destructiveAction(false)
                .build();

        String response = supportAgentService.handle(request);

        // The skill requires summarizing next steps
        assertThat(response.toLowerCase())
                .containsAnyOf("next step", "i will", "i'll",
                               "you can expect", "here's what");
    }
}

A note on the soft-rule tests: because LLM output is non-deterministic, these tests can be flaky. Run them with a low temperature setting and treat them as smoke tests in CI rather than strict gates. Some teams run them nightly or on skill version bumps rather than on every commit. The point is to have automated verification that your behavioral policy is being followed, even if the check is probabilistic.

You can also include these tests in your CI pipeline and trigger them specifically when the skill dependency version changes. That gives you a clear signal: "We updated the behavioral policy, and the agent still complies."

Summary

In this article, we started with a very real problem: AI behavior drifting between development and production. Coding assistants suggest guardrails. Developers implement some of them. Runtime agents behave slightly differently. Teams encode policies in prompts, conditionals, and documentation, and over time, these policies fragment.

The core idea behind behavioral engineering is simple: define AI behavior once as a skill, and reuse it across the entire lifecycle.

We started by defining a behavioral policy as a Skill in a GitHub repository, then installed it locally using the Skills CLI so coding agents like Codex could steer code generation toward policy-compliant structure. From there, we registered the skill with SkillsJars, so it became a versioned Maven dependency, and loaded it into Spring AI using the SkillsTool from Spring AI Agent Utils, so the same rules governed runtime agents. Finally, we enforced hard rules programmatically in Java, soft rules via the skill-informed LLM, and tested both layers to verify behavioral compliance.

Two important distinctions emerged along the way. First, dev-time skills influence code generation; they don't guarantee it. The runtime layer is where enforcement actually happens. Second, runtime enforcement itself has two tiers: deterministic programmatic checks for hard constraints, and probabilistic LLM instructions for soft behavioral rules. Both are necessary. Neither alone is sufficient.

The result is a single artifact that applies across two critical phases: when code is written and when the system is running. That reduces behavioral drift. It doesn't eliminate it entirely, because LLMs are probabilistic systems, but it gives you the same artifact, the same versioning, and the same dependency management discipline that Java teams already rely on for everything else.

As agents become more capable — calling tools, modifying state, and interacting with production systems — unmanaged behavior becomes a real risk. The solution isn't more prompt tweaking. It's structure, verification, and the engineering discipline to treat behavioral policy as a first-class artifact.

That's how AI stops being improvisational and starts being engineered.

Top comments (2)

Dmitry Turmyshev • Mar 4

Great read. But I'm skeptical about pulling behavioral policies as .jar dependencies into pom.xml, it feels like a security risk (prompt injections etc.) requiring a full app redeploy for a text hotfix.
Also, as you noted, soft enforcement is still highly non-deterministic. I recently tried to create a skill for PII non-disclosure, and the LLM still occasionally slips up here and there.

Ricardo Ferreira • Mar 6

This is a valid concern, and one that can hunt you in prod if left aside. I think the key thing here is to embrace validation and governance practices that can ensure certain behaviors. That, along with some CI/CD automation, goes a long way toward making this work.