Kenta Takeuchi

Posted on Mar 15 • Originally published at bmf-tech.com

Engineering in the AI Era: Evolution from Prompts to Autonomous Systems

#llmasjudge #promptascode #promptops #contextengineering

This article was originally published on bmf-tech.com.

Introduction: Overview of the Paradigm Shift

Between 2025 and 2026, software engineering is entering a fundamental turning point. This is not merely the "introduction of AI tools" but a transition to "Agentic Engineering," where AI becomes the main actor in the development process. In traditional engineering, developers wrote code while AI served as an assistant. However, AI agents are increasingly taking the lead in the entire SDLC (Software Development Life Cycle), from requirement definition to design, development, and testing.

The technological foundation supporting this transition has evolved through stages: prompt engineering → Prompt as Code → PromptOps → Context Engineering. This article explains these paradigms that engineers need to understand and practical implementation approaches.

Prompt as Code: Treating Prompts as Code

Why Prompt as Code is Necessary

In the early use of LLMs, prompts were adjusted haphazardly, like "magic spells." However, prompts are the "instruction set" in modern AI applications, and changes to them can have significant impacts on the behavior of the entire system. Treating prompts merely as text inputs can lead to semantic drift (a phenomenon where ambiguous instructions cause unstable outputs) and unpredictable errors, accumulating technical debt.

Prompt as Code is a paradigm that manages prompts with the same rigor as source code. This allows for version control, automated testing, continuous integration, and security audits.

Prompt-Layered Architecture (PLA)

PLA is a layered approach to systematically manage prompts. It consists of four layers:

Layer	Role	Effect
Prompt Configuration Layer	Template definition and variable embedding	Improved reusability
Orchestration Layer	Chaining multiple prompts and state management	Achieving complex reasoning
Response Interpretation Layer	Structuring, validating, and routing outputs	Ensuring deterministic processing
Domain Memory Layer	Persistence of business logic	Enabling continuous dialogue

Implementation Approach

Prompts should not be hardcoded; instead, separate variables using template engines like Jinja2. For example:

# prompts/extraction.yaml
version: 1.2.0
template: |
  Please extract {{entity_type}} from the following text.
  Output format: {{output_format}}

  Text: {{input_text}}
variables:
  - entity_type: required
  - output_format: required
  - input_text: required

Manage this prompt with Git and run automated tests in the CI/CD pipeline upon changes. Key tools include Promptfoo (CLI, local execution), LangSmith (integrated with LangChain ecosystem), and PromptLayer (no-code editor, A/B testing).

PromptOps: Establishing Operational Discipline

From DevOps to PromptOps

PromptOps is a systematic methodology that treats prompts as a "first-class operational asset." Just as DevOps accelerated code delivery, PromptOps standardizes and automates the entire lifecycle of prompts.

In traditional batch learning, models functioned as "frozen time capsules" unaware of environmental changes. In PromptOps, continuous monitoring, evaluation, and optimization of prompts address prompt drift (gradual degradation of performance).

LLM-as-a-Judge: Implementing Automated Evaluation

In evaluating the quality of prompts, traditional string matching (BLEU, ROUGE) fails to accurately assess outputs that are semantically correct but expressed differently. LLM-as-a-Judge is a method that utilizes high-performance LLMs (like GPT-4, Claude) as evaluators.

Five steps for implementation:

Define evaluation criteria (accuracy, tone, conciseness, etc.)
Prepare a Golden Dataset (test cases extracted from production data)
Create correct data (human labeling)
Design evaluation rubrics (scoring criteria)
Execute automated evaluation and integrate with CI

Be mindful of evaluator biases (self-preference, position bias, redundancy bias) and implement mitigation strategies, such as selecting different model families as evaluators or swapping the order of presentation for double evaluation.

Programmatic Optimization with DSPy

Stanford University's DSPy is a framework that automates prompt design. Developers define input-output contracts (Signatures) and simply pass them to modules like ChainOfThought or ReAct, allowing the optimizer to automatically generate the optimal prompt. When switching models, recompiling generates instructions optimized for the new model, dramatically reducing the cost of manual rewrites.

Context Engineering: Systematic Design of the Environment

Design Philosophy as Entropy Reduction

Context Engineering is the discipline of designing the "information environment" in which AI operates. Its theoretical foundation lies in entropy reduction. Since LLM outputs are based on probability distributions, ambiguous contexts increase uncertainty (entropy) and degrade quality. Proper context design constrains the interpretive space available to the model, achieving consistent outputs.

Multi-layered Memory Mechanism

A three-layer structure mimicking human memory systems is effective:

Short-term Memory: Information within the current dialogue session (system prompts, conversation history)
Working Memory: Temporary states during task execution (intermediate calculation results, tool invocation history)
Long-term Memory: User settings, past interactions, domain knowledge (vector DB, RAG)

Standardization of Model Context Protocol (MCP)

The MCP, announced by Anthropic at the end of 2024, is a standard protocol for connecting AI systems with data sources and tools. Previously, each tool integration required individual connectors, but MCP provides a common interface similar to HTTP.

The MCP server provides "tools (Infrastructure)," while SKILL.md (described later) provides "knowledge." This separation allows independent management of tool implementation details and business logic.

Retrieval-Augmented Generation (RAG) and Dynamic Data Integration

RAG is a technique that searches for relevant information from external knowledge bases and injects it into prompts. Reliable RAG systems require evaluation of the RAG Triad:

Context Relevance: Is the retrieved information relevant to the question?
Groundedness: Is the answer based solely on the retrieved information?
Question/Answer Relevance: Does the answer directly respond to the user's question?

Security and Governance

As we approach 2026, the threats to AI security are diversifying:

Prompt Injection: Malicious inputs overwrite system instructions
Indirect Injection: Activation of instructions embedded in external PDFs or web pages
Token Smuggling: Encoding harmful instructions in Base64 or similar to evade filters

Countermeasures include input sanitization, context isolation (explicitly marking untrusted content), and the implementation of guardrail models like PromptGuard and LlamaFirewall.

Transition to Agentic Engineering and Continuous AI

Agentic Engineering is a paradigm where AI evolves from a mere assistive tool to an autonomous entity capable of thinking, planning, and acting. The dramatic expansion of the context window (from 1 million to 2 million tokens) enables agents to understand entire codebases of thousands of lines and autonomously perform multi-file editing and tool execution.

The utilization of AI within organizations evolves through the following maturity levels:

Level	Characteristics	Role of AI
Level 0	Manual Processes	None
Level 1	AI Code Completion	Assistant
Level 2	AI Pair Programming	Collaborator
Level 3	Agent-based Development	Leader

Advanced companies define a "constitution of the project" in AGENTS.md, documenting the roles of agents, available tools, and decision-making processes. They also formalize specific workflows (e.g., PR review procedures, deployment flows) in SKILL.md for agents to learn.

Continuous AI: Evolving Systems

Traditional ML models are trained on static datasets and remain fixed after deployment. However, the real world is non-stationary, with user behavior and environments changing constantly. Continuous AI (Continual Learning) is a technique that allows for the incremental incorporation of new knowledge while maintaining past skills.

The biggest challenge is catastrophic forgetting: the phenomenon where knowledge of previous tasks is rapidly lost when learning new tasks. Three approaches to overcome this:

Replay Methods: Saving a portion of past data and mixing it with new data for learning (Experience Replay)
Regularization Methods: Penalizing changes to important parameters (EWC: Elastic Weight Consolidation)
Parameter Isolation: Using different subnetworks for each task (LoRA, Mixture-of-Experts)

The open-source library Avalanche is useful for implementation. It integrates benchmarks, training strategies, and evaluation metrics, allowing for the implementation of continual learning with just a few lines of code.

Organizational Success Factors for Prompt-Driven Development

To successfully operate LLM systems in production, it is essential to establish not only individual skills but also the overall organizational structure and processes.

Cross-Functional Governance Structure

To clarify responsibilities regarding prompts and AI behavior, establish a cross-functional AI governance committee comprising IT, legal, risk management, data science, and business units. This committee defines approval workflows based on risk classification (low, high, unacceptable) and has the authority to block the deployment of models or prompts that do not meet standards.

Formalization of Knowledge

Advanced companies formalize organizational knowledge through two documents: AGENTS.md (the constitution of the project) and SKILL.md (procedures for specific workflows). AGENTS.md documents the roles of agents, available tools, and decision-making processes, while SKILL.md describes specific workflows such as "PR review procedures" and "deployment flows." This prevents individualization and allows agents to learn the organization's standard processes.

Placement of Prompt Advocates

Rapidly growing AI startups have established dedicated roles called "Prompt Advocates." This role is responsible for reviewing prompt designs across teams, educating on best practices, and maintaining the overall quality of prompts in the organization. They function as "cultivators of engineering culture" that ensure quality while respecting the autonomy of each team rather than imposing centralized control.

Incorporating Security and Compliance

Security should not be an afterthought but integrated from the design phase. Adopt a zero-trust architecture and monitor all LLM communications through AI gateways (like Portkey or Helicone). Standardize automatic detection and masking of PII (Personally Identifiable Information) and the implementation of guardrails against prompt injection.

Skill Set Required for LLM Engineers

LLM engineers are required to acquire new technical domains in addition to traditional software development skills. Below is a practical learning roadmap.

Fundamental Skills: Prompt Design and Evaluation

Prompt Template Creation: The ability to consolidate scattered prompts into YAML files in a Git repository and template variables. Understanding of template engines like Jinja2 is necessary.

Evaluation Design: Creating a Golden Dataset (test cases extracted from production data) and designing evaluation rubrics using LLM-as-a-Judge. Understanding the limitations of superficial metrics like BLEU/ROUGE and the ability to select semantic evaluation methods.

Architectural Skills: Context Design

Implementation of RAG (Retrieval-Augmented Generation): Selecting vector DBs, understanding embedding models, and evaluating quality using the RAG Triad (context relevance, groundedness, question/answer relevance).

Utilization of MCP (Model Context Protocol): Understanding and implementing the standard protocol for integrating external tools and data sources with agents.

Multi-layered Memory Mechanism: The ability to design short-term memory (conversation history), working memory (intermediate states), and long-term memory (vector DB) appropriately.

Operational Skills: PromptOps and MLOps

CI/CD Integration: Implementing automated tests for prompt changes. Designing regression tests with tools like Promptfoo and integrating with GitHub Actions or Jenkins.

Drift Detection: The ability to monitor performance degradation of models or data and design triggers for continual learning. Building continual learning pipelines using frameworks like Avalanche.

Security Skills

Prompt Injection Countermeasures: Understanding attack methods such as direct/indirect injection and token smuggling, and implementing input sanitization, context isolation, and guardrail models (like PromptGuard).

Collaborative Skills

LLM systems cannot be completed by engineers alone. The ability to collaborate with product managers, legal personnel, and domain experts to balance technical constraints with business requirements is crucial. Particularly, skills to create an environment where non-engineers can directly edit prompts using no-code tools like PromptLayer are required.

Conclusion

Engineering in the AI era is shifting from "writing code" to "designing environments where AI can work effectively." Prompts are not just inputs; they are code, operational assets, and the intelligent interface of the system. Establishing the disciplines of PromptOps and Context Engineering and building a foundation for agents to evolve autonomously becomes a new responsibility for engineers.

By 2030, AI systems will evolve from "static artifacts" to "lifelong learning agents." The paradigms and practices explained in this article will technically support that transition. Engineers need to shift their roles from "users of AI" to "nurturers of AI," and then to "architects collaborating with AI."

DEV Community