DEV Community

Cover image for HITL Human-AI Collaboration: Why AI Code Generation Still Needs Human Oversight
Sopaco
Sopaco

Posted on

HITL Human-AI Collaboration: Why AI Code Generation Still Needs Human Oversight

 > Cowork Forge - An open-source AI multi-agent development platform, serving as both an embeddable AI Coding engine and a standalone production-grade development tool. GitHub: https://github.com/sopaco/cowork-forge

Introduction

Imagine this scenario: You tell AI "Help me write a user login feature," and within seconds, AI generates a large amount of code. Sounds efficient, right?

But here's the question: Is this code secure? Does it meet your requirements? Are there potential security vulnerabilities? Does it follow best practices?

If you fully trust AI and directly deploy this code to production, it could have serious consequences. This is why even with powerful AI, we still need human oversight.

HITL (Human-in-the-Loop, human-AI collaboration) isn't just a simple confirmation dialog—it's a thoughtful design philosophy that finds a balance between automation efficiency and output quality.

In this article, I'll explore HITL's application in Cowork Forge in depth, looking at why human verification is needed, how to design effective HITL mechanisms, and best practices in actual use.


The Credibility Issue of AI-Generated Content

Before discussing HITL, we need to understand why AI-generated content needs human verification.

AI's Limitations

Although Large Language Models (LLMs) excel at code generation, they still have some inherent limitations.

First, hallucination problems. AI may produce "hallucinations"—generating content that looks reasonable but is actually incorrect. In code generation, this might manifest as calling non-existent APIs or libraries, using incorrect syntax or semantics, or generating code that looks correct but has logic errors. For example, AI might generate let result = user.authenticate(password); but the authenticate method might not exist at all.

Second, insufficient context understanding. AI's context window is limited. It may not fully understand the entire project's context, leading to generated code that's inconsistent with existing code style, ignores project conventions and standards, or doesn't consider interactions with other modules. Imagine AI generating new code without knowing the project already has a common logging module, so it implements its own logging functionality, resulting in inconsistent code style.

Third, lack of domain knowledge. AI's training data is public code repositories, so it may lack expertise in specific domains, such as industry-specific compliance requirements, company internal coding standards, or specific business logic complexity. For example, when developing a financial trading system, AI might not understand PCI DSS compliance requirements, know company internal security standards, or have experience designing high-performance trading systems.

Fourth, inability to make complex decisions. Certain decisions require human judgment and experience, such as technology selection trade-offs, architecture design compromises, and security policy formulation. AI may not understand the complex considerations behind these decisions. For example, when selecting a database, AI might only consider performance factors while ignoring team technology stack, operational costs, data migration difficulty, and other multi-dimensional trade-offs.

Risks of Full Automation

If we fully trust AI and skip human verification, it could bring serious risks.

First, security vulnerabilities. AI-generated code may contain SQL injection, XSS attacks, permission bypass, sensitive information leakage, and other security vulnerabilities. Once exploited, the consequences could be disastrous. For example, AI-generated query code might not use parameterized queries, leading to SQL injection vulnerabilities.

Second, quality issues. Code quality might not meet standards, such as performance problems, poor maintainability, insufficient test coverage, improper error handling. These issues increase subsequent maintenance costs. For example, AI-generated code might not consider concurrent access, leading to performance bottlenecks.

Third, business logic errors. AI might misunderstand business requirements, leading to feature implementation that doesn't meet expectations, improper handling of edge cases, or incorrect understanding of business rules. This causes the product to fail to meet user needs. For example, AI might misunderstand the requirement "users can delete their own tasks" and think administrators can also delete users' tasks.

Fourth, compliance issues. It might violate industry regulations or company policies, such as insufficient data privacy protection, missing audit logs, improper access control. These issues could lead to legal risks. For example, AI-generated user registration functionality might not record when and how users consented to terms, violating privacy regulations like GDPR.

HITL Design Philosophy

The core philosophy of HITL is: introduce human verification at key decision points to ensure the quality and controllability of AI-generated content.

This doesn't mean humans need to participate in every stage, but rather verify at key, high-impact decision points. This design philosophy is based on balancing efficiency and quality—automate repetitive, standardized tasks, and verify key decisions and complex logic manually, maintaining high efficiency while ensuring quality.

At the same time, it reflects gradual trust building—start with small tasks to build trust in AI, gradually improve AI reliability through human verification, and gradually reduce human intervention based on trust.

More importantly, it retains control—humans retain control over key decisions, AI serves as an assistant rather than a decision-maker, and can intervene and adjust when needed.

Finally, it establishes a learning and feedback loop—collect feedback through human verification, improve AI behavior and output quality, and establish a continuous improvement mechanism.


Identifying Key Decision Points

So which stages need human verification? Not all stages require human intervention—we need to identify key decision points.

Characteristics of Key Decision Points

Whether a decision point needs human verification depends on several factors.

First, scope of impact. If this decision affects multiple modules or stages, it needs human verification; if the scope of impact is small, AI can be trusted. For example, modifying a PRD affects subsequent design, coding, testing, and other stages, so it needs human verification. But optimizing a function's internal implementation has a small scope of impact, so AI can be trusted.

Second, reversibility. If a decision is irreversible or difficult to reverse, it needs human verification; if it can be easily rolled back, AI can be trusted. For example, deleting a core database table is hard to reverse once executed, so it needs human verification. But adding a new function can be easily deleted, so AI can be trusted.

Third, complexity. If a decision involves complex logic or trade-offs, it needs human verification; if it's a simple, clear choice, AI can be trusted. For example, technology selection involves trade-offs across performance, cost, team skills, and other factors, so it needs human verification. But choosing which sorting algorithm to use can be automatically selected based on data characteristics, so AI can be trusted.

Fourth, risk level. If decision failure leads to serious consequences, it needs human verification; if the risk is controllable, AI can be trusted. For example, deploying to production might cause service interruption if it fails, so it needs human verification. But running unit tests only requires fixing bugs if they fail, so the risk is controllable and AI can be trusted.

Key Verification Points in Cowork Forge

Based on the above principles, Cowork Forge sets human verification at several key nodes.

Requirements Collection Confirmation is the first verification point. Requirements are the foundation of the entire development process—if misunderstood, all subsequent work will deviate. AI might misunderstand the user's true intent, inaccurate requirement scoping leads to development scope creep. So verification is needed to confirm whether AI correctly understands the core goal, whether the functional scope is accurate, whether user roles are correctly identified, and whether constraints are complete.

PRD Confirmation is the second verification point. The PRD is the blueprint for product development, directly affecting subsequent technical design and implementation. AI might miss important requirements or acceptance criteria, and requirement priorities might be unreasonable. So verification is needed to confirm whether functional requirements are complete, whether user stories are clear, whether acceptance criteria are testable, and whether non-functional requirements are thoroughly considered.

Design Confirmation is the third verification point. Technical design determines the system's architecture and implementation approach. AI might choose an inappropriate technology stack, and the architecture design might not match the project's actual situation. So verification is needed to confirm whether technology stack selection is reasonable, whether architecture design is feasible, whether component division is reasonable, and whether the data model is correct.

Code Plan Confirmation is the fourth verification point. The code plan determines which files will be modified. AI might miss files that need modification or include files that shouldn't be modified. So verification is needed to confirm the list of files to create, the list of files to modify, the change description for each file, and whether the change order is reasonable.

Trigger Conditions for Verification Points

HITL verification point triggers aren't fixed but dynamically determined based on conditions.

For example, in the PRD generation stage, if AI's confidence is below 0.8, human verification is triggered. In the technical design stage, if complexity reaches advanced level, human verification is triggered. In the coding stage, if confidence is below 0.7 or complexity reaches intermediate level, human verification is triggered. In the check stage, human verification usually isn't triggered because this is an automated verification process.

The benefit of this dynamic trigger mechanism is: for simple, clear tasks, human verification can be skipped to improve efficiency; for complex, uncertain tasks, human verification is mandatory to ensure quality.

Actual Case Analysis: Consequences of Wrong Decisions

Let's look at an actual case to see what happens if human verification is skipped.

Suppose you're developing a payment system. AI generates a PRD that only includes two features: users can pay with credit cards and support refund operations, but misses key security requirements like PCI DSS compliance, transaction logging and auditing, and risk control rules.

If you develop directly based on this PRD, the system might have serious security vulnerabilities, could lead to user financial loss, and might violate industry regulations. This is why human verification is so important at the PRD confirmation stage—humanly review the PRD, discover missing security requirements, and require AI to supplement relevant security requirements.

This case tells us that while AI is powerful, it may lack expertise in specific domains. In key domains like finance, healthcare, and security, human verification is particularly important.


HITL Mechanism Design Philosophy

HITL isn't just a simple "confirm" button—it's a complete interaction mechanism that needs careful design.

Not Just a Simple Confirmation Dialog

Many systems' HITL mechanisms are just a simple confirmation dialog: "AI has completed PRD generation. Continue? [Cancel] [Confirm]"

The limitations of this design are obvious: users don't know what content AI generated, can't modify the content, and lack contextual information.

Cowork Forge's HITL Design

Cowork Forge's HITL mechanism is more comprehensive.

First, content display. It displays AI output in a clear, readable format, provides sufficient contextual information, and supports multiple output formats (Markdown, JSON, etc.). For example, when displaying a PRD, it uses Markdown format and includes product overview, functional requirements list, user stories and acceptance criteria, and non-functional requirements, with clear headings and structure for each section.

Second, editing support. It allows users to directly edit AI output, supports external editor integration, and preserves edit history. For example, users can directly edit the PRD in the terminal or launch external editors like VS Code for editing. Edit history is preserved for easy user backtracking and comparison.

Third, regeneration. If users aren't satisfied, they can request AI to regenerate, can provide feedback to guide AI on how to improve. For example, users can say "This PRD is missing user tagging functionality, please add it," and AI will regenerate the PRD based on this feedback.

Fourth, interaction flow. It provides a complete interaction flow, allowing users to choose to confirm directly, need editing, request regeneration, or reject. Users can choose the most appropriate operation based on their judgment.

This flowchart shows the complete HITL interaction flow. Users have multiple choices, and each choice has clear processing logic. This design gives users full control, allowing them to choose the most appropriate operation based on their judgment.

Content Display and Interaction Experience Design

Content display has several principles.

First, clarity. Use clear headings and structure, highlight important information, use appropriate formatting (tables, lists, code blocks). For example, when displaying technology stack selection, it not only lists the selected technologies but also explains the selection rationale and lists alternatives.

Second, completeness. Display all relevant information, don't miss important details, provide contextual explanations. For example, when displaying code plans, it not only lists files to modify but also explains the change content and reason for each file.

Third, actionability. Clearly tell users what they can do, provide operation guidance, explain the consequences of different operations. For example, it clearly prompts users that they can choose to confirm and continue, edit content, regenerate, or view more details.

Interaction experience design also has several key points.

First, responsiveness. Quickly respond to user operations, provide real-time feedback, avoid long waits. For example, after users click the "edit" button, the editor launches immediately; after users finish editing, the system immediately reprocesses the content.

Second, undo capability. Support undo operations, preserve operation history, allow rollback to previous states. For example, if users edit the PRD but later realize the edit was wrong, they can undo the edit and return to the previous version.

Third, smart suggestions. Provide intelligent suggestions, predict user intent, reduce user operation steps. For example, if it detects users modified functional requirements in the PRD, it prompts whether technical design needs to be regenerated.

User Editing and Regeneration Support

Cowork Forge supports integrating external editors, allowing users to conveniently edit AI output.

The external editor integration workflow is: HITL controller writes content to a temporary file, launches external editor (like VS Code), user edits content in the editor, editor returns edited content, HITL controller submits modified content to agent, agent reprocesses, returns processing result, HITL controller displays result to user.

If users aren't satisfied with AI output, they can request regeneration. The regeneration workflow is: user requests regeneration, HITL controller requests user feedback, user provides feedback, HITL controller analyzes feedback, submits feedback to agent, agent adjusts prompt, calls LLM to regenerate, LLM returns new content, agent returns new content, HITL controller compares old and new output, displays differences to user.

This design gives users full control, allowing them to choose the most appropriate operation based on their judgment.


Best Practices in Actual Use

After understanding HITL's design principles, let's see how to efficiently perform human verification in actual use.

How to Efficiently Perform Human Verification

First, preparation. Before starting to use Cowork Forge, you need to prepare. Clarify requirements—before inputting requirements, first clarify your true requirements, write down key functional points and constraints, prepare relevant reference materials. Understand the project—understand the project's technology stack and architecture, understand the project's coding standards and conventions, understand the project's business logic. Configure environment—configure external editors (like VS Code), configure Git and version control, configure necessary development tools.

Second, verification strategy. Don't try to verify all content at once, do it in stages. Quick scan—first quickly browse AI output to understand the overall structure. Key check—focus on checking key parts like core features and security requirements. Detailed review—conduct detailed reviews of important parts. Cross-verify—compare AI output with original requirements.

Using a checklist is also a good approach. Prepare a checklist for each verification stage to ensure no important check items are missed.

Third, editing techniques. Use external editors—leverage editor syntax highlighting and auto-completion, use editor search and replace functionality, leverage editor version control integration. Preserve context—preserve original content when modifying for easy comparison, add comments explaining modification reasons, preserve modification history for easy backtracking. Progressive modification—first make large structural adjustments, then detail modifications, finally format adjustments.

Common Problems and Solutions

In actual use, you might encounter some common problems.

Problem 1: AI-generated content doesn't meet expectations

Possible causes include unclear requirement description, insufficient AI context information, or AI capability limitations.

Solutions include re-describing requirements (use clearer, more specific language to describe requirements), providing more context (provide more background information and reference materials), direct editing (manually modify AI output), providing feedback (tell AI what's wrong and request regeneration).

Problem 2: AI misses important requirements

Possible causes include incomplete requirement description, AI not considering all scenarios, or AI training data lacking similar requirements.

Solutions include supplementing requirements (manually add missing requirements), providing examples (provide similar feature examples), using checklists (use requirement checklists to ensure nothing is missed).

Problem 3: AI chooses inappropriate technology stack

Possible causes include AI not understanding project technology constraints, AI training data biased toward certain technology stacks, or AI not considering team technical capabilities.

Solutions include clarifying technology constraints (explicitly state technology constraints in requirements), providing technology stack suggestions (directly tell AI which technology stack to use), manual modification (manually modify technology stack selection).

Problem 4: AI output format doesn't meet requirements

Possible causes include AI not understanding project format standards, or AI training data format inconsistency.

Solutions include providing format templates (provide standard format templates), manually adjusting format (manually adjust output format), configuring format rules (specify format rules in configuration files).

HITL Application in Team Collaboration Scenarios

In team collaboration scenarios, HITL mechanisms need to support multi-user collaboration.

First, role division. Different team members are responsible for different verification stages: product manager responsible for requirements collection and PRD verification, technical lead responsible for technical design and code plan verification, development engineer responsible for code review.

Second, collaboration process. Requirements collection → product manager verification → PRD generation → product manager verification → technical design → technical lead verification → code planning → development engineer verification → code generation → code review → check verification.

Third, collaboration tools. Tools supporting team collaboration include comments and discussion (support adding comments on content), version management (support comparing different versions), permission control (different roles have different permissions), notification mechanism (notify relevant personnel for verification).

This sequence diagram shows the HITL process in team collaboration scenarios. Different roles are responsible for different verification stages, ensuring each stage has appropriate personnel for verification.


Summary

HITL (Human-in-the-Loop) is one of Cowork Forge's core design philosophies, finding a balance between automation efficiency and output quality.

HITL Value

First, balance between efficiency and quality. Automate repetitive tasks to improve efficiency, verify key decisions manually to ensure quality, maintaining high efficiency without sacrificing quality.

Second, gradual trust building. Start with small tasks to build trust in AI, gradually improve AI reliability through human verification, gradually reduce human intervention based on trust.

Third, retaining control. Humans retain control over key decisions, AI serves as an assistant rather than a decision-maker, can intervene and adjust when needed.

Fourth, learning and feedback. Collect feedback through human verification, improve AI behavior and output quality, establish a continuous improvement loop.

Future Automation Level Evolution

As AI technology develops, HITL mechanisms will also continue to evolve.

Short-term, optimize HITL interaction experience, improve agent confidence judgment capability, reduce unnecessary verification stages.

Medium-term, introduce automated testing and verification, establish agent behavior evaluation mechanisms, dynamically adjust verification strategies based on historical data.

Long-term, achieve highly reliable AI, reduce human verification, establish complete AI trust systems, achieve deep AI-human collaboration.

Recommendations for Users

First, don't skip human verification. Even if AI output looks perfect, review it carefully. Human verification is a key link in ensuring quality.

Second, provide clear feedback. If AI output doesn't meet expectations, provide clear feedback to help AI improve.

Third, establish verification habits. Establish systematic verification habits, use checklists to ensure no important check items are missed.

Fourth, continuous learning and improvement. Through using HITL, continuously learn and improve, increase verification efficiency and quality.

Conclusion

HITL isn't distrust of AI capability, but reasonable use of AI capability. AI is a powerful tool, but it still needs human guidance and supervision.

Through HITL mechanisms, we can enjoy the efficiency improvement brought by AI while ensuring output quality and controllability. This is the true meaning of human-AI collaboration.

Future software development isn't AI replacing humans, but AI and humans collaborating deeply to create better software together.


Related Reading:

Top comments (0)