Don't let AI agents decide whether they should do a task

#ai #llm #aiops

A collection of easily verifiable work (e.g. the output of an automated check script) seems like a perfect use case for AI automation, and it is. However, throwing a bunch of work at an AI agent may backfire and result in low quality slop produced just to silence errors, which will then have to be fixed by humans.

Natural conflict between helpfulness and harmlessness

The pivotal 2022 Anthropic paper Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback states:

Helpfulness and harmlessness often stand in opposition to each other. An excessive focus on avoiding harm can lead to ‘safe’ responses that don’t actually address the needs of the human. An excessive focus on being helpful can lead to responses that help humans cause harm or generate toxic content. We demonstrate this tension quantitatively by showing that preference models trained to primarily evaluate one of these qualities perform very poorly (much worse than chance) on the other.

In a business context, helpfulness is doing one's best to perform a provided task, while harmlessness is refusing to attempt a task if you don't have the information or access to perform it responsibly. Human employees generally do well in striking a good balance between helpfulness and harmlessness, but AI agents struggle by default.

AI companies invest considerable resources into making sure that their AI systems don't enable terrorism, reinforce hatred, encourage self-harm, etc. However, outside of those extremes, AI systems will attempt to be 'helpful' by default, even if it would cause considerable business damage. See the situation where an AI agent deleted a company's production database trying to be 'helpful'.

Human judgement is cheaper in most situations

As the Anthropic paper points out, making an LLM strike a balance between helpfulness and harmlessness is not an impossible problem, but it's not easy either. This is not a problem that can be fixed by just tweaking the prompt or by adding an extra step in the process. It seems that a big investment into an iterative process involving AI engineer time, dataset creation, and fine-tuning is needed to solve the problem well. These are not resources that most companies can afford to spend on most problems.

Instead, it makes more financial sense for companies to leverage human judgement and create processes where humans decide whether it makes sense for a given problem to be attempted by an overly ambitious 'helpful' AI agent. This creates more work for humans, but is cheaper than investing into an approach like RLHF in most situations, and also helps prevent costly errors by AI agents.

Conclusion

Don't rely on AI agents to have the judgement to decide whether they should attempt a task. They don't have that judgement, and they can't have it without an investment of millions of dollars into developing that judgement for that one problem. Instead, assume that AI agents are overly ambitious and 'helpful' by default, and only give them work they should attempt.

DEV Community

Don't let AI agents decide whether they should do a task

Natural conflict between helpfulness and harmlessness

Human judgement is cheaper in most situations

Conclusion

Top comments (0)