Five Open AI Agent Jobs That Actually Involve Evals, Guardrails, and Production Systems
Five Open AI Agent Jobs That Actually Involve Evals, Guardrails, and Production Systems
Most "AI job" lists are too loose to be useful. They lump together anything with LLM, GenAI, or automation in the title, even when the actual work is generic data science or internal tooling.
For this shortlist, I used a stricter filter. On May 6, 2026, I reviewed live company-run Greenhouse and Lever listings and kept only jobs whose descriptions explicitly point to real agent work: prompt and context engineering, tool use, orchestration, retrieval, evals, guardrails, observability, memory, or production deployment. I also excluded listings that redirected to an error page or were clearly talent-pipeline placeholders.
The result is a focused list of five open roles that map to different parts of the agent stack: behavior design, enterprise orchestration, production platform engineering, customer-support agent architecture, and deployed agent operations.
Selection Standard
- Live application page visible on May 6, 2026
- Company-run listing, not a scraped repost
- Job description explicitly tied to AI agents or agentic systems
- Direct application URL included
- Role scope specific enough to be useful to a serious applicant or recruiter
1. Prompt Engineer, Agent Prompts & Evals — Anthropic
- Company: Anthropic
- Location: San Francisco, CA or New York City, NY
- Work model: Hybrid
- Direct apply: Anthropic job page
Why this is a real AI agent role
This is one of the clearest examples of agent behavior work in the current market. Anthropic is not hiring for vague prompt-writing. The listing explicitly says the role supports system prompts, tool prompts, skills, and evaluations across AI-first products.
What the listing says
Anthropic frames the job as the bridge between model capability and product behavior. Responsibilities include designing and optimizing prompts, building evaluation suites, supporting model launches, and helping product teams ship consistent, safe behavior across product surfaces. The job also asks for experience with LLMs, evaluation methodologies, and production engineering practices.
Why it made this top five
A lot of companies say they are building agents; far fewer hire for the hard part, which is making those agents predictable across releases. This role is directly about regression-catching, quality measurement, prompt architecture, and rollout support. That is agent work in the practical sense, not just in the marketing sense.
2. Machine Learning Engineer, AI Assistant & Autonomous AI Agents — Glean
- Company: Glean
- Location: San Francisco Bay Area
- Work model: Hybrid, 3 to 4 days per week in office
- Direct apply: Glean job page
Why this is a real AI agent role
Glean is hiring specifically for AI Assistant & Autonomous AI Agents, and the listing goes beyond buzzwords. It describes work on agentic frameworks, LLM orchestration, memory-augmented LLMs, reinforcement learning, and evaluation frameworks for complex enterprise tasks.
What the listing says
The job is positioned at the intersection of applied research and production engineering. Responsibilities include building frameworks for agents to use tools and knowledge sources, inventing new architectures for reasoning and planning, improving agent quality with fine-tuning and RL, and leading scalable evaluation loops for production systems.
Why it made this top five
This is a strong signal that enterprise agent hiring is maturing. Glean is not asking for a demo-builder. It wants someone who can handle orchestration, personalization, evaluation, and production-grade implementation in an enterprise environment where trust and latency matter.
3. Staff Software Engineer – AI Agents — GoodLeap
- Company: GoodLeap
- Location: San Francisco, CA
- Work model: Hybrid
- Direct apply: GoodLeap job page
Why this is a real AI agent role
GoodLeap is explicit that the hire will architect and deliver production-grade AI agent capabilities. The listing names concrete agent-building components: multi-modal interactions, multi-agent orchestration, memory systems, long-running tasks, secure tool access, vector databases, embeddings, semantic search, RAG pipelines, and MCP familiarity.
What the listing says
The role is a hands-on technical leadership position inside a software ecosystem serving sustainable home-financing and contractor workflows. Responsibilities include building backend services in Python and FastAPI, setting technical direction for AI-powered systems, integrating vector databases and semantic search, and driving reliability, observability, and security.
Why it made this top five
This role stands out because it shows how agent systems are moving into operational vertical software, not just frontier-model companies. The language around memory, tool access, orchestration, and observability makes it clear that GoodLeap is hiring for real agent infrastructure, not an experiment lab.
4. AI Agent Architect, Customer Experience — Airtable
- Company: Airtable
- Location: Remote, United States
- Work model: Remote
- Direct apply: Airtable job page
Why this is a real AI agent role
Airtable’s description is unusually concrete about what the agent is expected to do: reason, retrieve, decide, and act inside a customer-support setting. The job centers on retrieval quality, decision logic, guardrails, feedback loops, versioning, and integrations with external systems.
What the listing says
The role owns the technical foundation for Airtable’s AI-native support experience. Core responsibilities include improving retrieval precision and contextual relevance, reducing hallucinations, building decision frameworks for safe account actions, blocking prompt injection, instrumenting observability, running A/B tests, and integrating agents with billing platforms, CRMs, internal tools, and Airtable APIs.
Why it made this top five
This is exactly the kind of post that separates serious agent work from generic chatbot work. The listing talks about failure modes, action boundaries, feedback instrumentation, and week-over-week performance gains. In other words, it treats the agent like a production system with operational accountability.
5. Staff AI Agent Engineer — Liberate
- Company: Liberate
- Location: Boston or San Francisco (Berkeley)
- Work model: Hybrid, 2 days per week in office
- Direct apply: Liberate job page
Why this is a real AI agent role
Liberate builds AI agents for insurance operations, and this listing is explicitly about agent deployments and agent quality. It is one of the better examples of a role that sits between product, platform, and customer reality.
What the listing says
The Staff AI Agent Engineer owns complex deployments from design through production. The responsibilities include building and iterating on agent workflows, prompts, evals, and integrations, converting customer-specific learnings into reusable patterns, debugging behavior with structured evals and monitoring, and leading launch-readiness and post-launch quality reviews.
Why it made this top five
Many companies now want agent engineers who can ship in messy, high-stakes environments, not just prototype. Liberate’s description is strong because it emphasizes reuse, monitoring, operational rigor, and failure-mode thinking. That is what mature agent deployment work looks like in a regulated industry.
What These Five Roles Show About the Market
A useful pattern emerges from this shortlist.
First, prompting alone is no longer enough. The strongest roles now pair prompting with evals, rollout safety, or product behavior ownership.
Second, retrieval and orchestration have become core hiring signals. Glean, GoodLeap, and Airtable all point to some mix of tool use, orchestration, vector retrieval, memory, and feedback loops.
Third, agent jobs are splitting into distinct lanes:
- behavior and evals roles
- platform and orchestration roles
- forward deployment and customer-launch roles
- domain-specific agent operations roles
That matters for applicants. Someone strong at RAG, observability, and guardrails may fit Airtable far better than Liberate. Someone strong at reusable agent infrastructure may be a better match for GoodLeap or Glean. Someone who likes model behavior, prompt architecture, and evaluation suites should look hard at Anthropic.
Final Take
If I had to describe this batch in one sentence, it would be this: the best AI-agent openings right now are no longer hiring for "AI enthusiasm"; they are hiring for people who can make agents reliable, instrumented, and useful in production.
That is why these five listings stood out. They do not just mention AI agents in passing. They describe the actual work of building them: prompts, tool use, memory, evaluation, retrieval, deployment, guardrails, and operational accountability.
Top comments (0)