Veríssimo Cassange

Posted on Apr 26

Interview Protocol as Code: Standardizing Technical Hiring with OpenClaw

#devchallenge #openclawchallenge

OpenClaw Challenge Submission 🦞

This is a submission for the OpenClaw Challenge.

What I Built

Every hiring manager has lived this: you interview five candidates for the same senior role and walk away with five wildly different assessment notes. One interviewer probes systems thinking. Another focuses on pure technical breadth. A third ranks personality fit. You have five hours of interviews and no systematic way to compare them.

I built interview_agent - an OpenClaw skill that standardizes the technical interview into a repeatable, machine-readable process. You give it a job description. It generates targeted questions, asks them one at a time, scores each answer against explicit criteria, and produces a hire/no-hire with the evidence backing it.

The skill has five sequential modes:

Mode	What it does
1 - Job Analysis	Parses job description, extracts core skills, flags risk areas, infers seniority
2 - Interview Plan	Builds a question roadmap with time estimates and scoring rubric
3 - Live Interview	Asks questions one by one; adapts the next question based on gaps observed
4 - Answer Evaluation	Scores the response with evidence across three dimensions (technical, behavioral, domain depth)
5 - Final Report	Synthesizes scores and delivers hire/no-hire with confidence level

The entire implementation is a single Markdown file. No backend. No database. No deployment nonsense.

I forked DioAugust/ws_dio_entrevistador and made four concrete changes:

Bilingual (PT / EN) - The skill detects whether your job description is in Portuguese or English and responds in kind. You can also switch mid-session by saying "switch to English". This was a practical necessity: tech teams in Brazil run internal interviews in Portuguese but screen candidates with English-only résumés.
Three-part scoring - Instead of a single global score, every candidate gets three sub-scores: tecnico (raw technical skill), comportamental (communication, collaboration), and dominio (depth in the specific domain). A candidate can be technically excellent but inarticulate, or vice versa. One number hides that truth.
Adaptive questions - During the live interview, if a candidate skips a critical topic (like observability or incident response), the next question deliberately targets that gap. You're not reading from a fixed script; you're drilling down on what matters.
Machine-readable outputs - Added fields like idioma_principal, dificuldade_estimada, and feedback_sugestao so the JSON can be consumed downstream: fed into a hiring dashboard, sent to a candidate with constructive feedback, or piped into a hiring tracking system.

Full change log: ATTRIBUTION.md

How I Used OpenClaw

An interview is a state machine: you're always somewhere in a defined sequence. OpenClaw's skill architecture maps to that exactly. I didn't write state management code or API wiring. I wrote the protocol itself in Markdown, and the framework executed it.

Stack:

Runtime: ghcr.io/openclaw/openclaw:latest (Docker)
Model: Gemini 2.5 Flash
Skill: ./skills/interview-agent/SKILL.md
UI: localhost:18789

Why it worked: The friction mattered. In the first week, I rewrote prompts 50+ times. Each iteration: edit the file, refresh the browser. No build. No deploy. That velocity let me test scoring rubrics, question phrasing, and JSON schemas fast enough to actually learn what works.

Demo

Repository: github.com/vec21/ws_dio_entrevistador

Run it:

git clone https://github.com/vec21/ws_dio_entrevistador
cd ws_dio_entrevistador
# Add your GOOGLE_API_KEY to docker-compose.yml
docker compose up -d
# Open http://localhost:18789

Mode 1 - Job Analysis

You provide (Portuguese job description):

Use the skill interview_agent to analyze this job posting as JSON:

Senior Backend Engineer - Fintech
Responsibilities:
- Critical payment APIs
- Event-driven microservices
- Observability and reliability

Requirements: Go or Kotlin, Kafka, AWS

The skill responds with:

{
  "job_title": "Senior Backend Engineer",
  "seniority": "senior",
  "primary_language": "pt",
  "technical_skills": ["Go/Kotlin", "Kafka", "AWS", "Observability"],
  "risk_flags": ["payment systems domain expertise required", "high fault tolerance expected"],
  "estimated_difficulty": "high"
}

Mode 4 - Answer Evaluation

You ask a question and the candidate responds:

Question: Tell me about a critical backend system you built.

Candidate response: I implemented idempotency keys, retries with exponential backoff,
database transactions, and latency/error metrics on a payments API.

The skill evaluates:

{
  "overall_score": 4,
  "sub_scores": {
    "technical": 5,
    "behavioral": 4,
    "domain_knowledge": 3
  },
  "positive_signals": ["idempotency correctly applied", "retry strategy with exponential backoff", "latency and error metrics instrumented"],
  "missing_signals": ["no incident response discussion", "missing scale and SLA context"],
  "suggested_feedback": "Ask about the biggest failure that occurred in this system and how recovery was handled."
}

The technical_knowledge score is high (idempotency + backoff are exactly right). But domain_knowledge is lower because describing a system without discussing failure modes or scale shows incomplete mastery of fintech reliability concerns. The feedback note guides the next question.

What I Learned

1. Constraints force rigor.

I started with a single global score. Disaster. Two candidates would score the same "3" but for opposite reasons: one brilliant at systems but inarticulate; the other articulate but shallow on design. I split into three scores and suddenly I could see clearly. The score becomes evidence, not a guess.

2. Flow beats features.

The single biggest quality lever wasn't smarter prompts or longer context windows. It was the flow: asking one question at a time, letting the candidate think, adapting the next question based on what you just learned. It feels like a conversation. It is a conversation. But underneath there's explicit structure. That combination-natural flow + explicit criteria-is what makes interviews repeatable and fair.

3. Multilingual means redesigning, not translating.

I could have run Portuguese prompts through a translator. Instead I rewrote them from first principles in Portuguese. Because "leverage" is a loan word in Portuguese that carries different weight. Because what counts as "senior" differs culturally. Designing for two languages forced me to articulate what I was actually measuring instead of hiding behind vague English jargon.

ClawCon Michigan

I did not attend ClawCon Michigan.

DEV Community