Overview
This tutorial demonstrates creating an AI-powered interview evaluation system that records conversations, converts audio to text, and applies structured scoring criteria. The approach separates the interviewing process from assessment, allowing evaluators to focus entirely on conversation quality while analyzing complete transcripts afterward with objective evidence.
Core Concept
Rather than simultaneously listening, note-taking, and assessing during interviews, this system enables:
- Natural interviewer-candidate interactions
- Complete preservation of dialogue through transcription
- Systematic post-interview analysis using searchable text
- Evidence-based scoring with supporting quotations and timestamps
Key Components
- Scoring criteria (4-6 job-specific competencies)
- Rating scale (1-5 with clear definitions)
- Evidence extraction (direct quotes proving competency levels)
- Complete transcripts (source of truth replacing handwritten notes)
Step 1: Define Role-Specific Criteria
Identify observable behaviors predicting success. For engineering roles:
- Problem decomposition approaches
- System architecture decision-making
- Technical language proficiency
- Concept explanation clarity
Step 2: Rating Scale
- 1 - Far Below: No competence evidence or irrelevant responses
- 2 - Below: Minimal understanding, vague answers
- 3 - Meets: Adequate demonstration with relevant examples
- 4 - Exceeds: Strong evidence with multiple detailed examples
- 5 - Far Exceeds: Exceptional mastery with innovative approaches
Step 3: Recording & Transcription Setup
Recording requirements:
- External microphones (not built-in laptop audio)
- Quiet environments
- Clear, consistent volume levels
Legal considerations:
- Inform candidates during scheduling
- Obtain verbal consent at interview start
- Follow local consent laws
- Store securely and delete after hiring decisions
Step 4: AssemblyAI Implementation (Python)
pip install assemblyai python-dotenv
import assemblyai as aai
import json
import os
from datetime import datetime
from dotenv import load_dotenv
load_dotenv()
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
def transcribe_interview(audio_file_path, candidate_name, position):
config = aai.TranscriptionConfig(
speech_model=aai.SpeechModel.best,
speaker_labels=True,
speakers_expected=2,
punctuate=True,
format_text=True
)
transcriber = aai.Transcriber(config=config)
transcript = transcriber.transcribe(audio_file_path)
if transcript.status == aai.TranscriptStatus.error:
print(f"Transcription failed: {transcript.error}")
return None
utterances = []
for utterance in transcript.utterances:
utterances.append({
'speaker': utterance.speaker,
'text': utterance.text,
'start_time': utterance.start / 1000,
'end_time': utterance.end / 1000,
'confidence': utterance.confidence
})
result = {
'candidate_name': candidate_name,
'position': position,
'interview_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
'duration_minutes': round(transcript.audio_duration / 60, 2),
'utterances': utterances,
'full_text': transcript.text
}
output_filename = f"{candidate_name.replace(' ', '_')}_{position.replace(' ', '_')}.json"
with open(output_filename, 'w') as f:
json.dump(result, f, indent=2, ensure_ascii=False)
return result
Step 5: Scoring & Evidence Extraction
import json
from typing import Dict, List
class InterviewScorer:
def __init__(self, transcript_file: str):
with open(transcript_file, 'r') as f:
self.transcript_data = json.load(f)
self.candidate_responses = self._get_candidate_responses()
def _get_candidate_responses(self) -> List[str]:
responses = []
speaker_counts = {}
for utterance in self.transcript_data['utterances']:
speaker = utterance['speaker']
speaker_counts[speaker] = speaker_counts.get(speaker, 0) + 1
candidate_speaker = min(speaker_counts.keys(),
key=lambda x: speaker_counts[x])
for utterance in self.transcript_data['utterances']:
if utterance['speaker'] == candidate_speaker:
responses.append(utterance['text'])
return responses
def find_evidence_for_competency(self, competency_keywords: List[str]) -> List[str]:
evidence = []
for response in self.candidate_responses:
response_lower = response.lower()
keyword_matches = sum(1 for keyword in competency_keywords
if keyword.lower() in response_lower)
if keyword_matches > 0 and len(response.split()) > 10:
evidence.append(response)
return evidence[:3]
def score_competency(self, evidence: List[str]) -> int:
if not evidence:
return 1
evidence_count = len(evidence)
avg_length = sum(len(e.split()) for e in evidence) / len(evidence)
if evidence_count == 1 and avg_length < 20:
return 2
elif evidence_count <= 2 and avg_length < 30:
return 3
elif evidence_count >= 2 and avg_length >= 30:
return 4
elif evidence_count >= 3 and avg_length >= 40:
return 5
else:
return 3
def generate_scorecard(self, competencies: Dict[str, List[str]]) -> Dict:
scorecard = {
'candidate': self.transcript_data['candidate_name'],
'position': self.transcript_data['position'],
'interview_date': self.transcript_data['interview_date'],
'competency_scores': {},
'supporting_evidence': {},
'overall_score': 0
}
total_score = 0
for competency_name, keywords in competencies.items():
evidence = self.find_evidence_for_competency(keywords)
score = self.score_competency(evidence)
scorecard['competency_scores'][competency_name] = score
scorecard['supporting_evidence'][competency_name] = evidence
total_score += score
scorecard['overall_score'] = round(total_score / len(competencies), 1)
return scorecard
Usage example:
engineering_competencies = {
'Problem Solving': [
'analyze', 'debug', 'troubleshoot', 'solution', 'approach',
'investigate', 'root cause', 'systematic', 'break down'
],
'Technical Skills': [
'python', 'javascript', 'react', 'database', 'api',
'algorithm', 'architecture', 'testing', 'performance'
],
'Communication': [
'explain', 'clarify', 'example', 'understand', 'question',
'discuss', 'present', 'document', 'feedback'
],
'Experience': [
'project', 'team', 'lead', 'built', 'developed',
'implemented', 'managed', 'delivered', 'worked on'
]
}
scorer = InterviewScorer('candidate_transcript.json')
scorer.save_scorecard(engineering_competencies, 'scorecard.json')
Common Implementation Errors
1. Generic Criteria Across Roles
Different positions require different competencies. "Communication" means statistical explanation for data scientists versus customer empathy for support staff.
2. Skipping Calibration Sessions
Without alignment, evaluators interpret identical evidence differently. Monthly calibration where all raters score sample transcripts independently prevents inconsistency.
3. Neglecting Audio Quality
Poor recording quality undermines transcription accuracy. Test setups beforehand and require external microphones.
Measurement & Validation
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(evaluator1_scores, evaluator2_scores):
kappa = cohen_kappa_score(evaluator1_scores, evaluator2_scores)
if kappa < 0.4:
return "Poor agreement - needs calibration"
elif kappa < 0.6:
return "Fair agreement - some calibration needed"
elif kappa < 0.8:
return "Good agreement - system working well"
else:
return "Excellent agreement - very consistent"
Frequently Asked Questions
What transcription accuracy is needed?
AssemblyAI's Universal models achieve approximately 94.4% accuracy, ensuring scores reflect actual responses.
Does this work with Zoom/Teams recordings?
Yes. Most platforms allow downloading recordings as MP4 files, which AssemblyAI accepts directly.
How does transcript-based scoring compare to manual scoring?
Transcript-based evaluation achieves higher inter-rater consistency because evaluators review identical, complete information rather than relying on incomplete notes and memory.
What if a candidate refuses to be recorded?
Offer traditional live scoring as an alternative while explaining that recording ensures fairer assessment.
How many competencies should I evaluate?
Stick to 4-6 competencies per interview to maintain focus while avoiding cognitive overload.
Top comments (0)