Mart Schweiger

Posted on Apr 8 • Originally published at assemblyai.com

How to Build an AI-Powered Interview Scoring System with Speech-to-Text

#python #ai #tutorial #assemblyai

Overview

This tutorial demonstrates creating an AI-powered interview evaluation system that records conversations, converts audio to text, and applies structured scoring criteria. The approach separates the interviewing process from assessment, allowing evaluators to focus entirely on conversation quality while analyzing complete transcripts afterward with objective evidence.

Core Concept

Rather than simultaneously listening, note-taking, and assessing during interviews, this system enables:

Natural interviewer-candidate interactions
Complete preservation of dialogue through transcription
Systematic post-interview analysis using searchable text
Evidence-based scoring with supporting quotations and timestamps

Key Components

Scoring criteria (4-6 job-specific competencies)
Rating scale (1-5 with clear definitions)
Evidence extraction (direct quotes proving competency levels)
Complete transcripts (source of truth replacing handwritten notes)

Step 1: Define Role-Specific Criteria

Identify observable behaviors predicting success. For engineering roles:

Problem decomposition approaches
System architecture decision-making
Technical language proficiency
Concept explanation clarity

Step 2: Rating Scale

1 - Far Below: No competence evidence or irrelevant responses
2 - Below: Minimal understanding, vague answers
3 - Meets: Adequate demonstration with relevant examples
4 - Exceeds: Strong evidence with multiple detailed examples
5 - Far Exceeds: Exceptional mastery with innovative approaches

Step 3: Recording & Transcription Setup

Recording requirements:

External microphones (not built-in laptop audio)
Quiet environments
Clear, consistent volume levels

Legal considerations:

Inform candidates during scheduling
Obtain verbal consent at interview start
Follow local consent laws
Store securely and delete after hiring decisions

Step 4: AssemblyAI Implementation (Python)

pip install assemblyai python-dotenv

import assemblyai as aai
import json
import os
from datetime import datetime
from dotenv import load_dotenv

load_dotenv()
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')

def transcribe_interview(audio_file_path, candidate_name, position):
    config = aai.TranscriptionConfig(
        speech_model=aai.SpeechModel.best,
        speaker_labels=True,
        speakers_expected=2,
        punctuate=True,
        format_text=True
    )

    transcriber = aai.Transcriber(config=config)
    transcript = transcriber.transcribe(audio_file_path)

    if transcript.status == aai.TranscriptStatus.error:
        print(f"Transcription failed: {transcript.error}")
        return None

    utterances = []
    for utterance in transcript.utterances:
        utterances.append({
            'speaker': utterance.speaker,
            'text': utterance.text,
            'start_time': utterance.start / 1000,
            'end_time': utterance.end / 1000,
            'confidence': utterance.confidence
        })

    result = {
        'candidate_name': candidate_name,
        'position': position,
        'interview_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'duration_minutes': round(transcript.audio_duration / 60, 2),
        'utterances': utterances,
        'full_text': transcript.text
    }

    output_filename = f"{candidate_name.replace(' ', '_')}_{position.replace(' ', '_')}.json"
    with open(output_filename, 'w') as f:
        json.dump(result, f, indent=2, ensure_ascii=False)

    return result

Step 5: Scoring & Evidence Extraction

import json
from typing import Dict, List

class InterviewScorer:
    def __init__(self, transcript_file: str):
        with open(transcript_file, 'r') as f:
            self.transcript_data = json.load(f)
        self.candidate_responses = self._get_candidate_responses()

    def _get_candidate_responses(self) -> List[str]:
        responses = []
        speaker_counts = {}

        for utterance in self.transcript_data['utterances']:
            speaker = utterance['speaker']
            speaker_counts[speaker] = speaker_counts.get(speaker, 0) + 1

        candidate_speaker = min(speaker_counts.keys(), 
                               key=lambda x: speaker_counts[x])

        for utterance in self.transcript_data['utterances']:
            if utterance['speaker'] == candidate_speaker:
                responses.append(utterance['text'])

        return responses

    def find_evidence_for_competency(self, competency_keywords: List[str]) -> List[str]:
        evidence = []

        for response in self.candidate_responses:
            response_lower = response.lower()
            keyword_matches = sum(1 for keyword in competency_keywords 
                                 if keyword.lower() in response_lower)

            if keyword_matches > 0 and len(response.split()) > 10:
                evidence.append(response)

        return evidence[:3]

    def score_competency(self, evidence: List[str]) -> int:
        if not evidence:
            return 1

        evidence_count = len(evidence)
        avg_length = sum(len(e.split()) for e in evidence) / len(evidence)

        if evidence_count == 1 and avg_length < 20:
            return 2
        elif evidence_count <= 2 and avg_length < 30:
            return 3
        elif evidence_count >= 2 and avg_length >= 30:
            return 4
        elif evidence_count >= 3 and avg_length >= 40:
            return 5
        else:
            return 3

    def generate_scorecard(self, competencies: Dict[str, List[str]]) -> Dict:
        scorecard = {
            'candidate': self.transcript_data['candidate_name'],
            'position': self.transcript_data['position'],
            'interview_date': self.transcript_data['interview_date'],
            'competency_scores': {},
            'supporting_evidence': {},
            'overall_score': 0
        }

        total_score = 0

        for competency_name, keywords in competencies.items():
            evidence = self.find_evidence_for_competency(keywords)
            score = self.score_competency(evidence)

            scorecard['competency_scores'][competency_name] = score
            scorecard['supporting_evidence'][competency_name] = evidence
            total_score += score

        scorecard['overall_score'] = round(total_score / len(competencies), 1)

        return scorecard

Usage example:

engineering_competencies = {
    'Problem Solving': [
        'analyze', 'debug', 'troubleshoot', 'solution', 'approach',
        'investigate', 'root cause', 'systematic', 'break down'
    ],
    'Technical Skills': [
        'python', 'javascript', 'react', 'database', 'api',
        'algorithm', 'architecture', 'testing', 'performance'
    ],
    'Communication': [
        'explain', 'clarify', 'example', 'understand', 'question',
        'discuss', 'present', 'document', 'feedback'
    ],
    'Experience': [
        'project', 'team', 'lead', 'built', 'developed',
        'implemented', 'managed', 'delivered', 'worked on'
    ]
}

scorer = InterviewScorer('candidate_transcript.json')
scorer.save_scorecard(engineering_competencies, 'scorecard.json')

Common Implementation Errors

1. Generic Criteria Across Roles
Different positions require different competencies. "Communication" means statistical explanation for data scientists versus customer empathy for support staff.

2. Skipping Calibration Sessions
Without alignment, evaluators interpret identical evidence differently. Monthly calibration where all raters score sample transcripts independently prevents inconsistency.

3. Neglecting Audio Quality
Poor recording quality undermines transcription accuracy. Test setups beforehand and require external microphones.

Measurement & Validation

from sklearn.metrics import cohen_kappa_score

def calculate_agreement(evaluator1_scores, evaluator2_scores):
    kappa = cohen_kappa_score(evaluator1_scores, evaluator2_scores)

    if kappa < 0.4:
        return "Poor agreement - needs calibration"
    elif kappa < 0.6:
        return "Fair agreement - some calibration needed"
    elif kappa < 0.8:
        return "Good agreement - system working well"
    else:
        return "Excellent agreement - very consistent"

Frequently Asked Questions

What transcription accuracy is needed?
AssemblyAI's Universal models achieve approximately 94.4% accuracy, ensuring scores reflect actual responses.

Does this work with Zoom/Teams recordings?
Yes. Most platforms allow downloading recordings as MP4 files, which AssemblyAI accepts directly.

How does transcript-based scoring compare to manual scoring?
Transcript-based evaluation achieves higher inter-rater consistency because evaluators review identical, complete information rather than relying on incomplete notes and memory.

What if a candidate refuses to be recorded?
Offer traditional live scoring as an alternative while explaining that recording ensures fairer assessment.

How many competencies should I evaluate?
Stick to 4-6 competencies per interview to maintain focus while avoiding cognitive overload.

DEV Community