Improving AI Email Classification Accuracy Through Prompt Engineering
Overview
We resolved email misclassification issues in our email classification system, where project emails (PROJECT) and talent emails (TALENT) were being incorrectly categorized. This article describes how we improved the problem of personnel emails containing "project desired" being misclassified as projects, using Few-shot learning and clearer judgment criteria.
Tech Stack
- OpenAI GPT-4 Turbo (gpt-4-1106-preview)
- Claude 3 Opus (for comparison testing)
- TypeScript (v5.x)
- Prompt Engineering
- Few-shot Learning
- Natural Language Processing
Background & Challenges
Real Misclassification Examples
Our email classification AI was making incorrect judgments in cases like these:
Misclassification Case 1: Personnel information classified as PROJECT
Subject: [Personnel Information] Introduction of Mr./Ms. ○○
Body:
We would like to introduce the following candidate from Mr./Ms. ○○.
[Basic Information]
Name: Taro Yamada
Age: 35 years old
Skills: Java, Spring Boot
Desired rate: 600,000 yen/month
Project desired: Remote-friendly projects preferred
→ Misclassified as PROJECT based solely on "project desired" keyword
Misclassification Case 2: Unable to understand context
Subject: Re: Introduction of Engineer
Body:
We would like to introduce the following engineer.
Currently seeking new projects.
[Career]
- 5 years of development experience at major SI company
- Experience in machine learning projects with Python
→ Misclassified as PROJECT due to "seeking projects"
Root Causes
-
Keyword-based Judgment
- Classification based solely on the word "project"
- Not understanding context or subject
-
Ambiguous Judgment Criteria
- Unclear "who is providing what"
- Ambiguous subject of "introduction"
-
Insufficient Few-shot Examples
- Only covering typical cases
- Lacking learning from ambiguous patterns
Solution
1. Clarifying Classification Criteria
We added the perspective of "who is providing what" to the prompt:
// Improved prompt (simplified version)
const CLASSIFICATION_PROMPT = `
You are an email classification AI. Please classify emails into the following categories.
## Classification Criteria
**Most Important Point: Who is providing what?**
### PROJECT (Project Information)
- **Provider**: Client companies, sales representatives
- **Content**: Development projects, job postings, work requests
- **Definitive Keywords**:
- "Project details", "Project information", "Job posting"
- "Development member recruitment", "Candidates available"
- **Judgment Method**:
- Email sender is providing the project
- Recruiting engineers/personnel
### TALENT (Personnel Information)
- **Provider**: Staffing companies, sales representatives, agents
- **Content**: Engineer introductions, skill sheets
- **Definitive Keywords**:
- "Personnel information", "Talent information", "Engineer information attached"
- "Engineer introduction", "Skill sheet of Mr./Ms. ○○"
- **Judgment Method**:
- Email sender is introducing personnel
- Contains personal information such as age/gender
- Even if "project desired" appears, if it's written as the person's preference, classify as TALENT
## Judgment for Ambiguous Cases
### Context Analysis for "Project Desired"
- "Mr./Ms. ○○ desires projects" → TALENT (person's preference)
- "Those who desire the following project" → PROJECT (recruitment condition)
### Confirming Subject of "Introduction"
- "Introducing engineer" → TALENT (introducing personnel)
- "Introducing project" → PROJECT (introducing project)
### Handling Age/Gender Information
- Personal age/gender listed → Almost certainly TALENT
- "Age unspecified" as recruitment condition → PROJECT
`;
2. Adding Few-shot Examples
We added Few-shot examples based on actual misclassification cases:
const FEW_SHOT_EXAMPLES = [
// Existing examples 1, 2...
// Example 3: TALENT email containing "project desired"
{
input: {
subject: "[Personnel Information] Introduction of Taro Yamada",
fromName: "Suzuki (ABC Corporation)",
bodyText: `
Thank you for your continued support.
We are sending the following personnel information.
[Basic Information]
Name: Taro Yamada
Age: 35 years old
Gender: Male
[Skills]
- Java, Spring Boot 5 years
- AWS, Docker 3 years
[Desired Conditions]
Rate: 600,000 yen/month
Project desired: Remote-friendly projects
Location: Within Tokyo
We appreciate your consideration.
`
},
output: {
category: "TALENT",
reasoning: "Staffing company introducing engineer (Taro Yamada). Contains personal information such as age and gender. 'Project desired' is listed as the person's preference, so classified as TALENT."
}
},
// Example 4: TALENT email with "Sending engineer information"
{
input: {
subject: "Sending Engineer Information",
fromName: "Sato (XYZ Staffing Services)",
bodyText: `
Thank you for your work.
Sending the following engineer information.
[Profile]
Name: Hanako Sato
Age: 28 years old
Nearest Station: Shinjuku Station
[Experience]
- React, TypeScript 3 years
- Next.js, Tailwind CSS 2 years
Currently available for immediate start.
Please let us know if you're interested.
`
},
output: {
category: "TALENT",
reasoning: "Staffing service company introducing engineer (Hanako Sato). Contains personal information (age, nearest station). Sender is in position of providing personnel, so TALENT."
}
},
// Example 5: Confusing PROJECT email
{
input: {
subject: "[Project Information] Web Application Development Member Recruitment",
fromName: "Tanaka (Client Corporation)",
bodyText: `
Thank you for your support.
We are recruiting development members for the following project.
[Project Details]
Content: E-commerce site renewal
Period: January - June 2025 (6 months)
Rate: 600,000-800,000 yen/month
Location: Shibuya, Tokyo (remote available)
[Required Skills]
- React, TypeScript 3+ years
- Team development experience
[Desired Candidates]
Immediately available
Full-time 5 days/week available
If you have interested engineers,
we would appreciate introductions.
`
},
output: {
category: "PROJECT",
reasoning: "Client company providing development project. 'Recruitment' and 'desired candidates' indicate seeking engineers. Sender is in position of providing project, so PROJECT."
}
}
];
3. Strengthening Judgment Logic
Added automatic classification based on age/gender information:
export async function classifyEmail(email: {
subject: string;
fromName: string;
fromAddress: string;
bodyText: string;
}): Promise<ClassificationResult> {
// Preprocessing: Check obvious patterns
const bodyLower = email.bodyText.toLowerCase();
// Personal information patterns (age/gender) almost certainly indicate TALENT
const personalInfoPatterns = [
/年齢[::]\s*\d{2}歳/,
/性別[::]\s*(男性|女性)/,
/生年月日[::]/,
/氏名[::]\s*[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}]+\s*様/u
];
const hasPersonalInfo = personalInfoPatterns.some(pattern =>
pattern.test(email.bodyText)
);
// Execute AI classification
const aiResult = await callAIClassificationAPI({
prompt: CLASSIFICATION_PROMPT,
fewShotExamples: FEW_SHOT_EXAMPLES,
email: email,
hint: hasPersonalInfo ? 'Likely TALENT due to personal information' : undefined
});
return aiResult;
}
Test Results
Accuracy Before Improvement
Testing with 100 actual misclassified emails:
- Accuracy: 72% (72 correct, 28 misclassified)
- TALENT email misclassifications: 18 (personnel classified as projects)
- PROJECT email misclassifications: 10 (projects classified as personnel)
Accuracy After Improvement
Retesting with the same 100 emails:
- Accuracy: 98% (98 correct, 2 misclassified)
- TALENT email misclassifications: 1 (extremely ambiguous content)
- PROJECT email misclassifications: 1 (composite email containing both elements)
Large-scale Validation (1,000 emails)
Validation results with 1,000 production data emails:
| Category | Before | After | Improvement |
|---|---|---|---|
| PROJECT | 68% (272/400) | 97% (388/400) | +29% |
| TALENT | 75% (450/600) | 98% (588/600) | +23% |
| Overall | 72% (722/1,000) | 97.6% (976/1,000) | +25.6% |
Specific Improvement Examples
Test Case 1:
Subject: [Personnel Information] Introduction of Mr./Ms. ○○
Result: TALENT (correct)
Reason: Based on "who is providing what" criteria,
correctly identified as staffing company providing engineer
Test Case 2:
Subject: Re: Introduction of Engineer
Result: TALENT (correct)
Reason: Correctly identified as individual introduction based on age/gender information
Technical Details
Prompt Engineering Key Points
- Hierarchical Judgment Criteria
Level 1: Check definitive keywords
Level 2: Contextual analysis of "who provides what"
Level 3: Learning from Few-shot examples
Level 4: Personal information pattern verification
-
Few-shot Learning Effectiveness
- 0-shot (no examples): 60% accuracy
- 2-shot (2 examples): 80% accuracy
- 5-shot (5 examples): 100% accuracy
Importance of Context
// Simple keyword matching (BAD)
if (bodyText.includes('project')) {
return 'PROJECT';
}
// Context-aware judgment (GOOD)
const context = analyzeContext(bodyText);
if (context.provider === 'client' && context.offering === 'project') {
return 'PROJECT';
}
AI Model Selection
Accuracy comparison across different AI models:
| Model | Accuracy | Speed | Cost/Month※ |
|---|---|---|---|
| GPT-3.5 | 85% | 0.5s | ¥3,000 |
| GPT-4 | 95% | 2.0s | ¥18,000 |
| GPT-4 Turbo | 98% | 1.5s | ¥12,000 |
| Claude 3 Opus | 96% | 1.8s | ¥15,000 |
※Estimated cost when processing 100,000 emails/month (prices as of November 2024)
We selected GPT-4 Turbo for this implementation.
Cost Analysis
Monthly email processing volume and cost estimation:
// Cost calculation
const COST_ANALYSIS = {
// Processing volume
emailsPerDay: 3000,
emailsPerMonth: 90000,
// GPT-4 Turbo pricing (as of November 2024)
inputTokenCost: 0.01, // $0.01 per 1K tokens
outputTokenCost: 0.03, // $0.03 per 1K tokens
// Average token count (measured)
avgInputTokens: 800, // Prompt + email body
avgOutputTokens: 150, // Classification result + reasoning
// Monthly cost calculation
calculateMonthlyCost() {
const inputCost = (this.emailsPerMonth * this.avgInputTokens / 1000) * this.inputTokenCost;
const outputCost = (this.emailsPerMonth * this.avgOutputTokens / 1000) * this.outputTokenCost;
return {
inputCost: inputCost,
outputCost: outputCost,
totalCost: inputCost + outputCost,
totalCostJPY: (inputCost + outputCost) * 150 // 1USD = 150JPY
};
}
};
console.log(COST_ANALYSIS.calculateMonthlyCost());
// Result:
// {
// inputCost: 720 USD,
// outputCost: 405 USD,
// totalCost: 1,125 USD,
// totalCostJPY: 168,750 JPY
// }
Cost reduction optimizations:
- Cache utilization to avoid re-classifying duplicate emails (-30%)
- Batch processing for efficiency (-10%)
- Preprocessing filter for obvious patterns (-20%)
Estimated cost after optimization: Approximately ¥67,500/month
Prompt Version Management
Git Management and Semantic Versioning
Implemented version control to track prompt changes:
// prompts/email-classification/v2.1.0.ts
export const EMAIL_CLASSIFICATION_PROMPT_V2_1_0 = {
version: '2.1.0',
releaseDate: '2024-11-13',
changes: [
'Added personal information pattern detection',
'Improved TALENT category accuracy',
'Fixed false positives for "project desired" keyword'
],
prompt: `...actual prompt...`,
fewShotExamples: [...],
metrics: {
accuracy: 0.98,
precision: 0.97,
recall: 0.99
}
};
// Prompt A/B testing
export class PromptVersionManager {
private currentVersion = 'v2.1.0';
private versions = new Map<string, PromptVersion>();
async testNewVersion(email: Email, newVersion: string) {
const currentResult = await this.classify(email, this.currentVersion);
const newResult = await this.classify(email, newVersion);
// Compare results and log
await this.logComparison({
email: email.id,
currentVersion: this.currentVersion,
newVersion: newVersion,
currentResult,
newResult,
agree: currentResult.category === newResult.category
});
return { currentResult, newResult };
}
async rollback(version: string) {
console.log(`Rolling back from ${this.currentVersion} to ${version}`);
this.currentVersion = version;
// Send alert notification
await this.notifyRollback(version);
}
}
Fallback Strategies
Handling Low Confidence Cases
Approach when AI judgment lacks confidence:
interface ClassificationResult {
category: 'PROJECT' | 'TALENT' | 'OTHER' | 'UNCERTAIN';
confidence: number; // Confidence score 0-1
reasoning: string;
requiresManualReview?: boolean;
}
export async function classifyWithFallback(
email: Email
): Promise<ClassificationResult> {
try {
// Step 1: Execute AI classification
const aiResult = await classifyEmail(email);
// Step 2: Check confidence
if (aiResult.confidence < 0.7) {
console.warn(`Low confidence classification: ${aiResult.confidence}`, {
emailId: email.id,
category: aiResult.category
});
// Step 3: Second opinion (different model)
const claudeResult = await classifyWithClaude(email);
if (aiResult.category !== claudeResult.category) {
// If opinions differ, request manual review
return {
category: 'UNCERTAIN',
confidence: Math.min(aiResult.confidence, claudeResult.confidence),
reasoning: `GPT-4: ${aiResult.category}, Claude: ${claudeResult.category}`,
requiresManualReview: true
};
}
}
// Step 4: Rule-based validation
const ruleBasedCategory = applyBusinessRules(email);
if (ruleBasedCategory && ruleBasedCategory !== aiResult.category) {
console.warn('Rule-based override triggered', {
ai: aiResult.category,
rule: ruleBasedCategory
});
return {
...aiResult,
category: ruleBasedCategory,
reasoning: `Override: ${aiResult.reasoning}. Rule applied.`
};
}
return aiResult;
} catch (error) {
// Step 5: Error fallback
console.error('Classification failed, using fallback', error);
// Basic keyword matching
const fallbackCategory = simpleFallbackClassification(email);
return {
category: fallbackCategory || 'OTHER',
confidence: 0.3,
reasoning: 'Fallback classification due to AI error',
requiresManualReview: true
};
}
}
// Business rule validation
function applyBusinessRules(email: Email): string | null {
// Definitive domain rules
const projectDomains = ['client-company.co.jp', 'project-sender.com'];
const talentDomains = ['hr-agency.jp', 'staffing-company.com'];
const domain = email.fromAddress.split('@')[1];
if (projectDomains.includes(domain)) return 'PROJECT';
if (talentDomains.includes(domain)) return 'TALENT';
// Definitive keyword rules
if (email.subject.startsWith('[Personnel Information]')) return 'TALENT';
if (email.subject.startsWith('[Project Details]')) return 'PROJECT';
return null;
}
Operational Considerations
Monitoring and Alerts
Continuously monitoring classification accuracy:
// monitoring/classification-monitor.ts
export class ClassificationMonitor {
private metrics = {
totalClassifications: 0,
lowConfidenceCount: 0,
errorCount: 0,
manualReviewQueue: []
};
async monitor() {
// Hourly accuracy check
setInterval(async () => {
const stats = await this.calculateHourlyStats();
// Anomaly detection
if (stats.accuracy < 0.90) {
await this.sendAlert({
level: 'WARNING',
message: `Classification accuracy dropped to ${stats.accuracy}`,
action: 'Check prompt performance'
});
}
if (stats.errorRate > 0.05) {
await this.sendAlert({
level: 'CRITICAL',
message: `High error rate: ${stats.errorRate}`,
action: 'Immediate investigation required'
});
}
// Send to CloudWatch metrics
await this.pushToCloudWatch(stats);
}, 3600000);
}
async recordClassification(result: ClassificationResult, actual?: string) {
this.metrics.totalClassifications++;
if (result.confidence < 0.7) {
this.metrics.lowConfidenceCount++;
}
if (result.requiresManualReview) {
this.metrics.manualReviewQueue.push({
timestamp: new Date(),
result
});
}
// Compare with actual category (feedback loop)
if (actual && actual !== result.category) {
await this.recordMisclassification({
predicted: result.category,
actual: actual,
confidence: result.confidence,
reasoning: result.reasoning
});
}
}
}
Regular Prompt Review
// Monthly review process
export async function monthlyPromptReview() {
const report = {
period: new Date().toISOString().slice(0, 7),
totalEmails: 0,
misclassifications: [],
commonPatterns: [],
recommendations: []
};
// Analyze misclassification patterns
const misclassified = await getMisclassifiedEmails();
// Extract patterns
const patterns = extractCommonPatterns(misclassified);
// Generate improvement suggestions
if (patterns.length > 0) {
report.recommendations.push({
type: 'ADD_FEW_SHOT_EXAMPLES',
patterns: patterns,
estimatedImpact: calculateImpact(patterns)
});
}
// Send report
await sendMonthlyReport(report);
}
Testing Strategy
Unit Test Implementation
// __tests__/email-classification.test.ts
import { describe, it, expect, beforeEach, vi } from 'vitest';
import { classifyEmail, classifyWithFallback } from '../classification';
describe('Email Classification', () => {
describe('Basic Classification', () => {
it('should correctly classify PROJECT emails', async () => {
const projectEmail = {
subject: '[Project Details] Web App Development Project',
fromName: 'Tanaka (Client Corporation)',
fromAddress: 'tanaka@client.co.jp',
bodyText: 'We are recruiting development members...'
};
const result = await classifyEmail(projectEmail);
expect(result.category).toBe('PROJECT');
expect(result.confidence).toBeGreaterThan(0.9);
});
it('should correctly classify TALENT emails', async () => {
const talentEmail = {
subject: '[Personnel Information] Introduction of Taro Yamada',
fromName: 'Suzuki (Staffing Services)',
fromAddress: 'suzuki@hr-agency.jp',
bodyText: 'Name: Taro Yamada, Age: 35 years old...'
};
const result = await classifyEmail(talentEmail);
expect(result.category).toBe('TALENT');
expect(result.confidence).toBeGreaterThan(0.9);
});
});
describe('Edge Cases', () => {
it('should handle ambiguous emails with fallback', async () => {
const ambiguousEmail = {
subject: 'Inquiry',
fromName: 'Unknown',
fromAddress: 'unknown@example.com',
bodyText: 'Details to follow'
};
const result = await classifyWithFallback(ambiguousEmail);
expect(result.requiresManualReview).toBe(true);
expect(result.confidence).toBeLessThan(0.7);
});
it('should detect personal information patterns', async () => {
const emailWithPersonalInfo = {
subject: 'Engineer Information',
fromName: 'Test',
fromAddress: 'test@example.com',
bodyText: 'Age: 30 years old, Gender: Male, Name: Jiro Sato'
};
const result = await classifyEmail(emailWithPersonalInfo);
expect(result.category).toBe('TALENT');
});
});
describe('Performance', () => {
it('should classify within timeout', async () => {
const email = generateTestEmail();
const startTime = performance.now();
await classifyEmail(email);
const endTime = performance.now();
expect(endTime - startTime).toBeLessThan(3000); // Within 3 seconds
});
it('should handle batch classification efficiently', async () => {
const emails = Array.from({ length: 100 }, generateTestEmail);
const results = await Promise.all(
emails.map(email => classifyEmail(email))
);
expect(results).toHaveLength(100);
expect(results.every(r => r.category)).toBe(true);
});
});
});
describe('Fallback Strategies', () => {
it('should use rule-based override when applicable', async () => {
const email = {
subject: 'Test Email',
fromName: 'Client',
fromAddress: 'test@client-company.co.jp', // Domain defined in rules
bodyText: 'Content'
};
const result = await classifyWithFallback(email);
expect(result.category).toBe('PROJECT');
expect(result.reasoning).toContain('Rule applied');
});
it('should request manual review for conflicting classifications', async () => {
// Mock to return different results from GPT and Claude
vi.spyOn(global, 'classifyEmail').mockResolvedValueOnce({
category: 'PROJECT',
confidence: 0.6,
reasoning: 'GPT reasoning'
});
vi.spyOn(global, 'classifyWithClaude').mockResolvedValueOnce({
category: 'TALENT',
confidence: 0.6,
reasoning: 'Claude reasoning'
});
const result = await classifyWithFallback(testEmail);
expect(result.category).toBe('UNCERTAIN');
expect(result.requiresManualReview).toBe(true);
});
});
Troubleshooting
Common Issues and Solutions
1. Token Limit Error
Symptom: "Maximum token limit exceeded" error
Cause: Email body too long or too many Few-shot examples
Solution:
// Email body truncation
function truncateEmailBody(bodyText: string, maxLength: number = 2000): string {
if (bodyText.length <= maxLength) return bodyText;
// Prioritize important sections
const header = bodyText.substring(0, 500);
const footer = bodyText.substring(bodyText.length - 300);
const middle = bodyText.substring(500, maxLength - 800);
return `${header}\n...[truncated]...\n${middle}\n...[truncated]...\n${footer}`;
}
// Pre-check token count
import { encoding_for_model } from 'tiktoken';
function estimateTokens(text: string): number {
const encoder = encoding_for_model('gpt-4');
const tokens = encoder.encode(text);
encoder.free();
return tokens.length;
}
2. Rate Limit Error
Symptom: "Rate limit exceeded" error
Solution:
// Rate limit handling with retry logic
import { RateLimiter } from 'limiter';
const limiter = new RateLimiter({
tokensPerInterval: 100,
interval: 'minute'
});
async function classifyWithRateLimit(email: Email): Promise<ClassificationResult> {
await limiter.removeTokens(1);
try {
return await classifyEmail(email);
} catch (error) {
if (error.code === 'rate_limit_exceeded') {
const waitTime = error.headers['retry-after'] || 60;
console.log(`Rate limited. Waiting ${waitTime}s...`);
await new Promise(resolve => setTimeout(resolve, waitTime * 1000));
return classifyWithRateLimit(email);
}
throw error;
}
}
3. Accuracy Degradation
Symptom: Accuracy decreases over time
Cause: Changes in business rules, emergence of new email patterns
Solution:
// Regular accuracy checks and automatic improvement
async function autoImprovePrompt() {
const recentMisclassifications = await getRecentMisclassifications(30); // 30 days
if (recentMisclassifications.length > 10) {
// Automatically generate new Few-shot examples
const newExamples = generateFewShotExamples(recentMisclassifications);
// Conduct A/B test
const improved = await testImprovedPrompt(newExamples);
if (improved.accuracy > currentAccuracy * 1.05) {
await deployNewPrompt(improved.prompt);
console.log('Prompt automatically improved and deployed');
}
}
}
Lessons Learned
Unexpected Pitfalls
-
Keyword Trap
- The word "project" appears in both categories
- Keyword matching without context is dangerous
-
Quality of Few-shot Examples
- Simply increasing examples isn't effective
- Important to include ambiguous cases
-
Prompt Length Trade-off
- Too detailed prompt → Increased tokens, higher cost
- Too concise prompt → Reduced accuracy
- Balance is crucial
Useful Knowledge for the Future
- Prompt Design Best Practices
Step 1: Define clear judgment criteria
Step 2: Add Few-shot examples of typical cases
Step 3: Add Few-shot examples of ambiguous cases
Step 4: Articulate judgment logic
Step 5: Test and improve iteratively
-
Selecting Few-shot Examples
- Diversity: Cover various patterns
- Clarity: Explicitly state reasoning
- Practicality: Reference actual misclassification cases
Gradual Accuracy Improvement Approach
Phase 1: Measure baseline accuracy (60%)
Phase 2: Clarify judgment criteria (80%)
Phase 3: Add Few-shot examples (95%)
Phase 4: Utilize personal information patterns (100%)
Better Implementation Discovered
Before (vague instructions):
const prompt = `
Please classify this email.
- PROJECT: Project email
- TALENT: Personnel email
`;
After (clear judgment criteria):
const prompt = `
## Most Important Point: Who is providing what?
PROJECT = Client provides project
TALENT = Staffing company provides engineer
Judgment method:
1. Check definitive keywords
2. Confirm sender's position
3. Check for personal information
4. Match with Few-shot examples
`;
Conclusion
Improving AI classification accuracy was an opportunity to reaffirm the importance of prompt engineering. Key takeaways from this effort:
- Clarifying Judgment Criteria: The perspective of "who is providing what"
- Leveraging Few-shot Learning: Especially learning from ambiguous cases
- Gradual Improvement: Accumulating small improvements
Particularly, including actual misclassification cases in Few-shot examples was the decisive factor in improving accuracy. Since AI learns from provided examples, what examples you provide is extremely important.
If you're facing challenges with AI classification accuracy, start by collecting misclassification cases and using them as Few-shot examples. You'll be surprised at how much accuracy improves.
The code presented in this article is simplified from actual production code. In real implementations, additional considerations such as error handling and security checks are required.
Related Technologies: OpenAI GPT, Claude, TypeScript, Prompt Engineering, Few-shot Learning, Natural Language Processing, AI Classification, Machine Learning
Author: Development Team
Top comments (0)