Overview
By migrating our email system's search functionality from traditional LIKE queries to PostgreSQL's full-text search, we achieved up to 71% performance improvement. This article details the implementation approach using GIN indexes and tsvector for 260,000 existing records, along with a non-blocking migration strategy.
Tech Stack
- PostgreSQL (Full-Text Search)
- Prisma ORM
- TypeScript
- tsvector / GIN Index
- websearch_to_tsquery()
Background & Challenges
Existing Problems
Initially, the email search feature was implemented using simple LIKE queries:
// Traditional LIKE search (simplified)
const emails = await prisma.email.findMany({
where: {
OR: [
{ subject: { contains: keyword } },
{ fromName: { contains: keyword } },
{ bodyText: { contains: keyword } }
]
}
});
However, as data volume exceeded 260,000 records, the following issues became apparent:
- Performance degradation: Some multi-keyword searches took over 1.4 seconds
- Scalability limits: Search time increased proportionally with data volume
- Lack of search flexibility: Unable to implement weighting or relevance scoring
Specific Numbers
- Java search: 1,458ms
- Engineer search: 968ms
- Project search: 655ms
These search times needed improvement.
Implementation Details
1. Database Schema Design
First, we added a search_vector column to the emails table. The key point is applying weights to subject, sender name, and body:
-- Add search_vector column
ALTER TABLE emails ADD COLUMN search_vector tsvector;
-- Create weighted trigger function
CREATE OR REPLACE FUNCTION update_search_vector()
RETURNS trigger AS $$
BEGIN
NEW.search_vector :=
setweight(to_tsvector('simple', COALESCE(NEW.subject, '')), 'A') ||
setweight(to_tsvector('simple', COALESCE(NEW."fromName", '')), 'B') ||
setweight(to_tsvector('simple', COALESCE(NEW."fromAddress", '')), 'C') ||
setweight(to_tsvector('simple',
LEFT(COALESCE(NEW."bodyText", ''), 10000)), 'D');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Create trigger
CREATE TRIGGER emails_search_vector_update
BEFORE INSERT OR UPDATE ON emails
FOR EACH ROW
EXECUTE FUNCTION update_search_vector();
Weight Intent:
- A (Subject): Most important information source
- B (Sender Name): Second most important
- C (Email Address): Supplementary information
- D (Body): Overall context
2. Creating GIN Index
We used the CONCURRENTLY option to avoid table locks:
// GIN index creation script (simplified)
async function createIndex() {
await prisma.$executeRawUnsafe(`
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_emails_search_vector
ON emails USING GIN(search_vector)
`);
}
Reasons for Using CONCURRENTLY:
- Doesn't block existing traffic
- Safe to run in production environments
- Allows reads and writes during index creation
3. Backfilling Existing Data
We updated search_vector for 260,000 existing records using batch processing with error handling and retry logic:
const BATCH_SIZE = 1000;
const MAX_RETRIES = 3;
const RETRY_DELAY = 5000; // 5 seconds
async function backfillSearchVector() {
const totalCount = await prisma.email.count();
console.log(`Total emails to process: ${totalCount}`);
let successCount = 0;
let failureCount = 0;
for (let offset = 0; offset < totalCount; offset += BATCH_SIZE) {
let retryCount = 0;
let success = false;
while (retryCount < MAX_RETRIES && !success) {
try {
const startTime = Date.now();
// Efficient update using CTE (Common Table Expression)
await prisma.$executeRaw`
WITH batch AS (
SELECT id, subject, "fromName", "fromAddress", "bodyText"
FROM emails
ORDER BY id
LIMIT ${BATCH_SIZE} OFFSET ${offset}
)
UPDATE emails
SET search_vector =
setweight(to_tsvector('simple', COALESCE(batch.subject, '')), 'A') ||
setweight(to_tsvector('simple', COALESCE(batch."fromName", '')), 'B') ||
setweight(to_tsvector('simple', COALESCE(batch."fromAddress", '')), 'C') ||
setweight(to_tsvector('simple',
LEFT(COALESCE(batch."bodyText", ''), 10000)), 'D')
FROM batch
WHERE emails.id = batch.id
`;
const elapsed = Date.now() - startTime;
successCount += BATCH_SIZE;
console.log(`✓ Processed ${offset + BATCH_SIZE}/${totalCount} (${elapsed}ms)`);
success = true;
} catch (error) {
retryCount++;
console.error(`✗ Batch failed (attempt ${retryCount}/${MAX_RETRIES}):`, error);
if (retryCount < MAX_RETRIES) {
console.log(`Retrying in ${RETRY_DELAY}ms...`);
await new Promise(resolve => setTimeout(resolve, RETRY_DELAY));
} else {
failureCount += BATCH_SIZE;
console.error(`✗ Batch permanently failed after ${MAX_RETRIES} attempts. Skipping...`);
}
}
}
}
console.log(`\nBackfill complete:`);
console.log(`- Success: ${successCount} records`);
console.log(`- Failure: ${failureCount} records`);
if (failureCount > 0) {
throw new Error(`Backfill completed with ${failureCount} failures`);
}
}
Processing Results:
- Completed 260,000 records in approximately 53 minutes
- Average 12 seconds per batch
- Up to 3 retries on errors
- Failed batches logged and skipped
4. Search Query Implementation
We implemented a safe search function using websearch_to_tsquery() with error handling:
export async function findManyWithFullTextSearch(params: {
keywords: string[];
accountId: string;
category?: string;
skip?: number;
take?: number;
}): Promise<Email[]> {
const { keywords, accountId, category, skip = 0, take = 50 } = params;
// Input validation: Limit keyword count (DoS protection)
if (keywords.length > 10) {
throw new Error('Too many keywords. Maximum 10 keywords allowed.');
}
// Input validation: Limit keyword length
const invalidKeywords = keywords.filter(k => k.length > 100);
if (invalidKeywords.length > 0) {
throw new Error('Keyword too long. Maximum 100 characters per keyword.');
}
// Convert keywords to tsquery format (AND conjunction)
// SQL injection protection is automatic with Prisma.sql parameterization
const searchQuery = keywords.join(' & ');
try {
const results = await prisma.$queryRaw<Email[]>`
SELECT e.*
FROM emails e
WHERE e.search_vector @@ websearch_to_tsquery('simple', ${searchQuery})
AND e."accountId" = ${accountId}
${category ? Prisma.sql`AND EXISTS (
SELECT 1 FROM "EmailClassificationResult" ecr
WHERE ecr."emailId" = e.id AND ecr.category = ${category}
LIMIT 1
)` : Prisma.empty}
ORDER BY ts_rank(e.search_vector, websearch_to_tsquery('simple', ${searchQuery})) DESC
LIMIT ${take} OFFSET ${skip}
`;
return results;
} catch (error) {
// Return empty array for PostgreSQL syntax error (error code 42601)
if (error instanceof Error && 'code' in error && error.code === '42601') {
console.warn('Invalid tsquery syntax, returning empty results:', searchQuery);
return [];
}
// Log and rethrow other errors
console.error('Full-text search error:', error);
throw error;
}
}
Key Points:
-
Security Measures:
- SQL injection prevention through
Prisma.sqlparameterization - DoS attack prevention with keyword count and length limits
- SQL injection prevention through
-
Error Handling:
- Returns empty array for tsquery syntax errors to continue application
- Properly logs and rethrows other errors
-
Query Optimization:
-
@@operator: Matching between tsvector and tsquery -
websearch_to_tsquery(): Supports natural search syntax (e.g., "Java AND Python") -
ts_rank(): Sorting by relevance score - WHERE clause order optimization: search_vector → accountId → category
-
Performance Improvement Results
Search Time Comparison
| Search Keyword | Before | After | Improvement |
|---|---|---|---|
| Java | 1,458ms | 596ms | 59% faster |
| Engineer | 968ms | 279ms | 71% faster |
| Project | 655ms | 203ms | 69% faster |
Why Such Significant Speed Improvements?
-
Index Scan Efficiency
- LIKE search: O(N) - Full table scan
- GIN index: O(log N) - Index scan only
-
Query Planner Optimization
- PostgreSQL query planner efficiently uses GIN index
- WHERE clause order optimized (high selectivity conditions first)
-
Weighted Relevance Scoring
- Prioritizes subject matches, improving user experience
- Natural ordering through ts_rank()
Lessons Learned
Unexpected Pitfalls
-
Importance of Character Limits
- tsvector has a 1MB limit
- Ensured safety by limiting bodyText to 10,000 characters
-
GIN Index Disk Usage
- Approximately 1.2GB index size for 260,000 records
- About 30-40% of original table size
- Consider tradeoff between storage cost and performance
-
Choosing 'simple' Configuration and Japanese Search Challenges
- For mixed Japanese-English text, 'simple' is more appropriate than language-specific settings ('english', 'japanese')
- Stemming processing is often unnecessary
-
Comparison with Japanese Morphological Analysis:
- MeCab + pg_bigm: Higher accuracy but complex setup
- 'simple' setting: Easy setup but imperfect word boundary recognition
- This project chose 'simple' prioritizing development speed
-
CONCURRENTLY Pitfalls
- Cannot be used within transactions
- Index creation took longer than expected (about 45 minutes for 260,000 records)
- Caution needed for re-execution on errors (partially created indexes may remain)
Knowledge for Future Use
- Gradual Migration Strategy
Step 1: Schema change (add search_vector column)
Step 2: Trigger setup (new data auto-updates)
Step 3: GIN index creation (CONCURRENTLY)
Step 4: Backfill existing data
Step 5: Application code switch
-
Performance Tuning Best Practices
- Verify execution plan with EXPLAIN ANALYZE
- Arrange WHERE clause in order of selectivity
- Add LIMIT 1 to EXISTS clause for early termination
-
Batch Processing Optimization
- Efficient bulk updates using CTE
- Ensure transparency through progress logging
- Error handling and retry logic
Better Approaches Discovered
Before (Multiple OR Conditions):
where: {
OR: [
{ subject: { contains: 'Java' } },
{ fromName: { contains: 'Java' } },
{ bodyText: { contains: 'Java' } }
]
}
After (Full-Text Search):
where: {
search_vector: {
search: 'Java',
mode: 'websearch'
}
}
The implementation became simpler, faster, and more flexible.
Operational Considerations
Monitoring Setup
// Monitor index usage statistics
const indexStats = await prisma.$queryRaw`
SELECT
schemaname,
tablename,
indexname,
idx_scan as index_scans,
idx_tup_read as tuples_read,
idx_tup_fetch as tuples_fetched,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE indexname = 'idx_emails_search_vector'
`;
Alert Thresholds
- Search response time: Warning at 500ms+, Alert at 1 second+
- Index scan count: Warning below 80% of expected traffic
- GIN index size: Warning when exceeding 30% of disk capacity
Log Output Design
// Structured log example
logger.info('full_text_search', {
keywords: keywords.join(','),
resultCount: results.length,
duration: Date.now() - startTime,
accountId,
category
});
Test Strategy
Unit Tests
describe('findManyWithFullTextSearch', () => {
it('should return emails matching single keyword', async () => {
const results = await findManyWithFullTextSearch({
keywords: ['Java'],
accountId: 'acc_123'
});
expect(results).toHaveLength(10);
expect(results[0].subject).toContain('Java');
});
it('should handle invalid tsquery syntax gracefully', async () => {
const results = await findManyWithFullTextSearch({
keywords: ['test & | invalid'],
accountId: 'acc_123'
});
expect(results).toEqual([]); // Returns empty array instead of error
});
it('should reject too many keywords', async () => {
await expect(
findManyWithFullTextSearch({
keywords: new Array(11).fill('test'),
accountId: 'acc_123'
})
).rejects.toThrow('Too many keywords');
});
});
Performance Tests
describe('Full-text search performance', () => {
it('should complete search within 500ms', async () => {
const startTime = Date.now();
await findManyWithFullTextSearch({
keywords: ['engineer'],
accountId: 'acc_123'
});
const duration = Date.now() - startTime;
expect(duration).toBeLessThan(500);
});
});
Troubleshooting
Common Problems and Solutions
-
Search returns 0 results
- Cause: search_vector might be null
- Check:
SELECT COUNT(*) FROM emails WHERE search_vector IS NULL - Solution: Re-run backfill script
-
Search is slow
- Cause: GIN index not being used
- Check: Verify execution plan with
EXPLAIN ANALYZE - Solution: Update index statistics with
VACUUM ANALYZE emails
-
tsvector size limit error
- Cause: bodyText exceeds 10,000 characters
- Check:
SELECT id, LENGTH("bodyText") FROM emails WHERE LENGTH("bodyText") > 10000 - Solution: Verify and apply character limit in trigger function
Debugging Methods
-- Check search_vector for specific email
SELECT
id,
subject,
search_vector,
ts_rank(search_vector, websearch_to_tsquery('simple', 'Java')) as rank
FROM emails
WHERE id = 'email_123';
-- Verify index is being used
EXPLAIN ANALYZE
SELECT * FROM emails
WHERE search_vector @@ websearch_to_tsquery('simple', 'Java')
LIMIT 10;
Conclusion
PostgreSQL's full-text search feature is a powerful tool that can achieve fast search even on large-scale data when properly implemented. The following points were key to success in this implementation:
- Non-blocking migration: Minimized impact on production environment
- Weighted search: Relevance scoring focused on user experience
- Gradual migration: Risk-distributed implementation strategy
Particularly, the gradual migration using the CONCURRENTLY option and batch processing was very effective in achieving large-scale data migration without service interruption.
If your data volume is growing, we recommend considering full-text search implementation early. When you feel the limits of LIKE search, that's the time to migrate.
The code presented in this article is a simplified version of actual production code. Real implementations require additional considerations for error handling and security checks.
Related Technologies: PostgreSQL, Prisma ORM, TypeScript, Full-Text Search, GIN Index, tsvector, Performance Optimization, Database Design
Author: 91works Development Team
Top comments (0)