我妻良樹

Posted on Nov 25

How We Achieved Up to 71% Faster Email Search with PostgreSQL Full-Text Search

#database #performance #postgres

Overview

By migrating our email system's search functionality from traditional LIKE queries to PostgreSQL's full-text search, we achieved up to 71% performance improvement. This article details the implementation approach using GIN indexes and tsvector for 260,000 existing records, along with a non-blocking migration strategy.

Tech Stack

PostgreSQL (Full-Text Search)
Prisma ORM
TypeScript
tsvector / GIN Index
websearch_to_tsquery()

Background & Challenges

Existing Problems

Initially, the email search feature was implemented using simple LIKE queries:

// Traditional LIKE search (simplified)
const emails = await prisma.email.findMany({
  where: {
    OR: [
      { subject: { contains: keyword } },
      { fromName: { contains: keyword } },
      { bodyText: { contains: keyword } }
    ]
  }
});

However, as data volume exceeded 260,000 records, the following issues became apparent:

Performance degradation: Some multi-keyword searches took over 1.4 seconds
Scalability limits: Search time increased proportionally with data volume
Lack of search flexibility: Unable to implement weighting or relevance scoring

Specific Numbers

Java search: 1,458ms
Engineer search: 968ms
Project search: 655ms

These search times needed improvement.

Implementation Details

1. Database Schema Design

First, we added a search_vector column to the emails table. The key point is applying weights to subject, sender name, and body:

-- Add search_vector column
ALTER TABLE emails ADD COLUMN search_vector tsvector;

-- Create weighted trigger function
CREATE OR REPLACE FUNCTION update_search_vector()
RETURNS trigger AS $$
BEGIN
  NEW.search_vector :=
    setweight(to_tsvector('simple', COALESCE(NEW.subject, '')), 'A') ||
    setweight(to_tsvector('simple', COALESCE(NEW."fromName", '')), 'B') ||
    setweight(to_tsvector('simple', COALESCE(NEW."fromAddress", '')), 'C') ||
    setweight(to_tsvector('simple',
      LEFT(COALESCE(NEW."bodyText", ''), 10000)), 'D');
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create trigger
CREATE TRIGGER emails_search_vector_update
  BEFORE INSERT OR UPDATE ON emails
  FOR EACH ROW
  EXECUTE FUNCTION update_search_vector();

Weight Intent:

A (Subject): Most important information source
B (Sender Name): Second most important
C (Email Address): Supplementary information
D (Body): Overall context

2. Creating GIN Index

We used the CONCURRENTLY option to avoid table locks:

// GIN index creation script (simplified)
async function createIndex() {
  await prisma.$executeRawUnsafe(`
    CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_emails_search_vector
    ON emails USING GIN(search_vector)
  `);
}

Reasons for Using CONCURRENTLY:

Doesn't block existing traffic
Safe to run in production environments
Allows reads and writes during index creation

3. Backfilling Existing Data

We updated search_vector for 260,000 existing records using batch processing with error handling and retry logic:

const BATCH_SIZE = 1000;
const MAX_RETRIES = 3;
const RETRY_DELAY = 5000; // 5 seconds

async function backfillSearchVector() {
  const totalCount = await prisma.email.count();
  console.log(`Total emails to process: ${totalCount}`);

  let successCount = 0;
  let failureCount = 0;

  for (let offset = 0; offset < totalCount; offset += BATCH_SIZE) {
    let retryCount = 0;
    let success = false;

    while (retryCount < MAX_RETRIES && !success) {
      try {
        const startTime = Date.now();

        // Efficient update using CTE (Common Table Expression)
        await prisma.$executeRaw`
          WITH batch AS (
            SELECT id, subject, "fromName", "fromAddress", "bodyText"
            FROM emails
            ORDER BY id
            LIMIT ${BATCH_SIZE} OFFSET ${offset}
          )
          UPDATE emails
          SET search_vector =
            setweight(to_tsvector('simple', COALESCE(batch.subject, '')), 'A') ||
            setweight(to_tsvector('simple', COALESCE(batch."fromName", '')), 'B') ||
            setweight(to_tsvector('simple', COALESCE(batch."fromAddress", '')), 'C') ||
            setweight(to_tsvector('simple',
              LEFT(COALESCE(batch."bodyText", ''), 10000)), 'D')
          FROM batch
          WHERE emails.id = batch.id
        `;

        const elapsed = Date.now() - startTime;
        successCount += BATCH_SIZE;
        console.log(`✓ Processed ${offset + BATCH_SIZE}/${totalCount} (${elapsed}ms)`);
        success = true;

      } catch (error) {
        retryCount++;
        console.error(`✗ Batch failed (attempt ${retryCount}/${MAX_RETRIES}):`, error);

        if (retryCount < MAX_RETRIES) {
          console.log(`Retrying in ${RETRY_DELAY}ms...`);
          await new Promise(resolve => setTimeout(resolve, RETRY_DELAY));
        } else {
          failureCount += BATCH_SIZE;
          console.error(`✗ Batch permanently failed after ${MAX_RETRIES} attempts. Skipping...`);
        }
      }
    }
  }

  console.log(`\nBackfill complete:`);
  console.log(`- Success: ${successCount} records`);
  console.log(`- Failure: ${failureCount} records`);

  if (failureCount > 0) {
    throw new Error(`Backfill completed with ${failureCount} failures`);
  }
}

Processing Results:

Completed 260,000 records in approximately 53 minutes
Average 12 seconds per batch
Up to 3 retries on errors
Failed batches logged and skipped

4. Search Query Implementation

We implemented a safe search function using websearch_to_tsquery() with error handling:

export async function findManyWithFullTextSearch(params: {
  keywords: string[];
  accountId: string;
  category?: string;
  skip?: number;
  take?: number;
}): Promise<Email[]> {
  const { keywords, accountId, category, skip = 0, take = 50 } = params;

  // Input validation: Limit keyword count (DoS protection)
  if (keywords.length > 10) {
    throw new Error('Too many keywords. Maximum 10 keywords allowed.');
  }

  // Input validation: Limit keyword length
  const invalidKeywords = keywords.filter(k => k.length > 100);
  if (invalidKeywords.length > 0) {
    throw new Error('Keyword too long. Maximum 100 characters per keyword.');
  }

  // Convert keywords to tsquery format (AND conjunction)
  // SQL injection protection is automatic with Prisma.sql parameterization
  const searchQuery = keywords.join(' & ');

  try {
    const results = await prisma.$queryRaw<Email[]>`
      SELECT e.*
      FROM emails e
      WHERE e.search_vector @@ websearch_to_tsquery('simple', ${searchQuery})
        AND e."accountId" = ${accountId}
        ${category ? Prisma.sql`AND EXISTS (
          SELECT 1 FROM "EmailClassificationResult" ecr
          WHERE ecr."emailId" = e.id AND ecr.category = ${category}
          LIMIT 1
        )` : Prisma.empty}
      ORDER BY ts_rank(e.search_vector, websearch_to_tsquery('simple', ${searchQuery})) DESC
      LIMIT ${take} OFFSET ${skip}
    `;

    return results;

  } catch (error) {
    // Return empty array for PostgreSQL syntax error (error code 42601)
    if (error instanceof Error && 'code' in error && error.code === '42601') {
      console.warn('Invalid tsquery syntax, returning empty results:', searchQuery);
      return [];
    }

    // Log and rethrow other errors
    console.error('Full-text search error:', error);
    throw error;
  }
}

Key Points:

Security Measures:
- SQL injection prevention through Prisma.sql parameterization
- DoS attack prevention with keyword count and length limits
Error Handling:
- Returns empty array for tsquery syntax errors to continue application
- Properly logs and rethrows other errors
Query Optimization:
- @@ operator: Matching between tsvector and tsquery
- websearch_to_tsquery(): Supports natural search syntax (e.g., "Java AND Python")
- ts_rank(): Sorting by relevance score
- WHERE clause order optimization: search_vector → accountId → category

Performance Improvement Results

Search Time Comparison

Search Keyword	Before	After	Improvement
Java	1,458ms	596ms	59% faster
Engineer	968ms	279ms	71% faster
Project	655ms	203ms	69% faster

Why Such Significant Speed Improvements?

Index Scan Efficiency
- LIKE search: O(N) - Full table scan
- GIN index: O(log N) - Index scan only
Query Planner Optimization
- PostgreSQL query planner efficiently uses GIN index
- WHERE clause order optimized (high selectivity conditions first)
Weighted Relevance Scoring
- Prioritizes subject matches, improving user experience
- Natural ordering through ts_rank()

Lessons Learned

Unexpected Pitfalls

Importance of Character Limits
- tsvector has a 1MB limit
- Ensured safety by limiting bodyText to 10,000 characters
GIN Index Disk Usage
- Approximately 1.2GB index size for 260,000 records
- About 30-40% of original table size
- Consider tradeoff between storage cost and performance
Choosing 'simple' Configuration and Japanese Search Challenges
- For mixed Japanese-English text, 'simple' is more appropriate than language-specific settings ('english', 'japanese')
- Stemming processing is often unnecessary
- Comparison with Japanese Morphological Analysis:
  - MeCab + pg_bigm: Higher accuracy but complex setup
  - 'simple' setting: Easy setup but imperfect word boundary recognition
  - This project chose 'simple' prioritizing development speed
CONCURRENTLY Pitfalls
- Cannot be used within transactions
- Index creation took longer than expected (about 45 minutes for 260,000 records)
- Caution needed for re-execution on errors (partially created indexes may remain)

Knowledge for Future Use

Gradual Migration Strategy

   Step 1: Schema change (add search_vector column)
   Step 2: Trigger setup (new data auto-updates)
   Step 3: GIN index creation (CONCURRENTLY)
   Step 4: Backfill existing data
   Step 5: Application code switch

Performance Tuning Best Practices
- Verify execution plan with EXPLAIN ANALYZE
- Arrange WHERE clause in order of selectivity
- Add LIMIT 1 to EXISTS clause for early termination
Batch Processing Optimization
- Efficient bulk updates using CTE
- Ensure transparency through progress logging
- Error handling and retry logic

Better Approaches Discovered

Before (Multiple OR Conditions):

where: {
  OR: [
    { subject: { contains: 'Java' } },
    { fromName: { contains: 'Java' } },
    { bodyText: { contains: 'Java' } }
  ]
}

After (Full-Text Search):

where: {
  search_vector: {
    search: 'Java',
    mode: 'websearch'
  }
}

The implementation became simpler, faster, and more flexible.

Operational Considerations

Monitoring Setup

// Monitor index usage statistics
const indexStats = await prisma.$queryRaw`
  SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan as index_scans,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
  FROM pg_stat_user_indexes
  WHERE indexname = 'idx_emails_search_vector'
`;

Alert Thresholds

Search response time: Warning at 500ms+, Alert at 1 second+
Index scan count: Warning below 80% of expected traffic
GIN index size: Warning when exceeding 30% of disk capacity

Log Output Design

// Structured log example
logger.info('full_text_search', {
  keywords: keywords.join(','),
  resultCount: results.length,
  duration: Date.now() - startTime,
  accountId,
  category
});

Test Strategy

Unit Tests

describe('findManyWithFullTextSearch', () => {
  it('should return emails matching single keyword', async () => {
    const results = await findManyWithFullTextSearch({
      keywords: ['Java'],
      accountId: 'acc_123'
    });

    expect(results).toHaveLength(10);
    expect(results[0].subject).toContain('Java');
  });

  it('should handle invalid tsquery syntax gracefully', async () => {
    const results = await findManyWithFullTextSearch({
      keywords: ['test & | invalid'],
      accountId: 'acc_123'
    });

    expect(results).toEqual([]); // Returns empty array instead of error
  });

  it('should reject too many keywords', async () => {
    await expect(
      findManyWithFullTextSearch({
        keywords: new Array(11).fill('test'),
        accountId: 'acc_123'
      })
    ).rejects.toThrow('Too many keywords');
  });
});

Performance Tests

describe('Full-text search performance', () => {
  it('should complete search within 500ms', async () => {
    const startTime = Date.now();

    await findManyWithFullTextSearch({
      keywords: ['engineer'],
      accountId: 'acc_123'
    });

    const duration = Date.now() - startTime;
    expect(duration).toBeLessThan(500);
  });
});

Troubleshooting

Common Problems and Solutions

Search returns 0 results
- Cause: search_vector might be null
- Check: SELECT COUNT(*) FROM emails WHERE search_vector IS NULL
- Solution: Re-run backfill script
Search is slow
- Cause: GIN index not being used
- Check: Verify execution plan with EXPLAIN ANALYZE
- Solution: Update index statistics with VACUUM ANALYZE emails
tsvector size limit error
- Cause: bodyText exceeds 10,000 characters
- Check: SELECT id, LENGTH("bodyText") FROM emails WHERE LENGTH("bodyText") > 10000
- Solution: Verify and apply character limit in trigger function

Debugging Methods

-- Check search_vector for specific email
SELECT
  id,
  subject,
  search_vector,
  ts_rank(search_vector, websearch_to_tsquery('simple', 'Java')) as rank
FROM emails
WHERE id = 'email_123';

-- Verify index is being used
EXPLAIN ANALYZE
SELECT * FROM emails
WHERE search_vector @@ websearch_to_tsquery('simple', 'Java')
LIMIT 10;

Conclusion

PostgreSQL's full-text search feature is a powerful tool that can achieve fast search even on large-scale data when properly implemented. The following points were key to success in this implementation:

Non-blocking migration: Minimized impact on production environment
Weighted search: Relevance scoring focused on user experience
Gradual migration: Risk-distributed implementation strategy

Particularly, the gradual migration using the CONCURRENTLY option and batch processing was very effective in achieving large-scale data migration without service interruption.

If your data volume is growing, we recommend considering full-text search implementation early. When you feel the limits of LIKE search, that's the time to migrate.

The code presented in this article is a simplified version of actual production code. Real implementations require additional considerations for error handling and security checks.

Related Technologies: PostgreSQL, Prisma ORM, TypeScript, Full-Text Search, GIN Index, tsvector, Performance Optimization, Database Design

Author: 91works Development Team

DEV Community