DEV Community

我妻良樹
我妻良樹

Posted on

How We Achieved Up to 71% Faster Email Search with PostgreSQL Full-Text Search

Overview

By migrating our email system's search functionality from traditional LIKE queries to PostgreSQL's full-text search, we achieved up to 71% performance improvement. This article details the implementation approach using GIN indexes and tsvector for 260,000 existing records, along with a non-blocking migration strategy.

Tech Stack

  • PostgreSQL (Full-Text Search)
  • Prisma ORM
  • TypeScript
  • tsvector / GIN Index
  • websearch_to_tsquery()

Background & Challenges

Existing Problems

Initially, the email search feature was implemented using simple LIKE queries:

// Traditional LIKE search (simplified)
const emails = await prisma.email.findMany({
  where: {
    OR: [
      { subject: { contains: keyword } },
      { fromName: { contains: keyword } },
      { bodyText: { contains: keyword } }
    ]
  }
});
Enter fullscreen mode Exit fullscreen mode

However, as data volume exceeded 260,000 records, the following issues became apparent:

  • Performance degradation: Some multi-keyword searches took over 1.4 seconds
  • Scalability limits: Search time increased proportionally with data volume
  • Lack of search flexibility: Unable to implement weighting or relevance scoring

Specific Numbers

  • Java search: 1,458ms
  • Engineer search: 968ms
  • Project search: 655ms

These search times needed improvement.

Implementation Details

1. Database Schema Design

First, we added a search_vector column to the emails table. The key point is applying weights to subject, sender name, and body:

-- Add search_vector column
ALTER TABLE emails ADD COLUMN search_vector tsvector;

-- Create weighted trigger function
CREATE OR REPLACE FUNCTION update_search_vector()
RETURNS trigger AS $$
BEGIN
  NEW.search_vector :=
    setweight(to_tsvector('simple', COALESCE(NEW.subject, '')), 'A') ||
    setweight(to_tsvector('simple', COALESCE(NEW."fromName", '')), 'B') ||
    setweight(to_tsvector('simple', COALESCE(NEW."fromAddress", '')), 'C') ||
    setweight(to_tsvector('simple',
      LEFT(COALESCE(NEW."bodyText", ''), 10000)), 'D');
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Create trigger
CREATE TRIGGER emails_search_vector_update
  BEFORE INSERT OR UPDATE ON emails
  FOR EACH ROW
  EXECUTE FUNCTION update_search_vector();
Enter fullscreen mode Exit fullscreen mode

Weight Intent:

  • A (Subject): Most important information source
  • B (Sender Name): Second most important
  • C (Email Address): Supplementary information
  • D (Body): Overall context

2. Creating GIN Index

We used the CONCURRENTLY option to avoid table locks:

// GIN index creation script (simplified)
async function createIndex() {
  await prisma.$executeRawUnsafe(`
    CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_emails_search_vector
    ON emails USING GIN(search_vector)
  `);
}
Enter fullscreen mode Exit fullscreen mode

Reasons for Using CONCURRENTLY:

  • Doesn't block existing traffic
  • Safe to run in production environments
  • Allows reads and writes during index creation

3. Backfilling Existing Data

We updated search_vector for 260,000 existing records using batch processing with error handling and retry logic:

const BATCH_SIZE = 1000;
const MAX_RETRIES = 3;
const RETRY_DELAY = 5000; // 5 seconds

async function backfillSearchVector() {
  const totalCount = await prisma.email.count();
  console.log(`Total emails to process: ${totalCount}`);

  let successCount = 0;
  let failureCount = 0;

  for (let offset = 0; offset < totalCount; offset += BATCH_SIZE) {
    let retryCount = 0;
    let success = false;

    while (retryCount < MAX_RETRIES && !success) {
      try {
        const startTime = Date.now();

        // Efficient update using CTE (Common Table Expression)
        await prisma.$executeRaw`
          WITH batch AS (
            SELECT id, subject, "fromName", "fromAddress", "bodyText"
            FROM emails
            ORDER BY id
            LIMIT ${BATCH_SIZE} OFFSET ${offset}
          )
          UPDATE emails
          SET search_vector =
            setweight(to_tsvector('simple', COALESCE(batch.subject, '')), 'A') ||
            setweight(to_tsvector('simple', COALESCE(batch."fromName", '')), 'B') ||
            setweight(to_tsvector('simple', COALESCE(batch."fromAddress", '')), 'C') ||
            setweight(to_tsvector('simple',
              LEFT(COALESCE(batch."bodyText", ''), 10000)), 'D')
          FROM batch
          WHERE emails.id = batch.id
        `;

        const elapsed = Date.now() - startTime;
        successCount += BATCH_SIZE;
        console.log(`✓ Processed ${offset + BATCH_SIZE}/${totalCount} (${elapsed}ms)`);
        success = true;

      } catch (error) {
        retryCount++;
        console.error(`✗ Batch failed (attempt ${retryCount}/${MAX_RETRIES}):`, error);

        if (retryCount < MAX_RETRIES) {
          console.log(`Retrying in ${RETRY_DELAY}ms...`);
          await new Promise(resolve => setTimeout(resolve, RETRY_DELAY));
        } else {
          failureCount += BATCH_SIZE;
          console.error(`✗ Batch permanently failed after ${MAX_RETRIES} attempts. Skipping...`);
        }
      }
    }
  }

  console.log(`\nBackfill complete:`);
  console.log(`- Success: ${successCount} records`);
  console.log(`- Failure: ${failureCount} records`);

  if (failureCount > 0) {
    throw new Error(`Backfill completed with ${failureCount} failures`);
  }
}
Enter fullscreen mode Exit fullscreen mode

Processing Results:

  • Completed 260,000 records in approximately 53 minutes
  • Average 12 seconds per batch
  • Up to 3 retries on errors
  • Failed batches logged and skipped

4. Search Query Implementation

We implemented a safe search function using websearch_to_tsquery() with error handling:

export async function findManyWithFullTextSearch(params: {
  keywords: string[];
  accountId: string;
  category?: string;
  skip?: number;
  take?: number;
}): Promise<Email[]> {
  const { keywords, accountId, category, skip = 0, take = 50 } = params;

  // Input validation: Limit keyword count (DoS protection)
  if (keywords.length > 10) {
    throw new Error('Too many keywords. Maximum 10 keywords allowed.');
  }

  // Input validation: Limit keyword length
  const invalidKeywords = keywords.filter(k => k.length > 100);
  if (invalidKeywords.length > 0) {
    throw new Error('Keyword too long. Maximum 100 characters per keyword.');
  }

  // Convert keywords to tsquery format (AND conjunction)
  // SQL injection protection is automatic with Prisma.sql parameterization
  const searchQuery = keywords.join(' & ');

  try {
    const results = await prisma.$queryRaw<Email[]>`
      SELECT e.*
      FROM emails e
      WHERE e.search_vector @@ websearch_to_tsquery('simple', ${searchQuery})
        AND e."accountId" = ${accountId}
        ${category ? Prisma.sql`AND EXISTS (
          SELECT 1 FROM "EmailClassificationResult" ecr
          WHERE ecr."emailId" = e.id AND ecr.category = ${category}
          LIMIT 1
        )` : Prisma.empty}
      ORDER BY ts_rank(e.search_vector, websearch_to_tsquery('simple', ${searchQuery})) DESC
      LIMIT ${take} OFFSET ${skip}
    `;

    return results;

  } catch (error) {
    // Return empty array for PostgreSQL syntax error (error code 42601)
    if (error instanceof Error && 'code' in error && error.code === '42601') {
      console.warn('Invalid tsquery syntax, returning empty results:', searchQuery);
      return [];
    }

    // Log and rethrow other errors
    console.error('Full-text search error:', error);
    throw error;
  }
}
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • Security Measures:
    • SQL injection prevention through Prisma.sql parameterization
    • DoS attack prevention with keyword count and length limits
  • Error Handling:
    • Returns empty array for tsquery syntax errors to continue application
    • Properly logs and rethrows other errors
  • Query Optimization:
    • @@ operator: Matching between tsvector and tsquery
    • websearch_to_tsquery(): Supports natural search syntax (e.g., "Java AND Python")
    • ts_rank(): Sorting by relevance score
    • WHERE clause order optimization: search_vector → accountId → category

Performance Improvement Results

Search Time Comparison

Search Keyword Before After Improvement
Java 1,458ms 596ms 59% faster
Engineer 968ms 279ms 71% faster
Project 655ms 203ms 69% faster

Why Such Significant Speed Improvements?

  1. Index Scan Efficiency

    • LIKE search: O(N) - Full table scan
    • GIN index: O(log N) - Index scan only
  2. Query Planner Optimization

    • PostgreSQL query planner efficiently uses GIN index
    • WHERE clause order optimized (high selectivity conditions first)
  3. Weighted Relevance Scoring

    • Prioritizes subject matches, improving user experience
    • Natural ordering through ts_rank()

Lessons Learned

Unexpected Pitfalls

  1. Importance of Character Limits

    • tsvector has a 1MB limit
    • Ensured safety by limiting bodyText to 10,000 characters
  2. GIN Index Disk Usage

    • Approximately 1.2GB index size for 260,000 records
    • About 30-40% of original table size
    • Consider tradeoff between storage cost and performance
  3. Choosing 'simple' Configuration and Japanese Search Challenges

    • For mixed Japanese-English text, 'simple' is more appropriate than language-specific settings ('english', 'japanese')
    • Stemming processing is often unnecessary
    • Comparison with Japanese Morphological Analysis:
      • MeCab + pg_bigm: Higher accuracy but complex setup
      • 'simple' setting: Easy setup but imperfect word boundary recognition
      • This project chose 'simple' prioritizing development speed
  4. CONCURRENTLY Pitfalls

    • Cannot be used within transactions
    • Index creation took longer than expected (about 45 minutes for 260,000 records)
    • Caution needed for re-execution on errors (partially created indexes may remain)

Knowledge for Future Use

  1. Gradual Migration Strategy
   Step 1: Schema change (add search_vector column)
   Step 2: Trigger setup (new data auto-updates)
   Step 3: GIN index creation (CONCURRENTLY)
   Step 4: Backfill existing data
   Step 5: Application code switch
Enter fullscreen mode Exit fullscreen mode
  1. Performance Tuning Best Practices

    • Verify execution plan with EXPLAIN ANALYZE
    • Arrange WHERE clause in order of selectivity
    • Add LIMIT 1 to EXISTS clause for early termination
  2. Batch Processing Optimization

    • Efficient bulk updates using CTE
    • Ensure transparency through progress logging
    • Error handling and retry logic

Better Approaches Discovered

Before (Multiple OR Conditions):

where: {
  OR: [
    { subject: { contains: 'Java' } },
    { fromName: { contains: 'Java' } },
    { bodyText: { contains: 'Java' } }
  ]
}
Enter fullscreen mode Exit fullscreen mode

After (Full-Text Search):

where: {
  search_vector: {
    search: 'Java',
    mode: 'websearch'
  }
}
Enter fullscreen mode Exit fullscreen mode

The implementation became simpler, faster, and more flexible.

Operational Considerations

Monitoring Setup

// Monitor index usage statistics
const indexStats = await prisma.$queryRaw`
  SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan as index_scans,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
  FROM pg_stat_user_indexes
  WHERE indexname = 'idx_emails_search_vector'
`;
Enter fullscreen mode Exit fullscreen mode

Alert Thresholds

  • Search response time: Warning at 500ms+, Alert at 1 second+
  • Index scan count: Warning below 80% of expected traffic
  • GIN index size: Warning when exceeding 30% of disk capacity

Log Output Design

// Structured log example
logger.info('full_text_search', {
  keywords: keywords.join(','),
  resultCount: results.length,
  duration: Date.now() - startTime,
  accountId,
  category
});
Enter fullscreen mode Exit fullscreen mode

Test Strategy

Unit Tests

describe('findManyWithFullTextSearch', () => {
  it('should return emails matching single keyword', async () => {
    const results = await findManyWithFullTextSearch({
      keywords: ['Java'],
      accountId: 'acc_123'
    });

    expect(results).toHaveLength(10);
    expect(results[0].subject).toContain('Java');
  });

  it('should handle invalid tsquery syntax gracefully', async () => {
    const results = await findManyWithFullTextSearch({
      keywords: ['test & | invalid'],
      accountId: 'acc_123'
    });

    expect(results).toEqual([]); // Returns empty array instead of error
  });

  it('should reject too many keywords', async () => {
    await expect(
      findManyWithFullTextSearch({
        keywords: new Array(11).fill('test'),
        accountId: 'acc_123'
      })
    ).rejects.toThrow('Too many keywords');
  });
});
Enter fullscreen mode Exit fullscreen mode

Performance Tests

describe('Full-text search performance', () => {
  it('should complete search within 500ms', async () => {
    const startTime = Date.now();

    await findManyWithFullTextSearch({
      keywords: ['engineer'],
      accountId: 'acc_123'
    });

    const duration = Date.now() - startTime;
    expect(duration).toBeLessThan(500);
  });
});
Enter fullscreen mode Exit fullscreen mode

Troubleshooting

Common Problems and Solutions

  1. Search returns 0 results

    • Cause: search_vector might be null
    • Check: SELECT COUNT(*) FROM emails WHERE search_vector IS NULL
    • Solution: Re-run backfill script
  2. Search is slow

    • Cause: GIN index not being used
    • Check: Verify execution plan with EXPLAIN ANALYZE
    • Solution: Update index statistics with VACUUM ANALYZE emails
  3. tsvector size limit error

    • Cause: bodyText exceeds 10,000 characters
    • Check: SELECT id, LENGTH("bodyText") FROM emails WHERE LENGTH("bodyText") > 10000
    • Solution: Verify and apply character limit in trigger function

Debugging Methods

-- Check search_vector for specific email
SELECT
  id,
  subject,
  search_vector,
  ts_rank(search_vector, websearch_to_tsquery('simple', 'Java')) as rank
FROM emails
WHERE id = 'email_123';

-- Verify index is being used
EXPLAIN ANALYZE
SELECT * FROM emails
WHERE search_vector @@ websearch_to_tsquery('simple', 'Java')
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Conclusion

PostgreSQL's full-text search feature is a powerful tool that can achieve fast search even on large-scale data when properly implemented. The following points were key to success in this implementation:

  • Non-blocking migration: Minimized impact on production environment
  • Weighted search: Relevance scoring focused on user experience
  • Gradual migration: Risk-distributed implementation strategy

Particularly, the gradual migration using the CONCURRENTLY option and batch processing was very effective in achieving large-scale data migration without service interruption.

If your data volume is growing, we recommend considering full-text search implementation early. When you feel the limits of LIKE search, that's the time to migrate.


The code presented in this article is a simplified version of actual production code. Real implementations require additional considerations for error handling and security checks.

Related Technologies: PostgreSQL, Prisma ORM, TypeScript, Full-Text Search, GIN Index, tsvector, Performance Optimization, Database Design

Author: 91works Development Team

Top comments (0)