DEV Community

Cover image for Building PeopleHub: An AI-Powered LinkedIn Intelligence Platform with LangGraph and Bright Data
Meir
Meir

Posted on

Building PeopleHub: An AI-Powered LinkedIn Intelligence Platform with LangGraph and Bright Data

I recently open-sourced PeopleHub, an AI-powered people search engine that combines natural language query parsing, LinkedIn profile scraping, and automated research report generation. In this post, I'll walk through the technical architecture, key design decisions, and implementation details.

Table of Contents

  1. What is PeopleHub?
  2. Architecture Overview
  3. Tech Stack
  4. AI Query Parser: Natural Language → Structured Search
  5. Bright Data Integration: The Data Pipeline
  6. Multi-Tier Caching Strategy
  7. LangGraph Research Engine: Agentic Workflows
  8. Database Design with Prisma
  9. Performance Optimizations
  10. Lessons Learned

What is PeopleHub?

PeopleHub solves a common problem: finding and researching professionals is either slow (manual LinkedIn searching) or expensive (premium tools charging $50+ per profile).

Key Features:

  • 🗣️ Natural language search - Just type "10 AI engineers in Israel"
  • Smart caching - 70-90% cost reduction through intelligent caching
  • 🔬 AI research reports - Automated due diligence with web scraping
  • 💾 Multi-tier persistence - PostgreSQL + Redis for optimal performance
  • 🤖 LangGraph workflows - Agentic multi-step research automation

Use Cases:

  • Recruiting and talent acquisition
  • Due diligence on executives/entrepreneurs
  • Competitive intelligence
  • Academic research on professional networks
  • Sales prospecting

Architecture Overview

┌─────────────────────────────────────────────────┐
│          Frontend (Next.js 15 + React)          │
│    SearchBar → Results → Research Reports       │
└────────────────────┬────────────────────────────┘
                     │
        ┌────────────┼────────────┐
        │            │            │
   ┌────▼────┐  ┌───▼──┐  ┌─────▼──────┐
   │ Search  │  │Image │  │ Research   │
   │   API   │  │Proxy │  │    API     │
   └────┬────┘  └──────┘  └─────┬──────┘
        │                        │
        │     ┌──────────────────┴───────┐
        │     │  Prisma ORM + PostgreSQL │
        │     │  Redis Cache (Optional)  │
        │     └──────────┬─────────────┬─┘
        │                │             │
   ┌────▼────────────┐   │    ┌───────▼────────┐
   │  AI Query       │   │    │  LangGraph     │
   │  Parser         │   │    │  Research      │
   │ (Gemini 2.0)    │   │    │  Workflows     │
   └─────────────────┘   │    └───────┬────────┘
                         │            │
                    ┌────▼────────────▼──┐
                    │   Bright Data APIs │
                    │ • Google Search    │
                    │ • LinkedIn Scraper │
                    │ • Web Unblocker    │
                    │ • MCP Server       │
                    └────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Tech Stack

Backend

  • Framework: Next.js 15.5.4 with App Router (API Routes)
  • Runtime: Node.js 18+
  • Language: TypeScript 5 (strict mode)
  • ORM: Prisma 6.5.0
  • Database: PostgreSQL (Supabase)
  • Cache: Redis with ioredis 5.8.2 (optional, hot cache)

AI/LLM

  • Query Parsing: Google Gemini 2.0 Flash (gemini-2.0-flash-exp)
  • AI SDK: Vercel AI SDK 5.0.60 (@ai-sdk/google 2.0.17)
  • Research Workflows: LangChain + LangGraph 1.0.1
  • Schema Validation: Zod 3.25.76

External APIs

  • Bright Data: Google Search API, LinkedIn Scraper API, Web Scraper
  • Custom MCP Client: Model Context Protocol SDK 1.19.1 for advanced tool access

Frontend

  • UI: React 19.1.0 with Next.js
  • State: Zustand 5.0.2 + TanStack Query 5.62.18
  • Styling: Tailwind CSS 4 with custom animation utilities

AI Query Parser

The Problem

Users shouldn't need to learn complex query syntax. They should be able to search naturally:

✅ "10 AI engineers in Israel"
✅ "Software engineers at Google"
✅ "Elon Musk"
✅ "Product managers in San Francisco with startup experience"
Enter fullscreen mode Exit fullscreen mode

The Solution: Structured Output with Gemini 2.0 Flash

I use Vercel's AI SDK with Gemini 2.0 Flash to convert natural language into structured search parameters using Zod schemas:

// src/lib/search/parser.ts
import { google } from '@ai-sdk/google';
import { generateObject } from 'ai';
import { z } from 'zod';

const SearchQuerySchema = z.object({
  count: z.number().min(1).max(50)
    .describe('Number of profiles to find'),
  role: z.string().nullable()
    .describe('Job title or role'),
  location: z.string().optional().nullable()
    .describe('Location or company name'),
  countryCode: z.string().length(2).optional().nullable()
    .describe('2-letter ISO country code'),
  keywords: z.array(z.string())
    .describe('Additional keywords or qualifications'),
  googleQuery: z.string()
    .describe('Optimized Google search query for LinkedIn'),
});

export async function parseSearchQuery(
  query: string
): Promise<ParsedSearchQuery> {
  const { object } = await generateObject({
    model: google('gemini-2.0-flash-exp'),
    schema: SearchQuerySchema,
    prompt: `Parse this search query: "${query}"

    Handle two types:
    1. Job/role search: "5 AI Engineers in Israel"
       → Extract count, role, location, keywords
    2. Individual search: "Elon Musk"
       → Set count=1, role=null, name in keywords

    Generate optimized Google query using:
    site:linkedin.com/in "Role" "Location" keywords`,
  });

  return object;
}
Enter fullscreen mode Exit fullscreen mode

Example Output

Input: "5 AI Engineers in Israel"

Output:

{
  "count": 5,
  "role": "AI Engineer",
  "location": "Israel",
  "countryCode": "IL",
  "keywords": [],
  "googleQuery": "site:linkedin.com/in \"AI Engineer\" \"Israel\""
}
Enter fullscreen mode Exit fullscreen mode

Why Gemini 2.0 Flash?

  • Fast: ~200-500ms response time
  • Structured Output: Native Zod schema support
  • Flexible: Handles both job searches and individual lookups
  • Cost-effective: $0.00001875 per 1K input tokens

Bright Data Integration

Bright Data is the backbone of PeopleHub's data acquisition. I use three of their APIs:

1. Google Search API

Purpose: Find LinkedIn profile URLs matching search criteria

Implementation:

// src/lib/brightdata/search.ts
const BRIGHTDATA_API_URL = 'https://api.brightdata.com/request';

export async function searchGoogle(
  query: string,
  page: number = 0,
  countryCode?: string | null,
): Promise<BrightDataGoogleSearchResponse> {
  const searchUrl = buildGoogleSearchUrl(query, page, countryCode);

  const response = await fetch(BRIGHTDATA_API_URL, {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.BRIGHTDATA_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url: searchUrl,
      zone: 'unblocker',
      format: 'raw',
    }),
  });

  return response.json();
}

function buildGoogleSearchUrl(
  query: string,
  page: number,
  countryCode?: string | null
): string {
  const start = page * 10;
  let url = `https://www.google.com/search?q=${encodeURIComponent(query)}&start=${start}&brd_json=1`;

  // Geo-targeting support
  if (countryCode) {
    url += `&gl=${countryCode.toUpperCase()}`;
  }

  return url;
}
Enter fullscreen mode Exit fullscreen mode

Key Features:

  • JSON Response Format: brd_json=1 parameter returns structured data
  • Geo-Targeting: gl parameter filters results by country
  • Site-Specific Queries: site:linkedin.com/in narrows to LinkedIn profiles
  • Organic Results: Returns titles, links, snippets without ads

2. LinkedIn Scraper API

Purpose: Extract comprehensive LinkedIn profile data

Dataset ID: gd_l1viktl72bvl7bjuj0

Implementation:

// src/lib/brightdata/linkedin.ts
const BRIGHTDATA_API_URL = 'https://api.brightdata.com/datasets/v3';
const LINKEDIN_DATASET_ID = 'gd_l1viktl72bvl7bjuj0';
const MAX_POLLING_ATTEMPTS = 600; // 10 minutes
const POLLING_INTERVAL_MS = 1000; // 1 second

async function triggerLinkedInScrape(
  urls: string[]
): Promise<string> {
  const triggerUrl = `${BRIGHTDATA_API_URL}/trigger`;
  const params = new URLSearchParams({
    dataset_id: LINKEDIN_DATASET_ID,
    include_errors: 'true',
  });

  const payload = urls.map(url => ({ url }));

  const response = await fetch(`${triggerUrl}?${params}`, {
    method: 'POST',
    headers: getApiHeaders(),
    body: JSON.stringify(payload),
  });

  const data = await response.json();
  return data.snapshot_id;
}

async function pollForSnapshot(
  snapshotId: string
): Promise<BrightDataLinkedInResponse[]> {
  let attempts = 0;

  while (attempts < MAX_POLLING_ATTEMPTS) {
    const snapshotUrl = `${BRIGHTDATA_API_URL}/snapshot/${snapshotId}`;
    const params = new URLSearchParams({ format: 'json' });

    const response = await fetch(`${snapshotUrl}?${params}`, {
      method: 'GET',
      headers: getApiHeaders(),
    });

    const data = await response.json();

    // Check if still processing
    if (!Array.isArray(data) && data.status === 'running') {
      attempts++;
      await new Promise(resolve =>
        setTimeout(resolve, POLLING_INTERVAL_MS)
      );
      continue;
    }

    // Data is ready
    return data;
  }

  throw new Error('Timeout waiting for LinkedIn data');
}

export async function fetchLinkedInProfiles(
  linkedinUrls: string[]
): Promise<ProfileData[]> {
  // Trigger async scraping job
  const snapshotId = await triggerLinkedInScrape(linkedinUrls);

  // Poll for results (max 10 minutes)
  const profiles = await pollForSnapshot(snapshotId);

  // Transform to database format
  return profiles.map(profile =>
    transformBrightDataProfile(profile)
  );
}
Enter fullscreen mode Exit fullscreen mode

What Gets Scraped:

  • Basic info: name, headline, about section
  • Work experience with company logos and descriptions
  • Education history
  • Languages spoken
  • Connection and follower counts
  • Profile pictures and banner images
  • Current company details

Why This Approach?

  • Async by Design: Trigger job, get snapshot ID, poll for completion
  • Batch Operations: Scrape multiple profiles in one request
  • Retry Logic: Handles transient failures gracefully
  • Timeout Protection: Max 10 minutes prevents infinite loops

3. MCP (Model Context Protocol) Integration

Purpose: Advanced tooling for the research engine

// src/lib/brightdata/client.ts
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streaming.js';

let mcpClient: Client | null = null;

export async function getBrightDataMCPClient(): Promise<Client> {
  // Singleton pattern prevents repeated connections
  if (mcpClient) return mcpClient;

  const apiToken = process.env.BRIGHTDATA_API_TOKEN;
  const transport = new StreamableHTTPClientTransport(
    new URL(`https://mcp.brightdata.com/mcp?api_token=${apiToken}`)
  );

  mcpClient = new Client(
    {
      name: 'peoplehub-client',
      version: '1.0.0',
    },
    {
      capabilities: {},
    }
  );

  await mcpClient.connect(transport);
  return mcpClient;
}
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Web scraping for research reports
  • Advanced search capabilities
  • Tool discovery and execution

Multi-Tier Caching Strategy

Caching is critical for cost reduction. PeopleHub uses a two-tier approach:

Tier 1: Redis (Hot Cache)

Purpose: Fast search result caching

// src/lib/redis/search-cache.ts
export async function getCachedSearchResults(
  query: string,
): Promise<CachedSearchResults | null> {
  const key = getCacheKey(CachePrefix.SEARCH_RESULTS, query);
  const cached = await getCache<CachedSearchResults>(key);
  return cached;
}

export async function cacheSearchResults(
  query: string,
  parsedQuery: ParsedSearchQuery,
  results: ProfileSummary[],
): Promise<boolean> {
  const key = getCacheKey(CachePrefix.SEARCH_RESULTS, query);
  const payload: CachedSearchResults = {
    query,
    parsedQuery,
    results,
    count: results.length,
    timestamp: Date.now(),
  };

  return setCache(key, payload, CacheTTL.SEARCH_RESULTS);
}
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Sub-millisecond lookups: In-memory data structure
  • Reduced database load: Offloads hot queries
  • TTL-based expiration: Configurable freshness window

Tier 2: PostgreSQL (Persistent Cache)

Purpose: Long-term profile storage with intelligent freshness

// src/lib/cache/index.ts
const CACHE_FRESHNESS_DAYS = 180;

export async function getCachedProfile(
  linkedinUrl: string
): Promise<ProfileData | null> {
  // Extract LinkedIn ID to handle regional URL variants
  const linkedinId = extractLinkedInId(linkedinUrl);

  const profile = await prisma.person.findUnique({
    where: { linkedinId },
  });

  if (!profile) return null;

  // Check freshness (< 180 days old)
  const daysSinceUpdate = Math.floor(
    (Date.now() - profile.updatedAt.getTime()) / (1000 * 60 * 60 * 24)
  );

  if (daysSinceUpdate >= CACHE_FRESHNESS_DAYS) {
    return null; // Stale, needs refresh
  }

  return transformToProfileData(profile);
}

export async function saveProfile(
  data: ProfileData
): Promise<ProfileData> {
  await prisma.person.upsert({
    where: { linkedinId: data.linkedinId },
    update: {
      ...data,
      searchCount: { increment: 1 }, // Track popularity
      lastViewed: new Date(),
    },
    create: {
      ...data,
    },
  });

  return data;
}
Enter fullscreen mode Exit fullscreen mode

Batch Optimization:

export async function getCachedProfiles(
  linkedinUrls: string[]
): Promise<Record<string, ProfileData>> {
  const linkedinIds = linkedinUrls
    .map(extractLinkedInId)
    .filter(Boolean);

  // Single SQL query for all profiles
  const profiles = await prisma.person.findMany({
    where: {
      linkedinId: { in: linkedinIds },
    },
  });

  // Filter by freshness
  const result: Record<string, ProfileData> = {};
  const now = Date.now();

  for (const profile of profiles) {
    const daysSinceUpdate = Math.floor(
      (now - profile.updatedAt.getTime()) / (1000 * 60 * 60 * 24)
    );

    if (daysSinceUpdate < CACHE_FRESHNESS_DAYS) {
      result[profile.linkedinId] = transformToProfileData(profile);
    }
  }

  return result;
}
Enter fullscreen mode Exit fullscreen mode

Performance Impact:

  • First search: ~120 seconds (LinkedIn scraping bottleneck)
  • Cached search: ~2.5 seconds (database lookup)
  • Batch lookup: 10-50ms for 100 profiles
  • Cost reduction: 70-90% (90% cache hit rate)

LangGraph Research Engine

This is PeopleHub's killer feature: automated due diligence reports using LangChain's LangGraph for multi-step agentic workflows.

What is LangGraph?

LangGraph is a framework for building stateful, multi-actor applications with LLMs. It uses a directed graph to define:

  • Nodes: Individual steps (fetch data, search web, summarize)
  • Edges: Transitions between steps
  • State: Shared data across the workflow

Research Workflow Graph

START
  ↓
Initialize Research
  ↓
  ├─→ Fetch LinkedIn Profile ──→ Aggregate Data
  │                                      ↓
  └─→ Execute Google Search          Write Report
         ↓                                ↓
      Scrape URLs (parallel)            END
         ↓
      Summarize Content (parallel)
         ↓
      Aggregate Data
Enter fullscreen mode Exit fullscreen mode

Implementation

State Definition:

// src/lib/research/types.ts
import { Annotation } from '@langchain/langgraph';

export const ResearchStateAnnotation = Annotation.Root({
  personName: Annotation<string>,
  linkedinUrl: Annotation<string>,
  linkedinData: Annotation<ProfileData | undefined>,
  searchQuery: Annotation<string | undefined>,
  searchResults: Annotation<SearchResult[]>,
  scrapedContents: Annotation<ScrapedContent[]>,
  webSummaries: Annotation<WebSummary[]>,
  finalReport: Annotation<string | undefined>,
  status: Annotation<string>,
  errors: Annotation<string[]>,
});
Enter fullscreen mode Exit fullscreen mode

Graph Builder:

// src/lib/research/graph.ts
import { StateGraph, START, END, Send } from '@langchain/langgraph';

export function createResearchGraph() {
  const graph = new StateGraph(ResearchStateAnnotation);

  // Add nodes
  graph.addNode('start', startNode);
  graph.addNode('fetchLinkedIn', fetchLinkedInNode);
  graph.addNode('executeSearch', executeSearchNode);
  graph.addNode('scrapeWebPage', scrapeWebPageNode);
  graph.addNode('summarizeContent', summarizeContentNode);
  graph.addNode('aggregateData', aggregateDataNode);
  graph.addNode('writeReport', writeReportNode);

  // Define edges
  graph
    .addEdge(START, 'start')
    .addEdge('start', 'fetchLinkedIn')
    .addEdge('start', 'executeSearch')
    .addEdge('fetchLinkedIn', 'aggregateData')
    .addConditionalEdges('executeSearch', routeToScraping)
    .addConditionalEdges('scrapeWebPage', routeToSummarization)
    .addEdge('summarizeContent', 'aggregateData')
    .addEdge('aggregateData', 'writeReport')
    .addEdge('writeReport', END);

  return graph.compile({ checkpointer: new MemorySaver() });
}
Enter fullscreen mode Exit fullscreen mode

Parallel Scraping with Send API:

// Route to scraping - creates parallel tasks
export function routeToScraping(
  state: ResearchGraphState
): Send[] {
  if (!state.searchResults?.length) return [];

  // Fan-out: Create one scraping task per URL
  return state.searchResults.map((result) =>
    new Send('scrapeWebPage', {
      ...state,
      url: result.url,
      metadata: {
        source: result.source,
        rank: result.rank,
        title: result.title,
      },
    })
  );
}

// Route to summarization - creates parallel tasks
export function routeToSummarization(
  state: ResearchGraphState
): Send[] {
  if (!state.scrapedContents?.length) return [];

  // Fan-out: Create one summarization task per scraped page
  return state.scrapedContents.map((content) =>
    new Send('summarizeContent', {
      ...state,
      scrapedContent: content,
    })
  );
}
Enter fullscreen mode Exit fullscreen mode

Node Implementations:

// Fetch LinkedIn profile
export const fetchLinkedInNode: ResearchNodeHandler = async (state) => {
  const profile = await fetchLinkedInProfile(state.linkedinUrl);
  return {
    linkedinData: profile,
    status: 'LinkedIn profile fetched',
  };
};

// Execute Google search
export const executeSearchNode: ResearchNodeHandler = async (state) => {
  const results = await searchGoogleForPerson(
    state.personName,
    state.linkedinUrl,
    { maxResults: 15 }
  );

  return {
    searchResults: results,
    status: 'Web search completed',
  };
};

// Summarize scraped content
export const summarizeContentNode: ResearchNodeHandler = async (state) => {
  const { scrapedContent } = state;

  const summary = await summarizeWebContent(
    scrapedContent.url,
    scrapedContent.content,
    state.personName
  );

  return {
    webSummaries: [summary],
    status: `Summarized ${scrapedContent.url}`,
  };
};

// Generate final report
export const writeReportNode: ResearchNodeHandler = async (state) => {
  const bundle = {
    personName: state.personName,
    linkedinUrl: state.linkedinUrl,
    linkedinData: state.linkedinData,
    webSummaries: state.webSummaries,
  };

  const result = await generateResearchReport(bundle);

  return {
    finalReport: result.report,
    status: 'Report ready',
  };
};
Enter fullscreen mode Exit fullscreen mode

Running the Graph:

// src/lib/research/runner.ts
export async function runResearchGraph(
  researchId: string,
  linkedinUrl: string,
  personName: string
): Promise<void> {
  // Update status to 'processing'
  await updateResearchStatus(researchId, 'processing');

  // Create and compile graph
  const compiledGraph = createResearchGraph();

  // Execute with checkpointing
  const result = await compiledGraph.invoke(
    { personName, linkedinUrl },
    { configurable: { thread_id: researchId } }
  );

  // Save report to database
  if (result.finalReport) {
    await saveResearchReport(
      researchId,
      result.finalReport,
      result.webSummaries.map(s => ({ url: s.url, summary: s.summary })),
      { /* metadata */ }
    );
  }
}
Enter fullscreen mode Exit fullscreen mode

Why LangGraph?

Advantages:

  1. Stateful Workflows: Shared state across steps
  2. Parallel Execution: Send API for fan-out/fan-in patterns
  3. Checkpointing: Resume from failures with MemorySaver
  4. Type Safety: TypeScript-first with full type inference
  5. Debuggability: Visual graph representation with Mermaid

Example Research Report:

# Research Report: John Doe

## Professional Background
John Doe is a Senior Software Engineer at Google with 8 years of experience...

## Recent Projects and Achievements
- Led the development of Project X, a distributed system processing 1M+ requests/day
- Published research paper on machine learning optimization at ICML 2024
- Speaker at Google I/O 2024 on Kubernetes best practices

## Technical Expertise
Primary skills: Python, Go, Kubernetes, TensorFlow, distributed systems
Notable contributions to open-source projects...

## Industry Reputation
Recognized as a thought leader in cloud-native architecture...

## Sources
1. [Google Engineering Blog - Project X Launch](https://example.com)
2. [ICML 2024 Paper: ML Optimization Techniques](https://example.com)
3. [Google I/O 2024 Talk](https://example.com)
Enter fullscreen mode Exit fullscreen mode

Database Design with Prisma

Schema Overview

// prisma/schema.prisma
datasource db {
  provider = "postgresql"
  url      = env("DATABASE_URL")
}

model Person {
  id                String   @id @default(cuid())
  linkedinUrl       String   @unique
  linkedinId        String   @unique
  linkedinNumId     String?

  // Basic Info
  firstName         String
  lastName          String
  fullName          String   @db.Text
  headline          String?  @db.Text
  about             String?  @db.Text

  // Location
  location          String?
  city              String?
  countryCode       String?

  // Profile Media
  profilePicUrl     String?
  bannerImage       String?
  defaultAvatar     Boolean  @default(false)

  // Current Company
  currentCompany    String?
  currentCompanyId  String?

  // Rich Data (JSON)
  experience        Json?
  education         Json?
  languages         Json?

  // Social Stats
  connections       Int?
  followers         Int?

  // Metadata
  searchCount       Int      @default(0)
  lastViewed        DateTime @default(now())
  createdAt         DateTime @default(now())
  updatedAt         DateTime @updatedAt

  // Relations
  researches        Research[]

  @@index([fullName])
  @@index([firstName, lastName])
  @@index([lastViewed])
  @@index([linkedinId])
  @@index([currentCompany])
  @@index([location])
  @@index([updatedAt])
  @@map("people")
}

model Search {
  id          String   @id @default(cuid())
  query       String   @db.Text
  results     Json     // Array of person IDs
  resultCount Int
  createdAt   DateTime @default(now())

  @@index([query])
  @@map("searches")
}

model Research {
  id              String   @id @default(cuid())
  personId        String?
  person          Person?  @relation(fields: [personId], references: [id])
  linkedinUrl     String
  personName      String
  report          String   @db.Text
  sources         Json     // Array of { url, summary }
  metadata        Json?    // Graph execution metadata
  status          String   // 'pending' | 'processing' | 'completed' | 'failed'
  errorMessage    String?  @db.Text
  createdAt       DateTime @default(now())
  updatedAt       DateTime @updatedAt

  @@index([personId])
  @@index([linkedinUrl])
  @@index([status])
  @@index([createdAt])
  @@map("researches")
}
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions

1. JSON Fields for Flexibility

// Experience is stored as JSON for flexibility
experience: [
  {
    title: "Senior Software Engineer",
    company: "Google",
    companyId: "1441",
    companyLogo: "https://...",
    location: "Mountain View, CA",
    startDate: "2020-01-01",
    endDate: null,
    description: "Led development of...",
    isCurrent: true
  },
  // ... more experiences
]
Enter fullscreen mode Exit fullscreen mode

Why JSON?

  • Flexible schema (experiences vary widely)
  • No need for separate tables and joins
  • Easier to store Bright Data's nested structures
  • PostgreSQL JSON operators for querying

2. Strategic Indexes

-- Full-text search on names
@@index([fullName])
@@index([firstName, lastName])

-- Filter by recency
@@index([lastViewed])
@@index([updatedAt])

-- Search by location/company
@@index([location])
@@index([currentCompany])

-- Unique constraints prevent duplicates
linkedinId String @unique
linkedinUrl String @unique
Enter fullscreen mode Exit fullscreen mode

3. Metadata Tracking

searchCount: Int @default(0)  // Popularity metric
lastViewed: DateTime          // Cache invalidation
updatedAt: DateTime           // Freshness check
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Identify popular profiles (high searchCount)
  • Prioritize cache refreshes for frequently viewed profiles
  • Track data staleness for intelligent revalidation

Performance Optimizations

1. Batch Database Queries

Problem: N+1 query problem when checking cache for multiple URLs

Solution: Single query for all profiles

// Instead of N queries
for (const url of urls) {
  await getCachedProfile(url); // ❌ N queries
}

// Do 1 query
const cached = await getCachedProfiles(urls); // ✅ 1 query
Enter fullscreen mode Exit fullscreen mode

Impact: 100 profiles: ~5 seconds → ~50ms

2. Parallel API Calls

Problem: Sequential Bright Data calls block each other

Solution: Promise.all for independent operations

// Sequential: ~6 seconds total
const parsedQuery = await parseSearchQuery(query);
const summaries = await searchLinkedInProfiles(parsedQuery);
await cacheSearchResults(query, summaries);

// Parallel: ~3 seconds total
const [parsedQuery, summaries] = await Promise.all([
  parseSearchQuery(query),
  searchLinkedInProfiles(parsedQuery), // ❌ Depends on parsedQuery
]);
Enter fullscreen mode Exit fullscreen mode

Caveat: Only parallelize truly independent operations!

3. Connection Pooling

Problem: Serverless functions create new DB connections per request

Solution: Singleton Prisma client

// src/lib/prisma.ts
import { PrismaClient } from '@prisma/client';

const globalForPrisma = global as unknown as { prisma: PrismaClient };

export const prisma = globalForPrisma.prisma ||
  new PrismaClient({
    log: process.env.NODE_ENV === 'development'
      ? ['query', 'error', 'warn']
      : ['error'],
  });

if (process.env.NODE_ENV !== 'production') {
  globalForPrisma.prisma = prisma;
}
Enter fullscreen mode Exit fullscreen mode

4. Image Proxy for CORS

Problem: Direct LinkedIn image URLs blocked by CORS and ad blockers

Solution: Proxy images through our API

// src/app/api/proxy-image/route.ts
export async function GET(request: NextRequest) {
  const imageUrl = request.nextUrl.searchParams.get('url');

  const response = await fetch(imageUrl, {
    headers: {
      'User-Agent': 'Mozilla/5.0...',
    },
  });

  const imageBuffer = await response.arrayBuffer();

  return new Response(imageBuffer, {
    headers: {
      'Content-Type': response.headers.get('Content-Type'),
      'Cache-Control': 'public, max-age=86400', // 24 hours
    },
  });
}
Enter fullscreen mode Exit fullscreen mode

5. Redis for Hot Cache

Problem: PostgreSQL queries still take 10-50ms for popular searches

Solution: Redis in-memory cache for search results

// Cache TTL: 30 minutes for hot searches
export const CacheTTL = {
  SEARCH_RESULTS: 30 * 60, // 30 minutes
  PROFILE_SUMMARY: 60 * 60, // 1 hour
};

// Check Redis first, fallback to PostgreSQL
const cached = await getCachedSearchResults(query);
if (cached) return cached; // ~2ms

// Cache miss, fetch from DB/API
const results = await searchLinkedInProfiles(query);
await cacheSearchResults(query, results);
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

1. Structured Output > Regex Parsing

Before: Manual regex parsing of queries

const match = query.match(/(\d+)\s+(.+)\s+in\s+(.+)/);
if (!match) throw new Error('Invalid query');
const [, count, role, location] = match;
Enter fullscreen mode Exit fullscreen mode

After: Gemini 2.0 Flash with Zod schemas

const { object } = await generateObject({
  model: google('gemini-2.0-flash-exp'),
  schema: SearchQuerySchema,
  prompt: `Parse: "${query}"`,
});
Enter fullscreen mode Exit fullscreen mode

Why Better:

  • Handles ambiguous queries ("software engineers at Google" - is Google a location or company?)
  • Extracts implicit information (maps "Israel" → "IL" country code)
  • No brittle regex maintenance

2. Async by Default for External APIs

Bright Data's LinkedIn scraper takes 60-120 seconds. Never block the user:

// ❌ Bad: User waits 2 minutes
const profiles = await fetchLinkedInProfiles(urls);
return profiles;

// ✅ Good: Return immediately, process in background
const researchId = await createResearch(linkedinUrl, personName);
runResearchGraph(researchId, linkedinUrl, personName); // Fire and forget
return { researchId, status: 'pending' };
Enter fullscreen mode Exit fullscreen mode

Users can check status with polling or webhooks.

3. LangGraph for Complex Workflows

Before: Imperative spaghetti code

async function research(person) {
  const linkedin = await fetchLinkedIn(person);
  const query = await generateQuery(linkedin);
  const results = await search(query);
  const scraped = [];
  for (const url of results) {
    scraped.push(await scrape(url));
  }
  const summaries = [];
  for (const content of scraped) {
    summaries.push(await summarize(content));
  }
  return await writeReport(linkedin, summaries);
}
Enter fullscreen mode Exit fullscreen mode

After: Declarative graph with parallel execution

const graph = new StateGraph(ResearchStateAnnotation)
  .addNode('fetchLinkedIn', fetchLinkedInNode)
  .addNode('search', searchNode)
  .addNode('scrape', scrapeNode) // Parallel
  .addNode('summarize', summarizeNode) // Parallel
  .addNode('writeReport', writeReportNode)
  .addConditionalEdges('search', routeToScraping)
  .compile();

const result = await graph.invoke({ person });
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Parallel scraping (15 URLs in 10 seconds, not 150 seconds)
  • Checkpointing (resume from failures)
  • Easier to test individual nodes

4. Cache Invalidation is Hard

Challenge: When to refresh cached profiles?

Solution: Multi-factor scoring

function shouldRefresh(profile: Person): boolean {
  const daysSinceUpdate = getDaysSince(profile.updatedAt);
  const isPopular = profile.searchCount > 10;
  const isRecent = getDaysSince(profile.lastViewed) < 7;

  // Refresh if:
  // - Stale (> 180 days)
  // - Popular AND moderately old (> 30 days)
  // - Recently viewed AND old (> 60 days)
  return (
    daysSinceUpdate > 180 ||
    (isPopular && daysSinceUpdate > 30) ||
    (isRecent && daysSinceUpdate > 60)
  );
}
Enter fullscreen mode Exit fullscreen mode

5. Type Safety Everywhere

TypeScript + Zod + Prisma = Compile-time safety across the stack

// 1. Zod validates API input
const SearchQuerySchema = z.object({
  count: z.number().min(1).max(50),
  role: z.string().nullable(),
});

// 2. Prisma generates types from schema
const profile: Person = await prisma.person.findUnique(...);

// 3. TypeScript enforces contracts
function transformProfile(data: ProfileData): Person {
  // Compiler catches type mismatches
}
Enter fullscreen mode Exit fullscreen mode

Result: Caught 50+ bugs at compile-time instead of runtime.


Conclusion

Building PeopleHub taught me:

  1. LLMs for structured extraction - Gemini 2.0 Flash + Zod is perfect for parsing
  2. Bright Data scales - Their APIs handle thousands of LinkedIn profiles without breaking
  3. LangGraph for workflows - State machines beat imperative code for complex flows
  4. Cache aggressively - 90% cache hit rate = 10x cost reduction
  5. Async everything - Never block users on slow external APIs

The entire codebase is open-source on GitHub. Check it out, try it yourself, and let me know what you think!

Tech Stack Summary:

  • Next.js 15.5.4 + React 19.1.0
  • TypeScript 5 + Zod 3.25.76
  • Prisma 6.5.0 + PostgreSQL + Redis
  • Google Gemini 2.0 Flash + Vercel AI SDK 5.0.60
  • LangChain + LangGraph 1.0.1
  • Bright Data APIs (Google Search, LinkedIn Scraper, MCP)

Questions?

Drop a comment below or reach out! Would love to hear your thoughts on:

  • Alternative approaches to query parsing
  • Other use cases for LangGraph
  • Optimization ideas for the research engine
  • Your experience with Bright Data APIs

Happy coding! 🚀

Top comments (0)