Meir

Posted on Nov 28

Building PeopleHub: An AI-Powered LinkedIn Intelligence Platform with LangGraph and Bright Data

#ai #typescript #nextjs #opensource

I recently open-sourced PeopleHub, an AI-powered people search engine that combines natural language query parsing, LinkedIn profile scraping, and automated research report generation. In this post, I'll walk through the technical architecture, key design decisions, and implementation details.

What is PeopleHub?
Architecture Overview
Tech Stack
AI Query Parser: Natural Language → Structured Search
Bright Data Integration: The Data Pipeline
Multi-Tier Caching Strategy
LangGraph Research Engine: Agentic Workflows
Database Design with Prisma
Performance Optimizations
Lessons Learned

What is PeopleHub?

PeopleHub solves a common problem: finding and researching professionals is either slow (manual LinkedIn searching) or expensive (premium tools charging $50+ per profile).

Key Features:

🗣️ Natural language search - Just type "10 AI engineers in Israel"
⚡ Smart caching - 70-90% cost reduction through intelligent caching
🔬 AI research reports - Automated due diligence with web scraping
💾 Multi-tier persistence - PostgreSQL + Redis for optimal performance
🤖 LangGraph workflows - Agentic multi-step research automation

Use Cases:

Recruiting and talent acquisition
Due diligence on executives/entrepreneurs
Competitive intelligence
Academic research on professional networks
Sales prospecting

Architecture Overview

┌─────────────────────────────────────────────────┐
│          Frontend (Next.js 15 + React)          │
│    SearchBar → Results → Research Reports       │
└────────────────────┬────────────────────────────┘
                     │
        ┌────────────┼────────────┐
        │            │            │
   ┌────▼────┐  ┌───▼──┐  ┌─────▼──────┐
   │ Search  │  │Image │  │ Research   │
   │   API   │  │Proxy │  │    API     │
   └────┬────┘  └──────┘  └─────┬──────┘
        │                        │
        │     ┌──────────────────┴───────┐
        │     │  Prisma ORM + PostgreSQL │
        │     │  Redis Cache (Optional)  │
        │     └──────────┬─────────────┬─┘
        │                │             │
   ┌────▼────────────┐   │    ┌───────▼────────┐
   │  AI Query       │   │    │  LangGraph     │
   │  Parser         │   │    │  Research      │
   │ (Gemini 2.0)    │   │    │  Workflows     │
   └─────────────────┘   │    └───────┬────────┘
                         │            │
                    ┌────▼────────────▼──┐
                    │   Bright Data APIs │
                    │ • Google Search    │
                    │ • LinkedIn Scraper │
                    │ • Web Unblocker    │
                    │ • MCP Server       │
                    └────────────────────┘

Tech Stack

Backend

Framework: Next.js 15.5.4 with App Router (API Routes)
Runtime: Node.js 18+
Language: TypeScript 5 (strict mode)
ORM: Prisma 6.5.0
Database: PostgreSQL (Supabase)
Cache: Redis with ioredis 5.8.2 (optional, hot cache)

AI/LLM

Query Parsing: Google Gemini 2.0 Flash (gemini-2.0-flash-exp)
AI SDK: Vercel AI SDK 5.0.60 (@ai-sdk/google 2.0.17)
Research Workflows: LangChain + LangGraph 1.0.1
Schema Validation: Zod 3.25.76

External APIs

Bright Data: Google Search API, LinkedIn Scraper API, Web Scraper
Custom MCP Client: Model Context Protocol SDK 1.19.1 for advanced tool access

Frontend

UI: React 19.1.0 with Next.js
State: Zustand 5.0.2 + TanStack Query 5.62.18
Styling: Tailwind CSS 4 with custom animation utilities

AI Query Parser

The Problem

Users shouldn't need to learn complex query syntax. They should be able to search naturally:

✅ "10 AI engineers in Israel"
✅ "Software engineers at Google"
✅ "Elon Musk"
✅ "Product managers in San Francisco with startup experience"

The Solution: Structured Output with Gemini 2.0 Flash

I use Vercel's AI SDK with Gemini 2.0 Flash to convert natural language into structured search parameters using Zod schemas:

// src/lib/search/parser.ts
import { google } from '@ai-sdk/google';
import { generateObject } from 'ai';
import { z } from 'zod';

const SearchQuerySchema = z.object({
  count: z.number().min(1).max(50)
    .describe('Number of profiles to find'),
  role: z.string().nullable()
    .describe('Job title or role'),
  location: z.string().optional().nullable()
    .describe('Location or company name'),
  countryCode: z.string().length(2).optional().nullable()
    .describe('2-letter ISO country code'),
  keywords: z.array(z.string())
    .describe('Additional keywords or qualifications'),
  googleQuery: z.string()
    .describe('Optimized Google search query for LinkedIn'),
});

export async function parseSearchQuery(
  query: string
): Promise<ParsedSearchQuery> {
  const { object } = await generateObject({
    model: google('gemini-2.0-flash-exp'),
    schema: SearchQuerySchema,
    prompt: `Parse this search query: "${query}"

    Handle two types:
    1. Job/role search: "5 AI Engineers in Israel"
       → Extract count, role, location, keywords
    2. Individual search: "Elon Musk"
       → Set count=1, role=null, name in keywords

    Generate optimized Google query using:
    site:linkedin.com/in "Role" "Location" keywords`,
  });

  return object;
}

Example Output

Input: "5 AI Engineers in Israel"

Output:

{
  "count": 5,
  "role": "AI Engineer",
  "location": "Israel",
  "countryCode": "IL",
  "keywords": [],
  "googleQuery": "site:linkedin.com/in \"AI Engineer\" \"Israel\""
}

Why Gemini 2.0 Flash?

Fast: ~200-500ms response time
Structured Output: Native Zod schema support
Flexible: Handles both job searches and individual lookups
Cost-effective: $0.00001875 per 1K input tokens

Bright Data Integration

Bright Data is the backbone of PeopleHub's data acquisition. I use three of their APIs:

1. Google Search API

Purpose: Find LinkedIn profile URLs matching search criteria

Implementation:

// src/lib/brightdata/search.ts
const BRIGHTDATA_API_URL = 'https://api.brightdata.com/request';

export async function searchGoogle(
  query: string,
  page: number = 0,
  countryCode?: string | null,
): Promise<BrightDataGoogleSearchResponse> {
  const searchUrl = buildGoogleSearchUrl(query, page, countryCode);

  const response = await fetch(BRIGHTDATA_API_URL, {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.BRIGHTDATA_API_TOKEN}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url: searchUrl,
      zone: 'unblocker',
      format: 'raw',
    }),
  });

  return response.json();
}

function buildGoogleSearchUrl(
  query: string,
  page: number,
  countryCode?: string | null
): string {
  const start = page * 10;
  let url = `https://www.google.com/search?q=${encodeURIComponent(query)}&start=${start}&brd_json=1`;

  // Geo-targeting support
  if (countryCode) {
    url += `&gl=${countryCode.toUpperCase()}`;
  }

  return url;
}

Key Features:

JSON Response Format: brd_json=1 parameter returns structured data
Geo-Targeting: gl parameter filters results by country
Site-Specific Queries: site:linkedin.com/in narrows to LinkedIn profiles
Organic Results: Returns titles, links, snippets without ads

2. LinkedIn Scraper API

Purpose: Extract comprehensive LinkedIn profile data

Dataset ID: gd_l1viktl72bvl7bjuj0

Implementation:

// src/lib/brightdata/linkedin.ts
const BRIGHTDATA_API_URL = 'https://api.brightdata.com/datasets/v3';
const LINKEDIN_DATASET_ID = 'gd_l1viktl72bvl7bjuj0';
const MAX_POLLING_ATTEMPTS = 600; // 10 minutes
const POLLING_INTERVAL_MS = 1000; // 1 second

async function triggerLinkedInScrape(
  urls: string[]
): Promise<string> {
  const triggerUrl = `${BRIGHTDATA_API_URL}/trigger`;
  const params = new URLSearchParams({
    dataset_id: LINKEDIN_DATASET_ID,
    include_errors: 'true',
  });

  const payload = urls.map(url => ({ url }));

  const response = await fetch(`${triggerUrl}?${params}`, {
    method: 'POST',
    headers: getApiHeaders(),
    body: JSON.stringify(payload),
  });

  const data = await response.json();
  return data.snapshot_id;
}

async function pollForSnapshot(
  snapshotId: string
): Promise<BrightDataLinkedInResponse[]> {
  let attempts = 0;

  while (attempts < MAX_POLLING_ATTEMPTS) {
    const snapshotUrl = `${BRIGHTDATA_API_URL}/snapshot/${snapshotId}`;
    const params = new URLSearchParams({ format: 'json' });

    const response = await fetch(`${snapshotUrl}?${params}`, {
      method: 'GET',
      headers: getApiHeaders(),
    });

    const data = await response.json();

    // Check if still processing
    if (!Array.isArray(data) && data.status === 'running') {
      attempts++;
      await new Promise(resolve =>
        setTimeout(resolve, POLLING_INTERVAL_MS)
      );
      continue;
    }

    // Data is ready
    return data;
  }

  throw new Error('Timeout waiting for LinkedIn data');
}

export async function fetchLinkedInProfiles(
  linkedinUrls: string[]
): Promise<ProfileData[]> {
  // Trigger async scraping job
  const snapshotId = await triggerLinkedInScrape(linkedinUrls);

  // Poll for results (max 10 minutes)
  const profiles = await pollForSnapshot(snapshotId);

  // Transform to database format
  return profiles.map(profile =>
    transformBrightDataProfile(profile)
  );
}

What Gets Scraped:

Basic info: name, headline, about section
Work experience with company logos and descriptions
Education history
Languages spoken
Connection and follower counts
Profile pictures and banner images
Current company details

Why This Approach?

Async by Design: Trigger job, get snapshot ID, poll for completion
Batch Operations: Scrape multiple profiles in one request
Retry Logic: Handles transient failures gracefully
Timeout Protection: Max 10 minutes prevents infinite loops

3. MCP (Model Context Protocol) Integration

Purpose: Advanced tooling for the research engine

// src/lib/brightdata/client.ts
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streaming.js';

let mcpClient: Client | null = null;

export async function getBrightDataMCPClient(): Promise<Client> {
  // Singleton pattern prevents repeated connections
  if (mcpClient) return mcpClient;

  const apiToken = process.env.BRIGHTDATA_API_TOKEN;
  const transport = new StreamableHTTPClientTransport(
    new URL(`https://mcp.brightdata.com/mcp?api_token=${apiToken}`)
  );

  mcpClient = new Client(
    {
      name: 'peoplehub-client',
      version: '1.0.0',
    },
    {
      capabilities: {},
    }
  );

  await mcpClient.connect(transport);
  return mcpClient;
}

Use Cases:

Web scraping for research reports
Advanced search capabilities
Tool discovery and execution

Multi-Tier Caching Strategy

Caching is critical for cost reduction. PeopleHub uses a two-tier approach:

Tier 1: Redis (Hot Cache)

Purpose: Fast search result caching

// src/lib/redis/search-cache.ts
export async function getCachedSearchResults(
  query: string,
): Promise<CachedSearchResults | null> {
  const key = getCacheKey(CachePrefix.SEARCH_RESULTS, query);
  const cached = await getCache<CachedSearchResults>(key);
  return cached;
}

export async function cacheSearchResults(
  query: string,
  parsedQuery: ParsedSearchQuery,
  results: ProfileSummary[],
): Promise<boolean> {
  const key = getCacheKey(CachePrefix.SEARCH_RESULTS, query);
  const payload: CachedSearchResults = {
    query,
    parsedQuery,
    results,
    count: results.length,
    timestamp: Date.now(),
  };

  return setCache(key, payload, CacheTTL.SEARCH_RESULTS);
}

Benefits:

Sub-millisecond lookups: In-memory data structure
Reduced database load: Offloads hot queries
TTL-based expiration: Configurable freshness window

Tier 2: PostgreSQL (Persistent Cache)

Purpose: Long-term profile storage with intelligent freshness

// src/lib/cache/index.ts
const CACHE_FRESHNESS_DAYS = 180;

export async function getCachedProfile(
  linkedinUrl: string
): Promise<ProfileData | null> {
  // Extract LinkedIn ID to handle regional URL variants
  const linkedinId = extractLinkedInId(linkedinUrl);

  const profile = await prisma.person.findUnique({
    where: { linkedinId },
  });

  if (!profile) return null;

  // Check freshness (< 180 days old)
  const daysSinceUpdate = Math.floor(
    (Date.now() - profile.updatedAt.getTime()) / (1000 * 60 * 60 * 24)
  );

  if (daysSinceUpdate >= CACHE_FRESHNESS_DAYS) {
    return null; // Stale, needs refresh
  }

  return transformToProfileData(profile);
}

export async function saveProfile(
  data: ProfileData
): Promise<ProfileData> {
  await prisma.person.upsert({
    where: { linkedinId: data.linkedinId },
    update: {
      ...data,
      searchCount: { increment: 1 }, // Track popularity
      lastViewed: new Date(),
    },
    create: {
      ...data,
    },
  });

  return data;
}

Batch Optimization:

export async function getCachedProfiles(
  linkedinUrls: string[]
): Promise<Record<string, ProfileData>> {
  const linkedinIds = linkedinUrls
    .map(extractLinkedInId)
    .filter(Boolean);

  // Single SQL query for all profiles
  const profiles = await prisma.person.findMany({
    where: {
      linkedinId: { in: linkedinIds },
    },
  });

  // Filter by freshness
  const result: Record<string, ProfileData> = {};
  const now = Date.now();

  for (const profile of profiles) {
    const daysSinceUpdate = Math.floor(
      (now - profile.updatedAt.getTime()) / (1000 * 60 * 60 * 24)
    );

    if (daysSinceUpdate < CACHE_FRESHNESS_DAYS) {
      result[profile.linkedinId] = transformToProfileData(profile);
    }
  }

  return result;
}

Performance Impact:

First search: ~120 seconds (LinkedIn scraping bottleneck)
Cached search: ~2.5 seconds (database lookup)
Batch lookup: 10-50ms for 100 profiles
Cost reduction: 70-90% (90% cache hit rate)

LangGraph Research Engine

This is PeopleHub's killer feature: automated due diligence reports using LangChain's LangGraph for multi-step agentic workflows.

What is LangGraph?

LangGraph is a framework for building stateful, multi-actor applications with LLMs. It uses a directed graph to define:

Nodes: Individual steps (fetch data, search web, summarize)
Edges: Transitions between steps
State: Shared data across the workflow

Research Workflow Graph

START
  ↓
Initialize Research
  ↓
  ├─→ Fetch LinkedIn Profile ──→ Aggregate Data
  │                                      ↓
  └─→ Execute Google Search          Write Report
         ↓                                ↓
      Scrape URLs (parallel)            END
         ↓
      Summarize Content (parallel)
         ↓
      Aggregate Data

Implementation

State Definition:

// src/lib/research/types.ts
import { Annotation } from '@langchain/langgraph';

export const ResearchStateAnnotation = Annotation.Root({
  personName: Annotation<string>,
  linkedinUrl: Annotation<string>,
  linkedinData: Annotation<ProfileData | undefined>,
  searchQuery: Annotation<string | undefined>,
  searchResults: Annotation<SearchResult[]>,
  scrapedContents: Annotation<ScrapedContent[]>,
  webSummaries: Annotation<WebSummary[]>,
  finalReport: Annotation<string | undefined>,
  status: Annotation<string>,
  errors: Annotation<string[]>,
});

Graph Builder:

// src/lib/research/graph.ts
import { StateGraph, START, END, Send } from '@langchain/langgraph';

export function createResearchGraph() {
  const graph = new StateGraph(ResearchStateAnnotation);

  // Add nodes
  graph.addNode('start', startNode);
  graph.addNode('fetchLinkedIn', fetchLinkedInNode);
  graph.addNode('executeSearch', executeSearchNode);
  graph.addNode('scrapeWebPage', scrapeWebPageNode);
  graph.addNode('summarizeContent', summarizeContentNode);
  graph.addNode('aggregateData', aggregateDataNode);
  graph.addNode('writeReport', writeReportNode);

  // Define edges
  graph
    .addEdge(START, 'start')
    .addEdge('start', 'fetchLinkedIn')
    .addEdge('start', 'executeSearch')
    .addEdge('fetchLinkedIn', 'aggregateData')
    .addConditionalEdges('executeSearch', routeToScraping)
    .addConditionalEdges('scrapeWebPage', routeToSummarization)
    .addEdge('summarizeContent', 'aggregateData')
    .addEdge('aggregateData', 'writeReport')
    .addEdge('writeReport', END);

  return graph.compile({ checkpointer: new MemorySaver() });
}

Parallel Scraping with Send API:

// Route to scraping - creates parallel tasks
export function routeToScraping(
  state: ResearchGraphState
): Send[] {
  if (!state.searchResults?.length) return [];

  // Fan-out: Create one scraping task per URL
  return state.searchResults.map((result) =>
    new Send('scrapeWebPage', {
      ...state,
      url: result.url,
      metadata: {
        source: result.source,
        rank: result.rank,
        title: result.title,
      },
    })
  );
}

// Route to summarization - creates parallel tasks
export function routeToSummarization(
  state: ResearchGraphState
): Send[] {
  if (!state.scrapedContents?.length) return [];

  // Fan-out: Create one summarization task per scraped page
  return state.scrapedContents.map((content) =>
    new Send('summarizeContent', {
      ...state,
      scrapedContent: content,
    })
  );
}

Node Implementations:

// Fetch LinkedIn profile
export const fetchLinkedInNode: ResearchNodeHandler = async (state) => {
  const profile = await fetchLinkedInProfile(state.linkedinUrl);
  return {
    linkedinData: profile,
    status: 'LinkedIn profile fetched',
  };
};

// Execute Google search
export const executeSearchNode: ResearchNodeHandler = async (state) => {
  const results = await searchGoogleForPerson(
    state.personName,
    state.linkedinUrl,
    { maxResults: 15 }
  );

  return {
    searchResults: results,
    status: 'Web search completed',
  };
};

// Summarize scraped content
export const summarizeContentNode: ResearchNodeHandler = async (state) => {
  const { scrapedContent } = state;

  const summary = await summarizeWebContent(
    scrapedContent.url,
    scrapedContent.content,
    state.personName
  );

  return {
    webSummaries: [summary],
    status: `Summarized ${scrapedContent.url}`,
  };
};

// Generate final report
export const writeReportNode: ResearchNodeHandler = async (state) => {
  const bundle = {
    personName: state.personName,
    linkedinUrl: state.linkedinUrl,
    linkedinData: state.linkedinData,
    webSummaries: state.webSummaries,
  };

  const result = await generateResearchReport(bundle);

  return {
    finalReport: result.report,
    status: 'Report ready',
  };
};

Running the Graph:

// src/lib/research/runner.ts
export async function runResearchGraph(
  researchId: string,
  linkedinUrl: string,
  personName: string
): Promise<void> {
  // Update status to 'processing'
  await updateResearchStatus(researchId, 'processing');

  // Create and compile graph
  const compiledGraph = createResearchGraph();

  // Execute with checkpointing
  const result = await compiledGraph.invoke(
    { personName, linkedinUrl },
    { configurable: { thread_id: researchId } }
  );

  // Save report to database
  if (result.finalReport) {
    await saveResearchReport(
      researchId,
      result.finalReport,
      result.webSummaries.map(s => ({ url: s.url, summary: s.summary })),
      { /* metadata */ }
    );
  }
}

Why LangGraph?

Advantages:

Stateful Workflows: Shared state across steps
Parallel Execution: Send API for fan-out/fan-in patterns
Checkpointing: Resume from failures with MemorySaver
Type Safety: TypeScript-first with full type inference
Debuggability: Visual graph representation with Mermaid

Example Research Report:

# Research Report: John Doe

## Professional Background
John Doe is a Senior Software Engineer at Google with 8 years of experience...

## Recent Projects and Achievements
- Led the development of Project X, a distributed system processing 1M+ requests/day
- Published research paper on machine learning optimization at ICML 2024
- Speaker at Google I/O 2024 on Kubernetes best practices

## Technical Expertise
Primary skills: Python, Go, Kubernetes, TensorFlow, distributed systems
Notable contributions to open-source projects...

## Industry Reputation
Recognized as a thought leader in cloud-native architecture...

## Sources
1. [Google Engineering Blog - Project X Launch](https://example.com)
2. [ICML 2024 Paper: ML Optimization Techniques](https://example.com)
3. [Google I/O 2024 Talk](https://example.com)

Database Design with Prisma

Schema Overview

// prisma/schema.prisma
datasource db {
  provider = "postgresql"
  url      = env("DATABASE_URL")
}

model Person {
  id                String   @id @default(cuid())
  linkedinUrl       String   @unique
  linkedinId        String   @unique
  linkedinNumId     String?

  // Basic Info
  firstName         String
  lastName          String
  fullName          String   @db.Text
  headline          String?  @db.Text
  about             String?  @db.Text

  // Location
  location          String?
  city              String?
  countryCode       String?

  // Profile Media
  profilePicUrl     String?
  bannerImage       String?
  defaultAvatar     Boolean  @default(false)

  // Current Company
  currentCompany    String?
  currentCompanyId  String?

  // Rich Data (JSON)
  experience        Json?
  education         Json?
  languages         Json?

  // Social Stats
  connections       Int?
  followers         Int?

  // Metadata
  searchCount       Int      @default(0)
  lastViewed        DateTime @default(now())
  createdAt         DateTime @default(now())
  updatedAt         DateTime @updatedAt

  // Relations
  researches        Research[]

  @@index([fullName])
  @@index([firstName, lastName])
  @@index([lastViewed])
  @@index([linkedinId])
  @@index([currentCompany])
  @@index([location])
  @@index([updatedAt])
  @@map("people")
}

model Search {
  id          String   @id @default(cuid())
  query       String   @db.Text
  results     Json     // Array of person IDs
  resultCount Int
  createdAt   DateTime @default(now())

  @@index([query])
  @@map("searches")
}

model Research {
  id              String   @id @default(cuid())
  personId        String?
  person          Person?  @relation(fields: [personId], references: [id])
  linkedinUrl     String
  personName      String
  report          String   @db.Text
  sources         Json     // Array of { url, summary }
  metadata        Json?    // Graph execution metadata
  status          String   // 'pending' | 'processing' | 'completed' | 'failed'
  errorMessage    String?  @db.Text
  createdAt       DateTime @default(now())
  updatedAt       DateTime @updatedAt

  @@index([personId])
  @@index([linkedinUrl])
  @@index([status])
  @@index([createdAt])
  @@map("researches")
}

Key Design Decisions

1. JSON Fields for Flexibility

// Experience is stored as JSON for flexibility
experience: [
  {
    title: "Senior Software Engineer",
    company: "Google",
    companyId: "1441",
    companyLogo: "https://...",
    location: "Mountain View, CA",
    startDate: "2020-01-01",
    endDate: null,
    description: "Led development of...",
    isCurrent: true
  },
  // ... more experiences
]

Why JSON?

Flexible schema (experiences vary widely)
No need for separate tables and joins
Easier to store Bright Data's nested structures
PostgreSQL JSON operators for querying

2. Strategic Indexes

-- Full-text search on names
@@index([fullName])
@@index([firstName, lastName])

-- Filter by recency
@@index([lastViewed])
@@index([updatedAt])

-- Search by location/company
@@index([location])
@@index([currentCompany])

-- Unique constraints prevent duplicates
linkedinId String @unique
linkedinUrl String @unique

3. Metadata Tracking

searchCount: Int @default(0)  // Popularity metric
lastViewed: DateTime          // Cache invalidation
updatedAt: DateTime           // Freshness check

Benefits:

Identify popular profiles (high searchCount)
Prioritize cache refreshes for frequently viewed profiles
Track data staleness for intelligent revalidation

Performance Optimizations

1. Batch Database Queries

Problem: N+1 query problem when checking cache for multiple URLs

Solution: Single query for all profiles

// Instead of N queries
for (const url of urls) {
  await getCachedProfile(url); // ❌ N queries
}

// Do 1 query
const cached = await getCachedProfiles(urls); // ✅ 1 query

Impact: 100 profiles: ~5 seconds → ~50ms

2. Parallel API Calls

Problem: Sequential Bright Data calls block each other

Solution: Promise.all for independent operations

// Sequential: ~6 seconds total
const parsedQuery = await parseSearchQuery(query);
const summaries = await searchLinkedInProfiles(parsedQuery);
await cacheSearchResults(query, summaries);

// Parallel: ~3 seconds total
const [parsedQuery, summaries] = await Promise.all([
  parseSearchQuery(query),
  searchLinkedInProfiles(parsedQuery), // ❌ Depends on parsedQuery
]);

Caveat: Only parallelize truly independent operations!

3. Connection Pooling

Problem: Serverless functions create new DB connections per request

Solution: Singleton Prisma client

// src/lib/prisma.ts
import { PrismaClient } from '@prisma/client';

const globalForPrisma = global as unknown as { prisma: PrismaClient };

export const prisma = globalForPrisma.prisma ||
  new PrismaClient({
    log: process.env.NODE_ENV === 'development'
      ? ['query', 'error', 'warn']
      : ['error'],
  });

if (process.env.NODE_ENV !== 'production') {
  globalForPrisma.prisma = prisma;
}

4. Image Proxy for CORS

Problem: Direct LinkedIn image URLs blocked by CORS and ad blockers

Solution: Proxy images through our API

// src/app/api/proxy-image/route.ts
export async function GET(request: NextRequest) {
  const imageUrl = request.nextUrl.searchParams.get('url');

  const response = await fetch(imageUrl, {
    headers: {
      'User-Agent': 'Mozilla/5.0...',
    },
  });

  const imageBuffer = await response.arrayBuffer();

  return new Response(imageBuffer, {
    headers: {
      'Content-Type': response.headers.get('Content-Type'),
      'Cache-Control': 'public, max-age=86400', // 24 hours
    },
  });
}

5. Redis for Hot Cache

Problem: PostgreSQL queries still take 10-50ms for popular searches

Solution: Redis in-memory cache for search results

// Cache TTL: 30 minutes for hot searches
export const CacheTTL = {
  SEARCH_RESULTS: 30 * 60, // 30 minutes
  PROFILE_SUMMARY: 60 * 60, // 1 hour
};

// Check Redis first, fallback to PostgreSQL
const cached = await getCachedSearchResults(query);
if (cached) return cached; // ~2ms

// Cache miss, fetch from DB/API
const results = await searchLinkedInProfiles(query);
await cacheSearchResults(query, results);

Lessons Learned

1. Structured Output > Regex Parsing

Before: Manual regex parsing of queries

const match = query.match(/(\d+)\s+(.+)\s+in\s+(.+)/);
if (!match) throw new Error('Invalid query');
const [, count, role, location] = match;

After: Gemini 2.0 Flash with Zod schemas

const { object } = await generateObject({
  model: google('gemini-2.0-flash-exp'),
  schema: SearchQuerySchema,
  prompt: `Parse: "${query}"`,
});

Why Better:

Handles ambiguous queries ("software engineers at Google" - is Google a location or company?)
Extracts implicit information (maps "Israel" → "IL" country code)
No brittle regex maintenance

2. Async by Default for External APIs

Bright Data's LinkedIn scraper takes 60-120 seconds. Never block the user:

// ❌ Bad: User waits 2 minutes
const profiles = await fetchLinkedInProfiles(urls);
return profiles;

// ✅ Good: Return immediately, process in background
const researchId = await createResearch(linkedinUrl, personName);
runResearchGraph(researchId, linkedinUrl, personName); // Fire and forget
return { researchId, status: 'pending' };

Users can check status with polling or webhooks.

3. LangGraph for Complex Workflows

Before: Imperative spaghetti code

async function research(person) {
  const linkedin = await fetchLinkedIn(person);
  const query = await generateQuery(linkedin);
  const results = await search(query);
  const scraped = [];
  for (const url of results) {
    scraped.push(await scrape(url));
  }
  const summaries = [];
  for (const content of scraped) {
    summaries.push(await summarize(content));
  }
  return await writeReport(linkedin, summaries);
}

After: Declarative graph with parallel execution

const graph = new StateGraph(ResearchStateAnnotation)
  .addNode('fetchLinkedIn', fetchLinkedInNode)
  .addNode('search', searchNode)
  .addNode('scrape', scrapeNode) // Parallel
  .addNode('summarize', summarizeNode) // Parallel
  .addNode('writeReport', writeReportNode)
  .addConditionalEdges('search', routeToScraping)
  .compile();

const result = await graph.invoke({ person });

Benefits:

Parallel scraping (15 URLs in 10 seconds, not 150 seconds)
Checkpointing (resume from failures)
Easier to test individual nodes

4. Cache Invalidation is Hard

Challenge: When to refresh cached profiles?

Solution: Multi-factor scoring

function shouldRefresh(profile: Person): boolean {
  const daysSinceUpdate = getDaysSince(profile.updatedAt);
  const isPopular = profile.searchCount > 10;
  const isRecent = getDaysSince(profile.lastViewed) < 7;

  // Refresh if:
  // - Stale (> 180 days)
  // - Popular AND moderately old (> 30 days)
  // - Recently viewed AND old (> 60 days)
  return (
    daysSinceUpdate > 180 ||
    (isPopular && daysSinceUpdate > 30) ||
    (isRecent && daysSinceUpdate > 60)
  );
}

5. Type Safety Everywhere

TypeScript + Zod + Prisma = Compile-time safety across the stack

// 1. Zod validates API input
const SearchQuerySchema = z.object({
  count: z.number().min(1).max(50),
  role: z.string().nullable(),
});

// 2. Prisma generates types from schema
const profile: Person = await prisma.person.findUnique(...);

// 3. TypeScript enforces contracts
function transformProfile(data: ProfileData): Person {
  // Compiler catches type mismatches
}

Result: Caught 50+ bugs at compile-time instead of runtime.

Conclusion

Building PeopleHub taught me:

LLMs for structured extraction - Gemini 2.0 Flash + Zod is perfect for parsing
Bright Data scales - Their APIs handle thousands of LinkedIn profiles without breaking
LangGraph for workflows - State machines beat imperative code for complex flows
Cache aggressively - 90% cache hit rate = 10x cost reduction
Async everything - Never block users on slow external APIs

The entire codebase is open-source on GitHub. Check it out, try it yourself, and let me know what you think!

Tech Stack Summary:

Next.js 15.5.4 + React 19.1.0
TypeScript 5 + Zod 3.25.76
Prisma 6.5.0 + PostgreSQL + Redis
Google Gemini 2.0 Flash + Vercel AI SDK 5.0.60
LangChain + LangGraph 1.0.1
Bright Data APIs (Google Search, LinkedIn Scraper, MCP)

Table of Contents

What is PeopleHub?

Architecture Overview

Tech Stack

Backend

AI/LLM

External APIs

Frontend

AI Query Parser

The Problem

The Solution: Structured Output with Gemini 2.0 Flash

Example Output

Bright Data Integration

1. Google Search API

2. LinkedIn Scraper API

3. MCP (Model Context Protocol) Integration

Multi-Tier Caching Strategy

Tier 1: Redis (Hot Cache)

Tier 2: PostgreSQL (Persistent Cache)

LangGraph Research Engine

What is LangGraph?

Research Workflow Graph

Implementation

Why LangGraph?

Database Design with Prisma

Schema Overview

Key Design Decisions

Performance Optimizations

1. Batch Database Queries

2. Parallel API Calls

3. Connection Pooling

4. Image Proxy for CORS

5. Redis for Hot Cache

Lessons Learned

1. Structured Output > Regex Parsing

2. Async by Default for External APIs

3. LangGraph for Complex Workflows

4. Cache Invalidation is Hard

5. Type Safety Everywhere

Conclusion

Questions?