I recently open-sourced PeopleHub, an AI-powered people search engine that combines natural language query parsing, LinkedIn profile scraping, and automated research report generation. In this post, I'll walk through the technical architecture, key design decisions, and implementation details.
Table of Contents
- What is PeopleHub?
- Architecture Overview
- Tech Stack
- AI Query Parser: Natural Language → Structured Search
- Bright Data Integration: The Data Pipeline
- Multi-Tier Caching Strategy
- LangGraph Research Engine: Agentic Workflows
- Database Design with Prisma
- Performance Optimizations
- Lessons Learned
What is PeopleHub?
PeopleHub solves a common problem: finding and researching professionals is either slow (manual LinkedIn searching) or expensive (premium tools charging $50+ per profile).
Key Features:
- 🗣️ Natural language search - Just type "10 AI engineers in Israel"
- ⚡ Smart caching - 70-90% cost reduction through intelligent caching
- 🔬 AI research reports - Automated due diligence with web scraping
- 💾 Multi-tier persistence - PostgreSQL + Redis for optimal performance
- 🤖 LangGraph workflows - Agentic multi-step research automation
Use Cases:
- Recruiting and talent acquisition
- Due diligence on executives/entrepreneurs
- Competitive intelligence
- Academic research on professional networks
- Sales prospecting
Architecture Overview
┌─────────────────────────────────────────────────┐
│ Frontend (Next.js 15 + React) │
│ SearchBar → Results → Research Reports │
└────────────────────┬────────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
┌────▼────┐ ┌───▼──┐ ┌─────▼──────┐
│ Search │ │Image │ │ Research │
│ API │ │Proxy │ │ API │
└────┬────┘ └──────┘ └─────┬──────┘
│ │
│ ┌──────────────────┴───────┐
│ │ Prisma ORM + PostgreSQL │
│ │ Redis Cache (Optional) │
│ └──────────┬─────────────┬─┘
│ │ │
┌────▼────────────┐ │ ┌───────▼────────┐
│ AI Query │ │ │ LangGraph │
│ Parser │ │ │ Research │
│ (Gemini 2.0) │ │ │ Workflows │
└─────────────────┘ │ └───────┬────────┘
│ │
┌────▼────────────▼──┐
│ Bright Data APIs │
│ • Google Search │
│ • LinkedIn Scraper │
│ • Web Unblocker │
│ • MCP Server │
└────────────────────┘
Tech Stack
Backend
- Framework: Next.js 15.5.4 with App Router (API Routes)
- Runtime: Node.js 18+
- Language: TypeScript 5 (strict mode)
- ORM: Prisma 6.5.0
- Database: PostgreSQL (Supabase)
- Cache: Redis with ioredis 5.8.2 (optional, hot cache)
AI/LLM
-
Query Parsing: Google Gemini 2.0 Flash (
gemini-2.0-flash-exp) -
AI SDK: Vercel AI SDK 5.0.60 (
@ai-sdk/google2.0.17) - Research Workflows: LangChain + LangGraph 1.0.1
- Schema Validation: Zod 3.25.76
External APIs
- Bright Data: Google Search API, LinkedIn Scraper API, Web Scraper
- Custom MCP Client: Model Context Protocol SDK 1.19.1 for advanced tool access
Frontend
- UI: React 19.1.0 with Next.js
- State: Zustand 5.0.2 + TanStack Query 5.62.18
- Styling: Tailwind CSS 4 with custom animation utilities
AI Query Parser
The Problem
Users shouldn't need to learn complex query syntax. They should be able to search naturally:
✅ "10 AI engineers in Israel"
✅ "Software engineers at Google"
✅ "Elon Musk"
✅ "Product managers in San Francisco with startup experience"
The Solution: Structured Output with Gemini 2.0 Flash
I use Vercel's AI SDK with Gemini 2.0 Flash to convert natural language into structured search parameters using Zod schemas:
// src/lib/search/parser.ts
import { google } from '@ai-sdk/google';
import { generateObject } from 'ai';
import { z } from 'zod';
const SearchQuerySchema = z.object({
count: z.number().min(1).max(50)
.describe('Number of profiles to find'),
role: z.string().nullable()
.describe('Job title or role'),
location: z.string().optional().nullable()
.describe('Location or company name'),
countryCode: z.string().length(2).optional().nullable()
.describe('2-letter ISO country code'),
keywords: z.array(z.string())
.describe('Additional keywords or qualifications'),
googleQuery: z.string()
.describe('Optimized Google search query for LinkedIn'),
});
export async function parseSearchQuery(
query: string
): Promise<ParsedSearchQuery> {
const { object } = await generateObject({
model: google('gemini-2.0-flash-exp'),
schema: SearchQuerySchema,
prompt: `Parse this search query: "${query}"
Handle two types:
1. Job/role search: "5 AI Engineers in Israel"
→ Extract count, role, location, keywords
2. Individual search: "Elon Musk"
→ Set count=1, role=null, name in keywords
Generate optimized Google query using:
site:linkedin.com/in "Role" "Location" keywords`,
});
return object;
}
Example Output
Input: "5 AI Engineers in Israel"
Output:
{
"count": 5,
"role": "AI Engineer",
"location": "Israel",
"countryCode": "IL",
"keywords": [],
"googleQuery": "site:linkedin.com/in \"AI Engineer\" \"Israel\""
}
Why Gemini 2.0 Flash?
- Fast: ~200-500ms response time
- Structured Output: Native Zod schema support
- Flexible: Handles both job searches and individual lookups
- Cost-effective: $0.00001875 per 1K input tokens
Bright Data Integration
Bright Data is the backbone of PeopleHub's data acquisition. I use three of their APIs:
1. Google Search API
Purpose: Find LinkedIn profile URLs matching search criteria
Implementation:
// src/lib/brightdata/search.ts
const BRIGHTDATA_API_URL = 'https://api.brightdata.com/request';
export async function searchGoogle(
query: string,
page: number = 0,
countryCode?: string | null,
): Promise<BrightDataGoogleSearchResponse> {
const searchUrl = buildGoogleSearchUrl(query, page, countryCode);
const response = await fetch(BRIGHTDATA_API_URL, {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.BRIGHTDATA_API_TOKEN}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: searchUrl,
zone: 'unblocker',
format: 'raw',
}),
});
return response.json();
}
function buildGoogleSearchUrl(
query: string,
page: number,
countryCode?: string | null
): string {
const start = page * 10;
let url = `https://www.google.com/search?q=${encodeURIComponent(query)}&start=${start}&brd_json=1`;
// Geo-targeting support
if (countryCode) {
url += `&gl=${countryCode.toUpperCase()}`;
}
return url;
}
Key Features:
-
JSON Response Format:
brd_json=1parameter returns structured data -
Geo-Targeting:
glparameter filters results by country -
Site-Specific Queries:
site:linkedin.com/innarrows to LinkedIn profiles - Organic Results: Returns titles, links, snippets without ads
2. LinkedIn Scraper API
Purpose: Extract comprehensive LinkedIn profile data
Dataset ID: gd_l1viktl72bvl7bjuj0
Implementation:
// src/lib/brightdata/linkedin.ts
const BRIGHTDATA_API_URL = 'https://api.brightdata.com/datasets/v3';
const LINKEDIN_DATASET_ID = 'gd_l1viktl72bvl7bjuj0';
const MAX_POLLING_ATTEMPTS = 600; // 10 minutes
const POLLING_INTERVAL_MS = 1000; // 1 second
async function triggerLinkedInScrape(
urls: string[]
): Promise<string> {
const triggerUrl = `${BRIGHTDATA_API_URL}/trigger`;
const params = new URLSearchParams({
dataset_id: LINKEDIN_DATASET_ID,
include_errors: 'true',
});
const payload = urls.map(url => ({ url }));
const response = await fetch(`${triggerUrl}?${params}`, {
method: 'POST',
headers: getApiHeaders(),
body: JSON.stringify(payload),
});
const data = await response.json();
return data.snapshot_id;
}
async function pollForSnapshot(
snapshotId: string
): Promise<BrightDataLinkedInResponse[]> {
let attempts = 0;
while (attempts < MAX_POLLING_ATTEMPTS) {
const snapshotUrl = `${BRIGHTDATA_API_URL}/snapshot/${snapshotId}`;
const params = new URLSearchParams({ format: 'json' });
const response = await fetch(`${snapshotUrl}?${params}`, {
method: 'GET',
headers: getApiHeaders(),
});
const data = await response.json();
// Check if still processing
if (!Array.isArray(data) && data.status === 'running') {
attempts++;
await new Promise(resolve =>
setTimeout(resolve, POLLING_INTERVAL_MS)
);
continue;
}
// Data is ready
return data;
}
throw new Error('Timeout waiting for LinkedIn data');
}
export async function fetchLinkedInProfiles(
linkedinUrls: string[]
): Promise<ProfileData[]> {
// Trigger async scraping job
const snapshotId = await triggerLinkedInScrape(linkedinUrls);
// Poll for results (max 10 minutes)
const profiles = await pollForSnapshot(snapshotId);
// Transform to database format
return profiles.map(profile =>
transformBrightDataProfile(profile)
);
}
What Gets Scraped:
- Basic info: name, headline, about section
- Work experience with company logos and descriptions
- Education history
- Languages spoken
- Connection and follower counts
- Profile pictures and banner images
- Current company details
Why This Approach?
- Async by Design: Trigger job, get snapshot ID, poll for completion
- Batch Operations: Scrape multiple profiles in one request
- Retry Logic: Handles transient failures gracefully
- Timeout Protection: Max 10 minutes prevents infinite loops
3. MCP (Model Context Protocol) Integration
Purpose: Advanced tooling for the research engine
// src/lib/brightdata/client.ts
import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streaming.js';
let mcpClient: Client | null = null;
export async function getBrightDataMCPClient(): Promise<Client> {
// Singleton pattern prevents repeated connections
if (mcpClient) return mcpClient;
const apiToken = process.env.BRIGHTDATA_API_TOKEN;
const transport = new StreamableHTTPClientTransport(
new URL(`https://mcp.brightdata.com/mcp?api_token=${apiToken}`)
);
mcpClient = new Client(
{
name: 'peoplehub-client',
version: '1.0.0',
},
{
capabilities: {},
}
);
await mcpClient.connect(transport);
return mcpClient;
}
Use Cases:
- Web scraping for research reports
- Advanced search capabilities
- Tool discovery and execution
Multi-Tier Caching Strategy
Caching is critical for cost reduction. PeopleHub uses a two-tier approach:
Tier 1: Redis (Hot Cache)
Purpose: Fast search result caching
// src/lib/redis/search-cache.ts
export async function getCachedSearchResults(
query: string,
): Promise<CachedSearchResults | null> {
const key = getCacheKey(CachePrefix.SEARCH_RESULTS, query);
const cached = await getCache<CachedSearchResults>(key);
return cached;
}
export async function cacheSearchResults(
query: string,
parsedQuery: ParsedSearchQuery,
results: ProfileSummary[],
): Promise<boolean> {
const key = getCacheKey(CachePrefix.SEARCH_RESULTS, query);
const payload: CachedSearchResults = {
query,
parsedQuery,
results,
count: results.length,
timestamp: Date.now(),
};
return setCache(key, payload, CacheTTL.SEARCH_RESULTS);
}
Benefits:
- Sub-millisecond lookups: In-memory data structure
- Reduced database load: Offloads hot queries
- TTL-based expiration: Configurable freshness window
Tier 2: PostgreSQL (Persistent Cache)
Purpose: Long-term profile storage with intelligent freshness
// src/lib/cache/index.ts
const CACHE_FRESHNESS_DAYS = 180;
export async function getCachedProfile(
linkedinUrl: string
): Promise<ProfileData | null> {
// Extract LinkedIn ID to handle regional URL variants
const linkedinId = extractLinkedInId(linkedinUrl);
const profile = await prisma.person.findUnique({
where: { linkedinId },
});
if (!profile) return null;
// Check freshness (< 180 days old)
const daysSinceUpdate = Math.floor(
(Date.now() - profile.updatedAt.getTime()) / (1000 * 60 * 60 * 24)
);
if (daysSinceUpdate >= CACHE_FRESHNESS_DAYS) {
return null; // Stale, needs refresh
}
return transformToProfileData(profile);
}
export async function saveProfile(
data: ProfileData
): Promise<ProfileData> {
await prisma.person.upsert({
where: { linkedinId: data.linkedinId },
update: {
...data,
searchCount: { increment: 1 }, // Track popularity
lastViewed: new Date(),
},
create: {
...data,
},
});
return data;
}
Batch Optimization:
export async function getCachedProfiles(
linkedinUrls: string[]
): Promise<Record<string, ProfileData>> {
const linkedinIds = linkedinUrls
.map(extractLinkedInId)
.filter(Boolean);
// Single SQL query for all profiles
const profiles = await prisma.person.findMany({
where: {
linkedinId: { in: linkedinIds },
},
});
// Filter by freshness
const result: Record<string, ProfileData> = {};
const now = Date.now();
for (const profile of profiles) {
const daysSinceUpdate = Math.floor(
(now - profile.updatedAt.getTime()) / (1000 * 60 * 60 * 24)
);
if (daysSinceUpdate < CACHE_FRESHNESS_DAYS) {
result[profile.linkedinId] = transformToProfileData(profile);
}
}
return result;
}
Performance Impact:
- First search: ~120 seconds (LinkedIn scraping bottleneck)
- Cached search: ~2.5 seconds (database lookup)
- Batch lookup: 10-50ms for 100 profiles
- Cost reduction: 70-90% (90% cache hit rate)
LangGraph Research Engine
This is PeopleHub's killer feature: automated due diligence reports using LangChain's LangGraph for multi-step agentic workflows.
What is LangGraph?
LangGraph is a framework for building stateful, multi-actor applications with LLMs. It uses a directed graph to define:
- Nodes: Individual steps (fetch data, search web, summarize)
- Edges: Transitions between steps
- State: Shared data across the workflow
Research Workflow Graph
START
↓
Initialize Research
↓
├─→ Fetch LinkedIn Profile ──→ Aggregate Data
│ ↓
└─→ Execute Google Search Write Report
↓ ↓
Scrape URLs (parallel) END
↓
Summarize Content (parallel)
↓
Aggregate Data
Implementation
State Definition:
// src/lib/research/types.ts
import { Annotation } from '@langchain/langgraph';
export const ResearchStateAnnotation = Annotation.Root({
personName: Annotation<string>,
linkedinUrl: Annotation<string>,
linkedinData: Annotation<ProfileData | undefined>,
searchQuery: Annotation<string | undefined>,
searchResults: Annotation<SearchResult[]>,
scrapedContents: Annotation<ScrapedContent[]>,
webSummaries: Annotation<WebSummary[]>,
finalReport: Annotation<string | undefined>,
status: Annotation<string>,
errors: Annotation<string[]>,
});
Graph Builder:
// src/lib/research/graph.ts
import { StateGraph, START, END, Send } from '@langchain/langgraph';
export function createResearchGraph() {
const graph = new StateGraph(ResearchStateAnnotation);
// Add nodes
graph.addNode('start', startNode);
graph.addNode('fetchLinkedIn', fetchLinkedInNode);
graph.addNode('executeSearch', executeSearchNode);
graph.addNode('scrapeWebPage', scrapeWebPageNode);
graph.addNode('summarizeContent', summarizeContentNode);
graph.addNode('aggregateData', aggregateDataNode);
graph.addNode('writeReport', writeReportNode);
// Define edges
graph
.addEdge(START, 'start')
.addEdge('start', 'fetchLinkedIn')
.addEdge('start', 'executeSearch')
.addEdge('fetchLinkedIn', 'aggregateData')
.addConditionalEdges('executeSearch', routeToScraping)
.addConditionalEdges('scrapeWebPage', routeToSummarization)
.addEdge('summarizeContent', 'aggregateData')
.addEdge('aggregateData', 'writeReport')
.addEdge('writeReport', END);
return graph.compile({ checkpointer: new MemorySaver() });
}
Parallel Scraping with Send API:
// Route to scraping - creates parallel tasks
export function routeToScraping(
state: ResearchGraphState
): Send[] {
if (!state.searchResults?.length) return [];
// Fan-out: Create one scraping task per URL
return state.searchResults.map((result) =>
new Send('scrapeWebPage', {
...state,
url: result.url,
metadata: {
source: result.source,
rank: result.rank,
title: result.title,
},
})
);
}
// Route to summarization - creates parallel tasks
export function routeToSummarization(
state: ResearchGraphState
): Send[] {
if (!state.scrapedContents?.length) return [];
// Fan-out: Create one summarization task per scraped page
return state.scrapedContents.map((content) =>
new Send('summarizeContent', {
...state,
scrapedContent: content,
})
);
}
Node Implementations:
// Fetch LinkedIn profile
export const fetchLinkedInNode: ResearchNodeHandler = async (state) => {
const profile = await fetchLinkedInProfile(state.linkedinUrl);
return {
linkedinData: profile,
status: 'LinkedIn profile fetched',
};
};
// Execute Google search
export const executeSearchNode: ResearchNodeHandler = async (state) => {
const results = await searchGoogleForPerson(
state.personName,
state.linkedinUrl,
{ maxResults: 15 }
);
return {
searchResults: results,
status: 'Web search completed',
};
};
// Summarize scraped content
export const summarizeContentNode: ResearchNodeHandler = async (state) => {
const { scrapedContent } = state;
const summary = await summarizeWebContent(
scrapedContent.url,
scrapedContent.content,
state.personName
);
return {
webSummaries: [summary],
status: `Summarized ${scrapedContent.url}`,
};
};
// Generate final report
export const writeReportNode: ResearchNodeHandler = async (state) => {
const bundle = {
personName: state.personName,
linkedinUrl: state.linkedinUrl,
linkedinData: state.linkedinData,
webSummaries: state.webSummaries,
};
const result = await generateResearchReport(bundle);
return {
finalReport: result.report,
status: 'Report ready',
};
};
Running the Graph:
// src/lib/research/runner.ts
export async function runResearchGraph(
researchId: string,
linkedinUrl: string,
personName: string
): Promise<void> {
// Update status to 'processing'
await updateResearchStatus(researchId, 'processing');
// Create and compile graph
const compiledGraph = createResearchGraph();
// Execute with checkpointing
const result = await compiledGraph.invoke(
{ personName, linkedinUrl },
{ configurable: { thread_id: researchId } }
);
// Save report to database
if (result.finalReport) {
await saveResearchReport(
researchId,
result.finalReport,
result.webSummaries.map(s => ({ url: s.url, summary: s.summary })),
{ /* metadata */ }
);
}
}
Why LangGraph?
Advantages:
- Stateful Workflows: Shared state across steps
- Parallel Execution: Send API for fan-out/fan-in patterns
- Checkpointing: Resume from failures with MemorySaver
- Type Safety: TypeScript-first with full type inference
- Debuggability: Visual graph representation with Mermaid
Example Research Report:
# Research Report: John Doe
## Professional Background
John Doe is a Senior Software Engineer at Google with 8 years of experience...
## Recent Projects and Achievements
- Led the development of Project X, a distributed system processing 1M+ requests/day
- Published research paper on machine learning optimization at ICML 2024
- Speaker at Google I/O 2024 on Kubernetes best practices
## Technical Expertise
Primary skills: Python, Go, Kubernetes, TensorFlow, distributed systems
Notable contributions to open-source projects...
## Industry Reputation
Recognized as a thought leader in cloud-native architecture...
## Sources
1. [Google Engineering Blog - Project X Launch](https://example.com)
2. [ICML 2024 Paper: ML Optimization Techniques](https://example.com)
3. [Google I/O 2024 Talk](https://example.com)
Database Design with Prisma
Schema Overview
// prisma/schema.prisma
datasource db {
provider = "postgresql"
url = env("DATABASE_URL")
}
model Person {
id String @id @default(cuid())
linkedinUrl String @unique
linkedinId String @unique
linkedinNumId String?
// Basic Info
firstName String
lastName String
fullName String @db.Text
headline String? @db.Text
about String? @db.Text
// Location
location String?
city String?
countryCode String?
// Profile Media
profilePicUrl String?
bannerImage String?
defaultAvatar Boolean @default(false)
// Current Company
currentCompany String?
currentCompanyId String?
// Rich Data (JSON)
experience Json?
education Json?
languages Json?
// Social Stats
connections Int?
followers Int?
// Metadata
searchCount Int @default(0)
lastViewed DateTime @default(now())
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
// Relations
researches Research[]
@@index([fullName])
@@index([firstName, lastName])
@@index([lastViewed])
@@index([linkedinId])
@@index([currentCompany])
@@index([location])
@@index([updatedAt])
@@map("people")
}
model Search {
id String @id @default(cuid())
query String @db.Text
results Json // Array of person IDs
resultCount Int
createdAt DateTime @default(now())
@@index([query])
@@map("searches")
}
model Research {
id String @id @default(cuid())
personId String?
person Person? @relation(fields: [personId], references: [id])
linkedinUrl String
personName String
report String @db.Text
sources Json // Array of { url, summary }
metadata Json? // Graph execution metadata
status String // 'pending' | 'processing' | 'completed' | 'failed'
errorMessage String? @db.Text
createdAt DateTime @default(now())
updatedAt DateTime @updatedAt
@@index([personId])
@@index([linkedinUrl])
@@index([status])
@@index([createdAt])
@@map("researches")
}
Key Design Decisions
1. JSON Fields for Flexibility
// Experience is stored as JSON for flexibility
experience: [
{
title: "Senior Software Engineer",
company: "Google",
companyId: "1441",
companyLogo: "https://...",
location: "Mountain View, CA",
startDate: "2020-01-01",
endDate: null,
description: "Led development of...",
isCurrent: true
},
// ... more experiences
]
Why JSON?
- Flexible schema (experiences vary widely)
- No need for separate tables and joins
- Easier to store Bright Data's nested structures
- PostgreSQL JSON operators for querying
2. Strategic Indexes
-- Full-text search on names
@@index([fullName])
@@index([firstName, lastName])
-- Filter by recency
@@index([lastViewed])
@@index([updatedAt])
-- Search by location/company
@@index([location])
@@index([currentCompany])
-- Unique constraints prevent duplicates
linkedinId String @unique
linkedinUrl String @unique
3. Metadata Tracking
searchCount: Int @default(0) // Popularity metric
lastViewed: DateTime // Cache invalidation
updatedAt: DateTime // Freshness check
Benefits:
- Identify popular profiles (high
searchCount) - Prioritize cache refreshes for frequently viewed profiles
- Track data staleness for intelligent revalidation
Performance Optimizations
1. Batch Database Queries
Problem: N+1 query problem when checking cache for multiple URLs
Solution: Single query for all profiles
// Instead of N queries
for (const url of urls) {
await getCachedProfile(url); // ❌ N queries
}
// Do 1 query
const cached = await getCachedProfiles(urls); // ✅ 1 query
Impact: 100 profiles: ~5 seconds → ~50ms
2. Parallel API Calls
Problem: Sequential Bright Data calls block each other
Solution: Promise.all for independent operations
// Sequential: ~6 seconds total
const parsedQuery = await parseSearchQuery(query);
const summaries = await searchLinkedInProfiles(parsedQuery);
await cacheSearchResults(query, summaries);
// Parallel: ~3 seconds total
const [parsedQuery, summaries] = await Promise.all([
parseSearchQuery(query),
searchLinkedInProfiles(parsedQuery), // ❌ Depends on parsedQuery
]);
Caveat: Only parallelize truly independent operations!
3. Connection Pooling
Problem: Serverless functions create new DB connections per request
Solution: Singleton Prisma client
// src/lib/prisma.ts
import { PrismaClient } from '@prisma/client';
const globalForPrisma = global as unknown as { prisma: PrismaClient };
export const prisma = globalForPrisma.prisma ||
new PrismaClient({
log: process.env.NODE_ENV === 'development'
? ['query', 'error', 'warn']
: ['error'],
});
if (process.env.NODE_ENV !== 'production') {
globalForPrisma.prisma = prisma;
}
4. Image Proxy for CORS
Problem: Direct LinkedIn image URLs blocked by CORS and ad blockers
Solution: Proxy images through our API
// src/app/api/proxy-image/route.ts
export async function GET(request: NextRequest) {
const imageUrl = request.nextUrl.searchParams.get('url');
const response = await fetch(imageUrl, {
headers: {
'User-Agent': 'Mozilla/5.0...',
},
});
const imageBuffer = await response.arrayBuffer();
return new Response(imageBuffer, {
headers: {
'Content-Type': response.headers.get('Content-Type'),
'Cache-Control': 'public, max-age=86400', // 24 hours
},
});
}
5. Redis for Hot Cache
Problem: PostgreSQL queries still take 10-50ms for popular searches
Solution: Redis in-memory cache for search results
// Cache TTL: 30 minutes for hot searches
export const CacheTTL = {
SEARCH_RESULTS: 30 * 60, // 30 minutes
PROFILE_SUMMARY: 60 * 60, // 1 hour
};
// Check Redis first, fallback to PostgreSQL
const cached = await getCachedSearchResults(query);
if (cached) return cached; // ~2ms
// Cache miss, fetch from DB/API
const results = await searchLinkedInProfiles(query);
await cacheSearchResults(query, results);
Lessons Learned
1. Structured Output > Regex Parsing
Before: Manual regex parsing of queries
const match = query.match(/(\d+)\s+(.+)\s+in\s+(.+)/);
if (!match) throw new Error('Invalid query');
const [, count, role, location] = match;
After: Gemini 2.0 Flash with Zod schemas
const { object } = await generateObject({
model: google('gemini-2.0-flash-exp'),
schema: SearchQuerySchema,
prompt: `Parse: "${query}"`,
});
Why Better:
- Handles ambiguous queries ("software engineers at Google" - is Google a location or company?)
- Extracts implicit information (maps "Israel" → "IL" country code)
- No brittle regex maintenance
2. Async by Default for External APIs
Bright Data's LinkedIn scraper takes 60-120 seconds. Never block the user:
// ❌ Bad: User waits 2 minutes
const profiles = await fetchLinkedInProfiles(urls);
return profiles;
// ✅ Good: Return immediately, process in background
const researchId = await createResearch(linkedinUrl, personName);
runResearchGraph(researchId, linkedinUrl, personName); // Fire and forget
return { researchId, status: 'pending' };
Users can check status with polling or webhooks.
3. LangGraph for Complex Workflows
Before: Imperative spaghetti code
async function research(person) {
const linkedin = await fetchLinkedIn(person);
const query = await generateQuery(linkedin);
const results = await search(query);
const scraped = [];
for (const url of results) {
scraped.push(await scrape(url));
}
const summaries = [];
for (const content of scraped) {
summaries.push(await summarize(content));
}
return await writeReport(linkedin, summaries);
}
After: Declarative graph with parallel execution
const graph = new StateGraph(ResearchStateAnnotation)
.addNode('fetchLinkedIn', fetchLinkedInNode)
.addNode('search', searchNode)
.addNode('scrape', scrapeNode) // Parallel
.addNode('summarize', summarizeNode) // Parallel
.addNode('writeReport', writeReportNode)
.addConditionalEdges('search', routeToScraping)
.compile();
const result = await graph.invoke({ person });
Benefits:
- Parallel scraping (15 URLs in 10 seconds, not 150 seconds)
- Checkpointing (resume from failures)
- Easier to test individual nodes
4. Cache Invalidation is Hard
Challenge: When to refresh cached profiles?
Solution: Multi-factor scoring
function shouldRefresh(profile: Person): boolean {
const daysSinceUpdate = getDaysSince(profile.updatedAt);
const isPopular = profile.searchCount > 10;
const isRecent = getDaysSince(profile.lastViewed) < 7;
// Refresh if:
// - Stale (> 180 days)
// - Popular AND moderately old (> 30 days)
// - Recently viewed AND old (> 60 days)
return (
daysSinceUpdate > 180 ||
(isPopular && daysSinceUpdate > 30) ||
(isRecent && daysSinceUpdate > 60)
);
}
5. Type Safety Everywhere
TypeScript + Zod + Prisma = Compile-time safety across the stack
// 1. Zod validates API input
const SearchQuerySchema = z.object({
count: z.number().min(1).max(50),
role: z.string().nullable(),
});
// 2. Prisma generates types from schema
const profile: Person = await prisma.person.findUnique(...);
// 3. TypeScript enforces contracts
function transformProfile(data: ProfileData): Person {
// Compiler catches type mismatches
}
Result: Caught 50+ bugs at compile-time instead of runtime.
Conclusion
Building PeopleHub taught me:
- LLMs for structured extraction - Gemini 2.0 Flash + Zod is perfect for parsing
- Bright Data scales - Their APIs handle thousands of LinkedIn profiles without breaking
- LangGraph for workflows - State machines beat imperative code for complex flows
- Cache aggressively - 90% cache hit rate = 10x cost reduction
- Async everything - Never block users on slow external APIs
The entire codebase is open-source on GitHub. Check it out, try it yourself, and let me know what you think!
Tech Stack Summary:
- Next.js 15.5.4 + React 19.1.0
- TypeScript 5 + Zod 3.25.76
- Prisma 6.5.0 + PostgreSQL + Redis
- Google Gemini 2.0 Flash + Vercel AI SDK 5.0.60
- LangChain + LangGraph 1.0.1
- Bright Data APIs (Google Search, LinkedIn Scraper, MCP)
Questions?
Drop a comment below or reach out! Would love to hear your thoughts on:
- Alternative approaches to query parsing
- Other use cases for LangGraph
- Optimization ideas for the research engine
- Your experience with Bright Data APIs
Happy coding! 🚀
Top comments (0)