How I built a complete spreadsheet enrichment solution using free tiers of cutting-edge APIs
TL;DR
I built a full-stack data enrichment tool that transforms basic spreadsheet data into rich, contextual information using Bright Data's MCP (Model Context Protocol) server, Google Gemini, and Kimi K2 through OpenRouter. The best part? It's completely free to use within generous limits:
- π Bright Data MCP: 5,000 requests/month (renewed monthly)
- π€ Google Gemini: Generous free tier for AI processing
- π§ Kimi K2 via OpenRouter: You can use their free version
- π Result: Professional-grade data enrichment without cost barriers
Demo Video
The Problem
Data enrichment is expensive. Whether you're a startup researching leads, a researcher gathering information, or a developer building demos, traditional data enrichment services can quickly become cost-prohibitive. I wanted to build something that democratizes access to real-time web intelligence.
Architecture Overview
The solution consists of three main components:
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β React Frontend β β FastAPI Backend β β AI + Web APIs β
β β β β β β
β β’ Spreadsheet βββββΊβ β’ LangGraph βββββΊβ β’ Bright Data β
β β’ CSV Upload β β β’ Enrichment β β β’ Gemini AI β
β β’ Real-time UI β β β’ Batch Processingβ β β’ Kimi K2 β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
Key Technologies
Bright Data MCP Server - The Game Changer
The Bright Data MCP server is what makes this project special. Unlike traditional web scraping or API approaches, MCP provides a standardized way for AI models to interact with external data sources.
Why MCP is revolutionary:
- π Plug-and-play integration with AI agents
- π Real-time web access without rate limiting headaches
- π‘οΈ Built-in compliance and ethical data collection
- π Structured data extraction ready for AI consumption
LangGraph for Agent Orchestration
I used LangGraph to create a sophisticated enrichment pipeline that can handle both standard and deep research modes:
class EnrichmentPipeline:
def __init__(self, llm_provider: LLMProvider, deep_research: bool = False):
self.llm = llm_provider
self.bright_data_client = None
self.tools = None
self.deep_research = deep_research
def build_graph(self):
"""Build and compile the enrichment graph"""
graph = StateGraph(EnrichmentContext)
if self.deep_research:
graph.add_node("search", self.deep_search_bright_data)
else:
graph.add_node("search", self.search_bright_data)
graph.add_node("extract", self.extract_minimal_answer)
graph.add_edge(START, "search")
graph.add_edge("search", "extract")
graph.add_edge("extract", END)
return graph.compile()
Implementation Deep Dive
1. Setting Up Bright Data MCP
The MCP integration is surprisingly straightforward:
async def initialize_bright_data(self):
"""Initialize Bright Data MCP client and tools"""
if self.bright_data_client is None:
bright_data_config = {
"mcpServers": {
"Bright Data": {
"command": "npx",
"args": ["@brightdata/mcp"],
"env": {
"API_TOKEN": os.getenv("BRIGHT_DATA_API_TOKEN"),
"WEB_UNLOCKER_ZONE": os.getenv("WEB_UNLOCKER_ZONE", "unblocker"),
"BROWSER_ZONE": os.getenv("BROWSER_ZONE", "scraping_browser")
}
}
}
}
self.bright_data_client = MCPClient.from_dict(bright_data_config)
adapter = LangChainAdapter()
self.tools = await adapter.create_tools(self.bright_data_client)
2. Dual-Mode Intelligence
The system supports two enrichment modes:
Standard Mode (Gemini + Bright Data)
Fast, efficient enrichment for basic use cases:
async def search_bright_data(self, state: EnrichmentContext):
"""Run standard Bright Data search using MCP"""
try:
await self.initialize_bright_data()
query = f"{state.column_name} of {state.target_value}"
search_tool = next((tool for tool in self.tools if "search" in tool.name.lower()), None)
if search_tool:
search_result = await search_tool.ainvoke({"query": query})
# Process and return structured results
except Exception as e:
logger.error(f"Error in search_bright_data: {str(e)}")
raise
Deep Research Mode (Kimi K2 + ReAct Agent)
For complex research requiring multi-source verification:
async def deep_search_bright_data(self, state: EnrichmentContext):
"""Run deep research using ReAct Agent with Kimi K2"""
try:
# Initialize Kimi K2 through OpenRouter
kimi_llm = ChatOpenAI(
openai_api_key=os.getenv("OPENROUTER_API_KEY"),
openai_api_base="https://openrouter.ai/api/v1",
model_name="moonshotai/kimi-k2"
)
# Create ReAct agent with Bright Data tools
agent = create_react_agent(
model=kimi_llm,
tools=self.tools,
prompt=system_prompt
)
result = await agent.ainvoke({
"messages": [("human", query)]
})
except Exception as e:
logger.error(f"Error in deep_search: {str(e)}")
raise
3. Batch Processing for Performance
One key optimization was implementing batch processing to handle entire columns efficiently:
@app.post("/api/enrich/batch", response_model=BatchEnrichmentResponse)
async def enrich_batch(request: BatchEnrichmentRequest):
"""Enrich multiple rows in parallel."""
try:
# Process each row
tasks = []
for row in request.rows:
if row.strip():
task = enrich_cell_with_graph(
column_name=request.column_name,
target_value=row,
context_values=request.context_values,
llm_provider=gemini_provider,
deep_research=request.deep_research
)
tasks.append(task)
# Execute all enrichments in parallel
enriched_values = await asyncio.gather(*tasks, return_exceptions=True)
return BatchEnrichmentResponse(
enriched_values=final_values,
status="success",
sources=all_sources
)
except Exception as e:
# Error handling...
4. React Frontend with Real-time Updates
The frontend provides a smooth, spreadsheet-like experience with real-time loading states:
const enrichColumn = async (colIndex: number) => {
// Set loading state for all cells
const newRows = data.rows.map((row) => {
const newRow = [...row];
if (newRow.some((cell) => cell.value?.length)) {
newRow[colIndex] = { ...newRow[colIndex], loading: true };
}
return newRow;
});
setData({ ...data, rows: newRows });
try {
const response = await fetch(`${API_URL}/api/enrich/batch`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
column_name: data.headers[colIndex],
rows: targetValues,
context_values: contextValues,
deep_research: deepResearch,
}),
});
const result = await response.json();
// Update all cells with enriched data
const enrichedRows = data.rows.map((row, rowIndex) => {
const newRow = [...row];
newRow[colIndex] = {
value: result.enriched_values[rowIndex],
sources: result.sources[rowIndex],
enriched: result.enriched_values[rowIndex] !== "",
loading: false,
};
return newRow;
});
setData({ ...data, rows: enrichedRows });
} catch (error) {
// Error handling...
}
};
The Free Tier Advantage
Cost Breakdown (Monthly)
- Bright Data MCP: 5,000 requests - $0
- Google Gemini: ~15M tokens - $0 (within free tier)
- OpenRouter Kimi K2: ~500K tokens - $0 (within free tier)
- Hosting: Vercel/Railway free tiers - $0
Total: $0/month for moderate usage
Real-world Usage
With 5,000 Bright Data requests per month, you can realistically enrich:
- π 1,000 companies with 5 data points each
- π 500 research subjects with 10 data points each
- π 250 comprehensive profiles with 20 data points each
Challenges and Solutions
1. Rate Limiting Coordination
Challenge: Managing rate limits across multiple free services.
Solution: Implemented exponential backoff and request queuing:
async def rate_limited_request(self, func, *args, **kwargs):
"""Execute request with exponential backoff"""
for attempt in range(3):
try:
return await func(*args, **kwargs)
except Exception as e:
if "rate limit" in str(e).lower():
wait_time = (2 ** attempt) * 1
await asyncio.sleep(wait_time)
else:
raise
raise Exception("Max retries exceeded")
2. Data Quality Consistency
Challenge: Ensuring consistent, structured output from different AI models.
Solution: Robust prompt engineering and output validation:
async def extract_minimal_answer(self, state: EnrichmentContext) -> Dict:
"""Use LLM to extract a minimal answer from Bright Data's results."""
prompt = f"""
Extract the {state.column_name} of {state.target_value} from this search result:
{content}
Rules:
1. Provide ONLY the direct answer - no explanations
2. Be concise
3. If information is not found, respond "Information not found"
4. Do not provide citations or references
Direct Answer:
"""
answer = await self.llm.generate(prompt)
return {"answer": answer}
3. CSV Parsing Robustness
Challenge: Handling malformed CSV files with quoted content.
Solution: Custom CSV parser with proper quote handling:
const parseCSVLine = (line: string): string[] => {
const result: string[] = [];
let current = '';
let inQuotes = false;
for (let i = 0; i < line.length; i++) {
const char = line[i];
const nextChar = line[i + 1];
if (char === '"') {
if (inQuotes && nextChar === '"') {
// Handle escaped quotes
current += '"';
i++; // Skip the next quote
} else {
// Toggle quote state
inQuotes = !inQuotes;
}
} else if (char === ',' && !inQuotes) {
result.push(current);
current = '';
} else {
current += char;
}
}
result.push(current);
return result;
};
Deployment and Production Considerations
Environment Setup
# Backend dependencies
pip install fastapi uvicorn langgraph langchain-google-genai langchain-openai
# Environment variables
BRIGHT_DATA_API_TOKEN=your_token
GOOGLE_API_KEY=your_gemini_key
OPENROUTER_API_KEY=your_openrouter_key
WEB_UNLOCKER_ZONE=unblocker
BROWSER_ZONE=scraping_browser
Docker Deployment
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["python", "app.py"]
Future Enhancements
- π Real-time Collaboration: WebSocket integration for multi-user editing
- π Advanced Analytics: Built-in data visualization and insights
- π Custom Connectors: Support for additional data sources via MCP
- π€ Smart Suggestions: AI-powered column header and enrichment suggestions
- π Usage Analytics: Track enrichment patterns and optimize workflows
Key Takeaways
MCP is a game-changer for AI-web integration - it abstracts away the complexity of web scraping while providing reliable, structured data access.
Free tiers can be powerful when combined strategically. The combination of Bright Data MCP, Gemini, and Kimi K2 provides enterprise-grade capabilities at zero cost.
LangGraph simplifies complex workflows - building sophisticated agent pipelines becomes manageable with proper state management.
User experience matters - Real-time updates, smooth animations, and intuitive interfaces make technical tools accessible to non-technical users.
Batch processing is essential for performance - handling multiple enrichments in parallel dramatically improves user experience.
Get Started
The complete source code is available on GitHub, with detailed setup instructions for both development and production deployment. Whether you're building a research tool, lead generation system, or data analysis platform, this foundation provides a solid starting point for any data enrichment needs.
The future of data enrichment is democratized, intelligent, and surprisingly affordable. With tools like Bright Data's MCP server, we're moving toward a world where powerful data intelligence is accessible to everyone - not just enterprises with large budgets.
Want to see this in action? Check out the live demo or explore the GitHub repository to get started with your own data enrichment projects.
Tags: #brightdata #mcp #ai #datascience #webdev #python #react #typescript #langchain #openai #gemini
Top comments (0)