Meir

Posted on Aug 14

Building a Free AI-Powered Data Enrichment Tool with Bright Data MCP, Gemini, and Kimi K2

How I built a complete spreadsheet enrichment solution using free tiers of cutting-edge APIs

TL;DR

I built a full-stack data enrichment tool that transforms basic spreadsheet data into rich, contextual information using Bright Data's MCP (Model Context Protocol) server, Google Gemini, and Kimi K2 through OpenRouter. The best part? It's completely free to use within generous limits:

🌐 Bright Data MCP: 5,000 requests/month (renewed monthly)
🤖 Google Gemini: Generous free tier for AI processing
🧠 Kimi K2 via OpenRouter: You can use their free version
🚀 Result: Professional-grade data enrichment without cost barriers

Demo Video

The Problem

Data enrichment is expensive. Whether you're a startup researching leads, a researcher gathering information, or a developer building demos, traditional data enrichment services can quickly become cost-prohibitive. I wanted to build something that democratizes access to real-time web intelligence.

Architecture Overview

The solution consists of three main components:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   React Frontend │    │  FastAPI Backend │    │  AI + Web APIs  │
│                 │    │                  │    │                 │
│ • Spreadsheet   │◄──►│ • LangGraph      │◄──►│ • Bright Data   │
│ • CSV Upload    │    │ • Enrichment     │    │ • Gemini AI     │
│ • Real-time UI  │    │ • Batch Processing│    │ • Kimi K2       │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Key Technologies

Bright Data MCP Server - The Game Changer

The Bright Data MCP server is what makes this project special. Unlike traditional web scraping or API approaches, MCP provides a standardized way for AI models to interact with external data sources.

Why MCP is revolutionary:

🔌 Plug-and-play integration with AI agents
🌍 Real-time web access without rate limiting headaches
🛡️ Built-in compliance and ethical data collection
📊 Structured data extraction ready for AI consumption

LangGraph for Agent Orchestration

I used LangGraph to create a sophisticated enrichment pipeline that can handle both standard and deep research modes:

class EnrichmentPipeline:
    def __init__(self, llm_provider: LLMProvider, deep_research: bool = False):
        self.llm = llm_provider
        self.bright_data_client = None
        self.tools = None
        self.deep_research = deep_research

    def build_graph(self):
        """Build and compile the enrichment graph"""
        graph = StateGraph(EnrichmentContext)

        if self.deep_research:
            graph.add_node("search", self.deep_search_bright_data)
        else:
            graph.add_node("search", self.search_bright_data)

        graph.add_node("extract", self.extract_minimal_answer)
        graph.add_edge(START, "search")
        graph.add_edge("search", "extract")
        graph.add_edge("extract", END)

        return graph.compile()

Implementation Deep Dive

1. Setting Up Bright Data MCP

The MCP integration is surprisingly straightforward:

async def initialize_bright_data(self):
    """Initialize Bright Data MCP client and tools"""
    if self.bright_data_client is None:
        bright_data_config = {
            "mcpServers": {
                "Bright Data": {
                    "command": "npx",
                    "args": ["@brightdata/mcp"],
                    "env": {
                        "API_TOKEN": os.getenv("BRIGHT_DATA_API_TOKEN"),
                        "WEB_UNLOCKER_ZONE": os.getenv("WEB_UNLOCKER_ZONE", "unblocker"),
                        "BROWSER_ZONE": os.getenv("BROWSER_ZONE", "scraping_browser")
                    }
                }
            }
        }

        self.bright_data_client = MCPClient.from_dict(bright_data_config)
        adapter = LangChainAdapter()
        self.tools = await adapter.create_tools(self.bright_data_client)

2. Dual-Mode Intelligence

The system supports two enrichment modes:

Standard Mode (Gemini + Bright Data)

Fast, efficient enrichment for basic use cases:

async def search_bright_data(self, state: EnrichmentContext):
    """Run standard Bright Data search using MCP"""
    try:
        await self.initialize_bright_data()

        query = f"{state.column_name} of {state.target_value}"
        search_tool = next((tool for tool in self.tools if "search" in tool.name.lower()), None)

        if search_tool:
            search_result = await search_tool.ainvoke({"query": query})
            # Process and return structured results

    except Exception as e:
        logger.error(f"Error in search_bright_data: {str(e)}")
        raise

Deep Research Mode (Kimi K2 + ReAct Agent)

For complex research requiring multi-source verification:

async def deep_search_bright_data(self, state: EnrichmentContext):
    """Run deep research using ReAct Agent with Kimi K2"""
    try:
        # Initialize Kimi K2 through OpenRouter
        kimi_llm = ChatOpenAI(
            openai_api_key=os.getenv("OPENROUTER_API_KEY"),
            openai_api_base="https://openrouter.ai/api/v1",
            model_name="moonshotai/kimi-k2"
        )

        # Create ReAct agent with Bright Data tools
        agent = create_react_agent(
            model=kimi_llm,
            tools=self.tools,
            prompt=system_prompt
        )

        result = await agent.ainvoke({
            "messages": [("human", query)]
        })

    except Exception as e:
        logger.error(f"Error in deep_search: {str(e)}")
        raise

3. Batch Processing for Performance

One key optimization was implementing batch processing to handle entire columns efficiently:

@app.post("/api/enrich/batch", response_model=BatchEnrichmentResponse)
async def enrich_batch(request: BatchEnrichmentRequest):
    """Enrich multiple rows in parallel."""
    try:
        # Process each row
        tasks = []
        for row in request.rows:
            if row.strip():
                task = enrich_cell_with_graph(
                    column_name=request.column_name,
                    target_value=row,
                    context_values=request.context_values,
                    llm_provider=gemini_provider,
                    deep_research=request.deep_research
                )
                tasks.append(task)

        # Execute all enrichments in parallel
        enriched_values = await asyncio.gather(*tasks, return_exceptions=True)

        return BatchEnrichmentResponse(
            enriched_values=final_values, 
            status="success", 
            sources=all_sources
        )

    except Exception as e:
        # Error handling...

4. React Frontend with Real-time Updates

The frontend provides a smooth, spreadsheet-like experience with real-time loading states:

const enrichColumn = async (colIndex: number) => {
  // Set loading state for all cells
  const newRows = data.rows.map((row) => {
    const newRow = [...row];
    if (newRow.some((cell) => cell.value?.length)) {
      newRow[colIndex] = { ...newRow[colIndex], loading: true };
    }
    return newRow;
  });
  setData({ ...data, rows: newRows });

  try {
    const response = await fetch(`${API_URL}/api/enrich/batch`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        column_name: data.headers[colIndex],
        rows: targetValues,
        context_values: contextValues,
        deep_research: deepResearch,
      }),
    });

    const result = await response.json();

    // Update all cells with enriched data
    const enrichedRows = data.rows.map((row, rowIndex) => {
      const newRow = [...row];
      newRow[colIndex] = {
        value: result.enriched_values[rowIndex],
        sources: result.sources[rowIndex],
        enriched: result.enriched_values[rowIndex] !== "",
        loading: false,
      };
      return newRow;
    });

    setData({ ...data, rows: enrichedRows });
  } catch (error) {
    // Error handling...
  }
};

The Free Tier Advantage

Cost Breakdown (Monthly)

Bright Data MCP: 5,000 requests - $0
Google Gemini: ~15M tokens - $0 (within free tier)
OpenRouter Kimi K2: ~500K tokens - $0 (within free tier)
Hosting: Vercel/Railway free tiers - $0

Total: $0/month for moderate usage

Real-world Usage

With 5,000 Bright Data requests per month, you can realistically enrich:

📊 1,000 companies with 5 data points each
🔍 500 research subjects with 10 data points each
📈 250 comprehensive profiles with 20 data points each

Challenges and Solutions

1. Rate Limiting Coordination

Challenge: Managing rate limits across multiple free services.
Solution: Implemented exponential backoff and request queuing:

async def rate_limited_request(self, func, *args, **kwargs):
    """Execute request with exponential backoff"""
    for attempt in range(3):
        try:
            return await func(*args, **kwargs)
        except Exception as e:
            if "rate limit" in str(e).lower():
                wait_time = (2 ** attempt) * 1
                await asyncio.sleep(wait_time)
            else:
                raise
    raise Exception("Max retries exceeded")

2. Data Quality Consistency

Challenge: Ensuring consistent, structured output from different AI models.
Solution: Robust prompt engineering and output validation:

async def extract_minimal_answer(self, state: EnrichmentContext) -> Dict:
    """Use LLM to extract a minimal answer from Bright Data's results."""
    prompt = f"""
    Extract the {state.column_name} of {state.target_value} from this search result:

    {content}

    Rules:
    1. Provide ONLY the direct answer - no explanations
    2. Be concise
    3. If information is not found, respond "Information not found"
    4. Do not provide citations or references
    Direct Answer:
    """

    answer = await self.llm.generate(prompt)
    return {"answer": answer}

3. CSV Parsing Robustness

Challenge: Handling malformed CSV files with quoted content.
Solution: Custom CSV parser with proper quote handling:

const parseCSVLine = (line: string): string[] => {
  const result: string[] = [];
  let current = '';
  let inQuotes = false;

  for (let i = 0; i < line.length; i++) {
    const char = line[i];
    const nextChar = line[i + 1];

    if (char === '"') {
      if (inQuotes && nextChar === '"') {
        // Handle escaped quotes
        current += '"';
        i++; // Skip the next quote
      } else {
        // Toggle quote state
        inQuotes = !inQuotes;
      }
    } else if (char === ',' && !inQuotes) {
      result.push(current);
      current = '';
    } else {
      current += char;
    }
  }

  result.push(current);
  return result;
};

Deployment and Production Considerations

Environment Setup

# Backend dependencies
pip install fastapi uvicorn langgraph langchain-google-genai langchain-openai

# Environment variables
BRIGHT_DATA_API_TOKEN=your_token
GOOGLE_API_KEY=your_gemini_key
OPENROUTER_API_KEY=your_openrouter_key
WEB_UNLOCKER_ZONE=unblocker
BROWSER_ZONE=scraping_browser

Docker Deployment

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["python", "app.py"]

Future Enhancements

🔄 Real-time Collaboration: WebSocket integration for multi-user editing
📊 Advanced Analytics: Built-in data visualization and insights
🔌 Custom Connectors: Support for additional data sources via MCP
🤖 Smart Suggestions: AI-powered column header and enrichment suggestions
📈 Usage Analytics: Track enrichment patterns and optimize workflows

Key Takeaways

MCP is a game-changer for AI-web integration - it abstracts away the complexity of web scraping while providing reliable, structured data access.
Free tiers can be powerful when combined strategically. The combination of Bright Data MCP, Gemini, and Kimi K2 provides enterprise-grade capabilities at zero cost.
LangGraph simplifies complex workflows - building sophisticated agent pipelines becomes manageable with proper state management.
User experience matters - Real-time updates, smooth animations, and intuitive interfaces make technical tools accessible to non-technical users.
Batch processing is essential for performance - handling multiple enrichments in parallel dramatically improves user experience.

Get Started

The complete source code is available on GitHub, with detailed setup instructions for both development and production deployment. Whether you're building a research tool, lead generation system, or data analysis platform, this foundation provides a solid starting point for any data enrichment needs.

The future of data enrichment is democratized, intelligent, and surprisingly affordable. With tools like Bright Data's MCP server, we're moving toward a world where powerful data intelligence is accessible to everyone - not just enterprises with large budgets.

Want to see this in action? Check out the live demo or explore the GitHub repository to get started with your own data enrichment projects.

Tags: #brightdata #mcp #ai #datascience #webdev #python #react #typescript #langchain #openai #gemini

DEV Community