DEV Community

Cover image for In search of a BigQuery agent in LangChain, I built a Toolkit that acts as a BigQuery agent for Text-to-SQL.
Mahadeva Prasad M M
Mahadeva Prasad M M

Posted on

1

In search of a BigQuery agent in LangChain, I built a Toolkit that acts as a BigQuery agent for Text-to-SQL.

Building an Intelligent BigQuery Chatbot with LLMs

In today's data-driven world, extracting insights from data warehouses like BigQuery remains challenging for many business users without SQL expertise. What if we could bridge this gap with natural language? In this post, I'll walk through how I built a BigQuery chatbot that translates natural language questions into optimized SQL queries using Google's Gemini AI.

The Challenge

Many organizations store terabytes of valuable data in BigQuery, but accessing this information typically requires:

  • SQL knowledge
  • Understanding of the database schema
  • Query optimization skills

This creates bottlenecks where data analysts become the gatekeepers of information, slowing down decision-making across organizations.

Image description

Solution Architecture

I developed a modular system that combines the power of large language models with BigQuery's performance. Here's how it works:

  1. Schema Understanding: The system first analyzes and documents the database schema, creating rich metadata
  2. Question Analysis: When a user asks a natural language question, the system identifies relevant tables and fields
  3. SQL Generation: Using Gemini AI, it converts the question into optimized SQL
  4. Query Refinement: The query undergoes multiple iterations of reflection and improvement
  5. Error Handling: If execution fails, the system automatically attempts to fix the query
  6. Results Delivery: The final data is returned to the user in an accessible format

Key Components

1. Schema Documentation Generator

The first module generates comprehensive documentation for the BigQuery schema:

def generate_table_description(table_name, schema_fields):
    prompt = f"""
    As a data analyst, generate a concise and precise description for a BigQuery table named '{table_name}' with the following fields:

    {', '.join([field.name for field in schema_fields])}

    Requirements:
    1. Explain the table's primary purpose and its business context
    2. Keep the description under 100 words
    3. Focus on how this table would be used in data analysis
    4. Mention any key relationships or important fields

    Description format: [Table Purpose] | [Key Fields] | [Common Use Cases]
    """

    response = model.generate_content(prompt)
    return response.text
Enter fullscreen mode Exit fullscreen mode

This creates a rich knowledge base that helps the LLM understand both the technical schema and business context of each table and field.

2. Question Analyzer

The system analyzes natural language questions to identify the most relevant tables:

def analyze_question_for_tables(question, schema_data):
    """Analyze a natural language question to identify relevant tables."""
    relevant_tables = {}
    question_terms = set(question.lower().split())

    for table_name, schema in schema_data.items():
        score = 0
        matched_terms = set()

        # Check table name, description, and fields for relevance
        # ...

        if score > 0:
            relevant_tables[table_name] = {
                "schema": schema,
                "relevance_score": score,
                "matched_terms": list(matched_terms)
            }

    return dict(sorted(relevant_tables.items(), 
                       key=lambda x: x[1]["relevance_score"], 
                       reverse=True))
Enter fullscreen mode Exit fullscreen mode

This semantic matching ensures only the most relevant tables are considered for query generation.

3. SQL Generator with Reflective Improvements

The core of the system converts questions to SQL and refines them through iterative reflection:

def generate_sql_query(question, schema_data, context_manager=None):
    relevant_tables = analyze_question_for_tables(question, schema_data)
    prompt = format_prompt_with_relevant_tables(question, relevant_tables)

    # Generate initial query
    response = model.generate_content(prompt)
    result = json.loads(json_match.group(0))

    # Apply reflective improvement
    current_query = result["query"]
    for i in range(MAX_REFLECTIONS):
        reflection_result = reflect_on_query(question, current_query, 
                                            relevant_tables, schema_data)

        if reflection_result.get("is_correct", False):
            break

        current_query = reflection_result["improved_query"]

    result["query"] = current_query
    return result
Enter fullscreen mode Exit fullscreen mode

This reflection stage is critical - it allows the system to scrutinize its own work and fix potential issues before execution.

4. Robust Error Handling and Query Execution

The system includes sophisticated error handling and automatic recovery:

def execute_query_with_retries(client, question, sql_query, schema_data, 
                               relevant_tables, max_retries=MAX_RETRIES):
    attempts = 0
    current_query = sql_query

    while attempts < max_retries:
        try:
            query_job = client.query(current_query)
            results = query_job.to_dataframe()
            return results, None, error_history

        except Exception as e:
            error_message = str(e)
            attempts += 1

            # Try to fix the query using LLM
            fix_result = fix_sql_query(question, current_query, 
                                       error_message, schema_data, 
                                       relevant_tables)

            if fix_result.get("fixed_query"):
                current_query = fix_result["fixed_query"]
            else:
                return None, "Failed to fix query", error_history
Enter fullscreen mode Exit fullscreen mode

Benefits and Results

Implementing this chatbot has transformed how our organization interacts with data:

  1. Democratized Data Access: Non-technical users can now get answers directly from BigQuery
  2. Reduced Analyst Workload: Data scientists focus on complex analysis rather than basic queries
  3. Faster Decision Making: Business insights are available in seconds rather than days
  4. Improved Data Literacy: Users learn about available data as they interact with the system
  5. Optimized Query Performance: Generated queries follow best practices for efficiency

Technical Considerations

When building similar systems, consider these factors:

  • LLM Selection: Use a technically proficient model like Gemini 2.0 that understands SQL nuances
  • Cost Optimization: Implement caching for common questions and rate limiting for API calls
  • Security: Ensure proper authentication and data access controls
  • Iterative Refinement: Multi-step generation with feedback loops produces better results than single-pass approaches
  • Error Handling: Robust retry mechanisms with intelligent fixes are essential for production use

Next Steps

Future enhancements to the system include:

  1. Adding conversational memory for follow-up questions
  2. Implementing visual explanation of query execution plans
  3. Extending to multi-database support beyond BigQuery
  4. Adding export functionality to BI tools
  5. Building a user feedback loop for continuous improvement

Conclusion

By combining the natural language capabilities of LLMs with the analytical power of BigQuery, we've created a tool that truly democratizes data access. This approach not only saves time but fundamentally changes how organizations can leverage their data assets.

The most exciting aspect is that this is just the beginning - as LLMs continue to improve, so will the capabilities of systems built on top of them. The future of data interaction is conversational, and that future is already here.


What challenges have you faced when trying to make data more accessible in your organization? Share your thoughts in the comments below!

Top comments (0)

AWS GenAI LIVE!

GenAI LIVE! is a dynamic live-streamed show exploring how AWS and our partners are helping organizations unlock real value with generative AI.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️