DEV Community

Cover image for How to Evaluate Your Text-to-SQL Agent in Cortex Analyst Using TruLens
Reza Brianca
Reza Brianca

Posted on

How to Evaluate Your Text-to-SQL Agent in Cortex Analyst Using TruLens

Overview

Ever since the LLM caught attention in the data domain space, many enterprises have tried to reap the benefits of it. However, given the fast pace of the breakthrough in the foundation model and its ecosystems, many find it difficult to integrate many of these tools to fulfill their enterprise standard. One of the most popular discussions is related to enabling talk to your data capabilities. Many attempts have been made but the results are still difficult to quantify. It has a different approach with traditional machine learning where the evaluation methods are already in their mature state. In this article, I will share one of the approaches in LLM evaluation, particularly for text-to-sql agents.

The rise of Text-to-SQL AI

Let’s face the reality. Many of the analysts in the enterprise are struggling to fulfill the role as SQL translator. That is, converting business questions into an SQL where the data can come from many different sources. In addition, let’s admit that doing those mundane tasks is not really good for your long-term career. On the other hand, enterprises also see this converting process as the major bottleneck. There will be back and forth meetings to clarify the definition and ensure they got the correct numbers.

Now they all hope that LLM can solve this problem. While the critical component has been identified (namely semantic view or semantic model), we are still not sure how to include the evaluation process in the development cycles. The semantic model serves as an enterprise-wide governance layer for the Text-to-SQL agent. More specifically, this layer provides value with:

  • Enriching Context: Informs the LLM of business-friendly terms (e.g., "Monthly Revenue") and how they map to complex, underlying schema objects.
  • Enforce Compliance: Ensures the generated SQL adheres to predefined rules, preventing the LLM from accessing unauthorized or irrelevant data, which is crucial for security and compliance.

Given the nature of semantic models, you can see that this is different from the usual machine learning model evaluation like accuracy, recall, precision, RMSE, etc. The evaluation process mostly is still done manually with the analyst checking the accuracy of the generated SQL.

The Challenge: Run Evaluation in Scalable Way

Doing the evaluation process manually is a tedious process. Not only do analysts need to check the generated queries, they also need to check whether the SQL can produce the valid results. This is mainly because the output of SQL is a number or a table where it has to be correct without any deviations. For example, if the question is about monthly revenue, the result should select the valid columns to base the aggregate on and use the correct formula on the expected columns. This is the usual result where BI reporting can visualize it using a bar chart or line chart.

You can imagine if there are changes the following area:

  • Semantic model: we add more tables, remove irrelevant columns or modify the table or column definition
  • Response instruction: we modify response instruction to refine the output response
  • Verified queries: we add more sample valid queries to enrich the agent Then the evaluation process must be done again to ensure the agent still works as expected.

Thanks to the rapid development in this evaluation area, now we can use TruLens as one of the options to do this automatically. The high level approach is quite similar to the traditional evaluation process where we need to prepare an evaluation dataset. The main difference is the ground truth now is not a label or value but the expected SQL as the output.

TruLens for Cortex Analyst

The TruLens library is quite versatile so we can use this to evaluate many LLM-based processes. Snowflake already created a documentation and hands-on tutorial for RAG-based evaluation here. I modified this tutorial a bit so the same approach can be applied to Cortex Analyst (Snowflake’s text-to-sql capabilities).

Why this is important for AI team

  • Using TruLens in Snowflake is a key component for operationalizing and monitoring their Text-to-SQL models in production. The evaluation is not a one-time test but a continuous observability loop.
  • The evaluation process ensures they can rely on the generated SQL for critical reports, eliminating the manual, tedious validation step and speeding up the time-to-insight.

Libraries

We need to have use several libraries to enable this

import requests
import json
import pandas as pd
import os
from snowflake.snowpark import Session

# TruLens components imports
from trulens.apps.custom import TruCustomApp as TruApp
from trulens.connectors.snowflake import SnowflakeConnector
from trulens.core.otel.instrument import instrument
from trulens.otel.semconv.trace import SpanAttributes
from trulens.core.run import Run, RunConfig
Enter fullscreen mode Exit fullscreen mode

If there are any missing libraries, we can install this using pip command.

pip install snowflake-snowpark-python pandas pyarrow requests trulens-core trulens-connectors-snowflake trulens-providers-cortex
Enter fullscreen mode Exit fullscreen mode

Create CortexAnalystApp with TruLens Instrument

Set the configuration to create a connection to Snowflake from your local environment. For security reasons, please don’t put the credentials in the code. You can use environment variables or Snowflake Secrets (Medium, Documentation) to store this.

# ⚠️ CONFIGURATION

# Snowflake connection parameters
connection_parameters = {
    "account": "ORGNAME-ACCOUNTNAME", # using new format
    "user": "YOUR_USERNAME",
    "password": "YOUR_PASSWORD",
    "warehouse": "YOUR_COMPUTE",
    "database": "YOUR_DB",
    "schema": "YOUR_SCHEMA",
    "role": "YOUR_ROLE"
}

# PAT Token (create in Snowsight → Profile → Personal Access Tokens)
PAT_TOKEN = "YOUR_PAT_TOKEN"

# Semantic model path (must be uploaded to a Snowflake stage)
SEMANTIC_MODEL_PATH = "@YOUR_DB.YOUR_SCHEMA.YOUR_STAGE/YOUR_SEMANTIC.yaml"

# Evaluation dataset and observability configuration
EVAL_DATASET_TABLE = "YOUR_DB.YOUR_SCHEMA.ANALYST_EVAL_DATASET"  # Table with 'question' and 'expected_sql' columns

# You need to create new DB and schema
OBSERVABILITY_DB = "OBSERVABILITY_DB"
OBSERVABILITY_SCHEMA = "OBSERVABILITY_SCHEMA"

session = Session.builder.configs(connection_parameters).create()
print(f"✅ Session created: {session.get_current_database()}.{session.get_current_schema()}")
Enter fullscreen mode Exit fullscreen mode

Semantic model samples based on the TPCH dataset can be downloaded from here. I use the TPCH dataset for this exercise. The dataset is readily available as SNOWFLAKE_SAMPLE_DATA when you sign up for a Snowflake account.

To help with the scalability, we can create the python class for easier function calls later.

class CortexAnalystApp:
    """
    Cortex Analyst application with TruLens instrumentation.
    Uses REST API for SQL generation and includes observability.
    """

    def __init__(self, session, semantic_model_path, pat_token):
        """
        Initialize the Cortex Analyst app.

        Args:
            session: Snowpark session for SQL execution
            semantic_model_path: Path to semantic model in Snowflake stage
            pat_token: Personal Access Token for authentication
        """
        if not pat_token:
            raise ValueError("❌ PAT token is required!")

        self.session = session
        self.semantic_model_path = semantic_model_path
        self.pat_token = pat_token

        # Get proper Snowflake account URL
        self.host = self._get_snowflake_host()
        self.url = f"https://{self.host}/api/v2/cortex/analyst/message"

        # Load semantic model for context
        self.schema_context = self._load_semantic_model()

        print(f"✅ CortexAnalystApp initialized with TruLens instrumentation")
        print(f"   Host: {self.host}")
        print(f"   URL: {self.url}")
        print(f"   Semantic Model: {self.semantic_model_path}")

    def _get_snowflake_host(self):
        """Get the proper Snowflake account URL using new format."""
        try:
            result = self.session.sql("SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME()").collect()
            account = result[0][0].lower().replace('_', '-')
            host = f"{account}.snowflakecomputing.com"

            return host
        except Exception as e:
            raise ValueError(f"Could not determine Snowflake host: {str(e)}")

    def _load_semantic_model(self):
        """Load semantic model from stage."""
        try:
            stream = self.session.file.get_stream(self.semantic_model_path)
            return stream.read().decode('utf-8')
        except Exception as e:
            print(f"⚠️ Could not load semantic model: {e}")
            return ""

    @instrument(
        span_type=SpanAttributes.SpanType.RETRIEVAL,
        attributes={SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return"}
    )
    def retrieve_context(self):
        """Retrieve semantic model as context for evaluation."""
        return [self.schema_context]

    @instrument(span_type=SpanAttributes.SpanType.GENERATION)
    def generate_sql(self, question: str) -> str:
        """
        Generate SQL using Cortex Analyst REST API.

        Args:
            question: Natural language question

        Returns:
            Generated SQL string
        """
        payload = {
            "messages": [
                {
                    "role": "user",
                    "content": [{"type": "text", "text": question}]
                }
            ],
            "semantic_model_file": self.semantic_model_path
        }

        headers = {
            "Authorization": f"Bearer {self.pat_token}",
            "X-Snowflake-Authorization-Token-Type": "PROGRAMMATIC_ACCESS_TOKEN",
            "Content-Type": "application/json"
        }

        try:
            response = requests.post(self.url, json=payload, headers=headers, timeout=60)

            if response.status_code != 200:
                error_msg = f"HTTP {response.status_code}: {response.text}"
                print(f"❌ API Error: {error_msg}")
                return f"Error: {error_msg}"

            resp_json = response.json()

            # Extract SQL from response
            if "message" in resp_json and "content" in resp_json["message"]:
                for item in resp_json["message"]["content"]:
                    if item.get("type") == "sql":
                        return item.get("statement", "")

            print(f"⚠️ Unexpected response format: {resp_json}")
            return "No SQL generated"

        except Exception as e:
            error_msg = str(e)
            print(f"❌ Error calling Cortex Analyst: {error_msg}")
            return f"Error: {error_msg}"

    def execute_sql(self, sql: str) -> str:
        """
        Execute generated SQL for validation.

        Args:
            sql: SQL query to execute

        Returns:
            String representation of results
        """
        if not sql or "Error" in sql:
            return "No valid SQL"
        try:
            df = self.session.sql(sql).to_pandas()
            return df.head(10).to_string()
        except Exception as e:
            return f"Execution Failed: {str(e)}"

    def query(self, question: str) -> str:
        """
        Main entry point - retrieves context, generates SQL, validates.
        This is the method that TruLens will automatically trace.

        Args:
            question: Natural language question

        Returns:
            Generated SQL string
        """
        # Retrieve context (semantic model)
        context = self.retrieve_context()

        # Generate SQL
        sql = self.generate_sql(question)

        # Execute for validation (optional, helps with evaluation)
        result = self.execute_sql(sql)

        return sql

# Initialize the app
analyst_app = CortexAnalystApp(session, SEMANTIC_MODEL_PATH, PAT_TOKEN)
Enter fullscreen mode Exit fullscreen mode

You can test sample questions to test this class.

# Test query
test_question = "What was the total revenue in 1995?" # change this according to your question
print(f"📝 Question: {test_question}")

result_sql = analyst_app.query(test_question)
print(f"\n✅ Generated SQL:\n{'-'*60}")
print(result_sql)
Enter fullscreen mode Exit fullscreen mode

Once the SQL can be generated from the sample question, we can continue to register this class into TruLens.

Register the App to TruLens

We need to register the class so the evaluation process can be captured properly inside the TruLens library and eventually will appear in the Snowflake platform.

# Switch to observability schema
session.use_schema(f"{OBSERVABILITY_DB}.{OBSERVABILITY_SCHEMA}") # Choose where to store the evaluation result
print(f"✅ Using: {session.get_current_database()}.{session.get_current_schema()}")

# Create TruLens connector for Snowflake
tru_snowflake_connector = SnowflakeConnector(snowpark_session=session)

# Register the app with TruLens
tru_analyst = TruApp(
    analyst_app,
    app_name="cortex_analyst_sql_gen",
    app_version="v1_trulens_eval",
    connector=tru_snowflake_connector,
    main_method=analyst_app.query  # Specify the main entry point method
)

print("✅ App registered with TruLens for observability")
Enter fullscreen mode Exit fullscreen mode

Configure Evaluation Run

We need to have a table with at least 2 (two) columns that contains the questions and expected SQL statements.

# Configure the evaluation run
run_config = RunConfig(
    run_name="analyst_eval_run_1",
    dataset_name=EVAL_DATASET_TABLE,
    description="Questions about the TPCH-SF1",
    label="tpch_analyst_eval",
    source_type="TABLE",
    dataset_spec={
        "input": "QUESTION",            # Business questions
        "ground_truth_output": "EXPECTED_SQL",  # Valid SQL
    },
)

# Add the run to the app
run = tru_analyst.add_run(run_config=run_config)
print(f"✅ Run configured: {run_config.run_name}")
print(f"   Dataset: {EVAL_DATASET_TABLE}")
Enter fullscreen mode Exit fullscreen mode

The configuration above basically tells TruLens to get the ground truth so later it can run the evaluation against it. Here are some samples from the evaluation dataset.
Question-1: What is the total revenue generated from all orders?

Expected SQL:

WITH __order_lineitems AS (
  SELECT
    l_shipdate AS lineitem_ship_date,
    l_extendedprice * (1 - l_discount) AS net_revenue
  FROM
    snowflake_sample_data.tpch_sf1.lineitem
)
SELECT
  MIN(ol.lineitem_ship_date) AS start_date,
  MAX(ol.lineitem_ship_date) AS end_date,
  SUM(ol.net_revenue) AS total_revenue
FROM
  __order_lineitems AS ol
Enter fullscreen mode Exit fullscreen mode

Question-2: What is the monthly revenue trend?

Expected SQL:

WITH __order_lineitems AS (
  SELECT
    l_shipdate AS lineitem_ship_date,
    l_extendedprice * (1 - l_discount) AS net_revenue
  FROM
    snowflake_sample_data.tpch_sf1.lineitem
)
SELECT
  DATE_TRUNC('MONTH', lineitem_ship_date) AS month,
  MIN(lineitem_ship_date) AS start_date,
  MAX(lineitem_ship_date) AS end_date,
  SUM(net_revenue) AS monthly_net_revenue
FROM
  __order_lineitems
GROUP BY
  DATE_TRUNC('MONTH', lineitem_ship_date)
ORDER BY
  month DESC NULLS LAST
Enter fullscreen mode Exit fullscreen mode

Question-3: What is the average part retail price by manufacturer?

Expected SQL:

WITH __parts AS (
  SELECT
    p_mfgr AS part_manufacturer_name,
    p_retailprice AS part_retail_price
  FROM
    snowflake_sample_data.tpch_sf1.part
)
SELECT
  p.part_manufacturer_name,
  AVG(p.part_retail_price) AS avg_retail_price
FROM
  __parts AS p
GROUP BY
  p.part_manufacturer_name
ORDER BY
  avg_retail_price DESC NULLS LAST
Enter fullscreen mode Exit fullscreen mode

Then we can start the evaluation

# Start the evaluation run
print("🚀 Starting evaluation run...")

run.start()

print("✅ Evaluation run completed!")
Enter fullscreen mode Exit fullscreen mode

The command above will run the evaluation process in Snowflake using TruLens library. If you run into error due to existing run_name already exists, you can use new name for the run_name or use this command to remove the previous run.

run = tru_analyst.get_run(run_name="analyst_eval_run_1")
run.delete()
Enter fullscreen mode Exit fullscreen mode

Compute Evaluation Metrics

Once the evaluation process is complete, we still need to run the evaluation metrics to quantify the text-to-sql output. We can use several metrics to evaluate the quality of the agent.

# Compute evaluation metrics
print("📊 Computing evaluation metrics...")
print("   This may take a few minutes depending on dataset size\n")

metrics_to_compute = [
    "answer_relevance",    # How relevant is the SQL to the question
    "groundedness",        # Is SQL grounded in the semantic model
    "coherence",           # Is the SQL logically coherent
    "correctness"          # Is SQL correct compared to ground truth
]

print(f"   Metrics: {', '.join(metrics_to_compute)}\n")

try:
    run.compute_metrics(metrics_to_compute)
    print("✅ Metrics computed successfully!")
    print("   Results are stored in your observability database")
except Exception as e:
    print(f"⚠️ Error computing metrics: {e}")
    print("   Some metrics may have completed")
Enter fullscreen mode Exit fullscreen mode

Depending on the evaluation dataset, this process can run for a while. Once the computing process is complete, you can see the results in the AI & ML → Cortex AI → Evaluations and select your app_name.

Evaluation Result
If you click on the evaluation run_name, it will provide the details on each question’s metrics result as defined in the metrics_to_compute parameter.

Evaluation Result Details

How to Interpret the Results

Here are some guidelines on what does each metrics measure in the evaluation process:

  • Low Correctness Score: Likely an issue with the semantic model (missing definitions or incorrect mappings) or a flaw in the evaluation dataset's Expected SQL. This requires an update to the semantic model or ground truth.
  • Low Groundedness Score: Indicates the model generated SQL that is trying to query entities outside of the defined semantic model. This is a critical governance failure that requires immediate attention and model tuning. These metrics are continuously logged back into Snowflake, providing a quantifiable metric for Regression Testing to ensure future model updates do not degrade performance on historical questions. ### Sample Results Here is a sample result difference from the question: “How many parts are supplied by suppliers from different regions than where they're ordered, grouped by the region pair?” Generated SQL
...
WHERE
  sr.region_name <> cr.region_name
GROUP BY
...
Enter fullscreen mode Exit fullscreen mode

Expected SQL

...
WHERE
  sr.region_name <> cr.region_name
  AND NOT sr.region_name IS NULL
  AND NOT cr.region_name IS NULL
GROUP BY
...
Enter fullscreen mode Exit fullscreen mode

Both generated the same CTEs in the process, the main difference is in the WHERE clause where the ground truth had additional column filters. TruLens saw this and marked the correctness metrics with 0.67

Conclusions

In this article, we have explored that using TruLens library can be integrated in your text-to-sql evaluation process. Although this example mainly uses Cortex Analyst, you can also explore this with different API endpoints and modify the generate_sql function. You can run this evaluation as part of your development process and enrich your evaluation dataset by adding more valid business questions and queries.

I hope this will give an inspiration on how to efficiently evaluate your talk-to-sql AI tools and prevent another bottleneck that may come in reaping the benefits of using AI at enterprise level.


Huge thanks to Vinut and Sheena for their guidance and review.

Top comments (0)