Sagara

Posted on Mar 30

PDFs with Graphs? Just Ask the Agent: Cross-Analyzing Unstructured and Structured Data on Snowflake Cortex Agent

#snowflake #ai

This is an English translation of the original Japanese article:
https://dev.classmethod.jp/articles/snowflake-multi-modal-analytics-with-cortex-agent/

Note: Since this is a translated article, some images contain Japanese text.

Previously, analyzing unstructured data on Snowflake required Cortex Search, which meant parsing text and loading it into tables — making it difficult to work with PDFs containing graphs and charts. However, now that the AI_COMPLETE function can directly query PDF files on stages, you can pass entire PDFs to an LLM without text extraction or chunk splitting.

https://docs.snowflake.com/en/user-guide/snowflake-cortex/ai-complete-document-intelligence

This means we can wrap AI_COMPLETE in a stored procedure as a PDF custom tool and combine it with Cortex Analyst (Semantic View) to enable natural language analysis across both "unstructured data like PDFs on stages" and "table data" — all within a single Cortex Agent. I decided to put this to the test.

The idea for this article was inspired by the following blog post. The approach of using a stored procedure that reads PDFs via AI_COMPLETE as a custom tool for Cortex Agent was extremely helpful.

https://zenn.dev/truestar/articles/d2431ccd4aa127

⚠️ Important note: The table data (monthly sales details) used in this article is entirely dummy data. While the PDF financial reports are actual published financial data from Classmethod, the monthly breakdown by service line and region is fictional. The dummy data was generated by proportionally distributing values so that the PDF's annual totals match the table's monthly totals.

Background & Challenges

When analyzing unstructured data on Snowflake, the main approaches available until now were:

Cortex Search: Extract and chunk text into tables, then search and answer via RAG (Retrieval-Augmented Generation)
Document AI: Structured data extraction (table conversion) from PDFs

These work well for text-centric documents, but had the following limitations:

PDFs with graphs and charts lose information when only text is extracted
A preprocessing pipeline for text parsing and table conversion is required, adding setup overhead

Technical Approach

Using AI_COMPLETE's document intelligence feature (TO_FILE + PROMPT), you can pass PDFs on stages directly to an LLM without text extraction. Since graphs and tables can be referenced as visual elements, information that was previously lost through text extraction can now be analyzed.

By wrapping this in a Python stored procedure and registering it as a custom tool for Cortex Agent, then combining it with Cortex Analyst (via Semantic View's cortex_analyst_text_to_sql tool), we achieve the following architecture:

AGENT_CM_FINANCIAL_ANALYST (Integrated Agent)
├── SP_ASK_CM_FINANCIALS (generic tool)
│     └── AI_COMPLETE + TO_FILE → Directly query PDFs on stage
└── AnalystMonthlySales (cortex_analyst_text_to_sql tool)
      └── Semantic View → SQL aggregation on monthly sales table

Tool selection is automatic by the Agent. Based on the question content, it calls the PDF tool, the Analyst tool, or both, and integrates the results into a unified answer.

Limitations

AI_COMPLETE with documents is in Public Preview as of March 30, 2026
- Stages must use server-side encryption (SNOWFLAKE_SSE)
- File size limits: Up to 10MB (900 pages) for Gemini 3.1 Pro, up to 4.5MB for Claude models
- Document count limits per request: Up to 20 for Gemini, up to 5 for Claude

Cost

AI_COMPLETE costs are determined by processed token count, not file size. Text portions and visual elements are tokenized, and the combined input/output is billable
Cortex Analyst (Semantic View) has a lower unit price when accessed through Cortex Agents (see Consumption Table Table 6(f) and 6(h))

Prerequisites

Snowflake: Enterprise edition
Feature status: Public Preview as of March 30, 2026 (AI_COMPLETE with documents, Cortex Agents, Semantic View)
Required privileges: SYSADMIN role, SNOWFLAKE.CORTEX_USER database role grant
Cross-region inference: Setting CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION' may be required to use gemini-3.1-pro (set with ACCOUNTADMIN)
Model used: gemini-3.1-pro (chosen for PDF reading as it supports up to 10MB)

Setup

Overall Architecture

Here's the complete picture of Snowflake objects we'll create:

POC_CM.SAGARA_TEST
├── STG_CM_FINANCIAL_REPORTS    -- PDF stage
├── V_CM_FINANCIAL_METADATA     -- PDF metadata view
├── T_MONTHLY_SALES             -- Dummy sales table
├── SV_MONTHLY_SALES            -- Semantic View (for Cortex Analyst)
├── SP_ASK_CM_FINANCIALS        -- PDF custom tool (stored procedure)
└── AGENT_CM_FINANCIAL_ANALYST  -- Integrated Agent

Target Data

PDFs (Unstructured Data)

We use the following 4 PDFs. In addition to 3 fiscal years of financial reports, we include a company introduction presentation (slide format) with numerous graphs and charts to also verify AI_COMPLETE's visual reading capabilities.

File Name	Type	Period	Coverage
financial-results_202306.pdf	Financial Report	19th Period	2022/7/1 - 2023/6/30
financial-results_202406.pdf	Financial Report	20th Period	2023/7/1 - 2024/6/30
financial-results_202506.pdf	Financial Report	21st Period	2024/7/1 - 2025/6/30
会社紹介資料_20251031.pdf	Company Introduction	—	As of October 2025

The company introduction presentation contains visual information that would be lost through text extraction: bar graphs of revenue trends, bar graphs of employee count trends, pie charts of team composition, office location maps, etc. The key question is whether AI_COMPLETE can directly read this type of material, which was difficult to handle with the conventional Cortex Search (text parsing approach).

Financial report example (obtained from Classmethod's official website)

Company introduction example (obtained from Classmethod's official SpeakerDeck)

Table (Structured Data) — Dummy Data

We generate dummy data that aligns with the PDF financial figures. The key demo point is verifying that "the annual revenue from the PDF side matches the monthly sales aggregation from the table side."

Column	Type	Description
SALE_ID	NUMBER	Surrogate key (sequential)
FISCAL_YEAR	NUMBER	Fiscal year (19, 20, 21)
YEAR_MONTH	DATE	First day of month (36 months)
SERVICE_LINE	VARCHAR	Service line (6 categories)
CUSTOMER_SEGMENT	VARCHAR	Customer segment (5 categories)
REGION	VARCHAR	Region (6 categories)
REVENUE	NUMBER	Revenue (thousands of yen)
COGS	NUMBER	Cost of goods sold (thousands of yen)
GROSS_PROFIT	NUMBER	Gross profit (thousands of yen)
DEAL_COUNT	NUMBER	Deal count

Alignment rules with PDFs:

Total REVENUE per period = PDF's revenue (19th: ¥59,005,311K, 20th: ¥77,190,340K, 21st: ¥95,056,018K)
Total COGS per period = PDF's cost of sales (19th: ¥52,634,720K, 20th: ¥69,320,565K, 21st: ¥85,782,527K)

Implementation

1. Stage Creation & PDF Upload

First, create an internal stage for PDF storage. Note that the encryption type must be SNOWFLAKE_SSE or AI_COMPLETE won't be able to read the files.

CREATE OR REPLACE STAGE STG_CM_FINANCIAL_REPORTS
  DIRECTORY = (ENABLE = TRUE)
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
  COMMENT = 'Stage for Classmethod financial report PDFs';

After creating the stage, upload the 4 PDFs via Snowsight: Data > Databases > POC_CM > SAGARA_TEST > Stages > STG_CM_FINANCIAL_REPORTS > + Files button.

After uploading, it should look like the following:

After uploading, refresh the directory table and verify the files:

ALTER STAGE STG_CM_FINANCIAL_REPORTS REFRESH;

SELECT * FROM DIRECTORY(@STG_CM_FINANCIAL_REPORTS);

You should see 4 files listed.

Next, create a metadata view that extracts document type and fiscal year from file names. Since financial reports and company introduction materials have different naming conventions, we use CASE statements to handle both patterns.

CREATE OR REPLACE VIEW V_CM_FINANCIAL_METADATA AS
SELECT
    RELATIVE_PATH,
    FILE_URL,
    SIZE,
    LAST_MODIFIED,
    -- Determine document type
    CASE
        WHEN RELATIVE_PATH LIKE 'financial-results%' THEN '決算報告書'
        WHEN RELATIVE_PATH LIKE '会社紹介資料%' THEN '会社紹介資料'
        ELSE 'その他'
    END AS DOC_TYPE,
    -- Extract fiscal year only for financial reports
    CASE
        WHEN RELATIVE_PATH LIKE 'financial-results%' THEN
            CASE SPLIT_PART(REPLACE(RELATIVE_PATH, '.pdf', ''), '_', 2)
                WHEN '202306' THEN 19
                WHEN '202406' THEN 20
                WHEN '202506' THEN 21
            END
        ELSE NULL
    END AS FISCAL_YEAR,
    CASE
        WHEN RELATIVE_PATH LIKE 'financial-results%' THEN
            CASE SPLIT_PART(REPLACE(RELATIVE_PATH, '.pdf', ''), '_', 2)
                WHEN '202306' THEN '2022/7/1 - 2023/6/30'
                WHEN '202406' THEN '2023/7/1 - 2024/6/30'
                WHEN '202506' THEN '2024/7/1 - 2025/6/30'
            END
        WHEN RELATIVE_PATH LIKE '会社紹介資料%' THEN '2025年10月時点'
        ELSE NULL
    END AS FISCAL_PERIOD
FROM DIRECTORY(@STG_CM_FINANCIAL_REPORTS)
WHERE RELATIVE_PATH LIKE '%.pdf';

Querying the created view shows the following:

2. Standalone AI_COMPLETE Verification

Before building the PDF custom tool, let's verify that AI_COMPLETE can directly read PDFs.

SELECT AI_COMPLETE(
  MODEL => 'gemini-3.1-pro',
  PROMPT => PROMPT(
    'この決算報告書の売上高と売上原価を教えてください: {0}',
    TO_FILE('@POC_CM.SAGARA_TEST.STG_CM_FINANCIAL_REPORTS', 'financial-results_202506.pdf')
  )
);

The PDF specified with TO_FILE is passed to the {0} placeholder in the PROMPT function. If the correct revenue and cost of sales figures are returned, we're good to go.

3. Creating the Stored Procedure for the PDF Custom Tool

Since AI_COMPLETE cannot be directly registered as a Cortex Agent tool, we wrap it in a Python stored procedure. It dynamically retrieves the PDF file list from the directory table, filters by document type (DOC_TYPE) and fiscal year, then executes AI_COMPLETE against each PDF.

CREATE OR REPLACE PROCEDURE POC_CM.SAGARA_TEST.SP_ASK_CM_FINANCIALS(
    QUESTION VARCHAR,
    FILTER_DOC_TYPE VARCHAR DEFAULT NULL,
    FILTER_FISCAL_YEAR VARCHAR DEFAULT NULL
)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.12'
PACKAGES = ('snowflake-snowpark-python')
HANDLER = 'main'
EXECUTE AS OWNER
AS
$$
import json

def _normalize(val):
    """Treat string 'NULL'/'null'/empty string as None (ignore)"""
    if val is None:
        return None
    if val.strip().upper() in ('NULL', ''):
        return None
    return val

def main(session, question, filter_doc_type=None, filter_fiscal_year=None):
    filter_doc_type = _normalize(filter_doc_type)
    filter_fiscal_year = _normalize(filter_fiscal_year)

    query = """
        SELECT RELATIVE_PATH, DOC_TYPE, FISCAL_YEAR, FISCAL_PERIOD
        FROM POC_CM.SAGARA_TEST.V_CM_FINANCIAL_METADATA
        WHERE 1=1
    """
    if filter_doc_type:
        safe_doc_type = filter_doc_type.replace("'", "''")
        query += f" AND DOC_TYPE = '{safe_doc_type}'"
    if filter_fiscal_year:
        query += f" AND FISCAL_YEAR = {filter_fiscal_year}"

    files_df = session.sql(query).collect()

    if not files_df:
        return json.dumps(
            {"error": "No PDF files matched the specified conditions"},
            ensure_ascii=False
        )

    stage_path = '@POC_CM.SAGARA_TEST.STG_CM_FINANCIAL_REPORTS'
    results = []

    for row in files_df:
        pdf_file = row['RELATIVE_PATH']
        safe_question = question.replace("'", "''")
        safe_pdf_file = pdf_file.replace("'", "''")
        ai_query = f"""
            SELECT AI_COMPLETE(
                MODEL => 'gemini-3.1-pro',
                PROMPT => PROMPT(
                    '{safe_question}: {{0}}',
                    TO_FILE('{stage_path}', '{safe_pdf_file}')
                )
            ) AS answer
        """
        try:
            result = session.sql(ai_query).collect()
            answer = result[0]['ANSWER'] if result else 'Error: No result'
        except Exception as e:
            answer = f'Error: {str(e)}'

        results.append({
            'file': pdf_file,
            'doc_type': row['DOC_TYPE'],
            'fiscal_year': str(row['FISCAL_YEAR']) if row['FISCAL_YEAR'] else None,
            'fiscal_period': row['FISCAL_PERIOD'],
            'answer': answer
        })

    return json.dumps(results, ensure_ascii=False)
$$;

The procedure includes a _normalize() function. When Cortex Agent passes parameters to the stored procedure, it sometimes sends the string "NULL" instead of SQL NULL. This function converts those to Python None.

For verification, let's call it with several patterns:

-- Across all PDFs
CALL POC_CM.SAGARA_TEST.SP_ASK_CM_FINANCIALS('Please summarize the key points of this document');

-- Financial reports only, 21st period
CALL POC_CM.SAGARA_TEST.SP_ASK_CM_FINANCIALS(
    'What are the revenue and operating profit?', '決算報告書', '21');

-- Company introduction only (graph reading)
CALL POC_CM.SAGARA_TEST.SP_ASK_CM_FINANCIALS(
    'Read the revenue trend from the performance graph', '会社紹介資料');

The third query in particular tests AI_COMPLETE's visual reading capability — reading numerical values from bar graphs.

If JSON-formatted answers from each PDF are returned as shown below, everything is working correctly.

4. Dummy Data Generation & Table Creation

Generate table data for Cortex Analyst.

We create a T_MONTHLY_SALES table and INSERT approximately 6,000 rows of dummy data that aligns with the PDF financial figures.

Column	Description
FISCAL_YEAR	Fiscal year (19, 20, 21)
YEAR_MONTH	First day of month (36 months)
SERVICE_LINE	Service line (AWS Resale, Cloud Migration Support, etc. — 6 categories)
CUSTOMER_SEGMENT	Customer segment (Enterprise, Mid-Market, etc. — 5 categories)
REGION	Region (Kanto, Kansai, etc. — 6 categories)
REVENUE / COGS / GROSS_PROFIT	Revenue / Cost of sales / Gross profit (thousands of yen)
DEAL_COUNT	Deal count

5. Semantic View Creation

Create a Semantic View for Cortex Analyst. Setting Japanese SYNONYMS improves the accuracy of natural language queries.

CREATE OR REPLACE SEMANTIC VIEW POC_CM.SAGARA_TEST.SV_MONTHLY_SALES

  TABLES (
    T_MONTHLY_SALES AS POC_CM.SAGARA_TEST.T_MONTHLY_SALES
      PRIMARY KEY (SALE_ID)
      WITH SYNONYMS = ('月次売上', '売上データ', '売上明細', '売上実績')
      COMMENT = 'Monthly sales detail data for Classmethod Inc.'
  )

  DIMENSIONS (
    T_MONTHLY_SALES.FISCAL_YEAR AS T_MONTHLY_SALES.FISCAL_YEAR
      WITH SYNONYMS = ('期', '年度', '会計期間')
      COMMENT = 'Fiscal year (19=19th period, 20=20th period, 21=21st period)',

    T_MONTHLY_SALES.YEAR_MONTH AS T_MONTHLY_SALES.YEAR_MONTH
      WITH SYNONYMS = ('月', '年月', '対象月')
      COMMENT = 'Target month (first day of month)',

    T_MONTHLY_SALES.SERVICE_LINE AS T_MONTHLY_SALES.SERVICE_LINE
      WITH SYNONYMS = ('サービス', '事業', '事業区分', 'サービス区分')
      COMMENT = 'Service line',

    T_MONTHLY_SALES.CUSTOMER_SEGMENT AS T_MONTHLY_SALES.CUSTOMER_SEGMENT
      WITH SYNONYMS = ('顧客区分', 'セグメント', '顧客タイプ')
      COMMENT = 'Customer segment',

    T_MONTHLY_SALES.REGION AS T_MONTHLY_SALES.REGION
      WITH SYNONYMS = ('地域', 'エリア', '拠点')
      COMMENT = 'Region'
  )

  METRICS (
    T_MONTHLY_SALES.TOTAL_REVENUE AS SUM(T_MONTHLY_SALES.REVENUE)
      WITH SYNONYMS = ('売上', '売上額', '売上金額', '収益', '売上高')
      COMMENT = 'Revenue (thousands of yen)',

    T_MONTHLY_SALES.TOTAL_COGS AS SUM(T_MONTHLY_SALES.COGS)
      WITH SYNONYMS = ('原価', '売上原価', 'コスト')
      COMMENT = 'Cost of goods sold (thousands of yen)',

    T_MONTHLY_SALES.TOTAL_GROSS_PROFIT AS SUM(T_MONTHLY_SALES.GROSS_PROFIT)
      WITH SYNONYMS = ('粗利', '粗利益', '売上総利益', 'GP')
      COMMENT = 'Gross profit (thousands of yen)',

    T_MONTHLY_SALES.TOTAL_DEAL_COUNT AS SUM(T_MONTHLY_SALES.DEAL_COUNT)
      WITH SYNONYMS = ('案件数', '取引件数', 'ディール数')
      COMMENT = 'Deal count'
  )

  COMMENT = 'Classmethod monthly sales dummy data (for Cortex Analyst)';

6. Integrated Agent Creation

Next, create the Agent that integrates the PDF custom tool and the Cortex Analyst tool. The key point is to clearly specify how each tool should be used in the instructions.

CREATE OR REPLACE AGENT POC_CM.SAGARA_TEST.AGENT_CM_FINANCIAL_ANALYST
  COMMENT = 'Classmethod financial report integrated analysis agent (PDF × Table)'
  FROM SPECIFICATION
$$
models:
  orchestration: auto

orchestration:
  budget:
    seconds: 120
    tokens: 32000

instructions:
  system: >
    あなたはクラスメソッド株式会社の企業情報・財務分析アシスタントです。
    以下の2つのデータソースを使って質問に回答できます。

    【データソース1: PDF資料】
    SP_ASK_CM_FINANCIALSツールで参照できます。
    (A) 決算報告書（3期分）: FILTER_DOC_TYPE=決算報告書
    (B) 会社紹介資料: FILTER_DOC_TYPE=会社紹介資料

    【データソース2: 月次売上テーブル】
    AnalystMonthlySalesツールで参照できます。

    【ツール選択ルール】
    1. 決算書の定性的な内容 → PDFツール（FILTER_DOC_TYPE=決算報告書）
    2. 会社概要、拠点、従業員数、経営理念、業績推移グラフ等
       → PDFツール（FILTER_DOC_TYPE=会社紹介資料）
    3. 数値の集計・比較・推移分析 → Analystツール
    4. PDF×テーブルの横断分析 → 両方呼び出して比較

  response: >
    日本語で回答してください。
    数値を含む回答ではテーブル形式で見やすく整理してください。

  sample_questions:
    - question: "第21期の売上高は？"
    - question: "サービスライン別の売上推移を教えて"
    - question: "第20期の監査報告書の内容は？"
    - question: "PDFの年間売上とテーブルの月次合計を突合して"
    - question: "会社の拠点一覧を教えて"
    - question: "業績推移のグラフから売上高の成長率を教えて"
    - question: "従業員数の推移と売上高の推移を比較して"

tools:
  - tool_spec:
      type: "generic"
      name: "SP_ASK_CM_FINANCIALS"
      description: >
        クラスメソッドのPDF資料（決算報告書・会社紹介資料）に対して質問を行い、回答を取得するツール。
        FILTER_DOC_TYPE（決算報告書, 会社紹介資料）とFILTER_FISCAL_YEAR（19, 20, 21）でフィルタ可能。
      input_schema:
        type: "object"
        properties:
          QUESTION:
            type: "string"
            description: "PDF資料に対して問い合わせる質問文"
          FILTER_DOC_TYPE:
            type: "string"
            description: "資料種別フィルタ（決算報告書, 会社紹介資料）。指定しない場合はNULL。"
          FILTER_FISCAL_YEAR:
            type: "string"
            description: "会計年度フィルタ（19, 20, 21）。指定しない場合はNULL。"
        required:
          - "QUESTION"

  - tool_spec:
      type: "cortex_analyst_text_to_sql"
      name: "AnalystMonthlySales"
      description: >
        月次売上明細テーブルに対してSQL集計を行い、数値データを分析するツール。

tool_resources:
  SP_ASK_CM_FINANCIALS:
    type: "procedure"
    identifier: "POC_CM.SAGARA_TEST.SP_ASK_CM_FINANCIALS"
    execution_environment:
      type: "warehouse"
      warehouse: "COPUTE_WH"
      query_timeout: 120
  AnalystMonthlySales:
    semantic_view: "POC_CM.SAGARA_TEST.SV_MONTHLY_SALES"
    execution_environment:
      type: "warehouse"
      warehouse: "COPUTE_WH"
      query_timeout: 60
$$;

7. Testing via Snowflake Intelligence

Let's interactively test the created Agent from the Snowflake Intelligence UI.

⚠️ Once again, the table data (monthly sales details) used in this verification is entirely dummy data. While the PDF financial reports are actual published financial data from Classmethod, the monthly breakdown by service line and region is fictional. The dummy data was generated by proportionally distributing values so that the PDF's annual totals match the table's monthly totals. Please keep this in mind.

We'll run the following test questions to verify that tool selection works correctly.

Test 1: "What is the revenue for the 21st period?"

→ Answered using the Analyst tool or PDF tool. Either way, the same value (¥95,056,018K) should be returned.

Test 2: "Show me the revenue trend by service line"

→ The Analyst tool is selected, and SQL aggregation is executed against the table data.

Test 3: "What does the audit report for the 20th period say?"

→ The PDF tool is selected, and the content of the audit report within the financial report PDF is returned.

Test 4: "Reconcile the annual revenue from the PDF with the monthly totals from the table"

→ Both tools are called, and the response confirms that the annual revenue from the PDF side matches the monthly sales total from the table side. This is the highlight of this verification.

Test 5: "List the company's office locations"

→ The PDF tool (FILTER_DOC_TYPE=会社紹介資料) is selected, and a list of 8 domestic and 5 overseas offices is returned from the company introduction's office location page (slide with map).

Test 6: "What is the revenue growth rate based on the performance trend graph?"

→ The PDF tool (FILTER_DOC_TYPE=会社紹介資料) is selected, and the response reads numerical values from the bar graph and calculates growth rates. This demonstrates that graph reading — which was impossible with text extraction — works properly.

Test 7: "Tell me about Classmethod's evaluation system, career paths, and salary ranges"

→ The PDF tool (FILTER_DOC_TYPE=会社紹介資料) is selected, and it properly reads the grade list and salary correlation expressed in complex graphs.

Test 8: "Compare the employee count trend with the revenue trend"

→ Both tools are called — the PDF tool (employee count trend graph from the company introduction) and the Analyst tool (revenue aggregation from the table) — and the response provides a cross-cutting comparative analysis of employee growth and revenue growth.

Conclusion

By using AI_COMPLETE's document intelligence feature, PDFs can be passed directly to an LLM without building a RAG pipeline, making it extremely simple to incorporate as a Cortex Agent custom tool.

Here's what this verification confirmed:

Cross-analysis of PDFs and tables is now possible within a single Agent. Based on the question content, the Agent automatically selects the appropriate PDF tool, Analyst tool, or both
Reconciliation between PDF annual totals and table monthly aggregations is achievable by calling both tools and comparing results
Even PDFs containing graphs and charts can be analyzed directly without text extraction or chunk splitting, making the setup dramatically simpler compared to Cortex Search

On the other hand, it's important to note that AI_COMPLETE with documents is in Public Preview (as of March 2026), there are file size limitations, and costs are incurred based on processed token count.

This approach feels like an excellent fit for use cases where you want to perform cross-cutting analysis of unstructured and structured data on Snowflake. Give it a try!

DEV Community