Nuttee

Posted on Feb 7

From Locked PDFs to Structured Data: Agent Skills to Extract Thailand's Election Policies using Gemini 3 Pro and Gemini CLI

#gemini #agentskills

8 min read

TL;DR

Built an AI skill that extracts structured data from 51 scanned Thai election PDFs (587 policies) in under 2 hours with 100% valid JSON output. Uses Gemini 3 Pro's native OCR + structured output + Pydantic schemas for guaranteed data quality.

Package workflows as AI "skills" that agents execute via natural language instead of manual scripts.

Results: Most of PDFs processed in 1-2 minutes, zero manual intervention, support agent with skills like gemini-cli, claude-code, cursor.

GitHub Repo →

The Problem: A Facebook Post and 51 Locked PDFs

The post that started it all: "Election Commission released 51 party policies as scanned PDFs that can't be used. When asked for CSV: 'If you want it, write it yourself'"

I was scrolling through Facebook when I saw this post from the Thai developer community. Thailand's Election Commission (กกต.) had released policy data for the 2026 election—all 587 policies from 51 political parties—but in the worst possible format: scanned image PDFs.

Not text PDFs you could copy from. Scanned photos of printed documents. Just images of tables filled with Thai text, numbers, and complex formatting.

When developers asked for a usable format like CSV or JSON, the response was essentially: "Do it yourself."

The challenge was enormous:

📄 51 scanned PDF files (no text layer)
🔤 Thai language requiring OCR
📊 Complex table structures with inconsistent formatting
💰 Thai numerical units needing normalization (ล้าน, พันล้าน)
🏷️ Unstructured categorization
⏱️ Tight timeline before election discussion season

I had three options:

Manual typing: 3+ weeks of soul-crushing work with inevitable errors
Traditional OCR: Days of cleanup fixing Thai character errors, then manual structuring
Build an AI solution: Handle OCR + extraction + validation in one automated workflow

I chose option 3—and built something reusable for anyone facing similar challenges.

The result? ✅ 51 parties extracted, 587 policies structured, 100% valid JSON—in under 2 hours of automated processing.

The Problem
Skills vs Scripts
Technical Deep Dive
Real Results
Real-World Impact
Installation
Challenges & Improvements
What I Learned
Try It Yourself

What Makes This Different: Skills, Not Scripts

Most people would write a Python script and call it done. But I took a different approach: I packaged this as a skill for AI agents (Gemini CLI and Claude Code).

What's a Skill?

Think of a skill as a recipe that teaches an AI agent a complete workflow. Instead of:

Running commands manually
Remembering script parameters
Setting up environments each time

You simply tell the AI in natural language:

Extract policies from all Thai election PDFs

The agent handles everything—setup, execution, error recovery, validation.

Why this matters: We're shifting from "writing scripts for humans to run" to "building tools for AI agents to use." This unlocks new levels of automation.

The Two Parts of a Skill

1. Instructions (SKILL.md) - The "user manual" for the AI

## Agent Instructions
BEFORE running scripts, execute these commands:
1. Navigate to skill directory
2. Create virtual environment
3. Activate environment
4. Install dependencies

2. Tools (scripts/) - The actual code that does the work

extract_policy.py - Single PDF extraction with OCR + validation
batch_extract_all.sh - Batch processing with retry logic
json_to_csv.py - Format conversion utilities
send_to_datadog.py - Monitoring and observability integration

The Secret Sauce: Native Thai OCR + Structured Output

The real breakthrough came from combining two powerful capabilities:

1. Gemini's Native Vision for Thai OCR

The challenge: These PDFs are scanned images—no selectable text. Traditional OCR tools struggle with Thai characters (๐-๙) and complex table layouts.

The solution: Gemini 3 Pro's native vision capabilities handle Thai OCR seamlessly. No preprocessing, no separate OCR pipeline, no error cleanup. It just works.

2. Structured Output with Pydantic

Instead of hoping the LLM returns valid JSON, you define a Pydantic schema that guarantees the output format.

Before: Raw Scanned Image

๑. ระบบรางความเร็วสูง | ๓.๕ แสนล้าน | งบประมาณ, PPP, พันธบัตร

(Thai numerals, Thai units, inconsistent formatting)

After: Clean, Validated JSON

{
  "policy_seq": 1,
  "policy_category": "โครงสร้างพื้นฐาน",
  "policy_name": "ระบบรางความเร็วสูง",
  "budget_baht": 350000000000,
  "funding_source": "๑) งบประมาณแผ่นดิน\n๒) PPP\n๓) พันธบัตร",
  "benefits": "๑) เพิ่มการเชื่อมต่อ\n๒) กระตุ้นเศรษฐกิจ"
}

(Converted numerals, normalized budget, structured data)

The Code: Pydantic + Gemini Structured Output

Here's the core implementation:

from pydantic import BaseModel, Field
from typing import List
import google.generativeai as genai
from google.genai import types
import os

# Define exact schema you want - this becomes your data contract
class Policy(BaseModel):
    # Convert Thai numerals (๑,๒,๓) to Arabic (1,2,3) for sequence
    policy_seq: int = Field(
        description="Policy sequence (Thai numerals → Arabic)"
    )
    # AI categorizes into 15 predefined buckets based on content
    policy_category: str = Field(
        description="Category from predefined list"
    )
    # Extract word-by-word for accuracy, preserving Thai formatting
    policy_name: str = Field(
        description="Policy name extracted word-by-word"
    )
    # Normalize: ๓.๕ แสนล้าน → 350000000000 (pure Baht integer)
    budget_baht: int = Field(
        description="Budget in Baht (0 if none)"
    )
    # Preserve Thai numerals in lists (๑) ๒) ๓))
    funding_source: str = Field(
        description="Funding details, preserve Thai formatting"
    )
    cost_effectiveness: str
    benefits: str
    impacts: str
    risks: str

class PoliticalPartyPolicies(BaseModel):
    policies: List[Policy]

# Configure Gemini with your schema - this is the magic
config = types.GenerateContentConfig(
    temperature=0.5,  # Low temp for consistency
    response_mime_type="application/json",
    response_schema=PoliticalPartyPolicies.model_json_schema(),  # 🎯 Guaranteed structure
    thinking_config=types.ThinkingConfig(thinking_level="low")
)

# Process scanned PDF with native OCR + structured extraction
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
response = client.models.generate_content_stream(
    model="gemini-3-pro-preview",
    contents=[
        # Your extraction instructions
        types.Part.from_text(text=extraction_instructions),
        # The scanned PDF - Gemini handles OCR natively
        types.Part.from_bytes(
            data=pdf_bytes,
            mime_type="application/pdf"
        )
    ],
    config=config  # Pass the structured output config
)

# Validate output - will raise error if schema doesn't match
policies = PoliticalPartyPolicies.model_validate_json(result_json)
print(f"✅ Validated: {len(policies.policies)} policies extracted")

Key Benefits:

✅ Native Thai OCR - Handles scanned images without preprocessing
✅ No manual validation - Pydantic handles schema validation automatically
✅ Type safety - Budget is always an integer, never a string
✅ Thai numeral conversion - LLM automatically converts Thai numerical characters (๐-๙) to computer-readable integers/floats
✅ Context-aware - Understands Thai units and preserves formatting where needed
✅ Clear errors - Know exactly what failed and where
✅ Documentation - Schema serves as specification

Usage: As Simple As Asking

Once installed, using the skill is incredibly intuitive:

Extract a single PDF:

Extract policies from "เบอร์ 9 พรรคเพื่อไทย.pdf" using the Thailand election skill

Batch process all PDFs:

Batch extract all Thai election PDFs in the assets folder

Convert to CSV:

Convert party_9_policies.json to CSV format with pipe delimiter

Analyze results:

Show me all policies with budgets over 100 billion Baht

The AI agent handles:

✅ Environment setup (Python venv, dependencies)
✅ Script execution with correct parameters
✅ Error handling and automatic retries
✅ Progress monitoring
✅ Output validation

Real Results: Battle-Tested at Scale

This isn't a toy project. Here are the real metrics from extracting Thailand's 2026 election data:

Metric	Result
Total parties	51 (100% success)
Total policies	587 extracted
Data quality	100% valid JSON
Processing time	Under 2 hours total
Average per PDF	1-2 minutes (59% of files)
Automatic retries	Handled seamlessly
Manual intervention	Zero

Processing Time Distribution

All 51 PDFs processed in under 5 minutes each:

Time Range	Files	Avg Time	Percentage
0-1 min	4 files	47 sec	11%
1-2 min	22 files	86 sec	59%
2-3 min	7 files	138 sec	19%
3-4 min	2 files	192 sec	5%
4-5 min	2 files	282 sec	5%

Most common processing time: 1-2 minutes (59% of files)

"Most PDFs processed in 90 seconds. That's faster than I could even open the file and copy-paste a single table manually."

This efficiency makes the solution practical for real-world use—you can extract data from dozens of documents in the time it takes to get a coffee.

Special Challenges Solved:

🔍 Thai OCR from scanned images - 100% success rate, no preprocessing needed
📄 Thai numerals - Auto-convert ๐-๙ to 0-9 in sequence fields, preserve in content
💰 Budget normalization - ๓.๕ แสนล้าน (3.5 hundred billion) → 350,000,000,000 Baht
🏷️ Smart categorization - 587 policies across 15 categories with context awareness
🔄 Retry logic - Automatically handles incomplete responses (up to 3 attempts)
⏱️ Stream timeout detection - Monitors chunk timing, auto-retries on stalls
📊 CSV consolidation - Automatic merging across all parties with pipe delimiter

Extracted Data in Action: Wevis Policy Comparison

The extracted data isn't just sitting in JSON files—it's already powering real civic tech applications. Wevis, a Thai civic tech organization, used this structured data to build an interactive policy comparison tool for the 2026 election.

Wevis's interactive policy comparison website using the extracted election data

What Wevis Built:

📊 Interactive comparison across all 51 political parties
🔍 Select and filter by policy categories
💰 Budget analysis and visualization
📱 Mobile-friendly interface for voters

Why This Matters:

This demonstrates the real-world impact of making locked government data accessible. What started as 51 scanned PDFs "you have to write yourself" is now an interactive tool that helps millions of Thai voters make informed decisions.

Check it out:

🌐 Wevis Policy Comparison Tool
📱 Facebook Announcement

"From locked PDFs to civic engagement tools—this is why open data automation matters."

How to Install and Use

The skill is open source and ready to use:

Option 1: For Existing Agent Users (Recommended)

npx skills add nuttea/thailand-election-skills \
  --skill extract-thailand-election-policies \
  --agent gemini-cli \
  --agent claude-code \
  --yes

# Edit .env and add your GEMINI_API_KEY
# start your agent like 'gemini', 'claude'

Option 2: Clone and Use Locally

git clone https://github.com/nuttea/thailand-election-skills.git
cd thailand-election-skills

# Set up Gemini API key
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY

# Use with Claude or Gemini CLI
# The agent handles the rest!

That's it! The AI agent will handle Python environment setup, dependency installation, and script execution.

Key Takeaways

1. AI Skills > Scripts
Package your workflows as reusable skills that agents can execute with natural language.

2. Structured Output is Essential
Use Pydantic schemas with Gemini's structured output for guaranteed valid data.

"The difference between hoping for valid JSON and guaranteeing it is the difference between a prototype and production."

3. Agent-First Development
Design tools for AI agents to use, not just for manual execution.

4. Automation at Scale
What would take weeks manually can be done in hours with proper AI tooling.

5. Make It Reusable
This skill pattern works for invoices, research papers, financial reports, contracts—any structured document extraction.

Adapting This for Your Use Case

This approach isn't limited to election policies. You can adapt it for:

📄 Invoice Processing - Extract line items, totals, dates, vendor info
📚 Research Papers - Extract abstracts, citations, methodology, results
💼 Contracts - Extract clauses, dates, parties, obligations, terms
📊 Financial Reports - Extract metrics, tables, summaries, KPIs
🏥 Medical Records - Extract diagnoses, prescriptions, vitals (with proper HIPAA compliance)

The pattern is the same:

Define your Pydantic schema (your data contract)
Configure Gemini with structured output
Package as a skill with clear agent instructions
Let agents do the work via natural language

Challenges and Future Improvements

While this solution achieved 100% success across 51 parties, it's important to acknowledge the real-world challenges and areas for improvement:

Current Limitations

1. Non-Deterministic Extraction
LLM extraction is not 100% deterministic. Running the same PDF twice might produce slightly different results:

Wording variations in descriptions
Occasional budget calculation differences
Inconsistent categorization on edge cases

2. Quality Depends on Source Documents
Low-quality or inconsistent-resolution scanned documents present challenges:

Blurry text can lead to OCR errors
Inconsistent table formatting requires manual verification
Handwritten annotations may be misinterpreted
Some PDFs required careful spot-checking

Recommended Next Steps

1. Build a Test Dataset for Automated Evaluation
Create a ground-truth dataset with known values:

test_cases = {
    "party_9": {
        "expected_policies": 12,
        "expected_total_budget": 450000000000,
        "sample_policy_names": ["ระบบรางความเร็วสูง", "..."]
    }
}

This enables automated validation: does extraction match expected policy count? Total budget?

2. Run Systematic Experiments
Compare different approaches with measurable metrics:

Models: Gemini 3 Pro vs. Gemini 4 vs. GPT-4 Vision
Parameters: Temperature (0.3 vs 0.5 vs 0.7), thinking levels, max tokens
Prompts: Different instruction styles, few-shot examples, chain-of-thought

3. Track Key Metrics
Build dashboards to monitor:

Cost: Token usage per PDF, per policy
Latency: Processing time per page, per table row
Accuracy: Match rate against ground truth test set
Consistency: Variance across multiple runs of same PDF

4. Implement Confidence Scores
Add validation checks to flag suspicious extractions:

def validate_extraction(policies, pdf_metadata):
    """Return confidence checks for extracted data"""
    checks = {
        "policy_count_reasonable": 5 <= len(policies) <= 50,
        "budgets_in_range": all(0 <= p.budget_baht <= 1e13 for p in policies),
        "has_required_fields": all(p.policy_name and p.policy_category for p in policies),
        "no_duplicate_sequences": len(set(p.policy_seq for p in policies)) == len(policies)
    }
    return checks

5. Create a Human-in-the-Loop Review Process
For production use:

Flag extractions with low confidence scores for review
Sample random extractions (e.g., 10%) for spot-checking
Track and learn from manual corrections
Build feedback loop to improve prompts over time

6. Implement LLM Observability
Monitor and optimize your extraction pipeline in production:

Key questions to answer:

Quality: How accurate are extractions across different document types?
Cost: What's the token usage per PDF? Per policy extracted?
Latency: Why do some PDFs take 5 minutes while others take 1 minute?
Failures: What patterns lead to extraction errors?

Observability Tools:

Datadog LLMObs: Track latency, costs, and quality metrics per extraction
Custom Dashboards: Visualize processing time distribution and error rates
A/B Testing: Compare different models (Gemini 3 vs 4, GPT-4 Vision)
Cost Analysis: Monitor token usage trends and optimize prompts

Real Example:
In my next post, I'll show how Datadog LLMObs revealed that:

PDFs with complex tables took 3-5 minutes vs. simple ones at 1-2 minutes
Certain table layouts caused 30% more retries
Optimizing prompts reduced average processing time by 40%
Token costs varied 10x between smallest and largest documents

This data-driven approach helps you make informed decisions about model selection, prompt engineering, and infrastructure costs.

The Reality Check

This solution is production-ready for:

✅ Rapid prototyping and initial data extraction
✅ Cases where 95-98% accuracy is acceptable
✅ Projects with some budget for spot-checking

It needs more work for:

⚠️ Mission-critical financial calculations requiring 100% accuracy
⚠️ Legal document extraction with no tolerance for errors
⚠️ High-volume production without human review

The key is knowing your accuracy requirements and building appropriate validation layers.

What I Learned

1. Gemini's Native Vision is Underrated
No need for separate OCR pipeline—Gemini handles scanned Thai documents natively. This eliminated an entire preprocessing step and potential error source.

2. Structured Output Changes Everything
Going from "hope the JSON is valid" to "guaranteed valid JSON" transforms a prototype into production-ready code. Pydantic + Gemini structured output is a game-changer.

3. Agent-First Design is the Future
Building for AI agents to use (not just humans) unlocks new automation possibilities. The same skill works across Gemini CLI, Claude Code, and any agent that understands the pattern.

4. Observability is Non-Negotiable
You can't optimize what you don't measure. Tracking metrics revealed 40% efficiency gains and identified which PDFs needed manual review.

5. Start Small, Validate Early
I processed the first 5 PDFs manually to spot-check before automating all 51. This caught prompt issues early and saved hours of rework.

Try It Yourself

GitHub Repository:
https://github.com/nuttea/thailand-election-skills

The repo includes:

✅ Complete skill implementation
✅ Pydantic schemas and extraction logic
✅ Batch processing scripts with retry logic
✅ CSV conversion utilities
✅ Datadog integration (for monitoring)
✅ Real example outputs from 51 parties
✅ Comprehensive documentation

Questions or feedback? Open an issue on GitHub.

Happy automating! 🚀

DEV Community