8 min read
TL;DR
Built an AI skill that extracts structured data from 51 scanned Thai election PDFs (587 policies) in under 2 hours with 100% valid JSON output. Uses Gemini 3 Pro's native OCR + structured output + Pydantic schemas for guaranteed data quality.
Package workflows as AI "skills" that agents execute via natural language instead of manual scripts.
Results: Most of PDFs processed in 1-2 minutes, zero manual intervention, support agent with skills like gemini-cli, claude-code, cursor.
The Problem: A Facebook Post and 51 Locked PDFs

The post that started it all: "Election Commission released 51 party policies as scanned PDFs that can't be used. When asked for CSV: 'If you want it, write it yourself'"
I was scrolling through Facebook when I saw this post from the Thai developer community. Thailand's Election Commission (กกต.) had released policy data for the 2026 election—all 587 policies from 51 political parties—but in the worst possible format: scanned image PDFs.
Not text PDFs you could copy from. Scanned photos of printed documents. Just images of tables filled with Thai text, numbers, and complex formatting.
When developers asked for a usable format like CSV or JSON, the response was essentially: "Do it yourself."
The challenge was enormous:
- 📄 51 scanned PDF files (no text layer)
- 🔤 Thai language requiring OCR
- 📊 Complex table structures with inconsistent formatting
- 💰 Thai numerical units needing normalization (ล้าน, พันล้าน)
- 🏷️ Unstructured categorization
- ⏱️ Tight timeline before election discussion season
I had three options:
- Manual typing: 3+ weeks of soul-crushing work with inevitable errors
- Traditional OCR: Days of cleanup fixing Thai character errors, then manual structuring
- Build an AI solution: Handle OCR + extraction + validation in one automated workflow
I chose option 3—and built something reusable for anyone facing similar challenges.
The result? ✅ 51 parties extracted, 587 policies structured, 100% valid JSON—in under 2 hours of automated processing.
Table of Contents
- The Problem
- Skills vs Scripts
- Technical Deep Dive
- Real Results
- Real-World Impact
- Installation
- Challenges & Improvements
- What I Learned
- Try It Yourself
What Makes This Different: Skills, Not Scripts
Most people would write a Python script and call it done. But I took a different approach: I packaged this as a skill for AI agents (Gemini CLI and Claude Code).
What's a Skill?
Think of a skill as a recipe that teaches an AI agent a complete workflow. Instead of:
- Running commands manually
- Remembering script parameters
- Setting up environments each time
You simply tell the AI in natural language:
Extract policies from all Thai election PDFs
The agent handles everything—setup, execution, error recovery, validation.
Why this matters: We're shifting from "writing scripts for humans to run" to "building tools for AI agents to use." This unlocks new levels of automation.
The Two Parts of a Skill
1. Instructions (SKILL.md) - The "user manual" for the AI
## Agent Instructions
BEFORE running scripts, execute these commands:
1. Navigate to skill directory
2. Create virtual environment
3. Activate environment
4. Install dependencies
2. Tools (scripts/) - The actual code that does the work
-
extract_policy.py- Single PDF extraction with OCR + validation -
batch_extract_all.sh- Batch processing with retry logic -
json_to_csv.py- Format conversion utilities -
send_to_datadog.py- Monitoring and observability integration
The Secret Sauce: Native Thai OCR + Structured Output
The real breakthrough came from combining two powerful capabilities:
1. Gemini's Native Vision for Thai OCR
The challenge: These PDFs are scanned images—no selectable text. Traditional OCR tools struggle with Thai characters (๐-๙) and complex table layouts.
The solution: Gemini 3 Pro's native vision capabilities handle Thai OCR seamlessly. No preprocessing, no separate OCR pipeline, no error cleanup. It just works.
2. Structured Output with Pydantic
Instead of hoping the LLM returns valid JSON, you define a Pydantic schema that guarantees the output format.
Before: Raw Scanned Image
๑. ระบบรางความเร็วสูง | ๓.๕ แสนล้าน | งบประมาณ, PPP, พันธบัตร
(Thai numerals, Thai units, inconsistent formatting)
After: Clean, Validated JSON
{
"policy_seq": 1,
"policy_category": "โครงสร้างพื้นฐาน",
"policy_name": "ระบบรางความเร็วสูง",
"budget_baht": 350000000000,
"funding_source": "๑) งบประมาณแผ่นดิน\n๒) PPP\n๓) พันธบัตร",
"benefits": "๑) เพิ่มการเชื่อมต่อ\n๒) กระตุ้นเศรษฐกิจ"
}
(Converted numerals, normalized budget, structured data)
The Code: Pydantic + Gemini Structured Output
Here's the core implementation:
from pydantic import BaseModel, Field
from typing import List
import google.generativeai as genai
from google.genai import types
import os
# Define exact schema you want - this becomes your data contract
class Policy(BaseModel):
# Convert Thai numerals (๑,๒,๓) to Arabic (1,2,3) for sequence
policy_seq: int = Field(
description="Policy sequence (Thai numerals → Arabic)"
)
# AI categorizes into 15 predefined buckets based on content
policy_category: str = Field(
description="Category from predefined list"
)
# Extract word-by-word for accuracy, preserving Thai formatting
policy_name: str = Field(
description="Policy name extracted word-by-word"
)
# Normalize: ๓.๕ แสนล้าน → 350000000000 (pure Baht integer)
budget_baht: int = Field(
description="Budget in Baht (0 if none)"
)
# Preserve Thai numerals in lists (๑) ๒) ๓))
funding_source: str = Field(
description="Funding details, preserve Thai formatting"
)
cost_effectiveness: str
benefits: str
impacts: str
risks: str
class PoliticalPartyPolicies(BaseModel):
policies: List[Policy]
# Configure Gemini with your schema - this is the magic
config = types.GenerateContentConfig(
temperature=0.5, # Low temp for consistency
response_mime_type="application/json",
response_schema=PoliticalPartyPolicies.model_json_schema(), # 🎯 Guaranteed structure
thinking_config=types.ThinkingConfig(thinking_level="low")
)
# Process scanned PDF with native OCR + structured extraction
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
response = client.models.generate_content_stream(
model="gemini-3-pro-preview",
contents=[
# Your extraction instructions
types.Part.from_text(text=extraction_instructions),
# The scanned PDF - Gemini handles OCR natively
types.Part.from_bytes(
data=pdf_bytes,
mime_type="application/pdf"
)
],
config=config # Pass the structured output config
)
# Validate output - will raise error if schema doesn't match
policies = PoliticalPartyPolicies.model_validate_json(result_json)
print(f"✅ Validated: {len(policies.policies)} policies extracted")
Key Benefits:
- ✅ Native Thai OCR - Handles scanned images without preprocessing
- ✅ No manual validation - Pydantic handles schema validation automatically
- ✅ Type safety - Budget is always an integer, never a string
- ✅ Thai numeral conversion - LLM automatically converts Thai numerical characters (๐-๙) to computer-readable integers/floats
- ✅ Context-aware - Understands Thai units and preserves formatting where needed
- ✅ Clear errors - Know exactly what failed and where
- ✅ Documentation - Schema serves as specification
Usage: As Simple As Asking
Once installed, using the skill is incredibly intuitive:
Extract a single PDF:
Extract policies from "เบอร์ 9 พรรคเพื่อไทย.pdf" using the Thailand election skill
Batch process all PDFs:
Batch extract all Thai election PDFs in the assets folder
Convert to CSV:
Convert party_9_policies.json to CSV format with pipe delimiter
Analyze results:
Show me all policies with budgets over 100 billion Baht
The AI agent handles:
- ✅ Environment setup (Python venv, dependencies)
- ✅ Script execution with correct parameters
- ✅ Error handling and automatic retries
- ✅ Progress monitoring
- ✅ Output validation
Real Results: Battle-Tested at Scale
This isn't a toy project. Here are the real metrics from extracting Thailand's 2026 election data:
| Metric | Result |
|---|---|
| Total parties | 51 (100% success) |
| Total policies | 587 extracted |
| Data quality | 100% valid JSON |
| Processing time | Under 2 hours total |
| Average per PDF | 1-2 minutes (59% of files) |
| Automatic retries | Handled seamlessly |
| Manual intervention | Zero |
Processing Time Distribution
All 51 PDFs processed in under 5 minutes each:
| Time Range | Files | Avg Time | Percentage |
|---|---|---|---|
| 0-1 min | 4 files | 47 sec | 11% |
| 1-2 min | 22 files | 86 sec | 59% |
| 2-3 min | 7 files | 138 sec | 19% |
| 3-4 min | 2 files | 192 sec | 5% |
| 4-5 min | 2 files | 282 sec | 5% |
Most common processing time: 1-2 minutes (59% of files)
"Most PDFs processed in 90 seconds. That's faster than I could even open the file and copy-paste a single table manually."
This efficiency makes the solution practical for real-world use—you can extract data from dozens of documents in the time it takes to get a coffee.
Special Challenges Solved:
- 🔍 Thai OCR from scanned images - 100% success rate, no preprocessing needed
- 📄 Thai numerals - Auto-convert ๐-๙ to 0-9 in sequence fields, preserve in content
- 💰 Budget normalization - ๓.๕ แสนล้าน (3.5 hundred billion) → 350,000,000,000 Baht
- 🏷️ Smart categorization - 587 policies across 15 categories with context awareness
- 🔄 Retry logic - Automatically handles incomplete responses (up to 3 attempts)
- ⏱️ Stream timeout detection - Monitors chunk timing, auto-retries on stalls
- 📊 CSV consolidation - Automatic merging across all parties with pipe delimiter
Extracted Data in Action: Wevis Policy Comparison
The extracted data isn't just sitting in JSON files—it's already powering real civic tech applications. Wevis, a Thai civic tech organization, used this structured data to build an interactive policy comparison tool for the 2026 election.
Wevis's interactive policy comparison website using the extracted election data
What Wevis Built:
- 📊 Interactive comparison across all 51 political parties
- 🔍 Select and filter by policy categories
- 💰 Budget analysis and visualization
- 📱 Mobile-friendly interface for voters
Why This Matters:
This demonstrates the real-world impact of making locked government data accessible. What started as 51 scanned PDFs "you have to write yourself" is now an interactive tool that helps millions of Thai voters make informed decisions.
Check it out:
"From locked PDFs to civic engagement tools—this is why open data automation matters."
How to Install and Use
The skill is open source and ready to use:
Option 1: For Existing Agent Users (Recommended)
npx skills add nuttea/thailand-election-skills \
--skill extract-thailand-election-policies \
--agent gemini-cli \
--agent claude-code \
--yes
# Edit .env and add your GEMINI_API_KEY
# start your agent like 'gemini', 'claude'
Option 2: Clone and Use Locally
git clone https://github.com/nuttea/thailand-election-skills.git
cd thailand-election-skills
# Set up Gemini API key
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY
# Use with Claude or Gemini CLI
# The agent handles the rest!
That's it! The AI agent will handle Python environment setup, dependency installation, and script execution.
Key Takeaways
1. AI Skills > Scripts
Package your workflows as reusable skills that agents can execute with natural language.
2. Structured Output is Essential
Use Pydantic schemas with Gemini's structured output for guaranteed valid data.
"The difference between hoping for valid JSON and guaranteeing it is the difference between a prototype and production."
3. Agent-First Development
Design tools for AI agents to use, not just for manual execution.
4. Automation at Scale
What would take weeks manually can be done in hours with proper AI tooling.
5. Make It Reusable
This skill pattern works for invoices, research papers, financial reports, contracts—any structured document extraction.
Adapting This for Your Use Case
This approach isn't limited to election policies. You can adapt it for:
- 📄 Invoice Processing - Extract line items, totals, dates, vendor info
- 📚 Research Papers - Extract abstracts, citations, methodology, results
- 💼 Contracts - Extract clauses, dates, parties, obligations, terms
- 📊 Financial Reports - Extract metrics, tables, summaries, KPIs
- 🏥 Medical Records - Extract diagnoses, prescriptions, vitals (with proper HIPAA compliance)
The pattern is the same:
- Define your Pydantic schema (your data contract)
- Configure Gemini with structured output
- Package as a skill with clear agent instructions
- Let agents do the work via natural language
Challenges and Future Improvements
While this solution achieved 100% success across 51 parties, it's important to acknowledge the real-world challenges and areas for improvement:
Current Limitations
1. Non-Deterministic Extraction
LLM extraction is not 100% deterministic. Running the same PDF twice might produce slightly different results:
- Wording variations in descriptions
- Occasional budget calculation differences
- Inconsistent categorization on edge cases
2. Quality Depends on Source Documents
Low-quality or inconsistent-resolution scanned documents present challenges:
- Blurry text can lead to OCR errors
- Inconsistent table formatting requires manual verification
- Handwritten annotations may be misinterpreted
- Some PDFs required careful spot-checking
Recommended Next Steps
1. Build a Test Dataset for Automated Evaluation
Create a ground-truth dataset with known values:
test_cases = {
"party_9": {
"expected_policies": 12,
"expected_total_budget": 450000000000,
"sample_policy_names": ["ระบบรางความเร็วสูง", "..."]
}
}
This enables automated validation: does extraction match expected policy count? Total budget?
2. Run Systematic Experiments
Compare different approaches with measurable metrics:
- Models: Gemini 3 Pro vs. Gemini 4 vs. GPT-4 Vision
- Parameters: Temperature (0.3 vs 0.5 vs 0.7), thinking levels, max tokens
- Prompts: Different instruction styles, few-shot examples, chain-of-thought
3. Track Key Metrics
Build dashboards to monitor:
- Cost: Token usage per PDF, per policy
- Latency: Processing time per page, per table row
- Accuracy: Match rate against ground truth test set
- Consistency: Variance across multiple runs of same PDF
4. Implement Confidence Scores
Add validation checks to flag suspicious extractions:
def validate_extraction(policies, pdf_metadata):
"""Return confidence checks for extracted data"""
checks = {
"policy_count_reasonable": 5 <= len(policies) <= 50,
"budgets_in_range": all(0 <= p.budget_baht <= 1e13 for p in policies),
"has_required_fields": all(p.policy_name and p.policy_category for p in policies),
"no_duplicate_sequences": len(set(p.policy_seq for p in policies)) == len(policies)
}
return checks
5. Create a Human-in-the-Loop Review Process
For production use:
- Flag extractions with low confidence scores for review
- Sample random extractions (e.g., 10%) for spot-checking
- Track and learn from manual corrections
- Build feedback loop to improve prompts over time
6. Implement LLM Observability
Monitor and optimize your extraction pipeline in production:
Key questions to answer:
- Quality: How accurate are extractions across different document types?
- Cost: What's the token usage per PDF? Per policy extracted?
- Latency: Why do some PDFs take 5 minutes while others take 1 minute?
- Failures: What patterns lead to extraction errors?
Observability Tools:
- Datadog LLMObs: Track latency, costs, and quality metrics per extraction
- Custom Dashboards: Visualize processing time distribution and error rates
- A/B Testing: Compare different models (Gemini 3 vs 4, GPT-4 Vision)
- Cost Analysis: Monitor token usage trends and optimize prompts
Real Example:
In my next post, I'll show how Datadog LLMObs revealed that:
- PDFs with complex tables took 3-5 minutes vs. simple ones at 1-2 minutes
- Certain table layouts caused 30% more retries
- Optimizing prompts reduced average processing time by 40%
- Token costs varied 10x between smallest and largest documents
This data-driven approach helps you make informed decisions about model selection, prompt engineering, and infrastructure costs.
The Reality Check
This solution is production-ready for:
- ✅ Rapid prototyping and initial data extraction
- ✅ Cases where 95-98% accuracy is acceptable
- ✅ Projects with some budget for spot-checking
It needs more work for:
- ⚠️ Mission-critical financial calculations requiring 100% accuracy
- ⚠️ Legal document extraction with no tolerance for errors
- ⚠️ High-volume production without human review
The key is knowing your accuracy requirements and building appropriate validation layers.
What I Learned
1. Gemini's Native Vision is Underrated
No need for separate OCR pipeline—Gemini handles scanned Thai documents natively. This eliminated an entire preprocessing step and potential error source.
2. Structured Output Changes Everything
Going from "hope the JSON is valid" to "guaranteed valid JSON" transforms a prototype into production-ready code. Pydantic + Gemini structured output is a game-changer.
3. Agent-First Design is the Future
Building for AI agents to use (not just humans) unlocks new automation possibilities. The same skill works across Gemini CLI, Claude Code, and any agent that understands the pattern.
4. Observability is Non-Negotiable
You can't optimize what you don't measure. Tracking metrics revealed 40% efficiency gains and identified which PDFs needed manual review.
5. Start Small, Validate Early
I processed the first 5 PDFs manually to spot-check before automating all 51. This caught prompt issues early and saved hours of rework.
Try It Yourself
GitHub Repository:
https://github.com/nuttea/thailand-election-skills
The repo includes:
- ✅ Complete skill implementation
- ✅ Pydantic schemas and extraction logic
- ✅ Batch processing scripts with retry logic
- ✅ CSV conversion utilities
- ✅ Datadog integration (for monitoring)
- ✅ Real example outputs from 51 parties
- ✅ Comprehensive documentation
Questions or feedback? Open an issue on GitHub.
Happy automating! 🚀




Top comments (0)