Case Study: Tracking Federal Documents with AI Summarization
Every day, we need to monitor countless information sources - wouldn't it be great to get concise summaries without manual work?
I'm building a system that automatically:
- Checks websites for new content (even without RSS)
- Downloads and processes PDFs
- Generates AI summaries of key information
- Delivers everything in a unified feed
All configured through simple YAML templates with CSS selectors, normalization rules, and GPT prompts.
Real-World Example: Presidential Executive Orders
Here's an actual template for monitoring new executive orders from the Federal Register:
version: 1
meta:
title: "Federal Register Executive Orders"
description: "Monitor presidential executive orders from Federal Register"
language: "en"
template_name: federal_register_orders
source:
type: json
url: "https://www.federalregister.gov/api/v1/documents.json?conditions%5Bcorrection%5D=0&conditions%5Bpresident%5D=donald-trump&conditions%5Bpresidential_document_type%5D=executive_order&conditions%5Btype%5D%5B%5D=PRESDOCU&order=newest&per_page=20"
frequency: "daily"
extract:
events:
limit: 2
selector: "results[*]"
fields:
id: document_number
title: title
order_number: executive_order_number
publication_date: publication_date
signing_date: signing_date
url: pdf_url
download:
extensions: [".pdf"]
timeout: 60
gpt:
prompt: |
Analyze this Executive Order document:
- Purpose: 1-2 sentences
- Key provisions: 3-5 bullet points
- Agencies involved: list
- Revokes/amends: if any
- Policy impact: neutral analysis
This is the text: {{ text }}
When a new executive order appears, the system automatically generates a concise summary like this:
Executive Order 14319 of July 23, 2025
Preventing Woke AI in the Federal Government
Purpose: Ensure reliable and unbiased AI outputs for Americans by preventing ideological bias, particularly from DEI-related content.
Policy impact: This aims to ensure government AI provides accurate information, potentially increasing transparency in federal AI use.
Current Implementation Status
The system now includes:
- Basic Flask web interface for viewing the monitoring feed
- CLI for adding new monitoring templates
- Support for local LLM processing (Deepseek as default)
Important note: During monitoring execution, the web interface becomes temporarily unresponsive as the system currently runs in single-threaded mode. This architectural limitation will be addressed in future updates.
Project Philosophy
The open-source project (Rostral.io on GitHub) is designed for:
- Researchers needing to track complex sources (like Huginn but with built-in AI)
- Anyone who wants to monitor important changes (like changedetection.io but with semantic parsing)
Key differentiator: AI-powered summarization of documents and news stored in one feed.
Technical Implementation
For large documents (common with government PDFs), we:
- Extract relevant text sections using keywords
- Combine fragments to fit the AI's context window
- Process through local LLM (Deepseek by default)
This avoids expensive API costs while maintaining quality. The system works as a pipeline with configurable stages (fetch, extract, download, normalize, etc.).
Getting Started
git clone https://github.com/yourusername/rostral.io
cd rostral.io
pip install -r requirements.txt
# Install Tesseract OCR
# Download GGUF model (see repo instructions)
# Run with web interface:
python3 app.py
# Or CLI-only mode:
python3 -m rostral
Would this approach be useful for your monitoring needs? What other government sources would you want templates for? Federal contracts? Patent filings? Regulations?
Explore the project on GitHub | Contribute via pull requests
Top comments (0)