Building a Personal AI Info Pipeline: Semantic Monitoring with YAML and GPT

#opensource #python #ai #flask

Case Study: Tracking Federal Documents with AI Summarization

Every day, we need to monitor countless information sources - wouldn't it be great to get concise summaries without manual work?

I'm building a system that automatically:

Checks websites for new content (even without RSS)
Downloads and processes PDFs
Generates AI summaries of key information
Delivers everything in a unified feed

All configured through simple YAML templates with CSS selectors, normalization rules, and GPT prompts.

Real-World Example: Presidential Executive Orders

Here's an actual template for monitoring new executive orders from the Federal Register:

version: 1

meta:
  title: "Federal Register Executive Orders"
  description: "Monitor presidential executive orders from Federal Register"
  language: "en"

template_name: federal_register_orders

source:
  type: json
  url: "https://www.federalregister.gov/api/v1/documents.json?conditions%5Bcorrection%5D=0&conditions%5Bpresident%5D=donald-trump&conditions%5Bpresidential_document_type%5D=executive_order&conditions%5Btype%5D%5B%5D=PRESDOCU&order=newest&per_page=20"
  frequency: "daily"

extract:
  events:
    limit: 2
    selector: "results[*]"
    fields:
      id: document_number
      title: title
      order_number: executive_order_number
      publication_date: publication_date
      signing_date: signing_date
      url: pdf_url

download:
  extensions: [".pdf"]
  timeout: 60

gpt:
  prompt: |
    Analyze this Executive Order document:
    - Purpose: 1-2 sentences
    - Key provisions: 3-5 bullet points
    - Agencies involved: list
    - Revokes/amends: if any
    - Policy impact: neutral analysis

    This is the text: {{ text }}

When a new executive order appears, the system automatically generates a concise summary like this:

Executive Order 14319 of July 23, 2025

Preventing Woke AI in the Federal Government

Purpose: Ensure reliable and unbiased AI outputs for Americans by preventing ideological bias, particularly from DEI-related content.

Policy impact: This aims to ensure government AI provides accurate information, potentially increasing transparency in federal AI use.

View full document

Current Implementation Status

The system now includes:

Basic Flask web interface for viewing the monitoring feed
CLI for adding new monitoring templates
Support for local LLM processing (Deepseek as default)

Important note: During monitoring execution, the web interface becomes temporarily unresponsive as the system currently runs in single-threaded mode. This architectural limitation will be addressed in future updates.

Project Philosophy

The open-source project (Rostral.io on GitHub) is designed for:

Researchers needing to track complex sources (like Huginn but with built-in AI)
Anyone who wants to monitor important changes (like changedetection.io but with semantic parsing)

Key differentiator: AI-powered summarization of documents and news stored in one feed.

Technical Implementation

For large documents (common with government PDFs), we:

Extract relevant text sections using keywords
Combine fragments to fit the AI's context window
Process through local LLM (Deepseek by default)

This avoids expensive API costs while maintaining quality. The system works as a pipeline with configurable stages (fetch, extract, download, normalize, etc.).

Getting Started

git clone https://github.com/yourusername/rostral.io
cd rostral.io
pip install -r requirements.txt

# Install Tesseract OCR
# Download GGUF model (see repo instructions)

# Run with web interface:
python3 app.py

# Or CLI-only mode:
python3 -m rostral

Would this approach be useful for your monitoring needs? What other government sources would you want templates for? Federal contracts? Patent filings? Regulations?

Explore the project on GitHub | Contribute via pull requests