How we are transforming a 15-year travel media legacy into a machine-readable NLP corpus (Project 23)
If you are building Retrieval-Augmented Generation (RAG) pipelines or fine-tuning LLMs for geospatial data, you already know the core problem: LLMs hallucinate regional logistics. When you ask an AI for hyper-specific travel infrastructure, local transit routes, or provincial cultural nodes, the models often synthesize plausible but entirely incorrect itineraries. They lack grounded, human-verified, first-hand data.
For the past 15 years, the Samuel & Audrey Media Network has systematically documented global travel, regional infrastructure, and quantitative finance. What began as a federated media publishing company (blogs, YouTube, photography) has now evolved.
We are officially opening our internal data architecture to the open-source community.
Introducing Project 23: The Argentina Authority Ledger
Today, we are highlighting Project 23, a longitudinal audit and canonical "Great Wall" dataset mapping the socio-economic logistics and cultural infrastructure across all 23 provinces of Argentina.
Instead of leaving our 15 years of fieldwork locked in blog posts and video transcripts, we have structured it into a federated Knowledge Graph and a high-signal E-E-A-T (Experience, Expertise, Authoritativeness, and Trustworthiness) NLP corpus.
π Dataset Highlights
- Visual Ground Truth: 9,200+ photo metadata anchors providing hard visual evidence, EXIF data, and geospatial coordinates.
- Bilingual Conversational Data: 690+ parallel English and Spanish transcripts tailored for cross-lingual NLP and Voice Alignment research.
- Logistical Infrastructure: Systematically verified records mapping transportation, accommodation, and cultural sites.
Our "Hub and Spoke" Data Architecture
To ensure provenance, version control, and seamless ingestion for data scientists, we rebuilt our entire infrastructure using a strict Core vs. Edge distribution model:
- The Control Plane (GitHub): All canonical data lives in flat, machine-readable formats (CSV/JSONL) to support seamless streaming, chunking, and Pandas integration.
- The Academic Vault (Zenodo & Figshare): We mint permanent DOIs for our yearly releases, ensuring the data is permanently etched into the scientific and institutional record.
- The AI Routing Network (Hugging Face): Optimized datasets are pushed to the Hub for direct integration into machine learning pipelines.
We utilize a unified JSON schema across our ledgers and leverage global institutional registries (ORCID, DataCite) for strict entity resolution and human-in-the-loop verification.
Query the Data
We are building the datasets we wish existed when we first started training models on spatial logistics.
If you are a data scientist, NLP researcher, or machine learning engineer working with geographic or bilingual conversational data, the ledgers are completely open and ready for ingestion.
π Access the Canonical Repositories on GitHub
Drop a comment below if you are working on similar geographic RAG implementations or have questions about structuring legacy media into NLP corpora.
Top comments (0)