DEV Community

Cover image for Source Score: Using AI to automate addition of new sources
Amit Singh
Amit Singh

Posted on

Source Score: Using AI to automate addition of new sources

This post is a continuation of a microservice I've been building. You can check out my last post in the series here.


Automate News Source Ingestion with Firecrawl, OpenRouter, and GitHub Actions

TL;DR : I turned a manual, copy‑paste routine for adding news outlets into a fully automated, monthly GitHub Actions workflow. The pipeline scrapes a ranking page with Firecrawl, extracts clean URLs using three free‑tier LLMs on OpenRouter, generates Source YAML files, and opens a PR that lands these sources straight on the live dashboard post merge.


How it all began

When I first ingested sources in source‑score database I seeded it with five manually‑added outlets. It was enough to verify that the endpoints work, but I kept thinking about the “real” world: the top global media brands that people actually read. I wanted the repo to stay fresh without someone constantly creating PRs to add new sources.

The idea was simple on paper: fetch the latest list of popular English‑language news sites, turn each entry into a valid Source document, and let the existing CI validate and ingest them. In practice though, I ran into three big hurdles (which further break down into their own little challenges as you'll see):

  1. Finding a reliable source – I discovered a page on PressGazette website that publishes the top 50 news sites each month, but the raw HTML was going to be a mess.
  2. Extracting just the URLs – The page’s markup mixed headlines, ads, and footnotes. A regular regex wouldn't have been enough.
  3. Keeping the process cheap – I didn’t want to spin up a paid scraper or a paid LLM every month.

The first breakthrough: Firecrawl does the heavy lifting

Firecrawl’s Python SDK made the scraping step painless and their free tier limits are high enough for my requirements. A tiny wrapper takes a URL and returns a clean Markdown:

def main():
    if len(sys.argv) < 2:
        print("Usage: scrape_firecrawl.py URL", file=sys.stderr)
        sys.exit(1)

    url = sys.argv[1]
    api_key = os.getenv("FIRECRAWL_API_KEY")

    fc = Firecrawl(api_key=api_key)
    try:
        doc = fc.scrape(url, formats=["markdown"])
    except Exception as e:
        sys.exit(f"Error scraping {url}: {e}")

    print(doc.markdown)
Enter fullscreen mode Exit fullscreen mode

Running it against the ranking page gives me a tidy Markdown blob that still contains a lot of noise, but it’s a far better starting point than raw HTML. I want to store the scraped content in plaintext and forward it to LLMs as it is. For this requirements, the markdown format seemed the best of all the available options.


Using LLMs to extract news source URLs from the scraped data

Before diving into the technical details, I'd like to add that so far my AI usage was limited to: Claude CLI for high level project management, Copilot VSCode extension for code generation, and various chat services (ChatGPT, Gemini, Kimi, etc) for any questions or learning new stuff. None of them, however, were going to meet my new requirements. This, and the fact that I love free stuff, led me to OpenRouter, and I'm so glad I signed up. Shout out to them for their generous free tier limits and giving people like me the option to use these new, super capable models for free.

Coming back to the main problem, the next step is where OpenRouter shines. I wrote a small helper script that:

  1. Loads a Markdown‑processing skill, that I wrote (using the power of googling) to help the model process MD files, from my repo. I have some other skills there as well that I've been experimenting with to help me write these blogs, maybe I'll cover them later 😉
  2. Sends the scraped Markdown file content to three free models (gemma‑4‑31b‑it, nemotron‑3‑nano‑omni‑30b, gemma‑4‑26b) until one returns a non‑empty answer. Sometimes requests to OpenRouter API fail because the model doesn't respond. To deal with that I'm iterating over these 3 models and so far at least one of them has always worked.
  3. Asks the model to list the top‑10 latest most popular news outlets and output only URLs, one per line.
SOURCE_QUESTION = "What are the top 10 latest most popular news outlets in the world listed in this document? Only output URLs of these news outlets separated by new lines. Do not output anything else."
def process_raw_doc(md_processing_skill: str, md_doc: str, api_key: str) -> str:
    for model in FREE_MODELS_DOC:
        payload = {
            "model": model,
            "messages": [
                {"role": "system", "content": md_processing_skill},
                {"role": "user",
                 "content": f"Here is a web page in Markdown:\n\n{md_doc}\n\n"
                            f"Answer this question:\n{SOURCE_QUESTION}"}
            ],
        }
        raw_source_list = req_openrouter(payload, api_key)
        if raw_source_list:
            break
    if not raw_source_list:
        sys.exit("Error: All models failed.")
    return raw_source_list
Enter fullscreen mode Exit fullscreen mode

The model’s answer still contains stray characters and occasional combined or shortened URLs.

New source URLs on the web page

To get a clean list of URLs, I'm running a second OpenRouter call that uses the built‑in web‑search tool to validate each line and verify the URL is correct. Following is the payload for my second OpenRouter request:

content = (f"Here is a raw list of URLs of news outlets with each line containing one or more unformatted URLs:\n\n"
    f"{raw_source_list}\n\n"
    f"Use web search to access these URLs and discard those that are invalid. Do not scrape the web page, only check if the URL is valid\n"
    "Based on successful web searches, return a list of corresponding properly formatted URLs without any extra test. Keep only one URL per line."
)
payload = {
    "model": model,
    "messages": [
        {
            "role": "user",
            "content": content
        },
    ],
    "tools": [
        {"type": "openrouter:web_search"}
    ]
}
Enter fullscreen mode Exit fullscreen mode

From source URLs to proper Source YAML

With a verified list of URLs in hand, the next step is to generate a full Source document for each outlet. To achieve that goal, yeah you guessed it right, another OpenRouter API call.
I'm adding an ingested source yaml doc as a schema example, then asking the model to fill in the blanks by using the web search tool again:

content = (f"Extract schema from the following yaml document and store it as source_schema:\n\n"
    f"{sample}\n\n"
    f"Following is a list of urls of media outlets separated by new lines\n\n"
    f"{sources}\n\n"
    "Use web search to fetch information about these media outlets and create yaml docs for each of them following the source_schema schema. Do not output anything except for the yaml documents for these medial outlets separated by ---."
)

payload = {
    "model": model,
    "messages": [
        {
            "role": "user",
            "content": content
        },
    ],
    "tools": [
        {"type": "openrouter:web_search"}
    ]
}
Enter fullscreen mode Exit fullscreen mode

The response is a string of YAML manifests separated by ---. I split, filter, and deduplicate against existing files to avoid overwriting anything.


Writing the new files (safely)

The deduplication logic avoids creating ingestion docs for sources that already exist in the repository. It loads every YAML under sources/, compares name and uri, and returns only truly new entries. The final loop writes each doc to a sanitized filename:

for doc_str in unique_src_docs:
    parsed = yaml.safe_load(doc_str)
    filename = parsed.get('name')
    if not filename:
        continue
    filename = re.sub(r"[^A-Za-z0-9._-]+", "-", filename.strip()) + ".yaml"
    path = os.path.join(sources_dir, filename)
    if os.path.exists(path):
        continue
    with open(path, 'w', encoding='utf-8') as f:
        f.write(doc_str.strip() + "\n")
Enter fullscreen mode Exit fullscreen mode

All files land in sources/ ready for the existing validation workflow.


The actual automation: GitHub Actions workflow

Now that we have all the scripts ready, time for the final piece of the puzzle, the scheduled CI job.

  1. It runs on the first day of each month
  2. checks out the repo
  3. creates a new branch
  4. sets up Python
  5. runs a tiny wrapper shell script that strings the three Python helpers together
  6. commits any new YAML files
  7. and opens a PR:

When the PR lands, the existing validate.yml workflow validates the new YAML files, and the post_on_merge.yml workflow posts them to the API when the new source docs get merged, now ready to be fetched by the live dashboard.


The results so far

All the new sources appear with correct URLs and short descriptions, and the CI passes without a hitch. The whole process now takes less than a minute of human time each month, and that too just to merge the PR 😁

Updated live dashboard


What’s next?

Adding sources was the low‑hanging fruit (relatively). The next challenge is to replicate the same pattern for claims and proofs. A larger data set, more complex validation, and a higher chance of model hallucination.

Completing this goal led me to discovering OpenRouter and learning how to use AI agent skills, can't wait to see what the next challenge brings.


Conclusion

By chaining Firecrawl, OpenRouter, and GitHub Actions I turned a tedious, error‑prone task into a reliable, monthly automation. The result is instantly visible on the live dashboard, and the repo stays in sync with the world’s most popular news outlets.
If you try this yourself and have a suggestion or question, feel free to open an issue or drop a comment.

OpenRouter token usage


Further Reading


Top comments (0)