Building a Serverless News Articles Monitor

#news #scraper #serverless #awslambda

Having our own API monitoring news sources in a programmatic way can be handy. Being able to extract news articles in a structured format and enriched with metadata would be even better.

That’s exactly what we’ve done and published freely for you to use: a news articles monitor that runs on a low-cost and easy to maintain serverless infrastructure.

The app is powered by the Python Newspaper library, built on top of the Chalice framework to run on AWS Lambda, monitored by Dashbird for peace of mind. It’s released under the MIT License, so you can use it freely and make any modifications you’d like to. Take a look at the source code if you’re curious about how it works under the hood, it’s really simple actually.

Quick Demo

We even published a live demo for you to try. Run these URLs in your browser to experiment:

API Endpoint

The API has basically one endpoint. It’s intended for demonstration use only. If you plan to use it regularly and/or in large scale, we kindly ask you to fork/clone our repository and run it on premise in your own AWS account. ;)

Obs.: this demo API has request throttling in place; please don’t abuse it (3 requests/min should be fine), otherwise it will shutdown the API temporarily for others to try. Our Lambda function was also configured to timeout in 5 seconds and only serve one request at a time.

https://vt7xjvnaw1.execute-api.us-east-1.amazonaws.com/Stage/{action}/{param}/

The action attribute takes three possible values:

build: collect the latest articles and additional data about a given news source
parse-article: parse and extract a specific article
get-meta: provide metadata about the current state of the news environment

Each action expects its own value in the param attribute:

The build and parse-article actions expect an URL string as param
The get-meta action expects one or both of these values: hot_topics and/or popular_urls. To request both at the same time, separate by comma: hot_topics,popular_urls.

Data Extracted

The build action will return basic information about the news source and a list of their latest articles. Get-meta action is very simple, providing a list of hot keywords and news sources URLs.

Parse-article is a bit more exciting. It will automatically extract the article data in a structured, machine-readable format: title, publishing date, text, etc. Some metadata is also extracted, such as authors and images.

What’s cool is that the Newspaper library uses NLP features from NLTK to generate content based on the original article: it summarizes the article text and gives a short version of it with a few sentences, as well as identifies the most relevant keywords. This could come handy when you’re trying to aggregate articles from multiple sources by topic, for example, or could make it easier to index articles by subject.

Why Serverless

We chose to deploy the application on a FaaS (Function as a Service) infrastructure because it greatly simplifies infrastructure setup and management.

Depending on how many news sources and the amount of articles you plan to extract with this app, a traditional server-based infrastructure could quickly become cumbersome to maintain. Or, if you’re planning to use it only occasionally, serverless will save you a LOT of money, because you won’t pay a penny for idle time.

By using AWS Lambda we also have at our hands a fleet of IP addresses to use. Each Lambda invocation could (not necessarily, but likely) run on a new container with a different IP address, which should help spread the workload for any given news source.

We integrated our function with Dashbird.io, so that we can leave the app running on auto-pilot. Dashbird scans our logs and alerts us by email or Slack in case anything goes south. It also employs anomaly detection algorithms to point out odd behavior or performance degradation. Since usage for this demo is going to be low, Dashbird even does all that for free!

Renato Byrro is a Developer Advocate for Dashbird. You can follow him on Twitter.