Make a Pocket app like html parser in Python

#python #pocket

This tutorial will show how to extract only the relevant html from any article or blog post by their URL in Python.

Most of us have used Pocket app on our phones or browsers to save links and read them later. It is kind of like a bookmark app but which also saves the link's contents. After adding the link to Pocket, you can see that it extracts only the main content of the article and discards other things like the websites's footer, menu, sidebar (if any), user comments, etc. We will not be getting into the algorithm related to identifying the html tag with the most amount of text content. You can read the discussion here on stackoverflow about the algorithms.

Newspaper

Newspaper is a Python module that deals with extracting text/html from URLs. Besides this, it can also extract article's title, author, publish time, images, videos etc. And if used in conjunction with nltk it can also extract article's keywords and summary.

In this tutorial, we will be using newspaper to extract the html tag containing the relevant contents of the URL, and then we will expose this functionality on web using flask.

Get started

We will be working in the global Python environment for simplicity of the tutorial. But you should do all the process described below in a virtual environment.

Install newspaper and flask using pip.
- For python2.x, pip install newspaper flask
- For python3.x, pip install newspaper3k flask
- If there is some error while installing newspaper you can read the detailed guide specific to your platform here.
Create a file called extractor.py with the following code.

from newspaper import Article, Config

config = Config()
config.keep_article_html = True


def extract(url):
    article = Article(url=url, config=config)
    article.download()
    article.parse()
    return dict(
        title=article.title,
        text=article.text,
        html=article.html,
        image=article.top_image,
        authors=article.authors,
    )

This is the only code we need for the main part thanks to newspaper which does all the heavy lifting. It identifies the html element containing the most relevant data, cleans it up, removes any script, style and other irrelevant tags that are not likely to make up the main article.

In the above code Article is newspaper's way of representing the article from a URL. By default, newspaper does not save the article content's html to save some extra processing. That is why we are importing Config from newspaper and creating a custom configuration telling newspaper to keep the article html.

In the extract function that accepts a url, we first create an Article instance passing in the url and custom config.
Then we download the full article html with article.download(). newspaper still hasn't processed the full html yet.
Now we call article.parse(). After this, all the necessary data related to the article in url will be generated. The data includes article's title, full text, author, image etc.
Then we return the data that we need in a dict.

Exposing as an API

Now that we have created the functionality to extract articles, we will be making this available on the web so that we can test it out in our browsers. We will be using flask to make an API. Here is the code.

from flask import (
    Flask,
    jsonify,
    request
)

from extractor import extract

app = Flask(__name__)


@app.route('/')
def index():
    return """
    <form action="/extract">
        <input type="text" name="url" placeholder="Enter a URL" />
        <button type="submit">Submit</button>
    </form>
    """


@app.route('/extract')
def extract_url():
    url = request.args.get('url', '')
    if not url:
        return jsonify(
            type='error', result='Provide a URL'), 406
    return jsonify(type='success', result=extract(url))


 if __name__ == '__main__':
     app.run(debug=True, port=5000)

We first create a Flask app.
In its index route we return a simple form with text input where users will paste the url and then submit the form to the /extract route specified in action.
In the extract_url function, we get the url from request.args and check if it is empty. If empty, we return an error. Otherwise will pass the url to our extract function that we created and then return the result using jsonify.
Now you can simply run python app.py and head over to http://localhost:5000 in your browser it test the implemetation.

Full Code

Few things to remember

Instead of checking the url for just empty string, we can also use regex to verify that the url, is in fact a url.
We should also check that the url is not for our own domain as it will lead to an infinite loop calling the same extract_url function again and again.
newspaper will not always be able to extract the most relevant html. Its functionality completely depends on how the content is organized in the source url's website. Sometimes, it may give one or two extra paragraphs or sometimes less. But for most of the standard news websites and blogs, it will always return the most relevant html.

Things to do next

The above demonstration is a very simple application that takes in a URL, returns the html and then forgets about it. But to make this more useful, you can take it a step further by:

Adding a database
Save the url and its extracted contents so that you can return the result from DB if the same URL is provided again.
You can add some more advanced APIs like returning a list of most recent queried/saved URLs and their title and other contents.
Then you can use the API service to create a web or android/ios app similar in features to what Pocket is.