This tutorial will show how to extract only the relevant html from any article or blog post by their URL in Python.
Most of us have used Pocket app on our phones or browsers to save links and read them later. It is kind of like a bookmark app but which also saves the link's contents. After adding the link to Pocket, you can see that it extracts only the main content of the article and discards other things like the websites's footer, menu, sidebar (if any), user comments, etc. We will not be getting into the algorithm related to identifying the html tag with the most amount of text content. You can read the discussion here on stackoverflow about the algorithms.
Newspaper is a Python module that deals with extracting text/html from URLs. Besides this, it can also extract article's title, author, publish time, images, videos etc. And if used in conjunction with
nltk it can also extract article's keywords and summary.
We will be working in the global Python environment for simplicity of the tutorial. But you should do all the process described below in a virtual environment.
pip install newspaper flask
pip install newspaper3k flask
- If there is some error while installing
newspaperyou can read the detailed guide specific to your platform here.
Create a file called
extractor.pywith the following code.
from newspaper import Article, Config config = Config() config.keep_article_html = True def extract(url): article = Article(url=url, config=config) article.download() article.parse() return dict( title=article.title, text=article.text, html=article.html, image=article.top_image, authors=article.authors, )
This is the only code we need for the main part thanks to
newspaper which does all the heavy lifting. It identifies the html element containing the most relevant data, cleans it up, removes any
style and other irrelevant tags that are not likely to make up the main article.
In the above code
newspaper's way of representing the article from a URL. By default,
newspaper does not save the article content's html to save some extra processing. That is why we are importing
newspaper and creating a custom configuration telling
newspaper to keep the article html.
- In the
extractfunction that accepts a
url, we first create an
Articleinstance passing in the
- Then we download the full article html with
newspaperstill hasn't processed the full html yet.
- Now we call
article.parse(). After this, all the necessary data related to the article in
urlwill be generated. The data includes article's title, full text, author, image etc.
- Then we return the data that we need in a
Now that we have created the functionality to extract articles, we will be making this available on the web so that we can test it out in our browsers. We will be using
flask to make an API. Here is the code.
from flask import ( Flask, jsonify, request ) from extractor import extract app = Flask(__name__) @app.route('/') def index(): return """ <form action="/extract"> <input type="text" name="url" placeholder="Enter a URL" /> <button type="submit">Submit</button> </form> """ @app.route('/extract') def extract_url(): url = request.args.get('url', '') if not url: return jsonify( type='error', result='Provide a URL'), 406 return jsonify(type='success', result=extract(url)) if __name__ == '__main__': app.run(debug=True, port=5000)
- We first create a
- In its
indexroute we return a simple form with text input where users will paste the url and then submit the form to the
/extractroute specified in
- In the
extract_urlfunction, we get the
request.argsand check if it is empty. If empty, we return an error. Otherwise will pass the
extractfunction that we created and then return the result using
- Now you can simply run
python app.pyand head over to http://localhost:5000 in your browser it test the implemetation.
- Instead of checking the url for just empty string, we can also use regex to verify that the url, is in fact a url.
- We should also check that the
urlis not for our own domain as it will lead to an infinite loop calling the same
extract_urlfunction again and again.
newspaperwill not always be able to extract the most relevant html. Its functionality completely depends on how the content is organized in the source
url's website. Sometimes, it may give one or two extra paragraphs or sometimes less. But for most of the standard news websites and blogs, it will always return the most relevant html.
The above demonstration is a very simple application that takes in a URL, returns the html and then forgets about it. But to make this more useful, you can take it a step further by:
- Adding a database
- Save the url and its extracted contents so that you can return the result from DB if the same URL is provided again.
- You can add some more advanced APIs like returning a list of most recent queried/saved URLs and their title and other contents.
- Then you can use the API service to create a web or android/ios app similar in features to what Pocket is.
This post was originally published on bitwiser.in