Jayson DeLancey

Posted on Jan 6, 2019 • Originally published at developer.here.com

Turn Text Into HERE Maps with Python NLTK

#python #nltk #maps #heredev

Originally published at developer.here.com so check out the original too.

What is in your Top 5 travel destinations? I asked my 8 year old son this recently and the response surprised me. The answer was very specific and unfamiliar.

OK, I know Venice but I’m a software engineer in Northern California and haven’t studied geography in quite some time. I have to admit I have no idea where some of those places are.

Maybe you have found yourself reading a travel article that beautifully describes a glamorous locale, places to stay, visit, and where to eat. These articles can be swell, but a simple map can go a long way to add the context of place that some expertly crafted prose alone cannot do. If the author or publisher didn’t include a map for you and you don’t have an 8-yr old Geography savant in your house, Python can help.

Solution

To solve this problem we will stand up a Python Flask server that exposes a few APIs to

Download a given URL and parse the HTML with BeautifulSoup.
Extract locations from the text based on some clues with the Natural Language Toolkit (NLTK).
Geocode the location to determine a latitude and longitude with the HERE Geocoder API.
Place markers on a map to identify the recognized places with the HERE Map Image API.

Server Setup

For this section I make an assumption you are running an environment like OSX or Linux. If you are running Windows you will need to adjust some of the commands a bit.

Configuration

With the Twelve-Factor App the case is made that a best practice is to store config in the environment. I agree and like to store my API credentials in variables APP_ID_HERE and APP_CODE_HERE found in a file called HERE.sh.

#!/bin/bash
export APP_ID_HERE=your-app-id-here
export APP_CODE_HERE=your-app-code-here

I source it into my environment with . HERE.sh to avoid any hard-coded credentials accidentally being released with my source.

Structure

The web server component will need several files you can see summarized in the listing below. Start by running mkdir app/api_1_0.

├── app
│   ├── __init__.py
│   └── api_1_0
│       ├── __init__.py
│       ├── demo.py
│       ├── health.py
├── HERE.sh
├── manage.py
├── config.py
└── requirements.txt

If you aren't using Virtual Environments for Python you should be. You can find more from the Hitchiker's Guide to Python to get off on the right footing. You'll want to initialize your environment with the libraries in requirements.txt which can be done with pip install -r requirements.txt if the requirements.txt contains the following dependencies.

Flask
Flask-Script
gunicorn
nltk
requests

App

We need manage.py as the main entrypoint to our application. It looks like the following listing:

import os
import app
from flask_script import Manager, Server

app = app.create_app('default')
manager = Manager(app)

if __name__ == '__main__':
  port = os.environ('PORT', 8000)
  manager.add_command('runserver', Server(port=port))
  manager.run()

I've left out a few niceties like logging and printing the URL for brevity. This isn't particularly interesting to our task and is just some housekeeping to run a simple server for our APIs.

The config.py is also important for pulling in some of those environment variables we'll need to reference later.

import os

class Config(object):
  SECRET_KEY = os.environ.get('FLASK_SECRET_KEY')
  APP_ID_HERE = os.environ.get('APP_ID_HERE')
  APP_CODE_HERE = os.environ.get('APP_CODE_HERE')

  @staticmethod
  def init_app(app):
    pass

config = {'default': Config}

Unlike other Python projects, our init files are pretty important on this one. In app/init.py we define the create_app function we saw in manage.py.

from config import config
from flask import Flask

def create_app(config_name):
  app = Flask(__name__)
  app.config.from_object(config[config_name])
  config[config_name].init_app(app)

  from .api_1_0 import api as api_1_0_blueprint
  app.register_blueprint(api_1_0_blueprint, url_prefix='/api/1.0')

  return app

This gives us nice clean api versioning for any resources in our API. We also need to define app/api_1_0/init.py with some configuration

from flask import Blueprint
api = Blueprint('api', __name__)

from . import health
from . import demo

As you can see, we do need to make sure each library we create is identified as part of the blueprint.

Healthcheck

To make sure our server is running properly we can add a quick healthcheck endpoint in the file app/api_1_0/healthcheck.py.

from flask import jsonify
from flask import current_app as app
from . import api

@api.route('/health', methods=['GET'])
def handle_health():
  return jsonify({
    'hello': 'world',
    'app_id_here': app.config['APP_ID_HERE'],
    'app_code_here': app.config['APP_CODE_HERE']
    })

At this point we should be able to run python manage.py runserver and have proof of life. If you use your browser to go to http://localhost:8000/healthcheck we should get a response that confirms our server is up and has our app_id and app_code properly configured.

You may not want to display this once you hit production but is fine while we're at a "hello world" stage.

Text

For the purposes of getting started I will use a simple text file with just our locations from before.

Mdina
Aswan
Soro
Gryfino
Venice

For more complex data sets to test with I recommend just trying out something from the New York Times, Wall Street Journal, or BBC travel sections.

Extract

We need to extract text from HTML and tokenize any words found that might be a location. We will define a method to handle requests for the resource /tokens so that we can look at each step independently.

@api.route('/tokens', methods=['GET'])
def handle_tokenize():
  # Take URL as input and fetch the body
  url = request.args.get('url')
  response = session.get('url')

  # Parse HTML from the given URL
  body = BeautifulSoup(response.content, 'html.parser')

  # Remove JavaScript and CSS from our life
  for script in body(['script', 'style']):
    script.decompose()

  text = body.get_text()

  # Ignore punctuation
  tokenizer = RegexpTokenizer(r'\w+')

  # Ignore duplicates
  tokens = set(tokenizer.tokenize(text))

  # Remove any stop words
  stop_words_set = set(stopwords.words())
  tokens = [w for w in tokens if not w in stop_words_set]

  # Now just get proper nouns
  tagged = pos_tag(tokens)
  tokens = [w for w,pos in tagged if pos in ['NNP', 'NNPS']]

  return jsonify(list(tokens))

Before this will work, we need to download NLTK resources. This is a one-time operation you can do in an interactive python shell or by executing a simple script.

$ python
...
>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('averaged_perceptron_tagger')

With this demo.py in place we can restart our server and call the following endpoint to get back a list of any terms that could potentially be locations.

#!/bin/bash
curl http://localhost:8000/api/1.0/tokens?url=$1

This returns a response from one sample article that looks like:

["Canada", "Many", "Conners", "Kelly", "Americans", "Biodinamica", "Milan", "Fabio", ... ]

We've trimmed the wordcount down dramatically from the original article but there is still some more work that could be done to fine tune this recognition process.
This is good enough for a first pass though without adding more complexity so let's see if we can start recognizing these places with the geocoder.

Geocode

The HERE Geocoder API very simply takes a human understandable location and turns it into geocordinates. If you put in an address, you get back latitude and longitude.

Here's the listing for a geocoder endpoint:

@api.route('/geocode', methods=['GET'])
def handle_geocode():
  uri = 'https://geocoder.api.here.com/6.2/geocode.json'
  headers = {}
  params = {
    'app_id': app.config['APP_ID_HERE'],
    'app_code': app.config['APP_CODE_HERE'],
    'searchtext': request.args.get('searchtext')
  }

  response = session.get(uri, headers=headers, params=params)
  return jsonify(response.json())

Restart the python webserver and send a request for a city like "Gryfino":

#!/bin/bash
curl http://localhost:8000/api/1.0/geocode?searchtext=$1

The response includes among other things the location I might put a marker to display this position on a map.

"DisplayPosition": {
  "Latitude": 53.25676,
  "Longitude": 14.48947
},

Map

Finally, we're going to take the latitude and longitude we received from our
geocode request and generate a simple render with the HERE Map Image API.

This listing looks like the following:

@api.route('/mapview', methods=['GET'])
def handle_mapview():
  uri = 'https://image.maps.api.here.com/mia/1.6/mapview'
  headers = {}
  params = {
    'app_id': app.config['APP_ID_HERE'],
    'app_code': app.config['APP_CODE_HERE'],
    'poi': request.args.get('poi')
  }

  response = session.get(uri, headers=headers, params=params)
  image_path = tempfile.mktemp()
  open(image_path, 'wb').write(response.content)

  return image_path

For simplicity and brevity I haven't included any of the error / response handling you should do here. I've also cheated a bit by just storing the image to the local filesystem for illustration.

Now by calling this endpoint with a comma-separated list of latitude, longitude pairs it will return a map with all of the locations having markers.

Place names without additional context can be ambiguous so in some cases there was more than one match. This map is only showing the first match, despite how much fun Venice beach may be.

Summary

The reason for making /tokens, /geocode, and /mapview separate endpoints is that this illustrates how you might setup microservices with Python + Flask for each operation you want to perform. This would allow a deployment to scale them independently.

You can find the full source code listing in the GitHub project.

For an extra dose of inception, try running the server and processing this article itself.

DEV Community