DEV Community

Cover image for How to Transform Data Extracted from Wikipedia into a Map in Python
Angelica Lo Duca
Angelica Lo Duca

Posted on

How to Transform Data Extracted from Wikipedia into a Map in Python

In this tutorial I describe a strategy to extract geographical items from Wikipedia, organized in lists and then show them into a geographical map. I exploit the following Python libraries:

  • selenium, which is a Python library for data extraction from any Web site. For more details on how to use selenium, you can read my previous article, entitled Scraping Data from Nested HTML Pages with Python Selenium
  • geopy, which is a Python library working as a client for the most famous geocoding services. More details can be found in this interesting article by Eunjoo Byeon, entitled Introduction to Geopy: Using Your Latitude & Longitude Data in Python.
  • folium, which is a Python library for geographical data visualisation. For more details, you can read this interesting article by Dario Radečić, entitled How to Make Stunning Interactive Maps with Python and Folium in Minutes.

As example, I exploit 5 Wikipedia pages, related to the Italian Jewish Communities:

  • Communities
  • Museums
  • Cemeteries
  • Ghettos
  • Synagogues

All the considered Wikipedia pages contain a list of items, each representing a geographical entity, i.e. an Italian city. Thus, the idea is to build a geographical map with all those localities extracted from Wikipedia. The procedure is organized in three steps:

  • Data Extraction
  • Data Cleaning
  • Data Enrichment Data Visualisation

1 Data Extraction

All the localities in all the considered Wikipedia pages is represented as bullets of unordered lists. Thus, they can be easily extracted through a common procedure, implemented by means of the selenium library. In order to make your code working, you should install the correct selenium driver for your browser, as explained in this video.

Now, I am ready to write the code.

Firstly, I import the driver:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
Enter fullscreen mode Exit fullscreen mode

Then, I define a function, called extract_list which receives as input the URL of the Wikipedia page as well as the XPath expression used to extract data from that page. The function extracts all the text associated to that XPath, splits the extracted text by lines and returns the list of items as a result:

def extract_list(url, xpath):
    options = Options()  
    options.add_argument("--headless") 
    options.add_argument("--lang=it");
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    table = []
    # get the list of terms
    words = driver.find_element_by_xpath(xpath).text
    table.extend(words.split('\n'))
    driver.close()
    return table
Enter fullscreen mode Exit fullscreen mode

Now, I can invoke the function for each considered Wikipedia page, for which I define a list:

pages = ['Comunit%C3%A0_ebraiche_italiane', 'Cimiteri_ebraici_in_Italia', 'Musei_ebraici_in_Italia','Ghetti_ebraici_in_Italia','Sinagoghe_in_Italia']
Enter fullscreen mode Exit fullscreen mode

Then, I build a loop on the created list of pages and I invoke the extract_list function. I also convert the extracted tables into a pandas Dataframe and I associate to each extracted item a category, corresponding to the considered page (with some stylistic changes):

import pandas as pd
df_dict = {}
xpath = '//*[@id="mw-content-text"]'
table = {}
base_url = 'https://it.wikipedia.org/wiki/'
for page in pages:
    name = page.replace('_', ' ').title().replace('%C3%A0', 'à')
    print(name)
    url = base_url + page
    table[page] = extract_list(url,xpath)
    df_dict[page] = pd.DataFrame(table[page], columns=['value'])
    df_dict[page]['category'] = name
Enter fullscreen mode Exit fullscreen mode

Finally, I build a Dataframe by concatenating the previous built dataframes:

df = pd.DataFrame(df_dict[pages[0]])
for i in range(1,len(pages)):
    df = pd.concat([df, df_dict[pages[i]]])
Enter fullscreen mode Exit fullscreen mode

Extracted data contain many errors, which need to be corrected. However, I can store this first raw dataset as a CSV file:

df.to_csv('data/raw_data.csv')
Enter fullscreen mode Exit fullscreen mode

Continue reading...
https://towardsdatascience.com/how-to-transform-data-extracted-from-wikipedia-into-a-map-in-python-8325dce0710b

Top comments (0)