DEV Community

Andreas
Andreas

Posted on

2 1

My attempt to build a worldwide zip code data set

I always thought getting worldwide postal codes by myself was an easy task because postal codes seem to be nothing more than a simple shortcode that is publicly available. I quickly realized this was not the case, because:

  • There is no single source of truth
  • Most sources were incomplete
  • Data was very often presented in a very unstructured way

After doing some general research, I soon understood, that the reason for the problems above had their origin in the history of postal codes. Each country has a different format, area granularity, and way of structuring postal codes as a whole.

I first tried to scrape Wikipedia with the following code. For this post, I will use the example of Austria.

For this, I a small python script.
Before running it make sure to install all dependencies:

  • pip3 install lxml
  • pip3 install requests,
  • pip3 install bs4

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_in_Austria'

# fire GET request
response = requests.get(url)

# parse content
content = BeautifulSoup(response.text, 'lxml')

# get postal codes

postcodes = [
    postcode.text for postcode in content.find_all('li')
    if ' - ' in postcode.text
]

# filter edge cases

postcodes = [
    postcode.split()[0] for postcode in postcodes 
    if len(postcode.split()) == 3 or
    len(postcode.split()) == 4
]

# write output to file
with open('at_postcodes.txt', 'a') as f:
    for postcode in postcodes:
        f.write(postcode + '\n')

Enter fullscreen mode Exit fullscreen mode

The obtained data set and the related approach might be enough for some use cases, but since I wanted to get global postal code data, I was not satisfied.

I live in Austria and realized very quickly that the data that I have just scraped is not complete (some postal codes are missing). Considering the time it took my to build the parser and the fact that I would have to adapt it for every single data source (adaptions are even needed across Wikipedia since every article is written differently), I decided to give up.

This was the moment I gave up and started to look for ready-to-use solutions:

I hope this article will save you some time, in case you are trying to achieve the same.

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more