DEV Community

Cover image for Use AWS to unzip all of Wikipedia in 10 minutes
Hyunho Richard Lee for Meadowrun

Posted on • Updated on • Originally published at Medium

Use AWS to unzip all of Wikipedia in 10 minutes

This is the first article in a series that walks through how to use Meadowrun to quickly run regular expressions over a large text dataset. This first article reviews parsing the Wikipedia dump file format, and then walks through using Meadowrun and EC2 to unzip ~67GB (uncompressed) of articles from the English language Wikipedia dump file. The second article will cover running regular expressions over the extracted data.

Background and Motivation

The goal of this first post is mostly to walk through creating a large text dataset (the contents of English-language Wikipedia) so that we have a real dataset to work with for the second article in this series. This article also introduces Meadowrun as a tool that makes it easy to scale your python code into the cloud.

If you want to understand some of the details of the Wikipedia dataset, start with this article. If you’re interested in generally applicable examples of searching large text datasets very quickly, start with the second article, and come back if you want to be able to follow along using the same dataset.

Unzipping Wikipedia

Wikipedia (as explained here) provides a “multistream” XML dump (caution! that’s a link to a ~19GB file). This is a single file that is effectively multiple bz2 files concatenated together. It’s meant to be read using the “index” file, which is a single bz2 text file whose contents look like:

602:594:Apollo
602:595:Andre Agassi
683215:596:Artificial languages
683215:597:Austroasiatic languages
Enter fullscreen mode Exit fullscreen mode

The first line is saying that there’s an article called “Apollo” with article ID 594, which is in the section of the file starting at 602 bytes into the multistream dump. The next article is “Andre Agassi”, has article ID 595, and it’s in the same section that starts at 602 bytes (each section has 100 articles). The article after that is called “Artificial languages”, has article ID 596, and it’s in the next section which starts at 683215 bytes.

So we’ll write a function iterate_articles_chunk that takes a multistream_file, a index_file, skips the first article_offset articles, and then read num_articles. A few notes on this code:

  • We’re using smart_open, which is an amazing library that lets you open objects in S3 (and other cloud object stores) as if they’re files on your filesystem. It’s obviously critical that we’re able to seek to an arbitrary position in an S3 file without first downloading the whole thing. We’ll assume you’re using Poetry, but you should be able to follow along with any other package manager:
poetry add smart_open[s3]
Enter fullscreen mode Exit fullscreen mode
  • We’re ignoring a ton of metadata about each Wikipedia article, but that doesn’t matter for our purposes.
  • A bit of an aside, but the above code is also a good example of how to use xml.etree.ElementTree.XMLPullParser to parse an XML file as a stream, which makes sense for large files, as it means you don’t need to hold the entire file in memory. In contrast, xml.dom.minidom requires enough memory for your entire file, but it does allow processing elements in any order.

Let’s try it out!

import time
from unzip_wikipedia_articles import iterate_articles_chunk

n = 1000
t0 = time.perf_counter()
bytes_read = 0
for title, text in iterate_articles_chunk(
    "enwiki-latest-pages-articles-multistream-index.txt.bz2",
    "enwiki-latest-pages-articles-multistream.xml.bz2",
    0,
    n,
):
    bytes_read += len(title) + len(text)
print(
    f"Read ~{bytes_read:,d} bytes from {n} articles in {time.perf_counter() - t0:.2f}s"
)
Enter fullscreen mode Exit fullscreen mode
Read ~34,795,421 bytes from 1000 articles in 2.17s
Enter fullscreen mode Exit fullscreen mode

Man that’s slow! Counting the lines in the index file tells us there are 22,114,834 articles (this is as of the 2022–06–20 dump). So at 2.17s per 1000 articles times 22 million articles, I’m looking at around 13 hours to unzip this entire file on my i7-8550U processor. Presumably most of this time is decompressing bz2, so as a sanity check, let’s see what others are getting for bz2 decompression speeds. This article gives 24MB/s for decompression speed, and we’re in the same ballpark at 16MB/s (we’re not counting the bytes for XML tags we’re ignoring, so our true decompression speed is a bit faster than this).

Scaling with Meadowrun

We could use multiprocessing to make use of all the cores on my laptop, but that would still only get us to 1–2 hours of runtime at best. In order to get through this in a more reasonable amount of time, we’ll need to use multiple machines in EC2. Meadowrun makes this easy!

We’ll assume you’ve configured your AWS CLI, and we’ll continue using Poetry. (See the docs for more context, as well as for using Meadowrun with Azure, pip, or conda.) To get started, install the Meadowrun package and then run Meadowrun’s install command to set up the resources Meadowrun needs in your AWS account.

poetry add meadowrun
poetry run meadowrun-manage-ec2 install
Enter fullscreen mode Exit fullscreen mode

Next, we’ll need to create an S3 bucket, upload the data files there, and then give the Meadowrun IAM role access to that bucket:

aws s3 mb s3://wikipedia-meadowrun-demo
aws s3 cp enwiki-latest-pages-articles-multistream-index.txt.bz2 s3://wikipedia-meadowrun-demo
aws s3 cp enwiki-latest-pages-articles-multistream.xml.bz2 s3://wikipedia-meadowrun-demo

poetry run meadowrun-manage-ec2 grant-permission-to-s3-bucket wikipedia-meadowrun-demo
Enter fullscreen mode Exit fullscreen mode

Now we’re ready to run our unzipping on the cloud:

import asyncio

import meadowrun

from convert_to_tar import convert_articles_chunk_to_tar_gz


async def unzip_all_articles():
    total_articles = 22_114_834
    chunk_size = 100_000

    await meadowrun.run_map(
        lambda i: convert_articles_chunk_to_tar_gz(i, chunk_size),
        [i * chunk_size for i in range(total_articles // chunk_size + 1)],
        meadowrun.AllocCloudInstance("EC2"),
        meadowrun.Resources(
            logical_cpu=1,
            memory_gb=2,
            max_eviction_rate=80,
        ),
        await meadowrun.Deployment.mirror_local(),
        num_concurrent_tasks=64,
    )


if __name__ == "__main__":
    asyncio.run(unzip_all_articles())
Enter fullscreen mode Exit fullscreen mode

In this snippet, we’re splitting all of the articles into chunks of 100,000, which gives us 222 tasks. We’re telling Meadowrun to start up enough EC2 instances to run 64 of these tasks in parallel at a time, and that each task will need 1 CPU and 2 GB of RAM. And we’re okay with spot instances up to an 80% chance of eviction (aka interruption).

Each task will run convert_articles_chunk_to_tar_gz which:

  • Calls iterate_articles_chunk to read its chunk of 100,000 articles
  • Gets just the title and text of those articles
  • Packs those into a .tar.gz file of plain text files where the name of each file in the archive is the title of the article (a .gz file is much faster to decompress than a bz2 file)
  • And finally writes that new file back to S3.

Using Meadowrun and EC2, this takes about 10 minutes from start to finish, where this would have taken 13 hours on my laptop just for reading the articles, not even counting the time to recompress into a .tar.gz.

The exact instance type that Meadowrun selects will vary based on spot instance availability and real-time pricing, but in an example run Meadowrun prints out:

Launched 1 new instance(s) (total $0.9107/hr) for the remaining 64 workers:
    ec2-3-12-160-131.us-east-2.compute.amazonaws.com: r6i.16xlarge (64 CPU/512.0 GB), spot ($0.9107/hr, 2.5% chance of interruption), will run 64 job/worker
Enter fullscreen mode Exit fullscreen mode

At $0.9107/hr, this whole process costs us less than a quarter!

Closing remarks

  • EC2 is amazing! (And so are Azure and GCP.) Spot pricing makes really powerful machines accessible for not very much money.
  • On the other hand, using EC2 for a task like this can require a decent amount of setup in terms of selecting an instance, remembering to turn it off when you’re done, and getting your code and libraries onto the machine. Meadowrun makes all of that easy!
  • The complete code for this series is here in case you want to use it as a template.

To stay updated on Meadowrun, star us on Github or follow us on Twitter!

Top comments (0)