DEV Community

Cover image for Italian Laws Unigram Viewer on the Edge With Cloudflare Pages
marplex
marplex

Posted on

Italian Laws Unigram Viewer on the Edge With Cloudflare Pages

Months ago I shared my Italian law mapping project, where I mapped 13K Italian laws and extracted relationships between them (labs.marcocimolai.xyz/tessuto-normativo). It went viral on Reddit, LinkedIn and has been covered by some of Italy's leading newspapers. Today, I will share my journey of building and deploying a "Google NGram viewer" for Italian laws.

Let's start with the basic idea: you search for a word and the site returns how many Italian laws containing that word have been published for each year, from the Constitution to 2022.

Google NGram Viewer

I didn't know much about search engines or information retrieval. So, as usual, my journey began with a Google search.

That's where I found the well known Apache Lucene. I started digging and learning all about it. I discovered that there are many other optimizations and pre-processes that are essentials for serving a search endpoint, and that there is a project called Solr that is doing this for me.

Solr is a search engine built on top of Apache Lucene, and it comes with sane defaults and a handy HTTP API. The first part is indexing and processing the laws, I used pysolr client to loop trough each law and add them to the index. The document format only contains the text (not stored) and the publication date (stored), which is all I need to reconstruct the term usage plot.

import pysolr
solr = pysolr.Solr('http://localhost:8983/solr/norms', always_commit=True)

for norm in norms:
  solr.add([
      {
          "text": norm.text,
          "date": norm.date
      }
  ])
Enter fullscreen mode Exit fullscreen mode

It's true that I left every configuration at default (so it may not be very optimized for my use-case), but I didn't expect the process to be so fast and easy. Performing queries was also very straightforward, I just needed to retrieve the date field with no additional scoring.

query = "lire"
results = solr.search('text:"%s"' % query, **{
    'fl': 'date',
    'rows': 15000
})
Enter fullscreen mode Exit fullscreen mode

After processing this data with pandas, here's the result of this query. The Y axis represents how many norms with the term "lire" were published in respect to the total count of published norms. That is term_occurrencies / total_norms.

Thinking on problems

Thanks to Solr I had just indexed Italian norms, performed queries and retrieved the term usage graph, exactly what I needed. Not really, because there are also some downsides:

  • I need to maintain additional server costs
  • Not easily scalable
  • Solr is doing too much for what I need to achieve

The previous tool was deployed entirely on Cloudflare Pages with no server compute. The graph is downloaded, opened and processed client side, hassle free and with zero deployment costs.

With this new Solr architecture, I have to use and maintain an external server. In addition, this custom solution is not easily scalable, difficult to distribute, and as it stands, acts as a central point of failure. During high demand spikes (which were very common in my previous project), I doubt that Solr will be able to serve all those users.

On top of that, Solr does some pre-processing on the words like lemmatization, stemming, storing word positions, storing multiple dates for the same year. Solr is great, but overkill for my needs.

Solution number two

I need a simple inverted index without stemming/lemmatization, where documents are the pre-computed term usage graphs.

The key part here is pre-computed. In games, light is too demanding to be solved in real time (not anymore actually). That's why we've always used _baked lightin_g. Light is pre-calculated and applied to the world textures.

My idea is similar: querying a large corpus of text is too demanding (in terms of computation, time, and cost) for real-time use. So I will do the search on the precomputed results and deliver them already baked.

Building the inverted index

An inverted index consists of two things: terms and documents. In this case, terms are words that appear in each law, and documents are usage distributions.

Inverted index structure, each term is associated to its usage graph

I have started by extracting tokens from a sample law using nltk. The tokens are processed and words are filtered (e.g. by removing stopwords or odd characters). As mentioned before, I don't need to do any stemming/lemmatization since I need to search for exact terms.

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import nltk

stop_words = set(stopwords.words('italian'))

#extract tokens from the norm text
def tokenize(norm):
  word_tokens = word_tokenize(norm.lower(), language="italian")
  tokens = []

  for word in word_tokens:

    #remove special characters
    token = process_word(word)

    #skip stop words
    if token not in stop_words:
      continue

    #skip invalid tokens (weird strings such as html)
    if !is_valid_token(token):
      continue

    #skip tokens that can't be stored in less than 32 bytes.
    #Note: remember this step for later
    token_bytes = bytes(token, 'utf-8')
    if len(token_bytes) > 32:
        continue

    tokens.append(token)

  return tokens
Enter fullscreen mode Exit fullscreen mode

The tokenize(norm) function returns a list of unique tokens, extracted from the input text. The next part is to build the inverted index.

norm = open('urn:nir:stato:legge:1967-03-09;150.txt').read()
year = 1967

inverted_index = {}

#Only take unique tokens.
# If a word occurs just once in the norm, it's counted in the final result.
unique_tokens = set(tokenize(norm))

for token in unique_tokens:
  #Get the count of this token
  freq = inverted_index.get(token, [[year, 0]])

  #Add +1 to the count
  freq[-1][1] += 1

  #Assign it back to the inverted index
  inverted_index[token] = freq

#The sorted keys of the inverted index is our vocabulary
vocab = sorted(inverted_index)
Enter fullscreen mode Exit fullscreen mode

This code is a simplification of the final result. Here the year is fixed, but the final code should also be able to update the index at a specific year.

With the code above, I have extracted from legge:1967-03-09;150 407 unique tokens. Here are some examples:

[
  'periodo',
  'ciascun',
  'termine',
  'incarichi',
  'liquidazione',
  'esso',
  'qualsiasi',
  'agrarie',
  'leggi',
  ...
]
Enter fullscreen mode Exit fullscreen mode

🧊 Freezing the index

I am able to process norms and create inverted indexes, but the results are only available in RAM and in Python data structures (dictionaries and lists). To use this index online, I need to export it to a file. I call this the freezing part.

The simplest solution would be to serialize the dictionary into JSON, then have the user download it and start looking up terms offline. This way I don't have to maintain any servers and everything is automatically hosted and distributed by Cloudflare or some other CDN.

representation of the proposed flow, from python dictionary, to json, to javascript hashmap

Unfortunately, there are multiple problems with this approach:

  • JSON is a text format and produces too big files with many byte repetitions. We mostly need to store numbers.
  • We need to parse the entire JSON file to perform search
  • Users download every document, even if they will probably not search on all of them.

I came up with a better solution that is more efficient and deparates the index from the documents. The idea is to have an index.bin and a documents.bin file.

The index.bin file contains the ordered list of tokens (the vocabulary). Each token is stored in 32 bytes, with trailing zeros if necessary. The most important thing to note here is that each token is byte aligned. This makes it possible to do a binary search without the need for a full read or parse of the file.

The documents.bin file contains the term usage, each stored in 152 bytes. Why 152? Because I want to analyze norms from the Constitution (1947) to the present day (2023), exactly 76 years. Each year's term count is an unsigned 2 byte integer. So 2 bytes times 76 years equals 152 bytes.

representation of the new solution, building two files index and documents

This way of storing documents consumes a lot of space (many term usage distributions contain a lot of zeros) but again, it is byte aligned and can be easily indexed by reading at an offset.

In production, I decided to index 53,036 laws. This resulted in a vocabulary size of 196,082 tokens, a 6MB index and a 28MB documents file.The entire script ran in about 16 minutes.

The compressed index.bin is about 800kb. The compressed documents.bin is about 1.8MB. This is not bad when you consider that the original Solr index took up more than 100 MB.

________________________________________________________
Executed in   16.75 mins    fish           external
   usr time  727.07 secs    0.00 micros  727.07 secs
   sys time    4.71 secs  779.00 micros    4.71 secs
Enter fullscreen mode Exit fullscreen mode

To recap, storing the inverted index in two parts and in binary format is way better because:

  • Users will only download the index file (smaller and can be compressed with gzip/brotli)
  • No need to read and parse the entire index to perform search
  • Documents are retrieved on-demand, only the parts that users need

As I said before, these two files will be distributed by a CDN. But how can the client download only a tiny part of the documents.bin file? After all, CDNs only serve static files and don't perform any kind of computation.

Leveraging HTTP functionalities

Introducing Range headers. Not every server supports it (GitHub does), but basically it allows you to download specific parts of a file. This is mainly used for watching MP4 videos on the web (without having to use HLS or DASH). It's useful because you don't have to download the whole file just to watch a tiny part of it.

Range: {byteFrom}-{byteTo}
Enter fullscreen mode Exit fullscreen mode

This suits my needs perfectly. Finally, the whole system can be divided into three steps:

1) Perform binary search on index.bin
2) Find the offset
3) Download the specific documents.bin part with Range header

More problems, thank you CORS

Before using this solution with Range headers, I tested if GitHub was accepting them, and it was. I also checked to see if it was serving files with the Access-Control-Allow-Origin: * header (to access it from other domains), and it was. I did this with Postman and everything worked as expected.

I only found the problem when I tried to run the app with files on GitHub (not from a local server). Little do I know, browsers do not just only look at domain origin to allow cross origin requests. During the CORS preflight, they also check for headers, methods, credentials and other options. In particular, GitHub does not accept any other additional header, including Range :(

➗ Divide and conquer

Again, I had to find another solution, without resorting to hosting the files on my server.

I've decided to split the documents.bin file into chunks. By choosing the number of chunks, I can reduce the load on the client to a reasonable amount. Too many chunks and the user has a higher chance of typing words that are in different files, too few and the user waits longer to download larger files.

I decided to split it into 10 parts, each of which weighs about 200kb compressed. The client knows what files to download simply by doing word_position / ( 196082 / 10), where 196082 is the vocabulary size.

One index file indexes into multiple chunks of the documents

Many ups and one big down

After all this long ping-pong between solutions and problems, I think I've found the (almost) perfect solution. It has numerous advantages and reduces maintenance and costs. The only major drawback is that it only searches for single words (unigrams).

I made a conscious decision to store usage graphs without considering word positions, which limits the ability to search for sequences of words (phrases). Adding this feature would have significantly increased the index and document size, making it challenging to deliver to clients.

I can easily make changes to the Python script to include n-grams, which include bigrams (two-word combinations). However, bigrams are less unique than unigrams because they are simply combinations of two single words. As a result, they are more sparse and diverse, resulting in much larger indexes.

When I tried to extract bigrams from 53036 norms, the vocabulary grew to 38 million bigrams, resulting in a 293MB index.bin and a 696MB documents.bin.

I'm still exploring new methods for searching phrases with static files and no server computation, as this remains an ongoing area of development.

Conclusion

This was like a challenge, trying to reduce cost and maintenance time as much as possible. I wanted to push myself to new limits, building custom and "low level" stuff with technologies I'd never really understood before.

So this is the complete architecture. One index.bin and 10 chunked documents files are enough to perform search over 50,000 Italian laws.

You can view the final result here labs.marcocimolai.xyz/term-trend

screenshot of the final result webpage

Top comments (0)