DEV Community: Frankie

Testing a website from different countries using Python 3 and Proxy Orbit

Frankie — Wed, 07 Aug 2019 05:42:04 +0000

It can be tricky to test web applications from different countries. Services exist that make it easier, but they often cost money for the really useful features. In this post I am going to go over how we can automate the testing of a web application in different regions using free web proxies. To do this, I will be using Python 3 and Proxy Orbit.

To start out, I have created a very basic demo application that detects the location of incoming users and returns a page that looks like the image below.

The flag and text will change depending on the IP address that the user is using at the time.

The code for this is below. It uses Python, Flask, and the geoip2 library. It also requires that you have an mmdb file for IP address lookup named "ip_country_db.mmdb"

app.py

import geoip2.database as geoip
from flask import Flask, render_template, request

app = Flask(__name__)

@app.route("/")
def index():
    ip_addr = request.remote_addr
    reader = geoip.Reader("ip_country_db.mmdb")
    country = reader.country(ip_addr)
    return render_template("index.html", country=country, name=country.country.names['en'])

if __name__ == "__main__":
    app.run(port=8080, host="0.0.0.0")

index.html

<html>
    <head>
        <title>Location Greeter</title>
    </head>
    <body>
        <div id="greeting">
            <img src="https://www.countryflags.io/{{ country.country.iso_code }}/flat/64.png">
            <div id="words">
                Hello {{ name }}!
            </div>
        </div>
    </body>
</html>

Now, getting to the fun stuff. To write the tests we are going to create
a Python script that loads the web page and checks for specific text on the page associated with the location being tested.

To start off let's create a basic Python script that loads a web page

import os                         
import requests    

url = os.getenv("TEST_URL")                                                                                                                                    

resp = requests.get(url)                                                                                                                                          

print(resp.content)

This script will simply load a web page located in the TEST_URL env variable and print its contents.

Now to write the code that actually tests the contents of the webpage for the correct location name.

import os
import requests

url = os.getenv("TEST_URL")

countries = [
    {"code": "US", "name": "United States"},
    {"code": "GB", "name": "United Kingdom"},
    {"code": "CN", "name": "China"},
]

for country in countries:
    resp = requests.get(url)
    if country['name'] not in resp.content.decode():
        raise Exception(f"{country['name']} not on webpage")

This code will obviously fail because each request will be made from the same IP address. To fix this we will be using Proxy Orbit (https://proxyorbit.com). Proxy Orbit is a rotating proxy API. Each request will give us a new proxy that we can use in our script. We can specify where the proxy is located using the API query arguments.

import os    
import requests    

url = os.getenv("TEST_URL")    
proxy_orbit_url = "https://api.proxyorbit.com/v1/?token=Iut1LQeCvN7bxHRkplubawI75qGiWxiXrKR1ARflDKA&location={country}"                              
countries = [                                                                                                                                                  
    {"code": "US", "name": "United States"},                                                                                                                   
    {"code": "GB", "name": "United Kingdom"},                                                                                                                  
    {"code": "CN", "name": "China"},                                                                                                                        
]                                                                                                                                                           

for country in countries:                                                                                                                                   
    presp = requests.get(proxy_orbit_url.format(country=country["code"]))                                                                                   
    if presp.status_code != 200:                                                                                                                            
        raise Exception("Could not get proxy")                                                                                                              
    proxy_json = presp.json()    
    pcurl = proxy_json["curl"]    
    resp = requests.get(url, proxies={"http": pcurl})    
    if country["name"] not in resp.content.decode():    
        print(country['name'], "Failed")                        
    else:                                                       
        print(country["name"], "Passed")

We didn't have to add much to integrate proxy support. We added a line in the beginning of the script to specify our Proxy Orbit API URL that will be used later. We also added a country string variable that will be modified later.

In the for loop we first make a request to the Proxy Orbit API formatting in the country code to get a new proxy in the region we need. Then we add the proxy to the request that is made to the URL we are trying to test.

The script will then print whether the country name was found on the web page or not when attempting to load the page with the proxy.

More countries can be added by adding new country dictionaries to the countries list.

ProxyOrbit.com - Building a rotating proxy API

Frankie — Wed, 31 Jul 2019 06:47:38 +0000

Creating another proxy service

There are a lot of services out there that provide proxies in many different ways. Some provide them in bulk, others provide them via an API, others just provide them for free on their website. The problem is that many of these services serve dead or sluggish proxies making it very difficult to use without having to verify that they are working yourself.

For a few projects I have planned, I need an API that would provide working proxies every time I requested them. I needed a service that was extremely aggressive about checking its proxies so that no proxy that was dead would last in the pool long enough to be served.

From this Proxy Orbit was born.

Proxy Orbit is a service that aims to do one set of tasks very well:

Scan the web for open proxies, save them, check them constantly, return a random proxy when an API request is made, and allow for filtering of the results.

Basic API Usage

A basic API request currently looks something like this:

https://api.proxyorbit.com/v1/?token=Iut1LQeCvN7bxHRkplubawI75qGiWxiXrKR1ARflDKA

Which returns an object that looks like the following

{
    anonymous: true,
    cookies: true,
    curl: "http://182.96.147.126:8118",
    get: true,
    ip: "182.96.147.126",
    isp: "No.31,Jin-rong Street",
    lastChecked: 1564551731.418154,
    location: "CN",
    port: 8118,
    post: true,
    protocol: "http",
    rtt: 1.008148431777954,
    ssl: false,
    websites: {
       amazon: false,
       facebook: false,
       google: false,
       instagram: false,
       netflix: false,
       nike: false,
       shopify: false,
       stubhub: false,
       supreme: false,
       ticketmaster: false,
       twitch: false,
       twitter: false,
       youtube: false
    }
}

All proxies returned have been checked within the last hour.

Proxies can also be returned in bulk using the count query argument.

https://api.proxyorbit.com/v1/?token=Iut1LQeCvN7bxHRkplubawI75qGiWxiXrKR1ARflDKA&count=5

The above will return five proxy objects in a JSON array.

Proxies can also be filtered by all parameters in the object. For example

https://api.proxyorbit.com/v1/?token=Iut1LQeCvN7bxHRkplubawI75qGiWxiXrKR1ARflDKA&amazon=true&protocol=http&ssl=true

will return HTTP proxies that can load Amazon.com and support HTTPS.

Examples in Python and NodeJS

Since everything is provided over a basic REST API, integration into existing code should be trivial. Here are two examples of using Proxy Orbit in Python and NodeJS. The scripts request a new proxy and check the proxy's IP.

Python Code

import requests

resp = requests.get("https://api.proxyorbit.com/v1/?token=Iut1LQeCvN7bxHRkplubawI75qGiWxiXrKR1ARflDKA&ssl=true&protocol=http")
if resp.status_code != 200:
    raise Exception("Token Expired!")

json = resp.json()
curl = json['curl']

resp2 = requests.get("https://api.proxyorbit.com/ip", proxies={"https":curl, "http":curl})
print("Proxy IP is:", resp2.content)

NodeJS

const request = require("request");

request.get("https://api.proxyorbit.com/v1/?token=Iut1LQeCvN7bxHRkplubawI75qGiWxiXrKR1ARflDKA&ssl=true&protocol=http", (err, resp, body) => {
    if(err) {
        console.log(err)    
    } else if(resp.statusCode != 200) {
        console.log("Token has expired");
    } else {
        json = JSON.parse(body);
        request({
            url:"https://api.proxyorbit.com/ip",
            method:"GET",
            proxy:json.curl
        }, (err, resp, body) => {
            console.log("Proxy IP is:", body);
        })
    }
});

Checking pool quality

To test the quality of the pool I have been using the following script

import requests

total = 100
failed = 0
url = input("URL: ")
for x in range(total):
    resp = requests.get("https://api.proxyorbit.com/v1/?token=Iut1LQeCvN7bxHRkplubawI75qGiWxiXrKR1ARflDKAA&protocols=http&ssl=true").json()
    curl = resp['curl']
    try:
        r = requests.get(url, proxies={"http":curl, "https":curl}, timeout=10)
    except Exception as e:
        print(e)
        print(curl, "failed")
        failed += 1
    else:
        print(r.status_code)
        #print(r.content)
        print(curl)
        if r.status_code != 200:
            failed += 1

error_rate = (failed / total) * 100

print(f"{error_rate}% failed")

The script makes 100 requests to a specific URL and tracks the number of failures. It then takes the # failures / 100 * 100 to calculate the error percentage. So far my tests to https://proxyorbit.com have shown roughly 3% of the proxies failing (3 / 100 do not work). Though proxyorbit.com isn't very strict about blocking proxy IPs.

Proxies are bound to fail because they are open. Others are using them as well and they get banned much quicker than private proxy servers. Though it is difficult to attain 100% working proxies in the pool at any given time, the proxy checking algorithm ensures that broken proxies are removed as soon as they are noticed to be dead. This is normally within minutes of the proxy no longer working.

Conclusion

Proxy Orbit will be used as a foundation for a couple of other products that are currently in the works. It's definitely been a lot of fun creating this one and I hope that some other people could find use in it as well.

As for the tech stack Proxy Orbit uses Python 3, Flask, and MongoDB.

If you are interested you can check the site out at https://proxyorbit.com. If you aren't interested in paying for the service, we do provide 2000 API requests per month for free when you sign up.

Feedback is always greatly appreciated!

Thanks!

How to Create a Web Crawler From Scratch in Python

Frankie — Fri, 26 Jul 2019 14:36:51 +0000

Overview

Most Python web crawling/scraping tutorials use some kind of crawling library. This is great if you want to get things done quickly, but if you do not understand how scraping works under the hood then when problems arise it will be difficult to know how to fix them.

In this tutorial I will be going over how to write a web crawler completely from scratch in Python using only the Python Standard Library and the requests module (https://pypi.org/project/requests/2.7.0/). I will also be going over how you can use a proxy API (https://proxyorbit.com) to prevent your crawler from getting blacklisted.

This is mainly for educational purposes, but with a little attention and care this crawler can become as robust and useful as any scraper written using a library. Not only that, but it will most likely be lighter and more portable as well.

I am going to assume that you have a basic understanding of Python and programming in general. Understanding of how HTTP requests work and how Regular Expressions work will be needed to fully understand the code. I won't be going into deep detail on the implementation of each individual function. Instead, I will give high level overviews of how the code samples work and why certain things work the way they do.

The crawler that we'll be making in this tutorial will have the goal of "indexing the internet" similar to the way Google's crawlers work. Obviously we won't be able to index the internet, but the idea is that this crawler will follow links all over the internet and save those links somewhere as well as some information on the page.

Start Small

The first task is to set the groundwork of our scraper. We're going to use a class to house all our functions. We'll also need the re and requests modules so we'll import them

import requests    
import re    

class PyCrawler(object):    
    def __init__(self, starting_url):    
        self.starting_url = starting_url                       
        self.visited = set()    

    def start(self):    
        pass                 

if __name__ == "__main__":    
    crawler = PyCrawler()    
    crawler.start()

You can see that this is very simple to start. It's important to build these kinds of things incrementally. Code a little, test a little, etc.

We have two instance variables that will help us in our crawling endeavors later.

starting_url

Is the initial starting URL that our crawler will start out

visited

This will allow us to keep track of the URLs that we have currently visited to prevent visiting the same URL twice. Using a set() keeps visited URL lookup in O(1) time making it very fast.

Crawl Sites

Now we will get started actually writing the crawler. The code below will make a request to the starting_url and extract all links on the page. Then it will iterate over all the new links and gather new links from the new pages. It will continue this recursive process until all links have been scraped that are possible from the starting point. Some websites don't link outside of themselves so these sites will stop sooner than sites that do link to other sites.

import requests    
import re    
from urllib.parse import urlparse    

class PyCrawler(object):    
    def __init__(self, starting_url):    
        self.starting_url = starting_url    
        self.visited = set()    

    def get_html(self, url):    
        try:    
            html = requests.get(url)    
        except Exception as e:    
            print(e)    
            return ""    
        return html.content.decode('latin-1')    

    def get_links(self, url):    
        html = self.get_html(url)    
        parsed = urlparse(url)    
        base = f"{parsed.scheme}://{parsed.netloc}"    
        links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)    
        for i, link in enumerate(links):    
            if not urlparse(link).netloc:    
                link_with_base = base + link    
                links[i] = link_with_base       

        return set(filter(lambda x: 'mailto' not in x, links))    

    def extract_info(self, url):                                
        html = self.get_html(url)                               
        return None                  

    def crawl(self, url):                   
        for link in self.get_links(url):    
            if link in self.visited:        
                continue                    
            print(link)                 
            self.visited.add(link)            
            info = self.extract_info(link)    
            self.crawl(link)                  

    def start(self):                     
        self.crawl(self.starting_url)    

if __name__ == "__main__":                           
    crawler = PyCrawler("https://google.com")        
    crawler.start()

As we can see a fair bit of new code has been added.

To start, get_html, get_links, crawl, and extract_info methods were added.

get_html()

Is used to get the HTML at the current link

get_links()

Extracts links from the current page

extract_info()

Will be used to extract specific info on the page.

The crawl() function has also been added and it is probably the most important and complicated piece of this code. "crawl" works recursively. It starts at the start_url, extracts links from that page, iterates over those links, and then feeds the links back into itself recursively.

If you think of the web like a series of doors and rooms, then essentially what this code is doing is looking for those doors and walking through them until it gets to a room with no doors. When this happens it works its way back to a room that has unexplored doors and enters that one. It does this forever until all doors accessible from the starting location have been accessed. This kind of process lends itself very nicely to recursive code.

If you run this script now as is it will explore and print all the new URLs it finds starting from google.com

Extract Content

Now we will extract data from the pages. This method (extract_info) is largely based on what you are trying to do with your scraper. For the sake of this tutorial, all we are going to do is extract meta tag information if we can find it on the page.

import requests    
import re    
from urllib.parse import urlparse    

class PyCrawler(object):    
    def __init__(self, starting_url):    
        self.starting_url = starting_url    
        self.visited = set()    

    def get_html(self, url):    
        try:    
            html = requests.get(url)    
        except Exception as e:    
            print(e)    
            return ""    
        return html.content.decode('latin-1')    

    def get_links(self, url):    
        html = self.get_html(url)    
        parsed = urlparse(url)    
        base = f"{parsed.scheme}://{parsed.netloc}"    
        links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)    
        for i, link in enumerate(links):    
            if not urlparse(link).netloc:    
                link_with_base = base + link    
                links[i] = link_with_base    

        return set(filter(lambda x: 'mailto' not in x, links))    

    def extract_info(self, url):    
        html = self.get_html(url)    
        meta = re.findall("<meta .*?name=[\"'](.*?)['\"].*?content=[\"'](.*?)['\"].*?>", html)    
        return dict(meta)    

    def crawl(self, url):    
        for link in self.get_links(url):    
            if link in self.visited:    
                continue    
            self.visited.add(link)    
            info = self.extract_info(link)    

            print(f"""Link: {link}    
Description: {info.get('description')}    
Keywords: {info.get('keywords')}    
            """)    

            self.crawl(link)    

    def start(self):    
        self.crawl(self.starting_url)    

if __name__ == "__main__":    
    crawler = PyCrawler("https://google.com")     
    crawler.start()

Not much has changed here besides the new print formatting and the extract_info method.

The magic here is in the regular expression in the extract_info method. It searches in the HTML for all meta tags that follow the format <meta name=X content=Y> and returns a Python dictionary of the format {X:Y}

This information is then printed to the screen for every URL for every request.

Integrate Rotating Proxy API

One of the main problems with web crawling and web scraping is that sites will ban you either if you make too many requests, don't use an acceptable user agent, etc. One of the ways to limit this is by using proxies and setting a different user agent for the crawler. Normally the proxy approach requires you to go out and purchase or source manually a list of proxies from somewhere else. A lot of the time these proxies don't even work or are incredibly slow making web crawling much more difficult.

To avoid this problem we are going to be using what is called a "rotating proxy API". A rotating proxy API is an API that takes care of managing the proxies for us. All we have to do is make a request to their API endpoint and boom, we'll get a new working proxy for our crawler. Integrating the service into the platform will require no more than a few extra lines of Python.

The service we will be using is Proxy Orbit (https://proxyorbit.com). Full disclosure, I do own and run Proxy Orbit.

The service specializes in creating proxy solutions for web crawling applications. The proxies are checked continually to make sure that only the best working proxies are in the pool.

import requests    
import re    
from urllib.parse import urlparse    
import os    

class PyCrawler(object):    
    def __init__(self, starting_url):    
        self.starting_url = starting_url    
        self.visited = set()    
        self.proxy_orbit_key = os.getenv("PROXY_ORBIT_TOKEN")    
        self.user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"    
        self.proxy_orbit_url = f"https://api.proxyorbit.com/v1/?token={self.proxy_orbit_key}&ssl=true&rtt=0.3&protocols=http&lastChecked=30"    

    def get_html(self, url):                                                                                                               
        try:                                                                                                                               
            proxy_info = requests.get(self.proxy_orbit_url).json()                                                                         
            proxy = proxy_info['curl']                                                                                                
            html = requests.get(url, headers={"User-Agent":self.user_agent}, proxies={"http":proxy, "https":proxy}, timeout=5)        
        except Exception as e:                                                                                                        
            print(e)                                                                                                                  
            return ""                                                                                                                 
        return html.content.decode('latin-1')                                                                                         

    def get_links(self, url):    
        html = self.get_html(url)    
        parsed = urlparse(url)    
        base = f"{parsed.scheme}://{parsed.netloc}"    
        links = re.findall('''<a\s+(?:[^>]*?\s+)?href="([^"]*)"''', html)    
        for i, link in enumerate(links):    
            if not urlparse(link).netloc:    
                link_with_base = base + link    
                links[i] = link_with_base    

        return set(filter(lambda x: 'mailto' not in x, links))    

    def extract_info(self, url):    
        html = self.get_html(url)    
        meta = re.findall("<meta .*?name=[\"'](.*?)['\"].*?content=[\"'](.*?)['\"].*?>", html)    
        return dict(meta)    

    def crawl(self, url):    
        for link in self.get_links(url):    
            if link in self.visited:    
                continue    
            self.visited.add(link)    
            info = self.extract_info(link)    

            print(f"""Link: {link}    
Description: {info.get('description')}    
Keywords: {info.get('keywords')}    
            """)    

            self.crawl(link)    

    def start(self):    
        self.crawl(self.starting_url)    

if __name__ == "__main__":    
    crawler = PyCrawler("https://google.com")    
    crawler.start()

As you can see, not much has really changed here. Three new class variables were created: proxy_orbit_key, user_agent, and proxy_orbit_url

proxy_orbit_key gets the Proxy Orbit API Token from an environment variable named PROXY_ORBIT_TOKEN

user_agent sets the User Agent of the crawler to Firefox to make requests look like they are coming from a browser

proxy_orbit_url is the Proxy Orbit API endpoint that we will be hitting. We will be filtering our results only requesting HTTP proxies supporting SSL that have been checked in the last 30 minutes.

in get_html a new HTTP request is being made to the Proxy Orbit API URL to get the random proxy and insert it into the requests module for grabbing the URL we are trying to crawl from behind a proxy.

If all goes well then that's it! We should now have a real working web crawler that pulls data from web pages and supports rotating proxies.

UPDATE:

It seems that some people have been having trouble with the final script, specifically get_html method. This is likely due to the lack of a Proxy Orbit API token being set. In the constructor there is a line, self.proxy_orbit_key = os.getenv("PROXY_ORBIT_TOKEN"). This line attempts to get an environmental variable named PROXY_ORBIT_TOKEN which is where your API token should be set. If the token is not set, the line proxy = proxy_info['curl'] will fail because the proxy API will return JSON signifying an unauthenticated request and won't contain any key curl.

There are two ways to get around this. The first is to signup at Proxy Orbit and get your token and set your PROXY_ORBIT_TOKEN env variable properly. The second way is to replace your get_html function with the following:

    def get_html(self, url):                                                                                                               
        try:                                                                                                                                                                                                                            
            html = requests.get(url, headers={"User-Agent":self.user_agent}, timeout=5)        
        except Exception as e:                                                                                                        
            print(e)                                                                                                                  
            return ""                                                                                                                 
        return html.content.decode('latin-1')

Note that if you replace the get_html function all of your crawler requests will be using your IP address.

Optimizing a Full-Text Search Engine - Compression

Frankie — Thu, 11 Jul 2019 01:23:32 +0000

About three months ago I was presented with a problem. The problem was that I wanted to be able to search for Podcasts based on an episode's transcript instead of an episode's titles, tags, etc. However, searching large amounts of text gets very hard very quickly. Normal database engines (MySQL, MongoDB, Postgresql, etc) did not provide the level of performance I was looking for. I wanted something that was extremely fast and easy to deploy without having to write much code.

So, in order to solve my dilemma I decided to take matters into my own hands. I decided to build my own solution. A full-text indexing and search engine server that could handle large quantities of information and provide near instant search results.

Since Fist has been in development, a few people have taken interest and have greatly helped advance the project to the next level. We are moving fast toward the goal of creating a production ready full-text search engine that is light weight and easy to deploy.

The problem is that it is not there yet. There are problems, quite a few problems. Problems that make Fist unfit for any kind of production use. In order to take Fist to that next level we need to make it lighter, faster, and way more efficient.

There are currently three pressing problems that Fist faces. The first is that the index file that is saved to the disk grows too big too fast. The second issue is that large numbers of simultaneous requests are not processed quickly enough. The third issue issue is that the indexing algorithm is too slow.

In this three part blogging series we will be implementing the necessary changes to help alleviate these problems.

We will be:

Implementing index file compression
Optimizing request handling
Optimizing indexing

This post, which is the first in the series, will go over how we can use liblzf to compress our index file before it is saved to disk. This greatly increases the size of the index that we can store on our system without eating unnecessary amounts of storace space.

liblzf is an incredibly small data compression library. The entire library consists of 4 files. 2 .c file and 2 .h files. This makes it easily embeddable inside of a code base while still keeping things light weight. Pretty much, it's exactly what we are looking for.

LZF compression is a very fast and efficient data compression algorithm. It is used by both Redis and OpenVPN for their data compression needs. It has been proven to work well in real production environments and is a good choice if things need to stay lightweight.

Fist currently does not use any kind of data compression when saving the database file to disk. This made it easier to implement the index serialization and write it to the disk in the beginning. We are past that point now though and the rate at which the index grows is starting to become a problem.

Since I am an avid listener of the Joe Rogan Experience Podcast, I thought it would be a good place to start to provide Fist with real data. Episodes are also upwards of three hours long so there is a lot of dialog that can be indexed. Fortunately for me, Aaron Hendrick was kind enough to open source the transcripts for a bunch of different episodes of the JRE podcast for us to use https://github.com/achendrick/jrescribe-transcripts

I threw a Python script together that parsed through the transcript files, extracted just the dialog, and indexed it in Fist using fistpy. Each file that contained the episode's transcript (some didn't have the transcript) was roughly 120KB - 170KB in size.

The script I used is below. It requires that the jrescribe-transcripts folder be in the same directory as where it is run and it depends on the fistpy client library. It also requires Fist to be running locally on port 5575, which is the default port.

import os
import json
import re
import string
from fist.client import Fist

f = Fist("localhost", 5575)
files = os.listdir("jrescribe-transcripts")

bytes_indexed = 0

for fon, fname in enumerate(files):
    fpath = "jrescribe-transcripts/{}".format(fname)
    if os.path.isfile(fpath):
        with open(fpath, 'r') as script:
            data = script.read()
            data = data.replace("---json", '')
            data = data.split("---")
            try:
                json_data = json.loads(data[0])
            except Exception as e:
                print(f"Failed to read JSON data from {fname}")

            script = data[1].lower()
            remove_strings = ["<episode-header />", "<episode-footer />", "<transcribe-call-to-action />", '\n']

            for string_to_remove in remove_strings:
                script = script.replace(string_to_remove, '')
            for i in re.findall("<timemark seconds=\"[0-9]+\" />", script):
                script = script.replace(i, '')
            for p in string.punctuation:
                script = script.replace(p, '')

            bytes_indexed += len(script)
            f.index(f"{fname} {script}")

After indexing just 358 of the 1395 files available the index was already a whopping 607MB. This didn't come as too much of a surprise, and luckily there is a lot of room for improvement. Since our index naturally contains a lot of duplicate words, LZF should be able to do a very good job of compressing the information.

Below is what the serialization function looks like before implementing compression.

void sdump(hashmap *hmap) {
    // Write size of hashmap to file. (# keys)

    FILE *dump = fopen("fist.db", "wb");

    uint32_t num_indices = 0;

    for(int i = 0; i < HMAP_SIZE; i++) {
        // Get number of indices that have values
        hashmap on = hmap[i];
        if(on.length > 0)
            num_indices++;
    }

    fwrite(&num_indices, sizeof(num_indices), 1, dump);
    // Iterate through hashmap and write key and array of values to file

    for(int i = 0; i < HMAP_SIZE; i++) {
        hashmap on = hmap[i];
        if(on.length > 0) {
            for(int key = 0; key < on.length; key++) {
                keyval object = on.maps[key];
                uint32_t length = object.key.length;
                // Writes key length and key name to db file
                fwrite(&length, sizeof(length), 1, dump);
                fwrite(dtext(object.key), object.key.length, 1, dump);

                uint32_t num_values = object.values.length;
                // Writes number of values associated with key to db file
                fwrite(&num_values, sizeof(num_values), 1, dump);
                for(int value = 0; value < object.values.length; value++) {
                    // Writes value to db file
                    dstring value_on = object.values.values[value];
                    uint32_t val_length = value_on.length;
                    fwrite(&val_length, sizeof(val_length), 1, dump);
                    fwrite(dtext(value_on), value_on.length, 1, dump);
                }
            }
        }
    }
    fclose(dump);
}

The function, named sdump(), is very simple. It takes in a pointer to a hashmap and serializes the data into a custom binary format that will ultimately get written to a file on the disk named fist.db. The goal here is to take the data that would be written and compress it before it is written.

In order to do this we have to make a few changes to sdump() and implement a new function called sdump_compress() that allows us to first write the binary data to a temporary file, compress that data, then write the information to the disk. The code for this is below.

void sdump_compress(unsigned char *data, uint64_t original_size) {
    FILE *compressed = fopen("fist.db", "wb");
    fwrite(&original_size, sizeof(original_size), 1, compressed);

    char *buffer;
    if((buffer = malloc(original_size * 3)) == NULL) {
        perror("Could not allocate memory durring compression. DB file will not be saved.");
        fclose(compressed);
        return;
    }
    long size;
    if(!(size = lzf_compress(data, original_size, buffer, original_size * 3))) {
        printf("Compression error\n");
    }
    fwrite(buffer, size, 1, compressed);
    fclose(compressed);
    free(buffer);
}

void sdump(hashmap *hmap) {
    // Write binary data to a temporary file, load the temp file into memory, compress it, save it
    // to disk.

    FILE *dump = tmpfile();

    if(dump == NULL) {
        perror("Could not create tmpfile during sdump. DB file will not be saved.");
        return;
    }

    uint32_t num_indices = 0;

    for(int i = 0; i < HMAP_SIZE; i++) {
        // Get number of indices that have values
        hashmap on = hmap[i];
        if(on.length > 0)
            num_indices++;
    }

    fwrite(&num_indices, sizeof(num_indices), 1, dump);
    // Iterate through hashmap and write key and array of values to file

    for(int i = 0; i < HMAP_SIZE; i++) {
        hashmap on = hmap[i];
        if(on.length > 0) {
            for(int key = 0; key < on.length; key++) {
                keyval object = on.maps[key];
                uint32_t length = object.key.length;
                // Writes key length and key name to db file

                fwrite(&length, sizeof(length), 1, dump);
                fwrite(dtext(object.key), object.key.length, 1, dump);

                uint32_t num_values = object.values.length;
                // Writes number of values associated with key to db file
                fwrite(&num_values, sizeof(num_values), 1, dump);
                for(int value = 0; value < object.values.length; value++) {
                    // Writes value to db file
                    dstring value_on = object.values.values[value];
                    uint32_t val_length = value_on.length;

                    fwrite(&val_length, sizeof(val_length), 1, dump);
                    fwrite(dtext(value_on), value_on.length, 1, dump);
                }
            }
        }
    }

    fseek(dump, 0, SEEK_END);
    uint64_t len = ftell(dump);
    fseek(dump, 0, SEEK_SET);
    unsigned char *buffer;
    if((buffer = malloc(len)) == NULL) {
        perror("Could not allocate memory during sdump. DB file will not be saved.");
        fclose(dump);
        return;
    }
    fread(buffer, 1, len, dump);

    sdump_compress(buffer, len);
    fclose(dump);
    free(buffer);
}

Only a few changes had to be made to sdump(). Firstly, we are now opening a new tmpfile() instead of fist.db for writing. We then write the binary data to this temporary file and allocate a buffer the size of the file. That data then gets passed to a new function sdump_compress() that compresses the binary data using lzf_compress() and writes that information to a file named fist.db.

Notice in sdump_compress() that before writing the compressed binary information to fist.db we write the original size of the decompressed information to the first 8 bytes of the output file. This is needed during decompression.

Decompression is the same process in reverse. If you are interested in seeing how that works you can view the entire serializer.c file here

What's most important is how much this helped. Going back to our original test with the JRE podcast files, running the same Python script over the same number of transcript files created a fist.db file size of ~300MB as opposed to the 600MB+ file from the initial test. The file size was effectively cut in half. This isn't too surprising since there is a lot of duplicate information being compressed.

Timing the new compression algorithm shows that it is roughly ~45% slower than the old algorithm. However, save to disk only happens occasionally at 2 minute intervals and when the server is stopped. This does not have any impact on indexing or search speed, so it is acceptable for now. The dramatic decrease in file size also makes up for the decrease in speed.

The speed issue could be helped by not using a temporary but to save time and code this implementation was chosen.