loading...
Cover image for Search autocomplete for 2 million records with React & AWS CloudSearch

Search autocomplete for 2 million records with React & AWS CloudSearch

jlei523 profile image jlei523 ・5 min read

True Home is a bootstrapped web app that provides a home value estimate for every property in Hong Kong — this means over 2 million homes.

Allowing users to look up their home as efficiently as possible became a challenge.

To make it easy for users to find their home, we built a search autocomplete service using the following stack:

  • AWS CloudSearch
  • React
  • react-autosuggest & autosuggest-highlight modules
  • Express.js server

A few tidbits on Hong Kong real estate:

  • The majority of the population lives in tall apartment buildings that can have hundreds of units.
  • Every building in Hong Kong has a unique name like “The Kennedy on Belchers”.
  • Hong Kongers don’t generally refer to where they live by an address such as “123 Main Street”. Rather, they use their building name and district like “The Belchers Block A in Causeway Bay”.

HK buildings

People in Hong Kong live like this. The middle building also happens to be my current home!

Our search autocomplete requirements:

  • Can’t use Google Places API because there is no way to connect the results to our database records.
  • Can’t use Algolia because it’s simply far too expensive for our bootstrapped app ($700 USD/month for 2 million records).
  • Users should be able to search by the building’s name.
  • Users should be able to search for their exact unit by the building name and the unit number.

Before we build, we design!

To design the look and feel of our search functionality, I used Sketch App and drew inspirations from where I once worked, Redfin.

Nailing the design early was important because it helped me figure out what tools I needed to use and what data was required.

design

Now let’s prep the data for AWS CloudSearch

You can prep your data in JSON, CSV, xml or txt formats. We chose JSON because batch uploading only supports JSON and xml formats.

True Home has two search categories: buildings and units.

Here’s an example of how our JSON file looks like:

[
  {
    "buildingaddress": "8 LEUNG TAK STREET",
    "Name": "EIGHT REGENCY (Tuen Mun)",
    "type": "building"
  },
  {
    "buildingaddress": "8 LEUNG TAK STREET",
    "Name": "31/F FLAT N - NA EIGHT REGENCY (Tuen Mun)",
    "type": "unit"
  }
] 

Uploading data to AWS CloudSearch

You can upload data in two ways: via the AWS GUI console or via the terminal through aws command.

The maximum file size you can upload to AWS CloudSearch is only 5mb. This presented a problem for us because we had 2 million records totaling 900mb of data to upload!

To solve this problem, we had to generate 180 JSON files, each slightly under 5mb, and batch uploaded them via the aws command line tool.
Here’s the bash script we used to loop through all 180 JSON files and upload to our endpoint:

for VARIABLE in $(ls *.json); do echo $VARIABLE; aws cloudsearchdomain --endpoint-url {ENDPOINT URL here} upload-documents --content-type application/json --documents $VARIABLE; sleep 1s; done

Testing the search results

One nice thing AWS CloudSearch provides is the ability to test search your data immediately in the console.

Here, we can test our newly uploaded data:

aws

Setting up an Express.js endpoint for AWS CloudSearch

Once you’ve verified that you can search your newly uploaded data, let’s spin up an API on the server to query the data. True Home happens to use Express.js.

The data flow works like this:

User types a search → search query is sent to Express server → Express server gets data from CloudSearch endpoint → Express sends search results back to browser

Wait a minute! Why do we need to go through a server? Why not just query the CloudSearch endpoint directly from the browser?

Unfortunately, CloudSearch doesn’t support CORS which means you either have to go through a server like Express.js or set up some kind of proxy service, both of which will add latency to each query.

Luckily for us, the latency hit isn’t too big because our server and CloudSearch instance are hosted in the same AWS location.

Here’s an example of how to set up the Express API:

server.get("/autocomplete/:searchString", async (req, res) => {

    let cloudSearchEndpoint = 'your endpoint here'

    let data = await axios.get(
      `  ${endpoint}/2013-01-01/search?q=~${
        req.params.searchString
      }&return=_all_fields%2C_score&highlight.label=%7B%22max_phrases%22%3A3%2C%22format%22%3A%22text%22%2C%22pre_tag%22%3A%22*%23*%22%2C%22post_tag%22%3A%22*%25*%22%7D&highlight.unitcode=%7B%22max_phrases%22%3A3%2C%22format%22%3A%22text%22%2C%22pre_tag%22%3A%22*%23*%22%2C%22post_tag%22%3A%22*%25*%22%7D&sort=_score+desc`
    );
    res.send(JSON.stringify(data));
  });

CloudSearch has official Javascript support but I had a hard time getting it to work for some reason. AWS documentation, in general, is lacking.
As a workaround, I simply used the auto-generated endpoint from the testing tool as my Express fetch URL.

Building the React component

react

True Home’s React search component is built with react-autosuggest. We chose this module because it has excellent documentation and easy to follow examples.

Initially, I was worried about the difficulty of highlighting the words as the user typed but autosuggest-highlight made this a breeze.

All in all, the front-end code took about 4 hours to complete. Most of the time was spent on formatting the data from CloudSearch and the rest was spent on styling the component.

Here’s True Home’s search component in its entirety for reference.

The result

https://thumbs.gfycat.com/WhisperedAfraidAoudad-mobile.mp4

Conclusion: Search autocomplete is surprisingly easy to build with modern tools but I wouldn’t use CloudSearch again

The whole feature took about 32 hours to complete — much faster than I initially thought as I had no prior experience with search.

As a comparison, it took well over a month to build Redfin’s search functionality back in 2014 by a far more experienced engineer. Granted, Redfin’s search had more requirements, more data, and more platforms to support.

The most time-consuming parts were prepping the data for CloudSearch and looking up CloudSearch’s awful and sparse documentation.

Amazon doesn’t seem to be improving CloudSearch anymore. The last major update was all the way back in 2013. I suspect that this is due to Elastisearch surpassing Solr (what CloudSearch is based on) in popularity.

If I had to do this over again, I would choose Elasticsearch over CloudSearch because the former has better documentation and supports CORS.

And that’s it!

Discussion

pic
Editor guide