Authored by Ben Hannel
Operational analytics on real-time data streams requires being able to slice and dice it along all the axes that matter to people, including time and space. We can see how important it is to analyze data spatially by looking at an app that’s all about location: Airbnb. Major events in San Francisco cause huge influxes of people, and Airbnb prices increase accordingly. However, these price increases are highly localized around these events. Airbnb publishes pricing data for the past and future, and we can use this data to see how prices spike around major events well before they happen.
We’ll look at three major events. The first is Outside Lands Music and Art Festival, which brought over 90,000 people to Golden Gate Park in August. We’ll also look at prices around Oracle OpenWorld and Dreamforce, two large conferences at Moscone Center. We ran the queries using Rockset’s new geospatial functions.
For all three events, there is a noticeable increase in the average price of Airbnbs within a one kilometer radius of the event. In the case of Outside Lands, the mean price spiked by over 30%!
In order to make geospatial queries fast, we reimagined Rockset’s search index. Rockset is built on three types of indexes- columnar storage, row storage, and a search index. We store each of the indexes in RocksDB, an ordered key-value store. The search index allows queries for all documents with a particular value, or a range of values, to run quickly. For each value a field takes on, the search index stores a sorted list of document IDs which have that value. This allows a query like this one to run quickly:
SELECT * FROM people WHERE name='Ben'
All we need to do is look up the key “name.Ben” in the search index.
When we introduced the geography type to the IValue framework, we needed to extend the capabilities of the search index. Typical geospatial queries are not usually searching for exactly one point, but for some compact region of points, like all points within a given distance, or within a polygon. To serve this need, we repurposed the search index to work differently for geographies. First, we partition the surface of the earth into a hierarchical grid of roughly square cells using the S2 library.
For each point in a collection, we add an entry in the search index for each cell which contains it. Since these cells form a hierarchy, a single point is contained by many cells- its immediate parent, and all of that cell’s ancestors. This increases space usage, but pays off with better query performance. In the figure above, a point in cell A in the figure will also be added to cells B, C, and D, because each of these cells contains cell A.
To find all points in a given region, we find a set of cells which covers that region. While every cell in the region (in this case Florida) is in the set of cells which covers it, some of the cells fall partly outside the target region. To ensure our results are exact, we check if these candidate points are contained by the region after retrieving them from storage, and discard those which are not. Thanks to the index, we never have to examine any points outside this set of cells, vastly reducing the query time for selective queries.
First, download and extract calendar.csv.gz and listings.csv.gz from Airbnb for your location and time of interest (I used the data for August in San Francisco). Then create a Rockset account if you don’t already have one, and upload each CSV to a separate Rockset collection.
Create a collection and upload calendar.csv. Specify that the format is CSV/TSV, and the default format options should be correct.
Create another collection and upload listings.csv, but this time you’ll need to specify a transformation. Before the geography type and geospatial queries, you had to do the math to compute distances between latitude/longitude points yourself (as we did when analyzing SF car break-ins). With geographies, we can specify a transformation which combines the latitude and longitude field into one object, and tells Rockset to create an geospatial index on it. The fields are initially strings, so we first cast them to floats, then convert them to a geography with the following transformation:
ST_GEOGPOINT(CAST(:longitude AS float), CAST(:latitude AS float))
Note that the longitude comes first.
Once your data has been ingested and indexed, you can run this query to get the daily average price near Moscone Center:
Again, for easy copy and paste:
SELECT c.date, AVG(CAST(replace(REPLACE(c.price, '$'), ',') as FLOAT)) average_price FROM commons.AirbnbCalendar c JOIN commons.AirbnbListings l on c.listing_id = l.id WHERE c.date < '2019-12-01' AND ST_DISTANCE(ST_GEOGPOINT(-122.400658, 37.784035), l.geography) < 1000 GROUP BY c.date ORDER BY date;
We can also look at how the prices between Airbnb’s very close to Golden Gate Park, and further away. I created this visualization using Rockset’s Tableau integration. Prices for the weekend of Outside Lands are in orange. Prices in blue are the average over all of August.
Again, for easy copy and paste:
SELECT CAST(replace(REPLACE(c.price, '$'), ',') as FLOAT) price, ST_DISTANCE( ST_GEOGPOINT(-122.491341, 37.768761), l.geography ) distance, IF(( c.date >= '2019-08-09' AND c.date <= '2019-08-11' ), 'During Outside Lands', 'August Average') AS during_outside_lands FROM commons.AirbnbCalendar c JOIN commons.AirbnbListings l on c.listing_id = l.id WHERE c.date >= '2019-08-01' AND c.date <= '2019-08-30' AND ST_DISTANCE( ST_GEOGPOINT(-122.491341, 37.768761), l.geography ) < 1300
As you can see, you’ll pay a substantial premium to get an Airbnb near Golden Gate Park during the weekend of the Outside Lands. However, if you can settle for a place a little further away, the prices look more like typical Airbnb prices. With Rockset, you can go from deeply nested data with latitude and longitude fields to fast, expressive geospatial queries in under an hour.