Brief
In this blog, we'll discuss the data we need, how we can get it, and how we might want to format it.
What data do we need?
As discussed in the last post, the data we need is the ratings received for each restaurant per minimum interval. For the sake of brevity, we'll just assume for now that the minimum interval is a month.
The biggest problem that comes to mind is that we need this data not only for the present, but also at each interval in the past.
For example, we can easily get the current rating data at the time of writing, which is April 2022. But we also need the state of reviews at March 2022, February 2022, January 2022.
Where could we get this data?
I was hoping to use Google's own API as they have a generous monthly free package, but unfortunately it is quite limited for our purposes. It only offers current review data, and only a very limited view of a location's review history.
This means we'll have to lean on an alternative data source. After looking through a few options on my own, I've landed on DataForSEO, which provides us with a way to get scraped data from Google's search engine results.
How will we get this data
Our input options are limited to whatever Google has decided to make available on their SERPs (Search Engine Result Pages), because that's what DataForSEO operates on.
We'll get the data with the following steps:
- Get a list of different neighbourhoods in Singapore, provided by the Google Locations API on DataForSEO.
- Request 'food nearby' search results at each of these locations using DataForSEO's SERP API.
- For each unique result, we will request available reviews using DataForSEO's Business API, using the Google Place ID's returned in the previous step.
This will give us - based on Google's discretion - up to 5300 locations, some of which may be duplicates, or non-food locations, which we will have to filter out.
After doing some initial testing, I've found that out of the 4985 locations returned by the SERP API, 4005 were duplicates, and 127 were non-food locations, leaving us with only 890 unique results.
It may be worth toying with increasing the number of results per location, or trying to look into the coverage of the areas provided by step 1, but this seems sufficient for our initial proof-of-concept.
Obstacles
What obstacles do we have?
Price
The first obstacle is pricing. The SERP API is very affordable, but the Business API is much more expensive, priced at $0.00075 per 10 reviews.
The aforementioned 890 unique locations add up to a total of 660,750 reviews (rounded up to the nearest 10 for each location). That would cost us approximately ~USD 55 to request all the results for all of the 890 locations. While that's not prohibitively expensive, it's not something I'd like to spend before ensuring that all of this will pan out the way I'm imagining.
I don't believe this to be a major obstacle - if the concept proves to be useful, $50 is not a large price to pay, even if it is not paid back. However, it may be prudent to rely instead on mock data, at least for the testing phase.
Limitations
The second obstacle is a limitation of DataForSEO's API. Their Business API also only returns up to 4490 results per location, which is not enough to cover the full depth of reviews of some more popular locations.
Although it would be ideal to cover each location's entire history, I believe it is not an essential part of our website, given that anything beyond 4490 results is probably reaching into the mid-2010s.
This may be a time to refine our requirements to include a target timeframe - perhaps 'recent trends over the last two years' or something of the sort, which should be covered by 4490 reviews.
Where are we at now?
We've laid out the steps for how we can get the data from DataForSEO, as well as discussed the major obstacles in dealing with their API, as well as how we can 'deal' with them.
I believe in the next step, we can finally start looking at how we will approach the design of our software.
Top comments (0)