Cover image for Will Hadoop-based Recommendation Engines Make Search Obsolete?

Will Hadoop-based Recommendation Engines Make Search Obsolete?

jessaminmorgan profile image Jasmine Morgan ・4 min read

Do you feel overwhelmed by information? Would you love to get precisely what you need without answering numerous questions or going through a difficult filtering process? Enter the recommendation engine, a tool designed to give you more of what you love, based on your own past preferences or those of people like you. The raw matter that is converted into recommendations is data, and lots of it.
The challenge is processing customer-related data that is so large it doesn’t fit on a single machine. Hence the term Big Data, which is characterized by the 3 Vs: volume, velocity, and variety. Google proposed the MapReduce model, and the Hadoop tool emerged as a way of breaking down data into clusters, analyzing them separately and putting together the result.

The MapReduce Architecture and Hadoop

The large datasets previously mentioned are more than traditional processing tools can handle. To understand the MapReduce approach, you can imagine an organization has just won a data processing project. The project manager distributes similar tasks to employees in the mapping phase, then each worker does their part, basically the same job distributed over many nodes. When they are done, the project manager (master node) centralizes the results, in what it is called the reduce phase.
In fact, instead of moving data towards a processing unity, the new approach is to run the analysis where the data is stored and only sent through the network the final results. This saves time and bandwidth. Hadoop is a tool that facilitates these actions in an accessible, robust, scalable and straightforward manner.

Collaborative filtering, Content-based and Hybrid

Recommendation engines can be built either by taking all past choices of a single user and trying to guess the trend and future preferences or by aggregating the likes of similar users and computing the most likely next move.
Content-based filtering takes as input the tags of the data. For example, in the case of a movie, it could use the genre, name of the actors, director, keywords for the storyline, place it was filmed and so on. It also requires building a user profile with past movies watched, favorite stars, preferred genre and some demographics. The result is given by a technique called cosine similarity which compares the two sets of data.
For collaborative filtering, user behavior is more important, and the data gathered reflects preferences, ratings, shares, comments, signaling interest in a specific piece. Again, the profile of the user is created from third-party data gathered from browser history or through their social media profile if they used a social login. The significant advantage of collaborative filtering is that it can offer very accurate recommendations to new users since it doesn’t require any direct input. This is useful for sites that want to personalize the experience for first-time users.
Of course, there is also the opportunity of mixing the two systems to create a laser-precision recommendation system, which uses both collaborative filtering to give more diversity and content-based to select only those items that would be perfect for the user, saving them time.

Putting the pieces together

Both techniques to build recommendation engines are suitable for Hadoop MapReduce. For example, in a collaborative filtering scenario, the data loaded in the HDFS (Hadoop Distributed File System) could be the ratings and reviews of books on Amazon. The system will perform distributed text parsing, extracting words, counting them and classifying into sentiment analysis clusters.
Based on this information, in a hybrid system, if a buyer is logged into their account, they will receive items classified according to a preference order. This order is determined by cluster analysis. Each user is distributed in a cluster by measuring the distance between their preferences and the average preferences of that group, denoted as the cluster center.
For anonymous users, the system can just use a handful of information and create a general set of recommendations. These are based on the location or some information like the device used and previously browsed pages.

Data issues

In the implementation phase, data science consultants from InData Labs warn about the importance of data quality and consistency. Since the input is gathered from unstructured sources and it is generated by users following no constraints, a cleaning and structuring process is necessary.
A potential challenge is given by multilingual sites, where there is a decision to be made if the results will be presented in an aggregated way, using translation or individually, per region. Of course, this is dictated by the application’s logic and business goals.

How good is a recommendation engine?

Of course, regardless of the chosen architecture, one can ask just how good an individual recommendation engine is. Just building it is not enough, it needs to be validated for accuracy and precision.
Accuracy can be defined as the ratio between the right guesses of the system (true positives and true negatives) divided by all possible results. Meanwhile, precision is the ratio between true positives and the sum of true positives and false positives, mainly showing the percentage of correct guesses.
Positives and negatives are determined by sentiment analysis by classifying the words used by clients in reviews in the appropriate clusters. Special attention needs to be paid to the ironic tone which can be misclassified as positive.
There is one more metric that can be used to measure, the recall. This is defined as the number of relevant reviews retrieved by a search divided by the total number of existing reviews documents.

The search of the next decade

In a world where recommendation engines know the customer so well, we can expect to reach the pinnacle of search, defined as “no search.” Based on the result computed by the machine, the user is supposed to be so happy with the returned result that they accept it without further changes. This is a significant difference when compared to the complicated query syntax of SQL, and even the specialized Google queries available today. We could expect a “mind reader” search option based on third-party data such as browser cookies and social media preferences, all neatly filtered through Hadoop.


markdown guide

We've seen this develop as a trend – Netflix is the most obvious example, but Yelp is a good business case too, I believe – and I think it's generally positive. I am, however, concerned that there's a capacity for negative social/cultural impact with these recommendation engines when applied to topics that are matters of taste (as opposed to something like "config steps users like you found helpful"). My concern is that Netflix's recommendations and crowdsourced tools like Yelp fuel a regression to the mean in terms of content/experience, and that mean may not actually exist. There might not be an "average taste" when it comes to food or content, and these systems may filter out great, idiosyncratic options. Do you think there's a way the precision vs accuracy measures can be used to control for this, or that these recommendation engines can be implemented in such a way that still surfaces odd choices?