DEV Community: Bloomreach

How to Use the Bloomreach Content SaaS Developer Application

Adam Pengh — Tue, 20 Jun 2023 16:27:51 +0000

Overview
Featured Functionality
Getting Started
- Create an API Token for Each CMS Instance
- Configure the Developer Extension Application
Sample Actions
- Copying Content Types
- Copying Components

Overview

Every customer that uses our Bloomreach Content SaaS product has two CMS instances provisioned.

A Sandbox CMS instance, which is typically used for Development and QA environments
A Production CMS instance, which is typically use for Acceptance/Staging and Production environments

In order to facilitate the transfer of content and/or configuration from one CMS instance to another, we've created a standalone application that utilizes the Content SaaS Management APIs.

The application can be accessed at https://bloomreach-content-ui-extension.netlify.app/ or it can be cloned/forked from the public GitHub repository: https://github.com/adampengh/bloomreach-content-saas-ui-extensions

Featured Functionality

Copy Content Types between instances
Copy Pages between channels in a single instance
Edit Channel configuration/properties
Copy Channel configuration between instances or between channels in a single instance
- Copy/Delete Components
- Copy/Delete Layouts
- Copy/Delete Routes
- Copy/Delete Menus
Batch Export Content
Batch Import Content

Getting Started

Create an API Token for Each CMS Instance

Log in to Bloomreach Content using an account with Site Developer privileges.
Navigate to Setup > brXM API token management
Click on the + API token button in the top right
Fill in a Token name, choose an Expiration date, check the Read and/or Write checkboxes for the APIs you want to use the token with, and click on Create. We recommend setting the expiration date as soon as possible, to automatically limit the risk of accidental credential exposure.
Copy the token to the clipboard or write it down. This is the only time it will be displayed. We recommend simply creating new tokens at need, rather than attempting to store the token value in a password manager or other long-term storage.

Configure the Developer Extension Application

When first accessing the application, you will be redirected to the Configuration page. Here you will see configuration settings for a Source Environment and Target Environment.

The Source Environment is the CMS instance you want to copy content/configuration from or make configuration changes in.
The Target Environment is the CMS instance where the content/configuration will be copied to.

In the Source Environment:

Enter the namespace of the environment. This is the subdomain portion of the URL immediately before bloomreach.io.
Enter the API Token generated earlier.
Click Save.
From the Projects dropdown, either select an existing Developer Project or click the "+" button to create a new Developer Project.
- To copy Content Types between instances, there needs to be an active Developer Project that includes Content Type Changes in the Target Environment.

In order to copy content/configuration to the Target Environment, follow the above steps for the Target Environment configuration section.

After the configuration has been set, the configuration is saved in the browser's localStorage.

Sample Actions

Copying Content Types

Click on Content Types from the sidebar navigation
Select the Content Types to copy in one of two ways:
- Click on the copy button next to a single Content Type or
- Select the checkbox of multiple Content Types, then click the Copy button on the top right of the page
In the confirmation modal, review the changes, then click the Copy button

Copying Components

Add the channel to the currently selected Source Environment project. If the channel has already been added, skip to Step 6
Click on Projects from the sidebar navigation
Click on the Project from the list of Source Projects
Click the "Add Channels" button
In the modal, select the channels to add and click "Add Channels"
Click on Channels from the sidebar navigation
Click on the channel from the list of channels in the currently selected Source Environment project
Click on the Components tab
Select the Components to copy in one of two ways:
- Click on the copy button next to a single Component or
- Select the checkbox of multiple Components, then click the Copy button on the top right of the page

Building Shared Components

developersblogs — Mon, 12 Jun 2023 17:50:15 +0000

Hey Babu! Shared Components!

In the world of CSS, there’s the concept of using only one stylesheet file to optimize web loading time, such that only a single request is needed when the client opens the page. As we moved into web-based application, the practice of writing a single file carried over.

Styles.css or themes.css are some of the most common names seen in projects and often come with hundreds, if not thousands, of definitions within a single file. As people come and go, this file tends to become a dumping ground and no one knows what’s in there anymore.

Aside from the occasional updates and modifications to individual items, the number of definitions in there that no longer have a corresponding DOM element, increases as time passes. But as the motto goes, "if it ain't broke, don't fix it." That single stylesheet usually turns into the technical debt that gets carried over and over.

At Bloomreach

During our initial phases, code duplication was a common phenomenon. For instance, our “common” stylesheet would exist in various forms across different dashboard projects.

After all sorts of mutation and cloning, it becomes impossible to track down what’s relevant and what's not.

From a couple of lines to more than 6 thousand lines, the work required to combine and clean up the files without causing user interface (UI) issues, is as daunting as replacing the columns in the basement of a skyscraper and hoping it doesn't fall down.

First Pass

During the rewrite of Bloomreach Dashboard in late 2016, one of the key objectives was to modularize our UI Components, i.e. the ability to import the specific stylings related to each component only.
Together with Webpack, React, and Sass, the entire br-theme file of 6000+ lines in Dashboard was deprecated.

This particularly enabled future engineers to find the related code/stylesheet way faster. This along with the introduction of Sass improved the readability and maintainability in a significant manner.

But...

Everything sounds great now, with an average of 70 lines per file, specific to the component that imported it. The styling becomes much more readable and manageable. But, what actually happened next was the issue crept into the component level.

Instead of cloning the stylesheet, now we clone the entire module into a new project. Then, as the project develops, so do those cloned components, taking on their own tweaks as new use cases occur.

This wasn’t too much of a concern initially. However, as we moved into the phase of combining dashboards, we had multiple copies of the same components, such as BrButtons.
With slight differentiation between each, it was a nightmare to maintain each individual component. A stricter and cleaner approach for reusable components became necessary.

Shared Components / Design Systems

Looking across the industry, we are certainly not the only company facing this issue. Atlassian, with some existing applications in React and migration of old ones into React, were stung with the pain of maintaining 45 different dropdown implementations. This is part of the reasoning for creating a unified internal design system that is codified and visualized.

The same can be seen across the board, such as with Airbnb and the well-known material design by Google. Many of these companies have gone through multiple rounds of evolutions.

Our Setup

Combining with the effort of unifying dashboards, we started to host the in-house shared components library, Babu Library, on our internal NPM server. Beginning with migrating UI components from all Dashboards one by one into Babu. All the while making sure the library is generic and flexible and, at the same time, adheres to our standardized styling guidelines.

In addition, the dependencies that some components have on Redux need to be eliminated. This allowed for all components to be usable out-of-the-box, without a complicated setup.

As we roll on with the development, a couple more requirements became obvious:

The capability for future engineers and product team to view what’s available through demo pages, instead of reinventing the wheel
The ability to quickly understand how to integrate without having to flip through pages of code
Clear guidelines to follow for development “on” and “with” the library
Utilities that could be shareable aside from the UI based components, such as Javascript utility functions

Hurdles in the Process

Some of the speed bumps along the way include:

1. Bundle Size and Version Conflicts

Given the requirement of providing a visualizable demo page, we included an entry point to serve the Babu library as a standalone application, which requires the React library and a few dependencies to be bundled up during the Webpack build process.

However, although the requirement is met, the bundled library size became too big for the consuming applications. In addition, if the consuming application also imported libraries that were bundled already, we now effectively included two copies of such libraries in the consuming application build, which further increased the size of it, and can also run into version conflicts between the two copies.

After analyzing the webpack process and resulting library sizes with webpack-bundle-analyzer, by utilizing npm peerDependencies setting and webpack exclusion (config.externals), we were able to reduce Babu library size from 2.6 MB to 523 KB, which is further reducible by minification.

2. Live Editing

To allow developers to quickly grasp how each component is used, the best way is to show the code required and how it is rendered. However, given the number of parameters different components accepts, there is no easy way to show it all in a simple static display.

By utilizing acorn-jsx, developers can then perform editing and rendering of the components live on the page. However, this only works for basic static components that simply renders based on the props passed in, not for components where states need to be maintained and changed based on user interactions or callbacks.

<Wrapper
         Component={BrEditableLabel}
         states={{value: 'text'}}
         handlers={{onSetLabel: function (value) { this.setState({value}) }}} />

To achieve this, a wrapper is developed to allow simulating the relationship of a consuming component and the shared component. It hosts the states and function callbacks, such that when a user interacts with the component UI, a callback is triggered to make updates to the states, which then feeds back into the component to trigger the update livecycle, for a fully interactive experience flow.

3. Continuous Development

During development, developers can symlink the Babu Library into their consuming project, such that any modifications on the library will be rebuilt and then triggers the build flow of the consuming application. This provides a much faster cycle from modification to seeing effects in the consuming application.

During testing, we now utilize tags to allow deployment of the library into a separate branch. This allows developers to publish non-production ready library without influencing the main branch, especially in cases where a full build is required for deploying to a server for QA purpose.

How is This Used Today?

The entire Babu codebase and usage can be thought of from three perspectives

Product Team (Designers/PMs) - Looking at the demo pages to explore the current set of components and its stylings, when designing new products/features or updating existing ones, as well as development for a unified UI/UX.
Consuming Application Developers - Looking mostly at the demo pages for how to use/integrate the components into your applications.
Babu Library Developers - Working together with the product team and engineering teams to update and enhance the library. As well as maintaining the demo page and develop best practices on the design system.

The Unfinished Path

Our original goal was completed after shipping the fully rewritten UI components. However, quite a number of improvements and additional requirements surfaced along the way, and we are looking to have these addressed as V3 in future. Some of the major items include:

Selenium Testing Framework
Flow Type Checking
Automatic documentation and demo page generations
Change-log / release documentation and notifications
Improved versioning and deployment flow
UI visualizable code change comparator
Improve and open source our Live Editing Feature for React Components (based on acorn-jsx).
Templates to enforce standardized styling and speed up new component development

Where did the Name Come from?

This is an interesting one. As we were releasing the library, the team urged me to choose a better name to replace the wordy “shared components library”. Without anything particular in mind or any maple trees outside my window (that was one of our internal Maple Dashboard got its name), this question was simply put aside as we have bigger challenges to tackle.

Until one of our global team meetings. I was reminded of my first ever trip to India back in June 2018, when I was asked for the name again. Aside from all the support and bonding, I am grateful for, one particular thing that stuck in my mind was when our teammate Shekhar referred me as Jason Babu.

I was confused initially - what’s a Babu? Is that a synonym for Baby? Wait...what? I didn’t know the culture in India is this open! After some inquiries, it turns out I was just hallucinating. The word “babu” is simply a suffix for showing respect, similar to the use of “Sir” or “Madam”.

So with my full appreciation, respect for the team, and the catchy name, we are proud to introduce to you to Babu Library 😊

Identity and Access Management at Bloomreach: Who are you, what are you doing here and other existential questions

developersblogs — Mon, 12 Jun 2023 17:39:54 +0000

The Integration team at Bloomreach has a very well defined charter: Make integrating Bloomreach products as easy as possible. Our primary customer is the IT professional that has been tasked with integrating Bloomreach products into their existing technology stack. Our efforts to make the integration easy generally involves building tooling that either 1-) Streamlines integrations behind the scenes or 2-) Provides more insight and control to that IT Professional. Since the team’s inception, we’ve had a lot of success behind the scenes, but our focus has now firmly moved on to providing tools to empower our customers to be as successful as possible as quickly as possible.

Historically, the process for the user and access management was to contact our support team and file a ticket. This led to a suboptimal experience for all involved, and it really made sense to hand over this functionality to our customers. After all, they know what makes sense for their organization infinitely better than we ever could. Let’s empower them to quickly make the changes that make sense to them. Moreover, whatever tool we build, let’s ensure that our internal support team uses the same tooling we’re providing to our customers. This proved to be an incredibly fun and challenging engineering problem! We have many customers from many completely unrelated industries, which meant that organizational structures varied wildly. We really needed a tool that was both flexible and intuitive enough that could meet the needs of a wide array of customers. This forced us to deal with a lot of assumptions around authorization that none of us even realized we were making.

This post is part of a series in which we’ll talk in detail about the solution we developed and the challenges we encountered along the way. By the end of this first post, you should have an idea of why it made sense to tackle this problem, why the common off the shelf solutions didn’t quite fit our use case and the basic structure of what eventually worked for us.

When we think about authorization, what generally leaps into mind is some form of the question: Is allowed to perform ? This question can easily be answered by a traditional authorization paradigm such as Role-Based-Access-Control (RBAC). At least, it can be, until you become a company with many customers and those customers have many nested sub-organizations that can potentially have organizations nested inside them, etc, etc. To tackle that problem with RBAC, the number of roles you have to support grows exponentially, which quickly becomes painful to manage.

We discovered that our implementation of identity and access management could not be a flat role-based access control system, it had to incorporate the additional concept of an organizational hierarchy.

As mentioned above: Bloomreach has a lot of customers, and those customers typically have several accounts. Accounts, from a Bloomreach perspective, are really just an abstract project. It could represent the entirety of a single customer or just a specific environment that belongs to a customer. From a programmatic perspective it looked a bit like this:

A flat structure, of all the accounts. Some of these accounts are clearly related to each other, but those links were often implicit.

We already had internal, and programmatic access to these links. But more than just a set of links, we really need to take into account our customers' organizational structure. Something that looked more like this:

Bloomreach, from an architectural perspective, is a multi-tenant company. Our customers are also often multi-tenant internal to their interactions with Bloomreach. A global office supply company, for example, has many brands, sites, and org units that all exist under the parent umbrella of the office supply company. This company will want to manage these as if they are separate tenants that they own.

Once this hierarchy was in place, plus the surrounding tooling, our customer could control their users and access. Support would no longer need to create users for one of our customers, a customer with the require roles could create users for their own organization, and grant that user whatever roles that user needed. In the example diagram below, jamie.fakeuser@officesupplies.ca would have the ability to create other users that would live under the www.officesupplies.ca node. Jamie could then grant that new user whatever access they needed.

In addition to Users having Roles at a location in the organizational hierarchy, organization and accounts needed something similar to a role. Each of our customers uses a subset of our product offerings, so those customers should only be able to grant their users access to those applications. Internally, we think of these as PermissionGroups, but we’ll talk more about those later.

The “organizational node” types we support are https://www.sumologic.com/, https://www.sumologic.com/, https://www.sumologic.com/, and https://www.sumologic.com/ and we plan to implement more over time, and we ultimately plan to turn over complete control of a customer's organizational hierarchy to the customer. They shouldn’t have to learn and conform to what we think their structure is, they should define it.

Once we had settled on a design that would meet our needs, we rolled up our sleeves and started hunting for our tool stack. Authentication and password management was already being managed by Auth0, so that was off our plate. We needed to provide good logging and auditing, but since we already used SumoLogic internally it made sense to just plug that in. That left choosing a data store. We tried many approaches including a GIT backed file system (we were already drawing a lot of inspiration from hierarchical file systems anyway!), zookeeper, and home growing our own hierarchical structure in MySQL, before landing on Postgres. Postgres has custom data types that align closely with our needs and enables us to do a one to one translation from concept to data - no hacks involved. In a later post, we’ll talk a bit more in-depth about these tools.

I hope this post gave you a little context about the problems we had to solve, why traditional solutions wouldn’t quite work, and what we decided to do. Keep an eye out for future posts where we’ll dive a little deeper into this architecture, as well as some of the design challenges we faced when trying to present users with an intuitive view of the described hierarchy. We’ll also discuss the Gatekeeper, which is a tool we wrote that actually attempts to answer all Authorization questions before a service is even aware of a request.

Discovery | Synonym Generation at Bloomreach

Rory Warin — Fri, 09 Jun 2023 13:31:48 +0000

Blog written by: Apurva Gupta, Antariksh Bothale & Soubhik Bhattacharya from Bloomreach, 2016

Abstract

As a company that’s in the business of helping people find things and helping our customers “Get Found,” it’s important for Bloomreach to accurately understand what people mean when they search for something. User queries are written in human language, which leads to dialectal variations in vocabulary (different names for the same thing—cookies/biscuits). Then there is the problem of context. In the phrases “frozen yogurt” and “frozen meat,” the word “frozen” carries a very different meaning from the meaning it carries in the phrase, “Frozen toys,” which refers to the hit Disney movie. This problem isn’t helped by the fact that search queries are often just bunches of words, devoid of capitalization, punctuation and/or syntactical clues that can make the meaning clearer – ”nj devils tees” would need to be understood as ”New Jersey Devils T-shirts.” Over- and under-specificity of search queries is another common problem – people searching for “hp printers,” might also be interested in buying printers from other brands.

From new fashion trends and terminology (jorts, anyone?), to words acquiring new connotations (such as the frozen → Frozen™ example), the eCommerce world springs several traps on us on our way to improving query understanding and improving search quality.

Synonym extraction and generation is one of the ways in which we deal with the continuously evolving, contextual nature of human language. To bridge the gap between the query and actual content, we figure out different possible representations of a word and process them to generate synonyms. To mine for synonym pairs, we look at descriptions of 100 million products, process billions of lines of text downloaded from the Web and also look for possible synonyms for queries amongst other queries [30 million queries].

At Bloomreach, we do this every week, learning new words and meanings from new data.

Why do synonyms need automated extraction/generation?

Traditionally eCommerce search engines have relied on following approaches to attain synonyms:

Thesauri: There are many freely available thesauri on the market, which capture different synonyms of words.
Stemmers: e.g. Porter stemmer, which contains rules to reduce words to their root forms — “mangoes => mango.”
Curated Dictionary: Manual entries that define synonyms.

In our tests we found these approaches to have a slew of problems. For example:

Context: A thesaurus will have “black => dark” as synonyms. This is applicable in certain contexts such as “black night => dark night”, but not in “black dress => dark dress”.
Usage: Thesauri do not cover general use of language. “rolling laptop case => rolling computer bag”. In other words, thesauri are not designed for eCommerce or products.
Evolution: “frozen princess => elsa” is a relationship with which all viewers of the movie “Frozen” would be familiar, but which dictionaries would not be.
Precision: In Web search, a user is presented with millions of results and generally finds relevant results among the first few. Whereas in eCommerce, people tend to go through hundreds of results to find the product they wanted. In such a scenario, tail results also need to be relevant, e.g. showing “short stemmed roses” for a search of “shorts” is a very bad user experience.
Sorting: Also, eCommerce portals allow users to sort by price, popularity etc. Such sorting prominently shows users results that otherwise would have been at the end. Hence adding a bad or context-free synonym such as “black dress => dark dress,” may, under certain sorts, show users dresses which are not black.
Hyponyms: It is difficult to find mappings such as “computer accessories => mouse, keyboard” in a dictionary. These are not synonyms but nevertheless are required in order to bridge the gap from user query to the product.

Although these problems can be resolved by using heuristics or by cleaning/augmenting the dictionaries manually, such an approach requires significant investment of manpower; and as language keeps evolving, it is a continuous investment.

Different types of synonyms

Given a query Q, a phrase Q’ will be called synonym of Q, if results of Q’ are also relevant for Q.

Adhering to the above definition, we can classify synonyms into various classes:

Spelling or Lemma variants (e.g. for the query “wifi adapter”)
a) Abbreviation: wireless USB adapter
b) Space variation: wi-fi USB adapter
c) Lemma Variants: wifi USB adaptor
d) Stem: wifi USB adapters, wireless USB adapters
Same meaning
a) women’s => ladies’
b) graph paper => grid paper
Hyponyms/Hypernyms. In this class, synonyms describe a superset of the query:
a) desktop pc => computers.
b) football gear => football cleats, football gloves, football helmet.
Related. These are not synonyms, but related phrases. They can be substituted when results are not available for true synonyms.
a) paper plates => plastic plates
b) abacus => calculator

Different types of synonyms are applied with varying confidence and methodologies: e.g. adapter => adapters can be applied with very high confidence and irrespective of context but “apple => apples” cannot be. Similarly, “plastic plates” should be shown for a search of “paper plates,” only when paper plates are not available. Synonyms need to be graded by quality and type when they are generated in order to account for the complexity of properly applying them.

Generation

The process of synonym generation is closer to “mining” than “searching:” e.g. we do not specifically generate synonyms for a query Q, but we generate all possible synonym pairs and expect pairs with Q or its subparts to have been generated. Just like mining, we can increase the probability of finding useful pairs by looking into the right data and using the right techniques. And just like mining, it will be all in the background, hidden from search users.

Representation

In a few words, the process of synonym generation can be classified into one which filters candidate synonym pairs, where each pair is a tuple of two phrases . Let us take the example of this pair . To us (humans) it is obvious that these two phrases are talking about the same objects. But to a machine, which has no context of what a presentation is or what it means to click, these two phrases are just sequences of letters, as randomly paired as any two other phrases. A simple English phrase, such as “wireless presenter,” contains a lot of contextual information, which goes amiss when we use just these two words to describe something. In representation phase we get a more complete description about this phrase using different sources of information. A few such possible representations and the information they augment are:

Neighboring Context (Bag or embeddings): You shall know a word by the company it keeps (Firth, J. R.)We collect phrases surrounding a phrase (e.g. frozen toys) from all over the Web. Then we count and sort these phrases by their frequency. Thus we obtain a list of frequently occurring phrases around “frozen toys.” These phrases describe the contexts in which “frozen toys” occur. Intuition behind such representation is that words/phrases with similar meaning would occur in similar contexts. Thus by comparing contexts of two phrases, we can compute a similarity score – which gives a quantitative association between two words, or phrases:
P(P1,P2) = probability of P1 and P2 occurring adjacently in same query
Rep(P1) = P(P1,Pi) for all Pi where {Pi} is the set of all phrases in corpus and Rep(P1) is the final representation of P1 in this method.
Recently, word embeddings such as word vectors generated by word2vec have been discovered to represent such contextual information in many fewer dimensions.
Product Clicks (Behavioral): Each phrase is represented as a probability distribution of products/documents that have been viewed/clicked/bought etc., after a user used this phrase as a query. Intuition behind such a representation is that for two search queries, q1 and q2, if users have the same/similar intent, they will click on similar documents and will have similar choices of documents. For example, if a product document is a relevant swimsuit, it will also be relevant swimwear. So, intuitively it should get user clicks for both the search queries and it will also be clicked more frequently than a less relevant swimwear/swimsuit.
P(d1,q) = probability of document d1 being clicked after q was queried and before any other query is done
REP(q) = {P(di,q)} for all documents di
Document vector: A phrase is represented by the “average of documents“ that contain this phrase. To achieve this, we first represent each document containing the phrase as a tf-idf vector of terms/phrases in that document. Then we take the average of all these vectors to achieve a vector representation for the “average of documents.”
Rep(D1) = for all phrases Pi in document D1 {this is representation of document D1}
Rep(P1) = average of Rep(Di) for all Di where P1 occurs {average of all documents containing P1 , represents P1}
Nearby queries: Whenever a user queries for “X” and then issues another query, “Y”, we add “Y” to representation of “X”. Thus we obtain a frequency count of phrases (such as “Y”), which can be used to represent “X”.P(P1,P2) = probability of P1 and P2 occurring in same sessionRep(P1) = P(P1,Pi) for all Pi where {Pi} is the set of all phrases in corpus and Rep(P1) is the final representation of P1 in this method.

Such representations map the humble two-word pairs to a pair of vectors, which have hundreds of dimensions. This new pair contains far more information and is more obvious for an algorithm as a pair.

Similarity

We borrow different distance and divergence metrics from various spaces to judge the quality of a candidate pair.

Different similarity measures:

Asymmetric: Asymmetric measures are used to figure out hyponym/hypernym pairs. For example a low value of |A∩B| / |A| and high value of |A∩B| / |B|, indicates that B is close to being a subset of A. If A and B were products viewed for queries A’ and B’, this measure can be used to determine if A is hypernym of B or vice versa.
Symmetric: Symmetric measures determine if A and B are approximately equivalent or contain similar information. To compare the distributional representation of clicks (2) of two phrases, we use KL – divergence, which is a measure of the information lost when distribution Q is used to approximate distribution P.

KL – Divergence of two distributions for phrase P and phrase Q.

Where summation is over the set of documents and P(i) is the probability of a document being clicked for phrase P. But this is a directional measure and is valid only when Q = 0 implies P = 0. Hence we use a symmetrized version which is JS divergence.

A good JSD score implies that P and Q are both good approximations of each other. Although, if P and Q are over a very small event space, confidence on this score will not be enough to use P and Q as synonym. Having another measure of confidence has proven to help in such cases.

Augment with extra similarities: Besides computing the similarities on the basis of the above representations, we can augment certain similarities such as edit distance.

After computing all similarities and augmentation, we have various scores which tell us the grade and type of synonym pairs. We observed that simple heuristics such as thresholding over the vector of scores works well. But these scores are continuously distributed and choosing a threshold manually caused a lot of good synonyms to get removed. This led us to the finding that thresholds would have to be phrase specific. For example, the distance between “black” and “dark” is less than the distance between “timer => stopwatch,” whereas, the second one happens to be a better synonym pair. We found that using scores of some obvious synonyms for a query helped in determining the threshold: e.g. Any phrase which is closer to “shoe” than “shoes” would be a good synonym for shoe. Thus, D(P, Q) < D(P,P’) where P’ is an obvious variation of P implies that P, Q are good candidates for synonymy.

Since we had already represented phrases and similarity of pairs as vectors , we can also feed these vectors to a supervised classifier for extracting synonyms.

Diagram

Illustration of synonym generation system

Conclusion

In this blo,g we have described how we generate different types of synonyms at BloomReach. It helps us in bridging the gap between our customers and the end users who search on eCommerce websites. We have tried our best to keep the methods independent of language, by not relying on grammar, etc. This will help us in serving people searching for products in different languages. Just like spoken language, our algorithms have their share of problems. They also make mistakes and have their good and bad days. We continuously work at improving them and welcome any suggestions or feedback.

References

http://web.stanford.edu/class/cs124/lec/sem
http://www.aclweb.org/anthology/C10-2151
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36252.pdf
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.4300
http://research.microsoft.com/pubs/167835/idg811-cheng.pdf

Discovery | SolrCloud Rebalance API

Rory Warin — Fri, 09 Jun 2023 13:22:25 +0000

Blog written by: Nitin Sharma & Suruchi Shah from Bloomreach, 2015

Introduction

In a multi-tenant search architecture, as the size of data grows, the manual management of collections, ranking/search configurations becomes non-trivial and cumbersome. This blog describes an innovative approach we implemented at Bloomreach that helps with an effective index and a dynamic config management system for massive multi-tenant search infrastructure in SolrCloud.

Problem

The inability to have granular control over index and config management for Solr collections introduces complexities in geographically spanned, massive multi-tenant architectures. Some common scenarios, involving adding and removing nodes, growing collections and their configs, make cluster management a significant challenge. Currently, Solr doesn’t offer a scaling framework to enable any of these operations. Although there are some basic Solr APIs to do trivial core manipulation, they don’t satisfy the scaling requirements at Bloomreach.

Innovative Data Management in SolrCloud Architecture

To address the scaling and index management issues, we have designed and implemented the Rebalance API, as shown in Figure 1. This API allows robust index and config manipulation in SolrCloud, while guaranteeing zero downtime using various scaling and allocation strategies. It has two dimensions:

The seven scaling strategies are as follows:

Auto Shard allows re-sharding an entire collection to any number of destination shards. The process includes re-distributing the index and configs consistently across the new shards, while avoiding any heavy re-indexing processes. It also offers the following flavors:
Flip Alias Flag controls whether or not the alias name of a collection (if it already had an alias) should automatically switch to the new collection.
Size-based sharding allows the user to specify the desired size of the destination shards for the collection. As a result, the system defines the final number of shards depending on the total index size.
Redistribute enables distribution of cores/replicas across unused nodes. Oftentimes, the cores are concentrated within a few nodes. Redistribute allows load sharing by balancing the replicas across all nodes.
Replace allows migrating all the cores from a source node to a destination node. It is useful in cases requiring replacement of an entire node.
Scale Up adds new replicas for a shard. The default allocation strategy for scaling up is unused nodes. Scale up also has the ability to replicate additional custom per-merchant configs in addition to the index replication (as an extension to the existing replication handler, which only syncs the index files)
Scale Down removes the given number of replicas from a shard.
Remove Dead Nodes is an extension of Scale Down, which allows removal of the replicas/shards from dead nodes for a given collection. In the process, the logic unregisters the replicas from Zookeeper. This in-turn saves a lot of back-and-forth communication between Solr and Zookeeper in their constant attempt to find the replicas on dead nodes.
Discovery-based Redistribution allows distribution of all collections as new nodes are introduced into a cluster. Currently, when a node is added to a cluster, no operations take place by default. With redistribution, we introduce the ability to rearrange the existing collections across all the nodes evenly.

Figure 1: Rebalance API overview

Scenarios

Let’s take a quick view at some resolved uses cases using the new Rebalance API.

Scenario 1: Re-sharding to meet latency SLAs

Collections often grow dynamically, resulting in an increased number of documents to retrieve (up to ~160M documents) and slowing down the process. In order to meet our latency SLAs, we decide to re-shard the collection. The process of increasing shards, for instance from nine to 12, for a given collection, is challenging since there is no accessible method to divide the index data evenly while controlling the placement of new shards on desired nodes.

API call
End Result: As observed in the diagram below, adding a shard doesn’t add any documents by default. Additionally, the node on which the new shard resides is based on the machine triggering the action. With the Rebalance API, we automatically distribute the documents by segmenting into even parts across the new shards.

Figure 2: Auto-Sharding to an increased number of shards.

Scenario 2: Data Migration from node to node.

We have two nodes, one of them is out or dead and we want to migrate the replicas/cores to a different live node. OR we encounter an uneven number of replicas on a set number of nodes, leading to a skewed distribution of data load, and we need to redistribute it across nodes.

API Call:
End Result: As observed in the diagram below, the BEFORE section demonstrates the uneven distribution of replicas/cores across the three nodes. Upon calling the REDISTRIBUTE strategy, we divide the replicas/cores across all nodes evenly.

Figure 3: Redistributing replicas/cores across the nodes.

Scenario 3: Dynamic Horizontal Scaling

Dynamic Horizontal Scaling is very useful when, for instance, we have two cores for a collection and want to temporarily scale up or scale down based on traffic needs and allocation strategy.

API Call: http://host:port/solr/admin/collections?action=REBALANCE&scaling_strategy=SCALE_UP &num_replicas=2&collection=collection_name

End Result: We observe in the diagram below that when new replicas are added, they have to be added one at the time, without control over node allocation. Furthermore, only the index files get replicated. According to the new Rebalance API, all the custom configs are replicated in addition to the index files on the new replicas where the nodes are chosen based on the allocation strategy.

Figure 4: Scaling up both shards by adding 2 replicas each

The chart below compares the states gathered from running tests to calculate the average time to split indexes using various approaches. It is important to note that while the second method only accounts for index distribution, the REBALANCE API (third and fourth) methods also include replication of custom configs.

As we notice in the table below, the Bloomreach Rebalance API performs much better, compared to the first two methods in terms of time. Furthermore, we parallelized the split and sync operation by making the Rebalance API more efficient as demonstrated in the fourth method (for collections over 150M it gives 95% savings in auto-sharding compared to re-indexing).

Conclusion

In a nutshell, Bloomreach’s new Rebalance API helps scaling SolrCloud architecture by ensuring high availability, zero downtime, seamless shard management and by providing a lot more control over index and config manipulation. Additionally, this faster and more robust mechanism has paved the way to automated recovery by allowing dynamic resizing of collections.

And that’s not all! We have implemented the Rebalance API in a generic way so that it can be open sourced. So stay tuned for more details!

Discovery | Solr Compute Cloud - An Elastic Solr Infrastructure

Rory Warin — Fri, 09 Jun 2023 13:14:53 +0000

Blog written by: Nitin Sharma & Li Ding from Bloomreach, 2015

Introduction

Scaling a multi-tenant search platform that has high availability while maintaining low latency is a hard problem to solve. It’s especially hard when the platform is running a heterogeneous workload on hundreds of millions of documents and hundreds of collections in SolrCloud.

Typically search platforms have a shared cluster setup. It does not scale out of the box for heterogenous use cases. A few of the shortcomings are listed below.

Uneven workload distribution causing noisy neighbor problem. One search pipeline affects another pipeline’s performance. (especially if they are running against the same Solr collection).
Impossible to tune Solr cache for the same collection for different query patterns.
Commit Frequency varies across indexing jobs causing unpredictable write load in the SolrCloud cluster.
Bad clients leaking connections that could potentially bring down the cluster.
Provisioning for the peak causes un-optimal resource utilization.

The key to solve these problems is isolation. We isolate the write and read of each the collections. At Bloomreach, we have implemented an Elastic Solr Infrastructure that dynamically grows/shrinks. It helps provide the right amount of isolation among pipelines while improving resource utilization.The SC2 API and HAFT services ( built in house) give us the ability to do the isolation and scale the platform in an elastic manner while guaranteeing high availability, low latency and low operational cost.

This blog describes our innovative solution in greater detail and how we scaled our infrastructure to be truly elastic and cost optimized. We plan to open source HAFT Service in the future for anyone who is interested in building their own highly available Solr search platform.

Problem Statement

Below is a diagram that describes our workload.

In this scheme, our production Solr cluster is the center for everything. As for the other elements:

At the bottom, the public API is serving an average of 100 QPS (query per second).
The blue boxes labeled indexing will commit index updates (full and partial indexing) to the system. Every indexing job represents a different customer’s data, which commit at different frequencies from everyday to every hour to every half hour.
The red boxes labeled pipeline are jobs to run ranking and relevance queries. The pipelines represent various types of read/analytical workload issued against customer data. Two or more pipelines can run against the same collections at the same time, which increases the complexity.

With this kind of workload, we are facing several key challenges with each client we serve. The graph below illustrates some the challenges with this architecture:

For indexing jobs:

Commits at same time as heavy reads: Indexing jobs running at the same time as pipelines and customer queries, which impact both pipelines and the latency of customer queries.
Frequent commits: Frequent commits and non-batched updates cause performance issues in Solr.
Leaked indexing: Indexing jobs might fail resulting in leaked clients, which get accumulated over time.

For the pipeline jobs:

Cache tuning: Impossible for Solr to tune the cache. The query pattern varies between pipeline jobs when working on the same collections.
OOM and high CPU usage: Unevenly distributed workload among Solr hosts in the clusters. Some nodes might have OOM error while other nodes have high CPU usage.
Bad pipeline: One bad client or query could bring down the performance of the entire cluster or make one node unresponsive.
Heavy load pipeline: One heavy load pipeline would affect other smaller pipelines.
Concurrent pipelines: The more concurrent pipeline jobs we ran, the more failures we saw.

Bloomreach Search Architecture

Left unchecked, these problems would eventually affect the availability and latency SLA with our customers. The key to solving these problems is isolation. Imagine if every pipeline and indexing job had its own Solr cluster, containing the collections they need, and every cluster was optimized for that job in terms of system, application and cache requirements. The production Solr cluster wouldn’t have any impact from those workloads. At Bloomreach, we designed and implemented a system we call Solr Compute Cloud (SC2) to isolate all the workload to scale the Solr search platform.

The architecture overview of SC2 is shown in the diagram below:

We have an elastic layer of clusters which is the primary source of data for large indexing and analysis MapReduce pipelines. This prevents direct access to production clusters from any pipelines. Only search queries from customers are allowed to access production clusters. The technologies behind elastic layer are SC2 API and Solr HAFT (High Availability and Fault Tolerance) Service (both built in-house).

SC2 API features include:

Provisioning and dynamic resource allocation: Fulfill client requests by creating SC2 clusters using cost-optimized instances that match resources necessary for requested collections.
Garbage collection: Will automatically terminate the SC2 clusters that exceed an allowed lifetime setting or idle for a certain amount of time.
Pipeline and indexing job performance monitoring: Monitor the cost and performance of each running job.
Reusability: Create an instance pool based on the request type to provision clusters faster. The API will also find an existing cluster based on the request data for read pipelines instead of provisioning a new cluster. Those clusters are at low utilization.

Solr HAFT Service provides several key features to support our SC2 infrastructure.

Replace node: When one node is down in a Solr cluster, this feature automatically adds a new node to replace that node.
Add replicas: Add extra replicas to existing collections if the query performance is getting worse.
Repair collection: When a collection is down, this feature repairs the collection by deleting the existing collection. Then it re-creates and streams data from backup Solr clusters.
Collection versioning: Config of each collection can be versioned and rolled back to previous known healthy config if a bad config was uploaded to Solr.
Dynamic replica creation: Creates and streams data to a new collection based on the replica requirement of the new collection.
Cluster clone: Automatically creates a new cluster based on existing serving cluster setup and streaming data from backup cluster.
Cluster swap: Automatically switches Solr clusters so that the bad Solr cluster can be moved out of serving traffic and the good or newly cloned cluster can be moved in to serve traffic.
Cluster state reconstruction: Reconstructs the state of newly cloned Solr cluster from existing serving cluster.

Read Workflow

We will describe the detailed steps of how read pipeline jobs work in SC2:

Read pipeline requests collection and desired replicas from SC2 API.
SC2 API provisions SC2 cluster dynamically with needed setup (and streams Solr data).
SC2 calls HAFT service to request data replication.
HAFT service replicate data from production to provisioned cluster.
Pipeline uses this cluster to run job.
After pipeline job finishes, call SC2 API to terminate the cluster.

Indexing Workflow

Below are detailed steps describing how an indexing job works in SC2.

The indexing job uses SC2 API to create a SC2 cluster of collection A with two replicas.
SC2 API provisions SC2 cluster dynamically with needed setup (and streams Solr data).
Indexer uses this cluster to index the data.
Indexer calls HAFT service to replicate the index from SC2 cluster to production.
HAFT service reads data from dynamic cluster and replicates to production Solr.

After the job finishes, it will call SC2 API to terminate the SC2 cluster.

Solr/Lucene Revolution Talk 2014 at Washington, D.C.

We spoke in detail about the Elastic Infrastructure at Bloomreach in last year’s Solr Conference. The link to the video of the talk and the slides are below.

Conclusion

Scaling a search platform with heterogeneous workload for hundreds of millions of documents and a massive number of collections in SolrCloud is nontrivial. A kitchen-sink shared cluster approach does not scale well and has a lot of shortcomings such as uneven workload distribution, sub-optimal cache tuning, unpredictable commit frequency and misbehaving clients leaking connections.

The key to solve these problems is isolation. Not only do we isolate the read and write jobs as a whole but also isolate write and read of each the collection. The in-house built SC2 API and HAFT services give us the ability to do the isolation and scale the platform in an elastic manner.

The SC2 infrastructure gives us high availability and low latency with low cost by isolating heterogeneous workloads from production clusters. We plan to open source HAFT Service in the future for anyone who is interested in building their own highly available Solr search platform.

Discovery | Introducing Briefly: A Python DSL to Scale Complex Mapreduce Pipelines

Rory Warin — Fri, 09 Jun 2023 13:07:35 +0000

Blog written by: Chou-han Yang from Bloomreach, 2015

Briefly

Today we are excited to announce Briefly, a new open-source project designed to tackle the challenge of simultaneously handling the flow of Hadoop and non-Hadoop tasks. In short, Briefly is a Python-based, meta-programming job-flow control engine for big data processing pipelines. We called it Briefly because it provides us with a way to describe complex data processing flows in a very concise way.

At Bloomreach, we have hundreds of Hadoop clusters running with different applications at any given time. From parsing HTML pages and creating indexes to aggregating page visits, we all rely on Hadoop for our day to day work. The job sounds simple, but the challenge is to handle complex operational issues without compromising code quality, as well as the ability to control a group of Hadoop clusters to maximize efficiency.

Challenges:

Data skew, or the high variance of data volume from different customers.
Non-Hadoop tasks are mixed in with the Hadoop tasks and need to be completed before or after the Hadoop tasks.
The fact that we run Elastic Map Reduce (EMR) with spot instances, which means clusters might be killed because of spot instance price spikes.

There are several different approaches to solve similar problems with pipeline abstraction, including cascading and Luigi, but they all solve our problems partially. They all provide some features, but none of them help in the case of multiple Hadoop clusters. That’s why we turned to Briefly to solve our large-scale pipeline processing problems.

The main idea behind Briefly, is to wrap all types of jobs into a concise Python function, so we only need to focus on the job flow logic instead of operational issues (such as fail/retry). For example, a typical Hadoop job in Briefly is wrapped like this:

@simple_hadoop_process

def preprocess(self):

  self.config.defaults(

    main_class = 'com.bloomreach.html.preprocess',

    args = ['${input}', '${output}']

  )



@simple_hadoop_process

def parse(self, params):

  self.config.defaults(

    main_class = 'com.bloomreach.html.parser',

    args = ['${input}', '${output}', params]

  )

And similarly, a Java process and a python process look like this:

@simple_java_process

def gen_index(self):

  self.config.defaults(

    classpath = ['.', 'some.other.classpath']

    main_class = 'com.bloomreach.html.genIndex',

    args = ['${input}', '${output}']

  )



@simple_process

def gen_stats(self):

  for line in self.read():

    # Do something here to analyze each line

  self.write('stats output')

And here’s what it looks like when we chain the jobs together to create dependencies:

objs = Pipeline("My first pipeline")

prop = objs.prop



parsed_html = source(raw_html) | preprocess() | parse(params)

index = parsed_html | gen_index()

stats = parsed_html | gen_stats()



targets = [stats, index]

objs.run(targets)

This script creates the workflow like this:

The pipeline can be executed locally (with Hadoop local mode), on Amazon EMR, or on Qubole simply by supplying different configurations. For example, running on Amazon EMR would require the following configuration:

# emr.conf



# Where to run your pipeline. It can be local, emr, or qubole

hadoop.runner = "emr"



# Max number of concurrent EMR clusters to be created

emr.max_cluster = 10



# Instance groups for each cluster

emr.instance_groups = [[1, "MASTER", "m2.2xlarge"], [9, "CORE", "m2.2xlarge"]]



# Name of your EMR cluster

emr.cluster_name = "my-emr-cluster"



# A unique name for the project for cost tracking purpose

emr.project_name = "my-emr-project"



# Where EMR is going to put yoru log

emr.log_uri = "s3://log-bucket/log-path/"



# EC2 key pairs if you want to login into your EMR cluster

emr.keyname = "ec2-keypair"



# Spot instance price upgrade strategy. The multipliers to the EC2 on-demand price you want

# to bid against the spot instances. 0 means use on-demand instances.

emr.price_upgrade_rate = [0.8, 1.5, 0]

Extra keys also need to be provided for specific platforms, such as Amazon EMR:

# your_pipeline.conf



ec2.key = "your_ec2_key"

ec2.secret = "your_ec2_secret"

And then your are good go. Run your Briefly pipeline with all your configuration files:

python your_pipeline.py -p your_pipeline.conf -p emr.conf -Dbuild_dir=build_dir_path

Now you can have several different configurations for running job locally, on Qubole, or with different cluster sizes. One thing we find useful is to subdivide a big cluster into smaller clusters which increases the survivability of the entire group of clusters, especially when running on spot instances.

The number of clusters and the cluster size can be adjusted according to the jobs being executed. Many small clusters provides better throughput when running with a lot of small jobs. On the other hand, a large job may run longer on a few clusters while other clusters may be terminated after a predetermined idle time. The setting can be changed easily in the configuration for performance tests.

Other features Briefly

Use of a Hartman pipeline to create job flow.
Resource management for multiple Hadoop clusters (Amazon EMR, Qubole) for parallel execution, also allowing customized Hadoop clusters.
Individual logs for each process to make debugging easier.
Fully resumable pipeline with customizable execution check and error handling.
Encapsulated local and remote filesystem (s3) for unified access.
Automatic download of files from s3 for local processes and upload of files to s3 for remote processes with s4cmd.
Automatic fail/retry logic for all failed processes.
Automatic price upgrades for EMR clusters with spot instances.
Timeout for Hadoop jobs to prevent long-running clusters.

Conclusion

We use Briefly to build and operate complex data processing pipelines across multiple mapreduce clusters. Briefly provides us with an ability to simplify the pipeline building and separate the operational logic from business logic, which makes each component reusable.

Please leave us feedback, file issues and submit pull requests if you find this useful. The code is available on GitHub at https://github.com/bloomreach/briefly.

Discovery | Strategies for Reducing Your Amazon EMR Costs

Rory Warin — Fri, 09 Jun 2023 13:02:07 +0000

Blog written by: Prateek Gupta from Bloomreach, 2015

Introduction

Bloomreach has built a personalized discovery platform with applications for organic search, site search, content marketing and merchandizing. Bloomreach ingests data from a variety of sources such as merchant inventory feed, sitefetch data from merchants’ websites and pixel data. The data is collected, parsed, stored and used to match user intent to content on merchants’ websites and to provide merchants with insights into consumer behavior and the performance of products on their sites.

Merchant Data

A sample data ingestion flow for merchant data is shown in the figure below. Bloomreach ingests merchant data including crawled merchant pages, merchant feed, and pixel data. There are ETL (extract-transform-load) flows that clean, filter and normalize the data and put it into the product database. Individual applications may use this data to produce derived relations. The product database also supports many applications including the “What’s Hot” application that displays relevant trending products to the user on merchant website.

Below is a sample workflow for personalization:

At Bloomreach, we launch 1,500 to 2,000 Amazon EMR clusters and run 6,000 Hadoop jobs every day. As a growing company, we’ve seen our use of Amazon EMR rise dramatically in a short time:

Amazon EMR Costs

It is critical that we keep our Amazon EMR costs down as we scale up. To that end, we’ve adopted the following strategies

1) Use AWS Spot Instances rather than On-Demand Instances whenever possible. Amazon Elastic Cloud Compute (Amazon EC2) Spot Instances are unused Amazon EC2 capacity that you bid on; the price you pay is determined by the supply and demand for Spot Instances. The cost of using Spot Instances can be 80% less than using On-Demand Instances. It’s important to manage Spot Instances because they can be terminated if the Spot market price exceeds your bid price. At Bloomreach, we have written an orchestration system that schedules jobs on Amazon EMR. The system implements a Hartmann pipeline that can run a variety of jobs both locally and on Amazon EMR. It can also detect failures such as Spot Instance termination and reschedule jobs on different clusters as needed.

2) Create a system that shares clusters among several small jobs rather than launching a separate cluster for every job. Remember, whether your job takes 10 minutes or 60 minutes, you’re paying for an hour of access. If you have four 10-minute jobs, you could share one cluster to do them all and be charged for one hour. Or you could employ one cluster for each and be charged for four hours. Sharing clusters among jobs also allows you to save the time and cost of bootstrapping a new cluster. The time savings alone can be a significant factor for real-time jobs.

3) Use Amazon EMR tags for cost tracking. Using EMR tags lets you track the cost of your cloud usage by project or by department, which gives you deeper insight into return on investment and provides transparency for budgeting purposes.

4) Create a lifecycle management system that allows you to track clusters and eliminate idle clusters.

5) Use the right instance types for your jobs. For example, use c3 instance type for compute-heavy jobs. This can significantly reduce waste and costs based on the scale of your jobs. Below is an algorithm we have found useful for selecting the instance type with the best value for compute capacity based on its Spot price:


maxCpuPerUnitPrice = 0

optimalInstanceType = null

For each instance_type in (Availability Zone, Region) {

cpuPerUnitPrice = instance.cpuCores/instance.spotPrice

if (maxCpuPerUnitPrice < cpuPerUnitPrice) {

optimalInstanceType = instance_type;

}

}

Incorporating these Amazon EMR strategies can reduce EMR costs, increase efficiency and make a good thing even better.

Discovery | Crawling Billions of Pages: Building Large Scale Crawling Cluster (Pt 2)

Rory Warin — Fri, 09 Jun 2023 12:58:36 +0000

Blog written by: Chou-han Yang from Bloomreach, 2015

Introduction

Previously in “Crawling Billions of Pages: Building Large Scale Crawling Cluster (Pt 2)” we talked about the way to build an asynchronous fetcher to download raw HTML pages effectively. Now we have to go from a single machine to a cluster of fetchers, therefore, we need a way to synchronize all the fetcher nodes so that they won’t overlap. Essentially we want a queue. The requirements for the queuing system are:

Handle billions of URLs.
Maintain constant web crawling speed per each domain.
Dynamically reschedule failed jobs.
Provide status report for crawling cluster in realtime.
Support multiple queues with different priorities.

In Bloomreach’s case, we use a single Redis server with 16GB memory, which can handle about 16 billion URLs. It is equivalent to one URL per one byte of memory. What follows is a description of our design process and some of its highlights.

Minimize the size of each entry

The first challenge to design a queuing system is to figure out what data you need to put into the queue. Once the data size for each entry is determined, it is a lot easier to estimate the capacity of the queuing system. In our case, if we put all the required URLs into the queue, assuming each URL consumes 512 bytes, our 16GB system would hold only 32 million URLs. This design is definitely not enough to handle billions of URLs.

So we have to combine multiple URLs into a single run corresponding to a task. In our system, we have 1,000 URLs per run. So, we can use a single path pointing to a S3 (Amazon Simple Storage Service) file containing a list of 1,000 URLs for this task. Effectively, we only need to have 0.5K bytes for 1,000 URLs in memory. Therefore, we should be able to handle 16 billion URLs with just 8GB of memory. Sounds easy, doesn’t it?

Transfer of Task Ownership

A typical queuing system allows task entries being pushed into or popped out. Effectively the ownership of a task is transferred from the producer of the task entry to the queue, and then from the queue to the task executor when the task is popped out from the queue.

The downside of this naive approach is that if the fetcher node crashes for any reason, the tasks currently running will be lost. There are several ways to avoid this:

1) Provide task timeout. The queuing system can reclaim the task after certain period of time.
Pros: Clients of the queue don’t need to change.
Cons: Once the task is reclaimed and rescheduled, it may run concurrently with the stale task.

2) Clients of the queue keep tasks in a file, and restart after crash.
Pros: Queue server doesn’t need to change.
Cons: Clients need to have a persistent database to maintain the running tasks.

Yet, there is still the need to transfer task ownership between client and server, which makes the system unnecessarily complicated. What we can do instead is keep the ownership of those tasks in the server after they are enqueued. That way the client doesn’t have any ownership of the task at any time. Instead we have a stateless list of tasks at any given time. It is very similar to how our library system works. Image every task is a book in the library and readers are clients who would like to read as much as possible. If we allow readers to ‘check-out’ books, we face the risk that some readers may lose their books. One solution is to never allow readers to ‘check-out’ books, and instead have them read digital versions online. That way, all the content stays in the library and there is no risk of losing books.

Each fetcher, the client of the queuing system, can obtain several tasks running in parallel at any moment. Like the library system, a reader is allowed to check out several books at the same time. So internally, the queuing system maintains a list of tasks each client is currently running. From the task list, each client would be able to check what task it should execute without actually looking into the queue. This approach dramatically reduces the complexity of the system because each client doesn’t need to know how the queue is implemented inside the queuing system. We will see how this helps build a more complex scheduling algorithm later.

As shown above, each client periodically syncs its own task list with the server. New tasks will be started once a client discovers new items on the list. On the other hand, once a task is marked done by a client, the server will remove the task and add a new task to the list.

Scheduling Algorithm

Currently, more than 60 percent of global internet traffic consists of requests from crawlers or some type of automated Web discovery system. A typical website with a single node can handle from one request per second to hundreds of requests per second, depending on the scalability and implementation. So web crawling is likely to put loads on the Web servers, therefore, a Web crawler should be polite and crawl each domain at very moderate speed, usually slower than 10 seconds a page. Crawling with extremely high speed could be considered a distributed denial-of-service (DDoS) attack. The perceived DDoS attack will result in your crawler quickly being blacklisted. One straightforward way to control the overall crawling speed is to control the number of fetching jobs running across the whole cluster.

On the other hand, one-off requests may come for various of reasons. People want to debug a certain set of pages, or more commonly, part of a website changes faster than others. For example, the homepage is usually updated more often than other pages. Therefore, having multiple queues for each domain is very useful in many cases.

The temptation is to create a priority queue with each task preassigned a priority. That way you can always process the priority tasks first. A priority queue has two issues: the priority value itself in each entry may increase the size of the task in memory. And secondly, high priority tasks will block low priority tasks. This is called starvation in scheduling algorithms. Monitoring the queue is also a challenge since most of the queuing systems use a linked list as the underlying data structure. Therefore, traversing the entire queue to read associated metadata is not practical.

A better approach is to use a randomized scheduler, for example, weighted fair queuing. With this approach, we can randomly schedule jobs from a group of queues without having different priorities at a job level. The chance of a queue being selected for scheduling is proportional to the priority of the queue.

A cool thing about this approach is that there is no starvation. Even jobs on queues with very low priority will get scheduled at some point. Reading progress for each queue is now a trivial matter because tasks with different priorities are put in different queues instead of mixing into a single priority queue.

Multi-tiered Aggregations

On top of the scheduling and task dispatching, monitoring and showing stats is crucial for production operation. The monitoring server doesn’t have to be on the same node as the queue, but we put them on the same node for simplicity.

The goal is to show the real-time aggregation for different criteria, such as per queue or per fetcher node. Every second, each node will send back the effective web crawling speed and error rates for each task (set of 1000 URLs). We can selectively aggregate the data and display the final results. If you imagine the input values as a single vector, the whole aggregation process can be written down as a single matrix multiplication. Each element in the matrix will have a value of either zero or one. Zero means the corresponding input value is not selected for aggregation, and one means the value is selected.

Matrix operations can be optimized with numerical libraries which take advantage of vectorized hardware operations, such as numPy. Instead of hard-coding those operations in a nested loop, we transform the whole aggregation process into a single matrix multiplication and then leverage numPy to perform the operation for us. After that, we can get the aggregated value every second, and then push those values upward to the minute, hour, and daily aggregation.

The Web UI can get real-time graph from the server without joining data every second. This approach basically amortizes the expansive joining for a given time period by running a partial join operation every second.

Conclusion

Handling billions of URLs in a single queuing system is definitely achievable. The key takeaway is to have the right architecture to help with data sharding and failure handling. Making every component stateless is essential for scaling out. Our queuing system, which has been running for more than a year, has handled more than 12 billion URLs without any sign of slowing down. The queuing system is a cornerstone for the robustness of our web crawling system.

Discovery | Crawling Billions of Pages: Building Large Scale Crawling Cluster (Pt 1)

Rory Warin — Fri, 09 Jun 2023 12:53:12 +0000

Blog written by: Chou-han Yang from Bloomreach, 2015

Introduction

At Bloomreach, we are constantly crawling our customers’ websites to ensure their quality and to obtain the information we need to run our marketing applications. It is fairly easy to build a prototype with a few lines of scripts and there are a bunch of open source tools available to do that, such as Apache Nutch. We chose to build our own web crawling cluster at Bloomreach after evaluating several open source options. Our requirements were:

Handle more than billions of pages (in about a week).
Use Hadoop (with Amazon EMR) to parse those pages efficiently.
Have constant QPS (query per second) for each website.
Have multiple task queues per each website with different priorities.
Handle long latency for slow websites.

A very typical architecture for web crawling clusters includes three main components:

A fetcher that send HTTP requests and reads content.
A centralized queuing system for job management and distribution.
A backend pipeline to parse and post-process pages.

For part one of this series, I would like to focus on the fetcher that we use to crawl our customers’ pages. I’ll cover other components separately in future posts.

First attempt: single process loop

To kick-start our discussion, we will just use simple code snippets to demonstrate a very simple fetcher:

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.net.URL;

import java.util.List;



public class Crawler {



  public static void crawl(List urls) throws IOException {

    for (String urlStr : urls) {

      URL url = new URL(urlStr);



      BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

      processHTML(in);

      in.close();

    }

  }



  public static void processHTML(BufferedReader in) {

    // ...

  }



}

This is a very straightforward implementation of the fetcher with several potential scaling issues:

It uses single thread and only one thread per process. In order to concurrently fetch from more than one website, the fetcher needs multiple processes.
If a single page takes a long time to process, or even worse, the server times out without any response, the whole process will be stuck.

As straightforward as it is, this approach won’t go very far before some operational headaches set in. So naturally, a better approach would be to use multiple threads in a single process. Unfortunately, with this system, the memory overhead for each process will quickly consume all your memory space.

Second attempt: multithreaded HTTP client

package scratch;



import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.net.URL;

import java.util.List;



public class Crawler implements Runnable {



  private String urlStr = null;



  public Crawler(String urlStr) {

    this.urlStr = urlStr;

  }



  public static void crawl(List urls) {

    for (String urlStr : urls) {

      new Thread(new Crawler(urlStr)).run();

    }

  }



  @Override

  public void run() {

    try {

      URL url = new URL(urlStr);



      BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));

      processHTML(in);

      in.close();

    } catch (IOException e) {

      // Deal with exception.

    }

  }



  public static void processHTML(BufferedReader in) {

    // ...

  }



}

This process seems more modern and it removes the requirements to run more than one process on a single machine. But the shortcoming that a single page can stop the whole loop remains. Compared to multiple process, multiple thread has better memory efficiency, but it will reach its limit when you are running at least 400 to 500 threads on a quad core machine.

Third attempt: asynchronous HTTP (Windows style)

To solve the problem of blocking threads for each website loop, people long ago developed solutions for Windows. An experienced Windows IIS programmer would be very familiar with the event-driven programming paradigm. Coming up with the same code in Java isn’t easy, but it might look something like:

package scratch;



import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.net.URL;

import java.util.List;



public class Crawler {



  public static void crawl(List urls) {

    for (String urlStr : urls) {

      AsyncHttpClient client = new AsyncHttpClient();

      Response response = client.prepareGet(url).execute(new AsyncHandler&lt;T&gt;() {



        void onThrowable(Throwable t) {

        }



        public STATE onBodyPartReceived(HttpResponseBodyPart bodyPart) throws Exception {

          processHTML(bodyPart);

          return STATE.CONTINUE;

        }



        public STATE onStatusReceived(HttpResponseStatus responseStatus) throws Exception {

          return STATE.CONTINUE;

        }



        public STATE onHeadersReceived(HttpResponseHeaders headers) throws Exception {

          return STATE.CONTINUE;

        }



        T onCompleted() throws Exception {

          return T;

        }

      });

    }

  }



  public static void processHTML(BufferedReader in) {

    // ...

  }



}

Windows usually uses a single thread to process all events, but you can allow multiple threads by changing the setting of the IIS Web server. The Windows operating system can dispatch different events to different window handlers so you can handle all asynchronous HTTP calls efficiently. For a very long time, people weren’t able to do this on Linux-based operating systems since the underlying socket library contained a potential bottleneck.

Fourth attempt: HTTP client with asynchronous I/O

The potential bottleneck has been removed by kernel 2.5.44 with the introduction to epoll system call. This allows a process to monitor a huge number of TCP connections without polling from each connection one-by-one. This also triggered the creation of series non-blocking libraries such as Java NIO.

Network libraries based on Java NIO have the benefit of easily scaling from a few thousands to tens of thousands of TCP connection per machine. The CPU no longer spends time in a waiting state or context switching between a huge number of threads. Therefore, performance and throughput both increase.

import java.net.*;

import java.util.List;



import org.jboss.netty.bootstrap.ClientBootstrap;

import org.jboss.netty.buffer.ChannelBuffer;

import org.jboss.netty.channel.*;

import org.jboss.netty.handler.codec.http.*;



import static org.jboss.netty.channel.Channels.pipeline;



public class Crawler extends SimpleChannelUpstreamHandler {



  public static void crawl(List urls) throws URISyntaxException {

    for (String urlStr : urls) {

      new Crawler().asyncRead(urlStr);

    }

  }



  public void asyncRead(String urlStr) throws URISyntaxException {

    URI uri = new URI(urlStr);



    // Configure the client.

    ClientBootstrap bootstrap = new ClientBootstrap();



    final SimpleChannelUpstreamHandler handler = this;



    // Set up the event pipeline factory.

    bootstrap.setPipelineFactory(new ChannelPipelineFactory() {



      @Override

      public ChannelPipeline getPipeline() throws Exception {

        ChannelPipeline pipeline = pipeline();

        pipeline.addLast("handler", handler);

        return pipeline;

      }

    });



    // Start the connection attempt.

    ChannelFuture future = bootstrap.connect(new InetSocketAddress(uri.getHost(), uri.getPort()));



    // Wait until the connection attempt succeeds or fails.

    Channel channel = future.awaitUninterruptibly().getChannel();

    if (!future.isSuccess()) {

      future.getCause().printStackTrace();

      bootstrap.releaseExternalResources();

      return;

    }



    // Prepare the HTTP request.

    HttpRequest request = new DefaultHttpRequest(HttpVersion.HTTP_1_1, HttpMethod.GET, uri.getRawPath());



    // Send the HTTP request.

    channel.write(request);



    // Wait for the server to close the connection.

    channel.getCloseFuture().awaitUninterruptibly();



    // Shut down executor threads to exit.

    bootstrap.releaseExternalResources();

  }



  @Override

  public void messageReceived(ChannelHandlerContext ctx, MessageEvent e) throws Exception {

    HttpResponse response = (HttpResponse) e.getMessage();

    processHTML(response.getContent());

  }



  public void processHTML(ChannelBuffer content) {

    // ...

  }



}

We use Netty to build our crawler, not only because it uses Java NIO, but because it also provides a good pipeline abstraction to the network stack. It is easy to insert a handler to HTTPS, compression, or time-out without compromising the code structure.

Conclusion

Based on our stress tests, each node with a quad-core CPU can go up to 600 queries per second, reaching the maximum network bandwidth, with its average HTML of size 400K bytes. With a six-node cluster, we can crawl at 3,600 QPS, which is about 311 million pages a day, or 1.2 billion pages in four days.

Next time, we will talk about how to store tasks with a very long list of URLs with efficient queuing and scheduling.

Discovery | The Evolution of Fault Tolerant Redis Cluster

Rory Warin — Fri, 09 Jun 2023 12:44:43 +0000

Blog written by: Hongche Liu & Jurgen Philippaerts from Bloomreach, 2015

Introduction

At Bloomreach, we use Redis, an open source advanced key-value cache and store, which is often referred to as a data structure server since values can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs. In one application, we use Redis to store 16 billion URLs in our massive parallel crawlers in. We use Redis to store/compute Cassandra clients’ access rate for rate limiting purpose in another. But this post is focused on yet another particular application — real-time personalization, in which we use Redis to store in-session user activities.

For this job, fault tolerance is a requirement. Without it, there would be no real-time, in-session personalization. As far as a Redis cluster development is concerned, fault tolerance and scalability have received attention only recently. Some features, like sharding, are not even available as a stable version yet in the main Redis repository. Fortunately, there are some industry solutions to fill the gap. This article covers the operations and administration of fault tolerance and scalability of the Redis cluster architecture used at Bloomreach. Many of the topics are discoverable but scattered around the Web, indicating the high commonality, but wide variety, among many industry applications using Redis.

Failsafe Redis Setup - we had a humble beginning.

The simple setup here suits our need at the time because our data is less than the memory capacity of the instance we are using. If our user in-session data grows out of the current instance’s memory capacity, we will have to migrate the Redis node to a bigger instance. If you are asking, “How the heck can this system scale up?” — good. You are ahead of the game. We will cover scaling in a later section. Here we will focus on fault tolerance.

Fault Tolerance Against Instance Failure

For fault tolerance against instance failure, we take advantage of the DNS failover setup (for example, AWS Route 53 Failover Policy) as in the diagram below.

We set up two CNAME records in DNS. In each of them, we configure the routing policy to be failover. For Redis1 CNAME, we set up the failover record type to be primary and attach the proper health check. For Redis2 (the hot backup), we set up the failover record type to be secondary and attach the proper health check.

In this setup, under normal circumstances, DNS for Redis.bloomreach.com (mock name) returns Redis1 (mock name). When the health check for Redis1 detects that it is down, the DNS resolution will point to Redis2 (mock name) automatically. When Redis1 is back, the DNS server will resolve Redis.bloomreach.com back to Redis1 again.

T0 is the time when Redis1 goes down.

TN is the time when the DNS service’s health check determines that Redis1 is down. It is also the time when the DNS resolution will point to Redis2, the backup live one.

TD is the time when the application’s DNS TTL (Time to Live) elapsed. It is also the time when application will get Redis2 host for the DNS lookup for Redis.bloomreach.com.

TR is the time when Redis1 comes back.

Between T0 and TD, the application would try to write or read from Redis1, which would fail.

So the application down time is

TD – T0 = health_check_grace_period + DNS_TTL

Between TD and TR, all the data, say D, go to Redis2, not replicated to Redis1.

So at TR when DNS points back to Redis1, all the written data D will be non-accessible.

To prevent the loss, we set up pager alert on Redis1 down, with the instruction to flip the replication from Redis2 to Redis1. Before we tried to automate this manual task. Redis has since come up with a good solution with sentinel in version 2.8, which is what we moved to next. It will be covered in the next section.

Fault Prevention Tips

But before we go there, I’d like to cover some topics that prevent faults (instead of tolerating faults):

1) If you are building a production grade Redis cluster, you must follow the tips on this Redis Administration page. During the pre-production baking stage, I personally encountered an issue “fork() error during background save”. The problem and the solution (setting overcommit on Linux to 1) has been noted here. Various production level issues and troubleshooting tips have been well documented on the Redis Administration page and it’s really helpful to pay attention to the issues listed there.
2) One item not covered on the above page is eviction policy. Without setting it up properly, we have encountered the case of Redis being stuck. Our application uses a Redis cluster mainly as a data store for in-session user activities. During peak traffic time, or season, the amount of data could spike beyond the capacity. The default eviction policy is “noeviction.”

maxmemory-policy noeviction

It means when the memory is full, no new data can be written into Redis, not a behavior we would like. After studying the industry experiences and testing, we settled on the following eviction policy. This is the policy that when memory is full, it evicts the data that is closest to expiring. It is the safest behavior in our application.

maxmemory-policy volatile-ttl

3) Another configuration issue we ran into was with Linux open file ulimit. Every TCP connection to Redis is an open file in Linux. The standard AWS Ubuntu image comes with open file size limit of 1024. When we provisioned more application servers for a stress test in preparation for the holiday season, we encountered the serious problem of application servers getting stuck in startup phase when they initialized sessions with Redis and the Redis hosts ran out of open file handles. It is particularly difficult to trace the problem correctly to the ulimit setting because restarting Redis (the most intuitive operation) resolves the symptoms temporarily. There are also many other settings (wrong port, authentication setting) that can result in the similar symptoms. The key thing to observe is the error message, “Connection reset by peer.” It is an Linux level error, not Redis. Using lsof command confirmed the connection count.
Do not confuse the Linux open files ulimit with Redis configuration of maxclient.

maxclients 10000

Although both must be sufficient for your application’s architecture. There are many resources pointing to the solutions to this problem. We set up an Ubuntu upstart conf with the following line:

limit nofile 10240 10240

This helped us pass the stress test.

Automatic Failover with Redis Sentinel

The previous section describes a simple fault-tolerant Redis setup that does not handle recovery well. So during the holiday season, we upgraded from the default Ubuntu Redis version 2.2 to 2.8, which has a new Redis component called sentinel, with the distinct feature of automatic recovery during a server fault. The setup is depicted in the following diagram:

In this setup, Redis sentinel plays a crucial role in system monitoring and failover situation. It keeps constant watch on the status of the master and of the replication status of each slave.

Leader Election

In the case of the master crashing, like in the following diagram, all the surviving sentinels get together and agree on the master being incapacitated and then proceed to vote for the next master, which is the slave with the most up to date replicated data.

The reason we have 4 sentinels installed is for decisive majority vote among 3 when one of the hosts is down. Note that one of the sentinels is actually running on the staging Redis.

Added to this new architecture is a set of load balancers. In our case, we use HAProxy. Now we are no longer relying on just DNS to send clients to the active Redis master node. Thanks to HAProxy’s health check capabilities, we can now reduce the load on the master Redis node by sending read traffic to all nodes, not just to the master.

HAProxy has some built in Redis check capabilities, but unfortunately, it only checks to see if a Redis node is healthy. It doesn’t report back if it is a master or slave node. Luckily, it allows you to write your own set of commands into a tcp check. The following configuration snippet allows us to configure a backend pool to be used only for Redis write commands.


tcp-check send info\ replication\r\n

tcp-check expect string role:master

tcp-check send QUIT\r\n

tcp-check expect string +OK

Another backend pool that takes Redis read commands can be setup with the built-in Redis check:

option Redis-check

A configuration like this needs one more customization in your application, as you now have two endpoints for your Redis connections, one for reads and one for writes.

Compared with the first approach we used, this setup is a cabinet model of government, where the cabinet members get together to decide if the executive head is incapacitated and then to take proper actions. For example, selecting the next executive head. The previous model is like the vice president model, where the vice president, in normal situations, does not serve any active purposes, but if the president is out, he/she is the automatic successor.

Cross Data Center Serving Infrastructure and Automatic Failover

When we evolve into the multi-region distributed serving infrastructure (e.g. east coast and west coast), we have the following setup:

The reason we need to have a separate Redis cluster is for the high SLA (10ms response time average), which cannot be achieved with cross-continent access due to high network latency.

In this setup, we have event handlers that register user activities in real-time to Redis. Our application server mostly performs read operations on Redis. The SLA on the event handler is not as high as the app server requirement (next page response time is OK), so we can pipe all event handlers’ traffic to the main Redis cluster and then replicate the data to the secondary clusters.

In this setup, one particular sentinel configuration is worth noting — slave priority. On the secondary cluster, you must set:

slave-priority = 0

This is because you would never want the new master to be in the secondary cluster. If you did, you would send all events from the primary cluster to this new master, replicating the data from the secondary cluster to the primary, increasing latency unnecessarily.

Scalable Redis Using Twemproxy

As our business grows, our data outgrows the capacity of any single node’s memory with acceptable cost. Unfortunately, Redis cluster is still a work in progress. However, others in the industry (more specifically, Twitter) have encountered and solved the data distribution problem in an acceptable way and then open-sourced it. The project is called twemproxy, which we adopted after some study and testing.

The following diagram depicts our setup. We have the production Redis sharded N ways behind 2 twemproxy servers.

In this setup, a fault in twemproxy server is handled by DNS setup with health checks. Unlike the first setup we used, the multiple CNAME records but use weighted routing policy with the same weight, so both twemproxy servers can route traffic to Redis, avoiding a single bottleneck while achieving higher availability.

A fault in a Redis master is handled by twemproxy using a configuration called auto_eject_hosts.

auto_eject_hosts: true

We followed the liveness recommendation section of the twemproxy document. In short, twemproxy detects Redis host failure with some retries and timeout built-in to avoid false alarms. Once such a situation is detected, it ejects the host from the Redis data distribution pool, using:

distribution: katema

The accessibility of data on the surviving nodes is preserved.

Tips for Twemproxy Configuration

1) Similar to Fault Prevention Tip 3, remember to set the Linux host’s ulimit for open files to an appropriately high number for your application, now that the application is actually connecting to a twemproxy host instead of Redis hosts.
2) If you, like us, have a staging replication setup similar to the diagram above, here is an important configuration to note. In nutcracker.yml, for each pool of Redis nodes, there is a servers configuration like this:

servers:

- 10.34.145.16:6379:1 server1

- 10.32.54.221:6379:1 server2

- 10.65.7.21:6379:1 server3

In the production settings, the Redis servers’ IP addresses, naturally, are different from those of the staging Redis servers’ IP addresses. However, make sure to put in the server names (in the above case, server1, server2, server3), even though the server name is optional. Also make sure the order of the servers correspond to the replication setup. That is, production server1 replicating to staging server1, and so on. The reason for this is that the data distribution hash will be based on the server name when it is provided. Making the production and staging server names and sequence the same ensures that the data distribution is consistent between production and staging. We had started without this configuration and found out only one-third of the data was accessible in staging.

Trade-offs of Using Twemproxy vs. Previous Solutions

Drawbacks of using twemproxy:

The data on the crashed node is not accessible until it comes back into service.
The data written to the other nodes during the downtime of the crashed node is not accessible after it is revived.
Only a subset of the Redis commands are supported. We had to rewrite part of our application to make this migration work. For example, we used to use Redis pipelining for higher throughput but it is not supported by twemproxy (I guess, due to the incompatibility between the pipelining intrinsic sequential nature and twemproxy’s parallel nature).
During the downtime of a production Redis node, say Redis A, staging twemproxy would be out of sync with its Redis cluster because: the staging data is replicated from production Redis nodes directly, not distributed through the staging twemproxy; the production twemproxies distribute data among the live nodes, Redis B and Redis C; the staging twemproxy, not knowing production Redis A is down, seeing all nodes (staging Redis A, B, C) alive, following the original data distribution scheme based on 3 shards, would sometimes go to the staging Redis A, where the replication has already stopped from production Redis A, resulting in a miss.

Gains of using Twemproxy:

Naturally, with this architecture, we are able the scale up our Redis capacity three times. If data grows even more, we can linearly scale up.
Additionally, in our throughput test, we discovered that twemproxy’s throughput is actually higher than the previous setup, even though there is an extra hop in data transmission. Twemproxy’s own documentation claims that the internal pipelining implementation cuts down the overhead and improves performance.

To us, the gains are more than the sacrifices. So we continue to use twemproxy.

Cross Data Center Scaling

When we scale the multi-region distributed serving infrastructure, we have the following setup:

In this setup, all regions can be scaled up at the same time.

Conclusions

In our evolution of Redis deployment strategy:

We used DNS failover settings to achieve fault tolerant Redis cluster for our first implementation.
We used Redis Sentinel for automatic failover for our second version.
We are using the open source twemproxy package to scale the capacity and the twemproxy automatic failover setting for fault tolerance.

Along this journey, Redis has proven to:

Have high performance (high throughput and low latency).
Contain convenient data structures for diverse application use cases.
Require a lot of thought and surrounding infrastructure when fault tolerance is a requirement.

And so you can see our journey has been an interesting one — so far. We’re sharing these tips and pitfalls in the hope that they will help smooth your way in adopting this wonderful technology.

Discovery | Introduction to Distributed Solr Components

Rory Warin — Fri, 09 Jun 2023 12:28:03 +0000

Blog written by: Suchi Amalapurapu & Ronak Kothari from Bloomreach, 2015

Introduction

We use Solr to power Bloomreach’s multi-tenant search infrastructure. The multi-tenant search solution caters to diverse requirements/features for tenants belonging to different verticals like apparel, office supplies, flowers & gifts, toys, furniture, housewares, home furnishings, sporting goods, health & beauty. Bloomreach’s search platform provides high-quality search results with a merchandising service that supports a number of different configurations. Hence the need for a number of search features which are implemented as distributed Solr components in SolrCloud.

This blog goes over how Solr distributed requests work and includes an illustration of a custom autocorrect component design and discusses the various design considerations for implementing distributed search features.

Lifecycle of a Solr search query

Search requests in Solr are served via the SearchHandler, which internally invokes a set of callbacks as defined by SearchComponent. These components can be chained to create custom search functionality without actually defining new handlers. A typical search application uses several search components implementing custom search features in each component. Some sample examples are QueryComponent, FacetComponent, MoreLikeThis, Highlighting, Statistics, Debug, QueryElevation.

The lifecycle of a typical search query in Solr is as follows:

Request Flow

This section goes over the execution flow of requests in SearchHandler and how the callbacks are invoked for SearchComponent. Solr requests are of two types — single shard (non-distributed mode) and multi-shard (distributed mode).

Non-distributed or single shard mode

Architecture

In Solr’s non-distributed mode (or single shard mode), the index data needed to serve a search request resides on a single shard. The following diagram describes the request flow for this scenario:

Implementation details

SearchHandler invokes each search component’s prepare methods in a loop followed by each component’s process methods. Components can work on the response of components invoked before itself.

Example – Autocorrect

Let’s consider the autocorrect feature wherein, the user query is spell-corrected when the query does not have any search results in the index. For example, user query “sheos” is autocorrected to “shoes”, since there are no search results for “sheos”.

This feature needs three components QueryComponent, SpellcheckComponent and AutocorrectComponent. QueryComponent is used for default search. If the user-defined query does not return any results from the index, SpellcheckComponent adds a spellcheck suggestion. AutocorrectComponent further uses the spellchecked query to return search results. If the user-defined query is spelled correctly, AutocorrectComponent does not modify the response. In this case, SpellcheckComponent and AutocorrectComponent are added as last components. The following are the key steps involved:

A search request, response builder objects are created in SolrDispatcher.
A distributed search request is initiated for query “sheos.”
QueryComponent returns 0 numResults for the query.
SpellcheckComponent, which gets invoked after QueryComponent, adds a spell-corrected suggestion, “shoes.”
Based on numResults being 0 and SpellcheckComponent’s suggestion, AutocorrectComponent reissues the search request with corrected query “shoes”
Response formatting.

Distributed or multi-shard mode

Architecture

A request is considered distributed (or multi-shard) when the index of a collection is partitioned into multiple sub-indexes called shards. The request could hit any node in the cluster. This node needn’t necessarily have data for the intended search request. The node that received the request executes the distributed request in multiple stages and on different shards that contain the actual index. The responses from each of these shards is further merged to get the final response for the query.

SearchComponent callbacks for a distributed search request differ from its non-distributed counterpart. A search request gets executed by the SearchHandler, which is distributed aware. The callback distributedProcess is used to determine the search component’s next stage across all search components. Search components have the ability to influence the next stage of a distributed request via this callback. Search components can spawn more requests in each stage and all these requests get added to a pending queue. Eventually non-distributed requests get spawned for each of these pending distributed requests, which in turn get executed on each shard. These responses are collated in handleResponses callback. Any post processing work can be further done in finishStage. Please refer to WritingDistributedSearchComponents for a detailed understanding of how distributed requests work.

The life cycle of a distributed Solr search request can be depicted as follows:

Example – Autocorrect

Let’s consider the functionality of autocorrect in distributed mode. The difference from its non-distributed counterpart is that the spell-corrected response could come from any shard and might not necessarily be present on each shard. So the autocorrect component that we had defined earlier has to process SpellcheckComponent’s response only after letting it collate the responses from all shards. (SpellcheckComponent in Solr is already distributed aware.)

Please note that this can be better achieved as two search requests instead of a single complicated request. However, the hypothetical implementation in this section illustrates a custom distributed autocorrect component.

The new autocorrect component has to be defined after the SpellcheckComponent in the list of search components defined in solrconfig.xml. This is to facilitate stage modification of the request by processing the response of SpellcheckComponent.

Implementation details

A new stage called STAGE_AUTOCORRECT is added right after STAGE_GET_FIELDS. This functionality goes in distributedProcess of AutocorrectComponent, which checks if the number of results for the given query are 0 and if SpellcheckComponent provides a suggestion for a corrected query. Autocorrect component modifies the request query to the spell suggestion and resets the stage to STAGE_START.

The new search request is then executed as a distributed search request. A couple of flags have to be used to avoid getting stuck in an infinite loop here, since we are actually resetting the stage. These steps can be listed as:

A search request, response builder objects are created in SolrDispatcher.
A distributed search request is initiated for query “sheos” on any node in the cluster with STAGE_START.
The request goes smoothly till STAGE_GET_FIELDS, where the next stage is set to STAGE_AUTOCORRECT.
STAGE_AUTOCORRECT is where the numResults condition and SpellcheckComponent suggestion is checked. If the query needs to be autocorrected, a new user query is set on the request and the stage set back to STAGE_START. Set an additional flag on the response to skip this stage next time.
Search request goes through all the stages again with the spell-checked query.
Response formatting.

Learnings

There are several key takeaways from our experiences with scaling distributed search features in Solr. These can be summarized as:

System Memory and resources

Any additional search requests created in custom search components should be closed explicitly to avoid leaking resources. Otherwise, this could cause subtle memory leaks in the system. Many search features that we authored had to be tested thoroughly to avoid performance overheads.
Solr components that load additional data in the form of external data will either need core reloads or ability to be updated via request handlers. Such components should avoid using locks to avoid performance issues when serving requests.
The throughput of a multi-sharded Solr cluster is usually lower than a non-sharded one (depending on factors like number of shards and stages of a query).

Design practices

Its better to split a complicated search request into multiple requests, rather than adding too many stages in the same request — more so in a multi-tenant architecture, where we can mix and match features, based on customer requirements via request-based params or collection-specific default settings.
In some cases, moving this functionality out of Solr simplified the component design.
One more caveat that we realized: The single shard collections hosted on SolrCloud do not follow the distributed request life cycle and instead use the non-distributed callbacks. So in general, Solr components should be designed to work in both modes, so that resharding or merging shards has no effect on search functionality later on.

Conclusion

Solr exposes interfaces to allow users to customize search components. Both the handlers, as well as search components, can be extended for custom functionality. However these components have to be carefully designed in a multi-shard mode to capture the intended functionality. Thorough testing and monitoring for a sustained period of time to detect any subtle leaks or performance degradation is essential in such production systems.

References

http://wiki.apache.org/solr/WritingDistributedSearchComponents
https://wiki.apache.org/solr/SolrCloud
https://wiki.apache.org/solr/SearchComponent
http://lifelongprogrammer.blogspot.in/2013/05/solr-refcounted-dont-forget-to-close.html

DEV Community: Bloomreach

How to Use the Bloomreach Content SaaS Developer Application

Table of Contents

Overview

Featured Functionality

Getting Started

Create an API Token for Each CMS Instance

Configure the Developer Extension Application

Sample Actions

Copying Content Types

Copying Components

Building Shared Components

At Bloomreach

First Pass

But...

Shared Components / Design Systems

Our Setup

Hurdles in the Process

1. Bundle Size and Version Conflicts

2. Live Editing

3. Continuous Development

How is This Used Today?

The Unfinished Path

Where did the Name Come from?

Identity and Access Management at Bloomreach: Who are you, what are you doing here and other existential questions

Discovery | Synonym Generation at Bloomreach

Abstract

Why do synonyms need automated extraction/generation?

Different types of synonyms

Generation

Representation

Similarity

Diagram

Conclusion

References

Discovery | SolrCloud Rebalance API

Introduction

Problem

Innovative Data Management in SolrCloud Architecture

Scenarios

Conclusion

Discovery | Solr Compute Cloud - An Elastic Solr Infrastructure

Introduction

Problem Statement

Bloomreach Search Architecture

Conclusion

Discovery | Introducing Briefly: A Python DSL to Scale Complex Mapreduce Pipelines

Briefly

Other features Briefly

Conclusion

Discovery | Strategies for Reducing Your Amazon EMR Costs

Introduction

Merchant Data

Amazon EMR Costs

Discovery | Crawling Billions of Pages: Building Large Scale Crawling Cluster (Pt 2)

Introduction

Minimize the size of each entry

Transfer of Task Ownership

Scheduling Algorithm

Multi-tiered Aggregations

Conclusion

Discovery | Crawling Billions of Pages: Building Large Scale Crawling Cluster (Pt 1)

Introduction

First attempt: single process loop

Third attempt: asynchronous HTTP (Windows style)

Fourth attempt: HTTP client with asynchronous I/O

Conclusion

Discovery | The Evolution of Fault Tolerant Redis Cluster

Introduction

Fault Tolerance Against Instance Failure

Fault Prevention Tips

Automatic Failover with Redis Sentinel

Leader Election

Cross Data Center Serving Infrastructure and Automatic Failover

Scalable Redis Using Twemproxy

Trade-offs of Using Twemproxy vs. Previous Solutions

Cross Data Center Scaling

Conclusions

Discovery | Introduction to Distributed Solr Components

Introduction