avital trifsik for Memphis.dev

Posted on Oct 23, 2022 • Edited on May 29, 2023 • Originally published at memphis.dev

Building a Scalable Search Architecture

#scalability #database #dataflow

Creating a scalable search architecture is a popular and important task for many systems. There are different solutions for this task. Choosing the right one depends on the requirements of your project.

Sometimes, as a project grows and its requirements change, you may run into new problems that you cannot solve with the search architecture you are using. For example, when increasing the amount of data, including synonyms in the search, adding multilingual search, etc. In this case, you need to think about creating a new, more efficient, scalable search architecture.

The search architecture must support the rapid read and write scaling required for most use cases.

In this article, we will talk about the main challenge that an effective search architecture must solve. We will also learn the main ways to implement such an architecture and what tools can be used to do this. In the end, we will tell you how to speed up the search engine.

Current challenges of search architecture
Building a scalable search architecture is a challenge for most programs of varying sizes and complexity. To solve it, there are several different search architectures.

However, due to the rapid development of information technology, the creation of new fast-growing marketplaces, and SaaS applications, existing solutions can no longer meet all requirements. Therefore, there is a need to build a new efficient scalable architecture that can solve modern problems.

Let’s take a look at the challenges developers of scalable search architectures are currently facing.

1.Dynamic scalability. The number of search queries the system receives can vary greatly depending on different events. Sometimes the amount of traffic can increase dozens of times compared to the average daily traffic. To efficiently process queries, search engines must be able to anticipate traffic growth and scale their infrastructure. It is important to add and remove machines in less than a minute and temporarily handle increased traffic before a new machine becomes available. Dynamic scalability saves costs while opening up new opportunities for the creative use of search engines.

2.A dynamic number of shards. The performance of the system and its ability to scale depends on this indicator. For optimal performance of the search engine, this number should be dynamic. In order to have optimal performance and the ability to scale, it is important to be able to tune this value automatically.

3.Separation of search and indexing. Simultaneous scaling in both query volume and data volume is a challenge for search engines. Increasing search or indexing capabilities requires more resources, which can make the infrastructure cost very high. For optimal performance of the search engine and to avoid the negative impact of indexing on search, you should scale indexing and search separately.

4.Parallel data transmission over the network. Over the past decade, the speed of data transfer over the network has grown by about 100 times and continues to grow further. At the same time, processors and data storage are not developing as fast. For efficient scaling, this situation requires the creation of a different way of transmitting data with a greater percentage of parallelization.

Differences between search engines and databases
While search engines and relational databases have a lot in common, there are also key differences between them.

Relational databases store structured data in the form of interrelated tables. They allow you to process much more information than search engines. However, the advantage of search engines is that they can parse unstructured data. They store flat objects instead of interconnected tables. Search engines allow you to improve the performance of data reading and writing operations by parallelizing them.

In relational databases, information is well organized and more reliable. In contrast, in search engines, information is not systematized and is not stable, since its location and content can constantly change.

Search engines are easy and fast to implement. Unlike databases, search engines need to be updated frequently. Their main goal is to provide a relevant set of high-quality search results that respond quickly to customer needs.

Scaling architecture patterns:

Primary/replicas architecture
A primary/replica architecture is used to support a large number of reads. It involves replicating the data of the main server to several replicas. This architecture allows you to process several times more requests than if you use only one copy of the data on one server. For example, if you use three copies of data on three servers, then you can process three times more requests than with one copy on one server.

To support more write operations, you need to split the data into several smaller parts and add more CPU to create these parts. Scaling reads and writes requires adding shards and having multiple copies on multiple machines.

To implement a primary/replica architecture that replicates the master server’s data across multiple replicas, each shard needs to have one version that accepts writes. Other replicas use the primary as a source of truth. A log file that is stored on the primary shard is often used to synchronize master data and replicas. This log file contains all the writes received by the primary shard in sequential order. Each replica shard reads writes from this log and applies them locally.

The main disadvantage of this approach is the co-location of indexing and search on the same machine. Since we need to create multiple copies of the data, the CPU and memory indexing needs to be duplicated. This greatly increases costs. If you need to scale indexing and search at the same time, then the replication factor increases and applies to more data, requiring a huge amount of additional resources.

Increases in CPU and memory usage for indexing can also negatively affect the end-user experience if the same computer processes both indexing and search queries. If the search traffic generated by end users increases dramatically, then the resources used for indexing may limit the ability to account for this traffic surge.

In addition, this approach limits the auto-scaling capabilities, since adding a new replica requires pulling data from existing machines, often takes several hours, and adds additional load on the machines. This results in you having to greatly increase the size of your architecture and expect a significant increase in data or queries.

Replication of a binary data structure
Another way to create a scalable search architecture is to replicate a binary data structure committed to disk after an indexing task has been completed.

This approach avoids duplicating the CPU and memory used for indexing. However, while overwriting all data structures, the binaries can be large, resulting in some delay.

Most often, search architectures process a large amount of indexing and searching operations at a rate of less than a minute. Therefore, in most cases, this approach is not used.

In addition, search engines rely on generational data structures. This means that instead of a single binary file, there is a set of files. When a segment receives new indexing operations, it is stored in a smaller data structure on disk. New indexing operations are performed in generation zero until parts of the files reach a certain size and must be merged with generation 1. This is necessary to remove duplicates and optimize search efficiency. The disadvantage of this approach is that all files on the disk are modified and each replica needs to get a new version containing all of the data shards.

The process of merging data and its transfer for replication will be influenced by such factors:

The shard size determines the maximum amount of data to transfer after all layers have been merged.
The number of generations. This will directly affect the merger frequency of all generations (the number of times we have to transfer the maximum data size).

Search architecture main services
To create a search architecture you should use three services: crawlers, webpage processors, and indexing.

Crawlers are bots that are used to visit a web page, get all the links that are on it, and follow those links. This allows search engines to constantly find new content.

Web page processors read the page content and metadata. Then you need to break down the content of the web page into simpler forms that can be grouped according to different criteria. For example, by topics, keywords, etc. Metadata contains useful information such as keywords, descriptions, and more.

Indexing is used to organize found information so that it can be read quickly and easily. You can use keywords and page rank for this purpose. However, more efficient indexing requires some research and development.

Set up search services using Kubernetes
Kubernetes is a scalable Docker container platform that can be used for all the services needed to create a search architecture. It allows you to configure services in such a way that they work regardless of what hardware is used for this. In addition, you can scale each service separately according to your needs.

Kubernetes allows you to create services and assign them unique IP addresses. This allows services to communicate with each other without creating special connections. In addition, it ensures the security of your services.

The main benefits of using Kubernetes are:

1.Automated operations. In Kubernetes, a lot of the complex work involved in managing applications is done automatically by built-in commands.

2.Infrastructure abstraction. Kubernetes handles calculations, networking, and storage on behalf of your workloads. This allows programmers not to worry about setting up the underlying environment but to focus on application development.

3.Service health monitoring. Kubernetes constantly checks the health of your services. It restarts containers that have crashed or stopped and makes services available to users only after confirming their operation.

Control data flow with Memphis
Memphis is a real-time data processing platform. It allows you to work with streaming data and supports asynchrony. It is well suited for engineers who work with a large amount of data and require writing a large amount of complex code.

The Memphis platform is high-performance and fault tolerant. It has a built-in monitoring capability which is very useful and handy for troubleshooting.

Using Memphis solves most of the data flow control problems you face when building a search architecture. Among them are:

A large number of data sources are very difficult to process.
Analysis of streaming data for each source.
Loss of messages due to lack of relaying, failures and monitoring.
Streaming data processing requires the integration of several applications, which greatly slows down the data processing process and makes it impossible to analyze data in real-time.
Difficulty in deploying, managing, protecting, updating, integrating, and configuring other existing data processing systems.

Memphis uses a producer-consumer pattern and will control the entire data flow for the services you use in your search architecture.

Let’s list the main Memphis benefits.

Team deployment. Runs natively in any Kubernetes and Docker environments and is fully optimized from the start
Resiliency. Out-of-the-box dead-letter queue, resource optimization, and event recovery without code changes.
Easy-to-use. Memphis has a simple and easy-to-use user interface for data engineers and developers to troubleshoot and track data in real-time.
Unique schema control. Memphis offers unique, reliable, and easy-to-use schema control with built-in streaming data conversion.

Using Elasticsearch for creating search indexes
Elasticsearch is an application that allows you to create a search index. It is often used for log analysis, full-text search, intelligent security systems, business intelligence, monitoring of ongoing processes, etc.

You can send data to Elasticsearch as JSON documents using APIs or using other tools like Logstash. Then Elasticsearch automatically saves the document and adds a link to it in the clustered index, including searchability. You can also find and retrieve a document using the Elasticsearch API.

Elasticsearch provides horizontally scalable search and supports multithreading. You can use several different search fields and prioritize which values are the best indicator that a record matches the search.

Fast files replication
The development of cloud infrastructure and increased throughput allows for greater isolation between indexing and search without negatively impacting indexing latency. This made possible more efficient and dynamic scaling.

Let’s take a look at how scalable architecture works.

The indexing virtual machine keeps a local copy of the file. In case of failure, a new machine is started, which downloads the file from the cloud storage.
New files are computed in memory during the indexing process.
When uploading to cloud storage, large files are divided into several small segments and uploaded in parallel.
The files are also stored locally on the disk of the indexing virtual machine.
After sending the data, the search virtual machine downloads all the files (as in the case of downloading a cloud, the files are divided into segments and downloaded in parallel).
The search uses the new version of the data.

This search architecture has a list of advantages.

Speed up data merging with parallel loading.
The ability to reduce the segment size and indexing in near real-time.
Search operations and index operations scale independently because they use separate virtual machines.
Quick add/remove virtual machine.
Adding/removing additional resources without long and expensive data rebalancing.
Ability to use new types of features, such as changing the number of shards (the parts of files) for a single index using an asynchronous process without pausing indexing.

Wrapping up
We figured out how modern scalable search architectures work, and what are their advantages and disadvantages. We also told what main services the search architecture consists of, and what they are used for. In addition, the article contains a description of useful tools that you can use when developing your scalable search architecture.