Evolution trends of big data systems
From the demand side, big data system from 1995 transaction scenario (TP), such as bank transactions and other daily online processing TP. Analysis scenarios (AP) by 2005, such as reverse indexing of search keywords, do not require complex SQL features and are more focused on concurrent performance. In 2010, hybrid scenarios (HTAP), which use a single system for transaction processing and real-time analysis, reduced operational complexity. In 2015, complex analytics scenarios, that is, converged analytics from multiple sources such as public cloud, private cloud, and edge cloud. Finally, the Real-time Hybrid Scenario (HSAP), the convergence of real-time business insights, services, and analytics.
From the perspective of the supply side, the big data system from the 1995 relational data (MySQL), that is, point storage and query oriented, through sub-database sub-table and middleware to do the horizontal expansion. By 2005, non-relational databases (NoSQL), which store large amounts of unstructured data, scale well horizontally. In 2010, hybrid databases (NewSQL) were compatible with MySQL's expressiveness, consistency, and NoSQL's extensibility. Finally, by 2015, data lakes and data warehouses can realize business integration across business lines and systems. Now, we have reached the era of the next generation of big data systems.
Based on the properties of large data generated from operations, development trends and the analysis of the large data can solve any problem, must not ignore the customer's data level, the number of Schema and change frequency and business logic, use of data of the main way and the frequency of these three questions, need to go back to the most basic logic of data processing are analyzed. Data processing operations are nothing more than reading and writing, so ultimately, there are four forms: write less, read less; Write more, read less; Write less, read more; Write more, read more correspond to different technical systems.
- Write less, read less: OLTP-type applications, which focus on point storage and queries, are well addressed by MySQL.
- Write more, read less: A common but underappreciated problem is the debug log of application code, which is very large in storage, and developers tend not to optimize, only searching through the massive log when something goes wrong. For a growing Internet enterprise, using ES accounts for 50% of the cost of big data. First, the search engine must maintain a full index, so it cannot save money. Another reason is that companies tend not to use big data to serve their businesses, so other big data applications are unavailable. But this cost is hidden in the overall technical cost and not visible, so there is no special optimization.
- Write less, read more: BI data analysis falls into this category, or OLAP, which generally writes sequentially, then computes and outputs the results. Almost all big data cloud service startups are a little red sea in this field.
- Write more, read more: real-time computing in the form of Search, advertising, and recommendations. Business scenarios are dynamic marketing based on user profiles, especially recommendations, which are becoming more widespread. Any information flow based on user characteristics is a recommendation. A large amount of data is due to the detailed record of user behavior. Real-time computing is to carry out dynamic prediction and judgment through algorithms.
In terms of application scenarios, the latter two kinds of reading and writing are respectively evolved into Hybrid Transactional & Analytical Processing (HTAP) and Hybrid Serving & Analytical Processing (HSAP). In terms of volume, the HTAP direction has been more entrepreneurial recently, but it solves already well-defined technical problems. Along the timeline, HSAP will overwrite HTAP in the future because HSAP solves business problems through technology.
Data engineers and developers need to focus on future industry trends and business pain points to improve their technology. Those in industries such as HTAP, likely to shrink in number in the future, need to do more career thinking and choices. More importantly, why are there so few practitioners and companies in an industry that is promising and able to solve the problems of current technology? These reasons must be the industry's breakthrough point and are vital to practitioners.
Challenges of HSAP
First, HSAP and HTAP are not antagonistic and even borrow many of HTAP's design ideas. For example, HTAP is replacing MySQL with storage changes:
HTAP is an upgrade to a database typically used in "transaction" scenarios to process structured data. Traditional databases logically take row storage, each row being one data item. The whole row of data needs to be read into the memory for calculation. Generally, only certain fields in the data line are processed. Therefore, the computing efficiency and CPU usage are not high.
When it came to search engines and big data, it was often necessary to scan data on a large scale and process certain fields in each row. So, based on these usage characteristics, column storage emerged. Column storage is algorithmically friendly because it is very convenient to add a column (the "feature" used in the algorithm). Another benefit of column storage is that CPU optimization, known as vectorization, can be used to execute a single instruction on multiple data simultaneously, greatly improving computing efficiency. Therefore, HTAP tends to emphasize inventory, vectorization, MPP, etc., and improve the efficiency of big data processing through these technologies.
However, this does not mean that row storage is overshadowed by row storage. Both row and column storage are related to usage scenarios and have costs, a balance problem between cost and efficiency. Therefore, in terms of storage form and computing efficiency, HSAP does not need to innovate for innovation's sake.
The biggest difference between HSAP and HTAP is that HSAP is both a technology and a business, so the first question it answers is data modeling from a business scenario.
A traditional database is also known as a relational database. Data modeling is very mature, in the form of Schema. HSAP can be considered to have evolved from search engines. The earliest search engines were to retrieve text so that it could be classified in NoSQL, that is, non-relational databases. After that, Internet businesses became increasingly diversified, a mixture of transaction and information flow. For example, e-commerce has both large-scale data business and complex transaction links.
Moreover, in Search, advertising, and recommendation business, e-commerce also needs structured data, such as commodity price, discount, and logistics information. Therefore, the data service base of e-commerce needs very good modeling, which is not the work of the engineer who makes the transaction link, but the work of the search engine architect. Modeling data services is critical and greatly impacts search engine storage and computing efficiency.
So the prerequisite for using HSAP is good business data modeling, storage optimization, query acceleration, and so on. Data modeling does not have a very good standardized solution because understanding the complex big data infrastructure and the business is essential. One possible evolution path is that the big data architect discovers more scenarios during the process of HSAP, abstracting the scenarios through data modeling, gradually accumulating experience, and eventually forming good products.
Application analysis of HSAP
What are the core customer issues in the HSAP space? Instead of taking the Internet platform with a huge amount of big data analysis and service requirements as an example, take the universal XX Bank. The basic scenario is as follows:
Marketing financial products according to user group dynamics;
Next door YY bank users with reasonable concessions to attract over.
The core pain point of the big data architecture team of the bank comes from the above scenario, which can be basically classified as "user growth." It requires big data analysis and service integration (i.e., this is a typical HSAP problem). However, the BI demand of the bank has been well-covered by-products, so the pain point is not strong. *The current warehouse architecture has the following problems: *
- The data delay, and the production and batch running tasks in the number warehouse are usually T+1 output, which does not support the integration of flow and batch. It is difficult to support some business scenarios with high timeliness.
- The capacity of metadata expansion and shrinkage is weak, and there is a performance bottleneck when the number of partitions increases rapidly.
- The resource scheduling capability is insufficient and cannot be containerized for elastic expansion.
*Requirements for technologies: *
- Stream batch integration: the basis is unified real-time storage. At the same time, the upstream and downstream computing using event trigger mode, the downstream data output delay is greatly shortened;
- Horizontal metadata expansion: Supports table management of many partitions and files.
- Flexible resource scheduling: Flexible container-based expansion, on-demand resource utilization, and public and private cloud deployment are supported.
- Open systems and interfaces: Services are the mainstream, but another complex offline and BI analysis processing is best also in a unified storage system, one is easy to connect with the existing system, and the other allows other engines to pull data out for processing. Therefore, compatibility with SQL language is also a must. Not to say too far, in the next 2-3 years, to solve these problems well will be a very successful company.
Case
This article gives examples of Snowflake, an American public company, and LakeSoul, an open-source product for a Chinese startup.
Snowflake is a typical PLG (Product led Growth) driven company. In terms of products, Snowflake has realized the real customer value: the expansion and shrinkage of cloud storage. Specifically:
- Truly taking advantage of the infinitely expanding storage and computing power of the cloud;
- Truly let customers zero operation and maintenance, high availability, to save worry;
- save customers money.
These principles coincide with introducing new products in the consumer goods field to meet the unmet needs of users, and the product details are well done. Snowflake, for example, has designed a virtual Warehouse, which comes in T-shirts ranging from X-small to 4x-large, to separate users from each other. Such product designs must be designed with a deep understanding of the requirements and provide great customer value.
In addition, Snowflake has achieved a better L-shaped strategy from a business perspective. In the health care sector, public information has shown that it amplifies the value of data by enabling "data exchange" and even achieves network effects. But there's more to it than that. Snowflake is suspected of blowing bubbles. But given the second-hand information (not available online), Snowflake's bet on a company that makes digital SaaS services in health care makes logical sense.
LakeSoul meets the technology needs to solve the core problems of our customers in the HSAP space.
- Integration of stream and batch: based on unified real-time storage, the upstream and downstream computing adopts event trigger mode, and the downstream data output delay is greatly shortened;
- Horizontal metadata expansion: Supports table management of many partitions and files.
- Elastic resource scheduling: containerized elastic expansion, on-demand resource utilization, and support public and private cloud deployment;
- Open system and interface: service is the mainstream, but other complicated offline and BI analysis and processing should also be based on a unified storage system. On the one hand, connecting with the existing system; on the other hand, it allows other engines to pull data out for processing.Therefore, compatibility with SQL language is also a must.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.