When it comes to machine learning, TensorFlow, PyTorch, and other well-known machine learning frameworks come to mind, and many models can also be found on GitHub. However, the algorithm model needs to be implemented in a specific business scenario, leading to many problems. Taking the recommendation system as an example, the following issues may occur during the implementation of real service scenarios:
1. How to generate training data (samples)? In recommendation scenarios, splice user feedback signals such as exposure and click with various feature sources, perform necessary feature cleaning and extraction, and divide data into verification sets, negative sampling, and other complex processing. We need an extensive data system to handle batch or streaming data.
**2. After the training data is generated on the big data platform, how can it be transmitted to the deep learning framework? **Frameworks such as TensorFlow and PyTorch have data input formats and corresponding DataLoader interfaces requiring data parsing. Dealing with data fragmentation, variable-length features, and other problems often puzzle algorithm engineers.
3. How to run a distributed training that can freely schedule cluster resources, including GPUs? A dedicated operations team may be required to manage machine learning-related hardware scheduling.
4. How to train the sparse feature model and use the NLP pre-training model? In the recommendation scenario, we need to be able to handle sparse features on a large scale to model the interesting relationship between users and goods. At the same time, multi-modal model fusion has gradually become the frontier direction.
**5. After model training is completed, how can online prediction be carried out efficiently? **In addition to cluster resources, elastic scheduling and load balancing are also involved. These problems require dynamic resource allocation in a heterogeneous environment with CPUs, GPUs, and NPUS. For complex models, distillation, quantification, and other methods are needed to ensure prediction performance.
**6. After the model goes online, how does the online system extract splicing features, ensure the consistency of offline features, and evaluate the algorithm effect quantitatively? **We need an online algorithm application framework that can organically integrate with online systems, read all kinds of online data sources, and provide multi-layer ABTest traffic experiment functions.
**7. How to conduct efficient iteration in algorithm experiments? **We wanted to be able to run multiple parallel experiments quickly to improve business performance rather than being tied down by complex system environment configurations.
The above problems often need to be solved by multiple teams and multiple sets of different systems in large Internet factories. As shown in the figure, building a complete, industrial-grade recommendation system is quite complicated and tedious, requiring considerable knowledge in different fields and investment in engineering development. Small and medium-sized enterprises lack the staffing and a one-stop platform to solve these problems in a standardized way.
The overall architecture of the recommendation system shared by the Netflix algorithm engineering team
But now, MetaSpore, a new machine learning platform, can solve these problems.
Based on MetaSpore's features, enterprises and developers can solve various problems encountered in algorithmic business development. Using standardized components and development interfaces to provide a one-stop development experience to meet the needs of enterprises and developers to obtain best practices in algorithmic business development. Specifically, MetaSpore has the following core functional design concepts:
**1. Seamless integration of model training and big data system. **MetaSpore can directly read structured and unstructured data of various data lakes and warehouses for training and seamlessly integrate data, feature preprocessing, and model training, eliminating tedious data import, export, and format conversion processes.
**2. Support for sparse features. **Large-scale sparse Embedding layer training is often required in search and generalization scenarios. Some processing of sparse features is involved, such as cross combination, variable-length feature pooling, etc., which need special support from the training framework.
3. Provide high-performance online forecasting services. Online prediction services support neural networks (including sparse Embedding), decision trees, and a variety of traditional machine learning models. Supports heterogeneous hardware computing acceleration, reducing the engineering threshold for online deployment.
4. Unified offline feature calculation. Through the unified feature format and calculation logic, the unified offline feature calculation saves the cost of repeated development of multiple systems and ensures the consistency of offline features.
5. Online algorithm application framework. The online algorithm application framework covers the common function points of online systems, such as automatic feature extraction from multi-data sources, feature calculation, predictive service interface, experimental dynamic configuration, ABTest dynamic cutting flow, etc.
**6. Embrace open source. **MetaSpore provides several independent research components to implement these core function points. At the same time, MetaSpore's development philosophy is to embrace the mature open source ecosystem as much as possible. Without cutting the ecosystem apart, it can also lower the barriers to learning and meet the needs of developers who want to develop quickly based on their existing experience.
MetaSpore, it has to be said, is a new machine learning platform with transcendent qualities that can solve problems that other products cannot. However, as a new open source project, it still has a lot to go, and I'll be keeping an eye on MetaSpore and sharing and reposting more information.
Top comments (0)