DEV Community

ACE Co-innovation Ecosystem
ACE Co-innovation Ecosystem

Posted on

Streamlining Federated Learning Workflows with MLOps Platform

Author: Fangchi Wang, Staff Engineer in the VMware AI Labs of Office of the CTO.

Federated Learning, or FL, has gained significant attention recently due to its privacy-preserving and communication-efficient approach to applying AI/ML to distributed data. VMware has been actively participating in the FL community by contributing to open source projects, publishing solution whitepapers, and promoting related techniques through various events. Our primary focus is providing secure, robust infrastructure and deployment management solutions for FL systems and workloads, leveraging VMware products and solutions. We are excited to introduce our recent collaboration with One Convergence™ Inc. to integrate Federated Learning into MLOps solutions, particularly the DKube platform, to enhance our customer’s FL workflow with a seamless experience.

Federated Learning and FATE (Federated AI Technology Enabler)
The success of artificial intelligence critically depends on the quantity and quality of data used for training effective prediction models. However, in real-world applications, data often remains isolated in individual data silos. This isolation poses a crucial challenge when it comes to sharing data, primarily due to business competition and the need to comply with privacy-protection laws and regulations such as the General Data Protection Regulation (GDPR). The inability to fully utilize the data thus impedes the training process required to develop meaningful models. To tackle this issue, federated learning has emerged, offering a solution that allows organizations to overcome data silos while ensuring data privacy and security in alignment with regulations.

FATE, an open source project hosted by the LF AI & DATA Foundation, provides a secure computing framework that underpins the federated AI ecosystem. It has garnered contributions from industry leaders such as WeBank, VMware, Tencent, UnionPay, and many others. Originating from the financial industry, FATE strongly emphasizes privacy preservation and is designed for industrial applications. Its primary objective is to implement secure computation protocols, leveraging advanced techniques such as homomorphic encryption and multi-party computation. By adopting these protocols, FATE enables the utilization of various machine learning algorithms while ensuring robust data privacy and security measures are in place.

As the TSC (Technical Steering Committee) board member of the FATE community, the VMware AI Labs team has been making significant contributions to the FATE ecosystem, including key features in FATE releases, as well as the creation of cloud-native FL solutions like KubeFATE and FedLCM. To learn more about Federated Learning and VMware’s cloud-native FL efforts, please refer to the following previous blogs:

Federated Machine Learning: Overcoming Data Silos and Strengthening Privacy
Cloud-Native Federated Learning and Projects
Similar to any other machine learning task, applying FATE and FL involves all typical MLOps workflows. For cloud-native machine learning, Kubeflow is one of the top choices.

“Implement” Kubeflow with DKube
In recent years, Kubeflow has evolved into a leading AI/ML platform, integrating many open-source advancements to create a cost-effective solution. Notably, it has transitioned from a Google project to an independent CNCF project by July 2023.

However, “implementing” Kubeflow with your preferred cloud or on-prem environment still requires significant work. Deploying Kubeflow successfully, operationalizing your data, model prep, tuning, deployment, and monitoring while managing security, compliance, and governance is still rather challenging. Doing it yourself for every new installation can be many months of work for several people. This is for every new organization and almost every new installation. The productivity and time loss are significant, and all the cost savings of using Kubeflow gets compensated by the increased expense and time that can be in hundreds of thousands of dollars and many months per installation. For this reason, many Kubeflow installation projects at large Fortune 100 companies have stalled.

But there is some good news on the Kubeflow movement. New AI/ML platforms built from the ground up on top of Kubeflow natively can address this challenge for you. DKube from One Convergence™ Inc., for example, has built a standard Kubeflow package with a better and more modern UI, and it integrates with AWS EKS, Azure AKS, or any cloud or on-prem Kubernetes distributions such as VMware Tanzu Kubernetes Grid. As shown in the graphic below, it integrates with Azure Blob or Azure NFS, AWS S3, On-prem S3/NFS/Ceph storage. It integrates with Active Directory or LDAP authentication in any cloud or on-prem installations. It integrates with Git, GitOps, Bitbucket, and Azure DevOps version control systems. It integrates with healthcare data sources like Arvados or Flywheel. In other words, you get a shrink-wrapped package that, with few simple commands at install and config time, can get you going in AWS, Azure, GCP, or on-prem on a Kubernetes distribution of your choice. From installation start to user onboarded can be as quick as a few hours or a day, depending on the complexity of the cluster.

Image description

Accelerating FATE Workflow through DKube Integration
Through a collaborative effort between VMware AI Labs and the DKube engineering team, the support for FATE has been integrated into DKube. As shown in the diagram below, FL workflows can be streamlined via DKube IDEs, Runs, and Model Management functionalities upon deploying and configuring FATE systems. In the following sections, we will explore the detailed steps of this integration.

Image description

Deploying and Configuring FATE Clusters
As previously mentioned, the VMware AI Labs team maintains two open-source projects, namely KubeFATE and FedLCM, which offer the capability to deploy and manage FATE systems in a cloud-native manner. KubeFATE facilitates the provisioning and management of FATE systems, also known as FATE clusters, on Kubernetes in data centers and multi-cloud environments. And FedLCM orchestrates FATE deployments from a multi-party perspective, enabling the operation and connection of distributed FATE clusters to form the federated learning “federation.”

Once the FATE federation is created, each participant will use DKube to interact with its FATE system and manage FATE jobs. To enable this functionality, the FATE cluster needs to be added in the Operational View of the DKube UI. Simply navigate to the Clusters page and add the FATE cluster’s FATE-Flow access details.

Image description

Developing FATE Training Code Using DKube IDEs
Once the FATE cluster information is added into DKube, we can start working with it in the DKube IDEs tab in the Data Science View. On the IDE creation page, we have the option to select the FATE as the ML framework, which enables the creation of a JupyterLab instance with the FATE client SDK pre-installed. In the configuration section, we can select the newly added FATE cluster so the IDE instance will automatically configure the FATE client SDK to connect to this specific cluster, enabling users to seamlessly write and test their FATE client code and effectively manage data and jobs within the FATE cluster.

Image description

Launching FATE FL Job via DKube Runs
Besides interacting with the FATE cluster via DKube IDEs, we can also launch FATE jobs in DKube Runs. Similar to using FATE in the DKube IDEs, we can specify FATE as the framework and the target FATE cluster to execute the job. Moreover, the trained model can be retrieved and saved into DKube for horizontal federated learning. Once a Run is completed, the trained model will be on the DKube Models page, and we can proceed with deploying the model into an online serving service, following the standard DKube model deployment workflow.

Image description

DKube IDEs and Runs support all FATE federated learning algorithms, including FATE-LLM, a recently released module enabling parameter-efficient fine-tuning of large language models through the federated learning approach. It has been verified that official FATE-LLM examples can be executed within DKube.

What’s Next?
Besides KubeFATE and FedLCM, VMware has actively engaged with and made substantial contributions to the FL community. One of our notable contributions is the introduction of the FATE-Operator to Kubeflow, enabling FATE management through the operator pattern. We are also collaborating with and contributing to OpenFL, another open-source federated learning project hosted by LF AI & Data. These contributions can integrate into MLOps platforms such as DKube, implementing an end-to-end FL process that covers everything from deployment and operation to freely selecting from a set of different FL frameworks and working with the most suitable one. We continue to work closely with partners to ensure that we bring together the best of each solution and accelerate our customers’ success on their AI/ML journey.

Top comments (0)