Real-time Distributed Training of Time-Series Model on Streaming Data
Objective:
Develop a distributed AI/ML system that enables real-time training of a time-series model on streaming data in a cloud-scale environment.
Constraints:
- The system must handle 10 million IoT devices generating 1000 bytes of sensor data per device every second. This data is streamed into the training platform via WebSockets.
- The model architecture is a custom-designed graph neural network (GNN) with 5 million parameters and an average training iteration time of 30 seconds on a single NVIDIA V100 GPU.
- The system requires real-time predictions with a latency of less than 5 seconds for new, unseen IoT data points.
- Training must occur in parallel on multiple AWS EC2 p3.16xlarge instances (8 NVIDIA V100 GPUs per instance) to achieve training times of less than 4 hours for the entire dataset.
- The model must be deployed as a cloud-agnostic, containerized microservice using Kubernetes and be accessible via a RESTful API.
- System monitoring and logging are critical to ensure that all nodes are functioning and the training process is stable. Log messages and training metrics must be streamed in real-time to an Elasticsearch cluster.
- Finally, the model should handle data drift and concept drift by automatically updating its parameters on a fixed schedule every 24 hours to adapt to changing patterns in the input data.
Technical Requirements:
- Choose a suitable distributed ML framework (e.g., PyTorch Distributed, TensorFlow Distributed, or Ray) to handle the training.
- Develop a real-time data ingestion and transformation pipeline using Apache Kafka or Amazon Kinesis to process the incoming IoT data.
- Implement a model-server architecture (e.g., TensorFlow Serving) to deploy the trained model as a cloud-agnostic, containerized microservice.
- Use a cloud-agnostic scheduling system like Apache Airflow to manage the automatic update of model parameters.
Your Challenge: Design, implement, and deploy a scalable and real-time distributed AI/ML system that meets all the constraints and requirements above, and showcases your expertise in distributed AI/ML and cloud-scale deployments.
Publicado automáticamente
Top comments (0)