DEV Community

Dr. Carlos Ruiz Viquez
Dr. Carlos Ruiz Viquez

Posted on

**Real-time Distributed Training of Time-Series Model on Str

Real-time Distributed Training of Time-Series Model on Streaming Data

Objective:

Develop a distributed AI/ML system that enables real-time training of a time-series model on streaming data in a cloud-scale environment.

Constraints:

  1. The system must handle 10 million IoT devices generating 1000 bytes of sensor data per device every second. This data is streamed into the training platform via WebSockets.
  2. The model architecture is a custom-designed graph neural network (GNN) with 5 million parameters and an average training iteration time of 30 seconds on a single NVIDIA V100 GPU.
  3. The system requires real-time predictions with a latency of less than 5 seconds for new, unseen IoT data points.
  4. Training must occur in parallel on multiple AWS EC2 p3.16xlarge instances (8 NVIDIA V100 GPUs per instance) to achieve training times of less than 4 hours for the entire dataset.
  5. The model must be deployed as a cloud-agnostic, containerized microservice using Kubernetes and be accessible via a RESTful API.
  6. System monitoring and logging are critical to ensure that all nodes are functioning and the training process is stable. Log messages and training metrics must be streamed in real-time to an Elasticsearch cluster.
  7. Finally, the model should handle data drift and concept drift by automatically updating its parameters on a fixed schedule every 24 hours to adapt to changing patterns in the input data.

Technical Requirements:

  1. Choose a suitable distributed ML framework (e.g., PyTorch Distributed, TensorFlow Distributed, or Ray) to handle the training.
  2. Develop a real-time data ingestion and transformation pipeline using Apache Kafka or Amazon Kinesis to process the incoming IoT data.
  3. Implement a model-server architecture (e.g., TensorFlow Serving) to deploy the trained model as a cloud-agnostic, containerized microservice.
  4. Use a cloud-agnostic scheduling system like Apache Airflow to manage the automatic update of model parameters.

Your Challenge: Design, implement, and deploy a scalable and real-time distributed AI/ML system that meets all the constraints and requirements above, and showcases your expertise in distributed AI/ML and cloud-scale deployments.


Publicado automáticamente

Top comments (0)