Satish Chandra Gupta

Posted on Apr 25, 2020 • Edited on Sep 6, 2022 • Originally published at ml4devs.com

Architecture for High-Throughput Low-Latency Big Data Pipeline on Cloud

#bigdata #analytics #cloud #architecture

Scalable and efficient data pipelines are as important for the success of analytics and ML as reliable supply lines are for winning a war.

For deploying big-data analytics, data science, and machine learning (ML) applications in real-world, analytics-tuning and model-training is only around 25% of the work. Approximately 50% of the effort goes into making data ready for analytics and ML. The remaining 25% effort goes into making insights and model inferences easily consumable at scale. The data pipeline puts it all together. It is the railroad on which heavy and marvelous wagons of ML run. Long term success depends on getting the data pipeline right.

This article gives an introduction to the data pipeline and an overview of architecture alternatives.