Scalable and efficient data pipelines are as important for the success of analytics and ML as reliable supply lines are for winning a war.
For deploying big-data analytics, data science, and machine learning (ML) applications in real-world, analytics-tuning and model-training is only around 25% of the work. Approximately 50% of the effort goes into making data ready for analytics and ML. The remaining 25% effort goes into making insights and model inferences easily consumable at scale. The data pipeline puts it all together. It is the railroad on which heavy and marvelous wagons of ML run. Long term success depends on getting the data pipeline right.
This article gives an introduction to the data pipeline and an overview of architecture alternatives.