Introduction
"According to incomplete statistics, the proportion of white hair of data workers is higher than the average of the same age group." by a data worker
Data, a familiar but mysterious word, has become a totem pursued by everyone. Managers love fancy data reports, data analysts are keen on building complicated statistical models, and salesmen take dashboards as compasses to see whether they can complete their KPIs. Since over ten years ago, the data industry has been developing fast, and there have been some novel yet formidable jargons, such as Big Data, Data Science, Data Lake, Data Mesh, Data Governance. Yet the "traditional" terms are still abstruse: Data Warehouse, Business Intelligence, Data Mart, Data Mining. What is more headachy is that many people are still unable to understand their relationship with recently popular concepts such as Artificial Intelligence, Machine Learning, and Deep Learning. These hot buzzwords are the results of aggressive development in the data area.
Professional Doctor or Fortune Teller?
Years ago, with the rapid development of the Internet industry, the bubble of the data industry was getting larger. Data, the by-product of the Internet applications, has large volumes and diversities. Data owners would like to get the most out of it and regard it as the gold mine. Therefore, data mining engineers became one of the most popular professionals. Later, a brand new yet more popular position Data Scientist emerged as "the sexiest job in the 21st century".
The popularity of data scientists is its requirement for abilities and experience in various areas:
- Programming Skills: at least able to use Python or R to do data cleansing, analysis and modeling.
- Mathematics and Statistics: familiar with probability theory, calculus, and discrete mathematics.
- Business Knowledge: deep understanding of market, process and macro trends in related areas.
- Communication Skills: able to convey insights and analysis results in a human-friendly way.
Therefore, the sexiness of data scientist comes from its high barriers, because people who are excellent in all above are quite rare. However, even for the versatile talents, many data science projects would ultimately fail. The 2 major issues are scale and quality. According to CrowdFlower, for data science projects in 2016, 80% of the time was spent on data collection and data cleansing, whilst only 20% was spent on analysis and modeling. This is a huge waste.
As most system architectures of enterprises are unable to support large-scale and high-quality data processing pipelines, so some work has to be manually done, just like the so called "human intelligence". There are their data models with low prediction accuracy, so their data scientists are labelled as "quacks" or "fortune tellers". To become a true "professional doctor", you need not only the "professional medical knowledge" (core abilities), but also the support from "professional medical equipment" (architecture and process).
Where is the way?
Many data workers are complaining about the fierce competition in the data area. Fortunately, the situation seems to be improving. Data analysts had to manually analyze distribution charts for deep insights, but now they can use smart machine learning models to automate this process. Traditional data analysis and modeling skills have been gradually becoming easy. For instance, Power BI or Tableau allow users to use a drag-and-drop low-code fashion to generate visual charts and models, whilst the old way is to import Python libraries such as pandas, matplotlib and sklearn to do the same in Jupyter Notebook. Open-source projects Apache Superset and Metabase allow users to easily analyze data on the web pages. This is quite similar to the development of digital cameras, from the film cameras to digital cameras and to smartphone cameras used by everyone. With lower and lower technical barriers, the whole industry can be developing fast. "Everyone can be data analyst" will no longer be a fantasy.
However, data quality is still an issue. Although we can automatically fill missing data and correct wrong data with intelligent machine learning models, most of the time manual interactions are still needed. The powerful AI models based on deep learning are trained from a large amount of manually annotated data. As a result, many organizations are promoting data standardization, an essential part of data governance. Garbage in ,garbage out.
No Silver Bullet
Data automation and data standardization is the mega trend in the future development. However, we should not regard it as the only solution to deal with data issue, given its wide range of application areas. Apart from fundamental professional data skills, the more important skills for current data workers would data sensitivity and logical thinking, which would not be taught in textbooks or courses. They have to come from project experience. Some seemingly high-profile terms, may not be as useful as those simple and practical methodologies.
Top comments (0)