If data is the new oil, then getting and enriching your own data is like fracking and refining it, at least in the case of textual data. This post gives you an overall picture on how to think about gathering and labeling data. You also get some tips on what kind of business questions should be considered.
The Data Science Hierarchy of Needs
These days more and more people try to build a so-called vertical AI startup/solution. These endeavors intend to solve industry specific problems by combining AI and subject matter expertise. They have four distinct features: 1) they are full stack products 2) they rely on subject matter expertise 3) they are built on the top of a proprietary dataset 4) AI delivers core value. Our experience suggests that the third point – getting the right proprietary dataset – is the hardest and most decisive factor regarding every data driven project, being either an intra- or entrepreneur endeavor.
Most people take data for granted. We get news about the newest deep learning algorithms every day. We live in the era of big data. We hear (at least those who work in the tech field) about new machine learning/artificial intelligence startups every day. So it must be easy to get data!
On the one hand, yes, there are awesome data repositories, like the UCI Machine Learning Repository. Governments are getting open and they are publishing their data via their own platforms or they are using something like CKAN. But keep in mind, your competitors can access these data too!
On the other hand, you have to get your own, domain-specific dataset, and annotate it to train your model(s)! Deep learning and other fancy ML algorithms are just the tip of the iceberg. There are plenty of things to do underneath. If you can’t get the underlying levels right, even the sexiest new deep learning algorithm will perform badly on your specific problem. Again, you can start with combining open datasets, but your competitors are doing the same thing too. If you want to deliver real value that is different from your competitors (i.e. better or more precise), you have to build and annotate your own dataset. The popular data science hierarchy of needs pyramid should look like as follows.
Source of the original picture: https://miro.medium.com/max/3760/1*jmk4Q2GAeUM_eqUtMh99oQ.pngSeparate your tasks
Harvesting and annotating data are two separate tasks done by two different groups. Data collection is often carried out by traditional software engineers, or by the data infrastructure team.While annotation is often lead (and sometimes even done) by Data Scientists/Analysts. A good product manager keeps his or her hands on the data and involves every stakeholder into the process. A PM should always remind one that getting and annotating data is a process, so you should constantly check the quality and scope of your raw and annotated data. The performance of the model you built using the data should be also monitored. You can use evaluation metrics and even some user feedback to plan further data gathering and data annotation task(s), which will help you build even better models.
Before you consider various options to gather and label data, keep in mind that you should build your initial dataset AND a pipeline/process that will help you train better and better models. Choosing a solution at one phase doesn’t mean that you cannot move to another one at a later phase. But note that transitioning from outsourcing to in-house scraping and labeling can be hard and very costly.
Your options
In theory, you have an idea about a product, and you need a special purpose dataset to train its magical AI part. Before you think over your options, you have to answer a few questions. What kind of data do you need in order to train a model? How can you get the data? Should you clean up the raw data before annotation? How much data should be annotated for the first model(s)? What does it mean to make a representative dataset in your case? Probably, you won’t get final answers first, but don’t be afraid as a rough idea is enough initially.
As a next step you should consider your options of data gathering and annotation, like
- building in-house competency
- crowdsourcing
- outsourcing
Source: https://cdn.pixabay.com/photo/2017/12/12/17/59/traffic-sign-3015228_960_720.pngYour constraints
You should know about your constraints like
- budget
- time
- law
- ethics
- technology
If you know your data sources, check them! Are they plain text or HTML? If they are websites, do you have to login to these sites? Do they use modern JavaScript frameworks, like React? Do these sites/texts contain sensitive information about humans? If you have to scrape a site, check its robots.txt to learn about what the owners let you scrape! Different regions have different laws to regulate scraping and storing publicly available data. Re-using data gained from scraping is often regulated by law. Although, it can be pretty expensive, ask your lawyer first!
Keep in mind that if something is legal, it is not necessarily ethical. Your project should be legal AND ethical. It is hard to define what ethical means. Probably your colleagues follow the ethical regulations and guidelines published by professional bodies and governments at your region. If not, ask them to do so! Also, the team should agree on that the goal of the project is in accordance with the members’ ethical norms. Scraping sites that requires login is a shady part of the business. Imagine that your colleague thinks it is actually stealing data and harming the privacy of the users of this site. Will such a colleague build the best scraper for the task? – Presumably, no. So, even if you have nothing against scraping data from certain sources, accept the fact that someone may think that it is not acceptable, even it it is legal.
Furthermore getting data from the web is not as easy as it sounds. For example modern JavaScript technologies requires a so-called pre-renderer, like Selenium, to pretend that a browser opened the site to show up its content.
Last but not least, you have budgetary and time constraints too. The more ready-made a solution is, the more expensive it is, but usually the less time it requires to deliver the data. In-house solutions require hiring permanent and temporary workers. Finding the right people takes time. You can employ juniors who are willing to learn a new filed, but again, this takes time. If you have enough money, first start with outsourcing the tasks to reliable partners. Later you can build up your own capabilities. If you are very short of money, bring data scraping in-house and crowdsource annotation. Otherwise read on and consider the tools and options you have.
That’s all for now. If you’d like to learn more about tools used for data gathering and annotation, stay tuned. The second part of this series will come soon!
Hire us
If you face any issues during data gathering and annotation, don’t hesitate to contact us at crowintelligence@gmail.com
Subscribe to our newsletter
Get highlights on NLP, AI, and applied cognitive science straight into your inbox.
Enter your email address
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Top comments (1)
A really informative post, indeed! Time and budget are a big problem when it comes to data collection and annotation. You need to build models to extract entities from your unstructured or structured texts which requires time. On top of that, there is also a need of a domain expert to review those annotations. However, with the advancements in NLP, there are a number of annotation tools that could be used effectively to mitigate this problem of time and budget. The one that I use for my projects is called NLP Lab, a free to use no-code platform, which helps users automate their data annotation process.