DEV Community

Cover image for 15 Best Chatbot Datasets for Machine Learning
Limarc Ambalina
Limarc Ambalina

Posted on

15 Best Chatbot Datasets for Machine Learning

An effective chatbot requires a massive amount of training data in order to quickly solve user inquiries without human intervention. However, the primary bottleneck in chatbot development is obtaining realistic, task-oriented dialog data to train these machine learning-based systems.

We’ve put together the ultimate list of the best conversational datasets to train a chatbot, broken down into question-answer data, customer support data, dialogue data and multilingual data.

Question-Answer Datasets for Chatbot Training

  1. Question-Answer Dataset: This corpus includes Wikipedia articles, manually-generated factoid questions from them, and manually-generated answers to these questions, for use in academic research.

The WikiQA Corpus: A publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, they used Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer.

  1. Yahoo Language Data: This page features manually curated QA datasets from Yahoo Answers from Yahoo.

  2. TREC QA Collection: TREC has had a question answering track since 1999. In each track, the task was defined such that the systems were to retrieve small snippets of text that contained an answer for open-domain, closed-class questions.

Customer Support Datasets for Chatbot Training

  1. Ubuntu Dialogue Corpus: Consists of almost one million two-person conversations extracted from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems. The full dataset contains 930,000 dialogues and over 100,000,000 words

  2. Relational Strategies in Customer Service Dataset: A collection of travel-related customer service data from four sources. The conversation logs of three commercial customer service IVAs and the Airline forums on TripAdvisor.com during August 2016.

  3. Customer Support on Twitter: This dataset on Kaggle includes over 3 million tweets and replies from the biggest brands on Twitter.

Dialogue Datasets for Chatbot Training

  1. Semantic Web Interest Group IRC Chat Logs: This automatically generated IRC chat log is available in RDF, back to 2004, on a daily basis, including time stamps and nicknames.

  2. Cornell Movie-Dialogs Corpus: This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts: 220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters from 617 movies.

  3. ConvAI2 Dataset: The dataset contains more than 2000 dialogues for a PersonaChat competition, where human evaluators recruited via the crowdsourcing platform Yandex.Toloka chatted with bots submitted by teams.

  4. Santa Barbara Corpus of Spoken American English: This dataset includes approximately 249,000 words of transcription, audio, and timestamps at the level of individual intonation units.

  5. The NPS Chat Corpus: This corpus consists of 10,567 posts out of approximately 500,000 posts gathered from various online chat services in accordance with their terms of service.

  6. Maluuba Goal-Oriented Dialogue: Open dialogue dataset where the conversation aims at accomplishing a task or taking a decision – specifically, finding flights and a hotel. The dataset contains complex conversations and decision-making covering 250+ hotels, flights, and destinations.

  7. Multi-Domain Wizard-of-Oz dataset (MultiWOZ): A fully-labeled collection of written conversations spanning over multiple domains and topics. The dataset contains 10k dialogues, and is at least one order of magnitude larger than all previous annotated task-oriented corpora.

Multilingual Chatbot Training Datasets

  1. NUS Corpus: This corpus was created for social media text normalization and translation. It is built by randomly selecting 2,000 messages from the NUS English SMS corpus and then translated into formal Chinese.

  2. EXCITEMENT Datasets: These datasets, available in English and Italian, contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company.

View the original article here for links to all datasets:
https://lionbridge.ai/datasets/15-best-chatbot-datasets-for-machine-learning/

Top comments (0)