We provide related resources of end-to-end task-orient dailogue systems, including datasets and our resources repo.
Datasets
We list serveral commonly used datasets in EToD in the following:
Modularly EToD datasets
- MultiWOZ MultiWOZ2.0 and 2.1 is both used in evaluations of different papers. MultiWOZ is one of the most widely used ToD dataset. It contains over 8,000 dialogue sessions and 7 different domains including: restaurant, hotel, attraction, taxi, train, hospital and police domain.
Fully EToD datasets
-
SMD Stanford Multi-turn Multi-domain Task-oriented Dialogue Dataset (SMD) includes three domains: navigation, weather, and calendar.
-
CamRest676. CamRest676 is a relatively small-scale restaurant domain dataset. It consists of 408/136/136 dialogues for training/validation/testing.
Other Resource of ToD datasets that might help EToD research
Multi-modal ToD Datasets
-
SIMMIC. Dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs (117K utterances) in the shopping domain, grounded in immersive and photo-realistic scenes.
-
MMConv. Multimodal Multi-domain Conversational dataset (MMConv) is a fully annotated collection of human-to-human role-playing dialogues spanning over multiple domains and tasks.
Survey of Datasets for EToD
-
AtmaHou/Task-Oriented-Dialogue-Research-Progress-Survey This repo included leader boards of popular dataset to present research progress in the task oriented dialogue fields and included leader boards of popular dataset to present research progress in the task oriented dialogue fields.
-
Survey of Available Datasets for Designing Task Oriented Dialogue Agents This paper provides a survey of available datasets for designing task oriented dialogue agents. It also provides a detailed analysis of the datasets and their characteristics.
Metrics and Evaluation Methods
We list some common metrics used for evaluating EToD system:
Modularly EToD Metrics
-
BLEU is used to measure the fiuency of generated response by calculating n-gram overlaps between the generated response and the gold response.
-
Inform and Success . Inform measures whether the system provides an appropriate entity and Success measures whether the system answers all requested attributes.
-
Combined is a comprehensive metric considering BLEU, Inform, and Success, which can be calculated by: Combined = (Inform + Success ) x 0.5+BLEU).
Fully EToD Metrics
-
BLEU is used to measure the fiuency of generated response by calculating n-gram overlaps between the generated response and the gold response.
-
Entity F1 is used to measure the difference between entities in the system and gold responses by micro-averaging the precision and recall.