MultiWOZ Corpus

Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora.

The dataset used in the EMNLP publication can be accessed at: MultiWOZ_2.0

The dataset used in the ACL publication can be accessed at: MultiWOZ_1.0

Data Structure

There are 3,406 single-domain dialogues that include booking if the domain allows for that and 7,032 multi-domain dialogues consisting of at least 2 up to 5 domains. To enforce reproducibility of results, the corpus was randomly split into a train, test and development set. The test and development sets contain 1k examples each. Even though all dialogues are coherent, some of them were not finished in terms of task description. Therefore, the validation and test sets only contain fully successful dialogues thus enabling a fair comparison of models. There are no dialogues from hospital and police domains in validation and testing sets.

Each dialogue consists of a goal, multiple user and system utterances as well as a belief state. Additionally, the task description in natural language presented to turkers working from the visitor’s side is added. Dialogues with MUL in the name refers to multi-domain dialogues. Dialogues with SNG refers to single-domain dialogues (but a booking sub-domain is possible). The booking might not have been possible to complete if fail_book option is not empty in goal specifications – turkers did not know about that.

The belief state have three sections: semi, book and booked. Semi refers to slots from a particular domain. Book refers to booking slots for a particular domain and booked is a sub-list of book dictionary with information about the booked entity (once the booking has been made).
The goal sometimes was wrongly followed by the turkers which may results in wrong belief state.


Belief Tracking

Joint Accuracy Overall accuracy F1 score
25.83% 97.19% 85.26%

Context-to-Text Generation

60.66% 49.04% 0.184

Natural Language Generation

2.99 0.632


If you use the dataset presented here in your work, please cite the corresponding papers. The bibtex are listed below:

Bug Report

If you have found any bugs in the dataset, please contact: pfb30 at cam dot ac dot uk