QCRI Home Arabic Language Technologies ALT Server Additional Resources Annotated Al Jazeera Dialectal Speech Corpus

About

This speech corpus contains dialect-level labels for 57 hours of dialectal Arabic speech (Egyptian, Levantine, North African, and Gulf) from Al Jazeera from between June 2014 and January 2015, as well as confidence levels those labels are based on. This corpus also contains 94 hours of dialectal Arabic speech automatically labeled by linking speaker information from the human-labeled set.

Related publications

  • Wray, Samantha and Ali, Ahmed. 2015. Crowdsource a little to label a lot: Labeling a Speech Corpus of Dialectal Arabic. Interspeech. [BibTeX]
    @InProceedings{wrayaliclassify, author = {Samantha Wray and Ahmed Ali}, title = {{Crowdsource a little to label a lot: Labeling a Speech Corpus of Dialectal Arabic}}, booktitle={Interspeech}, year = {2015}, note = {(in press)} }

Download

License