Modern Standard Arabic Pronunciation Lexicon

This package includes a pronunciation dictionary for Modern Standard Arabic ASR. It has been used in combination with the Kaldi Gale Recipe.
» Go to page

Kaldi Gale Recipe

This package includes files for building Arabic ASR using the GALE database from LDC and the Kaldi Speech Recognition Toolkit. The test set is a mix of conversational and report speech
» Go to page

QCRI Educational Domain (QED) Corpus

The QED Corpus is an open multilingual collection of subtitles for educational videos and lectures collaboratively transcribed and translated over the AMARA web-based platform. The current release of the QED Corpus v1.4 contain 20 languages distributed over 44620 files.
» Go to page

Annotated Al Jazeera Dialectal Speech Corpus

This corpus contains speech from Al Jazeera with both human-annotated and automatically-assigned labels for MSA and four major dialect groups (Egyptian, Levantine, North African, Gulf).
» Go to page

Arabic Fact-Checking and Stance Detection Corpus

This is a novel Arabic corpus that unifies stance detection, stance rationale, relevant document retrieval and fact checking. The corpus contains 422 claims that are made about the war in Syria and related Middle East political issues, where each claim is labeled for factuality, indicating whether they are True or False The corpus also contains 3,042 articles that are retrieved for these claims, where each claim-article pair is annotated for stance indicating whether the article agrees, disagrees, discusses or is unrelated to the claim. The corpus also points to which sentence(s) from the articles corresponds to the stance rationale. This is the first corpus to offer such a combination.

Bilingual Corpus of Parallel Tweets

A collection of parallel Arabic-English tweets and an additional list of Twitter accounts that post parallel tweets.
» Go to page