Papers
2021
HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion
IndoNLI: A Natural Language Inference Dataset for Indonesian
Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
A Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairs
DanFEVER: claim verification dataset for Danish
MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering
Universal Joy A Data Set and Results for Classifying Emotions Across Languages
ParsiNLU: A Suite of Language Understanding Challenges for Persian
Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19
StoryDB: Broad Multi-language Narrative Dataset
MFAQ: a Multilingual FAQ Dataset
Fine-grained Named Entity Annotation for Finnish
A Dataset and Baselines for Multilingual Reply Suggestion
A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization
Vy=akarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages
MasakhaNER: Named Entity Recognition for African Languages
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
Multilingual Entity and Relation Extraction Dataset and Model
A New Dataset and Efficient Baselines for Document-level Text Simplification in German
Olá, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer
XOR QA: Cross-lingual Open-Retrieval Question Answering
Never guess what I heard… Rumor Detection in Finnish News: a Dataset and a Baseline
MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization
X-Fact: A New Benchmark Dataset for Multilingual Fact Checking
Models and Datasets for Cross-Lingual Summarisation
GerDaLIR: A German Dataset for Legal Information Retrieval
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada
MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network
KLUE: Korean Language Understanding Evaluation
Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark
Hope Speech detection in under-resourced Kannada language
MassiveSumm: a very large-scale, very multilingual, news summarisation dataset
2020
Liputan6: A Large-scale Indonesian Dataset for Text Summarization
K-SNACS: Annotating Korean Adposition Semantics
MLQA: Evaluating Cross-lingual Extractive Question Answering
Multilingual Culture-Independent Word Analogy Datasets
The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages
An Annotated Dataset of Discourse Modes in Hindi Stories
A Dataset for Multi-lingual Epidemiological Event Extraction
A Multilingual Parallel Corpora Collection Effort for Indian Languages
FQuAD: French Question Answering Dataset
TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
From Web Crawl to Clean Register-Annotated Corpora
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains
KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi
KLEJ: Comprehensive Benchmark for Polish Language Understanding
GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors
CoDEx: A Comprehensive Knowledge Graph Completion Benchmark
MLSUM: The Multilingual Summarization Corpus
A New Dataset for Natural Language Inference from Code-mixed Conversations
Read and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russian
RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment
OCNLI: Original Chinese Natural Language Inference
Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis
Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
Books of Hours. the First Liturgical Data Set for Text Segmentation.
COSTRA 1.0: A Dataset of Complex Sentence Transformations
The ApposCorpus: a new multilingual, multi-domain dataset for factual appositive generation
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization - ACL Anthology
Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque
Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
CLUE: A Chinese Language Understanding Evaluation Benchmark
A Summarization Dataset of Slovak News Articles
2019
KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension
HEAD-QA: A Healthcare Dataset for Complex Reasoning
A Turkish Dataset for Gender Identification of Twitter Users
On the Cross-lingual Transferability of Monolingual Representations
XQA: A Cross-lingual Open-domain Question Answering Dataset - ACL Anthology
ChID: A Large-scale Chinese IDiom Dataset for Cloze Test
Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish
Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension
A Span-Extraction Dataset for Chinese Machine Reading Comprehension
PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
2018
Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers
MGAD: Multilingual Generation of Analogy Datasets
XNLI: Evaluating Cross-lingual Sentence Representations
Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank
LIdioms: A Multilingual Linked Idioms Data Set
Semi-supervised Training Data Generation for Multilingual Question Answering
Multi-Dialect Arabic POS Tagging: A CRF Approach
A Multilingual Wikified Data Set of Educational Material
SemEval-2018 Task 1: Affect in Tweets
2017
ACTSA: Annotated Corpus for Telugu Sentiment Analysis
Cross-lingual Name Tagging and Linking for 282 Languages
2016
PROMETHEUS: A Corpus of Proverbs Annotated with Metaphors
An Open Corpus for Named Entity Recognition in Historic Newspapers
2015
A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets
A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining
2014
Developing Text Resources for Ten South African Languages
Multilingual corpora with coreferential annotation of person entities
2013
Universal Dependency Annotation for Multilingual Parsing
2011
Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)
2010
Cross-Language Text Classification using Structural Correspondence Learning
2008
AnCora: Multilevel Annotated Corpora for Catalan and Spanish