Papers

2021

HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion

IndoNLI: A Natural Language Inference Dataset for Indonesian

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

A Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairs

DanFEVER: claim verification dataset for Danish

MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

Universal Joy A Data Set and Results for Classifying Emotions Across Languages

ParsiNLU: A Suite of Language Understanding Challenges for Persian

Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

StoryDB: Broad Multi-language Narrative Dataset

MFAQ: a Multilingual FAQ Dataset

Fine-grained Named Entity Annotation for Finnish

A Dataset and Baselines for Multilingual Reply Suggestion

A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization

Vy=akarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages

MasakhaNER: Named Entity Recognition for African Languages

XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Multilingual Entity and Relation Extraction Dataset and Model

A New Dataset and Efficient Baselines for Document-level Text Simplification in German

Olá, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer

XOR QA: Cross-lingual Open-Retrieval Question Answering

Never guess what I heard… Rumor Detection in Finnish News: a Dataset and a Baseline

Assessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levels

MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization

X-Fact: A New Benchmark Dataset for Multilingual Fact Checking

Models and Datasets for Cross-Lingual Summarisation

GerDaLIR: A German Dataset for Legal Information Retrieval

Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada

MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network

KLUE: Korean Language Understanding Evaluation

Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark

MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

Hope Speech detection in under-resourced Kannada language

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

2020

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

K-SNACS: Annotating Korean Adposition Semantics

MLQA: Evaluating Cross-lingual Extractive Question Answering

Multilingual Culture-Independent Word Analogy Datasets

The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages

An Annotated Dataset of Discourse Modes in Hindi Stories

A Dataset for Multi-lingual Epidemiological Event Extraction

A Multilingual Parallel Corpora Collection Effort for Indian Languages

FQuAD: French Question Answering Dataset

TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

From Web Crawl to Clean Register-Annotated Corpora

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains

KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi

KLEJ: Comprehensive Benchmark for Polish Language Understanding

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

CoDEx: A Comprehensive Knowledge Graph Completion Benchmark

MLSUM: The Multilingual Summarization Corpus

A New Dataset for Natural Language Inference from Code-mixed Conversations

Read and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russian

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment

OCNLI: Original Chinese Natural Language Inference

EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering

Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis

Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning

Books of Hours. the First Liturgical Data Set for Text Segmentation.

The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation

COSTRA 1.0: A Dataset of Complex Sentence Transformations

The ApposCorpus: a new multilingual, multi-domain dataset for factual appositive generation

XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization - ACL Anthology

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque

CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

CLUE: A Chinese Language Understanding Evaluation Benchmark

A Summarization Dataset of Slovak News Articles

2019

KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

HEAD-QA: A Healthcare Dataset for Complex Reasoning

A Turkish Dataset for Gender Identification of Twitter Users

On the Cross-lingual Transferability of Monolingual Representations

XQA: A Cross-lingual Open-domain Question Answering Dataset - ACL Anthology

ChID: A Large-scale Chinese IDiom Dataset for Cloze Test

Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish

Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension

Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.

A Span-Extraction Dataset for Chinese Machine Reading Comprehension

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

2018

Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers

MGAD: Multilingual Generation of Analogy Datasets

XNLI: Evaluating Cross-lingual Sentence Representations

Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank

LIdioms: A Multilingual Linked Idioms Data Set

Semi-supervised Training Data Generation for Multilingual Question Answering

Multi-Dialect Arabic POS Tagging: A CRF Approach

A Multilingual Wikified Data Set of Educational Material

SemEval-2018 Task 1: Affect in Tweets

2017

ACTSA: Annotated Corpus for Telugu Sentiment Analysis

Cross-lingual Name Tagging and Linking for 282 Languages

2016

PROMETHEUS: A Corpus of Proverbs Annotated with Metaphors

An Open Corpus for Named Entity Recognition in Historic Newspapers

SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelines

2015

A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets

A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining

2014

Developing Text Resources for Ten South African Languages

PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphone

Multilingual corpora with coreferential annotation of person entities

2013

Universal Dependency Annotation for Multilingual Parsing

OntoNotes

2011

Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

2010

Cross-Language Text Classification using Structural Correspondence Learning

2008

AnCora: Multilevel Annotated Corpora for Catalan and Spanish