Dataset Database

dataset nametitlelink to paperdata linkmotivation of the paper writer (how they were originally intended)task typehas train data?“data size (rough avg # of examples PER language, excluding english)”input data sourcecrowdsource platforms / background (if any)original languageinput data - automatic processingtranslationlabel sourcelabel language (at collection time / language used by annotators)languagepublication yearpublished venuereusing existing datasets?who created the dataset?# citationin_huggingfacedataset released?  
A New Dataset for Natural Language Inference from Code-mixed ConversationsA New Dataset for Natural Language Inference from Code-mixed Conversationshttps://arxiv.org/pdf/2004.05051.pdf task-oriented (multilingual)sentence pair tasknot mentioned100~1000“collected from curated source (exams, scientific papers, etc)”not mentionedits own languagen/an/acrowdsourcedits own languageen hi2020LRECNOcombination of university and industry0NONO  
MultiHumES: Multilingual Humanitarian Dataset for Extractive SummarizationMultiHumES: Multilingual Humanitarian Dataset for Extractive Summarizationhttps://aclanthology.org/2021.eacl-main.146.pdf task-oriented (multilingual)summarizationNO>10kcollected from media (news)”"”humanitarian experts”””its own languagen/an/a“annotated (authors, linguists)”in its own languageen fr es2021EACLNOcombination of university and industry0NONO  
A Multilingual Wikified Data Set of Educational MaterialA Multilingual Wikified Data Set of Educational Materialhttps://aclanthology.org/L18-1073.pdfNot availablecross-lingual transferclassification (non-sentiment analysis)not mentioned1000~10k“collected from curated source (exams, scientific papers, etc)”crowdflowerits own languagealignmentautomatic translationcrowdsourcedin its own languagebg cs de el hr it nl pl pt ru zh2018LRECNOcombination of university and industry0NONO  
Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)https://aclanthology.org/Y11-1038.pdfhttp://linguistics.hss.ntu.edu.sg/ResearchinLMS/Pages/NTUMultilingualCorpuscross-lingual transferstructured predictionnot mentioned1000~10kcollected from webn/aits own languagen/an/a“annotated (authors, linguists)”in its own languageen zh ja ko id vi2011“Pacific Asia Conference on Language, Information and Computation”NOuniversity51NONO  
XNLIXNLI: Evaluating Cross-lingual Sentence Representationshttps://arxiv.org/pdf/1809.05053.pdfhttps://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zipcross-lingual transfersentence pair taskNO1000~10Kcrowdsourcedgethybrid.ioEnglishalignmentcrowdsourced translation (incl. Gengo / One Hour Translation)crowdsourcedEnglishen fr es de el bg ru tr ar vi th zh hi sw ur2018EMNLPYES (English)industry502YESYES  
PAWS-XPAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identificationhttps://arxiv.org/pdf/1908.11828.pdfhttps://github.com/google-research-datasets/pawscross-lingual transfersentence pair taskNO1000~10Kcrowdsourcednot mentioned (maybe google internal according to the acknowledgements)Englishn/aautomatic translation & crowdsourced translation (incl. Gengo / One Hour Translation)crowdsourcedEnglishfr es de zh ja ko2019EMNLPYES (English)industry91YESYES  
MLSUMMLSUM: The Multilingual Summarization Corpushttps://arxiv.org/pdf/2004.14900v1.pdfhttps://github.com/huggingface/datasets/tree/master/datasets/mlsumtask-oriented (multilingual)summarizationYES>10Kcollected from media (news)n/aits own languagealignmentn/aautomatically inducedin its own languagefr de es ru tr2020EMNLPNOuniversity20YESYES  
XL-WiCXL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization - ACL Anthologyhttps://aclanthology.org/2020.emnlp-main.584.pdfhttps://pilehvar.github.io/xlwic/cross-lingual transferotherPARTIAL1000~10Kcurated linguistic resourcesn/aits own languagen/an/a“derived from linguistic resources (wordnet, etc)”in its own languagebg da de et fa fr hr it ja ko nl zh en2020EMNLPYES (English & other language)university14YESYES  
MLQAMLQA: Evaluating Cross-lingual Extractive Question Answeringhttps://arxiv.org/abs/1910.07475https://github.com/facebookresearch/MLQAcross-lingual transfermachine reading comprehensionNO1000~10Kcrowdsourced & collected from WikipediaamtEnglishalignmentcrowdsourced translation (incl. Gengo / One Hour Translation)crowdsourcedin its own languageen de es ar zh vi hi2020ACLYES (English & other language) 150YESYES  
XQuADOn the Cross-lingual Transferability of Monolingual Representationshttps://arxiv.org/abs/1910.11856https://github.com/deepmind/xquadcross-lingual transfermachine reading comprehensionNO1000~10Kcrowdsourced & collected from WikipediaamtEnglishn/acrowdsourced translation (incl. Gengo / One Hour Translation)crowdsourcedEnglishen es de el ru tr ar vi th zh hi2019EMNLPYES (English)industry218YESYES  
TyDi QATyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languageshttps://arxiv.org/abs/2003.05002https://ai.google.com/research/tydiqa/task-oriented (multilingual)machine reading comprehensionYES>10Kcrowdsourcednot mentionedits own languagen/an/acrowdsourcedits own languagear bn fi id ja ki ko ru te th2020TACLNOindustry118YESYES  
XOR QAXOR QA: Cross-lingual Open-Retrieval Question Answeringhttps://arxiv.org/pdf/2010.11856.pdfhttps://nlp.cs.washington.edu/xorqa/task-oriented (multilingual)QA + IRYES>10Kcrowdsourcedamt (and maybe undergrad students)its own languagen/acrowdsourced translation (incl. Gengo / One Hour Translation)crowdsourcedEnglishar bn fi ja ko te ru2021NAACLYES (other language)combination of university and industry13YESYES  
XQAXQA: A Cross-lingual Open-domain Question Answering Dataset - ACL Anthologyhttps://aclanthology.org/P19-1227/http://github.com/thunlp/XQAtask-oriented (multilingual)QA + IRPARTIAL1000~10Ktemplate-basedn/aits own languagen/an/aautomatically inducedits own languageen ch fr de po pt ru ta uk2019ACLNOuniversity41NOYES  
MKQAMKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answeringhttps://arxiv.org/abs/2007.15207https://github.com/apple/ml-mkqacross-lingual transferQA + IRNO1000~10KcrowdsourcedtryratingEnglishn/acrowdsourced translation (incl. Gengo / One Hour Translation)crowdsourcedEnglishar da de en es fi fr he hu it ja ko km ms nl no pl pt ru sv th tr vi zh2021TACLYES (English)industry19YESYES  
POS-tagged Arabic tweets for four dialectMulti-Dialect Arabic POS Tagging: A CRF Approachhttp://www.lrec-conf.org/proceedings/lrec2018/pdf/562.pdfhttps://huggingface.co/datasets/arabic_pos_dialecttask-oriented (target language)sequence taggingYES100~1000collected from social media or commercial sourcesn/aits own languagen/an/acrowdsourcedin its own languagear egy lev glf mgr2018LRECNOcombination of university and industry25YESYES  
WikiANNCross-lingual Name Tagging and Linking for 282 Languageshttps://www.aclweb.org/anthology/P17-1178https://huggingface.co/datasets/wikianntask-oriented (multilingual)sequence taggingYES1000~10Kcollected from Wikipedian/aEnglishalignedautomatic translationautomatically inducedEnglishace af als am an ang ar arc arz as ast ay az ba bar be bg bh bn bo br bs ca cdo ce ceb ckb co crh cs csb cv cy da de diq dv el en eo es et eu ext fa fi fo fr frr fur fy ga gan gd gl gn gu hak he hi hr hsb hu hy ia id ig ilo io is it ja jbo jv ka kk km kn ko ksh ku ky la lb li lij lmo ln lt lv mg mhr mi min mk ml mn mr ms mt mwl my mzn nap nds ne nl nn no nov oc or os sgs be-tarask cbk eml vro jv-x-bms en-basiceng lzh nan yue pa pdc pl pms pnb ps pt qu rm ro ru rw sa sah scn sco sd sh si sk sl so sq sr su sv sw szl ta te tg th tk tl tr tt ug uk ur uz vec vep vi vls vo wa war wuu xmf yi yo zea zh2017ACLNOuniversity71YESYES  
MFAQMFAQ: a Multilingual FAQ Datasethttps://arxiv.org/abs/2109.12870https://huggingface.co/datasets/clips/mfaqtask-oriented (multilingual)machine reading comprehensionYES>10Kcollected from webn/aits own languagen/an/aautomatically inducedin its own languagecs da de en es fi fr he hr hu id it nl no pl pt ro ru sv tr vi2021EMNLPNOindustry0YESYES  
XL-SumXL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languageshttps://aclanthology.org/2021.findings-acl.413/https://github.com/csebuetnlp/xl-sumtask-oriented (multilingual)summarizationYES>10Kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languageam ar az bn my zh en fr gu ha hi ig id ja rn ko ky mr ne om ps fa pcm pt pa ru gd sr si so es sw ta te th ti tr uk ur uz vi cy yo2021ACLNOcombination of university and industry3YESYES  
OntoNotes 5.0OntoNoteshttps://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdfhttps://catalog.ldc.upenn.edu/LDC2013T19task-oriented (multilingual)structured predictionYES>10Kcollected from media (news)n/aits own languagen/an/a“annotated (authors, linguists)”in its own languageen cm ar zh2013N/ANOuniversityN/ANOYES  
euronewsAn Open Corpus for Named Entity Recognition in Historic Newspapershttps://aclanthology.org/L16-1689.pdfhttps://github.com/EuropeanaNewspapers/ner-corporatask-oriented (multilingual)sequence taggingYES>10Kcollected from media (news)n/aits own languagen/an/acrowdsourcedin its own languagede fr nl2016LRECNOindustry21YESYES  
examsEXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answeringhttps://arxiv.org/pdf/2011.03080.pdfhttps://github.com/mhardalov/exams-qatask-oriented (multilingual)machine reading comprehensionYES1000~10K“collected from curated source (exams, scientific papers, etc)”n/aits own languagen/an/aautomatically inducedin its own languagear bg de es fr hr hu it lt mk pl pt sq sr tr vi2020EMNLPNOuniversity6YESYES  
hope_edi“HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion”https://aclanthology.org/2020.peoples-1.5/https://github.com/huggingface/datasets/blob/master/datasets/hope_edi/README.mdtask-oriented (multilingual)classification (sentiment analysis)YES>10Kcollected from social media or commercial sourcesgoogle formits own languagen/an/acrowdsourcedits own languageen ml ta2021EACLNOuniversity40YESYES  
kan_hopeHope Speech detection in under-resourced Kannada languagehttps://arxiv.org/abs/2108.04616#https://github.com/adeepH/kan_hopetask-oriented (target language)classification (sentiment analysis)YES1000~10Kcollected from social media or commercial sourcesn/aits own languagen/an/acrowdsourcedits own languageen kn2021 NOuniversity3YESYES  
masakhanerMasakhaNER: Named Entity Recognition for African Languageshttps://arxiv.org/pdf/2103.11811.pdfhttps://github.com/masakhane-io/masakhane-ner/task-oriented (multilingual)sequence taggingYES>10Kcollected from media (news)n/aits own languagen/an/acrowdsourcedits own languageam ha ig rw lg luo pcm sw wo yo2021TACLNOcombination of university and industry1YESYES  
multi_eurlexMultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transferhttps://arxiv.org/abs/2109.00904https://github.com/huggingface/datasets/blob/master/datasets/multi_eurlex/README.mdcross-lingual transferclassification (non-sentiment analysis)NO>10Kcollected from curated sourcen/aits own languagen/an/a“annotated (authors, linguists)”in its own languageen da de nl sv bg cs hr pl sk sl es fr it p ro et fi hu lt lv el mt2021EMNLPNOcombination of university and industry1YESYES  
nchltDeveloping Text Resources for Ten South African Languageshttp://www.lrec-conf.org/proceedings/lrec2014/pdf/1151_Paper.pdfhttps://repo.sadilar.org/handle/20.500.12185/7/discover?filtertype_0=database&filtertype_1=title&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Monolingual+Text+Corpora%3A+Annotated&filtertype=project&filter_relational_operator=equals&filter=NCHLT+Text+IItask-oriented (multilingual)sequence taggingYES>10kcollected from web & collected from curated sourcen/aits own languagen/an/aautomatically inducedin its own languageaf nr nso ss tn ts ve xh zu2014LRECNOuniversity68YESYES  
offenseval_dravidian“Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada”https://aclanthology.org/2021.dravidianlangtech-1.17.pdfN/Atask-oriented (multilingual)classification (sentiment analysis)YES1000~10Kcollected from social media or commercial sourcesn/aits own languagen/an/aautomatically inducedin its own languageen kn ml ta2021EACLNOuniversity3YESYES  
sem_eval_2018_task_1SemEval-2018 Task 1: Affect in Tweetshttp://saifmohammad.com/WebDocs/semeval2018-task1.pdfhttps://competitions.codalab.org/competitions/20948task-oriented (multilingual)classification (non-sentiment analysis)YES1000~10Kcollected from social media or commercial sourcesn/aits own languagen/an/acrowdsourcedin its own languageen ar es2018*ACL WorkshopNOuniversity427YESYES  
stsb_multi_mtN/AN/Ahttps://github.com/PhilipMay/stsb-multi-mttask-oriented (multilingual)classification (non-sentiment analysis)YES1000~10Kcollected from media (news)n/aEnglishn/aautomatic translationautomatically inducedEnglishen de es fr it nl pl pt ru zh2021N/AYES (English)industryN/AYESYES  
20MinutenA New Dataset and Efficient Baselines for Document-level Text Simplification in Germanhttps://aclanthology.org/2021.newsum-1.16/https://github.com/ZurichNLP/20Minutentask-oriented (target language)sentence-level-generation taskYES>10kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languagede2021EMNLPNOuniversity0NOYES  
XFORMAL“Olá, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer”https://aclanthology.org/2021.naacl-main.256.pdfhttps://github.com/Elbria/xformal-FoSTtask-oriented (multilingual)sentence-level-generation taskNO1000~10kcollected from webn/aits own languagen/an/a“annotated (authors, linguists)”in its own languagept fr it2021NAACLNOcombination of university and industry4NOYES yes
Mr. TyDiMr. TyDi: A Multi-lingual Benchmark for Dense Retrievalhttps://arxiv.org/abs/2108.08787https://github.com/castorini/mr.tyditask-oriented (multilingual)machine reading comprehensionYES1000~10kcrowdsourcednot mentioned (from tydi)its own languagen/an/acrowdsourcedin its own languagear bn en fi id ja ko ru sw te th2021*ACL WorkshopYES (English & other language)university0NOYES  
XWikisModels and Datasets for Cross-Lingual Summarisationhttps://aclanthology.org/2021.emnlp-main.742/https://github.com/lauhaide/clads/blob/main/fairseq2020/examples/clads/README.mdtask-oriented (multilingual)summarizationYES>10kcollected from Wikipedian/aits own languagen/an/aautomatically inducedin its own languagecs fr en de2021EMNLPNOuniversity0NOYES  
MultiHumESMultiHumES: Multilingual Humanitarian Dataset for Extractive Summarizationhttps://aclanthology.org/2021.eacl-main.146.pdfhttps://deephelp.zendesk.com/hc/en-us/sections/360011925552-MultiHumEStask-oriented (multilingual)summarizationYES1000~10k“collected from curated source (exams, scientific papers, etc)”n/aits own languagen/an/a“annotated (authors, linguists)”in its own languageen fr es2021EACLNOcombination of university and industry0NOYES  
SMiLERMultilingual Entity and Relation Extraction Dataset and Modelhttps://aclanthology.org/2021.eacl-main.166.pdfhttps://github.com/samsungnlp/smiler/task-oriented (multilingual)classification (non-sentiment analysis)YES>10kcollected from Wikipedian/aEnglishn/aautomatic translationautomatically inducedin its own languageit fr de pt es ko2021EACLNOcombination of university and industry0NOYES  
swiss_judgment_predictionSwiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmarkhttps://aclanthology.org/2021.nllp-1.3.pdfhttps://github.com/JoelNiklaus/SwissCourtRulingCorpustask-oriented (multilingual)classification (non-sentiment analysis)YES>10kcollected from curated sourcesn/aits own languagen/an/acrowdsourcedEnglishde fr it2021*ACL WorkshopNouniversity0YESYES  
tamilmixsentimentCorpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Texthttps://aclanthology.org/2020.sltu-1.28/https://dravidian-codemix.github.io/2020/index.htmltask-oriented (target language)classification (sentiment analysis)YES>10kcollected from social media or commercial sourcesn/aits own languagen/an/acrowdsourcedin its own languageen ta2020*ACL WorkshopNouniversity106YESYES  
C^3Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehensionhttps://arxiv.org/pdf/1904.09679https://dataset.org/c3/task-oriented (target language)machine reading comprehensionYES>10K“collected from curated source (exams, scientific papers, etc)”n/aits own languagealignmentn/aautomatically inducedin its own languagezh2019TACLNOcombination of university and industry20YESYES* in huggingface as part of clue 
ChIDChID: A Large-scale Chinese IDiom Dataset for Cloze Testhttps://aclanthology.org/P19-1075.pdfhttps://drive.google.com/drive/folders/1qdcMgCuK9d93vLVYJRvaSLunHUsGf50u?usp=sharingtask-oriented (target language)machine reading comprehensionYES>10Kcollected from media (news) & collected from webn/aits own languageword embedding similarity scoresn/aautomatically inducedin its own languagezh2019ACLYES (other language)university33YESYES* in huggingface as part of clue 
CLUE - IFLYTEK Long Text classificationCLUE: A Chinese Language Understanding Evaluation Benchmarkhttps://arxiv.org/abs/2004.05986https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zipmulti-task (target language)classification (sentiment analysis)YES>10Kcollected from social media or commercial sourcesn/aits own languagen/an/anot mentionedin its own languagezh2020COLINGNOindustry59NOYES  
CLUE - Ant Financial Question Matching CorpusCLUE: A Chinese Language Understanding Evaluation Benchmarkhttps://arxiv.org/abs/2004.05986https://storage.googleapis.com/cluebenchmark/tasks/afqmc_public.zipmulti-task (target language)sentence pair taskYES>10Kcollected from social media or commercial sourcesn/aits own languagen/an/anot mentionedin its own languagezh2020COLINGYES (other language)individual researchers59YESYES  
CLUE - Chinese Scientific LiteratureCLUE: A Chinese Language Understanding Evaluation Benchmarkhttps://arxiv.org/abs/2004.05986https://storage.googleapis.com/cluebenchmark/tasks/csl_public.zipmulti-task (target language)sentence pair taskYES>10Kcurated linguistic resourcesn/aits own languagetf-idf generatedn/aautomatically inducedin its own languagezh2020COLINGNOindividual researchers59NOYES* citation & published venue is clue’s because the dataset itself wasn’t published 
CLUE - CLUEWSC 2020CLUE: A Chinese Language Understanding Evaluation Benchmarkhttps://arxiv.org/abs/2004.05986https://storage.googleapis.com/cluebenchmark/tasks/cluewsc2020_public.zipmulti-task (target language)structured predictionYES1000~10Kcurated linguistic resourcesn/aits own languagen/an/a“annotated (authors, linguists)”in its own languagezh2020COLINGNOindividual researchers59NOYES* citation & published venue is clue’s because the dataset itself wasn’t published 
CLUE - Toutiao Short Text Classificaiton for NewsCLUE: A Chinese Language Understanding Evaluation Benchmarkhttps://arxiv.org/abs/2004.05986https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zipmulti-task (target language)classification (non-sentiment analysis)YES>10Kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languagezh2020COLINGNOindividual researchers59NOYES* citation & published venue is clue’s because the dataset itself wasn’t published 
CMRC 2018A Span-Extraction Dataset for Chinese Machine Reading Comprehensionhttps://aclanthology.org/D19-1600.pdfhttps://worksheets.codalab.org/worksheets/0x92a80d2fab4b4f79a2b4064f7ddca9cetask-oriented (target language)machine reading comprehensionYES>10Kcollected from Wikipedian/aits own languagen/an/a“annotated (authors, linguists)”in its own languagezh2019EMNLPNOcombination of university and industry91YESYES  
FQuAD 1.1FQuAD: French Question Answering Datasethttps://aclanthology.org/2020.findings-emnlp.107/https://fquad.illuin.tech/task-oriented (target language)machine reading comprehensionYES>10Kcrowdsourced & collected from Wikipediauniversity studentsits own languagen/an/acrowdsourcedin its own languagefr2020FindingsNOcombination of university and industry25YESYES  
CLSCross-Language Text Classification using Structural Correspondence Learninghttps://aclanthology.org/P10-1114.pdfhttps://github.com/getalp/Flaubert/tree/master/fluetask-oriented (multilingual)classification (sentiment analysis)YES1000~10Kcollected from social media or commercial sourcesn/aits own languagen/aautomatic translationautomatically inducedin its own languagefr de en ja2010ACLNOuniversity291NOYES  
IndoNLIIndoNLI: A Natural Language Inference Dataset for Indonesianhttps://arxiv.org/pdf/2110.14566.pdfhttps://github.com/ir-nlp-csui/indonli/tree/main/datatask-oriented (target language)sentence pair taskYES>10Kcollected from Wikipedia & collected from web & curated linguistic resourcesuniversity studentsits own languagen/an/a“crowdsourced & annotated (authors, linguists)”in its own languageid2021EMNLPNOcombination of university and industry0NOYES  
K-QuADSemi-supervised Training Data Generation for Multilingual Question Answeringhttps://aclanthology.org/L18-1437.pdfhttps://github.com/Di-lab-Yonsei/K-QuADtask-oriented (target language)machine reading comprehensionNO>10Kcrowdsourced & collected from Wikipedianot mentionedEnglish & its own languagen/aautomatic translation & crowdsourced translation (incl. Gengo / One Hour Translation)crowdsourcedEnglish & in its own languageko2018LRECYES (English)university26NOYES  
KLUE - DPKLUE: Korean Language Understanding Evaluationhttps://arxiv.org/pdf/2105.09680.pdfhttps://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000071/data/klue-dp-v1.1.tar.gzmulti-task (target language)structured predictionYES>10Kcollected from web & collected from social media or commercial sourcesdeepnaturalits own languagen/an/acrowdsourced & automatically inducedin its own languageko2021NeurIPS Datasets and Benchmarks TrackNOcombination of university and industry19NOYES  
KLUE - DST (WoS)KLUE: Korean Language Understanding Evaluationhttps://arxiv.org/pdf/2105.09680.pdfhttps://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000073/data/wos-v1.1.tar.gzmulti-task (target language)structured predictionYES1000~10Kcrowdsourcednot mentionedits own languagen/an/acrowdsourcedin its own languageko2021NeurIPS Datasets and Benchmarks TrackNOcombination of university and industry19NOYES  
KLUE - MRCKLUE: Korean Language Understanding Evaluationhttps://arxiv.org/pdf/2105.09680.pdfhttps://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000072/data/klue-mrc-v1.1.tar.gzmulti-task (target language)machine reading comprehensionYES>10Kcollected from media (news) & collected from Wikipedia & crowdsourcedselectstarits own languagen/an/acrowdsourcedin its own languageko2021NeurIPS Datasets and Benchmarks TrackNOcombination of university and industry19NOYES  
KLUE - NERKLUE: Korean Language Understanding Evaluationhttps://arxiv.org/pdf/2105.09680.pdfhttps://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000069/data/klue-ner-v1.1.tar.gzmulti-task (target language)sequence taggingYES>10Kcollected from webdeepnaturalits own languagen/an/acrowdsourcedin its own languageko2021NeurIPS Datasets and Benchmarks TrackNOcombination of university and industry19NOYES  
KLUE - NLIKLUE: Korean Language Understanding Evaluationhttps://arxiv.org/pdf/2105.09680.pdfhttps://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000068/data/klue-nli-v1.1.tar.gzmulti-task (target language)sentence pair taskYES>10Kcollected from web & collected from Wikipedia & collected from social media or commercial sources & collected from media (news)selectstarits own languagen/an/acrowdsourcedin its own languageko2021NeurIPS Datasets and Benchmarks TrackNOcombination of university and industry19NOYES  
KLUE - REKLUE: Korean Language Understanding Evaluationhttps://arxiv.org/pdf/2105.09680.pdfhttps://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000070/data/klue-re-v1.1.tar.gzmulti-task (target language)classification (non-sentiment analysis)YES>10Kcollected from web & collected from Wikipedia & collected from media (news)deepnaturalits own languagen/an/acrowdsourcedin its own languageko2021NeurIPS Datasets and Benchmarks TrackNOcombination of university and industry19NOYES  
KLUE - STSKLUE: Korean Language Understanding Evaluationhttps://arxiv.org/pdf/2105.09680.pdfhttps://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000067/data/klue-sts-v1.1.tar.gzmulti-task (target language)sentence pair taskYES>10Kcollected from web & collected from Wikipedia & collected from media (news)selectstarits own languageRTT & greedy sentence matchingn/acrowdsourcedin its own languageko2021NeurIPS Datasets and Benchmarks TrackYES (other language)combination of university and industry19NOYES  
KLUE - TC(YNAT)KLUE: Korean Language Understanding Evaluationhttps://arxiv.org/pdf/2105.09680.pdfhttps://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000066/data/ynat-v1.1.tar.gzmulti-task (target language)classification (sentiment analysis)YES>10Kcollected from media (news)selectstarits own languagen/an/acrowdsourcedin its own languageko2021NeurIPS Datasets and Benchmarks TrackYES (other language)combination of university and industry19NOYES  
KorQuAD1.0KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehensionhttps://arxiv.org/pdf/1909.07005.pdfhttps://korquad.github.io/task-oriented (target language)machine reading comprehensionYES>10Kcollected from Wikipedia & crowdsourcednot mentionedits own languagen/an/acrowdsourcedin its own languageko2019arxivNOindustry39YESYES  
OCNLIOCNLI: Original Chinese Natural Language Inferencehttps://arxiv.org/pdf/2010.05444.pdfhttps://storage.googleapis.com/cluebenchmark/tasks/ocnli_public.ziptask-oriented (target language)sentence pair taskYES>10K“collected from media (news) & collected from curated source (exams, scientific papers, etc) & curated linguistic resources”university studentsits own languagen/an/acrowdsourcedin its own languagezh2020FindingsNOcombination of university and industry21NOYES  
ParsiNLU - Multiple Choice QAParsiNLU: A Suite of Language Understanding Challenges for Persianhttps://arxiv.org/pdf/2012.06154.pdfhttps://github.com/persiannlp/parsinlu/tree/master/data/multiple-choicemulti-task (target language)QA + IRYES1000~10K“collected from curated source (exams, scientific papers, etc)”native speakersits own languagen/an/acrowdsourcedin its own languagefa2021TACLNOcombination of university and industry3NOYES  
ParsiNLU - Query ParaphrasingParsiNLU: A Suite of Language Understanding Challenges for Persianhttps://arxiv.org/pdf/2012.06154.pdfhttps://github.com/persiannlp/parsinlu/tree/master/data/qqpmulti-task (target language)sentence pair taskYES1000~10Kcollected from webnative speakersits own languagegoogle auto completeautomatic translation & expert translationcrowdsourcedin its own languagefa2021TACLYES (English)combination of university and industry3YESYES  
ParsiNLU - Reading ComprehensionParsiNLU: A Suite of Language Understanding Challenges for Persianhttps://arxiv.org/pdf/2012.06154.pdfhttps://github.com/persiannlp/parsinlu/tree/master/data/reading_comprehensionmulti-task (target language)machine reading comprehensionYES1000~10Kcollected from webnative speakersits own languagen/an/acrowdsourcedin its own languagefa2021TACLNOcombination of university and industry3YESYES  
ParsiNLU - Sentiment AnalysisParsiNLU: A Suite of Language Understanding Challenges for Persianhttps://arxiv.org/pdf/2012.06154.pdfhttps://github.com/persiannlp/parsinlu/tree/master/data/sentiment-analysissmulti-task (target language)classification (sentiment analysis)YES1000~10Kcollected from social media or commercial sources & collected from webnative speakersits own languagen/an/acrowdsourcedin its own languagefa2021TACLNOcombination of university and industry3YESYES  
ParsiNLU - Textual EntailmentParsiNLU: A Suite of Language Understanding Challenges for Persianhttps://arxiv.org/pdf/2012.06154.pdfhttps://github.com/persiannlp/parsinlu/tree/master/data/entailmentmulti-task (target language)sentence pair taskYES1000~10Kcollected from Wikipedia& collected from web & curated linguistic resourcesnative speakersits own languagen/an/acrowdsourcedin its own languagefa2021TACLYES (English)combination of university and industry3YESYES  
XGLUE - NTG“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”https://arxiv.org/pdf/2004.01401.pdfhttps://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gzcross-lingual transfersentence-level-generation taskPARTIAL>10Kcollected from webn/anot mentionednot mentionednot clear whether translation is usednot mentionednot mentioneden de fr es ru2020EMNLPNOindustry57YESYES  
XGLUE - QG“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”https://arxiv.org/pdf/2004.01401.pdfhttps://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gzcross-lingual transfersentence-level-generation taskPARTIAL>10Kcollected from webn/anot mentionednot mentionednot clear whether translation is usednot mentionednot mentioneden fr de es it pt2020EMNLPNOindustry57YESYES  
XGLUE - QAM“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”https://arxiv.org/pdf/2004.01401.pdfhttps://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gzcross-lingual transfersentence pair taskPARTIAL>10Kcollected from webn/anot mentionednot mentionednot clear whether translation is usednot mentionednot mentioneden fr de2020EMNLPNOindustry57YESYES  
XGLUE - WPR“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”https://arxiv.org/pdf/2004.01401.pdfhttps://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gzcross-lingual transferclassification (sentiment analysis)PARTIAL>10Kcollected from webn/anot mentionednot mentionednot clear whether translation is usednot mentionednot mentioneden de fr es it pt zh2020EMNLPNOindustry57YESYES  
XGLUE - QADSM“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”https://arxiv.org/pdf/2004.01401.pdfhttps://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gzcross-lingual transferclassification (sentiment analysis)PARTIAL>10Kcollected from webn/anot mentionednot mentionednot clear whether translation is usednot mentionednot mentioneden fr de2020EMNLPNOindustry57YESYES  
XGLUE - NC“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”https://arxiv.org/pdf/2004.01401.pdfhttps://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gzcross-lingual transferclassification (sentiment analysis)PARTIAL>10Kcollected from webn/anot mentionednot mentionednot clear whether translation is usednot mentionednot mentioneden es de fr ru2020EMNLPNOindustry57YESYES  
XGLUE - POS Tagging“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”https://arxiv.org/pdf/2004.01401.pdfhttps://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gzcross-lingual transferstructured predictionPARTIAL1000~10Kcurated linguistic resourcesn/anot mentionednot mentionedn/anot mentionednot mentionedar bg de el en es fr hi it nl pl pt ru th tr ur vi zh2020EMNLPNOindustry57YESYES  
XGLUE - NER“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”https://arxiv.org/pdf/2004.01401.pdfhttps://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gzcross-lingual transfersequence taggingPARTIAL1000~10Kcollected from media (news)people from universityits own languagen/an/acrowdsourcedin its own languageen de es dl2020EMNLPYES (English & other language)industry57YESYES  
negationminpairsA Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairshttps://aclanthology.org/2021.conll-1.19.pdfhttps://github.com/mahartmann/negationminpairstask-oriented (multilingual)classification (sentiment analysis)YES1000~10Kcrowdsourced“annotated by native speakers (except english), xnli: gethybrid.io”its own languagealignmentautomatic translation“annotated (authors, linguists)”in its own languageen bg de fr zh2021CoNLLYES (English & other languages)university0NOYES  
malayammixsentimentA Sentiment Analysis Dataset for Code-Mixed Malayalam-Englishhttps://arxiv.org/pdf/2006.00210v1.pdfhttps://github.com/bharathichezhiyan/MalayalamMixSentimenttask-oriented (target language)classification (sentiment analysis)YES>10kcollected from social media or commercial sourcesn/aits own languagen/an/acrowdsourcedin its own languageen ml2020*ACL WorkshopNouniversity48YESYES  
TUNIZIIntroducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysishttps://arxiv.org/pdf/2004.14303v1.pdfhttps://github.com/chaymafourati/TUNIZI-Sentiment-Analysis-Tunisian-Arabizi-Datasettask-oriented (target language)classification (sentiment analysis)NO1000~10Kcollected from social media or commercial sourcesn/aits own languagen/an/acrowdsourcedin its own languagear2020ICLRNOindustry8NOYES  
CoDExCoDEx: A Comprehensive Knowledge Graph Completion Benchmarkhttps://arxiv.org/pdf/2009.07810.pdfhttps://github.com/tsafavi/codextask-oriented (multilingual)classification (non-sentiment analysis)YES>10kcollected from Wikipedian/aits own languagen/an/a“annotated (authors, linguists) & automatically induced”Englishar de en es ru zh2020EMNLPNOuniversity16NOYES  
Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse BankMultilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bankhttp://www.lrec-conf.org/proceedings/lrec2018/pdf/141.pdfhttps://github.com/MurathanKurfali/Ted-MDB-Annotationstask-oriented (multilingual)structured predictionNO1000~10Kcollected from webn/aits own languagen/aexpert translation“annotated (authors, linguists)”in its own languageen de pl pt ru tr2018LRECNOuniversity18NOYES  
TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED TalksTED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talkshttps://aclanthology.org/2020.emnlp-main.223.pdfhttps://github.com/wanqiulong0923/TED-CDBtask-oriented (target language)structured predictionNO>10kcollected from webn/abothn/aexpert translation“annotated (authors, linguists)”in its own languagezh2020EMNLPNOuniversity0NOYES  
A Dataset for Multi-lingual Epidemiological Event ExtractionA Dataset for Multi-lingual Epidemiological Event Extractionhttps://aclanthology.org/2020.lrec-1.509.pdfhttps://zenodo.org/record/3709617#.YcCvOhPMITUtask-oriented (multilingual)classification (non-sentiment analysis)YES>10kcollected from media (news)n/aits own languagen/an/acrowdsourcednot mentioneden fr es pt2020LRECNOuniversity3NOYES  
Multilingual Culture-Independent Word Analogy DatasetsMultilingual Culture-Independent Word Analogy Datasetshttps://aclanthology.org/2020.lrec-1.501.pdfhttps://www.clarin.si/repository/xmlui/handle/11356/1261task-oriented (multilingual)otherNO>10kcollected from webn/aits own languagen/aautomatic translation“annotated (authors, linguists)”in its own languageen ee fi lv lt ru si se2020LRECNOuniversity6 YES  
SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and GuidelinesSemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelineshttps://aclanthology.org/L16-1656.pdf task-oriented (multilingual)classification (non-sentiment analysis)NO>10kcollected from web & collected from Wikipedian/aits own languagen/an/acrowdsourcedin its own languageen de ru2016LRECNOuniversity2NONO  
20MinutenA New Dataset and Efficient Baselines for Document-level Text Simplification in Germanhttps://aclanthology.org/2021.newsum-1.16.pdf task-oriented (target language)sentence-level-generation taskYES>10kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languagede2021ACLNOuniversity0NONO  
SpektrumA Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarizationhttps://aclanthology.org/2021.newsum-1.5.pdfhttps://github.com/MehwishFatimah/wsdtask-oriented (multilingual)summarizationYES1000~10K“collected from curated source (exams, scientific papers, etc)”n/aits own languagen/an/aautomatically inducedin its own languageen de2021EMNLPNOuniversity0NOYES  
A Summarization Dataset of Slovak News ArticlesA Summarization Dataset of Slovak News Articleshttps://aclanthology.org/2020.lrec-1.830.pdfhttps://github.com/NaiveNeuron/sme-sumtask-oriented (target language)summarizationnot mentioned>10kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languagesk2020LRECNOuniversity1NONO  
Liputan6: A Large-scale Indonesian Dataset for Text SummarizationLiputan6: A Large-scale Indonesian Dataset for Text Summarizationhttps://aclanthology.org/2020.aacl-main.60.pdfhttps://github.com/fajri91/sum_liputan6task-oriented (target language)summarizationYES>10kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languageid2020AACLNOuniversity8NOYES  
Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for BasqueConversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basquehttps://aclanthology.org/2020.lrec-1.55/http://ixa.si.ehu.es/node/12934task-oriented (target language)machine reading comprehensionNO1000~10kcollected from Wikipedian/aits own languagen/an/a“annotated (authors, linguists)”in its own languageeu2020LRECNOuniversity7NOYES  
A Framework for the Construction of Monolingual and Cross-lingual Word Similarity DatasetsA Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasetshttps://aclanthology.org/P15-2001.pdfNot availabletask-oriented (multilingual)classification (non-sentiment analysis)not mentioned<100“annotated (authors, linguists)”n/aEnglishn/aexpert translationcrowdsourcedEnglishen es fr de pt fa2015ACLYES (English)university61NONO  
A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion MiningA Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mininghttps://aclanthology.org/D15-1302.pdfhttps://github.com/diegma/trip-mamltask-oriented (multilingual)classification (sentiment analysis)YES1000~10kcollected from social media or commercial sourcesn/aits own languagen/an/acrowdsourcedin its own languageen es it2015ACLYES (English)university10NOYES  
“Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers”“Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers”https://aclanthology.org/L18-1558.pdfhttps://github.com/AlonEirew/cross-doc-event-coreftask-oriented (multilingual)structured predictionYES>10kcollected from webn/aits own languagen/an/acrowdsourcedin its own languagezh en es2018LRECNOuniversity6NOYES  
DanFEVER: claim verification dataset for DanishDanFEVER: claim verification dataset for Danishhttps://aclanthology.org/2021.nodalida-main.47.pdfhttps://figshare.com/articles/dataset/DanFEVER_claim_verification_dataset_for_Danish/14380970task-oriented (target language)classification (non-sentiment analysis)NO>10kcollected from Wikipedian/aits own languagen/an/acrowdsourcedin its own languageda2021NoDaLiDaNOuniversity5NOYES  
From Web Crawl to Clean Register-Annotated CorporaFrom Web Crawl to Clean Register-Annotated Corporahttps://aclanthology.org/2020.wac-1.3.pdfhttps://github.com/TurkuNLP/WAC-XIItask-oriented (multilingual)classification (non-sentiment analysis)NO>10kcollected from webn/aits own languagen/an/acrowdsourcedin its own languagefr sv2020LRECNOuniversity2NOYES  
LIdioms: A Multilingual Linked Idioms Data SetLIdioms: A Multilingual Linked Idioms Data Sethttps://arxiv.org/pdf/1802.08148.pdfhttps://github.com/dice-group/LIdioms/blob/master/en/english.ttltask-oriented (multilingual)otherNO100~1000collected from webn/aits own languagen/an/aautomatically inducedEnglishen pt it de ru2018LRECNOuniversity6NOYES  
Mega-COVMega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19https://arxiv.org/pdf/2005.06012.pdfhttps://github.com/UBC-NLP/megacov/tree/master/tweet_idstask-oriented (multilingual)classification (non-sentiment analysis)YES>10Kcollected from social media or commercial sourcesn/aEnglish & its own languagen/an/aautomatically inducedEnglish & in its own language 2021EACLNOuniversity7NOYES  
Finnish Rumor Detection DatasetNever guess what I heard… Rumor Detection in Finnish News: a Dataset and a Baselinehttps://arxiv.org/pdf/2106.03389.pdfhttps://zenodo.org/record/4697529#.YcKICy-B2tUtask-oriented (target language)classification (sentiment analysis)YES1000~10Kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languagefi2021*ACL WorkshopNOuniversity0NOYES  
PROMETHEUSPROMETHEUS: A Corpus of Proverbs Annotated with Metaphorshttps://aclanthology.org/L16-1600.pdf(not available)task-oriented (multilingual)structured predictionnot mentioned1000~10Kcurated linguistic resources & collected from social media or commercial sourcesn/aEnglishn/aexpert translation“annotated (authors, linguists)”in its own languageen it2016LRECNOuniversity7NOYES  
OneSecSense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domainshttps://aclanthology.org/2020.lrec-1.723.pdfhttp://trainomatic.org/data/onesec_lrec.tar.gztask-oriented (multilingual)otherYES>10Kcurated linguistic resources & collected from Wikipedian/aits own languagen/an/aautomatically inducedin its own languageen it fr de es2020LRECNOuniversity4NOYES  
StoryDBStoryDB: Broad Multi-language Narrative Datasethttps://aclanthology.org/2021.eval4nlp-1.4.pdfhttps://drive.google.com/drive/folders/1RCWk7pyvIpubtsf-f2pIsfqTkvtV80Yvtask-oriented (multilingual)classification (non-sentiment analysis)YES1000~10Kcollected from Wikipedian/aEnglish & its own languagealignmentn/aautomatically inducedEnglish & its own languageen it fr ru de nl uk pl pt es sv ja he fi eu hy fa no ar id ko vi bg el hu zh da gl th sr hr lb mk ta ms cs ro te ka ca lt sl2021*ACL WorkshopNOcombination of university and industry0NOYES  
DReaMThe DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languageshttps://aclanthology.org/2020.lrec-1.110.pdf“https://spraakbanken.gu.se/korp/?mode=dream#?cqp=%5B%5D&corpus=dream-en-open,dream-de-open,dream-es-open,dream-fr-open,dream-it-open,dream-nl-open,dream-ru-open”task-oriented (multilingual)structured predictionnot mentioned100~1000“collected from curated source (exams, scientific papers, etc)”n/aEnglish & its own languagen/an/a“annotated (authors, linguists)”English & its own languageen fr de es pt ru id nl it zh2020LRECNOuniversity5NOYES  
Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.https://aclanthology.org/P19-1392.pdfhttp://www.grupolys.org/~marcos/pub/collocations.zipcross-lingual transferstructured predictionNO1000~10Kcurated linguistic resourcesn/aEnglish & its own languagen/an/a“annotated (authors, linguists)”English & its own languageen pt es2019ACLNOuniversity6NOYES  
Universal Dependency Annotation for Multilingual ParsingUniversal Dependency Annotation for Multilingual Parsinghttps://aclanthology.org/P13-2017.pdfhttps://code.google.com/p/uni-dep-tb//task-oriented (multilingual)structured predictionNO1000~10Kcurated linguistic resourcesnot mentionedEnglish & its own languageparsersn/acrowdsourcedEnglish & its own languageen de sv es fr ko2013ACLNOcombination of university and industry561NOYES  
KINNEWSKINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundihttps://arxiv.org/pdf/2010.12174.pdfhttps://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpustask-oriented (multilingual)classification (non-sentiment analysis)YES>10Kcollected from media (news)n/aits own languagegoogle auto completen/acrowdsourcedEnglish & in its own languagerw2020COLINGNOuniversity3YESYES  
KIRNEWSKINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundihttps://arxiv.org/pdf/2010.12174.pdfhttps://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpustask-oriented (multilingual)classification (non-sentiment analysis)YES1000~10Kcollected from media (news)n/aits own languagegoogle auto completen/acrowdsourcedEnglish & in its own languagern2020COLINGNOuniversity3YESYES  
SQuAD-esAutomatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answeringhttps://arxiv.org/pdf/1912.05200.pdfhttps://github.com/ccasimiro88/TranslateAlignRetrievetask-oriented (target language)machine reading comprehensionYES>10Kcrowdsourced & collected from WikipediaamtEnglishalignmentautomatic translationcrowdsourcedEnglishes2020LRECYES (English)university18NOYES  
HEAD-QA: A Healthcare Dataset for Complex ReasoningHEAD-QA: A Healthcare Dataset for Complex Reasoninghttps://aclanthology.org/P19-1092.pdfhttp: //aghie.github.io/head-qa/task-oriented (multilingual)QA + IRYES1000~10K“collected from curated source (exams, scientific papers, etc)”n/aits own languagen/aautomatic translation“derived from linguistic resources (wordnet, etc)”in its own languagees en2019ACLNOuniversity10YESYES  
RuCoSRead and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russianhttps://aclanthology.org/2020.coling-main.570.pdfhttps://github.com/RussianNLP/RussianSuperGLUEtask-oriented (target language)machine reading comprehensionYES>10Kcollected from webtolokaits own languagetf-idf generated & othern/acrowdsourced & automatically inducedin its own languageru2020COLINGYES (other language)combination of university and industry3YESYES  
MuSeRCRead and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russianhttps://aclanthology.org/2020.coling-main.570.pdfhttps://github.com/RussianNLP/RussianSuperGLUEtask-oriented (target language)machine reading comprehensionYES1000~10K“collected from media (news) & collected from curated source (exams, scientific papers, etc)”tolokaits own languagen/an/acrowdsourcedin its own languageru2020COLINGYES (other language)combination of university and industry3YESYES  
BI-139CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrievalhttps://aclanthology.org/2020.emnlp-main.340.pdfhttps://www.cs.jhu.edu/~shuosun/clirmatrix/task-oriented (multilingual)QA + IRYES>10Kcollected from Wikipedian/aEnglish & its own languagealignmentn/aautomatically inducedin its own languageaf als am an ar arz ast az azb ba bar be bg bn bpy br bs bug ca cdo ce ceb ckb cs cv cy da de diq el eml eo es et eu fa fi fo fr fy ga gd gl gu he hi hr hsb ht hu hy ia id ilo io is it ja jv ka kk kn ko ku ky la lb li lmo lt lv mai mg mhr min mk ml mn mr mrj ms my mzn nap nds ne new nl nn no oc or os pa pl pms pnb ps pt qu ro ru sa sah scn sco sd sh si sk sl sq sr su sv sw szl ta te tg th tl tr tt uk ur uz vec vi vo wa war wuu xmf yi yo zh2020EMNLPNOuniversity3NOYES  
MULTI-8CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrievalhttps://aclanthology.org/2020.emnlp-main.340.pdfhttps://www.cs.jhu.edu/~shuosun/clirmatrix/task-oriented (multilingual)QA + IRYES>10Kcollected from Wikipedian/aEnglish & its own languagealignmentn/aautomatically inducedin its own languagear de en es fr ja ru zh2020EMNLPNOuniversity3NOYES  
GerDaLIR: A German Dataset for Legal Information RetrievalGerDaLIR: A German Dataset for Legal Information Retrievalhttps://aclanthology.org/2021.nllp-1.13.pdfhttps://github.com/lavis-nlp/GerDaLIRtask-oriented (target language)QA + IRYES>10K“collected from curated source (exams, scientific papers, etc)”n/aits own languagen/an/aautomatically inducedin its own languagede2021*ACL WorkshopNOuniversity0NOYES  
MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation NetworkMOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Networkhttps://arxiv.org/pdf/2106.07352.pdffnot availabletask-oriented (multilingual)classification (non-sentiment analysis)YES>10Kcollected from Wikipedia & collected from webn/aEnglish & its own languagen/an/aautomatically inducedin its own language 2021ACLYES (English & other language)industry2NONO  
A Turkish Dataset for Gender Identification of Twitter UsersA Turkish Dataset for Gender Identification of Twitter Usershttps://aclanthology.org/W19-4023v1.pdfhttps://cloud.iyte.edu.tr/index.php/s/5DhqdlUCCdB60qGtask-oriented (target language)classification (non-sentiment analysis)YES1000~10Kcollected from social media or commercial sourcesuniversity students & academic personnelits own languagen/an/acrowdsourcedin its own languagetr2019*ACL WorkshopNOuniversity11NOYES  
AnCora-CaAnCora: Multilevel Annotated Corpora for Catalan and Spanishhttp://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdfhttp://clic.ub.edu/corpus/entask-oriented (target language)sequence tagging & structured predictionNO>10Kcollected from media (news)not mentionedits own languagen/an/a“annotated (authors, linguists) & automatically induced”in its own languageca2008LRECYES (other language)university345PARTIALYES  
AnCora-EsAnCora: Multilevel Annotated Corpora for Catalan and Spanishhttp://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdfhttp://clic.ub.edu/corpus/entask-oriented (target language)sequence tagging & structured predictionNO>10Kcollected from media (news)not mentionedits own languagen/an/a“annotated (authors, linguists) & automatically induced”in its own languagees2008LRECNOuniversity345NOYES  
NCTTIAssessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levelshttps://aclanthology.org/2021.acl-long.212.pdfhttps://github.com/marcospln/ncttitask-oriented (multilingual)sequence taggingYES1000~10Kcollected from web & collected from Wikipediaamt & online platforms for portuguese in cordeiro et al (2019)its own languageparsersn/a“crowdsourced & annotated (authors, linguists)”in its own languageen pt2021ACLYES (English & other language)university2NOYES  
Books of Hours. the First Liturgical Data Set for Text Segmentation.Books of Hours. the First Liturgical Data Set for Text Segmentation.https://aclanthology.org/2020.lrec-1.97.pdf task-oriented (target language)sequence taggingNO1000~10K“collected from curated source (exams, scientific papers, etc)”n/aits own languageothern/a“annotated (authors, linguists)”in its own languagela2020LRECNOuniversity1NOYES  
COSTRA 1.0: A Dataset of Complex Sentence TransformationsCOSTRA 1.0: A Dataset of Complex Sentence Transformationshttps://aclanthology.org/2020.lrec-1.434/https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3123task-oriented (target language)sentence pair taskNO1000~10Kcollected from media (news)n/aits own languagealignmentn/acrowdsourcedin its own languagecs2020LRECNOuniversity1NOYES  
Fine-grained Named Entity Annotation for FinnishFine-grained Named Entity Annotation for Finnishhttps://aclanthology.org/2021.nodalida-main.14/https://github.com/TurkuNLP/turku-onetask-oriented (target language)sequence taggingYES>10K n/aits own languagen/an/a“derived from linguistic resources (wordnet, etc)”in its own languagefi2021NoDaLiDaYES (other language)university0NOYES  
GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical ErrorsGitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errorshttps://aclanthology.org/2020.lrec-1.835/https://github.com/mhagiwara/github-typo-corpustask-oriented (multilingual)sequence taggingNO>10Kcollected from social media or commercial sourcesn/aits own languagen/an/aautomatically inducedin its own languageen zh ja ru fr de pt es ko hi2020LRECNOcombination of university and industry13NOYES  
Hindi TimeBank: An ISO-TimeML Annotated Reference CorpusHindi TimeBank: An ISO-TimeML Annotated Reference Corpushttps://aclanthology.org/2020.isa-1.2/ task-oriented (target language)sequence taggingNO>10Kcollected from media (news)not mentionedits own languagen/an/acrowdsourcedin its own languagehi2020*ACL WorkshopNOuniversity3NONO  
K-SNACS: Annotating Korean Adposition SemanticsK-SNACS: Annotating Korean Adposition Semanticshttps://aclanthology.org/2020.dmr-1.6/https://github.com/jdch00/k-snacstask-oriented (target language)sequence taggingNO1000~10K“collected from curated source (exams, scientific papers, etc)”n/aits own languagen/an/a“annotated (authors, linguists)”in its own languageko2020*ACL WorkshopNOuniversity4NONO  
“MassiveSumm: a very large-scale, very multilingual, news summarisation dataset”“MassiveSumm: a very large-scale, very multilingual, news summarisation dataset “https://aclanthology.org/2021.emnlp-main.797/https://github.com/natschluter/massive-summtask-oriented (multilingual)summarizationYES>10Kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languageaf am ar as ay az bm bn bo bs bg ca cs cy da de el en eo fa fil fr ff ga gu ht ha he hi hr hu hy ig id is it ja kn ka km rw ky ko ku lo lv ln lt ml mr mk mg mn my nd ne nl or om pa pl pt prs ps ro rn ru si sk sl sn so es sq sr sw sv ta te tet tg th ti tr uk ur uz vi xh yo yue zh bi gd2021EMNLPNOuniversity0NONO  
Models and Datasets for Cross-Lingual SummarisationModels and Datasets for Cross-Lingual Summarisationhttps://aclanthology.org/2021.emnlp-main.742/https://github.com/lauhaide/cladstask-oriented (multilingual)summarizationYES>10Kcollected from Wikipedian/aits own languagealignmentn/aautomatically inducedin its own languagecs fr en de2021EMNLPNOuniversity0YESYES  
Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in DanishNeural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danishhttps://aclanthology.org/W19-6143/https://github.com/UniversalDependencies/UD_Danish-DDTtask-oriented (target language)sequence taggingYES1000~10Kcurated linguistic resourcesn/aits own languagen/an/a“annotated (authors, linguists)”in its own languageda2019*ACL WorkshopYES (English)university8YESYES  
Universal Joy A Data Set and Results for Classifying Emotions Across LanguagesUniversal Joy A Data Set and Results for Classifying Emotions Across Languageshttps://aclanthology.org/2021.wassa-1.7.pdfhttps://github.com/sotlampr/universal-joytask-oriented (multilingual)classification (sentiment analysis)YES>10Kcollected from social media or commercial sourcesn/aits own languagen/an/aautomatically inducedin its own languagebn zh de en fr hi id it kh my nl pt ro es tl th vi ms2021*ACL WorkshopNOuniversity6NOYES  
X-Fact: A New Benchmark Dataset for Multilingual Fact CheckingX-Fact: A New Benchmark Dataset for Multilingual Fact Checkinghttps://aclanthology.org/2021.acl-short.86/https://github.com/utahnlp/x-fact/task-oriented (multilingual)classification (non-sentiment analysis)YES1000~10K“collected from curated source (exams, scientific papers, etc)”n/aits own languagen/an/aautomatically inducedin its own languagesi nl mr no tr hi id it sr ru fa sq gu ka pl az bn ta de es pa fr ro pt ar2021ACLNOuniversity3NOYES  
XCOPA: A Multilingual Dataset for Causal Commonsense ReasoningXCOPA: A Multilingual Dataset for Causal Commonsense Reasoninghttps://aclanthology.org/2020.emnlp-main.185/https://github.com/cambridgeltl/xcopacross-lingual transferclassification (non-sentiment analysis)NO100~1000“collected from curated source (exams, scientific papers, etc)”n/aEnglishn/aexpert translationautomatically inducedEnglishet ht id it qu sw ta th tr vi zh2020EMNLPYES (English)university36YESYES  
KLEJ - NKJP-NERKLEJ: Comprehensive Benchmark for Polish Language Understandinghttps://aclanthology.org/2020.acl-main.111/https://klejbenchmark.com/multi-task (target language)classification (non-sentiment analysis)YES>10K“collected from curated source (exams, scientific papers, etc)”n/aits own languagen/an/a“annotated (authors, linguists)”in its own languagepo2020ACLYES (other language)university22YESYES  
KLEJ - CBDKLEJ: Comprehensive Benchmark for Polish Language Understandinghttps://aclanthology.org/2020.acl-main.111/https://klejbenchmark.com/multi-task (target language)classification (sentiment analysis)YES>10Kcollected from social media or commercial sourcesnot mentionedits own languagen/an/a“crowdsourced & annotated (authors, linguists)”in its own languagepo2020ACLYES (other language)university22YESYES  
KLEJ- PolEmo2.0-INKLEJ: Comprehensive Benchmark for Polish Language Understandinghttps://aclanthology.org/2020.acl-main.111/https://klejbenchmark.com/multi-task (target language)classification (non-sentiment analysis)YES1000~10Kcollected from social media or commercial sourcesn/aits own languagen/an/a“annotated (authors, linguists)”in its own languagepo2020ACLYES (other language)university22YESYES  
KLEJ - PolEmo2.0-OUTKLEJ: Comprehensive Benchmark for Polish Language Understandinghttps://aclanthology.org/2020.acl-main.111/https://klejbenchmark.com/multi-task (target language)classification (non-sentiment analysis)YES1000~10Kcollected from social media or commercial sourcesn/aits own languagen/an/a“annotated (authors, linguists)”in its own languagepo2020ACLYES (other language)university22YESYES  
KLEJ - Czy wiesz?KLEJ: Comprehensive Benchmark for Polish Language Understandinghttps://aclanthology.org/2020.acl-main.111/https://klejbenchmark.com/multi-task (target language)QA + IRYES1000~10Kcollected from Wikipedian/aits own languageRTT & greedy sentence matchingn/aautomatically inducedin its own languagepo2020ACLYES (other language)university22YESYES  
KLEJ - PSCKLEJ: Comprehensive Benchmark for Polish Language Understandinghttps://aclanthology.org/2020.acl-main.111/https://klejbenchmark.com/multi-task (target language)sentence-level-generation taskYES1000~10Kcollected from media (news)n/aits own languagen/an/aautomatically inducedin its own languagepo2020ACLYES (other language)university22YESYES  
KLEJ - ARKLEJ: Comprehensive Benchmark for Polish Language Understandinghttps://aclanthology.org/2020.acl-main.111/https://klejbenchmark.com/multi-task (target language)classification (sentiment analysis)YES>10Kcollected from social media or commercial sourcesn/aits own languagen/an/aautomatically inducedin its own languagepo2020ACLYES (other language)university22YESYES  
PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphonePACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphonehttps://aclanthology.org/L14-1240/ task-oriented (multilingual)classification (sentiment analysis)NO1000~10K“collected from curated source (exams, scientific papers, etc)”not mentionedoriginal languagen/an/acrowdsourcedin its own languageen de2014LRECNOcombination of university and industry1NOYES  
RussianSuperGLUE-LiDiRusRussianSuperGLUE: A Russian Language Understanding Evaluation Benchmarkhttps://aclanthology.org/2020.emnlp-main.381/https://github.com/RussianNLP/RussianSuperGLUEmulti-task (target language)sentence pair taskNO1000~10Kcollected from media (news)n/aEnglishn/aexpert translation“annotated (authors, linguists)”Englishru2020EMNLPYES (English)combination of university and industry11YESYES  
RussianSuperGLUE-RUSSERussianSuperGLUE: A Russian Language Understanding Evaluation Benchmarkhttps://aclanthology.org/2020.emnlp-main.381/https://github.com/RussianNLP/RussianSuperGLUEmulti-task (target language)otherYES>10Kcollected from Wikipedia & curated linguistic resourcestolokaits own languagen/an/acrowdsourcedin its own languageru2020EMNLPYES (other language)combination of university and industry11YESYES  
RussianSuperGLUE-PARusRussianSuperGLUE: A Russian Language Understanding Evaluation Benchmarkhttps://aclanthology.org/2020.emnlp-main.381/https://github.com/RussianNLP/RussianSuperGLUEmulti-task (target language)classification (non-sentiment analysis)YES100~1000“collected from web & collected from curated source (exams, scientific papers, etc)”amtEnglishn/aexpert translationcrowdsourcedEnglishru2020EMNLPYES (English)combination of university and industry11YESYES  
RussianSuperGLUE-TERRaRussianSuperGLUE: A Russian Language Understanding Evaluation Benchmarkhttps://aclanthology.org/2020.emnlp-main.381/https://github.com/RussianNLP/RussianSuperGLUEmulti-task (target language)sentence pair taskYES1000~10Kcollected from media (news) & collected from webn/aits own languagen/an/a“automatically induced & annotated (authors, linguists)”in its own languageru2020EMNLPYES (other language)combination of university and industry11YESYES  
RussianSuperGLUE-RCBRussianSuperGLUE: A Russian Language Understanding Evaluation Benchmarkhttps://aclanthology.org/2020.emnlp-main.381/https://github.com/RussianNLP/RussianSuperGLUEmulti-task (target language)sentence pair taskYES1000~10Kcollected from media (news) & collected from webn/aits own languagen/an/a“automatically induced & annotated (authors, linguists)”in its own languageru2020EMNLPYES (other language)combination of university and industry11YESYES  
RussianSuperGLUE-RWSDRussianSuperGLUE: A Russian Language Understanding Evaluation Benchmarkhttps://aclanthology.org/2020.emnlp-main.381/https://github.com/RussianNLP/RussianSuperGLUEmulti-task (target language)structured predictionYES100~1000“collected from curated source (exams, scientific papers, etc)”n/aEnglishn/adetails not provided“annotated (authors, linguists)”Englishru2020EMNLPYES (English)combination of university and industry11YESYES  
RussianSuperGLUE-DaNetQARussianSuperGLUE: A Russian Language Understanding Evaluation Benchmarkhttps://aclanthology.org/2020.emnlp-main.381/https://github.com/RussianNLP/RussianSuperGLUEmulti-task (target language)machine reading comprehensionYES100~1000collected from Wikipedia & crowdsourcedtolokaits own languageothern/a“crowdsourced & annotated (authors, linguists)”in its own languageru2020EMNLPNOcombination of university and industry11YESYES  
Vy=akarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic LanguagesVy=akarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languageshttps://arxiv.org/abs/2103.00854https://github.com/rajaswa/indic-syntax-evaluationtask-oriented (target language)structured predictionYES>10Kcurated linguistic resourcesn/aoriginal languagen/an/a“derived from linguistic resources (wordnet, etc)”in its own languagehi ta2021*ACL WorkshopYES (other language)university0NOYES  
IndicNLPSuite-Soham News Article Classification“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”https://aclanthology.org/2020.findings-emnlp.445/https://indicnlp.ai4bharat.org/home/multi-task (target language)classification (non-sentiment analysis)YES1000~10Kcollected from media (news)n/aoriginal languagen/an/aautomatically inducedin its own languagepa bn or gu mr kn te ml ta2020Findings combination of university and industry59YESYES  
IndicNLPSuite-iNLTK Headline Classification“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”https://aclanthology.org/2020.findings-emnlp.445/https://indicnlp.ai4bharat.org/home/multi-task (target language)classification (non-sentiment analysis)YES1000~10Kcollected from media (news)n/aoriginal languagen/an/aautomatically inducedin its own languagepa bn or gu mr kn te ml ta2020Findings combination of university and industry59YESYES  
IndicNLPSuite-AI4Bharat Cloze-style Question Answering“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”https://aclanthology.org/2020.findings-emnlp.445/https://indicnlp.ai4bharat.org/home/multi-task (target language)machine reading comprehensionYES>10Kcollected from Wikipedian/aoriginal languagen/an/aautomatically inducedin its own languagepa hi bn or as gu mr kn te ml ta2020Findings combination of university and industry59YESYES  
IndicNLPSuite-AI4Bharat Winograd Natural Language Inference“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”https://aclanthology.org/2020.findings-emnlp.445/https://indicnlp.ai4bharat.org/home/multi-task (target language)classification (non-sentiment analysis)NO100~1000“collected from curated source (exams, scientific papers, etc)”n/aEnglishn/aauthor translation“annotated (authors, linguists)”Englishhi mr gu2020FindingsYES (English)combination of university and industry59YESYES  
IndicNLPSuite-AI4Bharat Choice of Plausible Alternatives“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”https://aclanthology.org/2020.findings-emnlp.445/https://indicnlp.ai4bharat.org/home/multi-task (target language)classification (non-sentiment analysis)NO100~1000“annotated (authors, linguists)”n/aEnglishn/aauthor translation“annotated (authors, linguists)”Englishhi mr gu2020FindingsYES (English)combination of university and industry59YESYES  
IndicNLPSuite-WikiAnnNER“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”https://aclanthology.org/2020.findings-emnlp.445/https://indicnlp.ai4bharat.org/home/multi-task (target language)sequence taggingYES>10Kcollected from Wikipedian/aEnglishalignedautomatic translationautomatically inducedEnglishace af als am an ang ar arc arz as ast ay az ba bar be bg bh bn bo br bs ca cdo ce ceb ckb co crh cs csb cv cy da de diq dv el en eo es et eu ext fa fi fo fr frr fur fy ga gan gd gl gn gu hak he hi hr hsb hu hy ia id ig ilo io is it ja jbo jv ka kk km kn ko ksh ku ky la lb li lij lmo ln lt lv mg mhr mi min mk ml mn mr ms mt mwl my mzn nap nds ne nl nn no nov oc or os sgs be-tarask cbk eml vro jv-x-bms en-basiceng lzh nan yue pa pdc pl pms pnb ps pt qu rm ro ru rw sa sah scn sco sd sh si sk sl so sq sr su sv sw szl ta te tg th tk tl tr tt ug uk ur uz vec vep vi vls vo wa war wuu xmf yi yo zea zh2020FindingsYES (other language)combination of university and industry59YESYES  
CVIT-MKB Cross-lingual Sentence RetrievalA Multilingual Parallel Corpora Collection Effort for Indian Languageshttps://aclanthology.org/2020.lrec-1.462.pdfhttps://anoopkunchukuttan.github.io/indic_nlp_library/task-oriented (multilingual)QA + IRYES>10Kcollected from webn/aits own languagealignmentn/aautomatically inducedin its own languagehi te ta ml gu kn ur bn or mr pa as en2020LREC university18YESYES  
ACTSAACTSA: Annotated Corpus for Telugu Sentiment Analysishttps://aclanthology.org/W17-5408/https://drive.google.com/drive/folders/0B8HHvMMuHYdWdnJZZl9rWkY5bk0?usp=sharingtask-oriented (target language)classification (non-sentiment analysis)YES1000~10Kcollected from media (news)native speakersoriginal languagen/an/acrowdsourcedin its own languagete2017*ACL Workshop university31YESNO  
MIDAS DiscourseAn Annotated Dataset of Discourse Modes in Hindi Storieshttps://aclanthology.org/2020.lrec-1.149.pdfhttps://github.com/midas-research/hindi-discoursetask-oriented (target language)classification (non-sentiment analysis)YES1000~10K“collected from curated source (exams, scientific papers, etc)”native speakersoriginal languagen/an/acrowdsourcedin its own languagehi2020LREC combination of university and industry4YESYES  
A Multilingual Evaluation Dataset for Monolingual Word Sense AlignmentA Multilingual Evaluation Dataset for Monolingual Word Sense Alignmenthttps://aclanthology.org/2020.lrec-1.395.pdfhttps://github.com/elexis-eu/MWSAcross-lingual transferotherNO1000~10Kcurated linguistic resourcesn/aoriginal languagen/an/a“derived from linguistic resources (wordnet, etc)”in its own languageeu bg da nl en et de hu ga it sr sl es pt ru2020LRECYES (other language)university12NOYES  
Multilingual corpora with coreferential annotation of person entitiesMultilingual corpora with coreferential annotation of person entitieshttps://aclanthology.org/L14-1701/https://gramatica.usc.es/~marcos/lrec.tar.bz2cross-lingual transferstructured predictionNO1000~10Kcollected from Wikipedia & collected from media (news)n/aoriginal languagen/an/acrowdsourcedin its own languagegl pt es2014LREC university21NOYES  
MGAD: Multilingual Generation of Analogy DatasetsMGAD: Multilingual Generation of Analogy Datasetshttps://aclanthology.org/L18-1320.pdfhttps://github.com/rutrastone/MGADtask-oriented (multilingual)otherNO>10Ktemplate-basedn/aits own languagen/an/aautomatically inducedin its own languagehi ar ru2018LREC university8NOYES  
“The ApposCorpus: a new multilingual, multi-domain dataset for factual appositive generation”“The ApposCorpus: a new multilingual, multi-domain dataset for factual appositive generation”https://arxiv.org/abs/2011.03287https://yovakem.github.io/#ApposCorpustask-oriented (multilingual)sentence-level-generation taskYES>10Kcollected from Wikipedia & collected from media (news)n/aits own languagen/an/aautomatically inducedin its own languageen es de pl2020COLING combination of university and industry0NOYES  
“The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation”“The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation”https://aclanthology.org/2020.inlg-1.10/https://github.com/TallChris91/CACAPO-Datasettask-oriented (multilingual)sentence-level-generation taskYES>10Kcollected from media (news)n/aits own languagealignmentn/a“annotated (authors, linguists)”in its own languagenl en2020INLG university2NOYES  
A Dataset and Baselines for Multilingual Reply SuggestionA Dataset and Baselines for Multilingual Reply Suggestionhttps://arxiv.org/abs/2106.02017https://github.com/zhangmozhi/mrstask-oriented (multilingual)sentence-level-generation taskYES>10Kcollected from social media or commercial sourcesn/aits own languagen/an/aautomatically inducedin its own languageen es de pt fr ja sv it nl ru2021ACL combination of university and industry1NOYES  
                         
   task oriented multilingual61                    
   cross-lingual transfer21                    
   task-oriented (target language)37                    
   multi-task (target language)38