A New Dataset for Natural Language Inference from Code-mixed Conversations | A New Dataset for Natural Language Inference from Code-mixed Conversations | https://arxiv.org/pdf/2004.05051.pdf | | task-oriented (multilingual) | sentence pair task | not mentioned | 100~1000 | “collected from curated source (exams, scientific papers, etc)” | not mentioned | its own language | n/a | n/a | crowdsourced | its own language | en hi | 2020 | LREC | NO | combination of university and industry | 0 | NO | NO | | |
MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization | MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization | https://aclanthology.org/2021.eacl-main.146.pdf | | task-oriented (multilingual) | summarization | NO | >10k | collected from media (news) | ”"”humanitarian experts””” | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | en fr es | 2021 | EACL | NO | combination of university and industry | 0 | NO | NO | | |
A Multilingual Wikified Data Set of Educational Material | A Multilingual Wikified Data Set of Educational Material | https://aclanthology.org/L18-1073.pdf | Not available | cross-lingual transfer | classification (non-sentiment analysis) | not mentioned | 1000~10k | “collected from curated source (exams, scientific papers, etc)” | crowdflower | its own language | alignment | automatic translation | crowdsourced | in its own language | bg cs de el hr it nl pl pt ru zh | 2018 | LREC | NO | combination of university and industry | 0 | NO | NO | | |
Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus) | Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus) | https://aclanthology.org/Y11-1038.pdf | http://linguistics.hss.ntu.edu.sg/ResearchinLMS/Pages/NTUMultilingualCorpus | cross-lingual transfer | structured prediction | not mentioned | 1000~10k | collected from web | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | en zh ja ko id vi | 2011 | “Pacific Asia Conference on Language, Information and Computation” | NO | university | 51 | NO | NO | | |
XNLI | XNLI: Evaluating Cross-lingual Sentence Representations | https://arxiv.org/pdf/1809.05053.pdf | https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip | cross-lingual transfer | sentence pair task | NO | 1000~10K | crowdsourced | gethybrid.io | English | alignment | crowdsourced translation (incl. Gengo / One Hour Translation) | crowdsourced | English | en fr es de el bg ru tr ar vi th zh hi sw ur | 2018 | EMNLP | YES (English) | industry | 502 | YES | YES | | |
PAWS-X | PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification | https://arxiv.org/pdf/1908.11828.pdf | https://github.com/google-research-datasets/paws | cross-lingual transfer | sentence pair task | NO | 1000~10K | crowdsourced | not mentioned (maybe google internal according to the acknowledgements) | English | n/a | automatic translation & crowdsourced translation (incl. Gengo / One Hour Translation) | crowdsourced | English | fr es de zh ja ko | 2019 | EMNLP | YES (English) | industry | 91 | YES | YES | | |
MLSUM | MLSUM: The Multilingual Summarization Corpus | https://arxiv.org/pdf/2004.14900v1.pdf | https://github.com/huggingface/datasets/tree/master/datasets/mlsum | task-oriented (multilingual) | summarization | YES | >10K | collected from media (news) | n/a | its own language | alignment | n/a | automatically induced | in its own language | fr de es ru tr | 2020 | EMNLP | NO | university | 20 | YES | YES | | |
XL-WiC | XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization - ACL Anthology | https://aclanthology.org/2020.emnlp-main.584.pdf | https://pilehvar.github.io/xlwic/ | cross-lingual transfer | other | PARTIAL | 1000~10K | curated linguistic resources | n/a | its own language | n/a | n/a | “derived from linguistic resources (wordnet, etc)” | in its own language | bg da de et fa fr hr it ja ko nl zh en | 2020 | EMNLP | YES (English & other language) | university | 14 | YES | YES | | |
MLQA | MLQA: Evaluating Cross-lingual Extractive Question Answering | https://arxiv.org/abs/1910.07475 | https://github.com/facebookresearch/MLQA | cross-lingual transfer | machine reading comprehension | NO | 1000~10K | crowdsourced & collected from Wikipedia | amt | English | alignment | crowdsourced translation (incl. Gengo / One Hour Translation) | crowdsourced | in its own language | en de es ar zh vi hi | 2020 | ACL | YES (English & other language) | | 150 | YES | YES | | |
XQuAD | On the Cross-lingual Transferability of Monolingual Representations | https://arxiv.org/abs/1910.11856 | https://github.com/deepmind/xquad | cross-lingual transfer | machine reading comprehension | NO | 1000~10K | crowdsourced & collected from Wikipedia | amt | English | n/a | crowdsourced translation (incl. Gengo / One Hour Translation) | crowdsourced | English | en es de el ru tr ar vi th zh hi | 2019 | EMNLP | YES (English) | industry | 218 | YES | YES | | |
TyDi QA | TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages | https://arxiv.org/abs/2003.05002 | https://ai.google.com/research/tydiqa/ | task-oriented (multilingual) | machine reading comprehension | YES | >10K | crowdsourced | not mentioned | its own language | n/a | n/a | crowdsourced | its own language | ar bn fi id ja ki ko ru te th | 2020 | TACL | NO | industry | 118 | YES | YES | | |
XOR QA | XOR QA: Cross-lingual Open-Retrieval Question Answering | https://arxiv.org/pdf/2010.11856.pdf | https://nlp.cs.washington.edu/xorqa/ | task-oriented (multilingual) | QA + IR | YES | >10K | crowdsourced | amt (and maybe undergrad students) | its own language | n/a | crowdsourced translation (incl. Gengo / One Hour Translation) | crowdsourced | English | ar bn fi ja ko te ru | 2021 | NAACL | YES (other language) | combination of university and industry | 13 | YES | YES | | |
XQA | XQA: A Cross-lingual Open-domain Question Answering Dataset - ACL Anthology | https://aclanthology.org/P19-1227/ | http://github.com/thunlp/XQA | task-oriented (multilingual) | QA + IR | PARTIAL | 1000~10K | template-based | n/a | its own language | n/a | n/a | automatically induced | its own language | en ch fr de po pt ru ta uk | 2019 | ACL | NO | university | 41 | NO | YES | | |
MKQA | MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering | https://arxiv.org/abs/2007.15207 | https://github.com/apple/ml-mkqa | cross-lingual transfer | QA + IR | NO | 1000~10K | crowdsourced | tryrating | English | n/a | crowdsourced translation (incl. Gengo / One Hour Translation) | crowdsourced | English | ar da de en es fi fr he hu it ja ko km ms nl no pl pt ru sv th tr vi zh | 2021 | TACL | YES (English) | industry | 19 | YES | YES | | |
POS-tagged Arabic tweets for four dialect | Multi-Dialect Arabic POS Tagging: A CRF Approach | http://www.lrec-conf.org/proceedings/lrec2018/pdf/562.pdf | https://huggingface.co/datasets/arabic_pos_dialect | task-oriented (target language) | sequence tagging | YES | 100~1000 | collected from social media or commercial sources | n/a | its own language | n/a | n/a | crowdsourced | in its own language | ar egy lev glf mgr | 2018 | LREC | NO | combination of university and industry | 25 | YES | YES | | |
WikiANN | Cross-lingual Name Tagging and Linking for 282 Languages | https://www.aclweb.org/anthology/P17-1178 | https://huggingface.co/datasets/wikiann | task-oriented (multilingual) | sequence tagging | YES | 1000~10K | collected from Wikipedia | n/a | English | aligned | automatic translation | automatically induced | English | ace af als am an ang ar arc arz as ast ay az ba bar be bg bh bn bo br bs ca cdo ce ceb ckb co crh cs csb cv cy da de diq dv el en eo es et eu ext fa fi fo fr frr fur fy ga gan gd gl gn gu hak he hi hr hsb hu hy ia id ig ilo io is it ja jbo jv ka kk km kn ko ksh ku ky la lb li lij lmo ln lt lv mg mhr mi min mk ml mn mr ms mt mwl my mzn nap nds ne nl nn no nov oc or os sgs be-tarask cbk eml vro jv-x-bms en-basiceng lzh nan yue pa pdc pl pms pnb ps pt qu rm ro ru rw sa sah scn sco sd sh si sk sl so sq sr su sv sw szl ta te tg th tk tl tr tt ug uk ur uz vec vep vi vls vo wa war wuu xmf yi yo zea zh | 2017 | ACL | NO | university | 71 | YES | YES | | |
MFAQ | MFAQ: a Multilingual FAQ Dataset | https://arxiv.org/abs/2109.12870 | https://huggingface.co/datasets/clips/mfaq | task-oriented (multilingual) | machine reading comprehension | YES | >10K | collected from web | n/a | its own language | n/a | n/a | automatically induced | in its own language | cs da de en es fi fr he hr hu id it nl no pl pt ro ru sv tr vi | 2021 | EMNLP | NO | industry | 0 | YES | YES | | |
XL-Sum | XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages | https://aclanthology.org/2021.findings-acl.413/ | https://github.com/csebuetnlp/xl-sum | task-oriented (multilingual) | summarization | YES | >10K | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | am ar az bn my zh en fr gu ha hi ig id ja rn ko ky mr ne om ps fa pcm pt pa ru gd sr si so es sw ta te th ti tr uk ur uz vi cy yo | 2021 | ACL | NO | combination of university and industry | 3 | YES | YES | | |
OntoNotes 5.0 | OntoNotes | https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf | https://catalog.ldc.upenn.edu/LDC2013T19 | task-oriented (multilingual) | structured prediction | YES | >10K | collected from media (news) | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | en cm ar zh | 2013 | N/A | NO | university | N/A | NO | YES | | |
euronews | An Open Corpus for Named Entity Recognition in Historic Newspapers | https://aclanthology.org/L16-1689.pdf | https://github.com/EuropeanaNewspapers/ner-corpora | task-oriented (multilingual) | sequence tagging | YES | >10K | collected from media (news) | n/a | its own language | n/a | n/a | crowdsourced | in its own language | de fr nl | 2016 | LREC | NO | industry | 21 | YES | YES | | |
exams | EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering | https://arxiv.org/pdf/2011.03080.pdf | https://github.com/mhardalov/exams-qa | task-oriented (multilingual) | machine reading comprehension | YES | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | n/a | n/a | automatically induced | in its own language | ar bg de es fr hr hu it lt mk pl pt sq sr tr vi | 2020 | EMNLP | NO | university | 6 | YES | YES | | |
hope_edi | “HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion” | https://aclanthology.org/2020.peoples-1.5/ | https://github.com/huggingface/datasets/blob/master/datasets/hope_edi/README.md | task-oriented (multilingual) | classification (sentiment analysis) | YES | >10K | collected from social media or commercial sources | google form | its own language | n/a | n/a | crowdsourced | its own language | en ml ta | 2021 | EACL | NO | university | 40 | YES | YES | | |
kan_hope | Hope Speech detection in under-resourced Kannada language | https://arxiv.org/abs/2108.04616# | https://github.com/adeepH/kan_hope | task-oriented (target language) | classification (sentiment analysis) | YES | 1000~10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | crowdsourced | its own language | en kn | 2021 | | NO | university | 3 | YES | YES | | |
masakhaner | MasakhaNER: Named Entity Recognition for African Languages | https://arxiv.org/pdf/2103.11811.pdf | https://github.com/masakhane-io/masakhane-ner/ | task-oriented (multilingual) | sequence tagging | YES | >10K | collected from media (news) | n/a | its own language | n/a | n/a | crowdsourced | its own language | am ha ig rw lg luo pcm sw wo yo | 2021 | TACL | NO | combination of university and industry | 1 | YES | YES | | |
multi_eurlex | MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer | https://arxiv.org/abs/2109.00904 | https://github.com/huggingface/datasets/blob/master/datasets/multi_eurlex/README.md | cross-lingual transfer | classification (non-sentiment analysis) | NO | >10K | collected from curated source | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | en da de nl sv bg cs hr pl sk sl es fr it p ro et fi hu lt lv el mt | 2021 | EMNLP | NO | combination of university and industry | 1 | YES | YES | | |
nchlt | Developing Text Resources for Ten South African Languages | http://www.lrec-conf.org/proceedings/lrec2014/pdf/1151_Paper.pdf | https://repo.sadilar.org/handle/20.500.12185/7/discover?filtertype_0=database&filtertype_1=title&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Monolingual+Text+Corpora%3A+Annotated&filtertype=project&filter_relational_operator=equals&filter=NCHLT+Text+II | task-oriented (multilingual) | sequence tagging | YES | >10k | collected from web & collected from curated source | n/a | its own language | n/a | n/a | automatically induced | in its own language | af nr nso ss tn ts ve xh zu | 2014 | LREC | NO | university | 68 | YES | YES | | |
offenseval_dravidian | “Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada” | https://aclanthology.org/2021.dravidianlangtech-1.17.pdf | N/A | task-oriented (multilingual) | classification (sentiment analysis) | YES | 1000~10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | automatically induced | in its own language | en kn ml ta | 2021 | EACL | NO | university | 3 | YES | YES | | |
sem_eval_2018_task_1 | SemEval-2018 Task 1: Affect in Tweets | http://saifmohammad.com/WebDocs/semeval2018-task1.pdf | https://competitions.codalab.org/competitions/20948 | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | 1000~10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | crowdsourced | in its own language | en ar es | 2018 | *ACL Workshop | NO | university | 427 | YES | YES | | |
stsb_multi_mt | N/A | N/A | https://github.com/PhilipMay/stsb-multi-mt | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | 1000~10K | collected from media (news) | n/a | English | n/a | automatic translation | automatically induced | English | en de es fr it nl pl pt ru zh | 2021 | N/A | YES (English) | industry | N/A | YES | YES | | |
20Minuten | A New Dataset and Efficient Baselines for Document-level Text Simplification in German | https://aclanthology.org/2021.newsum-1.16/ | https://github.com/ZurichNLP/20Minuten | task-oriented (target language) | sentence-level-generation task | YES | >10k | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | de | 2021 | EMNLP | NO | university | 0 | NO | YES | | |
XFORMAL | “Olá, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer” | https://aclanthology.org/2021.naacl-main.256.pdf | https://github.com/Elbria/xformal-FoST | task-oriented (multilingual) | sentence-level-generation task | NO | 1000~10k | collected from web | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | pt fr it | 2021 | NAACL | NO | combination of university and industry | 4 | NO | YES | | yes |
Mr. TyDi | Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval | https://arxiv.org/abs/2108.08787 | https://github.com/castorini/mr.tydi | task-oriented (multilingual) | machine reading comprehension | YES | 1000~10k | crowdsourced | not mentioned (from tydi) | its own language | n/a | n/a | crowdsourced | in its own language | ar bn en fi id ja ko ru sw te th | 2021 | *ACL Workshop | YES (English & other language) | university | 0 | NO | YES | | |
XWikis | Models and Datasets for Cross-Lingual Summarisation | https://aclanthology.org/2021.emnlp-main.742/ | https://github.com/lauhaide/clads/blob/main/fairseq2020/examples/clads/README.md | task-oriented (multilingual) | summarization | YES | >10k | collected from Wikipedia | n/a | its own language | n/a | n/a | automatically induced | in its own language | cs fr en de | 2021 | EMNLP | NO | university | 0 | NO | YES | | |
MultiHumES | MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization | https://aclanthology.org/2021.eacl-main.146.pdf | https://deephelp.zendesk.com/hc/en-us/sections/360011925552-MultiHumES | task-oriented (multilingual) | summarization | YES | 1000~10k | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | en fr es | 2021 | EACL | NO | combination of university and industry | 0 | NO | YES | | |
SMiLER | Multilingual Entity and Relation Extraction Dataset and Model | https://aclanthology.org/2021.eacl-main.166.pdf | https://github.com/samsungnlp/smiler/ | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | >10k | collected from Wikipedia | n/a | English | n/a | automatic translation | automatically induced | in its own language | it fr de pt es ko | 2021 | EACL | NO | combination of university and industry | 0 | NO | YES | | |
swiss_judgment_prediction | Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark | https://aclanthology.org/2021.nllp-1.3.pdf | https://github.com/JoelNiklaus/SwissCourtRulingCorpus | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | >10k | collected from curated sources | n/a | its own language | n/a | n/a | crowdsourced | English | de fr it | 2021 | *ACL Workshop | No | university | 0 | YES | YES | | |
tamilmixsentiment | Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text | https://aclanthology.org/2020.sltu-1.28/ | https://dravidian-codemix.github.io/2020/index.html | task-oriented (target language) | classification (sentiment analysis) | YES | >10k | collected from social media or commercial sources | n/a | its own language | n/a | n/a | crowdsourced | in its own language | en ta | 2020 | *ACL Workshop | No | university | 106 | YES | YES | | |
C^3 | Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension | https://arxiv.org/pdf/1904.09679 | https://dataset.org/c3/ | task-oriented (target language) | machine reading comprehension | YES | >10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | alignment | n/a | automatically induced | in its own language | zh | 2019 | TACL | NO | combination of university and industry | 20 | YES | YES | * in huggingface as part of clue | |
ChID | ChID: A Large-scale Chinese IDiom Dataset for Cloze Test | https://aclanthology.org/P19-1075.pdf | https://drive.google.com/drive/folders/1qdcMgCuK9d93vLVYJRvaSLunHUsGf50u?usp=sharing | task-oriented (target language) | machine reading comprehension | YES | >10K | collected from media (news) & collected from web | n/a | its own language | word embedding similarity scores | n/a | automatically induced | in its own language | zh | 2019 | ACL | YES (other language) | university | 33 | YES | YES | * in huggingface as part of clue | |
CLUE - IFLYTEK Long Text classification | CLUE: A Chinese Language Understanding Evaluation Benchmark | https://arxiv.org/abs/2004.05986 | https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zip | multi-task (target language) | classification (sentiment analysis) | YES | >10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | not mentioned | in its own language | zh | 2020 | COLING | NO | industry | 59 | NO | YES | | |
CLUE - Ant Financial Question Matching Corpus | CLUE: A Chinese Language Understanding Evaluation Benchmark | https://arxiv.org/abs/2004.05986 | https://storage.googleapis.com/cluebenchmark/tasks/afqmc_public.zip | multi-task (target language) | sentence pair task | YES | >10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | not mentioned | in its own language | zh | 2020 | COLING | YES (other language) | individual researchers | 59 | YES | YES | | |
CLUE - Chinese Scientific Literature | CLUE: A Chinese Language Understanding Evaluation Benchmark | https://arxiv.org/abs/2004.05986 | https://storage.googleapis.com/cluebenchmark/tasks/csl_public.zip | multi-task (target language) | sentence pair task | YES | >10K | curated linguistic resources | n/a | its own language | tf-idf generated | n/a | automatically induced | in its own language | zh | 2020 | COLING | NO | individual researchers | 59 | NO | YES | * citation & published venue is clue’s because the dataset itself wasn’t published | |
CLUE - CLUEWSC 2020 | CLUE: A Chinese Language Understanding Evaluation Benchmark | https://arxiv.org/abs/2004.05986 | https://storage.googleapis.com/cluebenchmark/tasks/cluewsc2020_public.zip | multi-task (target language) | structured prediction | YES | 1000~10K | curated linguistic resources | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | zh | 2020 | COLING | NO | individual researchers | 59 | NO | YES | * citation & published venue is clue’s because the dataset itself wasn’t published | |
CLUE - Toutiao Short Text Classificaiton for News | CLUE: A Chinese Language Understanding Evaluation Benchmark | https://arxiv.org/abs/2004.05986 | https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip | multi-task (target language) | classification (non-sentiment analysis) | YES | >10K | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | zh | 2020 | COLING | NO | individual researchers | 59 | NO | YES | * citation & published venue is clue’s because the dataset itself wasn’t published | |
CMRC 2018 | A Span-Extraction Dataset for Chinese Machine Reading Comprehension | https://aclanthology.org/D19-1600.pdf | https://worksheets.codalab.org/worksheets/0x92a80d2fab4b4f79a2b4064f7ddca9ce | task-oriented (target language) | machine reading comprehension | YES | >10K | collected from Wikipedia | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | zh | 2019 | EMNLP | NO | combination of university and industry | 91 | YES | YES | | |
FQuAD 1.1 | FQuAD: French Question Answering Dataset | https://aclanthology.org/2020.findings-emnlp.107/ | https://fquad.illuin.tech/ | task-oriented (target language) | machine reading comprehension | YES | >10K | crowdsourced & collected from Wikipedia | university students | its own language | n/a | n/a | crowdsourced | in its own language | fr | 2020 | Findings | NO | combination of university and industry | 25 | YES | YES | | |
CLS | Cross-Language Text Classification using Structural Correspondence Learning | https://aclanthology.org/P10-1114.pdf | https://github.com/getalp/Flaubert/tree/master/flue | task-oriented (multilingual) | classification (sentiment analysis) | YES | 1000~10K | collected from social media or commercial sources | n/a | its own language | n/a | automatic translation | automatically induced | in its own language | fr de en ja | 2010 | ACL | NO | university | 291 | NO | YES | | |
IndoNLI | IndoNLI: A Natural Language Inference Dataset for Indonesian | https://arxiv.org/pdf/2110.14566.pdf | https://github.com/ir-nlp-csui/indonli/tree/main/data | task-oriented (target language) | sentence pair task | YES | >10K | collected from Wikipedia & collected from web & curated linguistic resources | university students | its own language | n/a | n/a | “crowdsourced & annotated (authors, linguists)” | in its own language | id | 2021 | EMNLP | NO | combination of university and industry | 0 | NO | YES | | |
K-QuAD | Semi-supervised Training Data Generation for Multilingual Question Answering | https://aclanthology.org/L18-1437.pdf | https://github.com/Di-lab-Yonsei/K-QuAD | task-oriented (target language) | machine reading comprehension | NO | >10K | crowdsourced & collected from Wikipedia | not mentioned | English & its own language | n/a | automatic translation & crowdsourced translation (incl. Gengo / One Hour Translation) | crowdsourced | English & in its own language | ko | 2018 | LREC | YES (English) | university | 26 | NO | YES | | |
KLUE - DP | KLUE: Korean Language Understanding Evaluation | https://arxiv.org/pdf/2105.09680.pdf | https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000071/data/klue-dp-v1.1.tar.gz | multi-task (target language) | structured prediction | YES | >10K | collected from web & collected from social media or commercial sources | deepnatural | its own language | n/a | n/a | crowdsourced & automatically induced | in its own language | ko | 2021 | NeurIPS Datasets and Benchmarks Track | NO | combination of university and industry | 19 | NO | YES | | |
KLUE - DST (WoS) | KLUE: Korean Language Understanding Evaluation | https://arxiv.org/pdf/2105.09680.pdf | https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000073/data/wos-v1.1.tar.gz | multi-task (target language) | structured prediction | YES | 1000~10K | crowdsourced | not mentioned | its own language | n/a | n/a | crowdsourced | in its own language | ko | 2021 | NeurIPS Datasets and Benchmarks Track | NO | combination of university and industry | 19 | NO | YES | | |
KLUE - MRC | KLUE: Korean Language Understanding Evaluation | https://arxiv.org/pdf/2105.09680.pdf | https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000072/data/klue-mrc-v1.1.tar.gz | multi-task (target language) | machine reading comprehension | YES | >10K | collected from media (news) & collected from Wikipedia & crowdsourced | selectstar | its own language | n/a | n/a | crowdsourced | in its own language | ko | 2021 | NeurIPS Datasets and Benchmarks Track | NO | combination of university and industry | 19 | NO | YES | | |
KLUE - NER | KLUE: Korean Language Understanding Evaluation | https://arxiv.org/pdf/2105.09680.pdf | https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000069/data/klue-ner-v1.1.tar.gz | multi-task (target language) | sequence tagging | YES | >10K | collected from web | deepnatural | its own language | n/a | n/a | crowdsourced | in its own language | ko | 2021 | NeurIPS Datasets and Benchmarks Track | NO | combination of university and industry | 19 | NO | YES | | |
KLUE - NLI | KLUE: Korean Language Understanding Evaluation | https://arxiv.org/pdf/2105.09680.pdf | https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000068/data/klue-nli-v1.1.tar.gz | multi-task (target language) | sentence pair task | YES | >10K | collected from web & collected from Wikipedia & collected from social media or commercial sources & collected from media (news) | selectstar | its own language | n/a | n/a | crowdsourced | in its own language | ko | 2021 | NeurIPS Datasets and Benchmarks Track | NO | combination of university and industry | 19 | NO | YES | | |
KLUE - RE | KLUE: Korean Language Understanding Evaluation | https://arxiv.org/pdf/2105.09680.pdf | https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000070/data/klue-re-v1.1.tar.gz | multi-task (target language) | classification (non-sentiment analysis) | YES | >10K | collected from web & collected from Wikipedia & collected from media (news) | deepnatural | its own language | n/a | n/a | crowdsourced | in its own language | ko | 2021 | NeurIPS Datasets and Benchmarks Track | NO | combination of university and industry | 19 | NO | YES | | |
KLUE - STS | KLUE: Korean Language Understanding Evaluation | https://arxiv.org/pdf/2105.09680.pdf | https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000067/data/klue-sts-v1.1.tar.gz | multi-task (target language) | sentence pair task | YES | >10K | collected from web & collected from Wikipedia & collected from media (news) | selectstar | its own language | RTT & greedy sentence matching | n/a | crowdsourced | in its own language | ko | 2021 | NeurIPS Datasets and Benchmarks Track | YES (other language) | combination of university and industry | 19 | NO | YES | | |
KLUE - TC(YNAT) | KLUE: Korean Language Understanding Evaluation | https://arxiv.org/pdf/2105.09680.pdf | https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000066/data/ynat-v1.1.tar.gz | multi-task (target language) | classification (sentiment analysis) | YES | >10K | collected from media (news) | selectstar | its own language | n/a | n/a | crowdsourced | in its own language | ko | 2021 | NeurIPS Datasets and Benchmarks Track | YES (other language) | combination of university and industry | 19 | NO | YES | | |
KorQuAD1.0 | KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension | https://arxiv.org/pdf/1909.07005.pdf | https://korquad.github.io/ | task-oriented (target language) | machine reading comprehension | YES | >10K | collected from Wikipedia & crowdsourced | not mentioned | its own language | n/a | n/a | crowdsourced | in its own language | ko | 2019 | arxiv | NO | industry | 39 | YES | YES | | |
OCNLI | OCNLI: Original Chinese Natural Language Inference | https://arxiv.org/pdf/2010.05444.pdf | https://storage.googleapis.com/cluebenchmark/tasks/ocnli_public.zip | task-oriented (target language) | sentence pair task | YES | >10K | “collected from media (news) & collected from curated source (exams, scientific papers, etc) & curated linguistic resources” | university students | its own language | n/a | n/a | crowdsourced | in its own language | zh | 2020 | Findings | NO | combination of university and industry | 21 | NO | YES | | |
ParsiNLU - Multiple Choice QA | ParsiNLU: A Suite of Language Understanding Challenges for Persian | https://arxiv.org/pdf/2012.06154.pdf | https://github.com/persiannlp/parsinlu/tree/master/data/multiple-choice | multi-task (target language) | QA + IR | YES | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | native speakers | its own language | n/a | n/a | crowdsourced | in its own language | fa | 2021 | TACL | NO | combination of university and industry | 3 | NO | YES | | |
ParsiNLU - Query Paraphrasing | ParsiNLU: A Suite of Language Understanding Challenges for Persian | https://arxiv.org/pdf/2012.06154.pdf | https://github.com/persiannlp/parsinlu/tree/master/data/qqp | multi-task (target language) | sentence pair task | YES | 1000~10K | collected from web | native speakers | its own language | google auto complete | automatic translation & expert translation | crowdsourced | in its own language | fa | 2021 | TACL | YES (English) | combination of university and industry | 3 | YES | YES | | |
ParsiNLU - Reading Comprehension | ParsiNLU: A Suite of Language Understanding Challenges for Persian | https://arxiv.org/pdf/2012.06154.pdf | https://github.com/persiannlp/parsinlu/tree/master/data/reading_comprehension | multi-task (target language) | machine reading comprehension | YES | 1000~10K | collected from web | native speakers | its own language | n/a | n/a | crowdsourced | in its own language | fa | 2021 | TACL | NO | combination of university and industry | 3 | YES | YES | | |
ParsiNLU - Sentiment Analysis | ParsiNLU: A Suite of Language Understanding Challenges for Persian | https://arxiv.org/pdf/2012.06154.pdf | https://github.com/persiannlp/parsinlu/tree/master/data/sentiment-analysiss | multi-task (target language) | classification (sentiment analysis) | YES | 1000~10K | collected from social media or commercial sources & collected from web | native speakers | its own language | n/a | n/a | crowdsourced | in its own language | fa | 2021 | TACL | NO | combination of university and industry | 3 | YES | YES | | |
ParsiNLU - Textual Entailment | ParsiNLU: A Suite of Language Understanding Challenges for Persian | https://arxiv.org/pdf/2012.06154.pdf | https://github.com/persiannlp/parsinlu/tree/master/data/entailment | multi-task (target language) | sentence pair task | YES | 1000~10K | collected from Wikipedia& collected from web & curated linguistic resources | native speakers | its own language | n/a | n/a | crowdsourced | in its own language | fa | 2021 | TACL | YES (English) | combination of university and industry | 3 | YES | YES | | |
XGLUE - NTG | “XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” | https://arxiv.org/pdf/2004.01401.pdf | https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz | cross-lingual transfer | sentence-level-generation task | PARTIAL | >10K | collected from web | n/a | not mentioned | not mentioned | not clear whether translation is used | not mentioned | not mentioned | en de fr es ru | 2020 | EMNLP | NO | industry | 57 | YES | YES | | |
XGLUE - QG | “XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” | https://arxiv.org/pdf/2004.01401.pdf | https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz | cross-lingual transfer | sentence-level-generation task | PARTIAL | >10K | collected from web | n/a | not mentioned | not mentioned | not clear whether translation is used | not mentioned | not mentioned | en fr de es it pt | 2020 | EMNLP | NO | industry | 57 | YES | YES | | |
XGLUE - QAM | “XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” | https://arxiv.org/pdf/2004.01401.pdf | https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz | cross-lingual transfer | sentence pair task | PARTIAL | >10K | collected from web | n/a | not mentioned | not mentioned | not clear whether translation is used | not mentioned | not mentioned | en fr de | 2020 | EMNLP | NO | industry | 57 | YES | YES | | |
XGLUE - WPR | “XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” | https://arxiv.org/pdf/2004.01401.pdf | https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz | cross-lingual transfer | classification (sentiment analysis) | PARTIAL | >10K | collected from web | n/a | not mentioned | not mentioned | not clear whether translation is used | not mentioned | not mentioned | en de fr es it pt zh | 2020 | EMNLP | NO | industry | 57 | YES | YES | | |
XGLUE - QADSM | “XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” | https://arxiv.org/pdf/2004.01401.pdf | https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz | cross-lingual transfer | classification (sentiment analysis) | PARTIAL | >10K | collected from web | n/a | not mentioned | not mentioned | not clear whether translation is used | not mentioned | not mentioned | en fr de | 2020 | EMNLP | NO | industry | 57 | YES | YES | | |
XGLUE - NC | “XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” | https://arxiv.org/pdf/2004.01401.pdf | https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz | cross-lingual transfer | classification (sentiment analysis) | PARTIAL | >10K | collected from web | n/a | not mentioned | not mentioned | not clear whether translation is used | not mentioned | not mentioned | en es de fr ru | 2020 | EMNLP | NO | industry | 57 | YES | YES | | |
XGLUE - POS Tagging | “XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” | https://arxiv.org/pdf/2004.01401.pdf | https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz | cross-lingual transfer | structured prediction | PARTIAL | 1000~10K | curated linguistic resources | n/a | not mentioned | not mentioned | n/a | not mentioned | not mentioned | ar bg de el en es fr hi it nl pl pt ru th tr ur vi zh | 2020 | EMNLP | NO | industry | 57 | YES | YES | | |
XGLUE - NER | “XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation” | https://arxiv.org/pdf/2004.01401.pdf | https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz | cross-lingual transfer | sequence tagging | PARTIAL | 1000~10K | collected from media (news) | people from university | its own language | n/a | n/a | crowdsourced | in its own language | en de es dl | 2020 | EMNLP | YES (English & other language) | industry | 57 | YES | YES | | |
negationminpairs | A Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairs | https://aclanthology.org/2021.conll-1.19.pdf | https://github.com/mahartmann/negationminpairs | task-oriented (multilingual) | classification (sentiment analysis) | YES | 1000~10K | crowdsourced | “annotated by native speakers (except english), xnli: gethybrid.io” | its own language | alignment | automatic translation | “annotated (authors, linguists)” | in its own language | en bg de fr zh | 2021 | CoNLL | YES (English & other languages) | university | 0 | NO | YES | | |
malayammixsentiment | A Sentiment Analysis Dataset for Code-Mixed Malayalam-English | https://arxiv.org/pdf/2006.00210v1.pdf | https://github.com/bharathichezhiyan/MalayalamMixSentiment | task-oriented (target language) | classification (sentiment analysis) | YES | >10k | collected from social media or commercial sources | n/a | its own language | n/a | n/a | crowdsourced | in its own language | en ml | 2020 | *ACL Workshop | No | university | 48 | YES | YES | | |
TUNIZI | Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis | https://arxiv.org/pdf/2004.14303v1.pdf | https://github.com/chaymafourati/TUNIZI-Sentiment-Analysis-Tunisian-Arabizi-Dataset | task-oriented (target language) | classification (sentiment analysis) | NO | 1000~10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | crowdsourced | in its own language | ar | 2020 | ICLR | NO | industry | 8 | NO | YES | | |
CoDEx | CoDEx: A Comprehensive Knowledge Graph Completion Benchmark | https://arxiv.org/pdf/2009.07810.pdf | https://github.com/tsafavi/codex | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | >10k | collected from Wikipedia | n/a | its own language | n/a | n/a | “annotated (authors, linguists) & automatically induced” | English | ar de en es ru zh | 2020 | EMNLP | NO | university | 16 | NO | YES | | |
Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank | Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank | http://www.lrec-conf.org/proceedings/lrec2018/pdf/141.pdf | https://github.com/MurathanKurfali/Ted-MDB-Annotations | task-oriented (multilingual) | structured prediction | NO | 1000~10K | collected from web | n/a | its own language | n/a | expert translation | “annotated (authors, linguists)” | in its own language | en de pl pt ru tr | 2018 | LREC | NO | university | 18 | NO | YES | | |
TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks | TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks | https://aclanthology.org/2020.emnlp-main.223.pdf | https://github.com/wanqiulong0923/TED-CDB | task-oriented (target language) | structured prediction | NO | >10k | collected from web | n/a | both | n/a | expert translation | “annotated (authors, linguists)” | in its own language | zh | 2020 | EMNLP | NO | university | 0 | NO | YES | | |
A Dataset for Multi-lingual Epidemiological Event Extraction | A Dataset for Multi-lingual Epidemiological Event Extraction | https://aclanthology.org/2020.lrec-1.509.pdf | https://zenodo.org/record/3709617#.YcCvOhPMITU | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | >10k | collected from media (news) | n/a | its own language | n/a | n/a | crowdsourced | not mentioned | en fr es pt | 2020 | LREC | NO | university | 3 | NO | YES | | |
Multilingual Culture-Independent Word Analogy Datasets | Multilingual Culture-Independent Word Analogy Datasets | https://aclanthology.org/2020.lrec-1.501.pdf | https://www.clarin.si/repository/xmlui/handle/11356/1261 | task-oriented (multilingual) | other | NO | >10k | collected from web | n/a | its own language | n/a | automatic translation | “annotated (authors, linguists)” | in its own language | en ee fi lv lt ru si se | 2020 | LREC | NO | university | 6 | | YES | | |
SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelines | SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelines | https://aclanthology.org/L16-1656.pdf | | task-oriented (multilingual) | classification (non-sentiment analysis) | NO | >10k | collected from web & collected from Wikipedia | n/a | its own language | n/a | n/a | crowdsourced | in its own language | en de ru | 2016 | LREC | NO | university | 2 | NO | NO | | |
20Minuten | A New Dataset and Efficient Baselines for Document-level Text Simplification in German | https://aclanthology.org/2021.newsum-1.16.pdf | | task-oriented (target language) | sentence-level-generation task | YES | >10k | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | de | 2021 | ACL | NO | university | 0 | NO | NO | | |
Spektrum | A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization | https://aclanthology.org/2021.newsum-1.5.pdf | https://github.com/MehwishFatimah/wsd | task-oriented (multilingual) | summarization | YES | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | n/a | n/a | automatically induced | in its own language | en de | 2021 | EMNLP | NO | university | 0 | NO | YES | | |
A Summarization Dataset of Slovak News Articles | A Summarization Dataset of Slovak News Articles | https://aclanthology.org/2020.lrec-1.830.pdf | https://github.com/NaiveNeuron/sme-sum | task-oriented (target language) | summarization | not mentioned | >10k | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | sk | 2020 | LREC | NO | university | 1 | NO | NO | | |
Liputan6: A Large-scale Indonesian Dataset for Text Summarization | Liputan6: A Large-scale Indonesian Dataset for Text Summarization | https://aclanthology.org/2020.aacl-main.60.pdf | https://github.com/fajri91/sum_liputan6 | task-oriented (target language) | summarization | YES | >10k | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | id | 2020 | AACL | NO | university | 8 | NO | YES | | |
Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque | Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque | https://aclanthology.org/2020.lrec-1.55/ | http://ixa.si.ehu.es/node/12934 | task-oriented (target language) | machine reading comprehension | NO | 1000~10k | collected from Wikipedia | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | eu | 2020 | LREC | NO | university | 7 | NO | YES | | |
A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets | A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets | https://aclanthology.org/P15-2001.pdf | Not available | task-oriented (multilingual) | classification (non-sentiment analysis) | not mentioned | <100 | “annotated (authors, linguists)” | n/a | English | n/a | expert translation | crowdsourced | English | en es fr de pt fa | 2015 | ACL | YES (English) | university | 61 | NO | NO | | |
A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining | A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining | https://aclanthology.org/D15-1302.pdf | https://github.com/diegma/trip-maml | task-oriented (multilingual) | classification (sentiment analysis) | YES | 1000~10k | collected from social media or commercial sources | n/a | its own language | n/a | n/a | crowdsourced | in its own language | en es it | 2015 | ACL | YES (English) | university | 10 | NO | YES | | |
“Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers” | “Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers” | https://aclanthology.org/L18-1558.pdf | https://github.com/AlonEirew/cross-doc-event-coref | task-oriented (multilingual) | structured prediction | YES | >10k | collected from web | n/a | its own language | n/a | n/a | crowdsourced | in its own language | zh en es | 2018 | LREC | NO | university | 6 | NO | YES | | |
DanFEVER: claim verification dataset for Danish | DanFEVER: claim verification dataset for Danish | https://aclanthology.org/2021.nodalida-main.47.pdf | https://figshare.com/articles/dataset/DanFEVER_claim_verification_dataset_for_Danish/14380970 | task-oriented (target language) | classification (non-sentiment analysis) | NO | >10k | collected from Wikipedia | n/a | its own language | n/a | n/a | crowdsourced | in its own language | da | 2021 | NoDaLiDa | NO | university | 5 | NO | YES | | |
From Web Crawl to Clean Register-Annotated Corpora | From Web Crawl to Clean Register-Annotated Corpora | https://aclanthology.org/2020.wac-1.3.pdf | https://github.com/TurkuNLP/WAC-XII | task-oriented (multilingual) | classification (non-sentiment analysis) | NO | >10k | collected from web | n/a | its own language | n/a | n/a | crowdsourced | in its own language | fr sv | 2020 | LREC | NO | university | 2 | NO | YES | | |
LIdioms: A Multilingual Linked Idioms Data Set | LIdioms: A Multilingual Linked Idioms Data Set | https://arxiv.org/pdf/1802.08148.pdf | https://github.com/dice-group/LIdioms/blob/master/en/english.ttl | task-oriented (multilingual) | other | NO | 100~1000 | collected from web | n/a | its own language | n/a | n/a | automatically induced | English | en pt it de ru | 2018 | LREC | NO | university | 6 | NO | YES | | |
Mega-COV | Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19 | https://arxiv.org/pdf/2005.06012.pdf | https://github.com/UBC-NLP/megacov/tree/master/tweet_ids | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | >10K | collected from social media or commercial sources | n/a | English & its own language | n/a | n/a | automatically induced | English & in its own language | | 2021 | EACL | NO | university | 7 | NO | YES | | |
Finnish Rumor Detection Dataset | Never guess what I heard… Rumor Detection in Finnish News: a Dataset and a Baseline | https://arxiv.org/pdf/2106.03389.pdf | https://zenodo.org/record/4697529#.YcKICy-B2tU | task-oriented (target language) | classification (sentiment analysis) | YES | 1000~10K | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | fi | 2021 | *ACL Workshop | NO | university | 0 | NO | YES | | |
PROMETHEUS | PROMETHEUS: A Corpus of Proverbs Annotated with Metaphors | https://aclanthology.org/L16-1600.pdf | (not available) | task-oriented (multilingual) | structured prediction | not mentioned | 1000~10K | curated linguistic resources & collected from social media or commercial sources | n/a | English | n/a | expert translation | “annotated (authors, linguists)” | in its own language | en it | 2016 | LREC | NO | university | 7 | NO | YES | | |
OneSec | Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains | https://aclanthology.org/2020.lrec-1.723.pdf | http://trainomatic.org/data/onesec_lrec.tar.gz | task-oriented (multilingual) | other | YES | >10K | curated linguistic resources & collected from Wikipedia | n/a | its own language | n/a | n/a | automatically induced | in its own language | en it fr de es | 2020 | LREC | NO | university | 4 | NO | YES | | |
StoryDB | StoryDB: Broad Multi-language Narrative Dataset | https://aclanthology.org/2021.eval4nlp-1.4.pdf | https://drive.google.com/drive/folders/1RCWk7pyvIpubtsf-f2pIsfqTkvtV80Yv | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | 1000~10K | collected from Wikipedia | n/a | English & its own language | alignment | n/a | automatically induced | English & its own language | en it fr ru de nl uk pl pt es sv ja he fi eu hy fa no ar id ko vi bg el hu zh da gl th sr hr lb mk ta ms cs ro te ka ca lt sl | 2021 | *ACL Workshop | NO | combination of university and industry | 0 | NO | YES | | |
DReaM | The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages | https://aclanthology.org/2020.lrec-1.110.pdf | “https://spraakbanken.gu.se/korp/?mode=dream#?cqp=%5B%5D&corpus=dream-en-open,dream-de-open,dream-es-open,dream-fr-open,dream-it-open,dream-nl-open,dream-ru-open” | task-oriented (multilingual) | structured prediction | not mentioned | 100~1000 | “collected from curated source (exams, scientific papers, etc)” | n/a | English & its own language | n/a | n/a | “annotated (authors, linguists)” | English & its own language | en fr de es pt ru id nl it zh | 2020 | LREC | NO | university | 5 | NO | YES | | |
Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations. | Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations. | https://aclanthology.org/P19-1392.pdf | http://www.grupolys.org/~marcos/pub/collocations.zip | cross-lingual transfer | structured prediction | NO | 1000~10K | curated linguistic resources | n/a | English & its own language | n/a | n/a | “annotated (authors, linguists)” | English & its own language | en pt es | 2019 | ACL | NO | university | 6 | NO | YES | | |
Universal Dependency Annotation for Multilingual Parsing | Universal Dependency Annotation for Multilingual Parsing | https://aclanthology.org/P13-2017.pdf | https://code.google.com/p/uni-dep-tb// | task-oriented (multilingual) | structured prediction | NO | 1000~10K | curated linguistic resources | not mentioned | English & its own language | parsers | n/a | crowdsourced | English & its own language | en de sv es fr ko | 2013 | ACL | NO | combination of university and industry | 561 | NO | YES | | |
KINNEWS | KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi | https://arxiv.org/pdf/2010.12174.pdf | https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | >10K | collected from media (news) | n/a | its own language | google auto complete | n/a | crowdsourced | English & in its own language | rw | 2020 | COLING | NO | university | 3 | YES | YES | | |
KIRNEWS | KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi | https://arxiv.org/pdf/2010.12174.pdf | https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | 1000~10K | collected from media (news) | n/a | its own language | google auto complete | n/a | crowdsourced | English & in its own language | rn | 2020 | COLING | NO | university | 3 | YES | YES | | |
SQuAD-es | Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering | https://arxiv.org/pdf/1912.05200.pdf | https://github.com/ccasimiro88/TranslateAlignRetrieve | task-oriented (target language) | machine reading comprehension | YES | >10K | crowdsourced & collected from Wikipedia | amt | English | alignment | automatic translation | crowdsourced | English | es | 2020 | LREC | YES (English) | university | 18 | NO | YES | | |
HEAD-QA: A Healthcare Dataset for Complex Reasoning | HEAD-QA: A Healthcare Dataset for Complex Reasoning | https://aclanthology.org/P19-1092.pdf | http: //aghie.github.io/head-qa/ | task-oriented (multilingual) | QA + IR | YES | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | n/a | automatic translation | “derived from linguistic resources (wordnet, etc)” | in its own language | es en | 2019 | ACL | NO | university | 10 | YES | YES | | |
RuCoS | Read and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russian | https://aclanthology.org/2020.coling-main.570.pdf | https://github.com/RussianNLP/RussianSuperGLUE | task-oriented (target language) | machine reading comprehension | YES | >10K | collected from web | toloka | its own language | tf-idf generated & other | n/a | crowdsourced & automatically induced | in its own language | ru | 2020 | COLING | YES (other language) | combination of university and industry | 3 | YES | YES | | |
MuSeRC | Read and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russian | https://aclanthology.org/2020.coling-main.570.pdf | https://github.com/RussianNLP/RussianSuperGLUE | task-oriented (target language) | machine reading comprehension | YES | 1000~10K | “collected from media (news) & collected from curated source (exams, scientific papers, etc)” | toloka | its own language | n/a | n/a | crowdsourced | in its own language | ru | 2020 | COLING | YES (other language) | combination of university and industry | 3 | YES | YES | | |
BI-139 | CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval | https://aclanthology.org/2020.emnlp-main.340.pdf | https://www.cs.jhu.edu/~shuosun/clirmatrix/ | task-oriented (multilingual) | QA + IR | YES | >10K | collected from Wikipedia | n/a | English & its own language | alignment | n/a | automatically induced | in its own language | af als am an ar arz ast az azb ba bar be bg bn bpy br bs bug ca cdo ce ceb ckb cs cv cy da de diq el eml eo es et eu fa fi fo fr fy ga gd gl gu he hi hr hsb ht hu hy ia id ilo io is it ja jv ka kk kn ko ku ky la lb li lmo lt lv mai mg mhr min mk ml mn mr mrj ms my mzn nap nds ne new nl nn no oc or os pa pl pms pnb ps pt qu ro ru sa sah scn sco sd sh si sk sl sq sr su sv sw szl ta te tg th tl tr tt uk ur uz vec vi vo wa war wuu xmf yi yo zh | 2020 | EMNLP | NO | university | 3 | NO | YES | | |
MULTI-8 | CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval | https://aclanthology.org/2020.emnlp-main.340.pdf | https://www.cs.jhu.edu/~shuosun/clirmatrix/ | task-oriented (multilingual) | QA + IR | YES | >10K | collected from Wikipedia | n/a | English & its own language | alignment | n/a | automatically induced | in its own language | ar de en es fr ja ru zh | 2020 | EMNLP | NO | university | 3 | NO | YES | | |
GerDaLIR: A German Dataset for Legal Information Retrieval | GerDaLIR: A German Dataset for Legal Information Retrieval | https://aclanthology.org/2021.nllp-1.13.pdf | https://github.com/lavis-nlp/GerDaLIR | task-oriented (target language) | QA + IR | YES | >10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | n/a | n/a | automatically induced | in its own language | de | 2021 | *ACL Workshop | NO | university | 0 | NO | YES | | |
MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network | MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network | https://arxiv.org/pdf/2106.07352.pdff | not available | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | >10K | collected from Wikipedia & collected from web | n/a | English & its own language | n/a | n/a | automatically induced | in its own language | | 2021 | ACL | YES (English & other language) | industry | 2 | NO | NO | | |
A Turkish Dataset for Gender Identification of Twitter Users | A Turkish Dataset for Gender Identification of Twitter Users | https://aclanthology.org/W19-4023v1.pdf | https://cloud.iyte.edu.tr/index.php/s/5DhqdlUCCdB60qG | task-oriented (target language) | classification (non-sentiment analysis) | YES | 1000~10K | collected from social media or commercial sources | university students & academic personnel | its own language | n/a | n/a | crowdsourced | in its own language | tr | 2019 | *ACL Workshop | NO | university | 11 | NO | YES | | |
AnCora-Ca | AnCora: Multilevel Annotated Corpora for Catalan and Spanish | http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf | http://clic.ub.edu/corpus/en | task-oriented (target language) | sequence tagging & structured prediction | NO | >10K | collected from media (news) | not mentioned | its own language | n/a | n/a | “annotated (authors, linguists) & automatically induced” | in its own language | ca | 2008 | LREC | YES (other language) | university | 345 | PARTIAL | YES | | |
AnCora-Es | AnCora: Multilevel Annotated Corpora for Catalan and Spanish | http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf | http://clic.ub.edu/corpus/en | task-oriented (target language) | sequence tagging & structured prediction | NO | >10K | collected from media (news) | not mentioned | its own language | n/a | n/a | “annotated (authors, linguists) & automatically induced” | in its own language | es | 2008 | LREC | NO | university | 345 | NO | YES | | |
NCTTI | Assessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levels | https://aclanthology.org/2021.acl-long.212.pdf | https://github.com/marcospln/nctti | task-oriented (multilingual) | sequence tagging | YES | 1000~10K | collected from web & collected from Wikipedia | amt & online platforms for portuguese in cordeiro et al (2019) | its own language | parsers | n/a | “crowdsourced & annotated (authors, linguists)” | in its own language | en pt | 2021 | ACL | YES (English & other language) | university | 2 | NO | YES | | |
Books of Hours. the First Liturgical Data Set for Text Segmentation. | Books of Hours. the First Liturgical Data Set for Text Segmentation. | https://aclanthology.org/2020.lrec-1.97.pdf | | task-oriented (target language) | sequence tagging | NO | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | other | n/a | “annotated (authors, linguists)” | in its own language | la | 2020 | LREC | NO | university | 1 | NO | YES | | |
COSTRA 1.0: A Dataset of Complex Sentence Transformations | COSTRA 1.0: A Dataset of Complex Sentence Transformations | https://aclanthology.org/2020.lrec-1.434/ | https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3123 | task-oriented (target language) | sentence pair task | NO | 1000~10K | collected from media (news) | n/a | its own language | alignment | n/a | crowdsourced | in its own language | cs | 2020 | LREC | NO | university | 1 | NO | YES | | |
Fine-grained Named Entity Annotation for Finnish | Fine-grained Named Entity Annotation for Finnish | https://aclanthology.org/2021.nodalida-main.14/ | https://github.com/TurkuNLP/turku-one | task-oriented (target language) | sequence tagging | YES | >10K | | n/a | its own language | n/a | n/a | “derived from linguistic resources (wordnet, etc)” | in its own language | fi | 2021 | NoDaLiDa | YES (other language) | university | 0 | NO | YES | | |
GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors | GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors | https://aclanthology.org/2020.lrec-1.835/ | https://github.com/mhagiwara/github-typo-corpus | task-oriented (multilingual) | sequence tagging | NO | >10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | automatically induced | in its own language | en zh ja ru fr de pt es ko hi | 2020 | LREC | NO | combination of university and industry | 13 | NO | YES | | |
Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus | Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus | https://aclanthology.org/2020.isa-1.2/ | | task-oriented (target language) | sequence tagging | NO | >10K | collected from media (news) | not mentioned | its own language | n/a | n/a | crowdsourced | in its own language | hi | 2020 | *ACL Workshop | NO | university | 3 | NO | NO | | |
K-SNACS: Annotating Korean Adposition Semantics | K-SNACS: Annotating Korean Adposition Semantics | https://aclanthology.org/2020.dmr-1.6/ | https://github.com/jdch00/k-snacs | task-oriented (target language) | sequence tagging | NO | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | ko | 2020 | *ACL Workshop | NO | university | 4 | NO | NO | | |
“MassiveSumm: a very large-scale, very multilingual, news summarisation dataset” | “MassiveSumm: a very large-scale, very multilingual, news summarisation dataset “ | https://aclanthology.org/2021.emnlp-main.797/ | https://github.com/natschluter/massive-summ | task-oriented (multilingual) | summarization | YES | >10K | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | af am ar as ay az bm bn bo bs bg ca cs cy da de el en eo fa fil fr ff ga gu ht ha he hi hr hu hy ig id is it ja kn ka km rw ky ko ku lo lv ln lt ml mr mk mg mn my nd ne nl or om pa pl pt prs ps ro rn ru si sk sl sn so es sq sr sw sv ta te tet tg th ti tr uk ur uz vi xh yo yue zh bi gd | 2021 | EMNLP | NO | university | 0 | NO | NO | | |
Models and Datasets for Cross-Lingual Summarisation | Models and Datasets for Cross-Lingual Summarisation | https://aclanthology.org/2021.emnlp-main.742/ | https://github.com/lauhaide/clads | task-oriented (multilingual) | summarization | YES | >10K | collected from Wikipedia | n/a | its own language | alignment | n/a | automatically induced | in its own language | cs fr en de | 2021 | EMNLP | NO | university | 0 | YES | YES | | |
Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish | Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish | https://aclanthology.org/W19-6143/ | https://github.com/UniversalDependencies/UD_Danish-DDT | task-oriented (target language) | sequence tagging | YES | 1000~10K | curated linguistic resources | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | da | 2019 | *ACL Workshop | YES (English) | university | 8 | YES | YES | | |
Universal Joy A Data Set and Results for Classifying Emotions Across Languages | Universal Joy A Data Set and Results for Classifying Emotions Across Languages | https://aclanthology.org/2021.wassa-1.7.pdf | https://github.com/sotlampr/universal-joy | task-oriented (multilingual) | classification (sentiment analysis) | YES | >10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | automatically induced | in its own language | bn zh de en fr hi id it kh my nl pt ro es tl th vi ms | 2021 | *ACL Workshop | NO | university | 6 | NO | YES | | |
X-Fact: A New Benchmark Dataset for Multilingual Fact Checking | X-Fact: A New Benchmark Dataset for Multilingual Fact Checking | https://aclanthology.org/2021.acl-short.86/ | https://github.com/utahnlp/x-fact/ | task-oriented (multilingual) | classification (non-sentiment analysis) | YES | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | n/a | n/a | automatically induced | in its own language | si nl mr no tr hi id it sr ru fa sq gu ka pl az bn ta de es pa fr ro pt ar | 2021 | ACL | NO | university | 3 | NO | YES | | |
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning | XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning | https://aclanthology.org/2020.emnlp-main.185/ | https://github.com/cambridgeltl/xcopa | cross-lingual transfer | classification (non-sentiment analysis) | NO | 100~1000 | “collected from curated source (exams, scientific papers, etc)” | n/a | English | n/a | expert translation | automatically induced | English | et ht id it qu sw ta th tr vi zh | 2020 | EMNLP | YES (English) | university | 36 | YES | YES | | |
KLEJ - NKJP-NER | KLEJ: Comprehensive Benchmark for Polish Language Understanding | https://aclanthology.org/2020.acl-main.111/ | https://klejbenchmark.com/ | multi-task (target language) | classification (non-sentiment analysis) | YES | >10K | “collected from curated source (exams, scientific papers, etc)” | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | po | 2020 | ACL | YES (other language) | university | 22 | YES | YES | | |
KLEJ - CBD | KLEJ: Comprehensive Benchmark for Polish Language Understanding | https://aclanthology.org/2020.acl-main.111/ | https://klejbenchmark.com/ | multi-task (target language) | classification (sentiment analysis) | YES | >10K | collected from social media or commercial sources | not mentioned | its own language | n/a | n/a | “crowdsourced & annotated (authors, linguists)” | in its own language | po | 2020 | ACL | YES (other language) | university | 22 | YES | YES | | |
KLEJ- PolEmo2.0-IN | KLEJ: Comprehensive Benchmark for Polish Language Understanding | https://aclanthology.org/2020.acl-main.111/ | https://klejbenchmark.com/ | multi-task (target language) | classification (non-sentiment analysis) | YES | 1000~10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | po | 2020 | ACL | YES (other language) | university | 22 | YES | YES | | |
KLEJ - PolEmo2.0-OUT | KLEJ: Comprehensive Benchmark for Polish Language Understanding | https://aclanthology.org/2020.acl-main.111/ | https://klejbenchmark.com/ | multi-task (target language) | classification (non-sentiment analysis) | YES | 1000~10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | “annotated (authors, linguists)” | in its own language | po | 2020 | ACL | YES (other language) | university | 22 | YES | YES | | |
KLEJ - Czy wiesz? | KLEJ: Comprehensive Benchmark for Polish Language Understanding | https://aclanthology.org/2020.acl-main.111/ | https://klejbenchmark.com/ | multi-task (target language) | QA + IR | YES | 1000~10K | collected from Wikipedia | n/a | its own language | RTT & greedy sentence matching | n/a | automatically induced | in its own language | po | 2020 | ACL | YES (other language) | university | 22 | YES | YES | | |
KLEJ - PSC | KLEJ: Comprehensive Benchmark for Polish Language Understanding | https://aclanthology.org/2020.acl-main.111/ | https://klejbenchmark.com/ | multi-task (target language) | sentence-level-generation task | YES | 1000~10K | collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | po | 2020 | ACL | YES (other language) | university | 22 | YES | YES | | |
KLEJ - AR | KLEJ: Comprehensive Benchmark for Polish Language Understanding | https://aclanthology.org/2020.acl-main.111/ | https://klejbenchmark.com/ | multi-task (target language) | classification (sentiment analysis) | YES | >10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | automatically induced | in its own language | po | 2020 | ACL | YES (other language) | university | 22 | YES | YES | | |
PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphone | PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphone | https://aclanthology.org/L14-1240/ | | task-oriented (multilingual) | classification (sentiment analysis) | NO | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | not mentioned | original language | n/a | n/a | crowdsourced | in its own language | en de | 2014 | LREC | NO | combination of university and industry | 1 | NO | YES | | |
RussianSuperGLUE-LiDiRus | RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark | https://aclanthology.org/2020.emnlp-main.381/ | https://github.com/RussianNLP/RussianSuperGLUE | multi-task (target language) | sentence pair task | NO | 1000~10K | collected from media (news) | n/a | English | n/a | expert translation | “annotated (authors, linguists)” | English | ru | 2020 | EMNLP | YES (English) | combination of university and industry | 11 | YES | YES | | |
RussianSuperGLUE-RUSSE | RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark | https://aclanthology.org/2020.emnlp-main.381/ | https://github.com/RussianNLP/RussianSuperGLUE | multi-task (target language) | other | YES | >10K | collected from Wikipedia & curated linguistic resources | toloka | its own language | n/a | n/a | crowdsourced | in its own language | ru | 2020 | EMNLP | YES (other language) | combination of university and industry | 11 | YES | YES | | |
RussianSuperGLUE-PARus | RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark | https://aclanthology.org/2020.emnlp-main.381/ | https://github.com/RussianNLP/RussianSuperGLUE | multi-task (target language) | classification (non-sentiment analysis) | YES | 100~1000 | “collected from web & collected from curated source (exams, scientific papers, etc)” | amt | English | n/a | expert translation | crowdsourced | English | ru | 2020 | EMNLP | YES (English) | combination of university and industry | 11 | YES | YES | | |
RussianSuperGLUE-TERRa | RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark | https://aclanthology.org/2020.emnlp-main.381/ | https://github.com/RussianNLP/RussianSuperGLUE | multi-task (target language) | sentence pair task | YES | 1000~10K | collected from media (news) & collected from web | n/a | its own language | n/a | n/a | “automatically induced & annotated (authors, linguists)” | in its own language | ru | 2020 | EMNLP | YES (other language) | combination of university and industry | 11 | YES | YES | | |
RussianSuperGLUE-RCB | RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark | https://aclanthology.org/2020.emnlp-main.381/ | https://github.com/RussianNLP/RussianSuperGLUE | multi-task (target language) | sentence pair task | YES | 1000~10K | collected from media (news) & collected from web | n/a | its own language | n/a | n/a | “automatically induced & annotated (authors, linguists)” | in its own language | ru | 2020 | EMNLP | YES (other language) | combination of university and industry | 11 | YES | YES | | |
RussianSuperGLUE-RWSD | RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark | https://aclanthology.org/2020.emnlp-main.381/ | https://github.com/RussianNLP/RussianSuperGLUE | multi-task (target language) | structured prediction | YES | 100~1000 | “collected from curated source (exams, scientific papers, etc)” | n/a | English | n/a | details not provided | “annotated (authors, linguists)” | English | ru | 2020 | EMNLP | YES (English) | combination of university and industry | 11 | YES | YES | | |
RussianSuperGLUE-DaNetQA | RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark | https://aclanthology.org/2020.emnlp-main.381/ | https://github.com/RussianNLP/RussianSuperGLUE | multi-task (target language) | machine reading comprehension | YES | 100~1000 | collected from Wikipedia & crowdsourced | toloka | its own language | other | n/a | “crowdsourced & annotated (authors, linguists)” | in its own language | ru | 2020 | EMNLP | NO | combination of university and industry | 11 | YES | YES | | |
Vy=akarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages | Vy=akarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages | https://arxiv.org/abs/2103.00854 | https://github.com/rajaswa/indic-syntax-evaluation | task-oriented (target language) | structured prediction | YES | >10K | curated linguistic resources | n/a | original language | n/a | n/a | “derived from linguistic resources (wordnet, etc)” | in its own language | hi ta | 2021 | *ACL Workshop | YES (other language) | university | 0 | NO | YES | | |
IndicNLPSuite-Soham News Article Classification | “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages” | https://aclanthology.org/2020.findings-emnlp.445/ | https://indicnlp.ai4bharat.org/home/ | multi-task (target language) | classification (non-sentiment analysis) | YES | 1000~10K | collected from media (news) | n/a | original language | n/a | n/a | automatically induced | in its own language | pa bn or gu mr kn te ml ta | 2020 | Findings | | combination of university and industry | 59 | YES | YES | | |
IndicNLPSuite-iNLTK Headline Classification | “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages” | https://aclanthology.org/2020.findings-emnlp.445/ | https://indicnlp.ai4bharat.org/home/ | multi-task (target language) | classification (non-sentiment analysis) | YES | 1000~10K | collected from media (news) | n/a | original language | n/a | n/a | automatically induced | in its own language | pa bn or gu mr kn te ml ta | 2020 | Findings | | combination of university and industry | 59 | YES | YES | | |
IndicNLPSuite-AI4Bharat Cloze-style Question Answering | “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages” | https://aclanthology.org/2020.findings-emnlp.445/ | https://indicnlp.ai4bharat.org/home/ | multi-task (target language) | machine reading comprehension | YES | >10K | collected from Wikipedia | n/a | original language | n/a | n/a | automatically induced | in its own language | pa hi bn or as gu mr kn te ml ta | 2020 | Findings | | combination of university and industry | 59 | YES | YES | | |
IndicNLPSuite-AI4Bharat Winograd Natural Language Inference | “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages” | https://aclanthology.org/2020.findings-emnlp.445/ | https://indicnlp.ai4bharat.org/home/ | multi-task (target language) | classification (non-sentiment analysis) | NO | 100~1000 | “collected from curated source (exams, scientific papers, etc)” | n/a | English | n/a | author translation | “annotated (authors, linguists)” | English | hi mr gu | 2020 | Findings | YES (English) | combination of university and industry | 59 | YES | YES | | |
IndicNLPSuite-AI4Bharat Choice of Plausible Alternatives | “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages” | https://aclanthology.org/2020.findings-emnlp.445/ | https://indicnlp.ai4bharat.org/home/ | multi-task (target language) | classification (non-sentiment analysis) | NO | 100~1000 | “annotated (authors, linguists)” | n/a | English | n/a | author translation | “annotated (authors, linguists)” | English | hi mr gu | 2020 | Findings | YES (English) | combination of university and industry | 59 | YES | YES | | |
IndicNLPSuite-WikiAnnNER | “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages” | https://aclanthology.org/2020.findings-emnlp.445/ | https://indicnlp.ai4bharat.org/home/ | multi-task (target language) | sequence tagging | YES | >10K | collected from Wikipedia | n/a | English | aligned | automatic translation | automatically induced | English | ace af als am an ang ar arc arz as ast ay az ba bar be bg bh bn bo br bs ca cdo ce ceb ckb co crh cs csb cv cy da de diq dv el en eo es et eu ext fa fi fo fr frr fur fy ga gan gd gl gn gu hak he hi hr hsb hu hy ia id ig ilo io is it ja jbo jv ka kk km kn ko ksh ku ky la lb li lij lmo ln lt lv mg mhr mi min mk ml mn mr ms mt mwl my mzn nap nds ne nl nn no nov oc or os sgs be-tarask cbk eml vro jv-x-bms en-basiceng lzh nan yue pa pdc pl pms pnb ps pt qu rm ro ru rw sa sah scn sco sd sh si sk sl so sq sr su sv sw szl ta te tg th tk tl tr tt ug uk ur uz vec vep vi vls vo wa war wuu xmf yi yo zea zh | 2020 | Findings | YES (other language) | combination of university and industry | 59 | YES | YES | | |
CVIT-MKB Cross-lingual Sentence Retrieval | A Multilingual Parallel Corpora Collection Effort for Indian Languages | https://aclanthology.org/2020.lrec-1.462.pdf | https://anoopkunchukuttan.github.io/indic_nlp_library/ | task-oriented (multilingual) | QA + IR | YES | >10K | collected from web | n/a | its own language | alignment | n/a | automatically induced | in its own language | hi te ta ml gu kn ur bn or mr pa as en | 2020 | LREC | | university | 18 | YES | YES | | |
ACTSA | ACTSA: Annotated Corpus for Telugu Sentiment Analysis | https://aclanthology.org/W17-5408/ | https://drive.google.com/drive/folders/0B8HHvMMuHYdWdnJZZl9rWkY5bk0?usp=sharing | task-oriented (target language) | classification (non-sentiment analysis) | YES | 1000~10K | collected from media (news) | native speakers | original language | n/a | n/a | crowdsourced | in its own language | te | 2017 | *ACL Workshop | | university | 31 | YES | NO | | |
MIDAS Discourse | An Annotated Dataset of Discourse Modes in Hindi Stories | https://aclanthology.org/2020.lrec-1.149.pdf | https://github.com/midas-research/hindi-discourse | task-oriented (target language) | classification (non-sentiment analysis) | YES | 1000~10K | “collected from curated source (exams, scientific papers, etc)” | native speakers | original language | n/a | n/a | crowdsourced | in its own language | hi | 2020 | LREC | | combination of university and industry | 4 | YES | YES | | |
A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment | A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment | https://aclanthology.org/2020.lrec-1.395.pdf | https://github.com/elexis-eu/MWSA | cross-lingual transfer | other | NO | 1000~10K | curated linguistic resources | n/a | original language | n/a | n/a | “derived from linguistic resources (wordnet, etc)” | in its own language | eu bg da nl en et de hu ga it sr sl es pt ru | 2020 | LREC | YES (other language) | university | 12 | NO | YES | | |
Multilingual corpora with coreferential annotation of person entities | Multilingual corpora with coreferential annotation of person entities | https://aclanthology.org/L14-1701/ | https://gramatica.usc.es/~marcos/lrec.tar.bz2 | cross-lingual transfer | structured prediction | NO | 1000~10K | collected from Wikipedia & collected from media (news) | n/a | original language | n/a | n/a | crowdsourced | in its own language | gl pt es | 2014 | LREC | | university | 21 | NO | YES | | |
MGAD: Multilingual Generation of Analogy Datasets | MGAD: Multilingual Generation of Analogy Datasets | https://aclanthology.org/L18-1320.pdf | https://github.com/rutrastone/MGAD | task-oriented (multilingual) | other | NO | >10K | template-based | n/a | its own language | n/a | n/a | automatically induced | in its own language | hi ar ru | 2018 | LREC | | university | 8 | NO | YES | | |
“The ApposCorpus: a new multilingual, multi-domain dataset for factual appositive generation” | “The ApposCorpus: a new multilingual, multi-domain dataset for factual appositive generation” | https://arxiv.org/abs/2011.03287 | https://yovakem.github.io/#ApposCorpus | task-oriented (multilingual) | sentence-level-generation task | YES | >10K | collected from Wikipedia & collected from media (news) | n/a | its own language | n/a | n/a | automatically induced | in its own language | en es de pl | 2020 | COLING | | combination of university and industry | 0 | NO | YES | | |
“The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation” | “The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation” | https://aclanthology.org/2020.inlg-1.10/ | https://github.com/TallChris91/CACAPO-Dataset | task-oriented (multilingual) | sentence-level-generation task | YES | >10K | collected from media (news) | n/a | its own language | alignment | n/a | “annotated (authors, linguists)” | in its own language | nl en | 2020 | INLG | | university | 2 | NO | YES | | |
A Dataset and Baselines for Multilingual Reply Suggestion | A Dataset and Baselines for Multilingual Reply Suggestion | https://arxiv.org/abs/2106.02017 | https://github.com/zhangmozhi/mrs | task-oriented (multilingual) | sentence-level-generation task | YES | >10K | collected from social media or commercial sources | n/a | its own language | n/a | n/a | automatically induced | in its own language | en es de pt fr ja sv it nl ru | 2021 | ACL | | combination of university and industry | 1 | NO | YES | | |
| | | | | | | | | | | | | | | | | | | | | | | | |
| | | task oriented multilingual | 61 | | | | | | | | | | | | | | | | | | | | |
| | | cross-lingual transfer | 21 | | | | | | | | | | | | | | | | | | | | |
| | | task-oriented (target language) | 37 | | | | | | | | | | | | | | | | | | | | |
| | | multi-task (target language) | 38 | | | | | | | | | | | | | | | | | | | | |