Dataset Database

dataset name	title	link to paper	data link	motivation of the paper writer (how they were originally intended)	task type	has train data?	“data size (rough avg # of examples PER language, excluding english)”	input data source	crowdsource platforms / background (if any)	original language	input data - automatic processing	translation	label source	label language (at collection time / language used by annotators)	language	publication year	published venue	reusing existing datasets?	who created the dataset?	# citation	in_huggingface	dataset released?
A New Dataset for Natural Language Inference from Code-mixed Conversations	A New Dataset for Natural Language Inference from Code-mixed Conversations	https://arxiv.org/pdf/2004.05051.pdf		task-oriented (multilingual)	sentence pair task	not mentioned	100~1000	“collected from curated source (exams, scientific papers, etc)”	not mentioned	its own language	n/a	n/a	crowdsourced	its own language	en hi	2020	LREC	NO	combination of university and industry	0	NO	NO
MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization	MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization	https://aclanthology.org/2021.eacl-main.146.pdf		task-oriented (multilingual)	summarization	NO	>10k	collected from media (news)	”"”humanitarian experts”””	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	en fr es	2021	EACL	NO	combination of university and industry	0	NO	NO
A Multilingual Wikified Data Set of Educational Material	A Multilingual Wikified Data Set of Educational Material	https://aclanthology.org/L18-1073.pdf	Not available	cross-lingual transfer	classification (non-sentiment analysis)	not mentioned	1000~10k	“collected from curated source (exams, scientific papers, etc)”	crowdflower	its own language	alignment	automatic translation	crowdsourced	in its own language	bg cs de el hr it nl pl pt ru zh	2018	LREC	NO	combination of university and industry	0	NO	NO
Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)	Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)	https://aclanthology.org/Y11-1038.pdf	http://linguistics.hss.ntu.edu.sg/ResearchinLMS/Pages/NTUMultilingualCorpus	cross-lingual transfer	structured prediction	not mentioned	1000~10k	collected from web	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	en zh ja ko id vi	2011	“Pacific Asia Conference on Language, Information and Computation”	NO	university	51	NO	NO
XNLI	XNLI: Evaluating Cross-lingual Sentence Representations	https://arxiv.org/pdf/1809.05053.pdf	https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip	cross-lingual transfer	sentence pair task	NO	1000~10K	crowdsourced	gethybrid.io	English	alignment	crowdsourced translation (incl. Gengo / One Hour Translation)	crowdsourced	English	en fr es de el bg ru tr ar vi th zh hi sw ur	2018	EMNLP	YES (English)	industry	502	YES	YES
PAWS-X	PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification	https://arxiv.org/pdf/1908.11828.pdf	https://github.com/google-research-datasets/paws	cross-lingual transfer	sentence pair task	NO	1000~10K	crowdsourced	not mentioned (maybe google internal according to the acknowledgements)	English	n/a	automatic translation & crowdsourced translation (incl. Gengo / One Hour Translation)	crowdsourced	English	fr es de zh ja ko	2019	EMNLP	YES (English)	industry	91	YES	YES
MLSUM	MLSUM: The Multilingual Summarization Corpus	https://arxiv.org/pdf/2004.14900v1.pdf	https://github.com/huggingface/datasets/tree/master/datasets/mlsum	task-oriented (multilingual)	summarization	YES	>10K	collected from media (news)	n/a	its own language	alignment	n/a	automatically induced	in its own language	fr de es ru tr	2020	EMNLP	NO	university	20	YES	YES
XL-WiC	XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization - ACL Anthology	https://aclanthology.org/2020.emnlp-main.584.pdf	https://pilehvar.github.io/xlwic/	cross-lingual transfer	other	PARTIAL	1000~10K	curated linguistic resources	n/a	its own language	n/a	n/a	“derived from linguistic resources (wordnet, etc)”	in its own language	bg da de et fa fr hr it ja ko nl zh en	2020	EMNLP	YES (English & other language)	university	14	YES	YES
MLQA	MLQA: Evaluating Cross-lingual Extractive Question Answering	https://arxiv.org/abs/1910.07475	https://github.com/facebookresearch/MLQA	cross-lingual transfer	machine reading comprehension	NO	1000~10K	crowdsourced & collected from Wikipedia	amt	English	alignment	crowdsourced translation (incl. Gengo / One Hour Translation)	crowdsourced	in its own language	en de es ar zh vi hi	2020	ACL	YES (English & other language)		150	YES	YES
XQuAD	On the Cross-lingual Transferability of Monolingual Representations	https://arxiv.org/abs/1910.11856	https://github.com/deepmind/xquad	cross-lingual transfer	machine reading comprehension	NO	1000~10K	crowdsourced & collected from Wikipedia	amt	English	n/a	crowdsourced translation (incl. Gengo / One Hour Translation)	crowdsourced	English	en es de el ru tr ar vi th zh hi	2019	EMNLP	YES (English)	industry	218	YES	YES
TyDi QA	TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages	https://arxiv.org/abs/2003.05002	https://ai.google.com/research/tydiqa/	task-oriented (multilingual)	machine reading comprehension	YES	>10K	crowdsourced	not mentioned	its own language	n/a	n/a	crowdsourced	its own language	ar bn fi id ja ki ko ru te th	2020	TACL	NO	industry	118	YES	YES
XOR QA	XOR QA: Cross-lingual Open-Retrieval Question Answering	https://arxiv.org/pdf/2010.11856.pdf	https://nlp.cs.washington.edu/xorqa/	task-oriented (multilingual)	QA + IR	YES	>10K	crowdsourced	amt (and maybe undergrad students)	its own language	n/a	crowdsourced translation (incl. Gengo / One Hour Translation)	crowdsourced	English	ar bn fi ja ko te ru	2021	NAACL	YES (other language)	combination of university and industry	13	YES	YES
XQA	XQA: A Cross-lingual Open-domain Question Answering Dataset - ACL Anthology	https://aclanthology.org/P19-1227/	http://github.com/thunlp/XQA	task-oriented (multilingual)	QA + IR	PARTIAL	1000~10K	template-based	n/a	its own language	n/a	n/a	automatically induced	its own language	en ch fr de po pt ru ta uk	2019	ACL	NO	university	41	NO	YES
MKQA	MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering	https://arxiv.org/abs/2007.15207	https://github.com/apple/ml-mkqa	cross-lingual transfer	QA + IR	NO	1000~10K	crowdsourced	tryrating	English	n/a	crowdsourced translation (incl. Gengo / One Hour Translation)	crowdsourced	English	ar da de en es fi fr he hu it ja ko km ms nl no pl pt ru sv th tr vi zh	2021	TACL	YES (English)	industry	19	YES	YES
POS-tagged Arabic tweets for four dialect	Multi-Dialect Arabic POS Tagging: A CRF Approach	http://www.lrec-conf.org/proceedings/lrec2018/pdf/562.pdf	https://huggingface.co/datasets/arabic_pos_dialect	task-oriented (target language)	sequence tagging	YES	100~1000	collected from social media or commercial sources	n/a	its own language	n/a	n/a	crowdsourced	in its own language	ar egy lev glf mgr	2018	LREC	NO	combination of university and industry	25	YES	YES
WikiANN	Cross-lingual Name Tagging and Linking for 282 Languages	https://www.aclweb.org/anthology/P17-1178	https://huggingface.co/datasets/wikiann	task-oriented (multilingual)	sequence tagging	YES	1000~10K	collected from Wikipedia	n/a	English	aligned	automatic translation	automatically induced	English	ace af als am an ang ar arc arz as ast ay az ba bar be bg bh bn bo br bs ca cdo ce ceb ckb co crh cs csb cv cy da de diq dv el en eo es et eu ext fa fi fo fr frr fur fy ga gan gd gl gn gu hak he hi hr hsb hu hy ia id ig ilo io is it ja jbo jv ka kk km kn ko ksh ku ky la lb li lij lmo ln lt lv mg mhr mi min mk ml mn mr ms mt mwl my mzn nap nds ne nl nn no nov oc or os sgs be-tarask cbk eml vro jv-x-bms en-basiceng lzh nan yue pa pdc pl pms pnb ps pt qu rm ro ru rw sa sah scn sco sd sh si sk sl so sq sr su sv sw szl ta te tg th tk tl tr tt ug uk ur uz vec vep vi vls vo wa war wuu xmf yi yo zea zh	2017	ACL	NO	university	71	YES	YES
MFAQ	MFAQ: a Multilingual FAQ Dataset	https://arxiv.org/abs/2109.12870	https://huggingface.co/datasets/clips/mfaq	task-oriented (multilingual)	machine reading comprehension	YES	>10K	collected from web	n/a	its own language	n/a	n/a	automatically induced	in its own language	cs da de en es fi fr he hr hu id it nl no pl pt ro ru sv tr vi	2021	EMNLP	NO	industry	0	YES	YES
XL-Sum	XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages	https://aclanthology.org/2021.findings-acl.413/	https://github.com/csebuetnlp/xl-sum	task-oriented (multilingual)	summarization	YES	>10K	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	am ar az bn my zh en fr gu ha hi ig id ja rn ko ky mr ne om ps fa pcm pt pa ru gd sr si so es sw ta te th ti tr uk ur uz vi cy yo	2021	ACL	NO	combination of university and industry	3	YES	YES
OntoNotes 5.0	OntoNotes	https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf	https://catalog.ldc.upenn.edu/LDC2013T19	task-oriented (multilingual)	structured prediction	YES	>10K	collected from media (news)	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	en cm ar zh	2013	N/A	NO	university	N/A	NO	YES
euronews	An Open Corpus for Named Entity Recognition in Historic Newspapers	https://aclanthology.org/L16-1689.pdf	https://github.com/EuropeanaNewspapers/ner-corpora	task-oriented (multilingual)	sequence tagging	YES	>10K	collected from media (news)	n/a	its own language	n/a	n/a	crowdsourced	in its own language	de fr nl	2016	LREC	NO	industry	21	YES	YES
exams	EXAMS: A Multi-Subject High School Examinations Dataset for Cross-Lingual and Multilingual Question Answering	https://arxiv.org/pdf/2011.03080.pdf	https://github.com/mhardalov/exams-qa	task-oriented (multilingual)	machine reading comprehension	YES	1000~10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	n/a	n/a	automatically induced	in its own language	ar bg de es fr hr hu it lt mk pl pt sq sr tr vi	2020	EMNLP	NO	university	6	YES	YES
hope_edi	“HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion”	https://aclanthology.org/2020.peoples-1.5/	https://github.com/huggingface/datasets/blob/master/datasets/hope_edi/README.md	task-oriented (multilingual)	classification (sentiment analysis)	YES	>10K	collected from social media or commercial sources	google form	its own language	n/a	n/a	crowdsourced	its own language	en ml ta	2021	EACL	NO	university	40	YES	YES
kan_hope	Hope Speech detection in under-resourced Kannada language	https://arxiv.org/abs/2108.04616#	https://github.com/adeepH/kan_hope	task-oriented (target language)	classification (sentiment analysis)	YES	1000~10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	crowdsourced	its own language	en kn	2021		NO	university	3	YES	YES
masakhaner	MasakhaNER: Named Entity Recognition for African Languages	https://arxiv.org/pdf/2103.11811.pdf	https://github.com/masakhane-io/masakhane-ner/	task-oriented (multilingual)	sequence tagging	YES	>10K	collected from media (news)	n/a	its own language	n/a	n/a	crowdsourced	its own language	am ha ig rw lg luo pcm sw wo yo	2021	TACL	NO	combination of university and industry	1	YES	YES
multi_eurlex	MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer	https://arxiv.org/abs/2109.00904	https://github.com/huggingface/datasets/blob/master/datasets/multi_eurlex/README.md	cross-lingual transfer	classification (non-sentiment analysis)	NO	>10K	collected from curated source	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	en da de nl sv bg cs hr pl sk sl es fr it p ro et fi hu lt lv el mt	2021	EMNLP	NO	combination of university and industry	1	YES	YES
nchlt	Developing Text Resources for Ten South African Languages	http://www.lrec-conf.org/proceedings/lrec2014/pdf/1151_Paper.pdf	https://repo.sadilar.org/handle/20.500.12185/7/discover?filtertype_0=database&filtertype_1=title&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Monolingual+Text+Corpora%3A+Annotated&filtertype=project&filter_relational_operator=equals&filter=NCHLT+Text+II	task-oriented (multilingual)	sequence tagging	YES	>10k	collected from web & collected from curated source	n/a	its own language	n/a	n/a	automatically induced	in its own language	af nr nso ss tn ts ve xh zu	2014	LREC	NO	university	68	YES	YES
offenseval_dravidian	“Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada”	https://aclanthology.org/2021.dravidianlangtech-1.17.pdf	N/A	task-oriented (multilingual)	classification (sentiment analysis)	YES	1000~10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	automatically induced	in its own language	en kn ml ta	2021	EACL	NO	university	3	YES	YES
sem_eval_2018_task_1	SemEval-2018 Task 1: Affect in Tweets	http://saifmohammad.com/WebDocs/semeval2018-task1.pdf	https://competitions.codalab.org/competitions/20948	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	1000~10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	crowdsourced	in its own language	en ar es	2018	*ACL Workshop	NO	university	427	YES	YES
stsb_multi_mt	N/A	N/A	https://github.com/PhilipMay/stsb-multi-mt	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	1000~10K	collected from media (news)	n/a	English	n/a	automatic translation	automatically induced	English	en de es fr it nl pl pt ru zh	2021	N/A	YES (English)	industry	N/A	YES	YES
20Minuten	A New Dataset and Efficient Baselines for Document-level Text Simplification in German	https://aclanthology.org/2021.newsum-1.16/	https://github.com/ZurichNLP/20Minuten	task-oriented (target language)	sentence-level-generation task	YES	>10k	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	de	2021	EMNLP	NO	university	0	NO	YES
XFORMAL	“Olá, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer”	https://aclanthology.org/2021.naacl-main.256.pdf	https://github.com/Elbria/xformal-FoST	task-oriented (multilingual)	sentence-level-generation task	NO	1000~10k	collected from web	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	pt fr it	2021	NAACL	NO	combination of university and industry	4	NO	YES		yes
Mr. TyDi	Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval	https://arxiv.org/abs/2108.08787	https://github.com/castorini/mr.tydi	task-oriented (multilingual)	machine reading comprehension	YES	1000~10k	crowdsourced	not mentioned (from tydi)	its own language	n/a	n/a	crowdsourced	in its own language	ar bn en fi id ja ko ru sw te th	2021	*ACL Workshop	YES (English & other language)	university	0	NO	YES
XWikis	Models and Datasets for Cross-Lingual Summarisation	https://aclanthology.org/2021.emnlp-main.742/	https://github.com/lauhaide/clads/blob/main/fairseq2020/examples/clads/README.md	task-oriented (multilingual)	summarization	YES	>10k	collected from Wikipedia	n/a	its own language	n/a	n/a	automatically induced	in its own language	cs fr en de	2021	EMNLP	NO	university	0	NO	YES
MultiHumES	MultiHumES: Multilingual Humanitarian Dataset for Extractive Summarization	https://aclanthology.org/2021.eacl-main.146.pdf	https://deephelp.zendesk.com/hc/en-us/sections/360011925552-MultiHumES	task-oriented (multilingual)	summarization	YES	1000~10k	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	en fr es	2021	EACL	NO	combination of university and industry	0	NO	YES
SMiLER	Multilingual Entity and Relation Extraction Dataset and Model	https://aclanthology.org/2021.eacl-main.166.pdf	https://github.com/samsungnlp/smiler/	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	>10k	collected from Wikipedia	n/a	English	n/a	automatic translation	automatically induced	in its own language	it fr de pt es ko	2021	EACL	NO	combination of university and industry	0	NO	YES
swiss_judgment_prediction	Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark	https://aclanthology.org/2021.nllp-1.3.pdf	https://github.com/JoelNiklaus/SwissCourtRulingCorpus	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	>10k	collected from curated sources	n/a	its own language	n/a	n/a	crowdsourced	English	de fr it	2021	*ACL Workshop	No	university	0	YES	YES
tamilmixsentiment	Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text	https://aclanthology.org/2020.sltu-1.28/	https://dravidian-codemix.github.io/2020/index.html	task-oriented (target language)	classification (sentiment analysis)	YES	>10k	collected from social media or commercial sources	n/a	its own language	n/a	n/a	crowdsourced	in its own language	en ta	2020	*ACL Workshop	No	university	106	YES	YES
C^3	Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension	https://arxiv.org/pdf/1904.09679	https://dataset.org/c3/	task-oriented (target language)	machine reading comprehension	YES	>10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	alignment	n/a	automatically induced	in its own language	zh	2019	TACL	NO	combination of university and industry	20	YES	YES	* in huggingface as part of clue
ChID	ChID: A Large-scale Chinese IDiom Dataset for Cloze Test	https://aclanthology.org/P19-1075.pdf	https://drive.google.com/drive/folders/1qdcMgCuK9d93vLVYJRvaSLunHUsGf50u?usp=sharing	task-oriented (target language)	machine reading comprehension	YES	>10K	collected from media (news) & collected from web	n/a	its own language	word embedding similarity scores	n/a	automatically induced	in its own language	zh	2019	ACL	YES (other language)	university	33	YES	YES	* in huggingface as part of clue
CLUE - IFLYTEK Long Text classification	CLUE: A Chinese Language Understanding Evaluation Benchmark	https://arxiv.org/abs/2004.05986	https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zip	multi-task (target language)	classification (sentiment analysis)	YES	>10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	not mentioned	in its own language	zh	2020	COLING	NO	industry	59	NO	YES
CLUE - Ant Financial Question Matching Corpus	CLUE: A Chinese Language Understanding Evaluation Benchmark	https://arxiv.org/abs/2004.05986	https://storage.googleapis.com/cluebenchmark/tasks/afqmc_public.zip	multi-task (target language)	sentence pair task	YES	>10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	not mentioned	in its own language	zh	2020	COLING	YES (other language)	individual researchers	59	YES	YES
CLUE - Chinese Scientific Literature	CLUE: A Chinese Language Understanding Evaluation Benchmark	https://arxiv.org/abs/2004.05986	https://storage.googleapis.com/cluebenchmark/tasks/csl_public.zip	multi-task (target language)	sentence pair task	YES	>10K	curated linguistic resources	n/a	its own language	tf-idf generated	n/a	automatically induced	in its own language	zh	2020	COLING	NO	individual researchers	59	NO	YES	* citation & published venue is clue’s because the dataset itself wasn’t published
CLUE - CLUEWSC 2020	CLUE: A Chinese Language Understanding Evaluation Benchmark	https://arxiv.org/abs/2004.05986	https://storage.googleapis.com/cluebenchmark/tasks/cluewsc2020_public.zip	multi-task (target language)	structured prediction	YES	1000~10K	curated linguistic resources	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	zh	2020	COLING	NO	individual researchers	59	NO	YES	* citation & published venue is clue’s because the dataset itself wasn’t published
CLUE - Toutiao Short Text Classificaiton for News	CLUE: A Chinese Language Understanding Evaluation Benchmark	https://arxiv.org/abs/2004.05986	https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip	multi-task (target language)	classification (non-sentiment analysis)	YES	>10K	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	zh	2020	COLING	NO	individual researchers	59	NO	YES	* citation & published venue is clue’s because the dataset itself wasn’t published
CMRC 2018	A Span-Extraction Dataset for Chinese Machine Reading Comprehension	https://aclanthology.org/D19-1600.pdf	https://worksheets.codalab.org/worksheets/0x92a80d2fab4b4f79a2b4064f7ddca9ce	task-oriented (target language)	machine reading comprehension	YES	>10K	collected from Wikipedia	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	zh	2019	EMNLP	NO	combination of university and industry	91	YES	YES
FQuAD 1.1	FQuAD: French Question Answering Dataset	https://aclanthology.org/2020.findings-emnlp.107/	https://fquad.illuin.tech/	task-oriented (target language)	machine reading comprehension	YES	>10K	crowdsourced & collected from Wikipedia	university students	its own language	n/a	n/a	crowdsourced	in its own language	fr	2020	Findings	NO	combination of university and industry	25	YES	YES
CLS	Cross-Language Text Classification using Structural Correspondence Learning	https://aclanthology.org/P10-1114.pdf	https://github.com/getalp/Flaubert/tree/master/flue	task-oriented (multilingual)	classification (sentiment analysis)	YES	1000~10K	collected from social media or commercial sources	n/a	its own language	n/a	automatic translation	automatically induced	in its own language	fr de en ja	2010	ACL	NO	university	291	NO	YES
IndoNLI	IndoNLI: A Natural Language Inference Dataset for Indonesian	https://arxiv.org/pdf/2110.14566.pdf	https://github.com/ir-nlp-csui/indonli/tree/main/data	task-oriented (target language)	sentence pair task	YES	>10K	collected from Wikipedia & collected from web & curated linguistic resources	university students	its own language	n/a	n/a	“crowdsourced & annotated (authors, linguists)”	in its own language	id	2021	EMNLP	NO	combination of university and industry	0	NO	YES
K-QuAD	Semi-supervised Training Data Generation for Multilingual Question Answering	https://aclanthology.org/L18-1437.pdf	https://github.com/Di-lab-Yonsei/K-QuAD	task-oriented (target language)	machine reading comprehension	NO	>10K	crowdsourced & collected from Wikipedia	not mentioned	English & its own language	n/a	automatic translation & crowdsourced translation (incl. Gengo / One Hour Translation)	crowdsourced	English & in its own language	ko	2018	LREC	YES (English)	university	26	NO	YES
KLUE - DP	KLUE: Korean Language Understanding Evaluation	https://arxiv.org/pdf/2105.09680.pdf	https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000071/data/klue-dp-v1.1.tar.gz	multi-task (target language)	structured prediction	YES	>10K	collected from web & collected from social media or commercial sources	deepnatural	its own language	n/a	n/a	crowdsourced & automatically induced	in its own language	ko	2021	NeurIPS Datasets and Benchmarks Track	NO	combination of university and industry	19	NO	YES
KLUE - DST (WoS)	KLUE: Korean Language Understanding Evaluation	https://arxiv.org/pdf/2105.09680.pdf	https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000073/data/wos-v1.1.tar.gz	multi-task (target language)	structured prediction	YES	1000~10K	crowdsourced	not mentioned	its own language	n/a	n/a	crowdsourced	in its own language	ko	2021	NeurIPS Datasets and Benchmarks Track	NO	combination of university and industry	19	NO	YES
KLUE - MRC	KLUE: Korean Language Understanding Evaluation	https://arxiv.org/pdf/2105.09680.pdf	https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000072/data/klue-mrc-v1.1.tar.gz	multi-task (target language)	machine reading comprehension	YES	>10K	collected from media (news) & collected from Wikipedia & crowdsourced	selectstar	its own language	n/a	n/a	crowdsourced	in its own language	ko	2021	NeurIPS Datasets and Benchmarks Track	NO	combination of university and industry	19	NO	YES
KLUE - NER	KLUE: Korean Language Understanding Evaluation	https://arxiv.org/pdf/2105.09680.pdf	https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000069/data/klue-ner-v1.1.tar.gz	multi-task (target language)	sequence tagging	YES	>10K	collected from web	deepnatural	its own language	n/a	n/a	crowdsourced	in its own language	ko	2021	NeurIPS Datasets and Benchmarks Track	NO	combination of university and industry	19	NO	YES
KLUE - NLI	KLUE: Korean Language Understanding Evaluation	https://arxiv.org/pdf/2105.09680.pdf	https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000068/data/klue-nli-v1.1.tar.gz	multi-task (target language)	sentence pair task	YES	>10K	collected from web & collected from Wikipedia & collected from social media or commercial sources & collected from media (news)	selectstar	its own language	n/a	n/a	crowdsourced	in its own language	ko	2021	NeurIPS Datasets and Benchmarks Track	NO	combination of university and industry	19	NO	YES
KLUE - RE	KLUE: Korean Language Understanding Evaluation	https://arxiv.org/pdf/2105.09680.pdf	https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000070/data/klue-re-v1.1.tar.gz	multi-task (target language)	classification (non-sentiment analysis)	YES	>10K	collected from web & collected from Wikipedia & collected from media (news)	deepnatural	its own language	n/a	n/a	crowdsourced	in its own language	ko	2021	NeurIPS Datasets and Benchmarks Track	NO	combination of university and industry	19	NO	YES
KLUE - STS	KLUE: Korean Language Understanding Evaluation	https://arxiv.org/pdf/2105.09680.pdf	https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000067/data/klue-sts-v1.1.tar.gz	multi-task (target language)	sentence pair task	YES	>10K	collected from web & collected from Wikipedia & collected from media (news)	selectstar	its own language	RTT & greedy sentence matching	n/a	crowdsourced	in its own language	ko	2021	NeurIPS Datasets and Benchmarks Track	YES (other language)	combination of university and industry	19	NO	YES
KLUE - TC(YNAT)	KLUE: Korean Language Understanding Evaluation	https://arxiv.org/pdf/2105.09680.pdf	https://aistages-prod-server-public.s3.amazonaws.com/app/Competitions/000066/data/ynat-v1.1.tar.gz	multi-task (target language)	classification (sentiment analysis)	YES	>10K	collected from media (news)	selectstar	its own language	n/a	n/a	crowdsourced	in its own language	ko	2021	NeurIPS Datasets and Benchmarks Track	YES (other language)	combination of university and industry	19	NO	YES
KorQuAD1.0	KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension	https://arxiv.org/pdf/1909.07005.pdf	https://korquad.github.io/	task-oriented (target language)	machine reading comprehension	YES	>10K	collected from Wikipedia & crowdsourced	not mentioned	its own language	n/a	n/a	crowdsourced	in its own language	ko	2019	arxiv	NO	industry	39	YES	YES
OCNLI	OCNLI: Original Chinese Natural Language Inference	https://arxiv.org/pdf/2010.05444.pdf	https://storage.googleapis.com/cluebenchmark/tasks/ocnli_public.zip	task-oriented (target language)	sentence pair task	YES	>10K	“collected from media (news) & collected from curated source (exams, scientific papers, etc) & curated linguistic resources”	university students	its own language	n/a	n/a	crowdsourced	in its own language	zh	2020	Findings	NO	combination of university and industry	21	NO	YES
ParsiNLU - Multiple Choice QA	ParsiNLU: A Suite of Language Understanding Challenges for Persian	https://arxiv.org/pdf/2012.06154.pdf	https://github.com/persiannlp/parsinlu/tree/master/data/multiple-choice	multi-task (target language)	QA + IR	YES	1000~10K	“collected from curated source (exams, scientific papers, etc)”	native speakers	its own language	n/a	n/a	crowdsourced	in its own language	fa	2021	TACL	NO	combination of university and industry	3	NO	YES
ParsiNLU - Query Paraphrasing	ParsiNLU: A Suite of Language Understanding Challenges for Persian	https://arxiv.org/pdf/2012.06154.pdf	https://github.com/persiannlp/parsinlu/tree/master/data/qqp	multi-task (target language)	sentence pair task	YES	1000~10K	collected from web	native speakers	its own language	google auto complete	automatic translation & expert translation	crowdsourced	in its own language	fa	2021	TACL	YES (English)	combination of university and industry	3	YES	YES
ParsiNLU - Reading Comprehension	ParsiNLU: A Suite of Language Understanding Challenges for Persian	https://arxiv.org/pdf/2012.06154.pdf	https://github.com/persiannlp/parsinlu/tree/master/data/reading_comprehension	multi-task (target language)	machine reading comprehension	YES	1000~10K	collected from web	native speakers	its own language	n/a	n/a	crowdsourced	in its own language	fa	2021	TACL	NO	combination of university and industry	3	YES	YES
ParsiNLU - Sentiment Analysis	ParsiNLU: A Suite of Language Understanding Challenges for Persian	https://arxiv.org/pdf/2012.06154.pdf	https://github.com/persiannlp/parsinlu/tree/master/data/sentiment-analysiss	multi-task (target language)	classification (sentiment analysis)	YES	1000~10K	collected from social media or commercial sources & collected from web	native speakers	its own language	n/a	n/a	crowdsourced	in its own language	fa	2021	TACL	NO	combination of university and industry	3	YES	YES
ParsiNLU - Textual Entailment	ParsiNLU: A Suite of Language Understanding Challenges for Persian	https://arxiv.org/pdf/2012.06154.pdf	https://github.com/persiannlp/parsinlu/tree/master/data/entailment	multi-task (target language)	sentence pair task	YES	1000~10K	collected from Wikipedia& collected from web & curated linguistic resources	native speakers	its own language	n/a	n/a	crowdsourced	in its own language	fa	2021	TACL	YES (English)	combination of university and industry	3	YES	YES
XGLUE - NTG	“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”	https://arxiv.org/pdf/2004.01401.pdf	https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz	cross-lingual transfer	sentence-level-generation task	PARTIAL	>10K	collected from web	n/a	not mentioned	not mentioned	not clear whether translation is used	not mentioned	not mentioned	en de fr es ru	2020	EMNLP	NO	industry	57	YES	YES
XGLUE - QG	“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”	https://arxiv.org/pdf/2004.01401.pdf	https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz	cross-lingual transfer	sentence-level-generation task	PARTIAL	>10K	collected from web	n/a	not mentioned	not mentioned	not clear whether translation is used	not mentioned	not mentioned	en fr de es it pt	2020	EMNLP	NO	industry	57	YES	YES
XGLUE - QAM	“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”	https://arxiv.org/pdf/2004.01401.pdf	https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz	cross-lingual transfer	sentence pair task	PARTIAL	>10K	collected from web	n/a	not mentioned	not mentioned	not clear whether translation is used	not mentioned	not mentioned	en fr de	2020	EMNLP	NO	industry	57	YES	YES
XGLUE - WPR	“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”	https://arxiv.org/pdf/2004.01401.pdf	https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz	cross-lingual transfer	classification (sentiment analysis)	PARTIAL	>10K	collected from web	n/a	not mentioned	not mentioned	not clear whether translation is used	not mentioned	not mentioned	en de fr es it pt zh	2020	EMNLP	NO	industry	57	YES	YES
XGLUE - QADSM	“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”	https://arxiv.org/pdf/2004.01401.pdf	https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz	cross-lingual transfer	classification (sentiment analysis)	PARTIAL	>10K	collected from web	n/a	not mentioned	not mentioned	not clear whether translation is used	not mentioned	not mentioned	en fr de	2020	EMNLP	NO	industry	57	YES	YES
XGLUE - NC	“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”	https://arxiv.org/pdf/2004.01401.pdf	https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz	cross-lingual transfer	classification (sentiment analysis)	PARTIAL	>10K	collected from web	n/a	not mentioned	not mentioned	not clear whether translation is used	not mentioned	not mentioned	en es de fr ru	2020	EMNLP	NO	industry	57	YES	YES
XGLUE - POS Tagging	“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”	https://arxiv.org/pdf/2004.01401.pdf	https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz	cross-lingual transfer	structured prediction	PARTIAL	1000~10K	curated linguistic resources	n/a	not mentioned	not mentioned	n/a	not mentioned	not mentioned	ar bg de el en es fr hi it nl pl pt ru th tr ur vi zh	2020	EMNLP	NO	industry	57	YES	YES
XGLUE - NER	“XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation”	https://arxiv.org/pdf/2004.01401.pdf	https://xglue.blob.core.windows.net/xglue/xglue_full_dataset.tar.gz	cross-lingual transfer	sequence tagging	PARTIAL	1000~10K	collected from media (news)	people from university	its own language	n/a	n/a	crowdsourced	in its own language	en de es dl	2020	EMNLP	YES (English & other language)	industry	57	YES	YES
negationminpairs	A Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairs	https://aclanthology.org/2021.conll-1.19.pdf	https://github.com/mahartmann/negationminpairs	task-oriented (multilingual)	classification (sentiment analysis)	YES	1000~10K	crowdsourced	“annotated by native speakers (except english), xnli: gethybrid.io”	its own language	alignment	automatic translation	“annotated (authors, linguists)”	in its own language	en bg de fr zh	2021	CoNLL	YES (English & other languages)	university	0	NO	YES
malayammixsentiment	A Sentiment Analysis Dataset for Code-Mixed Malayalam-English	https://arxiv.org/pdf/2006.00210v1.pdf	https://github.com/bharathichezhiyan/MalayalamMixSentiment	task-oriented (target language)	classification (sentiment analysis)	YES	>10k	collected from social media or commercial sources	n/a	its own language	n/a	n/a	crowdsourced	in its own language	en ml	2020	*ACL Workshop	No	university	48	YES	YES
TUNIZI	Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis	https://arxiv.org/pdf/2004.14303v1.pdf	https://github.com/chaymafourati/TUNIZI-Sentiment-Analysis-Tunisian-Arabizi-Dataset	task-oriented (target language)	classification (sentiment analysis)	NO	1000~10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	crowdsourced	in its own language	ar	2020	ICLR	NO	industry	8	NO	YES
CoDEx	CoDEx: A Comprehensive Knowledge Graph Completion Benchmark	https://arxiv.org/pdf/2009.07810.pdf	https://github.com/tsafavi/codex	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	>10k	collected from Wikipedia	n/a	its own language	n/a	n/a	“annotated (authors, linguists) & automatically induced”	English	ar de en es ru zh	2020	EMNLP	NO	university	16	NO	YES
Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank	Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank	http://www.lrec-conf.org/proceedings/lrec2018/pdf/141.pdf	https://github.com/MurathanKurfali/Ted-MDB-Annotations	task-oriented (multilingual)	structured prediction	NO	1000~10K	collected from web	n/a	its own language	n/a	expert translation	“annotated (authors, linguists)”	in its own language	en de pl pt ru tr	2018	LREC	NO	university	18	NO	YES
TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks	TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks	https://aclanthology.org/2020.emnlp-main.223.pdf	https://github.com/wanqiulong0923/TED-CDB	task-oriented (target language)	structured prediction	NO	>10k	collected from web	n/a	both	n/a	expert translation	“annotated (authors, linguists)”	in its own language	zh	2020	EMNLP	NO	university	0	NO	YES
A Dataset for Multi-lingual Epidemiological Event Extraction	A Dataset for Multi-lingual Epidemiological Event Extraction	https://aclanthology.org/2020.lrec-1.509.pdf	https://zenodo.org/record/3709617#.YcCvOhPMITU	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	>10k	collected from media (news)	n/a	its own language	n/a	n/a	crowdsourced	not mentioned	en fr es pt	2020	LREC	NO	university	3	NO	YES
Multilingual Culture-Independent Word Analogy Datasets	Multilingual Culture-Independent Word Analogy Datasets	https://aclanthology.org/2020.lrec-1.501.pdf	https://www.clarin.si/repository/xmlui/handle/11356/1261	task-oriented (multilingual)	other	NO	>10k	collected from web	n/a	its own language	n/a	automatic translation	“annotated (authors, linguists)”	in its own language	en ee fi lv lt ru si se	2020	LREC	NO	university	6		YES
SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelines	SemRelData ― Multilingual Contextual Annotation of Semantic Relations between Nominals: Dataset and Guidelines	https://aclanthology.org/L16-1656.pdf		task-oriented (multilingual)	classification (non-sentiment analysis)	NO	>10k	collected from web & collected from Wikipedia	n/a	its own language	n/a	n/a	crowdsourced	in its own language	en de ru	2016	LREC	NO	university	2	NO	NO
20Minuten	A New Dataset and Efficient Baselines for Document-level Text Simplification in German	https://aclanthology.org/2021.newsum-1.16.pdf		task-oriented (target language)	sentence-level-generation task	YES	>10k	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	de	2021	ACL	NO	university	0	NO	NO
Spektrum	A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization	https://aclanthology.org/2021.newsum-1.5.pdf	https://github.com/MehwishFatimah/wsd	task-oriented (multilingual)	summarization	YES	1000~10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	n/a	n/a	automatically induced	in its own language	en de	2021	EMNLP	NO	university	0	NO	YES
A Summarization Dataset of Slovak News Articles	A Summarization Dataset of Slovak News Articles	https://aclanthology.org/2020.lrec-1.830.pdf	https://github.com/NaiveNeuron/sme-sum	task-oriented (target language)	summarization	not mentioned	>10k	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	sk	2020	LREC	NO	university	1	NO	NO
Liputan6: A Large-scale Indonesian Dataset for Text Summarization	Liputan6: A Large-scale Indonesian Dataset for Text Summarization	https://aclanthology.org/2020.aacl-main.60.pdf	https://github.com/fajri91/sum_liputan6	task-oriented (target language)	summarization	YES	>10k	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	id	2020	AACL	NO	university	8	NO	YES
Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque	Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque	https://aclanthology.org/2020.lrec-1.55/	http://ixa.si.ehu.es/node/12934	task-oriented (target language)	machine reading comprehension	NO	1000~10k	collected from Wikipedia	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	eu	2020	LREC	NO	university	7	NO	YES
A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets	A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets	https://aclanthology.org/P15-2001.pdf	Not available	task-oriented (multilingual)	classification (non-sentiment analysis)	not mentioned	<100	“annotated (authors, linguists)”	n/a	English	n/a	expert translation	crowdsourced	English	en es fr de pt fa	2015	ACL	YES (English)	university	61	NO	NO
A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining	A Multi-lingual Annotated Dataset for Aspect-Oriented Opinion Mining	https://aclanthology.org/D15-1302.pdf	https://github.com/diegma/trip-maml	task-oriented (multilingual)	classification (sentiment analysis)	YES	1000~10k	collected from social media or commercial sources	n/a	its own language	n/a	n/a	crowdsourced	in its own language	en es it	2015	ACL	YES (English)	university	10	NO	YES
“Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers”	“Cross-Document, Cross-Language Event Coreference Annotation Using Event Hoppers”	https://aclanthology.org/L18-1558.pdf	https://github.com/AlonEirew/cross-doc-event-coref	task-oriented (multilingual)	structured prediction	YES	>10k	collected from web	n/a	its own language	n/a	n/a	crowdsourced	in its own language	zh en es	2018	LREC	NO	university	6	NO	YES
DanFEVER: claim verification dataset for Danish	DanFEVER: claim verification dataset for Danish	https://aclanthology.org/2021.nodalida-main.47.pdf	https://figshare.com/articles/dataset/DanFEVER_claim_verification_dataset_for_Danish/14380970	task-oriented (target language)	classification (non-sentiment analysis)	NO	>10k	collected from Wikipedia	n/a	its own language	n/a	n/a	crowdsourced	in its own language	da	2021	NoDaLiDa	NO	university	5	NO	YES
From Web Crawl to Clean Register-Annotated Corpora	From Web Crawl to Clean Register-Annotated Corpora	https://aclanthology.org/2020.wac-1.3.pdf	https://github.com/TurkuNLP/WAC-XII	task-oriented (multilingual)	classification (non-sentiment analysis)	NO	>10k	collected from web	n/a	its own language	n/a	n/a	crowdsourced	in its own language	fr sv	2020	LREC	NO	university	2	NO	YES
LIdioms: A Multilingual Linked Idioms Data Set	LIdioms: A Multilingual Linked Idioms Data Set	https://arxiv.org/pdf/1802.08148.pdf	https://github.com/dice-group/LIdioms/blob/master/en/english.ttl	task-oriented (multilingual)	other	NO	100~1000	collected from web	n/a	its own language	n/a	n/a	automatically induced	English	en pt it de ru	2018	LREC	NO	university	6	NO	YES
Mega-COV	Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19	https://arxiv.org/pdf/2005.06012.pdf	https://github.com/UBC-NLP/megacov/tree/master/tweet_ids	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	>10K	collected from social media or commercial sources	n/a	English & its own language	n/a	n/a	automatically induced	English & in its own language		2021	EACL	NO	university	7	NO	YES
Finnish Rumor Detection Dataset	Never guess what I heard… Rumor Detection in Finnish News: a Dataset and a Baseline	https://arxiv.org/pdf/2106.03389.pdf	https://zenodo.org/record/4697529#.YcKICy-B2tU	task-oriented (target language)	classification (sentiment analysis)	YES	1000~10K	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	fi	2021	*ACL Workshop	NO	university	0	NO	YES
PROMETHEUS	PROMETHEUS: A Corpus of Proverbs Annotated with Metaphors	https://aclanthology.org/L16-1600.pdf	(not available)	task-oriented (multilingual)	structured prediction	not mentioned	1000~10K	curated linguistic resources & collected from social media or commercial sources	n/a	English	n/a	expert translation	“annotated (authors, linguists)”	in its own language	en it	2016	LREC	NO	university	7	NO	YES
OneSec	Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains	https://aclanthology.org/2020.lrec-1.723.pdf	http://trainomatic.org/data/onesec_lrec.tar.gz	task-oriented (multilingual)	other	YES	>10K	curated linguistic resources & collected from Wikipedia	n/a	its own language	n/a	n/a	automatically induced	in its own language	en it fr de es	2020	LREC	NO	university	4	NO	YES
StoryDB	StoryDB: Broad Multi-language Narrative Dataset	https://aclanthology.org/2021.eval4nlp-1.4.pdf	https://drive.google.com/drive/folders/1RCWk7pyvIpubtsf-f2pIsfqTkvtV80Yv	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	1000~10K	collected from Wikipedia	n/a	English & its own language	alignment	n/a	automatically induced	English & its own language	en it fr ru de nl uk pl pt es sv ja he fi eu hy fa no ar id ko vi bg el hu zh da gl th sr hr lb mk ta ms cs ro te ka ca lt sl	2021	*ACL Workshop	NO	combination of university and industry	0	NO	YES
DReaM	The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages	https://aclanthology.org/2020.lrec-1.110.pdf	“https://spraakbanken.gu.se/korp/?mode=dream#?cqp=%5B%5D&corpus=dream-en-open,dream-de-open,dream-es-open,dream-fr-open,dream-it-open,dream-nl-open,dream-ru-open”	task-oriented (multilingual)	structured prediction	not mentioned	100~1000	“collected from curated source (exams, scientific papers, etc)”	n/a	English & its own language	n/a	n/a	“annotated (authors, linguists)”	English & its own language	en fr de es pt ru id nl it zh	2020	LREC	NO	university	5	NO	YES
Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.	Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.	https://aclanthology.org/P19-1392.pdf	http://www.grupolys.org/~marcos/pub/collocations.zip	cross-lingual transfer	structured prediction	NO	1000~10K	curated linguistic resources	n/a	English & its own language	n/a	n/a	“annotated (authors, linguists)”	English & its own language	en pt es	2019	ACL	NO	university	6	NO	YES
Universal Dependency Annotation for Multilingual Parsing	Universal Dependency Annotation for Multilingual Parsing	https://aclanthology.org/P13-2017.pdf	https://code.google.com/p/uni-dep-tb//	task-oriented (multilingual)	structured prediction	NO	1000~10K	curated linguistic resources	not mentioned	English & its own language	parsers	n/a	crowdsourced	English & its own language	en de sv es fr ko	2013	ACL	NO	combination of university and industry	561	NO	YES
KINNEWS	KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi	https://arxiv.org/pdf/2010.12174.pdf	https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	>10K	collected from media (news)	n/a	its own language	google auto complete	n/a	crowdsourced	English & in its own language	rw	2020	COLING	NO	university	3	YES	YES
KIRNEWS	KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi	https://arxiv.org/pdf/2010.12174.pdf	https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	1000~10K	collected from media (news)	n/a	its own language	google auto complete	n/a	crowdsourced	English & in its own language	rn	2020	COLING	NO	university	3	YES	YES
SQuAD-es	Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering	https://arxiv.org/pdf/1912.05200.pdf	https://github.com/ccasimiro88/TranslateAlignRetrieve	task-oriented (target language)	machine reading comprehension	YES	>10K	crowdsourced & collected from Wikipedia	amt	English	alignment	automatic translation	crowdsourced	English	es	2020	LREC	YES (English)	university	18	NO	YES
HEAD-QA: A Healthcare Dataset for Complex Reasoning	HEAD-QA: A Healthcare Dataset for Complex Reasoning	https://aclanthology.org/P19-1092.pdf	http: //aghie.github.io/head-qa/	task-oriented (multilingual)	QA + IR	YES	1000~10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	n/a	automatic translation	“derived from linguistic resources (wordnet, etc)”	in its own language	es en	2019	ACL	NO	university	10	YES	YES
RuCoS	Read and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russian	https://aclanthology.org/2020.coling-main.570.pdf	https://github.com/RussianNLP/RussianSuperGLUE	task-oriented (target language)	machine reading comprehension	YES	>10K	collected from web	toloka	its own language	tf-idf generated & other	n/a	crowdsourced & automatically induced	in its own language	ru	2020	COLING	YES (other language)	combination of university and industry	3	YES	YES
MuSeRC	Read and Reason with MuSeRC and RuCoS: Datasets for Machine Reading Comprehension for Russian	https://aclanthology.org/2020.coling-main.570.pdf	https://github.com/RussianNLP/RussianSuperGLUE	task-oriented (target language)	machine reading comprehension	YES	1000~10K	“collected from media (news) & collected from curated source (exams, scientific papers, etc)”	toloka	its own language	n/a	n/a	crowdsourced	in its own language	ru	2020	COLING	YES (other language)	combination of university and industry	3	YES	YES
BI-139	CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval	https://aclanthology.org/2020.emnlp-main.340.pdf	https://www.cs.jhu.edu/~shuosun/clirmatrix/	task-oriented (multilingual)	QA + IR	YES	>10K	collected from Wikipedia	n/a	English & its own language	alignment	n/a	automatically induced	in its own language	af als am an ar arz ast az azb ba bar be bg bn bpy br bs bug ca cdo ce ceb ckb cs cv cy da de diq el eml eo es et eu fa fi fo fr fy ga gd gl gu he hi hr hsb ht hu hy ia id ilo io is it ja jv ka kk kn ko ku ky la lb li lmo lt lv mai mg mhr min mk ml mn mr mrj ms my mzn nap nds ne new nl nn no oc or os pa pl pms pnb ps pt qu ro ru sa sah scn sco sd sh si sk sl sq sr su sv sw szl ta te tg th tl tr tt uk ur uz vec vi vo wa war wuu xmf yi yo zh	2020	EMNLP	NO	university	3	NO	YES
MULTI-8	CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval	https://aclanthology.org/2020.emnlp-main.340.pdf	https://www.cs.jhu.edu/~shuosun/clirmatrix/	task-oriented (multilingual)	QA + IR	YES	>10K	collected from Wikipedia	n/a	English & its own language	alignment	n/a	automatically induced	in its own language	ar de en es fr ja ru zh	2020	EMNLP	NO	university	3	NO	YES
GerDaLIR: A German Dataset for Legal Information Retrieval	GerDaLIR: A German Dataset for Legal Information Retrieval	https://aclanthology.org/2021.nllp-1.13.pdf	https://github.com/lavis-nlp/GerDaLIR	task-oriented (target language)	QA + IR	YES	>10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	n/a	n/a	automatically induced	in its own language	de	2021	*ACL Workshop	NO	university	0	NO	YES
MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network	MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network	https://arxiv.org/pdf/2106.07352.pdff	not available	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	>10K	collected from Wikipedia & collected from web	n/a	English & its own language	n/a	n/a	automatically induced	in its own language		2021	ACL	YES (English & other language)	industry	2	NO	NO
A Turkish Dataset for Gender Identification of Twitter Users	A Turkish Dataset for Gender Identification of Twitter Users	https://aclanthology.org/W19-4023v1.pdf	https://cloud.iyte.edu.tr/index.php/s/5DhqdlUCCdB60qG	task-oriented (target language)	classification (non-sentiment analysis)	YES	1000~10K	collected from social media or commercial sources	university students & academic personnel	its own language	n/a	n/a	crowdsourced	in its own language	tr	2019	*ACL Workshop	NO	university	11	NO	YES
AnCora-Ca	AnCora: Multilevel Annotated Corpora for Catalan and Spanish	http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf	http://clic.ub.edu/corpus/en	task-oriented (target language)	sequence tagging & structured prediction	NO	>10K	collected from media (news)	not mentioned	its own language	n/a	n/a	“annotated (authors, linguists) & automatically induced”	in its own language	ca	2008	LREC	YES (other language)	university	345	PARTIAL	YES
AnCora-Es	AnCora: Multilevel Annotated Corpora for Catalan and Spanish	http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf	http://clic.ub.edu/corpus/en	task-oriented (target language)	sequence tagging & structured prediction	NO	>10K	collected from media (news)	not mentioned	its own language	n/a	n/a	“annotated (authors, linguists) & automatically induced”	in its own language	es	2008	LREC	NO	university	345	NO	YES
NCTTI	Assessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levels	https://aclanthology.org/2021.acl-long.212.pdf	https://github.com/marcospln/nctti	task-oriented (multilingual)	sequence tagging	YES	1000~10K	collected from web & collected from Wikipedia	amt & online platforms for portuguese in cordeiro et al (2019)	its own language	parsers	n/a	“crowdsourced & annotated (authors, linguists)”	in its own language	en pt	2021	ACL	YES (English & other language)	university	2	NO	YES
Books of Hours. the First Liturgical Data Set for Text Segmentation.	Books of Hours. the First Liturgical Data Set for Text Segmentation.	https://aclanthology.org/2020.lrec-1.97.pdf		task-oriented (target language)	sequence tagging	NO	1000~10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	other	n/a	“annotated (authors, linguists)”	in its own language	la	2020	LREC	NO	university	1	NO	YES
COSTRA 1.0: A Dataset of Complex Sentence Transformations	COSTRA 1.0: A Dataset of Complex Sentence Transformations	https://aclanthology.org/2020.lrec-1.434/	https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3123	task-oriented (target language)	sentence pair task	NO	1000~10K	collected from media (news)	n/a	its own language	alignment	n/a	crowdsourced	in its own language	cs	2020	LREC	NO	university	1	NO	YES
Fine-grained Named Entity Annotation for Finnish	Fine-grained Named Entity Annotation for Finnish	https://aclanthology.org/2021.nodalida-main.14/	https://github.com/TurkuNLP/turku-one	task-oriented (target language)	sequence tagging	YES	>10K		n/a	its own language	n/a	n/a	“derived from linguistic resources (wordnet, etc)”	in its own language	fi	2021	NoDaLiDa	YES (other language)	university	0	NO	YES
GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors	GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors	https://aclanthology.org/2020.lrec-1.835/	https://github.com/mhagiwara/github-typo-corpus	task-oriented (multilingual)	sequence tagging	NO	>10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	automatically induced	in its own language	en zh ja ru fr de pt es ko hi	2020	LREC	NO	combination of university and industry	13	NO	YES
Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus	Hindi TimeBank: An ISO-TimeML Annotated Reference Corpus	https://aclanthology.org/2020.isa-1.2/		task-oriented (target language)	sequence tagging	NO	>10K	collected from media (news)	not mentioned	its own language	n/a	n/a	crowdsourced	in its own language	hi	2020	*ACL Workshop	NO	university	3	NO	NO
K-SNACS: Annotating Korean Adposition Semantics	K-SNACS: Annotating Korean Adposition Semantics	https://aclanthology.org/2020.dmr-1.6/	https://github.com/jdch00/k-snacs	task-oriented (target language)	sequence tagging	NO	1000~10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	ko	2020	*ACL Workshop	NO	university	4	NO	NO
“MassiveSumm: a very large-scale, very multilingual, news summarisation dataset”	“MassiveSumm: a very large-scale, very multilingual, news summarisation dataset “	https://aclanthology.org/2021.emnlp-main.797/	https://github.com/natschluter/massive-summ	task-oriented (multilingual)	summarization	YES	>10K	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	af am ar as ay az bm bn bo bs bg ca cs cy da de el en eo fa fil fr ff ga gu ht ha he hi hr hu hy ig id is it ja kn ka km rw ky ko ku lo lv ln lt ml mr mk mg mn my nd ne nl or om pa pl pt prs ps ro rn ru si sk sl sn so es sq sr sw sv ta te tet tg th ti tr uk ur uz vi xh yo yue zh bi gd	2021	EMNLP	NO	university	0	NO	NO
Models and Datasets for Cross-Lingual Summarisation	Models and Datasets for Cross-Lingual Summarisation	https://aclanthology.org/2021.emnlp-main.742/	https://github.com/lauhaide/clads	task-oriented (multilingual)	summarization	YES	>10K	collected from Wikipedia	n/a	its own language	alignment	n/a	automatically induced	in its own language	cs fr en de	2021	EMNLP	NO	university	0	YES	YES
Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish	Neural Cross-Lingual Transfer and Limited Annotated Data for Named Entity Recognition in Danish	https://aclanthology.org/W19-6143/	https://github.com/UniversalDependencies/UD_Danish-DDT	task-oriented (target language)	sequence tagging	YES	1000~10K	curated linguistic resources	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	da	2019	*ACL Workshop	YES (English)	university	8	YES	YES
Universal Joy A Data Set and Results for Classifying Emotions Across Languages	Universal Joy A Data Set and Results for Classifying Emotions Across Languages	https://aclanthology.org/2021.wassa-1.7.pdf	https://github.com/sotlampr/universal-joy	task-oriented (multilingual)	classification (sentiment analysis)	YES	>10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	automatically induced	in its own language	bn zh de en fr hi id it kh my nl pt ro es tl th vi ms	2021	*ACL Workshop	NO	university	6	NO	YES
X-Fact: A New Benchmark Dataset for Multilingual Fact Checking	X-Fact: A New Benchmark Dataset for Multilingual Fact Checking	https://aclanthology.org/2021.acl-short.86/	https://github.com/utahnlp/x-fact/	task-oriented (multilingual)	classification (non-sentiment analysis)	YES	1000~10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	n/a	n/a	automatically induced	in its own language	si nl mr no tr hi id it sr ru fa sq gu ka pl az bn ta de es pa fr ro pt ar	2021	ACL	NO	university	3	NO	YES
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning	XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning	https://aclanthology.org/2020.emnlp-main.185/	https://github.com/cambridgeltl/xcopa	cross-lingual transfer	classification (non-sentiment analysis)	NO	100~1000	“collected from curated source (exams, scientific papers, etc)”	n/a	English	n/a	expert translation	automatically induced	English	et ht id it qu sw ta th tr vi zh	2020	EMNLP	YES (English)	university	36	YES	YES
KLEJ - NKJP-NER	KLEJ: Comprehensive Benchmark for Polish Language Understanding	https://aclanthology.org/2020.acl-main.111/	https://klejbenchmark.com/	multi-task (target language)	classification (non-sentiment analysis)	YES	>10K	“collected from curated source (exams, scientific papers, etc)”	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	po	2020	ACL	YES (other language)	university	22	YES	YES
KLEJ - CBD	KLEJ: Comprehensive Benchmark for Polish Language Understanding	https://aclanthology.org/2020.acl-main.111/	https://klejbenchmark.com/	multi-task (target language)	classification (sentiment analysis)	YES	>10K	collected from social media or commercial sources	not mentioned	its own language	n/a	n/a	“crowdsourced & annotated (authors, linguists)”	in its own language	po	2020	ACL	YES (other language)	university	22	YES	YES
KLEJ- PolEmo2.0-IN	KLEJ: Comprehensive Benchmark for Polish Language Understanding	https://aclanthology.org/2020.acl-main.111/	https://klejbenchmark.com/	multi-task (target language)	classification (non-sentiment analysis)	YES	1000~10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	po	2020	ACL	YES (other language)	university	22	YES	YES
KLEJ - PolEmo2.0-OUT	KLEJ: Comprehensive Benchmark for Polish Language Understanding	https://aclanthology.org/2020.acl-main.111/	https://klejbenchmark.com/	multi-task (target language)	classification (non-sentiment analysis)	YES	1000~10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	“annotated (authors, linguists)”	in its own language	po	2020	ACL	YES (other language)	university	22	YES	YES
KLEJ - Czy wiesz?	KLEJ: Comprehensive Benchmark for Polish Language Understanding	https://aclanthology.org/2020.acl-main.111/	https://klejbenchmark.com/	multi-task (target language)	QA + IR	YES	1000~10K	collected from Wikipedia	n/a	its own language	RTT & greedy sentence matching	n/a	automatically induced	in its own language	po	2020	ACL	YES (other language)	university	22	YES	YES
KLEJ - PSC	KLEJ: Comprehensive Benchmark for Polish Language Understanding	https://aclanthology.org/2020.acl-main.111/	https://klejbenchmark.com/	multi-task (target language)	sentence-level-generation task	YES	1000~10K	collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	po	2020	ACL	YES (other language)	university	22	YES	YES
KLEJ - AR	KLEJ: Comprehensive Benchmark for Polish Language Understanding	https://aclanthology.org/2020.acl-main.111/	https://klejbenchmark.com/	multi-task (target language)	classification (sentiment analysis)	YES	>10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	automatically induced	in its own language	po	2020	ACL	YES (other language)	university	22	YES	YES
PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphone	PACE Corpus: a multilingual corpus of Polarity-annotated textual data from the domains Automotive and CEllphone	https://aclanthology.org/L14-1240/		task-oriented (multilingual)	classification (sentiment analysis)	NO	1000~10K	“collected from curated source (exams, scientific papers, etc)”	not mentioned	original language	n/a	n/a	crowdsourced	in its own language	en de	2014	LREC	NO	combination of university and industry	1	NO	YES
RussianSuperGLUE-LiDiRus	RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark	https://aclanthology.org/2020.emnlp-main.381/	https://github.com/RussianNLP/RussianSuperGLUE	multi-task (target language)	sentence pair task	NO	1000~10K	collected from media (news)	n/a	English	n/a	expert translation	“annotated (authors, linguists)”	English	ru	2020	EMNLP	YES (English)	combination of university and industry	11	YES	YES
RussianSuperGLUE-RUSSE	RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark	https://aclanthology.org/2020.emnlp-main.381/	https://github.com/RussianNLP/RussianSuperGLUE	multi-task (target language)	other	YES	>10K	collected from Wikipedia & curated linguistic resources	toloka	its own language	n/a	n/a	crowdsourced	in its own language	ru	2020	EMNLP	YES (other language)	combination of university and industry	11	YES	YES
RussianSuperGLUE-PARus	RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark	https://aclanthology.org/2020.emnlp-main.381/	https://github.com/RussianNLP/RussianSuperGLUE	multi-task (target language)	classification (non-sentiment analysis)	YES	100~1000	“collected from web & collected from curated source (exams, scientific papers, etc)”	amt	English	n/a	expert translation	crowdsourced	English	ru	2020	EMNLP	YES (English)	combination of university and industry	11	YES	YES
RussianSuperGLUE-TERRa	RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark	https://aclanthology.org/2020.emnlp-main.381/	https://github.com/RussianNLP/RussianSuperGLUE	multi-task (target language)	sentence pair task	YES	1000~10K	collected from media (news) & collected from web	n/a	its own language	n/a	n/a	“automatically induced & annotated (authors, linguists)”	in its own language	ru	2020	EMNLP	YES (other language)	combination of university and industry	11	YES	YES
RussianSuperGLUE-RCB	RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark	https://aclanthology.org/2020.emnlp-main.381/	https://github.com/RussianNLP/RussianSuperGLUE	multi-task (target language)	sentence pair task	YES	1000~10K	collected from media (news) & collected from web	n/a	its own language	n/a	n/a	“automatically induced & annotated (authors, linguists)”	in its own language	ru	2020	EMNLP	YES (other language)	combination of university and industry	11	YES	YES
RussianSuperGLUE-RWSD	RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark	https://aclanthology.org/2020.emnlp-main.381/	https://github.com/RussianNLP/RussianSuperGLUE	multi-task (target language)	structured prediction	YES	100~1000	“collected from curated source (exams, scientific papers, etc)”	n/a	English	n/a	details not provided	“annotated (authors, linguists)”	English	ru	2020	EMNLP	YES (English)	combination of university and industry	11	YES	YES
RussianSuperGLUE-DaNetQA	RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark	https://aclanthology.org/2020.emnlp-main.381/	https://github.com/RussianNLP/RussianSuperGLUE	multi-task (target language)	machine reading comprehension	YES	100~1000	collected from Wikipedia & crowdsourced	toloka	its own language	other	n/a	“crowdsourced & annotated (authors, linguists)”	in its own language	ru	2020	EMNLP	NO	combination of university and industry	11	YES	YES
Vy=akarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages	Vy=akarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages	https://arxiv.org/abs/2103.00854	https://github.com/rajaswa/indic-syntax-evaluation	task-oriented (target language)	structured prediction	YES	>10K	curated linguistic resources	n/a	original language	n/a	n/a	“derived from linguistic resources (wordnet, etc)”	in its own language	hi ta	2021	*ACL Workshop	YES (other language)	university	0	NO	YES
IndicNLPSuite-Soham News Article Classification	“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”	https://aclanthology.org/2020.findings-emnlp.445/	https://indicnlp.ai4bharat.org/home/	multi-task (target language)	classification (non-sentiment analysis)	YES	1000~10K	collected from media (news)	n/a	original language	n/a	n/a	automatically induced	in its own language	pa bn or gu mr kn te ml ta	2020	Findings		combination of university and industry	59	YES	YES
IndicNLPSuite-iNLTK Headline Classification	“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”	https://aclanthology.org/2020.findings-emnlp.445/	https://indicnlp.ai4bharat.org/home/	multi-task (target language)	classification (non-sentiment analysis)	YES	1000~10K	collected from media (news)	n/a	original language	n/a	n/a	automatically induced	in its own language	pa bn or gu mr kn te ml ta	2020	Findings		combination of university and industry	59	YES	YES
IndicNLPSuite-AI4Bharat Cloze-style Question Answering	“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”	https://aclanthology.org/2020.findings-emnlp.445/	https://indicnlp.ai4bharat.org/home/	multi-task (target language)	machine reading comprehension	YES	>10K	collected from Wikipedia	n/a	original language	n/a	n/a	automatically induced	in its own language	pa hi bn or as gu mr kn te ml ta	2020	Findings		combination of university and industry	59	YES	YES
IndicNLPSuite-AI4Bharat Winograd Natural Language Inference	“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”	https://aclanthology.org/2020.findings-emnlp.445/	https://indicnlp.ai4bharat.org/home/	multi-task (target language)	classification (non-sentiment analysis)	NO	100~1000	“collected from curated source (exams, scientific papers, etc)”	n/a	English	n/a	author translation	“annotated (authors, linguists)”	English	hi mr gu	2020	Findings	YES (English)	combination of university and industry	59	YES	YES
IndicNLPSuite-AI4Bharat Choice of Plausible Alternatives	“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”	https://aclanthology.org/2020.findings-emnlp.445/	https://indicnlp.ai4bharat.org/home/	multi-task (target language)	classification (non-sentiment analysis)	NO	100~1000	“annotated (authors, linguists)”	n/a	English	n/a	author translation	“annotated (authors, linguists)”	English	hi mr gu	2020	Findings	YES (English)	combination of university and industry	59	YES	YES
IndicNLPSuite-WikiAnnNER	“IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages”	https://aclanthology.org/2020.findings-emnlp.445/	https://indicnlp.ai4bharat.org/home/	multi-task (target language)	sequence tagging	YES	>10K	collected from Wikipedia	n/a	English	aligned	automatic translation	automatically induced	English	ace af als am an ang ar arc arz as ast ay az ba bar be bg bh bn bo br bs ca cdo ce ceb ckb co crh cs csb cv cy da de diq dv el en eo es et eu ext fa fi fo fr frr fur fy ga gan gd gl gn gu hak he hi hr hsb hu hy ia id ig ilo io is it ja jbo jv ka kk km kn ko ksh ku ky la lb li lij lmo ln lt lv mg mhr mi min mk ml mn mr ms mt mwl my mzn nap nds ne nl nn no nov oc or os sgs be-tarask cbk eml vro jv-x-bms en-basiceng lzh nan yue pa pdc pl pms pnb ps pt qu rm ro ru rw sa sah scn sco sd sh si sk sl so sq sr su sv sw szl ta te tg th tk tl tr tt ug uk ur uz vec vep vi vls vo wa war wuu xmf yi yo zea zh	2020	Findings	YES (other language)	combination of university and industry	59	YES	YES
CVIT-MKB Cross-lingual Sentence Retrieval	A Multilingual Parallel Corpora Collection Effort for Indian Languages	https://aclanthology.org/2020.lrec-1.462.pdf	https://anoopkunchukuttan.github.io/indic_nlp_library/	task-oriented (multilingual)	QA + IR	YES	>10K	collected from web	n/a	its own language	alignment	n/a	automatically induced	in its own language	hi te ta ml gu kn ur bn or mr pa as en	2020	LREC		university	18	YES	YES
ACTSA	ACTSA: Annotated Corpus for Telugu Sentiment Analysis	https://aclanthology.org/W17-5408/	https://drive.google.com/drive/folders/0B8HHvMMuHYdWdnJZZl9rWkY5bk0?usp=sharing	task-oriented (target language)	classification (non-sentiment analysis)	YES	1000~10K	collected from media (news)	native speakers	original language	n/a	n/a	crowdsourced	in its own language	te	2017	*ACL Workshop		university	31	YES	NO
MIDAS Discourse	An Annotated Dataset of Discourse Modes in Hindi Stories	https://aclanthology.org/2020.lrec-1.149.pdf	https://github.com/midas-research/hindi-discourse	task-oriented (target language)	classification (non-sentiment analysis)	YES	1000~10K	“collected from curated source (exams, scientific papers, etc)”	native speakers	original language	n/a	n/a	crowdsourced	in its own language	hi	2020	LREC		combination of university and industry	4	YES	YES
A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment	A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment	https://aclanthology.org/2020.lrec-1.395.pdf	https://github.com/elexis-eu/MWSA	cross-lingual transfer	other	NO	1000~10K	curated linguistic resources	n/a	original language	n/a	n/a	“derived from linguistic resources (wordnet, etc)”	in its own language	eu bg da nl en et de hu ga it sr sl es pt ru	2020	LREC	YES (other language)	university	12	NO	YES
Multilingual corpora with coreferential annotation of person entities	Multilingual corpora with coreferential annotation of person entities	https://aclanthology.org/L14-1701/	https://gramatica.usc.es/~marcos/lrec.tar.bz2	cross-lingual transfer	structured prediction	NO	1000~10K	collected from Wikipedia & collected from media (news)	n/a	original language	n/a	n/a	crowdsourced	in its own language	gl pt es	2014	LREC		university	21	NO	YES
MGAD: Multilingual Generation of Analogy Datasets	MGAD: Multilingual Generation of Analogy Datasets	https://aclanthology.org/L18-1320.pdf	https://github.com/rutrastone/MGAD	task-oriented (multilingual)	other	NO	>10K	template-based	n/a	its own language	n/a	n/a	automatically induced	in its own language	hi ar ru	2018	LREC		university	8	NO	YES
“The ApposCorpus: a new multilingual, multi-domain dataset for factual appositive generation”	“The ApposCorpus: a new multilingual, multi-domain dataset for factual appositive generation”	https://arxiv.org/abs/2011.03287	https://yovakem.github.io/#ApposCorpus	task-oriented (multilingual)	sentence-level-generation task	YES	>10K	collected from Wikipedia & collected from media (news)	n/a	its own language	n/a	n/a	automatically induced	in its own language	en es de pl	2020	COLING		combination of university and industry	0	NO	YES
“The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation”	“The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation”	https://aclanthology.org/2020.inlg-1.10/	https://github.com/TallChris91/CACAPO-Dataset	task-oriented (multilingual)	sentence-level-generation task	YES	>10K	collected from media (news)	n/a	its own language	alignment	n/a	“annotated (authors, linguists)”	in its own language	nl en	2020	INLG		university	2	NO	YES
A Dataset and Baselines for Multilingual Reply Suggestion	A Dataset and Baselines for Multilingual Reply Suggestion	https://arxiv.org/abs/2106.02017	https://github.com/zhangmozhi/mrs	task-oriented (multilingual)	sentence-level-generation task	YES	>10K	collected from social media or commercial sources	n/a	its own language	n/a	n/a	automatically induced	in its own language	en es de pt fr ja sv it nl ru	2021	ACL		combination of university and industry	1	NO	YES

			task oriented multilingual	61
			cross-lingual transfer	21
			task-oriented (target language)	37
			multi-task (target language)	38