Motivation
cross-lingual transfer (evaluate NLP system trained on one langauge on other languages), single task (multilingual) w/ ML training (improve the performance of a particular task and cover multiple language), single task (single lang) (improve the performance of a particular task and focus on a single, Non-English language), multi-task (single lang) (test models for a particular language on many tasks, often re-using existing datasets)
Task Type
The task type that the dataset addresses:
classification (sentiment analysis), classification (sentence pair), classification (other), QA (w/ retrieval), QA (machine reading), structured prediction, sequence tagging, generation (summarization), generation (other), other
Size
less than 100, 100 ~ 1000, 1000 ~ 10K, greater than 10K
Input Data Source
The source of input text of the dataset, multiple sources allowed:
annotated (authors, linguists) (manually annotation by domain experts), crowdsourced (manual annotation by crowdworkers), curated linguistic resources (derived from curated linguistic resources (e.g., WordNet)), curated source (exams, scientific papers.etc), media (newspaper.etc), template-based, web (sources other than commercial websites and Wikipedia), Wikipedia (wikipedia text, wikidump, wikidata.etc), not mentioned
Original Language
Language where the input text is collected:
English, its own language (language of the dataset), other language (languages that are not included in the dataset and is not English), both (both English and its own language), not mentioned
Label Source
The source of the label, multiple sources allowed:annotated (authors, linguists) (manually annotation by domain experts), crowdsourced (manual annotation by crowdworkers), automatically induced (automatically aligned or deduced from labeled or unlabeled data (e.g., bullets points in news as summary)), curated linguistic resources (derived from curated linguistic resources (e.g., WordNet)), not mentioned
Published Venue
The published venue of the paper:
ACL, EMNLP, NAACL, LREC, *ACL Workshop, Findings, NeurIPS Datasets and Benchmarks Track, arXiv, N/A
Reused Dataset
Whether they reused any dataset, if yes, what language is the reused dataset:
Yes-Eng (Reuse English datasets), Yes-other-lang (Reuse dataset in other languages (including the languages of the new dataset)), Yes-Eng & other-lang (reused dataset in both English and other languages), No (does not reuse any datset)
Creators
Who creates the dataset:
industry, individual researchers (usually the joint efforts of many people to create a specific language's benchmark datsets), university, combination of industry and university