Full Survey

Scroll to the bottom of the table to search by column.

Motivation

cross-lingual transfer (evaluate NLP system trained on one langauge on other languages), single task (multilingual) w/ ML training (improve the performance of a particular task and cover multiple language), single task (single lang) (improve the performance of a particular task and focus on a single, Non-English language), multi-task (single lang) (test models for a particular language on many tasks, often re-using existing datasets)

Task Type

The task type that the dataset addresses:
classification (sentiment analysis), classification (sentence pair), classification (other), QA (w/ retrieval), QA (machine reading), structured prediction, sequence tagging, generation (summarization), generation (other), other

Size

less than 100, 100 ~ 1000, 1000 ~ 10K, greater than 10K

Input Data Source

The source of input text of the dataset, multiple sources allowed:
annotated (authors, linguists) (manually annotation by domain experts), crowdsourced (manual annotation by crowdworkers), curated linguistic resources (derived from curated linguistic resources (e.g., WordNet)), curated source (exams, scientific papers.etc), media (newspaper.etc), template-based, web (sources other than commercial websites and Wikipedia), Wikipedia (wikipedia text, wikidump, wikidata.etc), not mentioned

Original Language

Language where the input text is collected:
English, its own language (language of the dataset), other language (languages that are not included in the dataset and is not English), both (both English and its own language), not mentioned

Label Source

The source of the label, multiple sources allowed:
annotated (authors, linguists) (manually annotation by domain experts), crowdsourced (manual annotation by crowdworkers), automatically induced (automatically aligned or deduced from labeled or unlabeled data (e.g., bullets points in news as summary)), curated linguistic resources (derived from curated linguistic resources (e.g., WordNet)), not mentioned

Published Venue

The published venue of the paper:
ACL, EMNLP, NAACL, LREC, *ACL Workshop, Findings, NeurIPS Datasets and Benchmarks Track, arXiv, N/A

Reused Dataset

Whether they reused any dataset, if yes, what language is the reused dataset:
Yes-Eng (Reuse English datasets), Yes-other-lang (Reuse dataset in other languages (including the languages of the new dataset)), Yes-Eng & other-lang (reused dataset in both English and other languages), No (does not reuse any datset)

Creators

Who creates the dataset:
industry, individual researchers (usually the joint efforts of many people to create a specific language's benchmark datsets), university, combination of industry and university

Dataset NameTitleMotivationTask TypeHas Train DataSizeInput Data SourceOriginal LanguageTranslation UsedLabel SourceLanguagesPublication YearPublished VenueReused DatasetCreatorsIn Huggingface
Dataset NameTitleMotivationTask TypeHas Train DataSizeInput Data SourceOriginal LanguageTranslation UsedLabel SourceLanguagesPublication YearPublished VenueReused DatasetCreatorsIn Huggingface