Following (Bender and Friedman, 2018) we provide all the relevant information about the datasets of this benchmark. All the commissioned annotators of these datasets (1) were native speakers of Catalan, (2) with a background on translation, edition, philology, or linguistics, and (3) had previous experience in language-related tasks. The curation rationale we followed was to make these datasets both representative of contemporary Catalan language, as well as directly comparable to similar reference datasets from the General Language Understanding Evaluation (GLUE) benchmark. Since our datasets are desgined for Machine Learning and Language Modelling, we have provided training, evaluation and tests splits at HuggingFace. In what follows, we describe these datasets.
For Named Entity Recognition (NER) we use annotations from the AnCora corpus. We extracted Named Entities from the original AnCora version, filtering out some unconventional ones, and transcribed them into a standard CONLL-IOB format. The original annotation was done as a part of the AnCora project at the University of Barcelona and mainly contains newswire texts.
For Part-of-Speech Tagging (POS) we use annotations from the AnCora corpus, projected on the Universal Dependencies treebank. The original annotation was done in a constituency framework as a part of the AnCora project at the University of Barcelona. It was converted to dependencies by the Universal Dependencies team and used in the CoNLL 2009 shared task. The CoNLL 2009 version was later converted to HamleDT and to Universal Dependencies.
For Semantic Textual Similarity (STS) we created a new dataset from scratch: STS-ca. STS-ca contains more than 3,000 sentence pairs, annotated with their semantic similarity using a scale from 0 (no similarity at all) to 5 (semantic equivalence). To develop STS-ca, we pre-selected potential sentence pairs from the Catalan Textual Corpus corpus by using different similarity measures (Jaccard, Doc2Vec and DistilBERT embedding cosine similarity). We did a final manual review to ensure that the selection represented superficial and deep similarities in subject matter and lexicon. Following the guidelines set in the SemEval challenges, we commissioned 4 native speakers from 2 independent companies to assess the similarity of the sentence pairs on a scale between 0 and 5. Then, for each sentence pair, we computed the mean of the four annotations, and we discarded single annotations that deviated by more than 1 from the mean. After this filtering process, we used the mean of the remaining annotations as a final score. Finally, in order to assess the quality of the dataset, we measured the correlation of each annotator's labels with the average of the rest of the annotators, and averaged all the individual correlations, resulting in a Pearson correlation of 0.739.
TeCla (Textual Classification for Catalan) is a Catalan news corpus for thematic Text Classification. It contains 153,265 articles classified under 30 different categories, albeit editorially-oriented ones rather than truly encyclopedic labels. We crawled 219,586 articles from the Catalan News Agency (ACN) newswire archive, the latest from October 11th, 2020. We used the subsection category as a classification label, after excluding territorial labels and labels with less than 2,000 occurrences. With these criteria we compiled a total of 153,265 articles for this text classification dataset.
TE-ca is a dataset of Textual Entailment in Catalan. It contains more than 20,000 pairs of sentences annotated with their relation label: Neutral, Inference or Contradiction. The source sentences of this dataset were extracted from the Catalan Textual Corpus and from VilaWeb newswire. We randomly chose 18,000 sentences from these sources, and filtered them by different criteria, such as length and stand-alone intelligibility. From the remaining text sentences, we commissioned 3 hypotheses (one for each entailment category) to be written by a team of annotators. We obtained more than 20,000 annotated sentence pairs. From 600 randomly selected samples, we cross-annotated for quality assurance and obtained an inter-annotator agreement of 83.57%.
This dataset is an extractive Question Answering dataset that gathers the previous QA datasets ViquiQuAD and VilaQuAD. ViquiQuAD was developed using content from the Catalan Wikipedia (Viquipèdia). From a set of high quality, non-tranlsated articles in the Catalan Wikipedia, 597 were randomly chosen, and from them 5-8 sentence contexts were extracted. We commissioned the creation of between 1 and 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0. In total, 15,153 pairs of a question and an extracted fragment that contained the answer were created. Following the same guidelines as ViquiQuAD, we developed VilaQuAD, an extractive QA dataset from newswire. From the online edition of the Catalan newspaper Vilaweb, 2,095 article headlines were randomnly selected. In total, 6,282 pairs of a question and an extracted fragment that contains the answer were created.
The Cross-lingual Question Answering Dataset is a multilingual benchmark for evaluating question-answering performance. The dataset consists of a subset of 240 paragraphs from the Wikipedia and 1,190 question-answer pairs from the development set of SQuAD v1.1 (Rajpurkar et al., 2016) together with their professional translations into ten languages: Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. Rumanian was added later. We created a Catalan version using professional translators. The model for this task has been trained with CatalanQA and texted with XQuaD-ca.
MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.