Data¶
The contents of this module are mostly concerned with creating compatibility with the
Huggingface transformers package.
Specifically, the nlp_uncertainty_zoo.utils.data.DatasetBuilder
class tries to provide an easy interface to
load local datasets and utilizing the Huggingface code.
Furthermore, it supports the easy use of custom samplers defined in the nlp_uncertainty_zoo.utils.samplers
module,
that produce representative sub-samples of training sets for different tasks.
These were for instance used in Ulmer et al. (2022) to show how the quality of uncertainty estimates behaves as a function of the available training data.
The following dataset builder classes are included:
nlp_uncertainty_zoo.utils.data.DatasetBuilder
: Abstract superclass that can be used for inheritance in order to support new task types.
nlp_uncertainty_zoo.utils.data.LanguageModellingDatasetBuilder
: Dataset builder used for language modelling, including both “classic” language modelling and masked language modelling. To indicate which type of language modelling us used, either “next_token_prediction” or “mlm” should be specified for the type_ argument during initialization.
nlp_uncertainty_zoo.utils.data.ClassificationDatasetBuilder
: As the name suggests, this class is aimed at classification problems, both in terms of sequence labelling and sequence prediction. This is again specified by passing “sequence_classification” or “token_classification” in the type_ argument during initialization. Dataset files are expected to be in the .csv format with tab-separated columns containing the sentence and label(s). When using sequence labelling, labels spanning multiple subword tokens will only be assigned to the first part, while the other subword tokens receive a -100 label.
Furthermore, the module constain a modified version of Huggingface’s DataCollatorForLanguageModeling <:py:class:`nlp_uncertainty_zoo.utils.data.LanguageModellingDatasetBuilder: >`_:
It seemed that for the classical next token prediction, the collator wouldn’t produce the right offset between tokens and labels, i.e. the next tokens to be predicted.
The nlp_uncertainty_zoo.utils.data.ModifiedDataCollatorForLanguageModeling
provides a minimal modification of the original code to ensure this property.