Data ==== The contents of this module are mostly concerned with creating compatibility with the `Huggingface transformers package `_. Specifically, the :py:class:`nlp_uncertainty_zoo.utils.data.DatasetBuilder` class tries to provide an easy interface to load local datasets and utilizing the Huggingface code. Furthermore, it supports the easy use of custom samplers defined in the :py:mod:`nlp_uncertainty_zoo.utils.samplers` module, that produce representative sub-samples of training sets for different tasks. These were for instance used in `Ulmer et al. (2022) `_ to show how the quality of uncertainty estimates behaves as a function of the available training data. The following dataset builder classes are included: * :py:class:`nlp_uncertainty_zoo.utils.data.DatasetBuilder`: Abstract superclass that can be used for inheritance in order to support new task types. * :py:class:`nlp_uncertainty_zoo.utils.data.LanguageModellingDatasetBuilder`: Dataset builder used for language modelling, including both "classic" language modelling and masked language modelling. To indicate which type of language modelling us used, either `"next_token_prediction"` or `"mlm"` should be specified for the `type_` argument during initialization. * :py:class:`nlp_uncertainty_zoo.utils.data.ClassificationDatasetBuilder`: As the name suggests, this class is aimed at classification problems, both in terms of sequence labelling and sequence prediction. This is again specified by passing `"sequence_classification"` or `"token_classification"` in the `type_` argument during initialization. Dataset files are expected to be in the `.csv` format with tab-separated columns containing the sentence and label(s). When using sequence labelling, labels spanning multiple subword tokens will only be assigned to the first part, while the other subword tokens receive a `-100` label. Furthermore, the module constain a modified version of Huggingface's `DataCollatorForLanguageModeling <:py:class:`nlp_uncertainty_zoo.utils.data.LanguageModellingDatasetBuilder`: >`_: It seemed that for the classical next token prediction, the collator wouldn't produce the right offset between tokens and labels, i.e. the next tokens to be predicted. The :py:class:`nlp_uncertainty_zoo.utils.data.ModifiedDataCollatorForLanguageModeling` provides a minimal modification of the original code to ensure this property. Data Module Documentation ========================= .. automodule:: nlp_uncertainty_zoo.utils.data :members: :show-inheritance: :undoc-members: