Samplers¶
This module contains dataset sampler that are used with the nlp_uncertainty_zoo.utils.data.DatasetBuilder
class
to create representative sub-samples of training data.
These were for instance used in Ulmer et al. (2022) to show how the quality of uncertainty estimates behaves as a function of the available training data.
Right now, the module comprises three different samplers:
nlp_uncertainty_zoo.utils.samplers.LanguageModellingSampler
: Here, inputs are sub-sampled to primarily maintain the original distribution of sentence lengths like in the text. Also, multiple blocks of sentences are samples contiguously to maintain a notion of paragraphs.
nlp_uncertainty_zoo.utils.samplers.SequenceClassificationSampler
: Inputs are mostly sub-sampled to maintain the same class distributions as in the original corpus. Secondly, the same distribution of sequence lengths is also tried to be maintained.
nlp_uncertainty_zoo.utils.samplers.TokenClassificationSampler
: Same as the previous one. In order to maintain the same class distribution, the sequences are primarily sampled proportion to the cross-entropy between a sequence’s label distribution and the global label distribution.
Samplers Module Documentation¶
Sampler used to sub-sample different types of datasets. In each class, some statistics about the distribution of inputs is built, and then indices of instances from the dataset are subs-ampled based on these statistics.
- class nlp_uncertainty_zoo.utils.samplers.LanguageModellingSampler(data_source: Sized, target_size: int, sample_range: Tuple[int, int], num_jobs: int = 1, seed: Optional[int] = None)¶
Bases:
Subsampler
Sampler specific to language modelling. The sub-sampling strategy here is to approximately maintain the same distribution of sentence lengths as in the original corpus, and to maintain contiguous paragraphs of text spanning multiple sentences.
- class nlp_uncertainty_zoo.utils.samplers.SequenceClassificationSampler(data_source: Sized, target_size: int, num_jobs: int = 1, seed: Optional[int] = None)¶
Bases:
Subsampler
Sampler specific to sequence classification. The strategy here is to approximately maintain the same class distribution as in the original corpus, and to a lesser extent the same sequence length distribution.
- class nlp_uncertainty_zoo.utils.samplers.Subsampler(data_source: Sized, target_size: int, num_jobs: int = 1, seed: Optional[int] = None)¶
Bases:
Sampler
,ABC
Abstract base class of any sampler that sub-samples a dataset to a given target size.
- class nlp_uncertainty_zoo.utils.samplers.TokenClassificationSampler(data_source: Sized, target_size: int, ignore_label: int = -100, num_jobs: int = 1, seed: Optional[int] = None)¶
Bases:
Subsampler
Sampler specific to sequence classification. The strategy here is to approximately maintain the same class distribution as in the original corpus, and to a lesser extent the same sequence length distribution. Compared to the SequenceClassificationSampler, all labels of a sequence are considered.
- nlp_uncertainty_zoo.utils.samplers.create_probs_from_dict(freq_dict: Dict[int, int], max_label: Optional[int] = None) array ¶
Auxiliary function creating a numpy array containing a categorical distribution over integers from a dictionary of frequencies.
- Parameters:
- freq_dict: Dict[int, int]
Dictionary mapping from class labels to frequencies.
- max_label: Optional[int]
Maximum value of a class label aka number of classes (minus 1). If None, tyis is based on the maximum valued key in freq_dict.
- Returns:
- np.array
Distribution over class labels as a numpy array.
- nlp_uncertainty_zoo.utils.samplers.merge_freq_dicts(freqs_a: Dict[int, int], freqs_b: Dict[int, int]) Dict[int, int] ¶
Merge two dictionaries of frequencies. Used for creating data statistics before sub-sampling, where statistics for each instance are collected via different jobs and then merged.
- Parameters:
- freqs_a: Dict[int, int]
First frequency dictionary.
- freqs_b: Dict[int, int]
Second frequency dictionary.
- Returns:
- Dict[int, int]
New frequency dictionary.
- nlp_uncertainty_zoo.utils.samplers.merge_instance_dicts(lengths_a: Dict[int, List[int]], lengths_b: Dict[int, List[int]]) Dict[int, List[int]] ¶
Merge two dictionaries of instances lights, where inputs are grouped by a common characteristic, e.g. length. Used for creating data statistics before sub-sampling, where statistics for each instance are collected via different jobs and then merged.
- Parameters:
- lengths_a: Dict[int, List[int]]
First instance dictionary.
- lengths_b: Dict[int, List[int]]
Second instance dictionary.
- Returns:
- Dict[int, List[int]]
New instance dictionary.