Samplers

This module contains dataset sampler that are used with the nlp_uncertainty_zoo.utils.data.DatasetBuilder class to create representative sub-samples of training data. These were for instance used in Ulmer et al. (2022) to show how the quality of uncertainty estimates behaves as a function of the available training data.

Right now, the module comprises three different samplers:

Samplers Module Documentation

Sampler used to sub-sample different types of datasets. In each class, some statistics about the distribution of inputs is built, and then indices of instances from the dataset are subs-ampled based on these statistics.

class nlp_uncertainty_zoo.utils.samplers.LanguageModellingSampler(data_source: Sized, target_size: int, sample_range: Tuple[int, int], num_jobs: int = 1, seed: Optional[int] = None)

Bases: Subsampler

Sampler specific to language modelling. The sub-sampling strategy here is to approximately maintain the same distribution of sentence lengths as in the original corpus, and to maintain contiguous paragraphs of text spanning multiple sentences.

class nlp_uncertainty_zoo.utils.samplers.SequenceClassificationSampler(data_source: Sized, target_size: int, num_jobs: int = 1, seed: Optional[int] = None)

Bases: Subsampler

Sampler specific to sequence classification. The strategy here is to approximately maintain the same class distribution as in the original corpus, and to a lesser extent the same sequence length distribution.

class nlp_uncertainty_zoo.utils.samplers.Subsampler(data_source: Sized, target_size: int, num_jobs: int = 1, seed: Optional[int] = None)

Bases: Sampler, ABC

Abstract base class of any sampler that sub-samples a dataset to a given target size.

class nlp_uncertainty_zoo.utils.samplers.TokenClassificationSampler(data_source: Sized, target_size: int, ignore_label: int = -100, num_jobs: int = 1, seed: Optional[int] = None)

Bases: Subsampler

Sampler specific to sequence classification. The strategy here is to approximately maintain the same class distribution as in the original corpus, and to a lesser extent the same sequence length distribution. Compared to the SequenceClassificationSampler, all labels of a sequence are considered.

nlp_uncertainty_zoo.utils.samplers.create_probs_from_dict(freq_dict: Dict[int, int], max_label: Optional[int] = None) array

Auxiliary function creating a numpy array containing a categorical distribution over integers from a dictionary of frequencies.

Parameters:
freq_dict: Dict[int, int]

Dictionary mapping from class labels to frequencies.

max_label: Optional[int]

Maximum value of a class label aka number of classes (minus 1). If None, tyis is based on the maximum valued key in freq_dict.

Returns:
np.array

Distribution over class labels as a numpy array.

nlp_uncertainty_zoo.utils.samplers.merge_freq_dicts(freqs_a: Dict[int, int], freqs_b: Dict[int, int]) Dict[int, int]

Merge two dictionaries of frequencies. Used for creating data statistics before sub-sampling, where statistics for each instance are collected via different jobs and then merged.

Parameters:
freqs_a: Dict[int, int]

First frequency dictionary.

freqs_b: Dict[int, int]

Second frequency dictionary.

Returns:
Dict[int, int]

New frequency dictionary.

nlp_uncertainty_zoo.utils.samplers.merge_instance_dicts(lengths_a: Dict[int, List[int]], lengths_b: Dict[int, List[int]]) Dict[int, List[int]]

Merge two dictionaries of instances lights, where inputs are grouped by a common characteristic, e.g. length. Used for creating data statistics before sub-sampling, where statistics for each instance are collected via different jobs and then merged.

Parameters:
lengths_a: Dict[int, List[int]]

First instance dictionary.

lengths_b: Dict[int, List[int]]

Second instance dictionary.

Returns:
Dict[int, List[int]]

New instance dictionary.