Samplers¶

This module contains dataset sampler that are used with the nlp_uncertainty_zoo.utils.data.DatasetBuilder class to create representative sub-samples of training data. These were for instance used in Ulmer et al. (2022) to show how the quality of uncertainty estimates behaves as a function of the available training data.

Right now, the module comprises three different samplers:

nlp_uncertainty_zoo.utils.samplers.LanguageModellingSampler: Here, inputs are sub-sampled to primarily maintain the original distribution of sentence lengths like in the text. Also, multiple blocks of sentences are samples contiguously to maintain a notion of paragraphs.

nlp_uncertainty_zoo.utils.samplers.SequenceClassificationSampler: Inputs are mostly sub-sampled to maintain the same class distributions as in the original corpus. Secondly, the same distribution of sequence lengths is also tried to be maintained.

nlp_uncertainty_zoo.utils.samplers.TokenClassificationSampler: Same as the previous one. In order to maintain the same class distribution, the sequences are primarily sampled proportion to the cross-entropy between a sequence’s label distribution and the global label distribution.

Samplers Module Documentation¶

Sampler used to sub-sample different types of datasets. In each class, some statistics about the distribution of inputs is built, and then indices of instances from the dataset are subs-ampled based on these statistics.

class nlp_uncertainty_zoo.utils.samplers.LanguageModellingSampler(data_source: Sized, target_size: int, sample_range: Tuple[int, int], num_jobs: int = 1, seed: Optional[int] = None)¶

Bases: Subsampler

Sampler specific to language modelling. The sub-sampling strategy here is to approximately maintain the same distribution of sentence lengths as in the original corpus, and to maintain contiguous paragraphs of text spanning multiple sentences.

class nlp_uncertainty_zoo.utils.samplers.SequenceClassificationSampler(data_source: Sized, target_size: int, num_jobs: int = 1, seed: Optional[int] = None)¶

Bases: Subsampler

Sampler specific to sequence classification. The strategy here is to approximately maintain the same class distribution as in the original corpus, and to a lesser extent the same sequence length distribution.

class nlp_uncertainty_zoo.utils.samplers.Subsampler(data_source: Sized, target_size: int, num_jobs: int = 1, seed: Optional[int] = None)¶

Bases: Sampler, ABC

Abstract base class of any sampler that sub-samples a dataset to a given target size.

class nlp_uncertainty_zoo.utils.samplers.TokenClassificationSampler(data_source: Sized, target_size: int, ignore_label: int = -100, num_jobs: int = 1, seed: Optional[int] = None)¶

Bases: Subsampler

nlp_uncertainty_zoo.utils.samplers.create_probs_from_dict(freq_dict: Dict[int, int], max_label: Optional[int] = None) → array¶

Auxiliary function creating a numpy array containing a categorical distribution over integers from a dictionary of frequencies.

Parameters:

freq_dict: Dict[int, int]: Dictionary mapping from class labels to frequencies.
max_label: Optional[int]: Maximum value of a class label aka number of classes (minus 1). If None, tyis is based on the maximum valued key in freq_dict.

Returns:

np.array: Distribution over class labels as a numpy array.

nlp_uncertainty_zoo.utils.samplers.merge_freq_dicts(freqs_a: Dict[int, int], freqs_b: Dict[int, int]) → Dict[int, int]¶

Merge two dictionaries of frequencies. Used for creating data statistics before sub-sampling, where statistics for each instance are collected via different jobs and then merged.

Parameters:

freqs_a: Dict[int, int]: First frequency dictionary.
freqs_b: Dict[int, int]: Second frequency dictionary.

Returns:

Dict[int, int]: New frequency dictionary.

nlp_uncertainty_zoo.utils.samplers.merge_instance_dicts(lengths_a: Dict[int, List[int]], lengths_b: Dict[int, List[int]]) → Dict[int, List[int]]¶

Merge two dictionaries of instances lights, where inputs are grouped by a common characteristic, e.g. length. Used for creating data statistics before sub-sampling, where statistics for each instance are collected via different jobs and then merged.

Parameters:

lengths_a: Dict[int, List[int]]: First instance dictionary.
lengths_b: Dict[int, List[int]]: Second instance dictionary.

Returns:

Dict[int, List[int]]: New instance dictionary.