Uncertainty Eval¶

The quality of uncertainty estimates can be tricky to evaluate, since there are no gold labels like for a classification tasks. For that reason, this module contains methods for exactly this purpose.

To measure the quality of general uncertainty estimates, we use the common evaluation method of defining a proxy OOD detection tasks, where we quantify how well we can distinguish in- and out-of-distribution inputs based on uncertainty scores given by a model. This is realized using the area under the receiver-operator-characteristic (nlp_uncertainty_zoo.utils.uncertainty_eval.aupr()) and the area under precision-recall curve (nlp_uncertainty_zoo.utils.uncertainty_eval.auroc()).

New in Ulmer et al. (2022), it is also evaluated how indicative the uncertainty score is with the potential loss of a model. For this reason, this module also implements the Kendall’s tau correlation coefficient (nlp_uncertainty_zoo.utils.uncertainty_eval.kendalls_tau()).

Instances of the nlp_uncertainty_zoo.models.model.Model class can be evaluated in a single function using all the above metrics with nlp_uncertainty_zoo.uncertainty_eval.evaluate_uncertainty().

Uncertainty Eval Module Documentation¶

Module to implement different metrics the quality of uncertainty metrics.

nlp_uncertainty_zoo.utils.uncertainty_eval.aupr(y_true: array, y_pred: array) → float¶

Return the area under the precision-recall curve for a pseudo binary classification task, where in- and out-of-distribution samples correspond to two different classes, which are differentiated using uncertainty scores.

Parameters:

y_true: np.array: True labels, where 1 corresponds to in- and 0 to out-of-distribution.
y_pred: np.array: Uncertainty scores as predictions to distinguish the two classes.

Returns:

float: Area under the precision-recall curve.

nlp_uncertainty_zoo.utils.uncertainty_eval.auroc(y_true: array, y_pred: array) → float¶

Return the area under the receiver-operator characteristic for a pseudo binary classification task, where in- and out-of-distribution samples correspond to two different classes, which are differentiated using uncertainty scores.

Parameters:

y_true: np.array: True labels, where 1 corresponds to in- and 0 to out-of-distribution.
y_pred: np.array: Uncertainty scores as predictions to distinguish the two classes.

Returns:

float: Area under the receiver-operator characteristic.

nlp_uncertainty_zoo.utils.uncertainty_eval.evaluate_uncertainty(model: ~nlp_uncertainty_zoo.models.model.Model, id_eval_split: ~torch.utils.data.dataloader.DataLoader, ood_eval_split: ~typing.Optional[~torch.utils.data.dataloader.DataLoader] = None, eval_funcs: ~typing.Dict[str, ~typing.Callable] = frozendict.frozendict({'kendalls_tau': <function kendalltau>}), contrastive_eval_funcs: ~typing.Tuple[~typing.Callable] = frozendict.frozendict({'aupr': <function aupr>, 'auroc': <function auroc>}), ignore_token_ids: ~typing.Tuple[int] = (-100, ), verbose: bool = True) → Dict[str, Any]¶

Evaluate the uncertainty properties of a model. Evaluation happens in two ways:

1. Eval functions that are applied to uncertainty metrics of the model on the eval_split (and ood_eval_split if specified). 2. Eval functions that take measurements on and in- and out-of-distribution dataset to evaluate a proxy binary anomaly detection task, for which the functions specified by contrastive_eval_func are used. Also, the ood_eval_split argument has to be specified.

Parameters:

model: Model: Model to be evaluated.
id_eval_split: DataLoader: Main evaluation split.
ood_eval_split: Optional[DataLoader]: OOD evaluation split. Needs to be specified for contrastive evalualtion functions to work.
eval_funcs: Dict[str, Callable]: Evaluation function that evaluate uncertainty by comparing it to model losses on a single split.
contrastive_eval_funcs: Dict[str, Callable]: Evaluation functions that evaluate uncertainty by comparing uncertainties on an ID and OOD test set.
ignore_token_ids: Tuple[int]: IDs of tokens that should be ignored by the model during evaluation.
verbose: bool: Whether to display information about the current progress.

Returns:

Dict[str, Any]: Results as a dictionary from uncertainty metric / split / eval metric to result.

nlp_uncertainty_zoo.utils.uncertainty_eval.kendalls_tau(losses: array, uncertainties: array) → float¶

Compute Kendall’s tau for a list of losses and uncertainties for a set of inputs. If the two lists are concordant, i.e. the points with the highest uncertainty incur the highest loss, Kendall’s tau is 1. If they are completely discordant, it is -1.

Parameters:

losses: np.array: List of losses for a set of points.
uncertainties: np.array: List of uncertainty for a set of points.

Returns:

float: Kendall’s tau, between -1 and 1.