Uncertainty Eval

The quality of uncertainty estimates can be tricky to evaluate, since there are no gold labels like for a classification tasks. For that reason, this module contains methods for exactly this purpose.

To measure the quality of general uncertainty estimates, we use the common evaluation method of defining a proxy OOD detection tasks, where we quantify how well we can distinguish in- and out-of-distribution inputs based on uncertainty scores given by a model. This is realized using the area under the receiver-operator-characteristic (nlp_uncertainty_zoo.utils.uncertainty_eval.aupr()) and the area under precision-recall curve (nlp_uncertainty_zoo.utils.uncertainty_eval.auroc()).

New in Ulmer et al. (2022), it is also evaluated how indicative the uncertainty score is with the potential loss of a model. For this reason, this module also implements the Kendall’s tau correlation coefficient (nlp_uncertainty_zoo.utils.uncertainty_eval.kendalls_tau()).

Instances of the nlp_uncertainty_zoo.models.model.Model class can be evaluated in a single function using all the above metrics with nlp_uncertainty_zoo.uncertainty_eval.evaluate_uncertainty().

Uncertainty Eval Module Documentation

Module to implement different metrics the quality of uncertainty metrics.

nlp_uncertainty_zoo.utils.uncertainty_eval.aupr(y_true: array, y_pred: array) float

Return the area under the precision-recall curve for a pseudo binary classification task, where in- and out-of-distribution samples correspond to two different classes, which are differentiated using uncertainty scores.

Parameters:
y_true: np.array

True labels, where 1 corresponds to in- and 0 to out-of-distribution.

y_pred: np.array

Uncertainty scores as predictions to distinguish the two classes.

Returns:
float

Area under the precision-recall curve.

nlp_uncertainty_zoo.utils.uncertainty_eval.auroc(y_true: array, y_pred: array) float

Return the area under the receiver-operator characteristic for a pseudo binary classification task, where in- and out-of-distribution samples correspond to two different classes, which are differentiated using uncertainty scores.

Parameters:
y_true: np.array

True labels, where 1 corresponds to in- and 0 to out-of-distribution.

y_pred: np.array

Uncertainty scores as predictions to distinguish the two classes.

Returns:
float

Area under the receiver-operator characteristic.

nlp_uncertainty_zoo.utils.uncertainty_eval.evaluate_uncertainty(model: ~nlp_uncertainty_zoo.models.model.Model, id_eval_split: ~torch.utils.data.dataloader.DataLoader, ood_eval_split: ~typing.Optional[~torch.utils.data.dataloader.DataLoader] = None, eval_funcs: ~typing.Dict[str, ~typing.Callable] = frozendict.frozendict({'kendalls_tau': <function kendalltau>}), contrastive_eval_funcs: ~typing.Tuple[~typing.Callable] = frozendict.frozendict({'aupr': <function aupr>, 'auroc': <function auroc>}), ignore_token_ids: ~typing.Tuple[int] = (-100, ), verbose: bool = True) Dict[str, Any]

Evaluate the uncertainty properties of a model. Evaluation happens in two ways:

1. Eval functions that are applied to uncertainty metrics of the model on the eval_split (and ood_eval_split if specified). 2. Eval functions that take measurements on and in- and out-of-distribution dataset to evaluate a proxy binary anomaly detection task, for which the functions specified by contrastive_eval_func are used. Also, the ood_eval_split argument has to be specified.

Parameters:
model: Model

Model to be evaluated.

id_eval_split: DataLoader

Main evaluation split.

ood_eval_split: Optional[DataLoader]

OOD evaluation split. Needs to be specified for contrastive evalualtion functions to work.

eval_funcs: Dict[str, Callable]

Evaluation function that evaluate uncertainty by comparing it to model losses on a single split.

contrastive_eval_funcs: Dict[str, Callable]

Evaluation functions that evaluate uncertainty by comparing uncertainties on an ID and OOD test set.

ignore_token_ids: Tuple[int]

IDs of tokens that should be ignored by the model during evaluation.

verbose: bool

Whether to display information about the current progress.

Returns:
Dict[str, Any]

Results as a dictionary from uncertainty metric / split / eval metric to result.

nlp_uncertainty_zoo.utils.uncertainty_eval.kendalls_tau(losses: array, uncertainties: array) float

Compute Kendall’s tau for a list of losses and uncertainties for a set of inputs. If the two lists are concordant, i.e. the points with the highest uncertainty incur the highest loss, Kendall’s tau is 1. If they are completely discordant, it is -1.

Parameters:
losses: np.array

List of losses for a set of points.

uncertainties: np.array

List of uncertainty for a set of points.

Returns:
float

Kendall’s tau, between -1 and 1.