Uncertainty Eval¶
The quality of uncertainty estimates can be tricky to evaluate, since there are no gold labels like for a classification tasks. For that reason, this module contains methods for exactly this purpose.
To measure the quality of general uncertainty estimates, we use the common evaluation method of defining a proxy OOD detection tasks,
where we quantify how well we can distinguish in- and out-of-distribution inputs based on uncertainty scores given by a model.
This is realized using the area under the receiver-operator-characteristic (nlp_uncertainty_zoo.utils.uncertainty_eval.aupr()
) and
the area under precision-recall curve (nlp_uncertainty_zoo.utils.uncertainty_eval.auroc()
).
New in Ulmer et al. (2022), it is also evaluated how indicative the uncertainty score is with the potential loss of a model.
For this reason, this module also implements the Kendall’s tau correlation coefficient (nlp_uncertainty_zoo.utils.uncertainty_eval.kendalls_tau()
).
Instances of the nlp_uncertainty_zoo.models.model.Model
class can be evaluated in a single function using all the above metrics with nlp_uncertainty_zoo.uncertainty_eval.evaluate_uncertainty()
.
Uncertainty Eval Module Documentation¶
Module to implement different metrics the quality of uncertainty metrics.
- nlp_uncertainty_zoo.utils.uncertainty_eval.aupr(y_true: array, y_pred: array) float ¶
Return the area under the precision-recall curve for a pseudo binary classification task, where in- and out-of-distribution samples correspond to two different classes, which are differentiated using uncertainty scores.
- Parameters:
- y_true: np.array
True labels, where 1 corresponds to in- and 0 to out-of-distribution.
- y_pred: np.array
Uncertainty scores as predictions to distinguish the two classes.
- Returns:
- float
Area under the precision-recall curve.
- nlp_uncertainty_zoo.utils.uncertainty_eval.auroc(y_true: array, y_pred: array) float ¶
Return the area under the receiver-operator characteristic for a pseudo binary classification task, where in- and out-of-distribution samples correspond to two different classes, which are differentiated using uncertainty scores.
- Parameters:
- y_true: np.array
True labels, where 1 corresponds to in- and 0 to out-of-distribution.
- y_pred: np.array
Uncertainty scores as predictions to distinguish the two classes.
- Returns:
- float
Area under the receiver-operator characteristic.
- nlp_uncertainty_zoo.utils.uncertainty_eval.evaluate_uncertainty(model: ~nlp_uncertainty_zoo.models.model.Model, id_eval_split: ~torch.utils.data.dataloader.DataLoader, ood_eval_split: ~typing.Optional[~torch.utils.data.dataloader.DataLoader] = None, eval_funcs: ~typing.Dict[str, ~typing.Callable] = frozendict.frozendict({'kendalls_tau': <function kendalltau>}), contrastive_eval_funcs: ~typing.Tuple[~typing.Callable] = frozendict.frozendict({'aupr': <function aupr>, 'auroc': <function auroc>}), ignore_token_ids: ~typing.Tuple[int] = (-100, ), verbose: bool = True) Dict[str, Any] ¶
Evaluate the uncertainty properties of a model. Evaluation happens in two ways:
1. Eval functions that are applied to uncertainty metrics of the model on the eval_split (and ood_eval_split if specified). 2. Eval functions that take measurements on and in- and out-of-distribution dataset to evaluate a proxy binary anomaly detection task, for which the functions specified by contrastive_eval_func are used. Also, the ood_eval_split argument has to be specified.
- Parameters:
- model: Model
Model to be evaluated.
- id_eval_split: DataLoader
Main evaluation split.
- ood_eval_split: Optional[DataLoader]
OOD evaluation split. Needs to be specified for contrastive evalualtion functions to work.
- eval_funcs: Dict[str, Callable]
Evaluation function that evaluate uncertainty by comparing it to model losses on a single split.
- contrastive_eval_funcs: Dict[str, Callable]
Evaluation functions that evaluate uncertainty by comparing uncertainties on an ID and OOD test set.
- ignore_token_ids: Tuple[int]
IDs of tokens that should be ignored by the model during evaluation.
- verbose: bool
Whether to display information about the current progress.
- Returns:
- Dict[str, Any]
Results as a dictionary from uncertainty metric / split / eval metric to result.
- nlp_uncertainty_zoo.utils.uncertainty_eval.kendalls_tau(losses: array, uncertainties: array) float ¶
Compute Kendall’s tau for a list of losses and uncertainties for a set of inputs. If the two lists are concordant, i.e. the points with the highest uncertainty incur the highest loss, Kendall’s tau is 1. If they are completely discordant, it is -1.
- Parameters:
- losses: np.array
List of losses for a set of points.
- uncertainties: np.array
List of uncertainty for a set of points.
- Returns:
- float
Kendall’s tau, between -1 and 1.