Evariste

Modeling uncertainty is a key aspect of our research process at Evariste. In this short post, we will discuss the different sources of uncertainty that we consider when building our models. Illustrating with a simple example, we will show how a naive approach to modeling uncertainty can lead to overconfident modeling predictions.

The toy dataset we will consider is an instance of sklearn.datasets.make_moons consisting of 200 samples with 2 features and 2 classes. The samples arise from 2 interleaving half circles with a small amount of Gaussian noise.

Suppose we wish to construct a classifier for this dataset which both predicts the class of a new sample, but also provides a measure of the uncertainty of this prediction. There are two main sources of uncertainty that we need to consider.

Aleatoric Uncertainty

The first source, aleatoric uncertainty, is uncertainty due to the inherent stochasticity of the data. In our example, this corresponds to the uncertainty introduced by the Gaussian noise added to the intersecting circles, such that for a sample at the interface between the two circles, we are uncertain as to which class it belongs.

In the context of drug discovery, this uncertainty could arise from the stochasticity of the biological system, or in vitro assay under study. This type of uncertainty is often referred to as irreducible uncertainty, as it cannot be reduced by collecting more data. No matter how many samples we collect, we will always be uncertain about the class of a sample at the interface between the two circles.

Epistemic Uncertainty

The second source, epistemic uncertainty, is uncertainty due to the lack of knowledge about the system under study. In our example, this corresponds to the uncertainty introduced by the limited number of samples for which we have data available. In particular, we are uncertain about the class of a sample in a region where we have no data at all.

In the context of drug discovery, this uncertainty is normally the dominant source of uncertainty, as we typically have a limited number of data points available for a given assay we wish to model. This type of uncertainty is often referred to as reducible uncertainty, as it can be reduced by collecting more data.

It is important to consider both sources of uncertainty in drug discovery in order to identify the most promising compounds for further investigation.

A Naive Approach

A naive approach to modeling uncertainty is to train a classifier on the available data, and then use the classifier to predict class probabilities for a new sample. The uncertainty of the prediction is then estimated by the entropy of the probability distribution of the predicted class probabilities.

Whether our model consists of gradient boosted trees, a neural network, or knn classifier, the naive approach to modeling uncertainty faces the same issue. The models are unreasonably confident in their predictions in regions with high epistemic uncertainty. We can see this in the figure below. The background colour in the left column represents the model's predicted probability that a point belongs to the upper left semi-circle class, and the right column represents the uncertainty in this prediction given by the cross-entropy of the predicted class probabilities. The uncertainty is not only low in the semi-circular regions where we have a lot of consistent data, but also in the regions where we have no data. This is clearly incorrect, as we should also be highly uncertain about the class of a sample in a region where we have no data at all. The models have correctly identified the regions of high aleatoric uncertainty, but have failed to identify the regions of high epistemic uncertainty.

Techniques such as ensembling models can help to reduce the uncertainty of predictions, but does not address the issue of overconfidence in regions of high epistemic uncertainty. For example each of the models in the ensemble will be highly confident in its prediction in the regions where we have no data, and so the ensemble will also be highly confident in its prediction in these regions.

blog_post.md

A Bayesian Approach

A Bayesian, more principled approach to model uncertainty given a model $\theta$ and data $\mathcal{D}$ is to consider the posterior distribution over the model parameters given the data $p(\theta|\mathcal{D})$. Given a new data point $x$, we can then marginalise over the posterior distribution to obtain the predictive distribution $p(y|x,\mathcal{D})$ for the label $y$ of the new data point.

That is to say we can obtain the predictive distribution by averaging over all possible models, weighting each model by its posterior probability given the data. This is known as Bayesian model averaging.

$$p(y|x,\mathcal{D}) = {\color{orchid}\int}_{\color{maroon}\Theta} {\color{seagreen}p(y|x,\tilde{\theta})} \cdot {\color{skyblue}p(\tilde{\theta}|\mathcal{D})} {\color{orchid}d\tilde{\theta}} $$

In a similar fashion we can estimate the uncertainty of a model $\theta$ at a given point by averaging an uncertainty measure, such as $\textsf{CrossEntropy}$ over possible models with respect to the posterior.

$$ {\color{orchid}\int}_{\color{maroon}\Theta} {{\color{gold}{\textsf{CrossEntropy}}}\big(p(y|x,\theta);\ p(\theta| x, \tilde{\theta})\big)} \cdot {\color{skyblue}p(\tilde{\theta}|\mathcal{D})} {\color{orchid}d\tilde{\theta} }$$

Estimating this integral efficiently is non-trivial, and we use a proprietary method. In order not to underestimate uncertainty it is important to be able to sample adversarial models for which both the posterior probability and the uncertainty measure are high. Our proprietary method allows us to do this and furthermore we are able to decompose the uncertainty into aleatoric and epistemic components.

Example

As an example we'll apply the Bayesian approach to uncertainty estimation outlined above to the toy dataset and KNeighboursClassifier.

Using this more sophisticated approach, we get a granular view of uncertainty. We see that our uncertainty measure identifies the highest level of uncertainty occurs at the interface between the two clusters, but also our model is highly uncertain in the regions where we have no data. In our decomposition we are also able to identify the region of high aleatoric uncertainty, as we did with the naive approach.

Upshot

We have discussed the different sources of uncertainty that we consider when building our models. We have shown how a naive approach to modeling uncertainty can lead to incorrect, overconfident conclusions, and how a Bayesian approach can be used to correctly identify regions of high epistemic uncertainty. In this toy example it is clear from the plots the regions of high epistemic uncertainty, but in the setting of drug discovery involving more complex models, multiple predicted endpoints and highly noisy datasets, managing epistemic uncertainty is a lot more tricky.

Beyond academic interest, the upshot of accurately modeling uncertainty for small molecule drug discovery is faster project progression. With knowledge of the uncertainty of our predictions, we can identify compounds for which we are highly confident in our predictions, and compounds for which we are less confident. In this way we can precisely balance the trade-off between exploration and exploitation, and through a series of calculated bets, maximise the chance of identifying a successful drug candidate.

If you would like to learn more about how we model uncertainty at Evariste and if it is applicable to your datasets please get in touch with our Head Quant, Oliver Vipond.

DDR Conference Poster