Python: Full Bayesian Predictive Distribution

Background

Follow this link to download the full jupyter notebook.

In one of the previous posts, we looked at the maximum likelihood estimate (MLE) for a linear regression model. Instead of using the deterministic model directly, we have also looked at the predictive distribution. In the previous post, we used this stochastic model to include information about the data uncertainty into the prediction process.
When using the MLE paramter (or the least square estimate in the case of a linear model), all other parameters with smaller likelihood are automatically being discarded. This is problematic because the maximum likelihood parameter, must not be necessarily be correct or even be close to the true parameter value. It simply has the highest data likelihood for the given data. If we have little data points, supporting our hypothesis, the likelihood function becomes very flat. Then, other parameters show very similar likelihood values and the true parameter might not be the MLE and be falsely discarded just because its likelihood is a few decimal places off.
In this post, we will take a different approach that involves a full bayesian treatement of parameter uncertainty. W will not ditch all parameters except for the MLE but instead include the knowledge about each and every paramter into the prediction process.

Integrating parameter uncertainty into the predictive distribution

As a reminder, we are interested in predicting a new value $y$ for a given, yet unseen, data point x. Before, we did this using the predictive distribution of the MLE model which gave us the probability for the predicted value. The probability however was based on the assumption that the MLE was in fact correct and the uncertainty only stems from data noise.
The full bayesian treatment of parameter uncertainty means that we will include the uncertainty of all parameters. Therefore, we sum over the predictions for all parameters and weigh them with respect to the probability of the parameter. This is equivalent to integrating the predictions, weighted by their probability, over the whole parameter space.

$$p(y\ |\ x) = \int_\theta \underbrace{p(y\ |\ x,\theta)}_{\text{predictive distribution}}\cdot \underbrace{p(\theta\ |\ Z)}_{\text{Parameter probability}}\\ \text{Equation 1: Predictive Distribution considering parameter uncertainty}$$By integrating over the whole parameter space, we include uncertainty of all parameters and thus, overall prediction uncertainty will unavoidably increase. Intuitively, this makes sence as the prediction not only accounts for data uncertainty as before, but also uncertainty in the parameters.

Parameter probability: Likelihood, Prior & Posterior

In the above equation, we weight each prediction with respect to their parameter probability. Using the data likelihood makes sense when no prior knowledge about the parameter exists. But what if we already have a rough knowledge about the true parameter value? For example, we could have a rough idea about the true paramter value from experience. We could model our prior knowledge using a normal distribution with its peak at the assumed true value. In this case, we want to incorporate the prior knowledge into the calculation of our parameter probability. Then, we obtain a better estimate by multiplying the data likelihood with the prior distribution which brings us to the defintion of the posterior distribution.

$$p(\theta\ |\ Z) \propto \underbrace{p(Z\ |\ \theta)}_{\text{Likelihood}}\quad \cdot \underbrace{p(\theta)}_{\text{Prior Distribution}}\\ \text{Equation 2: Posterior Distribution}$$The resulting posterior distribution has a more pronounced peak than both, the likelihood and the prior distribution. Intuitively, this makes sense as using two sources of information should result in an increased accuracy. If however, no prior information about the parameter is available, one usually uses a uniform prior distribution which is equal to saying that all parameters are equally weighted. Now, we can replace the likelihood with the posterior distribution in equation (1).

NoteFor a more comprehensive understanding, you should refer look into the topic of Bayesian inference. As a sidenode, mathematically, the likelihood function is not really a probability distribution. Its integral does not necessarily equal to one and a scaling factor is required to make it a true probability distribution. However, the resulting posterior distribution has the same shape and is only scaled by a constant factor which is indicated by the $\propto$ symbol.

Hands on

In the following, we will calculate the prediction model as described above using a simple line model that is characterized by two parameters, m and b. First, we draw the line of the “true” parameters which are set to $m = -0.3$ and $b=0.5$. Then, we generate our sample data set which we use for fitting by generating a few points on the line and adding white gaussian noise to it to model measurement uncertainty. We use noise with a standard deviation of $0.1$.

import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt
%matplotlib inline
%config IPCompleter.greedy=True

def line(x,b,m):
    return b+m*x;
def label(theta):
    return "[%2.2f, %2.2f]" %(theta[0],theta[1]);

def linearmodel(x,w):
    phi = lineRegressor(x);
    return phi.dot(w);

#create data from polyonimal model
N = 5;
x_std = 0.3; #data variance assumed to be known a priori
mb0 = np.array([[-0.3],[0.5]])
x = np.vstack(np.linspace(0,3,int(N)));
t = linearmodel(x,mb0) + np.vstack(np.random.normal(0,x_std,x.shape[0]))

#draw data
tempx = np.vstack(np.linspace(-5,5,10));
plt.plot(tempx,linearmodel(tempx,mb0),'k--',label=label(mb0.ravel()));
plt.plot(x,t,'k.',markersize=20,label='data points',markeredgecolor='w');
plt.xlabel('x');
plt.ylabel('y');
plt.legend();

Calculate posterior probability distribution

In the following, we calculate the Posterior distribution.

  • First, we calculate the the data likelihood function. We assume that we have no prior knowledge about the noise and assume it follows a normal distribution with a standard deviation of $\beta = 1$ (our first hyper parameter).
  • Second, we generate the prior distribution. Here, we assume that we have a rough idea of the true parameter value being something close to $[m,b] = [-0.2,0.7]$. We model this prior knowledge using a normal distirbution with its mean set to this value. To reflect our pretty high confidence in our prior guess, we set the entries of the covariance matrix $S_0$ to 0.1 (second hyper parameter).
  • Then, we multiply the prior distribution and the likelihood function to obtain the posterior distribution.
  • Finally, we plot all three functions using a contour plot.

From the plots, we can make some interesting observations. The posterior distribution has a much more prominent peak than both the prior and the likelihood function. Besides, the peak is also closer to the true parameter value. The maximum in the likelihood function does not coincide with the true parameter value. This example shows, that using the MLE just for the sake of doing it can be dangerous!

def likelihood(t,model,x,w,std):
    mu = model(x,w);
    ps = stats.norm.pdf(t,mu,std)
    l = 1;
    for p in ps:
        l = l*p;
    return l;

def lineRegressor(x):
    x = np.array(x,ndmin=2);#make sure scalars are treated as matrices
    ones = np.ones((np.shape(x)[0],1));
    phi = np.concatenate((ones,x), axis=1);
    return phi;

#create array to cover parameter space
res = 100;
M,B = np.meshgrid(np.linspace(-1,1,res),np.linspace(-1,1,res));
MB = np.c_[M.ravel(),B.ravel()];

#calculate likelihood function
beta = 1; #standard deviation of data likelihood assumed to be known
L = np.array([likelihood(t,linearmodel,x,mb.reshape(2,1),beta) for mb in MB]).reshape(M.shape)

#draw
f,(ax2,ax3,ax4) = plt.subplots(1,3,figsize=(15,5),sharey=True);

#draw likelihood function
ax2.contourf(M,B,L);
ax2.set_title('Data Likelihood')
ax2.set_xlabel('m');
ax2.set_ylabel('b')
ax2.plot(mb0[0],mb0[1],'k*',markersize=10,label='True Parameter: '+label(mb0));
mbMLE = MB[L.argmax()]
ax2.plot(mbMLE[0],mbMLE[1],'r*',markersize=10,label='Max Likelihood: '+label(mbMLE));
ax2.legend(loc='lower center')

#prior distribution
S0 = np.array([[0.1,0],[0.0,0.1]]);
m0 = np.array([[-0.2],[0.7]]);
Prior = stats.multivariate_normal.pdf(MB,m0.ravel(),S0);
Prior = Prior.reshape(M.shape);
ax3.contourf(M,B,Prior)
ax3.set_title('Prior Prob. Dist.')
ax3.set_xlabel('m');
ax3.plot(mb0[0],mb0[1],'k*',markersize=10,label='True Parameter: '+label(mb0));
ax3.plot(m0[0],m0[1],'g*',markersize=10,label='Max Prior: '+label(m0));
ax3.legend(loc='lower center')

#posterior
Posterior = np.multiply(Prior,L)
ax4.contourf(M,B,Posterior)
ax4.set_title('Posterior Prob. Dist.')
ax4.set_xlabel('m');
mbPost = MB[Posterior.argmax()]
ax4.plot(mb0[0],mb0[1],'k*',markersize=10,label='True Parameter: '+label(mb0));
ax4.plot(mbMLE[0],mbMLE[1],'r*',markersize=10,label='Max Likelihood: '+label(mbMLE));
ax4.plot(m0[0],m0[1],'g*',markersize=10,label='Max Prior: '+label(m0));
ax4.plot(mbPost[0],mbPost[1],'b*',markersize=10,label='Max Posterior'+label(mbPost));
ax4.legend(loc='lower center');

Calculating the posterior more efficiently

As we can see, multiplying two normal distributions, again results in a normal distriution. Instead of calculating the posterior distribution numerically, as above, we can use an analytical solution. The following equations were taken from (Bishop, Machine Learning, ch.3.3).

$$m_N = S_N(S_0^{-1}\cdot m_0+\beta\cdot\Phi\cdot Y)\\ S_N^{-1} = S_0^{-1}+\beta\Phi^T\cdot\Phi\\ \text{Equation 3: Analytical solution for posterior normal distribution.}$$

  • $S_N$: Covariance matrix of the posterior distribution
  • $m_N$: Mean value of the posterior distribution
  • $\Phi$: Regression matrix introduced in the previous blogpost (note that it was called X, not $\Phi$)
  • $Y$: Vector of output values in the dataset $[y_0,\dots,y_N]$
  • $\beta$: Assumed noise variance

In the following code snippet, we calculate both values and plot the distribution using the stats package. As we can see, the resulting distribution is the same as the numerically calculated one. However, the maximumum is somewhat different from the numerical solution. This is because of the limited resolution which we used to divide the parameterspace. If we had had used a higher resolution, the result would have been more accurate but also computational intense.

#Calculate posterior distribution using the analytical solution

#generate regression matrix from input values for the line model
Phi = lineRegressor(x);

#calculate covariance matrix
SN_inv = np.linalg.inv(S0) + beta*Phi.T.dot(Phi)
SN = np.linalg.inv(SN_inv);

#calculate mean value (that is the maximum likelihood value)
mN = SN.dot( np.linalg.inv(S0).dot(m0) + beta*Phi.T.dot(t));

#generate distribution object using the stats package for plotting
PosteriorAnalytical = stats.multivariate_normal.pdf(MB,mN.ravel(),SN);

#plot the distribution
plt.figure(figsize=(4,4))
plt.contourf(M,B,PosteriorAnalytical.reshape(M.shape))
plt.plot(mbPost[0],mbPost[1],'b*',markersize=10,label='Posterior numerical'+label(mbPost));
plt.plot(mN[0],mN[1],'bx',markersize=10,label='mN: '+label(mN))
plt.title('Posterior (analytically)')
plt.xlabel('m');
plt.ylabel('b');
plt.legend(loc='lower center');

Drawing models from the distiribution

The posterior distribution now tells us how likely each and every parameter is after integrating our prior knowledge and the knwoledge obtained from the data in the form of the likelihood function.
To get an idea of what the posterior tells us about the possible model parameters, we will now draw a number of parameters from the distribution. Parameters with higher probability are more likely to be drawn while parameters far from the peak will be drawn much less often.
In the following code snippet, we plot 60 models (lines) which were drawn from the distribution. As we can see, all lines are very close together in the area of the data points. This is because the data points “resistrict” the direction of the lines, or at least lower the probability of lines showing in different directions. Nevertheless, towards the sides, the lines are “allowed” to spread apart because no further data points constrains them.

f = plt.figure(figsize=(12,4))
ax1 = plt.subplot2grid((1, 3), (0, 0), colspan=1)
ax2 = plt.subplot2grid((1, 3), (0, 1), colspan=2)


#draw parameters from distribution
N = 60;
colors = plt.cm.RdYlGn(np.linspace(0,1,N))

#draw some 50 parameters
draws = stats.multivariate_normal.rvs(mN.ravel(),SN,N)

#visualize the drawn parameters
ax1.scatter(draws[:,0],draws[:,1],c='b',s = 50,label='drawn parameters')
ax1.set_xlim(-1,1);
ax1.set_ylim(-1,1);
ax1.set_title('drawn parameters');

#draw the maximum likelihood estimate
ax1.plot(mN[0],mN[1],'r*',markersize=20,label ='mN')
ax1.legend();

#draw the line model for each parameter
tempx = np.array([[-5.0],[10]]);
for i in range(0,N):
    draw = draws[i];
    ax2.plot(tempx,linearmodel(tempx,draw),'b-',linewidth=1,alpha=0.5);#,c=colors[i],alpha=0.5);
ax2.set_title('according models');

#draw the line for the maximum likelihood parameter in red
ax2.plot(tempx,linearmodel(tempx,mN),linewidth=3,c='r');

#also show the original data points 
ax2.plot(x,t,'ko',label='data points',markeredgecolor='w');
ax2.legend();

Visualizing the prediction probability

We can already see that the distribution of the lines towards the edges might be somewhat related to the prediction uncertainty which becomes bigger when extrapolating into yet unknown regions. We can use a little trick to generate a visual representation of this prediction probability. First, we make each line a little thicker to represent the stochastic part of the model due to noise (defined by $\beta$). Then, we give each line a very high transparency.
Now, if we draw many more samples from the distribution, we basically cover the whole parameter space (or at least that with significant importance/probability). Lines which are drawn more often will overlap each other leading to a higher opacity indicating a higher prediction probability. In contrast, those lines with lower likelihood won’t and in consequence have a much lower opacity. These model paramters are very unlikely and thus their predicted values also have a very low probability.

In [78]:
f = plt.figure(figsize=(12,4))
ax1 = plt.subplot2grid((1, 3), (0, 0), colspan=1)
ax2 = plt.subplot2grid((1, 3), (0, 1), colspan=2)
#f,(ax1,ax2) = plt.subplots(1,2,figsize=(8,4))

#draw 500 parameter samples from the posterior distribution 
#using the stats framework for multivariate normal distributions 
#by providingand and our calculated variance matrix 
draws = stats.multivariate_normal.rvs(mN.ravel(),SN,300)

#plot the drawn parameters 
ax1.scatter(draws[:,0],draws[:,1],c='b',s = 20,label='drawn parameters')
ax1.plot(mN[0],mN[1],'r*',markersize=20,label ='mN')
ax1.set_xlim(-1,1);
ax1.set_ylim(-1,1);
ax1.legend();
ax1.set_title('drawn parameters');

#plot the models with a small alpha value
for draw in draws:
    tempx = np.array([[-10.0],[10.0]]);
    ax2.plot(tempx,linearmodel(tempx,draw),linewidth=20,alpha=1/100.0,color='b');

#draw the maximum likelihood estimate 
ax2.plot(tempx,linearmodel(tempx,mN),'r-',linewidth=3);
#also show the original data points
ax2.plot(x,t,'ko',label ='data points',markeredgecolor='w');
ax2.set_xlim(-5,10);
ax2.set_ylim(-5,7.5);
ax2.set_title('Visualization of Prediction Probability');

Analytical solution to the bayesian predictive distribution

We see that in constrast to the predictive distribution of the MLE which only modeled the data uncertainty, the obtained distribution has a varying variance which depends on $x$. In order to obtain the true distribution, we have to integrate over the whole parameter space and weight each stochastic prediction model with its posterior probability. Luckily, an analytical solution can be derived which was taken from Bishop as well. The variance $\sigma^2$ around the mean value (which coincides with the MLE) can be calculated using the following equation. Of course, the variance depends on the two hyper parameters $S_0$ and $\beta$. The dependency on $x$ shows once more, that the variance is not constant but depends on the input $x$.

$$\sigma^2 = \frac{1}{\beta} \Phi^T(x)\cdot S_N\cdot\Phi(x)\\ \text{Equation 4: Variance Calculation of the Bayesian Predictive Distribution}$$

Visualization of the analytical solution

To visualize the obtained predictive distribution, we first draw the mean value and then use the above equation to calculate the variance for a range of x values. We then draw polygons for the first 4 standard deviations around the mean. As we can see, the predictive distribution looks very similar to the distribution which we obtained using the above trick.

In [72]:
plt.figure(figsize=(8,4));

def predictive_distribution_variance(beta,x,SN): #variance of predictive distribution
    Phi = lineRegressor(x);
    return np.vstack(np.array([1.0/beta + phi.dot(SN).dot(phi) for phi in Phi])) ;
    
def variancePolygon(axes,model,std,X,alpha=1,color='r',drawUpTo=1,label=''):
    Y = list()
    Std = list();
    for x in X:
        Y.append(model(x));
        Std.append(std(x));
    Y   = np.array(Y).ravel();
    Std = np.array(Std).ravel();
    axes.plot(X,Y,color+'-',label=label);
    for i in range(1,drawUpTo+1): 
        temp_std = Std * (float(i)/drawUpTo);
        temp_alpha = alpha/drawUpTo;
        plt.fill(np.append(X,np.flipud(X)),np.append(Y-temp_std,np.flipud(Y)+temp_std),color,alpha=temp_alpha)

variancePolygon(plt.gca(),
                lambda x: linearmodel(x,mN),
                lambda x: predictive_distribution_variance(beta,x,SN), 
                np.linspace(-10,10,100), 
                alpha=1.0,color='b',drawUpTo=4,label='predictive distribution');

plt.gca().set_xlim([-5,10]);
plt.gca().set_ylim([-5,7.5]);
plt.plot(tempx,linearmodel(tempx,mN),'r-',linewidth=3,label='mN');
plt.plot(x,t,'ko',label ='data points',markeredgecolor='w');
plt.legend();
plt.title('Analytical Predictive Distribution');

Conclusion

We now can assign a probability to every predicted value or even perform predictions by adding a normally distributed random variable with variance set to $\sigma^2(x)$. In the area around the data points, we have a very high confidence in our predictions. Going further away, the models are not resitricted by any data points and naturally, the confidence in our predictions becomes smaller.We can see that the uncertainty in the right side of the plot is pretty big.
Predicting using the MLE model would be dangerous because it hides the high level of uncertainty. Using the predictive distribution, we do not discard other parameter solutions but instead incorporate them into the prediction process in the form of parameter uncertainty.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s