Python: Predictive Distribution of the Least Square Estimate

If you want to see the code with syntax highlighting, download the gits for this post from my github.

In the  previous post, we looked at the numerical calculation of the maximum likelihood estimate (MLE). As you might know, we can obtain the same solution in a much easier way using the method of least squares. It can be shown that solving for the maximum likelihood under the assumption of normally distributed data gives the same solution as minimizing the quadratic error which is given below. Here, $\hat{y}$ and $y$ represent the predicted (model outputs) and the measured output values respectively.

$$\sum_{n=1}^N (\hat{y}_n-y_n)^2$$

Instead of calculating all likelihood values over the whole range of parameters and picking the parameter resulting in a maximum, solving the least square problem can be done in a single step called the linear least square estimate (LSE). We can calculate this solution using the numpy function np.linalg.lstsq(X,Y). Here, X and Y are the so called regression matrix and output vector. The least square problem can only be directly calculated if the underlying model has a linear form which means, that each input value is factorized by a corresponding parameter. In the case of the line model, this is the case and we can write it in a vectorized notion which is also called linear regression model.

$$y = \underbrace{[1,x]}_{\text{regression vector}}\cdot \begin{bmatrix} b\\ m \end{bmatrix}$$

For each data point, the regression vector has only two values: The bias term $1$ to allow a constant offset defined by the bias parameter $b$ and the $n_{th}$ input value $x_n$ which gets multiplied by the gradient $m$. Now, for each of the N measured data points, we create the corresponding regression. For each output $y_n$ we finally end up with a set of linear equations which can be written in the following matrix notation. The lstsq(X,Y) function exactly takes this regression matrix $X$ and vector of outputs $Y$ as arguments to solve for $\theta$ that minimizes the quadratic error. Now lets apply this to the problem of the previous post and compare both solutions.

$$ Y = \begin{bmatrix} y_1 \\ \vdots\\ y_N \end{bmatrix} = \underbrace{\begin{bmatrix} [1,x_1] \\ \vdots\\ [1,x_N] \end{bmatrix}}_{X} \cdot \begin{bmatrix} b\\ m \end{bmatrix} $$

import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

def line(x,b,m):
    return b+m*x;

#create 10 data points around a line with b = 2, m = 1 with added noise
mb_true = [2.0,1.0];
std_true = 1.0;
N = int(30);
x = np.vstack(np.linspace(0,4,N));
y = line(x,mb_true[0],mb_true[1]) + np.vstack(np.random.normal(0,std_true,x.shape[0]));


#Calculate MLE
def gaussian(x, mu, sig):
    return np.exp(-np.power(x - mu, 2.0) / (2.0 * np.power(sig, 2.0)))
def likelihood(x,y,model,std):
    mu = model(x);
    ps = gaussian(y,mu,std);
    l = 1;
    for p in ps:
        l = l*p;
    return l;
#create array to cover parameter space
res = 20;
M,B = np.meshgrid(np.linspace(-3.5,9,res),np.linspace(-0.5,2.5,res));
MB = np.c_[M.ravel(),B.ravel()];
#calculate likelihoods
sigma = 2.0
L = np.array([likelihood(x,y, lambda x: line(x,mb[0],mb[1]),sigma) for mb in MB]).reshape(M.shape)
#select parameter with maximum likelihood
mb_max = np.array([M[np.unravel_index(L.argmax(),L.shape)],B[np.unravel_index(L.argmax(),L.shape)]]) 


#Calculate the LSE
def lineRegressor(x):
    phi = np.concatenate((np.ones([len(x),1]),x),axis=1);
    return phi;
X = lineRegressor(x);
least_square_result = np.linalg.lstsq(X,y)
mb_ls = least_square_result[0].ravel();


#Draw results
x_temp = np.linspace(-5,15,2);#used to lines beyond the data points
f,(ax1,ax2) = plt.subplots(1,2,figsize=(12,4));
#draw data points and true line
ax1.plot(x, y,'k.',markersize=15,label='data points');
ax1.plot(x_temp, line(x_temp,mb_true[0],mb_true[1]),'k--',label='True');
ax1.set_xlabel('x');
ax1.set_ylabel('y');
#draw estimated lines
ax1.plot(x_temp, line(x_temp,mb_max[0],mb_max[1]),'y-',label='MLE');
ax1.plot(x_temp, line(x_temp,mb_ls[0],mb_ls[1]),'m-',label='LSE');
#draw likelihood
ax2.contourf(M,B,np.power(L,0.1),cmap=plt.cm.binary);
ax2.plot(M,B,'w+',markersize=4,alpha=0.5);
ax2.set_xlabel('b');
ax2.set_ylabel('m');
ax2.set_title('data likelihood');
#mark parameter estimate
ax2.plot(mb_max[0],mb_max[1],'yo',markersize=10,           label='MLE');
ax2.plot(mb_ls[0],mb_ls[1],'mo',markersize=10,label='LSE');
ax1.legend();
ax2.legend();

As we can see, we end up with almost the same estimates for both the MLE and the LSE. The small difference is due the quantization error. If we had used a higher resolution (more control points indicated by the white + signs in the right plot), we also would have found an estimate closer to the maximum. However, at the same time, a higher resolution results in an increased number of computations. In our case with two parameters, the number of compuations in fact increases quadratically. This being said, we already see that calculating the LSE directly is far more efficient than numerically calculating the MLE.

Predictive distribution

Now that we have estimated the parameters $b$ and $m$ using the method of least squares, we can use our model to predict new y values for yet unseen inputs x. In the left plot below, we predicted points for x values below zero and above 5 which were not covered by the data.
It becomes obvious that the predicted values, while layin on the estimated line, do not reflect the stochastic part of the data. In reality it might be important to also model the uncertainty when predicting new values. We can do this by including a random variable $v$ to the model.

$$y = b+m*x+v,\ v \sim \mathcal{N}(0,\sigma)$$
Again, we make an assumption about $v$ that it is normally distributed noise with zero mean (the mean becomes the line itself) and a standard deviation of $\sigma$. To find $\sigma$, we have to calculate the variance $\sigma^2$ of the data which is defined as the normalized sum of squared errors (residuals). Note that we have used the same assumption about out data when we calculated estimates using the maximum likelihood method. In our case the assumption holds true as the data has actually been disturbed using white gaussian noise.

$$\sigma^2 = \underbrace{\sum_{n=1}^{N} (\hat{y}(x_n)-y_n)^2}_{\text{RSS}}\frac{1}{N}$$

Instead of calculating $\sigma$ manually, the lstsq(X,Y) function already gives the sum of squared residuals (RSS) as the second return value. This value reflects the non-normalized variance of the residuals and dividing it by the number of data points and taking its square root, one obtains the standard deviation $\sigma$.

$$\sigma = \sqrt{\sigma^2} = \sqrt{\frac{RSS}{N}}$$

Now, we predict a new value by adding normally distributed noise of standard deviation  to the output. Thus, the predicted output itself becomes a random variable. We can visualize this distribution by drawing the first few standard deviations $\sigma$ around the model.
Note: Bishop calls this distribution predictive distribution as it assigns a probability to each predicted value. However, this is very confusing as he also uses the term to describe the fully bayesian approach of modeling model uncertainty which we will cover in the next post.

f, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(14,4),sharey=True);

#predict new data points
x_predict = np.concatenate((np.linspace(-5,-1,10),np.linspace(6,15,20)));
y_predict = line(x_predict,mb_ls[0],mb_ls[1]);
#draw
ax1.plot(x_temp,line(x_temp,mb_ls[0],mb_ls[1]),'m--',label='LSE');
ax1.plot(x, y,'k.',markersize=15,label='$y$');
ax1.plot(x_predict,y_predict,'m.',markersize=15,label='$\hat{y}$');
ax1.set_title('predictions');
ax1.set_xlabel('x');
ax1.set_ylabel('y');
ax1.legend();

#stochastic prediction
rss_ls = least_square_result[1]; #sum of squared residuals
std_ls = np.sqrt(rss_ls/N);
y_predict = line(x_predict,mb_ls[0],mb_ls[1]) + np.random.normal(0,std_ls,len(x_predict));
#draw
ax2.plot(x_temp,line(x_temp,mb_ls[0],mb_ls[1]),'m--');
ax2.plot(x, y,'k.',markersize=15);
ax2.plot(x_predict,y_predict,'m.',markersize=15);
ax2.set_title('stochastic prediction with $\sigma$ = %2.2f' %(std_ls));
ax2.set_xlabel('x');

#visualize predictive distribution
def variancePolygon(axes,model,xmin,xmax,resolution,std,alpha=1,color='r',drawUpTo=1,label=''):
    Y = list()
    X = np.linspace(xmin,xmax,resolution);
    for x in X:
        y = model(x)
        Y.append(y);
    Y = np.array(Y);
    for i in range(1,drawUpTo+1): 
        temp_std = std*float(i);
        temp_alpha = 0.5*float(drawUpTo-i+1)*alpha/(drawUpTo);
        plt.fill(np.append(X,np.flipud(X)),np.append(Y-temp_std,np.flipud(Y)+temp_std),color,alpha=temp_alpha,label="%i $\sigma$"%i)
    axes.legend();
variancePolygon(ax3,lambda x: line(x,mb_ls[0],mb_ls[1]),-5,15,2, std_ls,1.0,'m',3,'MLE');
ax3.plot(x, y,'k.',markersize=15,label='data points');
ax3.plot(x_predict,y_predict,'m.',markersize=15,label='predictions');
ax3.set_title('predictive distribution');
ax3.set_xlabel('x');
 
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s