In this post I want to talk about regression and the maximum likelihood estimate. Instead of going the usual way of deriving the least square (LS) estimate which conincides with the maximum likelihood (ML) under the assumption of normally distributed noise, I want to take a different route. Here, instead of using the analytical LS solution, I want to show you, how we can numerically compute the ML estimate.
Note: If you want to see the python code, check out this link.
To give you a smooth start, lets quickly recap the definition of likelihood (also data likelihood). In the context of supervised learning, the likelihood function gives an information about how likely it is to observe a data set given an assumed structure of the underlying system that produced the data (here called model structure ). By using a parameter vector to describe a concrete instance of the model, the likelihood function would then describe the dependent probability of seeing the data given a concrete model .
The data consists of measured input and output data points (here, and values). Note that each value and could also be vectors for systems with multiple inputs or/and outputs.
Maimum Likelihood Estimate (MLE)
If we assume that the structure of the underlying system is known, our job is to find those model parameters, which most likely represent the underlying true system. Of course, the above assumption of knowing the model structure a-priori is a strong one and can not always assumed to be true. Now, if we hypothetically have had a likelihood function that assigns a likelihood to each and every parameter, we could simply choose that parameter with the highest likelihood which is called the maximum likelihood estimate. In the following, we will discuss how to calculate this likelihood function.
Note: In contrast to using the least square estimate to directly calculate the MLE, calculating the likelihood function manually has a great advantage which we will exploit in the next article. Basically, by using the least square estimate, we only retrieve the most likely parameter but not its likelihood. Moreover, we dismiss all parameters, no matter how liekliy they might be. This way, one might end up with a parameter which is almost as likely as others but completely ignores other, also plausible solutions.
Example with Line Model
To make things a little more intuitive, lets look at the example below. In the first plot, the black points represent our data set. It is easy to see that the points follow a line with some added noise. Remember that a line can be described by two parameters, its offset and its gradient (the formula is given above).
In the second plot, we added three lines with different parameters to the plot. It is obvious that the blue line best describes the data points because its deviates least from the data points. To put this in a different, more probabilistic perspective, we could also argue that observing the given data is more likely if the measurements were taken from a system that has the shape of the blue rather than the red or the green line. As a consequence, the parameter vector describing the green and red line should get assigned a smaller data likelihood.
Numerical calculation of likelihood
The example above illustrates, that we can intuitively assign likelihoods to different lines. But how do we numerically calculate it for different parametrizations? Again, we have to make some more assumptions before we can perform the calculation.
We assume that the data has been disturbed with white gaussian noise with zero mean and standard deviation which must be known a priori (see below). Then, the data can be described using the following formula. This simply means that the data is normally distributed around the model itself which can be described by rearranging the formula.
In practise, could be set to the measurement variance which might be known from a data sheet. In the code example above, we used a standard deviation of to create the artificial data. To make things more realistic, we assume that we have no prior knowledge and thus assume a value of as a rough estimate.
As the probability is assumed to be normally distributed around the model, it is highest if the data and model values are the same and becomes smaller if it deviates from the model. The plot below illustrates the probability around the model by plotting the first two standard deviations around the three lines.
We can now calculate its likelihood for each parameter configuration making use of the above assumption. We define the likelihood as the product of the probability of each data point. Note that in the equation below, we left out the constant factor of the normal distribution because it does not depend on the model and we are only interested in finding the model with highest likelihood, not its exact likelihood value.
If we calculate the likelihood for the three lines, we can see, that the blue line has the highest likelihood because the data points deviate the least from the model (left plot). Note that we use the 10th root of the likelihood because the differences in the likelihood are so huge, that the values of the red and green line would be completely invisible if we used the values directly. This is because, we assume that the level of noise (represented by ) has been chosen rather small.
By increasing , we would assume that our data is distorted by noise which a much greater amplitude. As a consequence, it would be harder to tell which line the data points belong to. The example below shows, how the likelihood between all three lines becomes increasingly similar, as we increase sigma between values of (right plot).
So far, we only calculated the likelihood for the three randomly chosen lines with fixed parameters. Now, to find the maximum likelihood estimate for the parameters and , we calculate the likelihood for a whole range of parameters. To do so, we create a whole range of parameter pairs (meshgrid). To limit the computational demands, we use parameter boundaries of and and a resolution of 50 resulting in parameter combindations for which we have then calculate the likelihood.
Since we have two parameters, we can visualize the likelihood using a two dimensional contour plot. The darkness then indicates the likelihood of the parameter pair. Since the likelihoods differ by many magnitudes, we again transform their values using the 10th root to improve visibility of the shape of the likelihood function.
We also added the parameter values of the three lines to the plot. As expected, the blue parameter pair is much closer the maximum of the likelihood function as the green and red line parameters.
Then, we get the index of the maximum value in the meshgrid of likelihood values. We use this array index to access the belonging parameter from the array (meshgrid) of parameters. As we can see, the maximum likelihood parameter (yellow), is a little offset to the true parameter value, that is the value which was used to generate the data. This is because of the random noise which has been added to the data. Due to the normally distributed noise, and the limited number of data points, the data points are not evenly distributed along the true line. As a result, a different, slightly offset line is calculated.
The error becomes increasingly small as the number of data points grows. In the plot below, we compare the MLE with respect to the number of data points used for calculation of the Likelihood function.
As one can see, only using a single data point creates a very broad likelihood function with no clear maximum. As a consequence, no single maximum exists. Since we are only using a single point, we basically assign every line that goes through this point equal probability.
Using more data points, the maximum slowly builds up around the original parameter value. As we use more and more data points, the maximum becomes more prominent reflecting a higher likelihood for parameters in this region. While in theroy, two points would be enough to exactly define a line, the likelihood function also considers the uncertainty of the measurements