Evaluating Optimal Fits of Functions to Data

Method of Maximum Likelihood

Generally in science, we make a set of measurements in an experiment, gather the data, and attempt to learn something from it. Since no experiment can be done perfectly, we must always allow for the fact that there is an uncertainty associated even with the most quantitative measurements. That is to say, there is no certainty except that all measurements are uncertain to some degree. Repeat the experiment again with exactly the same conditions and you will find similar numbers to the previous experiment, but they will not be exactly the same. So to, for a single experiment with multiple measurements we do not get the exactly the same measurements each time in the great majority of cases. We see very similar results, but still there is a random variation in the last decimal place of the numbers. Although each case must be analyzed separately, it is generally true that these variations are caused by small imperfections or errors in the method of measurement and in the measurement apparatus. Hence, there is some uncertainty about the result we conclude on the basis of our data with its inherent uncertainty. For example, we can measure the acceleration due to gravity to be 9.8 m/s^2, but is it 9.81 m/s^2 or 9.79 m/s^2? The round-off takes care of having to know the last decimal place, but to be good scientists, we must specify our uncertainty on the last decimal place so that others who make measurements of the same quantity can validly compare their measurement results to our own.

A great deal of science therefore depends on knowledge of the mathematical laws of probability and statistics. We have to estimate the amount of uncertainty in our measurements and take that uncertainty into account when calculating a final result. In many cases, we make measurements in order to relate experimental variables to some measureable result, i.e. we change a voltage or position or velocity which we can determine and we watch the system under study change in some measureable way. The desired result is to determine the functional form of the connection between what we changed and what we observe. On a graph, what we change would be specified in units along the x axis, for example, and what we observe would have its value specified along the y axis. Hence a measurement consists of a given (x,y) combination. We wish to find the form of the function, y(x), i.e. how y depends on x. Since we don't measure y exactly, we expect that several measurements of y for the same given value of x will give us some idea of the uncertainty on y for that point in x. In real-world cases, we will also have to worry about the fact that our knowledge of x is also not perfect.

The measured value of y(x) is said by mathematicians to be determined by the parent distribution which governs the probability of making any particular observation. In other words, there is a true functional form y(x) which tells us the true value of y for any given x. We do not see this ``true'' value, but we measure something close to it. The small errors mentioned earlier contrive to keep us from seeing what y really is. However, the more measurements of y we make for a given x, the more certain we are of the true value. It can be mathematically proven that, in general, an infinite number of measurements would give us exact knowledge of y. In practice, the ultimate limits set by quantum mechanics prevent us from ever having exact knowledge of any physical parameter, but, in the cases scientists usually deal with, we will never need to know exactly what the parent distribution is, we simply need a good approximation to it. The parent distribution describes what the shape of the distribution is for measurements of y at some particular value of x.

To get at the form of y(x), we take a sample of measurements or observations. This particular sample is just one of a great many that we ``extract'' from the parent distribution. Take another sample and we get a slightly different ``extraction'' which mimics the parent distribution. The parent distribution then determines the probability of making any particular set of observations for a sample.

How do we determine what the parent distribution is given a sample? The interesting thing is that physicists and mathematicians who work in a branch of physics called statistical mechanics have shown that the many small perturbations (what we referred to earlier as errors) that effect a physical system almost always force the measurement to follow the Gaussian distribution. This scattering of random numbers is so common, it's usually referred to as simply the normal distribution.Hence, our first guess for any measurement method should be that the measurement uncertainty follows a Gaussian distribution. If that's so, then we can follow a rather simple mathematical trail to formulating a procedure which is guaranteed to give us the most probable shape of the functional form for y(x). If we define the symbol y_i to be the measured value of y for some particular value of x made in observation i (defined as seeing y_i at x_i), then the probability P_i of having observed y_i is given by

The probability of observing any particular sample of N measurements of giving the y_i values seen is the product of the probabilities for getting any particular measurement value. So, the probability is

where the products and sums are taken over the measurements 1 to N.

Fitting Functions

If the function y(x) has parameters a_j such that y is any well-behaved function, e.g. y(x) = a_1 * cos(a_2*x) or y(x) = a_1*exp[-(x - a_2)^2/(2*a_3^2)] + a_4 + a_5*x + a_6*x^2, then our task is to find the set of a_j corresponding to the true functional form of y(x). The maximum likelihood method assumes that the observed set of measurements are more likely to have come from the parent distribution (i.e. the formula for y(x) with all the correct coefficients a_j) than from any similar distribution with different coefficients a_j. Therefore, the probability given above for N measurements is the maximum probability if the values of a_j are just those of the true parent distribution. That means that the ``correct'' values of a_1, a_2, a_3, etc. are those which maximize the probability function shown for N measurements. Now, the first term is a constant and independent of the parameters a_j. Therefore, maximizing the probability is equivalent to minimizing the sum in the exponential. Generally, this sum is defined by the symbol chi-square

Hence, to find the function which is the optimal fit to the data, we minimize this weighted sum of squares of deviations between the measurements and the predicted values for the measurements. The fit which gives the smallest sum of squares or the least-squares fit maximizes the probability that the measurements came from the parent distribution described by our fit. The chi-square also quantifies the goodness-of-fit, i.e. it gives a measure of how likely it is that any particular functional form we put in for y(x_i) fits the data, y_i.

To minimize the chi-square, we must take the derivative of it with respect to each parameter, a_j, set the derivatives equal to zero and simultaneously solve the set of equations we get. Another way of saying it is that setting a derivative of chi-square with respect to a particular parameter, a_j, equal to zero and solving minimizes the chi-square with respect to that parameter. We need to simultaneously minimize the chi-square with respect to each parameter to obtain the true minimum of least squares deviations. Hence, we solve the simultaneous equations

This is called the analytic method. In cases where the function y(x) is something like a Gaussian distribution, it is generally difficult to obtain a closed form solution for the minimization equations. Instead, we consider the chi-square function to be a continuous function of the parameters, a_j. This function can be visualized as a function in a multidimensional space where each independent parameter, a_j, defines a coordinate. What we look for is the position in this space which forms the lowest point for the surface. The coordinates of this position are the values of the parameters, a_j, that minimize the chi-square.

The difficulty with finding the lowest point is that we have to develop an algorithm which will search for it. The parameter space can be vast, hence it can take quite a long time for an algorithm to find the minimum. Also, many chi-square functions are complex enough to have several local minima, i.e. low points which are the lowest in a neighborhood, but are not the lowest point overall. Therefore, our algorithm must be clever if it is quickly find the true minimum and not be fooled by local minima. There are a number of algorithms that do this. They fall into several types. One type is a set of algorithms that simply search the parameter space in a clever way in order to find its way to the minimum. An example of this type is the grid search method. Another type of algorithm is one which uses approximate analytical methods which approximate the answer by finding an approximate closed-form solution for the simultaneous minimization equations. The approximation is improved by executing a prescribed set of mathematical steps. These methods are often faster than parameter space searching, but require more care in programming. An example is the gradient search method.


The Analytic Method

Fitting Straight Lines

The simplest set of observations that can be analyzed by the analytic method is one in which a linear relationship is assumed as the parent distribution from which the data is drawn. For example, let us assume that we have a set of data which appears to follow a linear relationship for the dependence of y on x. The parent distribution is then given by particular values for the slope and intercept, represented by the symbols a_0 and b_0, such that the actual relationship between y and x is

y(x) = a_0 * x + b_0

For any given value of x we measure, let's say we measure the value x_i on the i_th observation, we can find the probability, P_i for making the observed measurement y_i. As usual, we assume a Gaussian distribution with standard deviation of sigma_i and mean equal to the true value y(x_i), governs our observations. Then, as argued in the section above, we can write down the chi-square whose minimum represents the maximum probability for the probability of our particular data set given a particular parent distribution is

This approach is copied from Bevington's Data Reduction and Error Analysis for the Physical Sciences The minimum value of this function is the one which gives a value of zero for both of the partial derivatives with respect to each of the coefficients. We will ease the difficulty with notation by assuming that the uncertainty associated with each measurement is the same (this is usually a good approximation), so, for N measurements

These equations can be rearranged to yield a pair of simultaneous equations

since it must be true that the sum over N measurements of a constant is simply N times the constant. All that remains is to find the values of a and b which solve these equations. Those values of a and b will minimize the sum of the squares of the deviations of the data points from the calculated fit, i.e. the deviations between y_i and y(x_i). In so doing, we will maximize the probability that the line y(x_i) = a * x_i + b is the parent distribution from which our data, namely N measurements of (x_i, y_i), was derived. Hence, it is natural to say that the values of a and b which minimize chi-square are indeed the ``true'' values, a_0 and b_0, we referred to earlier.

It is trivial to use algebra to find the values of a and b that solve our simulataneous equations. Since we will soon run into harder problems, however, let's do the general method of using determinants (consult a book on linear algebra to get a general introduction to this method). The solutions are then given as

As might be expected, we estimate the uncertainty on these parameters by using the rule that the uncertainty squared for a parameter is equal to the sum of the squares of the standard deviation for each data point contributing to the parameter multiplied by the effect which that data point has on the determination of the parameter. Specifically, for our N measured data points, we have an uncertainty of sigma for each point. The effect of that point on the determination of a or b is given, respectively, by

where j is an index that can take on values from 1 to N. The formula for the uncertainty on the parameter b is then

By the same argument, the uncertainty in the slope, a, is

The uncertainty can be obtained from looking at the spread in the data. Namely, we can approximate this by observing the square of the deviation of the data set from the predicted behavior. We assert without proof that

Fitting Polynomials

If the general form of the function describing y for observed values of x is a polynomial, then we express it as a power-series, namely

y = a + bx + cx^2 + dx^3 + ...

The method of least squares allows us to find the values of the constants a, b, c, d, ... that give us a functional form that best fits our data y(x_i). The method is a trivial extension of that used to fit straight lines. In order to make a concrete example, we consider only terms up to second order for now. The chi-square we wish to minimize is

We next find and set to zero the derivative of the chi-square with respect to each coefficient:

If we isolate the first term of each of these equations, we find a certain symmetry:

The optimal values for the coefficients are now easily written in the form of determinants


The Grid Search Method


The Gradient Search Method


larryg@upenn5.hep.upenn.edu