A great deal of science therefore depends on knowledge of the mathematical laws of probability and statistics. We have to estimate the amount of uncertainty in our measurements and take that uncertainty into account when calculating a final result. In many cases, we make measurements in order to relate experimental variables to some measureable result, i.e. we change a voltage or position or velocity which we can determine and we watch the system under study change in some measureable way. The desired result is to determine the functional form of the connection between what we changed and what we observe. On a graph, what we change would be specified in units along the x axis, for example, and what we observe would have its value specified along the y axis. Hence a measurement consists of a given (x,y) combination. We wish to find the form of the function, y(x), i.e. how y depends on x. Since we don't measure y exactly, we expect that several measurements of y for the same given value of x will give us some idea of the uncertainty on y for that point in x. In real-world cases, we will also have to worry about the fact that our knowledge of x is also not perfect.
The measured value of y(x) is said by mathematicians to be determined by the parent distribution which governs the probability of making any particular observation. In other words, there is a true functional form y(x) which tells us the true value of y for any given x. We do not see this ``true'' value, but we measure something close to it. The small errors mentioned earlier contrive to keep us from seeing what y really is. However, the more measurements of y we make for a given x, the more certain we are of the true value. It can be mathematically proven that, in general, an infinite number of measurements would give us exact knowledge of y. In practice, the ultimate limits set by quantum mechanics prevent us from ever having exact knowledge of any physical parameter, but, in the cases scientists usually deal with, we will never need to know exactly what the parent distribution is, we simply need a good approximation to it. The parent distribution describes what the shape of the distribution is for measurements of y at some particular value of x.
To get at the form of y(x), we take a sample of measurements or observations. This particular sample is just one of a great many that we ``extract'' from the parent distribution. Take another sample and we get a slightly different ``extraction'' which mimics the parent distribution. The parent distribution then determines the probability of making any particular set of observations for a sample.
How do we determine what the parent distribution is given a sample? The interesting thing is that physicists and mathematicians who work in a branch of physics called statistical mechanics have shown that the many small perturbations (what we referred to earlier as errors) that effect a physical system almost always force the measurement to follow the Gaussian distribution. This scattering of random numbers is so common, it's usually referred to as simply the normal distribution.Hence, our first guess for any measurement method should be that the measurement uncertainty follows a Gaussian distribution. If that's so, then we can follow a rather simple mathematical trail to formulating a procedure which is guaranteed to give us the most probable shape of the functional form for y(x). If we define the symbol y_i to be the measured value of y for some particular value of x made in observation i (defined as seeing y_i at x_i), then the probability P_i of having observed y_i is given by

The probability of observing any particular sample of N measurements of giving the y_i values seen is the product of the probabilities for getting any particular measurement value. So, the probability is

where the products and sums are taken over the measurements 1 to N.

Hence, to find the function which is the optimal fit to the data, we minimize this weighted sum of squares of deviations between the measurements and the predicted values for the measurements. The fit which gives the smallest sum of squares or the least-squares fit maximizes the probability that the measurements came from the parent distribution described by our fit. The chi-square also quantifies the goodness-of-fit, i.e. it gives a measure of how likely it is that any particular functional form we put in for y(x_i) fits the data, y_i.
To minimize the chi-square, we must take the derivative of it with respect to each parameter, a_j, set the derivatives equal to zero and simultaneously solve the set of equations we get. Another way of saying it is that setting a derivative of chi-square with respect to a particular parameter, a_j, equal to zero and solving minimizes the chi-square with respect to that parameter. We need to simultaneously minimize the chi-square with respect to each parameter to obtain the true minimum of least squares deviations. Hence, we solve the simultaneous equations

This is called the analytic method. In cases where the function y(x) is something like a Gaussian distribution, it is generally difficult to obtain a closed form solution for the minimization equations. Instead, we consider the chi-square function to be a continuous function of the parameters, a_j. This function can be visualized as a function in a multidimensional space where each independent parameter, a_j, defines a coordinate. What we look for is the position in this space which forms the lowest point for the surface. The coordinates of this position are the values of the parameters, a_j, that minimize the chi-square.
The difficulty with finding the lowest point is that we have to develop an algorithm which will search for it. The parameter space can be vast, hence it can take quite a long time for an algorithm to find the minimum. Also, many chi-square functions are complex enough to have several local minima, i.e. low points which are the lowest in a neighborhood, but are not the lowest point overall. Therefore, our algorithm must be clever if it is quickly find the true minimum and not be fooled by local minima. There are a number of algorithms that do this. They fall into several types. One type is a set of algorithms that simply search the parameter space in a clever way in order to find its way to the minimum. An example of this type is the grid search method. Another type of algorithm is one which uses approximate analytical methods which approximate the answer by finding an approximate closed-form solution for the simultaneous minimization equations. The approximation is improved by executing a prescribed set of mathematical steps. These methods are often faster than parameter space searching, but require more care in programming. An example is the gradient search method.
y(x) = a_0 * x + b_0
For any given value of x we measure, let's say we measure the value x_i on the i_th observation, we can find the probability, P_i for making the observed measurement y_i. As usual, we assume a Gaussian distribution with standard deviation of sigma_i and mean equal to the true value y(x_i), governs our observations. Then, as argued in the section above, we can write down the chi-square whose minimum represents the maximum probability for the probability of our particular data set given a particular parent distribution is

This approach is copied from Bevington's Data Reduction and Error Analysis for the Physical Sciences The minimum value of this function is the one which gives a value of zero for both of the partial derivatives with respect to each of the coefficients. We will ease the difficulty with notation by assuming that the uncertainty associated with each measurement is the same (this is usually a good approximation), so, for N measurements

These equations can be rearranged to yield a pair of simultaneous equations

since it must be true that the sum over N measurements of a constant is simply N times the constant. All that remains is to find the values of a and b which solve these equations. Those values of a and b will minimize the sum of the squares of the deviations of the data points from the calculated fit, i.e. the deviations between y_i and y(x_i). In so doing, we will maximize the probability that the line y(x_i) = a * x_i + b is the parent distribution from which our data, namely N measurements of (x_i, y_i), was derived. Hence, it is natural to say that the values of a and b which minimize chi-square are indeed the ``true'' values, a_0 and b_0, we referred to earlier.
It is trivial to use algebra to find the values of a and b that solve our simulataneous equations. Since we will soon run into harder problems, however, let's do the general method of using determinants (consult a book on linear algebra to get a general introduction to this method). The solutions are then given as

As might be expected, we estimate the uncertainty on these parameters by using the rule that the uncertainty squared for a parameter is equal to the sum of the squares of the standard deviation for each data point contributing to the parameter multiplied by the effect which that data point has on the determination of the parameter. Specifically, for our N measured data points, we have an uncertainty of sigma for each point. The effect of that point on the determination of a or b is given, respectively, by

where j is an index that can take on values from 1 to N. The formula for the uncertainty on the parameter b is then

By the same argument, the uncertainty in the slope, a, is

The uncertainty can be obtained from looking at the spread in the data. Namely, we can approximate this by observing the square of the deviation of the data set from the predicted behavior. We assert without proof that

y = a + bx + cx^2 + dx^3 + ...
The method of least squares allows us to find the
values of the constants a, b, c, d, ... that give us
a functional form that best fits our data y(x_i). The
method is a trivial extension of that used to fit
straight lines. In order to make a concrete example,
we consider only terms up to second order for now.
The chi-square we wish to minimize is

We next find and set to zero the derivative of the chi-square with respect to each coefficient:

If we isolate the first term of each of these equations,
we find a certain symmetry:

The optimal values for the coefficients are now easily
written in the form of determinants
