Fitting a Line to (x,y) Data [beyond OLS]

Fitting a Line to (x,y) Data

Beyond Ordinary Least Square [OLS]

Historical detour: origin of OLS in astrometry

Ushered in by Tycho Brahe (1546-1601), the era of accurate astrometry (measurement of positions of things in the sky) was well underway by the 1700s. In 1740 the astronomer Jacques Cassini compiled the below list of observations of the tilt of the Earth's equator compared to the plane of the Earth's orbit about the Sun ("obliquity of ecliptic"). The list seems to demonstrate that the obliquity slowly has decreased from almost 24° to about 23½° during the nearly 2,000 years covered.

  Date         Obliquity

  140 B.C.      23.853 °
  140           23.856
  390 A.D.      23.500
  880           23.583
 1070           23.567
 1300           23.533
 1460           23.500
 1500           23.473
 1500           23.488
 1570           23.499
 1570           23.525
 1600           23.517
 1656           23.484
 1672           23.482
 1738           23.472

Data like this presented a problem for the scientists of the day.

On one hand, given the well known difficulties of accurate measurement, it came as no surprise that different astronomers found slightly different results for the same quantity. For example, Cassini's list has two obliquity measurements for 1570: Tycho's and Danti's; they differ by .026°.

On the other hand, mathematically working with inexact numbers is a problem. It is well known that given two distinct data points: (x_i,y_i) for i=1,2, there is exactly one line (given by the formula: y=a+bx) that exactly goes through the two points. (In this case, the a and b can be determined by simultaneously solving for the unknowns: a and b, in terms of the known, constant values (x_i,y_i).) However given three or more data points it is highly unlikely that a single line would go through all the points. (Such problems are said to be "over-determined".) To a mathematician missing by an inch is the same as missing by a mile: "missing" turns equations into non-equations. There are no values for a and b, that allow y_i=a+bx_i for all i. SO: two known data pairs [equations] (x_i,y_i) and two unknowns a and b; result: success! Three known data pairs (x_i,y_i) and two unknowns a and b; result: failure!

Clearly it is impossible (or at best a waste of effort) to find mathematical equations that exactly reproduce inexact results. Thus folks tried to find a "good" equations that came "close" to the necessarily inexact measurements (much as an average value comes "close" to as much as possible of the data). A solution was published in 1805 by Adrien Marie Legendre in his book on determining the orbits of comets. He correctly noted that there is arbitrariness in the choice of the best equation, but proposed a solution:

Of all the principles that can be proposed for this purpose, I think there is none more general, more exact, or easier to apply, than that which we have used in this work; it consists of making the sum of the squares of the errors [deviation of measurement from equation] a minimum. By this method, a kind of equilibrium is established among the errors which, since it prevents the extremes from dominating, is appropriate for revealing the state of the system which most nearly approaches the truth

Thus when we can't make

"error"=y_i-(a+bx_i)

zero for all the data, we compromise so that some "errors" are positive, others negative, but the sum of the squares of the "errors" for all the data pairs is as small as possible.

In his 1809 book on planetary orbits Carl Friedrich Gauss published the method of least squares and in addition provided a hint as to why least-squares is preferred if errors are distributed "normally" (i.e., with a Gaussian distribution). (He also reported that he had been using least squares since 1795, thus initiating a priority dispute with Legendre.)

By 1810, another great mathematician/physicist Pierre Simon Laplace had endorsed the least squares method, and within the next decade this simple method became the standard method of analyzing data in astronomy. (Interestingly it took nearly a century for the technique to find wide-spread use in biology and the social sciences.)

Other options to OLS

While the method of least squares is simple and in many cases the proper method to apply, it is by no means the only proper method to fit data to lines. For example, why focus on minimizing the square of the "error"? Instead, one might minimize the absolute value of the "error". The method of least square puts x and y on decidedly unequal footings: "error" means vertical (y) deviation rather than horizontal deviation (x). This is appropriate if the x is exactly set while the y have measurement error. However in the Casinni data the y measurement error seems to be about 0.01° whereas the date (x) of the early measurements is uncertain by decades. Which is bigger a 10 years or 0.01°? (If you provided an answer to my rhetorical question, would your answer change if I made the comparison to 1 decade and 36 arc-secs?) If we don't have the problem of different units for x and y, one reasonable compromise would be to count as "error" the shortest distance from the data point to the line.

Thus one might distinguish a long list of possible fitting methods:

"ordinary" least squares (where "error" is just the y deviation)
least squares, but where "error" is just the x deviation
the "average" of the above two: the line that bisects the above two lines (this only makes sense if x and y have the same units)
least squares, but where "error" is the shortest distance between the point and the line (this only makes sense if x and y have the same units)
"ordinary" least absolute deviation (where we minimize the sum of the absolute value of the "error", i.e., the y deviation)
least absolute deviation where "error" is just the x deviation
the "average" of the above two

click here to enter data via copy&paste and then do all the above fits and display four trendlines

Here are PostScript, PDF, SVG, and text files showing the results of these fits to Cassini's data. Our current understanding of how obliquity changed (along with Cassini's data) is displayed here...upshot: the slope is shallower than all of the fits suggest, but essentially within errors of OLS(y).

In more recent history (ApJ 364 p.104 1990 & 397 p.55 1992), statisticians and astronomers have provided detailed guidance as the which method to choose, which I much over simplify below.

if "most" of the "error" is in one quantity, put that quantity on the y-axis and use ordinary least squares
if the two quantities seem equally (or unknowably) prone to "error" and have the same units use 3
if the methods deviate "significantly" understand why
if possible know your measurement errors

Chi-square

In discussing orbit calculations for the comet of 1680, Newton caught the gist of the truth. He wrote:

"From all this it is plain that these observations agree with theory, so far as they agree with one another."

If different folks (correctly) measure the same quantity and get different results, the (standard) deviation of their results tells us the accuracy of the measurement. If theory lies within a few standard deviations of all the measurements, we must count that as a confirmation of theory.

Now we come to the actual use Cassini made of his data. He argued (1) from his own experience measuring obliquity (in 1656) and (2) from the near consistency of measurements made at nearly the same time, that the measurement errors for obliquity were in fact much smaller than the deviation between the points and the line. Thus, while the line has a very small P, it was missing the data points by unbelievably large amounts: many times the measurement errors. Hence, he argued, the obliquity is not decreasing uniformly.

The main point here is that without some idea of measurement errors, we cannot say whether or not the line comes adequately close to the data points. A small P is no guarantee that a least square line represents the actual behavior.

Unfortunately the best way to know your measurement errors is to repeat the experiment 10 times and take the standard deviation of the resulting measurements. Thus experiments that properly tell the uncertainty of the results are 10 times harder than those that don't. Hence the old statement that physicists are judged more for their uncertainties than for their values.

Equipment manufacturers specifications often report expected errors of measurements; these calibration errors are almost never relevant to the statistical measurement errors discussed here. Sorry: no shortcuts in real life. Elementary lab periods rarely have the time to allow 10 repeats, so they often use extremely crude estimates of measurement errors (like equipment manufacturers specifications). Sorry: follow the methods your physics instructor uses when he actually does physics, not what he preaches in introductory lab.

SO if we really want to get a meaningful P, in addition to our (x_i,y_i) data we must also enter some sort of tolerance that reports how close the fitted line should be to the data. In simple cases that might be an overall summary of tolerance, say that the fitted curve should be within, say, .01° of the measured obliquity. (Of course Cassini knew that errors were larger in the distant past before telescopes and other accurate measuring instruments; he should have entered smaller tolerances for more recent dates.)

Click here for more on this sort of fit

Finally, if we really want to do a good job, we should recognize that there are errors in both x and y.

Click here for WAPP: data entry with error in both coordinates possible