How to do a linear analysis of data:
The first step is to see if indeed this process makes any
sense at all. That is, do the data appear to be linearly related? This step is
usually accomplished by doing a "scatter plot" of the data. This
means plotting them as x-y points on a graph. The choice of ranges for the axes
matters here a great deal. A poor choice for the range of values on one of the
axes could lead to all the points falling over a very small portion of the
graph, making it impossible to tell whether there is any potential relationship
in the data set. Thus, choose your axes so that the range of them is
approximately the same as that in your data. So if you have a set of body
weights ranging from 5 grams to 160 grams, it would be reasonable to choose the
axis for body weight to run from 0 – 160 grams or so.
Which is the x (horizontal) axis and which is the y
(vertical)? If you have some reason to expect that one of the measurements is
the dependent one (e.g. body weight depends on age, rather than the reverse),
then choose the one that is dependent (weight) as the y axis and the
independent one (age) as the horizontal axis. If you do not have any reason to
expect that one of the measurements is caused by the other, then the choice of
which measurement is plotted on which axis doesn't matter. Just choose one and
go with it.
The next step is to eyeball the data and see if there
appears to be any relationship. If the scatter plot looks like the points might
be described at least approximately by a line (e.g. don't worry if there is a
lot of scatter about any line you might draw on the graph), then it is
reasonable to proceed with fitting a line. The next step then is eyeballing the
data and making a guess at a line without doing any calculations or using any
program. This will not give you an exact answer, but it will be useful later in
checking to see that the line that is obtained from a computer or other method
you choose makes sense. To eyeball, just quickly draw a line through the data,
eyeball a rough
slope and then write down the formula for the line using
the point-slope form, that is y - y1 = m * (x - x1) where
m is the slope and the point you have chosen is (x1,y1),
or use the point-intersept form if it can easily be estimated where the line crosses
the y-axis y = m * x + b where b
is the y-intersept. Be sure in
doing this that you keep your units straight, so that you know what units each
measurement, and thus the slope, is in.
What if the data do not appear to be linearly related?
Then don't try to fit a straight line. Later we'll discuss ways to transform
the data to see if a different relationship might be a better choice.
If it looks like a linear relationship is a reasonable
assumption, the next step is to find the "least squares fit". This
can be done automatically within Matlab, and many calculators can also do this.
The idea is to choose a line that "best fits" the data in that the
data points have the minimum sum of their vertical distances from this
particular line. One way that you might code a computer to do this is:
(i)
Guess at the equation for the line - y = a*x + b.
(ii)
Measure the vertical distance from each data point to the line
you have chosen.
(iii)
Sum up the distances chosen in (ii).
(iv)
Change the slope a and the intercept b to see if you can
reduce the sum obtained in (iii).
(v)
Continue this until you get tired or you've done the best you
can.
One of the potential problems with the above is what
"distance" to use in (ii). If you allow some distances to be positive
and some negative (e.g. for a point above the line and a point below the line),
then the distances can cancel, which is not appropriate. So we want all
distances to be positive and the standard way to do this is the just square
each vertical distance found in (ii), and produce the sum of the square
distances to get (iii). The line we would get by going through step (v) would
then be called the "least squares fit" since it chooses the line so
as to minimize the sum of the square deviations of points from the line.
It turns out that it is not necessary to go through the
steps (1)-(v) above at all. It can be proven that the "best" values
of the slope a and the intercept b can be obtained from a relatively simple
formula that just uses the x and y values for all the data points. Matlab does
this easily for you using the command "C=polyfit(A,W,1)" which will
produce a vector C in which the first value is the best fit for the slope a and
the second is the best fit for the intercept b for the least squares fit of the
vector of data W (on the vertical axis) to the vector of data A (on the
horizontal axis). Think of A as giving a vector of ages (in days) of bats and
the vector W giving the weights of these bats (in grams). Note that the units
of the components of C depend upon the units the data are measured in. The
first component of C is a slope so it has units grams per day for the bat
example, and the second component of C (the y-intercept) has the same units as
the measurement on the y-axis (grams in the case of the bats).
Once you have a linear least squares fitted line, you can
proceed to use it to interpolate (find the y-value predicted by the linear fit
for an x-value that falls in the range of the x-values in your data set), or to
extrapolate (find the y-value predicted by the linear fit for an x-value that
falls outside the range of the x-values in your data set). Thus if you have
values for body weights W (in grams) and ages of bats A (in days) 9, 15, 20,
22, 34, 44 and 49, and it appears that a linear fit to the data is reasonable,
you can interpolate to find the weight of a bat of age 30 days, or you can
extrapolate to find the weight of a bat of age 60 days. All you do to find
these weights is to plug the age into the equation of the line and calculate
the associated y-value (weight in grams). Matlab makes this easy by using the
command "Y=polyval(C,30)" which will give the best guess according to
the linear fit for the weight of a bat of age 30 days.
How can we tell if a linear fit is any good?
This is where we make use of the notion of a correlation.
In common parlance, we say two measurements are "correlated" if there
appears to be some relationship between them, though this relationship need not
be causal. Thus leaf length and width might be related to each other, but
neither is caused by the other. They might simply be related due to the age of
the leaf or the environmental
conditions under which the leaf developed (e.g. better nutrients and
water could lead to a larger leaf).
There is a formal definition of correlation that we will
use which essentially tells how close to linearly related two measurements are.
Note that this is restricted to being a measure for linear relationships. If
two measurements are related, but not linearly, then the correlation we
estimate may not imply the measurements are closely related when they actually
are. For example, human body weight is certainly related to age as an
individual grows, but growth is not
linear at all and so a correlation may not be the best
way of saying that these two variables are related.
Correlation is measured by the "correlation
coefficient" for which the small Greek letter rho (r) is
typically used. The calculation of r follows easily from the x and y
values of the data set. The coefficient r is a measure of the strength
of the relationship between the two measurements, scaled in such a way that if
two measurements fell exactly on a straight line with positive slope, then r = 1,
while if they fell exactly on a line with negative slope, r =
-1. In these cases we say the data are "perfectly positively
correlated" or "perfectly negatively correlated". If r = 0
we say the data are "uncorrelated" but again this doesn't mean the
data are not related - for example if the data fell on a parabola y = (x-1)2
for values of x between 0 and 2, the correlation would be near zero but the
data are certainly closely related.
Again Matlab makes it easy to calculate the correlation
coefficient of two vectors using "corrcoef(A,W)" which computes the
correlation coefficient of the data given in the two vectors A and W. Many
calculators will compute this as well. Note that the correlation coefficient is
a dimensionless number - in calculating it the dimensions of the measurements
cancel out.