Constructing Bar Charts and Histograms:
Bar charts are
easy to construct if the underlying variable is ordinal (e.g. the number of
female or male meerkats who are babysitting at some time) since you simply let
the height of each bar be the number of observations associated with each
nominal value. There is no issue in this case with defining the width of the
range covered by each bar. If the underlying variable is ordinal but discrete
(for example, the number of tomatoes produced per plant must be an integer),
then the only decision to be made is how to lump together the possible outcomes
in deciding the range covered by each bar. In this case, generally you should
assign equal ranges to each bar and use 10-20 bars, assuming the underlying
data set has sufficient points to allow for more than 1-2 points per bar. For
example, if the number of tomatoes per plant varied from 1 to 50, you could use
10 bars each of width 5 (1-5, 6-10,... , 46-50). Difficulties arise here if the
number of bars doesn't divide the range of the data evenly - there is no one
solution to this, because one bar would either have to be wider or narrower
than the others, or else you must extend the range of the data beyond the
observations. The usual rule is to have one of the bars cover a wider range of
values than the other bars.
Constructing a histogram for a continuous data set
involves:
(1) Decide on the number of classes (bars) for the
histogram. Typically choose this to be 10-20, but if there is a small dataset,
choose this smaller. A rule of thumb might be that the number of observations
divided by the number of classes should be at least 4.
(2) Choose the width of the classes (all widths will be
the same) by dividing the range of the data (Maximum value observed minus
minimum value observed) by the number of classes from step (1). Round this
value up so that the class width has the same precision (number of significant
digits) as the measurements. So if we are measuring body weights of black bears
and the smallest weight is 16.0 kg and the largest weight is 118 kg, and we
wish to use 10 classes, then the class width is 10.2 kg.
(3) Select the smallest observed value as the lower limit
of the first class (e.g. the left hand value for the first class), and add
multiples of the class width from step (2) to obtain the lower limits for each
class. So in the black bear case, the lower class limits would be
16.0, 26.2, 36.4, 46.6, É , 107.8
(4) Find the upper class limits (the right hand value for
each class) by adding the class width to the lower class limits and subtracting
from this the smallest significant digit in the observations. So if the class
width is 0.3 and the lower limit of the first class is 3.7 (and the finest
precision of the original data was .1 unit), then the upper class limit for the
first class is 3.7 + .3 -.1 = 3.9 and the next class would have lower class
limit 4.0 (so that there is a gap between the upper class limit of one class
and the lower class limit of the next class, and this gap is the precision of
the original data. In the black bear case, since the highest precision of the
original data is .1 kg, the upper class limits are
26.1, 36.3, 46.5, É, 118
Note that the last class goes up through the highest
value in the data set.
(5) For the boundaries between classes, use the value
which is halfway between the upper class limit of one class and the lower class
limit of the next class. So in the above example, the class boundary would be
3.95 and the lowest class would be 3.7-3.95 with the next class being 3.95-4.25
and so on. In the black bear example, the class boundaries would be
26.15, 36.35, 46.55, É, 118
(6) Count up the number of observations in each class and
make the height of the bar for each class equal to these numbers. You can also
construct a "frequency histogram" by dividing the height of each bar
of the histogram (the number of observations falling in that class) by the
total number of observations (the sample size). For a frequency histogram, the
sum of the heights of all bars is one.
Linear Regression:
One objective in using descriptive statistics is to aid
in understanding whether there is a relationship between two measurements
associated with an observation (e.g. are leaf length and leaf width related,
are weights of bats related to their age since birth, is respiration rate of an
individual related to the temperature of the environment around that
individual). The simplest type of relationship that would exist between two
measurements is that they are proportional (so that doubling one value leads to
a doubling in the other). A slight extension of this is that they are linearly
related (e.g. if the measurements are L and W, then L= a*W + b where a and b
are constants).
A regression is a formula that provides the "best
fit" of a particular mathematical equation to a set of data. A simple
regression occurs if there are only two variables being considered, and a
linear regression is one in which the relationship is assumed to be linear and
the "best line" which goes through the data on a standard x-y graph
is obtained. This is not meant to imply that one of the variables necessarily
"determines" the other one, but rather that any complex dependence
between them may be adequately summarized as a linear one. Strictly speaking,
when there is reason to believe that there is a causal relationship between the
measurements (e.g. variation in weights of bats is due to differences in their
ages, but variation in ages of bats is in no way "caused" by their
weights) then the term regression is used, otherwise we say there is a
"correlation". Thus
variation in leaf length is correlated with leaf width, but not
"caused" by it.
The standard method to find the "best fit" line
through a data set is called least squares - it is the line which minimizes the
sum of the squares of the differences between the height of the line (y-value)
and height (y-value) of each data point, where this sum is taken over all data
points. This best fit line minimizes the sum of the square residuals. There are
formulas for the slope and y-intercept of the best fit line (the regression
line) that can be calculated using the original data points.