Statistics, Descriptive
Statistics, Descriptive
I. Location And DispersionHans Kellerer
II. AssociationRobert H. Somers
I. LOCATION AND DISPERSION
A basic statistical need is that of describing a set of observations in terms of a few calculated quantities—descriptive statistics—that express compactly the most salient features of the observational material. Some common descriptive statistics are the sample average, the median, the standard deviation, and the correlation coefficient.
Of course, one is also interested in corresponding descriptive quantities for the underlying population from which the sample of observations was drawn; these population descriptive statistics may usually be thought of as sample descriptive statistics for very large hypothetical samples, so large that sampling variability becomes negligible.
The present article deals with descriptive statistics relating to location or position (for instance, the sample average or the median) and to dispersion or variability (for instance, the standard deviation). The accompanying article deals with descriptive statistics for aspects of association between two or more statistical variates (for instance, the correlation coefficient).
Most, although not all, descriptive statistics for location deal with a generalized mean—that is, some function of the observations satisfying intuitive restrictions of the following kind: (a) a generalized mean must take values between the lowest and highest observations; (b) it must be unchanged under reorderings of the observations; and (c) if all the observations are equal, the generalized mean must have their common value. There are many possible generalized means; those selected for discussion here have useful interpretations, are computationally reasonable, and have a tradition of use.
Descriptive statistics of dispersion supply information about the scatter of individual observations. Such statistics are usually constructed so that they become larger as the sample becomes less homogeneous.
Descriptive statistics of location
Generalized means . An important family of location measures represents the so-called central tendency of a set of observations in one of various senses. Suppose the observations are denoted by xl, X2, ... ,xn. Then the ordinary average or arithmetic mean is
If, however, a function, f, is defined and the average of the f(x-i)'s is considered, then an associated generalized mean, M, is defined by
The summation is from 1 to n, and f has the same meaning on both sides of the defining equation. For the arithmetic mean, f is the identity function. For the geometric mean (when all x¡'s are positive), f is the logarithmic function—that is, log M = (1/n) 2 logx¡, so that
For this procedure to make sense, f must provide a one-to-one relationship between possible values of the Xi's and possible values of the f(x¡)'s. Sometimes special conventions are necessary.
For any such generalized mean, the three intuitive restrictions listed earlier are clearly satisfied when f is monotone increasing. In addition, a change in any single x¡, with the others fixed, changes the value of M. Four of the many generalized means having these properties are listed in Table 1. Since the quadratic mean is important primarily in measuring dispersion, only the first three means listed in Table 1 will be discussed in detail.
The arithmetic mean. The arithmetic mean is perhaps the most common of all location statistics because of its clear meaning and ease of computation. It is usually denoted by placing a bar above the generic symbol describing the observations, thus x̄ = (1/n) ∑ Xi, ā = (1/n) Z ai, etc. The population analogue of the arithmetic mean is simply the expectation of a random variable describing the population—that is, E(X), if X is the random variable. [See Probability, article on Formal probability.]
If the Xi's represent, for example, the wages received by the ¿th individual in a group, then ∑¡ represents the total wages received, and x̄, the arithmetic mean or ordinary average, represents the wages that would have been received by each person if everyone in the group had received the same amount.
The major formal properties of the arithmetic mean are:
(a) The sum of the deviations of the x¡'s from x̄ is zero: ∑(xi – x̄) = 0.
(b) The sum of squared deviations of the x¡'s from x, considered as a function of x, is minimized by x̄: ∑(x¡ - x̄)2 < ∑(¡ - x)2 for all x.
(c) If a and b are any two numbers, and if one sets y¡ = axi + b, then x̄ = ax̄ + b, and x = (ȳ — b)/a when a ¿ 0. This linear invariance of the arithmetic mean is the basis for so-called coding, changing the origin and scale of the observations for computational convenience. For example,
Table 1 — Four important generalized means | |||
---|---|---|---|
f(x) | Equation for the mean value, M | ||
Arithmetic mean | x | ||
Geometric mean | log x | ||
Harmonic mean | |||
Quadratic mean | x2 |
instead of finding the average of the xi’s 110, 125, 145, 190, 210, and 300, it is simpler to subtract 100 from each number and multiply by 1/6: y¡1/6 = x¡ - 100). After finding ȳ (16 for the x's listed here), it is necessary only to reverse the coding process to obtain * = 5ȳ; + 100 (180 in this case), (ii) Suppose the xt's are divided into m subgroups, where the jth subgroup includes n¡ of the X's (2w,- — n) and has arithmetic mean x¡. Then x̄, the arithmetic mean for all the Xi's, is
When n is large, it is often advisable to summarize the original observations in a frequency table. Table 2 shows how x̄ is calculated in this case.
Table 2 — Observations summarized in a frequency table | ||||
---|---|---|---|---|
CLASS | Class INTERVAL | Class MARK | FREQUENCY | |
(j) | w1 | f1 | W1f1 | |
1 | 10≤x<15 | 12.5 | 3 | 37.5 |
2 | 15≤x<20 | 17.5 | 22 | 385.0 |
3 | 20≤x<25 | 22.5 | 38 | 855.0 |
4 | 25≤x<30 | 27.5 | 29 | 797.5 |
5 | 30≤x<35 | 32.5 | 8 | 260.0 |
Totals | 100 | 2,335.0 |
The table shows, for example, that 22 of the 100 x¡ lie in the interval from 15 to 20. By eq. (2)
The numerator may be determined approximately by resorting to the “hypothesis of the class mark”: the average value of the observations in a class is the class mark—that is, w¡ = x¡ for all j. Up to the accuracy of the approximation this yields
The numerator summation is now from 1 to m (m = the number of classes). For the data in Table 2, x = 2,335/100 = 23.35.
A coding transformation may be employed for easier computation:
The numbers X0 and c should be chosen for convenience in computation. In the example, one might take x0 = 22.5 = the class mark of the central class and c = 5 = the class width. This transformation
Table 3 — Frequency table with coded observations | |||
---|---|---|---|
Class | y, = (w,-22.5)/ 5 | f1 | yifi |
1 | -2 | 3 | -6 |
2 | -1 | 22 | -22 |
3 | 0 | 38 | 0 |
4 | 1 | 29 | 29 |
5 | 2 | 8 | 16 |
Totals | 100 | 17 |
yields Table 3, from which x̄ = 5(17/100) + 22.5 = 23.35, as before.
It often happens that the first or last class is “open”—that is, the lowermost or uppermost class boundary is unknown. If there are only a few elements in the open classes, a reasonable choice of class marks may often be made without fear that the arithmetic mean will be much affected. [Considerations about choice of class width and position, and about the errors incurred in adopting the hypothesis of the class mark, are given in Statistical Analysis, Special Problems Of, article ON Grouped observations.]
Weighted averages, of the form ]∑ a¡x¡/∑a¡, are often employed, especially in the formation of index numbers. These bear a clear formal relationship to the arithmetic means computed by formula (3) [see Index numbers].
The geometric mean. One of the subtle problems that concerned thinkers in the Middle Ages was the following: The true, unknown value of a horse is, say, 100 units of money. Two persons, A and B, estimate this unknown value. A's estimate is 10 units, and B's is 1,000. Which is the worse estimate? There are two objections to the answer “B's estimate is worse, because he is off by 900 units, whereas A is only 90 units off”: first, the available room for estimating downward is only 100 units, because negative units do not apply, whereas the room for estimating upward (above 100 units) is unlimited; second, since A's estimate is ]' of the true value and B's is 10 times the true value, both estimates are equally poor, relatively speaking. The two errors balance out on the multiplicative average, for 0.1 . 10—1.
An example will illustrate the use of the geometric mean: The number of television viewers in a certain area was 50 on January 1, 1960, and 4,050 on January 1, 1964. It might be in order to assume the same relative increase (by a factor of r) for each of the four years involved. Table 4 can then be drawn up. Under the assumption of constant relative increase per year, 50r4 = 4,050, so that . The corresponding imputed values are given in parentheses in the last
Table 4 — Television viewers in a certain areaa | |||
---|---|---|---|
a.Hypothetical data. | |||
b.Imputed values are shown in parentheses. | |||
Date (Jan. 1) | Observed number | Assumed formula | Observed and impufedb |
1960 | 50 | 50 | 50 |
1961 | 50r | (150) | |
1962 | 50r2 | (450) | |
1963 | 50r3 | (1,350) | |
1964 | 4,050 | 50r4 | 4,050 |
column of Table 4. Note that the square root of the product of the initial and final numbers, , is 450, the imputed value at the middle year.
The example indicates that the geometric mean may be appropriate in instances of relative change. This is the situation, for example, in several index problems [see Index numbers]. Here multiplication takes the place of the addition employed in the arithmetic mean; hence, the procedure is to extract roots instead of to divide. One justification for the term “geometric mean” comes from considering the geometric sequence 1, 2, 4, 8, 16, 32, 64; the central element, 8, is the same as the geometric mean. Calculating the geometric mean makes sense only when each original value is greater than zero. If the data is grouped as in Table 2, the corresponding geometric mean is
The geometric mean of n fractions, xi/yi, i = 1, ... ,n, is equal to the quotient of the geometric mean of the x,'s divided by the geometric mean of the y¡'s. If X is a positive random variable that describes the underlying population, then the population geometric mean is 10E(log x), where logarithms are taken to base 10.
The harmonic mean. The feature common to the following examples is the employment of reciprocals of the original values.
Example 1: A man travels 2 kilometers, traveling the first at the speed x, = 10 km/hr and the second at the speed x, = 20 km/hr. What is the “average” speed? The answer (10 + 20)/2=15 would usually be misleading, for the man travels a longer time at the lower speed, 10 km/hr, than at the higher speed, 20 km/hr—that is, 1/10 as against 1/20 of an hour. Since speed is defined as distance/time, a better average might be the harmonic mean
The point is that speed averaged over time must satisfy the relationship average speed x total time = total distance, and only the harmonic average does this (when the individual distances are equal).
The example may be generalized as follows: At first a man travels 60 km at a speed of 15 km/hr and then 90 km at a speed of 45 km/hr. The corresponding weighted harmonic mean is
Example 2: A market research institute wants to determine the average daily consumption of razor blades by polling 500 persons. Experience has shown that it is better to ask, “How many days does a razor blade last you?” because that is how people usually think. The results of the poll are shown in Table 5, where x¡ denotes the number of
Table 5 — Useful life of razor blades* | ||
---|---|---|
Number of days = xi | Number of persons fi | Total consumption per day = (1/xi)fi |
Hypothetical data. | ||
2 | 100 | 50 |
3 | 150 | 50 |
4 | 200 | 50 |
5 | 50 | 10 |
Totals | 500 | 160 |
days a razor blade lasts . Column 3 indicates that the 100 persons in the first group use ½. 100 = 50 blades a day. The average consumption per person per day is
Other descriptive statistics of location . There are several location measures that are not conveniently described in the form f(M) = (l/n)∑f(x¡)- Most of these other location measures are based on the order statistics of the sample [see Nonparametric Statistics, article on Order statistics].
The median. If a set of n observations (n odd) with no ties is arranged in algebraic order, the median of the observations is defined as the middle member of the ordering. For example, the median, Me(x), of the (ordered) x¡'s
2, 17,19,23,38,47,98
is 23. If n is even, it is conventional, although arbitrary, to take as median the average of the two middle observations; for example, the conventional median of
2, 17,19, 23, 38, 47
is (19 + 23)/2 = 21. The same definitions apply even when ties are present. Thus, the median cuts the set of observations in half according to order.
Unlike the descriptive statistics discussed earlier, the median is unaffected by changes in most individual observations, provided that the changes are not too large. For example, in the first set of numbers above, 2 could be changed to any other number not greater than 23 without affecting the median. Thus, the median is less sensitive to outliers than is the arithmetic mean [see Statistical Analysis, Special Problems Of, article on OUTLIERS]. On the other hand, the median is still a symmetric function of all the observations, and not, as is sometimes said, a function of only one observation.
The median minimizes the sum of absolute residuals—that is, ∑ǀx¡ — x¡ is minimum for x = Me(x).
A disadvantage of the median is that if the x's are divided into subgroups, one cannot in general compute Me(x) from the medians and sizes of the subgroups.
For a population described by a random variable X the median is any number, Med, such that
Pr{X < Med} ≤ ½ Pr ≤ {X ≤ Med}.
In general, Med is not uniquely defined, although it is when X has a probability density function that is positive near its middle [see PROBABILITY]; then Med has a clear interpretation via
Pr{X < Med} = Pr{X ≤ Med}
= Pr{X > Med} = Pr{X ≥ Med}
½(See Figure 1.)
Table 6 shows how the value of Me is determined approximately for grouped observations. The first two columns might represent the frequency distribution of the test scores achieved by 141 subjects. With this as a basis column 3 is derived. The figures in column 3 are found by continued addition (accumulation) of the frequencies given in
Table 6 — frequencies and cumulative Frequencies for approximate determination of median | ||
---|---|---|
Class interval | Frequency | Cumulative frequency |
x<20 | 4 | 4 |
20≤x <30 | 17 | 21 |
30≤x <40 | 38 | 59 |
40≤x <60 | 49 | 108 |
60≤x <80 | 29 | 137 |
80≤x | 4 | 141 |
column 2. The number 59, for example, means that 59 subjects achieved scores of less than 40. In accordance with the definition, the median, Me, is the score achieved by the 71st subject. Since the 59 smallest scores are all less than 40 and since each of the other scores is greater than that, Me = 40 + a, where 0 ≤ a < 20, because the fourth class, within which the median lies, has an interval of 20.
The hypothesis that the 49 elements of the fourth class are uniformly distributed [see Distributions, STATISTICAL] over the interval yields the following relationship for a (linear interpolation): 20 : a = 49 : (71 - 59), whence a = 4.9 and Me = 44.9. The value 71 - 59 = 12 enters into the equation because the median is the twelfth score in the median class.
The frequency distribution is a good starting point for computing Me because it already involves an arrangement in successive size classes. The assumption of uniform distribution within the median class provides the necessary supplement; whether this hypothesis is valid must be decided in each individual case. Only a small portion of the cumulative sum table is required for the actual calculation.
In the calculation it was presupposed that the variate was continuous—for instance, that a score of 39.9 could be achieved. If only integral values are involved, on the other hand, the upper boundaries of the third and fourth classes are 39 and 59 respectively. In that case we obtain Me = 39 + 5 = 44.
Quartiles, deciles, percentiles, midrange. The concept that led to the median may be generalized. If there are, say, n = 801 observations, which are again arranged in algebraic order, starting with the smallest, it may be of interest to specify the variate values of the 201st, 401st, and 601st observations —that is, the first, second, and third quartiles, Q1; Q2, and Q3,. Obviously, Q2 is the same as Me. From the associated cumulative distribution, Q1, and Q3 may be obtained as follows (see Figure 2, where, for convenience, a smoothed curve has been used to represent a step function): Draw parallels to the x-axis at the heights 201 and 601; let their intersections with the cumulative distribution be P1 and P3. The x-coordinates of Px and P3 then yield Q1 and Q3 respectively.
If Q1 = $800, Q2 = $1,400, and Q3 = $3,000 in an income distribution, the maximum income of the families in the lower fourth of income recipients is $800 and that of the bottom half is $1,400, while 25 per cent of the population has an income in excess of $3,000.
The variate values of the 81st, 161st, 241st, .., 721st observations in the examples where n — 801 determine the 1st, 2d, 3d, ... ,9th deciles. Similarly, the percentiles in the example are given by the 9th, 17th, 24th, ... ,793d ordered observations. In general, if the sample size is n, the ith percentile is given by the sample element with ordinal number [ni/100] + 1 when ni/100 is not an integer and by the simple average of the two elements with ordinal numbers ni/100 and (ni/100) + 1 when ni/100 is an integer. Here “[x]” stands for the largest integer in x. [For further discussion, see Nonparametric Statistics, article On Order statistics.]
If the smallest observation is denoted by x(1, and the largest by x(n), the midrange is ½(x(1) + x(n)) and the quartile average is ½(Qi + Q3).
Mode. If a variate can take on only discrete values, another useful measure of location is the variate value that occurs most frequently, the so-called mode (abbreviated as Mo). If, for instance, the number of children born of 1,000 marriages, each of which has lasted 20 years, is counted, Mo is the most frequent number of children.
The mode is often encountered in daily life: the most frequent mark in a test, the shoe size called for most often, the range of prices most frequently requested. Generally speaking, the concept of the typical is associated with the mode; consider, for instance, wage distribution in a large factory. The mode is very vivid. However, it possesses concrete significance only when there is a tendency toward concentration about it and when enough observations are available.
In considering a frequency distribution with class intervals that are all of equal width, one convention is to take as the mode the midpoint of the class containing the largest number of elements. This procedure is open to the objection that it does not take into consideration which of the adjoining classes contains a larger number of elements; the mode would be better calculated in such a way that it is closer to the adjacent class that contains the larger number of elements. Several more or less arbitrary ways of doing this have been suggested— for example, by Croxton and Cowden (1939, p. 213) and by Hagood and Price ([1941] 1952, p. 113). On the other hand, for a continuous distribution it makes sense to take the value of x that yields a (relative) maximum as the mode. There may be two or more relative maxima, as may be readily seen from a graph of the distribution.
Further comments on mean values . There exist some quantitative relationships between the mean values.
(a) Harmonic mean ≤ geometric mean ≤ arithmetic mean. The equal signs apply only to the trivial case where all the observations have the same variate value. This is a purely mathematical result.
(b) For a symmetrical distribution (one whose density or frequency function is unchanged by reflection through a vertical line) with exactly one mode, x, Me, and Mo coincide. For a distribution that is slightly positively skew (that is, one that slopes more gently to the right than to the left— this is not a precise concept), an approximate relationship is Me — Mo = 3(x̄ — Me). In a positively skew distribution the sequence of the three parameters is Mo, Me, and x̄ in negatively skew ones, on the other hand, it is x̄, Me, and Mo. Kendall and Stuart write as follows about this: “It is a useful mnemonic to observe that the mean, median, and mode occur in the same order (or reverse order) as in the dictionary; and that the median is nearer to the mean than to the mode, just as the corresponding words are nearer together in the dictionary” ([1943] 1958, p. 40).
Scales of measurement. There is usually a distinction made between four types of scales on which observations may be made; the distinction is briefly characterized by Senders as follows:
If all you can say is that one object is different from another, you have a nominal scale.
If you can say that one object is bigger or better or more of anything than another, you have an ordinal scale.
If you can say that one object is so many units (inches, degrees, etc.) more than another, you have an interval scale.
If you can say that one object is so many times as big or bright or tall or heavy as another, you have a ratio scale. (1958, p. 51)
Scales of measurement are also discussed by Gini (1958), Pfanzagl (1959), Siegel (1956), and Stevens (1959).
For an interval scale there is a meaningful unit (foot, second, etc.) but not necessarily an intrinsically meaningful origin. A ratio scale has both a meaningful unit and a meaningful origin.
Some examples of these four types of scales are:
(a) Nominal scale: classifying marital status into four categories—single, married, widowed, and divorced; a listing of diseases.
(b) Ordinal scale: answers to the question “How do you like the new method of teaching?” with the alternative replies “Very good,” “Good,” “Fair,” “Poor,” “Very poor.”
(c) Interval scale: measuring temperature on the centigrade, Reaumur, or Fahrenheit scales; measuring calendar time (7:09 A.M., January 3, 1968).
(d) Ratio scale: measuring a person's height in centimeters; measuring temperature on the absolute (Kelvin) scale; measuring age or duration of time.
It is apparent that the four scales are arranged in our listing according to an increasing degree of stringency. The nominal scale, in particular, is the weakest method of measurement.
The applicability of a given location measure depends upon the scale utilized, as is shown in Table 7. Only the most frequent value can be calculated for every scale, although Gini (1958) and his school have endeavored to introduce other averages appropriate for a nominal scale.
Table 7 indicates that the median, the quartiles,
Table 7 — Scales of measurement required for use of location measures | |
---|---|
Location measure | Scale required |
Mode | at least a nominal scale |
Median, quartile, and percentile | at least an ordinal scale |
Arithmetic mean | at least an interval scale |
Geometric and harmonic means | ratio scale |
and the percentiles are relatively invariant for any strictly monotonic transformation in the sense that, for example, φ(medianX) = median (φ(X)), where φ is a strictly monotone transformation. Similarly, the arithmetic mean is relatively invariant for any linear transformation y = ax + b, and the geometric and harmonic means are relatively invariant in the transformation y = ax (with a > 0 for the geometric mean).
Descriptive statistics of dispersion
If there were no dispersion among the observations of a sample, much of statistical methodology would be unnecessary. In fact, there almost always is dispersion, arising from two general kinds of sources: (a) inherent variability among individuals (for instance, households have different annual savings) and (b) measurement error (for instance, reported savings may differ erratically from actual savings).
Absolutive measures of dispersion . A first intuitive method of measuring dispersion might be in terms of ∑(x¡ — x̄), the sum of residuals about the arithmetic mean. This attempt immediately fails since the sum is always zero. A useful measure of dispersion, however, is obtained by introducing absolute values, yielding the mean deviation (l/n) ǀ¡,—x̄ǀ.
Variance and standard deviation. An even more useful measure of dispersion is the variance,
the average squared deviation between an observation and the arithmetic mean. The positive square root of s2 is called the standard deviation. (Often n, in the above definition, is replaced by n - 1, primarily as a convention to simplify formulas that arise in the theory of sampling, although in some contexts more fundamental arguments for the use of n — 1 may be given.) [See Sample surveys.]
It is of interpretative interest to rewrite s2 as
without mention of x̄.
The population variance is E(X — µ,)2, where
µ= E(X) [see Probability, article on Formal probability].
The variance and standard deviation may be computed from a table of grouped observations in a manner analogous to that for the arithmetic mean; coding is particularly useful. Suggestions have been made toward compensating for the class width in a grouping [see Statistical Analysis, Special Problems Of, article on Grouped observations].
Variance has an additivity property that is akin to additivity for the arithmetic mean over subgroups. Suppose that the xi's are divided into two subgroups of sizes n1; n,, with arithmetic means x̄1x̄2, and with variances s2, s22. For convenience, let P1 =n1/n, p2 = n2/n. Then, not only does
(8) x̄=p1x̄1 + p2x̄2,
but also
(9) s2 = p1s12 + p1p2(x̄, — x̄2)2.
This relationship may be extended to more than two subgroups, and, in fact, it corresponds to a classic decomposition of s2 into within and between subgroup terms. [See Linear Hypotheses, article on Analysis Of variance.]
Perhaps a more important additivity property of variance refers to sums of observations. [This is discussed in Probability, article on Formal Probability, after consideration of correlation and independence.]
Other deviation measures of dispersion. The probable error is a measure of dispersion that has lost its former popularity. It is most simply defined as .6745 x s, although there has been controversy about the most useful mode of definition. If the x-'s appear roughly normal and n is large, then about half the x¡'s will lie in the interval x̄ ± probable error.
Gini has proposed a dispersion measure based on absolute deviations,
that has attracted some attention, although it is not widely used.
When the observations are ordered in time or in some other exogenous variate, the mean square successive difference,
has been useful. [See Variances, Statistical Study of.]
Dispersion measures using order statistics. The simplest measure of dispersion based on the order statistics of a sample is the difference between the largest and smallest x¡. : the range [see Nonparametric Statistics, article on Order statistics]. As against its advantage of vividness and ease of calculation there is the disadvantage that the range is fixed by the two extreme values alone. Outliers often occur, especially for large n, so that the range is of limited usefulness precisely in such cases. This is not so serious in small samples, where the range is readily employed. There is a simple relationship connecting the sample size, n, the expected range, and the standard deviation under certain conditions.
The basic shortcoming of the range is avoided by using, say, the 10-90 interpercentile range. This is determined by eliminating the top and bottom 10 per cent of the elements that are ordered according to size and determining the range of the remaining 80 per cent. The 5-95 percentile range can be determined similarly. The semi-interquartile range is ½(Q3 — Qt), in which the bottom and top 25 per cent of all the elements are ignored and half the range of the remaining 50 per cent is calculated.
Relative measures of dispersion . The expressions set forth above are absolute measures of dispersion. They measure the dispersion in the same unit chosen for the several variate values or in the square of that unit. The following example shows that there is also need for relative measures of dispersion, however: A study is to be made of the annual savings (a) of 1,000 pensioners and (b) of 1,000 corporation executives. The average income in (a) might be $1,800, as against $18,000 in (b). The absolute dispersion in (b) will presumably be considerably greater than that in (a), and a measure of dispersion that allows for the gross differences in magnitude may be desired. Two of the numerous possibilities available are (1) the coefficient of variation, s/x̄ or (s/x̄). 100, which measures s as a fraction or a percentage of x (this parameter loses its significance when x equals 0 or is close to 0), and (2) the quartile dispersion coefficient (Q3 - Q3)/Me, which indicates the interquartile range as a fraction of the median. Such measures are dimensionless numbers.
Concentration curves . Another way of describing dispersion is the construction of a concentration (or Lorenz) curve. [These techniques are discussed in Graphic presentation.]
Populations and samples . In the above discussion a number of descriptive statistics have been defined, both for sets of observations and for populations. A fundamental statistical question is the inductive query “What can be said about the population mean, variances, etc., on the basis of a sample's mean, variance, etc.?” [Many of the statistical articles in this encyclopedia deal with aspects of this question, particularly Sample surveys; Estimation; and Hypothesis testing.]
Hans Kellerer
BIBLIOGRAPHY
Discussions of the statistics of location and dispersion are found in every statistics text. The following are only a few selected references.
Croxton, Frederick e.; and Cowden, Dudley j. (1939) 1955 Applied General Statistics. 2d ed. Englewood Cliffs, N.J.: Prentice-Hall. → Sidney Klein became a joint author with the third edition, published in 1967.
Dalenius, T. 1965 The Mode: A Neglected Statistical Parameter. Journal of the Royal Statistical Society Series A 128:110-117.
Fechner, Gustav t. 1897 Kollektivmasslehre. Leipzig: Engelmann.
Gini, Corrado 1958 Le medie. Turin (Italy): Unione Tipografico.
Hagood, Margaret j.; and Price, David o. (1941) 1952 Statistics for Sociologists. Rev. ed. New York: Holt.
Kendall, Maurice g.; and Buckland, William r. (1957) 1960 A Dictionary of Statistical Terms. 2d ed. Published for the International Statistical Institute with the assistance of UNESCO. Edinburgh and London: Oliver & Boyd.
Kendall, Maurice g.; and Stuart, Alan (1943) 1958 The Advanced Theory of Statistics. Volume 1: Distribution Theory. New York: Hafner; London: Griffin.
Pfanzagl, J. 1959 Die axiomatischen Grundlagen einer allgemeinen Theorie des Messens. Schriftenreihe des Statistischen Instituts der Universita't Wien, New Series, No. 1. Würzburg (Germany): Physica-Verlag.
Senders, Virginia l. 1958 Measurement and Statistics: A Basic Text Emphasizing Behavioral Science Application. New York: Oxford Univ. Press.
Siegel, Sidney 1956 Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill.
Stevens, S. S. 1959 Measurement, Psychophysics, and Utility. Pages 18-63 in Charles W. Churchman and Philburn Ratoosh (editors), Measurement: Definitions and Theories. New York: Wiley.
Vergottini, Mario De 1957 Medie, variabilità, rapporti. Turin (Italy): Einaudi.
Yule, G. UDny; and Kendall, Maurice g. (1911) 1958 An Introduction to the Theory of Statistics. 14th ed., rev. & enl. London: Griffin. → Kendall has been a joint author since the eleventh edition (1937); the 1958 edition was revised by him.
žižec, Franz (1921) 1923 Grundriss der Statistik. 2d ed., rev. Munich: Duncker & Humblot.
II. ASSOCIATION
When two or more variables or attributes are observed for each individual of a group, statistical description is often based on cross-tabulations showing the number of individuals having each combination of score values on the variables. Further summarization is often desired, and, in particular, a need is commonly felt for measures (or indices, or coefficients) that show how strongly one variable is associated with another. In the bivariate case, the primary topic of this article, well-known indices of this kind include the correlation coefficient, the Spearman rank correlation measure, and the mean square contingency coefficient. In this article, the motivations for such measures are discussed, criteria are explained, and some important specific measures are treated, with emphasis throughout on the cross-tabulations of nominal and ordinal measurements. [For further material on the metric case, see Multivariate Analysis, articles on CORRELATION.]
Current work in this area is generally careful to make an explicit distinction between a parameter— that is, a summary of a population of values (often taken as infinitely large)—and a statistic, which is a function of sample observations and most sensibly regarded as a means of making an estimate of some clearly defined population parameter. It is of doubtful value to have a quantity defined in terms of an observed sample unless one is clear about the meaning of the population quantity that it serves to estimate. Recent work on sampling theory presenting, among other things, methods for computing a confidence interval for their y (illustrated below) is contained in Goodman and Kruskal (1954-1963, part 3).
The correlation coefficient and related measures. In a bivariate normal distribution that has been standardized so that both marginal standard deviations are equal, the correlation coefficient nicely describes the extent to which the elliptical pattern associated with the distribution is elongated. At one extreme, when the distribution is nearly concentrated along a straight line, the correlation coefficient is nearly 1 in absolute value, and each variate is nearly a linear function of the other. At the other extreme, the elliptical pattern is a circle if and only if the correlation coefficient, whose population value is often designated by p, is 0. In this case, knowledge of one variate provides no information about the other.
In this bivariate normal case, with equal marginal standard deviations, the regression (conditional expectation) of either variate on the other is a straight line with slope p. For the above reasons, it is not unreasonable to adopt p as a measure of degree of association in a bivariate normal distribution. The essential idea behind the correlation coefficient was Francis Gallon's, but Karl Pearson defined it carefully and was the first to study it in detail. [See the biographies of Galton and PEARSON.]
It was pointed out by G. Udny Yule, a student of Pearson's, that p also has a useful interpretation even if the bivariate distribution is not normal. This interpretation is actually of p2, since 1 — p2 measures how closely one variate clusters about the least squares linear prediction of it from the other. More generally, it remains true that the two variates have a linear functional relation if and only if p = ±1, but p may be 0 without the existence of stochastic independence; in fact, the variales may be functionally related by a properly shaped function at the same time that p = 0.
More recently, Kruskal (1958, pp. 816-818) has reviewed attempts to provide an interpretalion of both Pearson's p and the correlation ratios whether or not normality obtains. Both of these measures may be interpreted in terms of expected or average squared deviations, but Kruskal feels thai interpretations in these terms are not always appropriate.
In Pearson's early work the assumption of normality was basic, but he soon recognized that one often deals with observations in the form of discrete categories, such as “pass” or “fail” on a test of mental ability. Consequently, he introduced additional coefficients to deal with a bivariate distribution in which one or both of the variables are normal and continuous in principle but become grouped into two or more classes in the process of making the observations. Thus, his biserial correlation coefficient is designed to estimate the value of p when one of the normal variables is grouped into two classes, and his tetrachoric coefficienl is designed for the same purpose when both variables are grouped into two classes, yielding in ihis laller case a bivariate distribution in the form of a fourfold table (see Kendall & Stuart [1946] 1961, pp. 304-312). More recently, Tate (1955) has pointed oul the value of having a measure of association between a discrete random variable thai lakes the values 0 and 1, and a continuous random variable, especially in the sort of problems that oflen arise in psychology, where one wishes to correlate a dichotomous trail and a numerically measurable characteristic, such as an ability. He has reviewed two alternative models and both asymptolic and exacl distribution theory, all based on normality, and has applied these models to illustrative dala.
Yule's “attribute statistics” and recent continuations. Yule's reluctance to accept the assumption of normality in order to measure relationships soon led him to an interest in the relationship between discrete variables, which he referred to as allribuies. Mosl of ihis work concentrated on dichotomous attributes and, thus, the measurement of relationships in fourfold lables. For ihis purpose his firsl proposal was a measure designated Q (for Quelelel, the Belgian aslronomer-statesman-statistician); the aspect of the relationship that this represented in quantilalive form he designated the “association” of the two altributes (see Yule 1912). The altempt to coin a special term for the aspecl of a relationship measured by a particular coefficient has since been largely discarded, and the term “association” is now used more broadly. [See the biographies of Quetelet; Yule.]
The computation of Yule's Q is illustrated in Table 1, where il is adapted to a sociological study of the relationship between race and an ailitude of alienation from one's social environment Alienation was measured by a version of Srole's Anomia Scale (1956); the sample is of adutls in Berkeley, California, in 1960. The quantification of Ihis relationship in this way, which in the present case gives Q = (ad — bc)/(ad + be) = .66, might be especially useful if the purpose of the study were to compare differenl years or communities with regard to the exlenl to which race may condition an attitude of alienation. Of course,
Table 1 — Illustrative fourfold table, relating race to an attitude of alienation from one’s social environment* | |||||
---|---|---|---|---|---|
RACE(X) | ASSOCIATION MEASURE | ||||
* Cell entries are frequencies. | |||||
Source: Templeton 1966. | |||||
Negro | White | Total | |||
ALIENATION (Y) | High | a = 19 | b = 55 | 74 | |
Low | c = 6 | d = 85 | 91 | ||
Total | 25 | 140 | 165 | ||
Per cent highly alienated | 76% | 39% |
neither an association nor a correlation is in itself evidence of causation. Although the relationship shown here is an objective fact, and therefore a descriptive reflection of social reality with possible consequences for social relations in the United States, it is possible that some other variable, such as socioeconomic status, accounts for at least part of the relationship, and to that extent the relation observed in Table 1 would be considered “spurious” in the Lazarsfeld ([1955] 1962, p. 123) sense.
It might seem more natural to measure the relation between race and alienation by a comparison of rates of alienation in the two racial groups. Thus, there was a difference of 76 - 39 = 37 percentage points (see bottom row Table 1) in the measured alienation for these two groups at that time and place. This difference might be considered a measure of association, since it, too, would be 0 if the tabulated frequencies showed statistical independence. Theoretically, such a coefficient (sometimes called the “percentage difference”) could achieve a maximum of 100 if the distribution were as in Table 2. For comparability, it is customary to divide by 100, producing a coefficient sometimes referred to as dyx (for X an independent variable and Y a dependent one), which has a maximum of 1.0, as do most other measures of association.
A problem in the use of dyx, the difference between rates of alienation, arises when it is noted that the maximum absolute value of d¡« (that is, unity) is achieved only in the special situation illustrated in Table 2, where the row and column marginal distributions must necessarily be equal. This dependence of maximum value on equality of the marginal distributions can be misleading—for example, in comparisons of different communities where the number of Negroes remains about the same but the average level of alienation varies. In Table 3, Q is 1.0, while dyx is less than 1.0. Table 3 represents the maximum discrepancy that could be observed in the rates of alienation, given the marginal distributions of the obtained sample of Table 1, and for this situation the maximum value of dgx, is only .65.
In addition to this independence of its maximum as marginals vary, Q has another desirable property, which is shared by dyx : it has an operational interpretation. Both of these coefficients may be interpreted in a way that involves paired comparisons. Thus, Q = ad/(ad + bc) — be/(ad + bc) represents the probability of a pair having a “concordant ordering of the variables” less the probability of a pair having a “discordant ordering” when the pair is chosen at random from the set of pairs of individuals constructed by pairing each Negro member of the population with, in turn, each white member of the population and disregarding all pairs except those for which the alienation levels of the two individuals differ. By a “concordant ordering” is meant, in this illustration, a situation in which the Negro member of the pair has high alienation and the white has low alienation.
Closely related to this is the interpretation of dyx = ad/[(a + c)(b + d)] - bc/[(a + c)(b + d)], which is also the difference between the probability of a concordant pair and the probability of a discordant pair. In this case, however, the pair is chosen at random from a somewhat extended set of pairs: those in which the two individuals are of different races (here race is taken as X, the independent variable), whether or not they have different levels of alienation. Thus, dyx asks directly: To what extent can one predict, in this population, the ordering of alienation levels of two persons, one of whom is Negro, the other white? On the other hand, Q asks: To what extent can one predict the ordering of alienation levels of persons of different races among those whose alienation levels are not equal? Thus, Q does not distinguish beween the situation of Table 2, in which no white persons are highly alienated, and that of Table 3, in which about 35 per cent of the whites are alienated.
Table 2 — Hypothetical fourfold table showing dyx = 1 | |||||
---|---|---|---|---|---|
RACE(X) | ASSOCIATION MEASURE | ||||
Negro | White | Total | |||
ALIENATION (Y) | High | 25 | 0 | 25 | |
Low | 0 | 140 | 140 | Q = 1 | |
Total | 25 | 140 | 165 | dyx = 1 | |
Per cent highly alienated | 100% | 0% |
Table 3 — Hypothetical fourfold table showing the maximum possible value of dyx given the observed marginal distribution | |||||
---|---|---|---|---|---|
RACE(X) | ASSOCIATION MEASURE | ||||
Negro | White | Total | |||
ALIENATION (Y) | High | 25 | 49 | 74 | |
Low | 0 | 91 | 91 | Q = 1 | |
Total | 25 | 140 | 165 | dyx = .65 | |
Per cent highly alienated | 100% | 35% |
There is, of course, no reason why one should expect to be able to summarize more than one aspect of a bivariate distribution in a single coefficient. Indeed, some investigators assume that several quantities will be necessary to describe a cross-tabulation adequately, and some go so far as to question the value of summary statistics at all, preferring to analyze and present the whole tabulation with r rows and c columns. This latter approach, however, leads to difficulties when two or more cross-classifications are being compared. The measures of association Q and dyx may be extended to cross-tabulations with more than two categories per variable, as described below. The problems mentioned above extend to such cross-tabulations.
Although correlation theory developed historically from the theory of probability of continuous variables, one may consider the antecedents of “association theory” as lying more in the realm of the logic of classes. Yule's early paper on association traces the origin of his ideas to De Morgan, Boole, and Jevons (in addition to Quetelet); more recently it has been recalled that the American logician Peirce also worked on this problem toward the close of the nineteenth century [see the biography of Peirce].
The fact that most of the detailed remarks in this article refer to the work of British and American statisticians does not imply that developments have been confined to these countries. In Germany important early work was done by Lipps and Deuchler, and Gini in Italy has developed a series of statistical measures for various types of situations. [See Gini.] A comprehensive review of the scattered history of this work is contained in Goodman and Kruskal (1954-1963, parts 1 and 2) and Kruskal (1958). Characteristically, investigators in one country have developed ad hoc measures without a knowledge of earlier work, and some coefficients have been “rediscovered” several times in recent decades.
Measurement characteristics and association
In recent years the distinction between the continuous and the discrete—between quantity and quality—often running through the work of logicians has become less sharp because of the addition of the idea of ordinal scales (rankings), which may be either discrete or continuous. Indeed, a whole series of “levels of measurement” has been introduced in the past twenty-five years (see Torgerson 1958, chapters 1-3), and because of their relevance to the measurement of relationships the matter warrants a brief statement here. [Other discussions of scales of measurement can be found in SCALING; and in Statistics, Descriptive, article On Location And dispersion.]
As described above, in the past the term “correlation” has referred to a relation between quantities, while the term “association” has been reserved for the relation between qualities. For present purposes it is sufficient to elaborate this distinction only slightly by defining the three categories “metrics,” “ordinals,” and “nominals.” Metrics (for example, height and weight), roughly corresponding to the earlier quantitative variables, are characterized by having an unambiguous and relevant unit of measurement. (Metrics are often further classified according to whether a meaningful zero point exists.) Ordinals (for example, amount of subjective agreement or appreciation) lack a unit of measurement but permit comparisons so that of two objects one can always say that one is less than the other, or that the two are tied, on the relevant dimension. Ordinal scales are for this reason designated “comparative concepts” by Hempel (1952, pp. 58-62). Nominis (for example, types of ideology, geographical regions) are simply classificatory
categories lacking any relevant ordering but tied together by some more generic conception. Each of these scales may be either continuous or discrete, except nominals, which are inherently discrete.
Present usage generally retains the term “correlation” for the relationship between metrics (sometimes “correlation” is restricted to p and the correlation ratio), and “rank correlation” for the relationship between continuous ordinals, although the term “association” has also been used here. The terms “order association” and “monotonic correlation” have been used for the relationship between discrete ordinals, and the terms “association” and “contingency” are used to refer to the relationship between nominals. (Some statisticians prefer to reserve the term “contingency” for data arising from a particular sampling scheme.)
With ordinal data of a social or psychological origin the number of ordered categories is often small relative to the number of objects classified, in which case the bivariate distribution is most conveniently represented in the form of a cross-classification that is identical in format to the joint distribution over two nominis. Because of the identity of format, however, the different cases are sometimes confused, and as a result a measure of association is used that ignores the ordering of the categories, an aspect of the data that may be crucial to the purposes of the analysis. For this reason it is helpful to conceive of a cross-classification of two ordinals as rankings of a set of objects on two ordinals simultaneously, with ties among the rankings leading to a number of objects falling in each cell of the cross-classification.
Illustrative data, of a type very common in contemporary survey research, are presented in Table 4. In this instance the investigator utilized an additive index of income, education, and occupational prestige to obtain a socioeconomic ranking, which was then grouped into three ordered classes, with observations distributed uniformly over these classes insofar as possible. The alienation scale used in this study is derived from questionnaire responses indicating agreement with 24 such statements as “We are just so many cogs in the machinery of life”; “With so many religions around, one doesn't really know which to believe”; and “Sometimes I feel all alone in the world.” Again the results are grouped into three nearly equal classes.
An appropriate coefficient for the measurement of the relation between socioeconomic status and alienation in Table 4 would be gamma (y), a generalization of Yule's Q, introduced by Goodman and Kruskal (1954-1963, part 1, p. 747). Omitting details of computation, this measure provides in
Table 4 — Cross-classification of socioeconomic status and alienation* | |||||
---|---|---|---|---|---|
Socioeconomic STATUS | |||||
High | Medium | Low | Total | ||
* Cell entries are frequencies. | |||||
Source: Reconstructed from Erbe 1964, table 4(6), p. 207. | |||||
High | 23 | 62 | 107 | 192 | |
ALIENATION | Medium | 61 | 65 | 61 | 187 |
Low | 112 | 60 | 23 | 195 | |
Total | 196 | 187 | 191 | 574 |
formation of the following sort: Suppose one is presented with a randomly chosen pair of individuals from a population distributed in the form of Table 4, the pair being chosen with the restriction that the members be located in different status categories as well as different alienation categories. What are the chances that the individual in the higher status category, say, will also be more alienated? The probability of this event for the population of Table 4 is .215. Similarly, one may ask for the chance that in such a pair of individuals the person with higher status has less alienation—this complementary event has probability 1—.215 = .785. The difference between these probabilities, —.570, is the value of y, a measure of association with many convenient properties, most of which have been noted in the comments on Q, which is a special case of y. A slight modification of y, retaining many of the same properties but yielding an asymmetric coefficient, dyx, has recently been presented by Somers (1962) and was also illustrated above in the special case of a fourfold table. It is asymmetric in that, unlike y, it will in general have a different value depending on which variable is taken as independent.
This example shows clearly that a distinction must be made between ordinal and nominal measurement; this distinction becomes especially important when, as in Table 4, the number of categories exceeds two. Table 4 cross-tabulates two ordinal variables, while Table 5 presents a representative cross-tabulation of two variables treated by the analysts as nominal variables, that is, categorical groupings with no inherent ordering.
The investigators in the latter case were interested in quantifying, for purposes of comparison, the extent of occupational segregation within the school districts shown in Table 5 and accomplished this by means of a measure of association introduced by Goodman and Kruskal (1954-1963, part 1, p. 759), designated t6 (not to be confused with the very different T6 of Kendall [1948] 1963, chapter 3). As a measure of association, t6 is 0 when
Table 5 — Distribution of occupations of fathers of children in three school districts* | ||||||
---|---|---|---|---|---|---|
FATHER'S OCCUPATION | ||||||
* Cell entries are frequencies. | ||||||
Source: Data from Wilson 1959, p. 839, as presented and analyzed in Rhodeset al. 1965, pp. 687-688. | ||||||
Professional | Whitecollar | Selfemployed | Manual | Total | ||
A | 92 | 174 | 68 | 39 | 373 | |
School DISTRICT | B | 39 | 138 | 90 | 140 | 407 |
C | 11 | 111 | 37 | 221 | 380 | |
Total | 142 | 423 | 195 | 400 | 1,160 |
the frequencies are statistically independent and is +1 when the frequencies are so distributed that a knowledge of location of an individual on one variable—here, school district—enables perfect prediction of his location on the other variable. In Table 5 the value of n is .075, indicating that, in the language of Goodman and Kruskal's interpretation, the error of prediction (employing a “proportional” prediction) would be reduced only 7.5 per cent in making predictions of occupational status of father given a knowledge of school district (and the joint frequency distribution), as opposed to making that prediction with a knowledge only of the occupational distribution in the margin. Changing the order of rows or columns does not affect tb.
If the frequencies had been distributed as in Table 6, then Tb would have been 1.0, since knowledge of school district would permit perfect prediction of occupational category.
Because Tb is intended for nominal variables, the idea of a “negative” association has no meaning, and the measure therefore varies only between 0 and +1.
Whereas Goodman and Kruskal interpreted Tb in prediction terms as described above, Rhodes and others (1965) in their application have presented an interpretation that motivates this measure as an index of spatial segregation, based on a model of chance interaction. Their work thus provides a
Table 6 — Hypothetical distribution of frequencies enabling perfect prediction of occupation of father, given knowǀedge of school district | ||||||
---|---|---|---|---|---|---|
FATHER'S OCCUPATION | ||||||
Professional | Whitecollar | Selfemployed | Manual | Total | ||
A | 0 | 373 | 0 | 0 | 373 | |
School District B | 407 | 0 | 0 | 0 | 407 | |
C | 0 | 0 | 0 | 380 | 380 | |
Total | 407 | 373 | 0 | 380 | 1,160 |
good example of the derivation of a summary statistic that is appropriate to, and interpretable in the light of, the specific research hypothesis that is the investigator's concern. It is, for their research purposes, a largely irrelevant coincidence that their coefficient happens to be Goodman and Kruskal's Tb, interpretable in another way.
Another special kind of problem may arise in certain analyses where association is of interest. In many types of sociological analysis data are not available for cross-tabulating the individual observations but rather are available only on the “ecological” level, in the form of rates. Thus, for example, one might have available only the proportion of persons of high income and the proportion of persons voting for each political party for collectivities such as election districts. The investigator, however, may still be interested in the relation at the individual level. Duncan and others (1961) have discussed this problem in detail and have presented methods for establishing upper and lower bounds for the value of the individual association, given ecological data. Goodman (1959) has discussed and elaborated upon these procedures, and he has shown how it is possible, with minimal assumptions, to make estimates of the value of the individual association from the ecological data.
Multiple and partial association
There is an area in which much work remains to be done: the problem of analysis of simultaneous relations among more than two variables of an ordinal or nominal sort. An article by Lewis (1962) summarizes much of the literature on particular aspects of the problem for nominis, more from the point of view of testing hypotheses of no relation, but dealing indirectly with the choice of descriptive statistics; Goodman (1963) has corrected and modified certain remarks of Lewis'. From a different point of view, several new methods are introduced in Coleman (1964, chapter 6). Goodman and Kruskal (1954-1963, part 1, sees. 11-12) have also discussed ways of extending their conceptions to the partial-association and multiple-association situations, and somewhat different ideas are contained in a brief discussion of the triple dichotomy by Somers (1964).
Robert H. Somers
[See alsoMultivariate Analysis, articles onCORRELATION; Survey Analysis, article on The Analysis Of Attribute data; Tabular presentation.]
BIBLIOGRAPHY
Coleman, James s. 1964 Introduction to Mathematical Sociology. New York: Free Press.
Duncan, Otis Dudley; Cuzzort, Ray p.; and Duncan, Beverly 1961 Statistical Geography: Problems in Analyzing Areal Data, New York: Free Press.
Erbe, William 1964 Social Involvement and Political Activity: A Replication and Elaboration. American Sociological Review 29:198-215.
Goodman, Leo a. 1959 Some Alternatives to Ecological Correlation. American Journal of Sociology 64:610-625.
Goodman, Leo a. 1963 On Methods for Comparing Contingency Tables. Journal of the Royal Statistical Society Series A 126:94-108.
Goodman, Leo a.; and Kruskal, William h. 1954-1963 Measures of Association for Cross-classifications. Parts 1-3. Journal of the American Statistical Association 49:732-764; 54:123-163; 58:310-364.
Guttman, Louis 1941 An Outline of the Statistical Theory of Prediction: Supplementary Study B-l. Pages 253-318 in Social Science Research Council, Committee on Social Adjustment, The Prediction of Personal Adjustment, by Paul Horst et al. New York: The Council.
Hempel, Carl g. 1952 Fundamentals of Concept Formation in Empirical Science. Volume 2, number 7, in International Encyclopedia of Unified Science. Univ. of Chicago Press.
Kendall, M. G. (1948)1963 Rank Correlation Methods. 3d ed., rev. & enl. New York: Hafner. → The first edition was published by Griffin.
Kendall, M. G.; and Stuart, Alan (1946) 1961 The Advanced Theory of Statistics. Volume 2: Inference and Relationship. New York: Hafner; London: Griffin.
Kruskal, William h. 1958 Ordinal Measures of Association. Journal of the American Statistical Association 53:814-861.
Lazarsfeld, Paul f. (1955)1962 Interpretation of Statistical Relations as a Research Operation. Pages 115-125 in Paul F. Lazarsfeld and Morris Rosenberg (editors), The Language of Social Research: A Reader in the Methodology of Social Research. New York: Free Press.
Lewis, B. N. 1962 On the Analysis of Interaction in Multi-dimensional Contingency Tables. Journal of the Royal Statistical Society Series A 125:88-117.
Rhodes, Albert l.; Reiss, Albert j. JR.; and Duncan, Otis Dudley 1965 Occupational Segregation in a Metropolitan School System. American Journal of Sociology 70:682-694.
Somers, Robert h. 1962 A New Asymmetric Measure of Association for Ordinal Variables. American Sociological Review 27:799-811.
Somers, Robert h. 1964 Simple Measures of Association for the Triple Dichotomy. Journal of the Royal Statistical Society Series A 127:409-415.
Srole, Leo 1956 Social Integration and Certain Corollaries: An Exploratory Study. American Sociological Review 21:709-716.
Tate, Robert f. 1955 Applications of Correlation Models for Biserial Data. Journal of the American Statistical Association 50:1078-1095.
Templeton, Fredric 1966 Alienation and Political Participation: Some Research Findings. Public Opinion Quarterly 30:249-261.
Torgerson, Warren s. 1958 Theory and Methods of Scaling. New York: Wiley.
Wilson, Alan b. 1959 Residential Segregation of Social Classes and Aspirations of High School Boys. American Sociological Review 24:836-845.
Yule, G. UDny 1912 On the Methods of Measuring Association Between Two Attributes. Journal of the Royal Statistical Society 75:579-652. → Contains ten pages of discussion on Yule's paper.
Descriptive Statistics
DESCRIPTIVE STATISTICS
Descriptive statistics include data distribution and the summary information of the data. Researchers use descriptive statistics to organize and describe the data of a sample or population. The characteristics of the sample are statistics while those of the population are parameters. Descriptive statistics are usually used to describe the characteristics of a sample. The procedure and methods to infer the statistics to parameters are the statistical inference. Descriptive statistics do not include statistical inference.
Though descriptive statistics are usually used to examine the distribution of single variables, they may also be used to measure the relationship of two or more variables. That is, descriptive statistics may refer to either univariate or bivariate relationship. Also, the level of the measurement of a variable, that is, nominal, ordinal, interval, and ratio level, can influence the method chosen.
DATA DISTRIBUTION
To describe a set of data effectively, one should order the data and examine the distribution. An eyeball examination of the array of small data is often sufficient. For a set of large data, the aids of tables and graphs are necessary.
Tabulation. The table is expressed in counts or rates. The frequency table can display the distribution of one variable. It lists attributes, categories, or intervals with the number of observations reported. Data expressed in the frequency distribution are grouped data. To examine the central tendency and dispersion of large data, using grouped data is easier than using ungrouped data. Data usually are categorized into intervals that are mutually exclusive. One case or data point falls into one category only. Displaying frequency distribution of quantitative or continuous variables by intervals is especially efficient. For example, the frequency distribution of age in an imaginary sample can be seen in Table 1.
Here, age has been categorized into five intervals, i.e., 15 and below, 16–20, 21–25, 26–30, and 31–35, and they are mutually exclusive. Any age falls into one category only. This display is very efficient for understanding the age distribution in our imaginary sample. The distribution shows that twenty cases are aged fifteen or younger, twenty-five cases are sixteen to twenty years old, thirty-six cases are twenty-one to twenty-five years old, twenty cases are twenty-six to thirty years old, and nineteen cases are thirty-one to thirty-five years old. To compare categories or intervals and to compare various samples or populations, the reporting percent or relative frequency of each category is important. The third column shows the percent of sample in each interval or category. For example, 30 percent of the sample falls into the range of twenty-one to twenty-five years old. The fourth column shows the proportion of observation for each interval or category. The proportion was called relative frequency. The cumulative frequency, the cumulative percent, and the cumulative relative frequency are other common elements in frequency tabulation. They are the sum of counts, percents, or proportions below or equal to the corresponding category or interval. For instance, the cumulative frequency of age thirty shows 101 persons or 84.2 percent of the sample age thirty or younger.
The frequency distribution displays one variable at a time. To study the joint distribution of two or more variables, we cross-tabulate them first. For example, the joint distribution of age and sex in the imaginary sample can be expressed in Table 2.
This table is a two-dimensional table: age is the column variable and sex is the row variable. We call this table a "two-by-five" table: two categories for sex and five categories for age. The marginal frequency can be seen as the frequency distribution of the corresponding variables. For example, there are fifty seven men in this sample. The marginal frequency for age is called column frequency and the marginal frequency for sex is called row frequency. The joint frequency of age and sex is cell frequency. For example, there are seventeen women twenty-one to twenty-five years old in this sample. The second number in each cell is column percentage; that is, the cell frequency divided by the column frequency and times 100 percent. For example, 47 percent in the group of twenty-one to twenty-five year olds are women. The third number in each cell is row percentage; that is, the cell frequency divided by the row frequency. For example, 27 percent of women are twenty-one to twenty-five years old. The marginal frequency can be seen as the frequency distribution of the corresponding variables. The row and column percentages are useful in examining the distribution of on variable conditioning on the other variable.
Charts and Graphs. Charts and graphs are efficient ways to show data distribution. Popular
age distribution of an imaginary sample | |||||
codes | frequency | percent | relative frequency | cumulative frequency | cumulative percent |
15 and below | 20 | 16.7 | .17 | 20 | 16.7 |
16–20 | 25 | 20.8 | .21 | 45 | 37.5 |
21–25 | 36 | 30.0 | .30 | 81 | 67.5 |
26–30 | 20 | 16.7 | .17 | 101 | 84.2 |
31–35 | 19 | 15.8 | .16 | 120 | 100 |
total | 120 | 100 | 1.0 |
graphs for single variables are bar graphs, histograms, and stem-and-leaf plots. The bar graph shows the relative frequency distribution of discrete variables. A bar is drawn over each category with height of the bar representing the relative frequency of observations in that category. The histogram can be seen as a bar graph for the continuous variable. By connecting the midpoints of tops of all bars, a histogram becomes the frequency polygon. Histograms effectively show the shape of the distribution.
Stem-and-leaf plots represent each observation by its higher digit(s) and its lowest digit. The value of higher digits is the stem while the value of the final digit of each observation is the leaf. The stem-and-leaf plot conveys the same information as the bar graph or histogram. Additionally, it tells the exact value of each observation. Despite providing more information than bar graphs and histograms, stem-and-leaf plots are used mostly for small data.
Other frequently used graphs include line graphs, ogives, and scatter plots. Line graphs and ogives show the relationship between time and the variable. The line graph usually shows trends. The ogive is a form of a line graph for cumulative relative frequency or percentage. It is commonly used for survival data. The scatter plot shows the relationship between variables. In a two-dimensional scatter plot, x and y axises label values of the data. Conventionally, we use the horizontal axis (x-axis) for the explanatory variable and use the vertical axis (y-axis) for the outcome variable. The plain is naturally divided into four areas by two axises. For continuous variables, the value at the joint point of two axises is zero. When the x-axis goes to the right or y-axis goes up, the value ascends; when the x-axis goes to the left or y-axis goes down, the value descends. The data points, determinated by the joint attributes of the variables, are scattered in four areas or along the axises.
SUMMARY STATISTICS
We may use measures of central tendency and dispersion to summarize the data. To measure the central tendency of a distribution is to measure its center or typicality. To measure the dispersion of a distribution is to measure its variation, heterogeneity, or deviation.
Central Tendency. Three popular measures of the central tendency are mean, median, and mode. The arithmetic mean or average is computed by taking the sum of the values and dividing by the number of the values. It is the balanced point of the sample or population weighted by values. Mean is an appropriate measure for continuous (ratio or interval) variables. However, the information might be misleading because the arithmetic mean is sensitive to the extreme value or outliers in a distribution. For example, the ages of five students are 21, 19, 20, 18, and 20. The ages of another five students are 53, 9, 12, 13, and 11. Though their distributions are very different, the mean age for both groups is 19.6.
Median is the value or attribute of the central case in an ordered distribution. If the number of cases is even, the median is the arithmetic average of the central two cases. In an ordered age distribution of thirty-five persons, the median is the age of the eighteenth person, while, in a distribution of thirty-six persons, the median is the average age of
age | ||||||
sex | 15 and below | 16–20 | 21-25 | 26–30 | 31-25 | total |
male | 9 | 12 | 19 | 9 | 8 | 57 |
45.0% | 48.0% | 52.8% | 45.0% | 42.1% | 47.5% | |
15.8% | 21.1% | 33.3% | 15.8% | 14.0% | ||
female | 11 | 13 | 17 | 11 | 11 | 63 |
55.0% | 52.0% | 47.2% | 55.0% | 57.9% | 52.5% | |
17.5% | 20.6% | 27.0% | 17.5% | 17.5% | ||
20 | 25 | 36 | 20 | 19 | 120 | |
total | 16.7% | 20.8% | 30.0% | 16.7% | 15.8% | 100% |
the seventeenth and eighteenth persons. The median, like mean, can only tell the value of the physical center in an array of numbers, but cannot tell the dispersion. For example, the median of 21, 30, 45, and 100 is 27.5 and the median of 0, 27, 28, and 29 is also 27.5, but the two distributions are different. The mode is the most common value, category, or attribute in a distribution. Like the median, the mode has its limitations. For a set of values of 0, 2, 2, 4, 4, 4, 4, 5, and 10, the mode is four. For a set of values of 0, 0, 1, 4, 4, 4, 5, and 6, the mode is also four. One cannot tell one distribution from the other simply by examining the mode or median alone. The mode and median can be used to describe the central tendency of both continuous and discrete variables, and values of mode and median are less affected by the extreme value or the outlier than the mean.
One may also use upper and lower quartiles and percentiles to measure the central tendency. The n percentile is a number such that n percent of the distribution falls below it and (100−n) percent falls above it. The lower quartile is the twenty-fifth percentile, the upper quartile is the seventy-fifth percentile, and the median is the fiftieth percentile. For example, the lower quartile or the twenty-fifth percentile is two and the upper quartile or the seventy-fifth percentile is seven for a set of values of 1, 2, 3, 4, 5, 6, 7, and 8. Apparently, the upper and lower quartiles and the percentiles can provide more information about a distribution than the other measures of the central tendency.
Dispersion. The central tendency per se does not provide much information on the distribution. Yet the combination of measures of central tendency and dispersion becomes useful to study a distribution. The most popular measures of dispersion are range, standard deviation, and variance. Range is the crude measure of a distribution from the highest value to the lowest value or the difference between the highest and the lowest values. For example, the range for a set of values of 1, 2, 3, 4, and 5 is one to five. The range is sensitive to the extreme value and may not provide sufficient information about the distribution. Alternatively, the dispersion can be measured by the distance between the mean and each value. The standard deviation is defined as the square root of the arithmetic mean of the squared deviation from the mean. For example, the standard deviation for a set of values of 1, 2, 3, 4, and 5 is 1.44. We take the square root of the squared deviation from the mean because the sum of the deviation from the mean is always zero. The variance is the square of the standard deviation. The variance is two in the previous array of numbers. The standard deviation is used as a standardized unit in statistical inference. Comparing with standard deviation, the unit of the variance is not substantively meaningful. It is, however, valuable to explain the relationship between variables. Mathematically, the variance defines the area of the normal curve while the standard deviation defines the average distance between the mean and each data point. Since they are derived from the distance from the mean, standard deviation and variance are sensitive to the extreme values.
The interquartile range (IQR) and mean absolute deviation (MAD) are also commonly used to measure the dispersion. The IQR is defined as the difference between the first and third quartiles. It is more stable than the range. MAD is the average absolute values of the deviation of the observations from the mean. As standard deviation, MAD can avoid the problem that the sum of the deviation from the mean is zero, but it is not as useful in statistical inference as variance and standard deviation.
Bivariate Relationship. One may use the covariance and correlation coefficients to measure the direction and size of a relationship between two variables. The covariance is defined as the average product of the deviation from the mean between two variables. It also reports the extent to which the variables may vary together. On average, while one variable deviates one unit from the mean, the covariance tells the extent to which the corresponding value of the other variable may deviate from its own mean. A positive covariance suggests that, while the value of one variable increases, that of the other variable tends to increase. A negative covariance suggests that, while the value of one variable increases, that of the other variable tends to decrease. The correlation coefficient is defined as the ratio of the covariance to the product of the standard deviations of two variables. It can also be seen as a covariance rescaled by the standard deviation of both variables. The value of the correlation coefficient ranges from −1 to 1, where zero means no correlation, −1 means perfectly negatively related, and 1 means perfectly positively related. The covariance and correlation are measures of the bivariate relationship between continuous variables. Many measures of association between categorical variables are calculated using cell frequencies or percentages in the cross-tabulation, for example, Yule's Q, phi, Goodman's tau, Goodman's gamma, and Somer's d. Though measures of association alone show the direction and size of a bivariate relationship, it is statistical inference to test the existence of such a relationship.
RELATIONSHIPS BETWEEN GRAPHS AND SUMMARY STATISTICS
The box plot is a useful tool to summarize the statistics and distribution. The box plot is consisted of a rectangular divided box and two extended lines attached to the ends of the box. The ends of the box define the upper and lower quartiles. The range of the distribution on each side is shown by an extended line attached to each quartile. A line dividing the box shows the median. The plot can be placed vertically or horizontally. The box plot became popular because it can express the center and spread of the data simultaneously. Several boxes may be placed next to one another for comparison.
The order of mode, median, and mean is related to the shape of the distribution of a continuous variable. If mean, median, and mode are equal to each other, the shape of the histogram approximates a bell curve. However, a uniform distribution, in which all cases are equally distributed among all values and three measures of the central tendency are equal to each other, has a square shape with the width as the range and the height as the counts or relative frequency. In a bimodal distribution, two modes are placed in two ends of the distribution equally distanced from the center where the median and the mean are placed. We seldom see the true bell-curved, uniform, and bimodal distributions. Most of the distributions are more or less skewed to the left or to the right. If the mean is greater than median and the median is greater than mode, the shape is skewed to the right. If the mean is smaller than the median, and the median is smaller than the mode, the shape is skewed to the left. The outliers mainly lead the direction.
The shape and direction of the scatter plot can diagnose the relationship of two variables. When the distribution directs from the upper-right side to the lower-left side, the correlation coefficient is positive; when it directs from the upper-left side to the lower-right side, the correlation coefficient is negative. The correlation of a loosely scattered plot is weaker than that of a tightly scattered plot. A three-dimensional scatter plot can be used to show a bivariate relationship and its frequency distribution or a relationship of three variables. The former is commonly seen as a graph to examine a joint distribution.
Descriptive statistics is the first step in studying the data distribution. In omitting this step, one might misuse the advanced methods and thus be led to wrong estimates and conclusions. Some summary statistics such as standard deviation, variance, mean, correlation, and covariance, are also essential elements in statistical inference and advanced statistical methods.
references
Agresti, Alan, and Barbara Finlay 1997 Statistical Methods for the Social Sciences, 3rd ed. New York: Simon & Schuster.
Blalock, Jr., Hubert M. 1979 Social Statistics, rev. 2nd ed. New York: McGraw-Hill.
Johnson, Allan G. 1988 Statistics. New York: Harcourt Brace Jovanovich.
Wonnacott, Thomas H., and Ronald J. Wonnacott 1990 Introductory Statistics. 5th ed. New York: John Wiley & Sons, Inc.
Daphne Kuo
Descriptive Statistics
Descriptive Statistics
Descriptive statistics, which are widely used in empirical research in the social sciences, summarize certain features of the data set in a study. The data set nearly always consists of lists of numbers that describe a population. Descriptive statistics are used to summarize the information in the data using simple measures. Thus, descriptive statistics can help to represent large data sets in a simple manner. However, an incautious use of descriptive statistics can lead to a distorted picture of the data by leaving out potentially important details.
THE HISTOGRAM
Descriptive statistics take as a starting point observations from a population. So suppose we have observed n > 1 draws from a population, and let [ x 1, …, xn] ] denote these observations. These observations could, for example, be a survey of income levels in n individual households, in which case x 1 would be the income level of the first household and so forth. One way of doing this is through the distribution of the data that gives a summary of the frequency of individual observations. The distribution is calculated by grouping the raw observations into categories according to ranges of values. As a simple example, Table 1 reports the distribution of a data set of income levels for 1,654 households in the United Kingdom. The data set has been grouped into five income categories. These categories represent income in U.S. dollars within the following ranges: $0–$700; $701–$1,400; $1,401–$2,100; $2,101–$2,800; and $2,801–$3,500. The second row in Table 1 shows the number of households in each income
Table 1 | |||||
---|---|---|---|---|---|
Distribution of weekly salaries | |||||
Weekly salary ($) | 0–700 | 701–1400 | 1401–2100 | 2101–2800 | 2801–3500 |
SOURCE: UK Family Expenditure Survey, 1995. | |||||
Number of households | 1160 | 429 | 41 | 17 | 7 |
Percentage of households (%) | 70.13 | 25.94 | 2.48 | 1.03 | 0.42 |
range. The corresponding frequencies are found by dividing each cell with the number of observations; these are given in the third row.
One can also present the frequencies as a graph. This type of graph is normally referred to as a histogram. The frequencies in Table 1 are depicted as a histogram in Figure 1.
SUMMARY STATISTICS
An even more parsimonious representation of the data set can be done through summary statistics. The most typical ones are measures of the center and dispersion of the data. Other standard summary statistics are kurtosis and skewness.
The three most popular measures of the center of the distribution are the mean, median, and mode. The mean,
or average, is calculated by adding up all the observed values and dividing by the number of observations:
The median represents the middle of the set of observations when these are ordered by value. Thus, 50 percent of the observations are smaller and 50 percent are greater than the median. Finally, the mode is calculated as the most frequently occurring value in the data set.
The dispersion of the data set tells how much the observations are spread around the central tendency. Three frequently used measures of this are the variance (and its associated standard deviation), mean deviation, and range. The variance (VAR) is calculated as the sum of squared deviations from the mean, divided by the number of observations:
The standard deviation (SD) is the square-root of the variance: The mean deviation (MD) measures the average absolute deviation from the mean:
The range is calculated as the highest minus the lowest observed value. The range is very sensitive to extremely large or extremely small values, (or outliers), and it may, therefore, not always give an accurate picture of the data.
Skewness is a measure of the degree of asymmetry of the distribution relative to the center. Roughly speaking, a distribution has positive skew if most of the observations are situated to the right of the center, and a negative skew if most of the observations are situated to the left of the center. Skewness is calculated as:
Kurtosis measures the “peakedness” of the distribution. Higher kurtosis means more of the variance is due to infrequent extreme deviations. The kurtosis is calculated as:
SEE ALSO Mean, The; Mode, The; Moment Generating Function; Random Samples; Standard Deviation
BIBLIOGRAPHY
Anderson, David R., Dennis J. Sweeney, and Thomas A. Williams. 2001. Statistics for Business and Economics, 8th ed. Cincinnati, OH: South-Western Thomson Learning.
Freedman, David, Robert Pisani, and Roger Purves. 1998. Statistics, 3rd ed. New York: Norton.
Dennis Kristensen