Tabular Analysis
TABULAR ANALYSIS
In its most general form, tabular analysis includes any analysis that uses tables, in other words, almost any form of quantitative analysis. In this article, however, it refers only to the analysis of categorical variables (both nominal and ordered) when that analysis relies on cross-classified tables in the form of frequencies, probabilities, or conditional probabilities (percentages). In general, the use of such cross-tabulated data is practical only with variables that have a limited number of categories. Therefore, this article deals with some of the analytic problems of categorical data analysis. Although it sometimes is difficult to separate analysis from methods of data presentation, the emphasis here is decidedly on analysis (see Davis and Jacobs 1968).
Tabular analysis can take many different forms, but two methods deserve special attention. The first is known as subgroup analysis. The underlying logic of this type of analysis was codified under the name "elaboration paradigm" by Lazarsfeld and his colleagues (Kendall and Lazarsfeld 1950; Lazarsfeld 1955; Hyman 1955; Rosenberg 1968; Lazarsfeld et al., 1972; Zeisel 1985). Because of the simplicity of the method and the ease with which it can facilitate communication with others, subgroup analysis has been the mainstay of research reports dealing with categorical data.
The second is based on the use of log-linear and related models and has become increasingly popular (Bishop et al. 1975; Goodman 1978; Haberman 1978, 1979; Fienberg 1980; Agresti 1984; 1990; Clogg and Shihadeh 1994). (For other related models, see McCullagh and Nelder 1983; Thiel 1991; Long 1997; Press and Wilson 1998). This method is flexible, can handle more complex data (with many variables), and is more readily amenable to statistical modeling and testing (Clogg et al. 1990). For this reason, the log-linear method is rapidly emerging as the standard method for analyzing multivariate categorical data. Its results, however, are not easily accessible because the resulting tabular data are expressed as multiplicative functions of the parameters (i.e., log-linear rather than linear), and the parameters of these models tend to obscure descriptive information that often is needed in making intelligent comparisons (Davis 1984; Kaufman and Schervish 1986; Alba 1988; Clogg et al. 1990).
These two methods share a set of analytic strategies and problems and are complementary in their strengths and weaknesses. To understand both the promises and the problems of tabular analysis, it is important to understand the logic of analysis and the problems that tabular analyses share with the simpler statistical analysis of linear systems. As a multivariate analysis tool, tabular analysis faces the same problems that other welldeveloped linear statistical models face in analyzing data that are collected under less than ideal experimental conditions. It therefore, is important to have a full understanding of this foundation, and the best way to do that is to examine the simplest linear system.
STATISTICAL CONTROLS, CAUSAL ORDERING, AND IMPROPER SPECIFICATIONS
Consider the simplest linear multivariate system:
where all the variables, including the error term, are assumed to be measured from their respective means. When this equation is used merely to describe the relationship between a dependent variable Y and two other variables X and Z, the issue of misspecification—in other words, whether the coefficients accurately reflect an intended relationship—does not arise because the coefficients are well-known partial regression coefficients. However, when the linear model depicted in equation (1) is considered as a representation of an underlying theory, these coefficients receive meaning under that theory. In that case, the issue of whether the coefficients really capture the intended relationship becomes important. Causal relationships are not the only important relationships, but it is informative to examine this equation with reference to such relationships since this is the implicitly implied type of system.
Many different conceptions of causality exist in the literature (Blalock 1964, 1985a, 1985b; Duncan 1966, 1975; Simon 1954, 1979; Heise 1975; Mostetler and Tukey 1977; Bunge 1979; Singer and Marini 1987). However, the one undisputed criterion of causality seems to be the existence of a relationship between manipulated changes in one variable (X) and attendant changes in another variable (Y) in an ideal experiment. That is, a causal connection exists between X and Y if changes in X and X alone produce changes in Y. This is a very restrictive criterion and may not be general enough to cover all important cases, but it is sufficient as a point of reference. This definition is consistent with the way in which effects are measured in controlled experiments. In general, even in an ideal experiment, it is often impossible to eliminate or control all the variations in other variables, but their effects are made random by design. A simple linear causal system describing a relationship produced in an ideal experiment thus takes the following familiar form:
where e stands for all the effects of other variables that are randomized. The randomization makes the expected correlation between X and e zero. (Without loss of generality, it is assumed that all the variables [X, Y, and e] are measured as deviations from their respective means.) For the sake of simplicity, it is assumed for now that Y does not affect X. (For an examination of causal models dealing with reciprocal causation and with more complex systems in general, see Fisher 1966; Goldberger and Duncan 1973; Alwin and Hauser 1975; Duncan 1975; Blalock 1985a, 1985b.)
The coefficient dyx measures the expected change in Y given a unit change in X. It does not matter whether changes in X affect other variables and whether some of those variables in turn affect Y. As long as all the changes in Y ultimately are produced by the manipulated initial changes in X and X alone, X receives total credit for them. Therefore, dyx is a coefficient of total causal effect (referred to as an effect coefficient for short).
The customary symbol for a simple regression coefficient, byx is not used in equation (2) because byx is equivalent to dyx only under these very special conditions. If one uses a simple regression equation in the form similar to equation (2) above and assumes that byx is equivalent to dyx, the model is misspecified as long as the data do not meet all the assumptions made about the ideal experiment. Such errors in model specification yield biased estimates in general. Implications of some specification errors may be trivial, but they can be serious when one is analyzing nonexperimental data (see Kish 1959; Campbell and Stanley 1966; Leamer 1978; Cook and Campbell 1979; Lieberson 1985; Arminger and Bohrnstedt 1987).
Many underlying causal systems are compatible with the three-variable linear equation shown above. For the purpose at hand, it is enough to examine the simple causal systems shown in Figure 1. These causal systems imply critical assumptions about the error term and the causal ordering. If these assumptions are correct, there is a definite connection between the underlying causal parameters and the regression coefficients in equation (1). However, if some of these assumptions are wrong, equation (1) is a misrepresentation of the assumed causal model (for a fuller description of other possible systems, see Duncan 1975).
The notation for causal hierarchy (≥) means that the preceding variable may affect the variables after it, but variables after (≥) may not affect the preceding variables. A connecting arrow between two variables indicates both the existence and the direction of effects; lack of a connecting arrow indicates no known effects. (For convenience, these diagrams do not show random errors, but their presence is assumed.)
For each causal system in Figure 1, the key relationships among simple regression coefficients, partial regression coefficients, and effect coefficients are listed below each causal diagram. Look at the simple causal chain (or a cascading system) shown in A1, for instance. The introduction of Z as a control variable has no effect on the observed relationship between X and Y. Note also that the simple regression coefficient is equivalent to the effect coefficient (byx = byx·z = dyz); similarly, the simple byz is equivalent to dyx, but the partial byz·xbecomes zero. (If one were to control Y, the X–Z relationship would not change, but such control is superfluous given the assumptions about the causal ordering.) In fact, one could argue that these two conditions, given the assumptions about the causal hierarchy, uniquely define a simple causal chain. If the control variable Z enters the X–Y causal system only through X (or the effects of a set of variables are mediated completely through [an]other variable[s] in the system), there is no need to introduce Z (or a set of such variables) as a control to correctly specify the X–Y relationship.
In A2, the two partials (byx·z and byz·x) are different from the respective bivariate coefficients (byx and byz). The key point is that the partial byx∙z is equivalent to dyx, while the partial between Z and Y (byz∙x) simply reflects the portion of the causal effect from Z to Y that is not mediated by X.
In A3, there is no direct connection between X and Y once the effect of Z is controlled: The observed bivariate relation between X and Y is spurious or, more accurately, the observed association between X and Y is explained by the existence of a common cause. In this case, the introduction of Z, controlling its effects on both X and Y, is critical in ascertaining the true causal parameter of the system (dyx), which happens to be zero.
All the causal systems shown in B share similar patterns with A; the pattern of the relationship between the bivariate coefficients and the partials remains the same. For this reason, the X–Y relationship in particular is examined in the same way by introducing Z as a control variable regardless of the specification of causal hierarchy between X and Z. Note in particular that introducing Z as a control variable in B1 and B3 is a misspecification of the model, but such misspecifications (including an irrelevant variable in the equation) do not lead to biased estimation (for a related discussion, see Arminger and Bohrnstedt 1987).
The systems shown in C do not require additional comments. Except for the changes in the order of the two variables X and Z, they are exact replicas of the systems in A. The resulting statistics show the same patterns observed in A. Nevertheless, the attendant interpretation of the results is radically different. For instance, when the partial byx∙z disappears, one does not consider that there is no causal relationship between X and Y; instead, one's conviction about the causal relationship is reinforced by the fact that an intervening causal agent is found.
In summary, the assumptions about the causal ordering play a critical role in the interpretation of the coefficients of the linear model shown in equation (1). The assumptions about the order must come from outside knowledge.
There is one more type to note. All the systems examined so far are linear and additive. The partial coefficients reflect the expected change in the dependent variable given a unit change in a given independent variable while the other independent variables are kept constant. If two or more independent variables interact, such simplicity does not exist. A simple example of such a system is given below:
which is the same as equation (1) except for the simplification of labels for the variables and coefficients and the addition of a multiplicative term (X1- x2).
The partial for X1 in such a system, for example, no longer properly represents the expected change in Y for a unit change in X1, even if the assumptions about the causal order are correct. A partial differentiation of the equation with respect to X1 for instance, gives b1 + X2- b3, which implies that the rate of change introduced by a change in X1 is also dependent on the values of the other causal variable (X2) and the associated coefficient (b3). One therefore cannot interpret the individual coefficients as measuring something independently of others. This point is important for a fuller understanding of the log-linear models introduced below, because a bivariate relationship is represented by interaction terms. The notion of control often invoked with ceteris paribus (other things being unchanged) also becomes ambiguous.
The logic of causal analysis for the additive systems can be extended easily to a system with more variables. If the assumptions about the causal order, the form of the relationship, and the random errors are correct, one can identify the causal parameters, such as dyx, and decompose the linear connection between any set of variables into spurious (noncausal) and genuine (causal) components, dyx, and the latter (dyx) into indirect (mediated) and direct (residual) components.
To identify dyx, one must control all the potentially relevant variables that precede X in causal ordering but not the variables that might intervene between X and Y. Under this assumption, then, the partial byx (z∙∙∙), where the variables in parentheses represent all such "antecedent" variables, is equivalent to dyx. In identifying this component, one must not control the variables that X may affect; these variables may work as mediating causal agents and transmit part of the effect of X to Y.
The partial of a linear system in which both antecedent variables (Zs) and intervening variables (Ws) are included (byx ∙[x∙∙∙ w∙∙∙]) will represent the residual causal connection between X and Y that is not mediated by any of the variables included in the model. As more Ws are included, this residual component may change. However, the linear representation of a causal system without these additional intervening variables is not misspecified. By contrast, if the introduction of additional Zs will change the X–Y partial, an omission of such variables from the equation indicates a misspecification of the causal system because some of the spurious components will be confounded with the genuine causal components.
For nonexperimental data, the problems of misspecification and misinterpretation are serious. Many factors may confound the relationships under consideration (Campbell and Stanley 1966; Cook and Campbell 1979; Lieberson 1985; Arminger and Bohrnstedt 1987; Singer and Marini 1987). There is no guarantee that a set of variables one is considering constitutes a closed system, but the situation is not totally hopeless. The important point is that one should not ignore these issues and assume away potentially serious problems. Selection biases, contagion effects, limited variations in the data, threshold effects, and so on, can be modeled if they are faced seriously (Rubin 1977; Leamer 1978; Hausman 1978; Heckman 1979; Berk 1983, 1986; Heckman and Robb 1986; Arminger and Bohrnstedt 1987; Long 1988; Bollen 1989; Xie 1989). Furthermore, this does not mean that one has to control (introduce) every conceivable variable. Once a few key variables are controlled, additional variables usually do not affect the remaining variables too much. (This observation is a corollary to the well-known fact that social scientists often have great difficulty finding any variable that can substantially improve R2 in regression analysis.)
FREQUENCY TABLES, CONDITIONAL PROBABILITIES, AND ODDS RATIOS
To fix the ideas and make the following discussions concrete, it is useful to introduce basic notations and define two indicators of association for a bivariate table. Consider the simplest contingency table, one given by the cross-classification of two dichotomous variables. Let fij denote the observed frequencies; then the observed frequency distribution will have the following form:
Note the form of marginal frequencies. Now let pijdenote the corresponding observed probabilities: pij = fij/N. Let the uppercase letters, Fij and Pij, denote the corresponding expected frequencies and probabilities under same model or hypothesis.
If X and Y are statistically independent,
That is, the conditional probability of Yi given Xj is the same as the marginal probability of Yi. Thus, a convenient descriptive indicator of statistical independence is that byx = p11/p-1 - p12/p-2 = 0. The percentage difference is simply 100 times byx. The symbol byx is quite appropriate in this case, for it is equivalent to the regression coefficient. The fact that byx ≠ 0 implies a lack of statistical independence between Xand Y.
Another equally good measure is the odds ratio or cross-product ratio:
The first line shows that the odds ratio is a ratio of ratios. The second line shows that it is immaterial whether one starts with odds (ratio) in one direction or the opposite direction. The final line indicates that the odds ratio is equivalent to the cross-product ratio. In general, if all the odds ratios in a table for two variables are 1, the two variables are statistically independent; the converse is also true. The fact that t equals 1 implies that X is independent of Y. Therefore, both the odds ratio (t) and the percent age difference (byx) can serve equally well as descriptive indicators of association between variables.
Given that observed frequencies are unstable because of sampling variability, it is useful to test the null hypothesis that t = byx = 0 in the population. Such a hypothesis is evaluated by using either the conventional chi-square statistic or the -2*(likelihood ratio):
These values are evaluated against the theoretical distribution with the appropriate degrees of freedom. These two tests are equivalent for large samples. (For a related discussion, see Williams 1976; Tamas et al. 1994)
ELABORATION AND SUBGROUP ANALYSIS
The logic of linear systems that was presented earlier was introduced to social scientists through the elaboration paradigm and through an informal demonstration of certain patterns of relationship among variables (Kendall and Lazarsfeld 1950; Lazarsfeld 1955). Statistical control is achieved by examining relationships within each subgroup that is formed by the relevant categories of the control variable. The typical strategy is to start the analysis with an examination of the association between two variables of interest, say, X and Y. If there is an association of some sort between X and Y, the following two questions become relevant: (1) Is the observed relationship spurious or genuine? (2) If some part of the relationship is genuine, which variables mediate the relationship between the two? (The question of sampling variability is handled rather informally, relying on the magnitude of the percentage differences as a simple guide. Moreover, two variables that seemingly are unrelated at the bivariate level may show a stronger association after suppressor variables are controlled. Therefore, in some situations, applying such a test may be premature and uncalled for.)
To answer these questions adequately, one must have a fairly good knowledge of the variables under consideration and the implications of different causal systems. It is clear from the earlier examination of the linear causal systems that to answer the first question, one must examine the X–Y relationship while controlling for the factors that are antecedent to X(assuming that X ≥ Y). To answer the second question, one also must control factors that X may affect and that in turn may affect Y. Controlling for many variables is possible in theory but is impractical for two quite different reasons: (1) One runs out of cases very quickly as the number of subgroups increases, and (b) as the number of subgroups increases, so does the number of partial tables to examine and evaluate. Nevertheless, it is quite possible that one might find a strategically critical variable that might help explain the observed relationship either by proving that the observed relationship is spurious or by confirming a causal connection between the two variables.
To make the discussion more concrete, consider the hypothetical bivariate percentage table between involvement in car accidents (Y) and the gender of the driver (X). The percentage difference (10% = 30% − 20%) indicates that men are more likely to be involved in car accidents while driving than are women. Because there are only two categories in Y, this percentage difference (byx) captures all the relationship in the table. Given the large sample size and the magnitude of the percentage difference, it is safe to assume that this is not an artifact of sampling variability.
Suppose a third variable (Z = amount of driving) is suspected to be related to both gender (X) and involvement in accidents (Y). It therefore is prudent to examine whether the X–Y relationship remains the same after the amount of driving is controlled or eliminated. Whether this conjecture is reasonable can be checked before one examines the three-variable subgroup analysis: There has to be some relationship between Xand Z and between X and Y. Table 1b shows the bivariate relationship between gender (X) and driving (Z). Note that there is a very strong association: byx=.333 (33.3%) difference between the genders.
The conditional tables may show one of the following four patterns: (1) The observed relationship between X and Y disappears within each subgroup: byx∙z = 0, (2) the relationship remains the same: byx∙z = byx, (3) the relationships change in magnitude but remain the same across the groups:
Hypothetical Bivariate Tables | ||
men | women | |
source: adapted from ziesel (1985), p. 146. | ||
a) car accidents (y ) by gender (x) | ||
had at least one accident while driving | 30% | 20% |
never had an accident while driving | 70% | 80% |
total | 100% | 100% |
(number of cases) | (3,000) | (3,000) |
b) amount of driving (z ) by gender (x) | ||
more than 10,000 miles | 67.7% | 33.3% |
less than 10,000 miles | 33.3% | 67.7% |
total | 100% | 100% |
(number of cases) | (3,000) | (3,000) |
byx∙z(1) = byx∙z(2) ≠ byx, (4) the X–Y relationship in one group is different from the relationship in the other group: byx∙z(1) ≠ byx∙z(2). These examples are shown in Table 2. Compare these patterns with the corresponding causal systems shown in Figure 1.
Whether Z should be considered as antecedent or intervening depends on the theory one is entertaining. One's first interpretation might be that the original relationship has sexist implications in that it may mean that men are either more aggressive or less careful. Against such a hypothesis, the amount of driving is an extraneous variable. By contrast, one may entertain a social role theory stating that in this society men's roles require more driving and that more driving leads to more accidents. Then Z can be considered an intervening variable.
Pattern (1) will help undermine the psychological or biological hypothesis, and pattern (2) will enhance that hypothesis. Pattern (1) also will lend weight to the social role hypothesis. These patterns are the simplest to deal with but rarely are encountered in real life (see Lazarsfeld 1955; Rosenberg 1968; Zeisel 1985 for interesting examples). If one were lucky enough to come across such a pattern, the results would be considered important findings. Note that there are three causal systems in Figure 1 that share the same statistical pattern (the relationship between partials and original coefficients) with each of these two. Of course, the choice must be dictated by the theory and assumptions about the causal ordering that one is willing to entertain.
Patterns (3) and (4) are more likely outcomes in real life. In (3), the magnitude of the X–Y relationship within each subgroup is reduced. (Sometimes the X–Y relationship may turn out to be even stronger.) This pattern is compatible with three causal systems—A2, B2, and C2—in Figure 1. Assume that one takes the causal order indicated in C; that is, one takes the gender role theory to account for the observed relationship. Part of the original relationship (.04 out of .10) is mediated by the amount of driving, but a greater part (.06) remains unexplained. If one believes that all the difference in the accident rate has nothing to do with psychological or biological differences between the genders, one has several other potential role-related connections to consider: Men may drive more during the rush hours than women do, men may drive during worse weather conditions than women do, and so on. One could introduce these variables as additional controls. By contrast, if one believes in the validity of the psychological explanation, one could collect data on the aggressiveness of each individual and introduce aggressiveness as a control variable.
Table 2d illustrates a pattern in which the effects of the two explanatory variables interact: X's effect on Y varies across the categories of Z, and Z's effect on Y varies across the categories of X. A corresponding example in linear systems was given by equation (3). One must consider both variables at the same time because the effect of one variable depends on the other.
In general, empirical data may exhibit patterns that are mixtures of 2c and 2d. In cross-tabulations of variables with more than two categories, it is often not easy, purely on the basis of eyeballing, to discern the underlying pattern. At this point, there is a need for more refined and systematic tools. Moreover, in some instances, an application of a log-linear model may indicate patterns that are different from what a linear model (such as using percentage tables) might indicate.
Before ending this section, it should be mentioned that some examples in the literature use the
Percent Ever Had Accident (Y) by Gender (X) by Amount of Driving (Z) | |||
note: number of cases for the percentage base are in parentheses. throughout these tables, bxz = .40 and byx = .10 remain constant. compare percents across the categories of that variable. | |||
a) original x–y relationship disappears | |||
(compatible with causal systems a3, b3, and c1) | |||
amount of driving (z ) | |||
gender (x) | > 10,000 miles | < 10,000 miles | |
men | 40% (2,000) | 10% (1,000) | byx ∙z = 0 |
women | 40% (2,000) | 10% (1,000) | byz ∙x = .30 |
b) original x–y relationship unchanged | |||
(compatible with causal systems al, b1, and c3) | |||
gender (x) | > 10,000 miles | < 10,000 miles | |
men | 30% (2,000) | 30% (1,000) | byx ∙z = .10 |
women | 20% (1,000) | 20% (2,000) | byz ∙x = 0 |
c) original x–y relationship diminishes | |||
(compatible with causal systems a2, b2, and c2) | |||
gender (x) | > 10,000 miles | < 10,000 miles | |
men | 34% (2,000) | 24% (1,000) | byx ∙z = .06 |
women | 28% (1,000) | 18% (2,000) | byz ∙x = .10 |
d) x–y relationship varies | |||
gender (x) | > 10,000 miles | < 10,000 miles | |
men | 40% (2,000) | 20% (1,000) | byx ∙z (1) = .20 |
women | 20% (1,000) | 20% (2,000) | byx ∙z (2) = 0 |
byz ∙x (1) = .20 | |||
byz ∙x (2) = 0 |
subgroup analysis as a full-fledged multivariate analysis tool. For instance, Davis (1984) shows how the logic of elaboration can be combined with the standardization technique to derive, among other things, the following decomposition of the relationship between the father's and the son's occupational statuses, where Zs represent the father's education and the mother's education and W represents the son's education.
The power of subgroup analysis comes mainly from the close analogy between the percentage differences and the coefficients of the linear system illustrated in Figure 1, but its uses need not be confined to the analysis of causal systems. There are various applications of this logic to survey data (Hyman 1955; Rosenberg 1968; Zeisel 1985). These accounts remain one of the best sources for learning the method as well as the art of pursuing research ideas through the use of percentage tables.
ODDS RATIOS AND LOG-LINEAR MODELS
A more formal approach to categorical data analysis is provided by the log-linear model and related models (Bishop et al. 1975; Goodman 1978; Haberman 1978, 1979; Fienberg 1980; Agresti 1984; Clogg and Shihadeh 1994; Long 1997). Some of these models are not even log-linear (Clogg 1982a, 1982b, Goodman 1984, 1990; Wong 1995; Xie 1992). Only the log-linear models are examined here.
By means of an ingenious device, the log-linear model describes the relationships among categorical variables in a linear form. The trick is to treat the logarithms of the cell frequencies as the (titular) dependent variable and treat design vectors as independent variables. The design vectors represent relevant features of the contingency table and hypotheses about them.
Once again consider a concrete example; the simplest bivariate table, in which each variable has only two categories. Such a table contains four frequencies. Logarithms of these frequencies (logfrequencies for short) can be expressed as an exact function of the following linear equation:
In this equation, Y stands for the log-frequencies (log(Fij)). X1 is a design vector for the first (row) variable, and X2 is a design vector for the second (column) variable. The last vector (X1 - X2) is a design vector for interaction between X1 and X2, and it is produced literally by multiplying the respective components of X1 and X2. It is important to note that the model is linear only in its parameters and that there is an interaction term. As is the case with linear models that contain interaction terms, one must be careful in interpreting the coefficients for the variables involved in the interaction term.
This type of model in which the observed frequencies are reproduced exactly also is known as a saturated model. (The model is saturated because all the available degrees of freedom are used up. For instance, there are only four data points, but this model requires that many parameters.) Of course, if one can reproduce the exact log-frequencies, one also can reproduce the actual frequencies by taking the exponential of Y—Fij = exp(Yij). Note also the similarities between equations (3) and (4); both contain a multiplicative term as a variable. (For more general models, a maximum likelihood estimation requires an iterative solution, but that is a technical detail for which readers should consult standard texts (such as Nelder and Wedderburm 1972; Plackett 1974; Goodman 1978, 1984; Haberman 1978, 1979; Fleiss 1981; Agresti 1984). Many computer packages routinely provide solutions to these types of equations. Therefore, what is important is the logic underlying such analysis, not the actual calculation needed.)
It is no exaggeration to say that in more advanced uses of the model, what distinguishes a good and creative analysis from a mundane analysis is how well one can translate one's substantive research ideas into appropriate design vectors. Thus, it is worthwhile to examine these design vectors more carefully. Constructing a design matrix (the collection of vectors mentioned above) for a saturated model is easy, because one is not pursuing any specific hypothesis or special pattern that might exist in the relationship. Categories of each variable have to be represented, and there are many equivalent ways of doing that. This section will examine only the two most often used ones: effect coding and dummy coding. These design matrices for a 2 × 2 table are shown in Table 3.
The first column (X0) in each coding represents a design vector for the constant term (b0); X1is for the row categories, and X2 is for the column categories. The last column (X3) is the product of the preceding two, needed to represent interaction between X1, and X2. Note the pattern of these design vectors. In the effect coding, except for the constant vector, each vector or column sums to zero. Moreover, the interaction vector sums to zero for each column and row of the original bivariate table. This pattern assures that each effect is measured as a deviation from its respective mean.
In dummy coding, the category effect is expressed as a deviation from one reference category, in this case, the category that is represented by zero. Whatever codings are used to represent the categories of each variable, the interaction design vector is produced by multiplying the design vector for the column variable by the design
Design Vectors Used in Log-Linear Model for 2 × 2 Table | ||||||||
a) design matrices for saturated model | ||||||||
effect coding | dummy coding | |||||||
frequency | x0 | x1 | x2 | x3 | x0 | x1 | x2 | x3 |
y11 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
y12 | 1 | 1 | –1 | –1 | 1 | 1 | 0 | 0 |
y21 | 1 | –1 | 1 | –1 | 1 | 0 | 1 | 0 |
y22 | 1 | –1 | –1 | 1 | 1 | 0 | 0 | 0 |
b) representation of log-frequencies in terms of parameter | ||||||||
b0+b1+b2+b3 | b0+b1–b2–b3 | b0+b1+b2+b3 | b0+b1 | |||||
b0–b1+b2–b3 | b0–b1–b2+b3 | b0+b2 | b0 | |||||
c) representation of frequencies in terms of multiplicative parameters, where ti = exp(bi) | ||||||||
t0*t1*t2*t3 | t0*t1/(t2*t3) | t0*t1*t2*t3 | t0*t1 | |||||
t0*t2/(t1*t3) | t0*t3(t1*t2) | t0*t2 | t0 | |||||
d) parameters for interaction in log-linear model | ||||||||
b3 | –b3 | b3 | 0 | |||||
–b3 | b3 | 0 | 0 | |||||
log (odds ratio) | ||||||||
4*b3 | b3 | |||||||
e) multiplicative parameter for interaction (t3 = exp (b3)) | ||||||||
t3 | 1/t3 | t3 | 1 | |||||
1/t3 | t3 | 1 | 1 | |||||
odds ratio | ||||||||
t3*t3*t3*t3= t34 | t3 |
vector for the row variable. Normally, one needs as many design vectors for a given variable as there are categories, minus one: (R-1) for the row variable and (C-1) for the column variable. In that case, there will be (C-1)(R-1) interaction design vectors for the saturated model. These interaction vectors are created by cross-multiplying the vectors in one set with those of the other set. There is only one vector for each of the three independent variables in equation (4) because both variables are dichotomous.
The names for these codings come from the fact that the first coding is customarily used as a convenient way of expressing factor effects in an analysis of variance (ANOVA), while the second coding often is used in regression with dummy variables. As a result of coding differences in the representation of each variable, the constant term in each coding has a different meaning: In effect, coding it measures the unweighted grand mean, while in dummy coding, it measures the value of the category with all zeros (in this particular case, Y22). (For other coding schemes, see Haberman 1979; Agresti 1984; Long 1984.) Some parameter estimates are invariant under different types of coding, and some are not (Long 1984); therefore, it is important to understand fully the implications of a particular design matrix for a proper interpretation of the analysis results.
Panel (b) of Table 3 expresses each cell as a product of the design matrix and corresponding parameters. Since the particular vectors used contain 1, - 1, or 0, the vectors do not seem to appear in these cell representations. However, when design vectors contain other numbers (as will be shown below), they will be reflected in the cell representation. Panel (c) is obtained by exponentiation of the respective cell entries in (b), the individual t-parameter also being the corresponding exponential of the log-linear parameter in panel (b).
Panel (d) isolates parameters associated with the interaction design vector. Panel (e) contains corresponding antilogs or multiplicative coefficients. These parameters play a critical role in representing the degree and nature of association between the row variables and the column variables. If all the odds ratios are 1, one variable is statistically independent from the other; in other words, information about the association between variables is totally contained in the pattern of odds ratios. Panels (d) and (e) show that the odds ratio in turn is completely specified by the parameter(s) of the interaction vector(s). In forming the odds ratio, all the other parameters cancel out (in logarithms, multiplication becomes addition and division becomes subtraction).
In short, this is an indirect way to describe a pattern of association in a bivariate table. Unfortunately, doing this requires a titular dependent variable and multiplicative terms as independent variables. Also, in effect coding, the log-odds ratio is given by 4 x b3, but in dummy coding, it is given by b3. This is a clear indication that one cannot assume that there is only one way of describing the parameters of a log-linear model. These facts make the interpretation of these parameters tricky, but the process is worth it for two reasons.
First, the advantage of this method for analyzing a 2 X 2 table is trivial, but the model can be generalized and then applied to more complex contingency tables. Because of the ANOVA-like structure, it is easy to deal with higher-level interaction effects. Second, the parameters of the log-linear models (obtained through the likelihood procedure) have very nice sampling properties for large samples. Therefore, better tools for statistical testing and estimating are available. Without this second advantage, the fact that the approach allows the construction of ANOVA-like models may not be of much value, for the log-linear models only indirectly and by analogy reflect the relationship between variables.
Consider the bivariate tables in Table 4. In all these tables, the frequencies are such that they add up to 100 in each column. Thus, one can take these frequencies as percentages as well. The first table shows a 20 percent difference and an odds ratio of 2.25. The second table shows only half the percentage difference of the first but the same odds ratio. The last table shows the same percentage difference as the second one, but its odd ratio is greater at 6.68. These descriptive measures indicate that there is some association between the two variables in each table.
Whether this observed association is statistically significant can be tested by applying a model in which the coefficient for the interaction design vector is constrained to be zero. (Here one is utilizing the properties of the log-linear model that were asserted earlier.) Constraining the interaction parameter to zero is the same as deleting the interaction design vector from the model. This type of a design matrix imposes the model of statistical independence (independence model for short) on the data. If such a log-linear model does not fit the data (on the basis of some predetermined criteria), the observed association is accepted as significant. For large samples, both the conventional chi-square test and the likelihood ratio (L2) test can be used for this purpose. The results of these tests are included in each table, and they indicate that all three associations are statistically significant at the conventional α level of .05.
Thus, to describe fully the underlying pattern of the association in Table 4, one needs to introduce the interaction parameter, which in these cases is the same as it is using the saturated model. The right-hand tables show the multiplicative parameters (t-parameters) for the interaction term. (Here only the results of applying effect coding are included.) First, examine the patterns of these parameters. In each of the three tables, the t-parameters indicate that the main diagonal cells have higher rates than do the off-diagonal cells. This tendency is slightly higher in the last table than it is in the first two. This interpretation follows from the fact that to reproduce the observed frequency in each cell, the respective t-parameter
Odds Ratios (t) and Percentage Differences | |||||||
frequencies | multiplicative parameters | ||||||
a) | x1 | x2 | effect coding | dummy coding | |||
y1 | 60 | 40 | 1.225 | .816 | 2.25 | 1 | |
y2 | 40 | 60 | .816 | 1.225 | 1 | 1 | |
100 | 100 | ||||||
byx= .20 | t = 2.25 | ||||||
l2= 8.05 | p = .005 | ||||||
b) | x1 | x2 | |||||
y1 | 20 | 10 | 1.225 | .816 | 2.25 | 1 | |
y2 | 80 | 90 | .816 | 1.225 | 1 | 1 | |
100 | 100 | ||||||
byx = .10 | t = 2.25 | ||||||
l2 = 3.99 | p = .046 | ||||||
c) | x1 | x2 | |||||
y1 | 12 | 2 | 1.608 | .622 | 6.68 | 1 | |
y2 | 88 | 98 | .622 | 1.608 | 1 | 1 | |
100 | 100 | ||||||
byx = .10 | t = 6.68 | ||||||
l2 = 8.46 | p = .004 |
must be multiplied to whatever value may be implied by other parameters in the model. In the first and second tables, the frequencies in the main diagonal are about 22 percent higher (1.22 times) than they would be without the interaction effect. The frequencies in the off-diagonal cells are about 18 percent lower than they otherwise would be. If one were to examine only the statistics generated by log-linear models, however, it would be easy to overlook the fact that the percentage of the first cell in the last table is only 12 percent (see Kaufman and Schervish 1986 for a more extended discussion). This is one of the reasons why it is advisable to examine the percentage tables even if one is using the log-linear model almost exclusively.
There are other reasons, too. By the linear standard (percentage difference), the first table shows a greater degree of association than does the second or the third. By the standard of a log-linear model or odds ratio, the last table shows the greatest degree of association. In most cases, where the percentages remain within the range of 20 to 80 percent, these two standards are roughly comparable, and the linear and log-linear models may produce similar results (see Goodman 1981). More important, in examining three-way interactions, if two subtables have the patterns shown in Table 4a and 4b, log-linear models will indicate no three-way interaction, while linear models will indicate it. There are models in which a particular standard is justified explicitly by the phenomenon under consideration, but one should not adopt a standard merely because a particular statistical model does so. It is important to understand the differences in the implicit standards that are used in different methods.
SOME MODELS OF ASSOCIATION
The flexibility of log-linear models is not obvious until one deals with several variables. However, even in a bivariate table, if there is an underlying order in the categories of variables involved and the pattern of association, the model allows some flexibility for exploring this pattern. Consider the hypothetical table shown in Table 5a. The marginal totals are such that each column may be read as percentages. There is a definite pattern in the direction of the relationship, although the tendency is fairly weak. If one were to apply a test of independence, such a null hypothesis would not be rejected. (L2 = 6.56 with four degrees of freedom has a probability of .161.) Against an unspecified alternative hypothesis, the null hypothesis cannot be rejected at the conventional level of α.
Knowing that almost every society values these two variables in the same order, one may expect that the underlying pattern of association reflects the advantages of the upper class over the lower class in obtaining valued objects. Both the pattern of percentage differences and the odds ratios seem to indicate such an ordering in the pattern: The advantage the upper class has over the lower class is greater than the one it has over the middle class. Furthermore, the upperclass does better in relation to educational levels that are farther apart (the odds ratio involving the comer cells is 2.87).
A conjecture or hypothesis like this can be translated into a design vector. Assign any consecutive numbers to the categories of each variable, but to be consistent with the effect coding, express them as deviations from the mean. One such scaling is to use (R+ 1)/2-i for the row variable and (C+1)/2-j for the column variable. (The mean and category values can be reversed, but this scheme assigns a higher value to a higher class and a higher educational level to be consistent with everyday language.) Recalling once again that only the interaction terms are relevant for the description of association, one needs to create such an interaction term by multiplying these two vectors component by component. An example is shown in Table 6.
The log-linear model, then, will include design vectors for the constant term, two vectors for the row and two vectors for the column, and one vector for the "linear-by-linear" interaction. This type of model is known as a uniform association model (for a fuller discussion of this and related models, see McCullagh 1978; Haberman 1979; Clog 1982a, 1982b; Anderson 1984; Goodman 1984, 1985, 1987, 1990, 1991 Clogg and Shihadeh 1994). The results of applying such a model to Table 5a are presented in Table 5b and 5c. First, this model fits the data extremely well. Moreover, the reduction of the L2 statistic (6.557 − .008 = 6.549) with one degree of freedom is statistically significant. Therefore, the null hypothesis cannot be accepted against this specific alternative hypothesis (see Agresti 1984 for a fuller discussion of hypothesis testing of this type). Note the pattern of the expected frequencies and the interaction parameters. Both indicate that the odds ratio for every consecutive four cells is uniform. Moreover, the other odds ratios are exact functions of this basic odds ratio and the distances involved. For instance, the odds ratio for the four corner cells is 2.87 = 1.3032*2, with each exponent indicating the number of steps between respective categories in each variable. A degree of parsimony has been achieved in describing the pattern of association and some statistical power has been gained in proposing a more definite alternative hypothesis than the general one that stipulates any lack of statistical independence (and hence uses up four degrees of freedom).
The introduction of different design matrices allows one to explore different patterns very easily. Just two are examined here. Consider the hypothetical tables shown in Table 7. In the first table, the odds ratios remain the same across the columns but vary across the rows, perhaps indicating that the order inherent in the row categories is not uniform, while that in the column category is. Differently stated, the distance between two consecutive row categories varies, while it remains constant for the column categories. Such an association pattern is known as the row-effects association model not because the column variable does not have any effect but because an equal-interval scale works well for it. In this case, one needs two design vectors to accommodate the unequal distances in the row categories. In general, the most one needs is the number of categories in the row minus one. As is shown in Table 6, these design vectors are obtained by cross-multiplying the linear distance vector of the column and the two vectors that already have been used to represent the row categories. (It works just as well to use the dummy coding.) The column-effects model is obtained if one reverses the role of these variables.
Table 7b is an example of the simplest possible homogeneous row–column effects model. The odds
A Hypothetical Table: Level of Educational Attainment (Y) by Social Class (X) | ||||||
a) observed table | ||||||
social class | ||||||
level of education | high | middle | low | |||
college | 17 | 12 | 8 | |||
high school | 41 | 38 | 34 | |||
less than high school | 42 | 50 | 58 | |||
total | 100 | 100 | 100 | |||
b) expected frequencies under the assumption of independence | ||||||
12.33 | 12.33 | 12.33 | l2 = 6.56 | |||
37.67 | 37.67 | 37.67 | df1 = 4 | |||
50.00 | 50.00 | 50.00 | p = .161 | |||
c) expected frequencies under the assumption of uniform association | ||||||
16.90 | 11.90 | 8.20 | l2 = .0082 | |||
41.30 | 38.00 | 33.80 | df2 = 3 | |||
41.80 | 50.10 | 58.10 | p = .161 | |||
d) log-linear and multiplicative parameters | ||||||
.264 | 0 | –.264 | 1.303 | 1 | .768 | |
0 | 0 | 0 | 1 | 1 | 1 | |
–.264 | 0 | .264 | .768 | 1 | 1.303 | |
l12 – l22 = 6.48; | df1 – df2= 1; | p = .0109 |
ratios change across the row and across the column, but the corresponding pair of categories in the row and in the column share the same odds ratio. In this particular example, there is a greater distance between the first two categories than there is between the second two. In general, a homogeneous row–column effects model can accommodate different intervals in each variable as long as the corresponding intervals are homogeneous across the variables. The design matrix for such a pattern is easily obtained by adding the roweffects model vectors and the column-effects model vectors. This is also how two variables are constrained to have equal coefficients in any linear model. Such a design matrix for a 3×3 table is also contained in Table 6. The examples shown in that table should be sufficient to indicate strategies for generalizing to a larger table.
There are many other possibilities in formulating specific hypotheses. These relatively simple models are introduced not only for their intrinsic value but also as a reminder that one can incorporate a variety of specialized hypotheses into the log-linear model (for other possibilities, see Goodman 1984; Clogg 1982a, 1982b; Agresti 1983, 1984). Before ending this section, it should be noted that when design vectors such as the ones for the homogeneous row–column effects model are used, the connection between the parameters for linear models indicated in this article and the usual ANOVA notation used in the literature is not
Design Matrices for Row-Column Association Models for 3 × 3 Table | |||||||||||||||||||||
t | a | b | c | d | e | f | g | h | i† | j | |||||||||||
c*d | a*~d | b*~d | f+g | (f~g)† | a*~b | ||||||||||||||||
note: ~ (horizontal concatenation); * (multiplication); *~ (horizontal direct product); † (excluding redundant vector). | |||||||||||||||||||||
1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | |
1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | |
1 | 1 | 0 | –1 | –1 | 1 | –1 | –1 | –1 | 0 | –1 | –1 | –2 | –1 | –1 | 0 | –1 | –1 | 0 | –1 | 0 | |
1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | |
1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | |
1 | 0 | 1 | –1 | –1 | 0 | –1 | 0 | 0 | –1 | 0 | 0 | 0 | –1 | 0 | –1 | 0 | 0 | –1 | 0 | –1 | |
1 | –1 | –1 | 1 | 0 | –1 | 1 | –1 | –1 | –1 | –1 | 0 | –2 | –1 | –1 | –1 | 0 | –1 | –1 | 0 | 0 | |
1 | –1 | –1 | 0 | 1 | –1 | 0 | 0 | 0 | 0 | 0 | –1 | 0 | –1 | 0 | 0 | –1 | 0 | 0 | –1 | –1 | |
1 | –1 | –1 | –1 | –1 | –1 | –1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |
t: design vector for the constant term. | |||||||||||||||||||||
a: effect coding for row variable. | |||||||||||||||||||||
b: effect coding for column variable. | |||||||||||||||||||||
c: linear contrasts for row variable—(r + 1)/2 – i; any consecutive numbering will do; for variables with three categories, this is the same as the first code for the row variable. | |||||||||||||||||||||
d: linear contrasts for column variable—(c + 1)/2 – j. | |||||||||||||||||||||
e: design for the linear-by-linear interaction or uniform association, obtained by multiplying the linear contrast vector for the row and for the column. | |||||||||||||||||||||
f: design vectors for the row effects model, obtained by multiplying the design vectors for the row categories and the linear contrast vector for the column. | |||||||||||||||||||||
g: design vectors for column effects model, obtained by multiplying the design vectors for the column variable and the linear contrast for the row variable. | |||||||||||||||||||||
h: homogeneous row-column effects model, obtained by adding each vector in the matrix for the row and the corresponding vector in the matrix for the column. | |||||||||||||||||||||
i: row and column effects model—concatenation of f and g minus the redundant linear-by-linear interaction vector. | |||||||||||||||||||||
j: interaction vectors for saturated model, obtained by multiplying each vector in a with each vector in b. | |||||||||||||||||||||
design matrix for each type of model is obtained by concatenating relevant vectors from above, and the degrees of freedom by number of cells in the table minus the number of columns in the design matrix. | |||||||||||||||||||||
vectors | df | ||||||||||||||||||||
independence model | t~a~b | 4 | |||||||||||||||||||
uniform association model | t~a~b~e | 3 | |||||||||||||||||||
row-effects model | t~a~b~f | 2 | |||||||||||||||||||
column-effects model | t~a~b~g | 2 | |||||||||||||||||||
homogeneous row–column effects model | t~a~b~h | 2 | |||||||||||||||||||
row and column effects model | t~a~b~i | 1 | |||||||||||||||||||
saturated model | t~a~b~j | 0 |
obvious. Those parameters pertaining to each cell, denoted by tij, are equivalent to the product of the relevant part of the design matrix and the corresponding coefficients.
SOME EXTENSIONS
There are several ways in which one can extend the basic features of the log-linear models examined so far. Among these, the following three seem important: (1) utilizing the ANOVA-like structure
Hypothetical Tables Illustrating Some Association Models | ||||||
a) row-effects association model | ||||||
frequency | odds ratio | |||||
x | ||||||
400 | 400 | 50 | 4 | 4 | ||
y | 200 | 800 | 400 | 2 | 2 | |
100 | 800 | 800 | ||||
log parameters | multiplicative parameter | |||||
1.155 | 0 | –1.555 | 3.175 | 1 | .315 | |
–.231 | 0 | .231 | .794 | 1 | 1.260 | |
–.924 | 0 | .924 | .397 | 1 | 2.520 | |
b) homogeneous row–column effects model | ||||||
frequency | odds ratio | |||||
x | ||||||
400 | 100 | 100 | 4 | 2 | ||
y | 100 | 100 | 200 | 2 | 1 | |
100 | 200 | 400 | ||||
log parameters | multiplicative parameters | |||||
.924 | –.231 | –.693 | 2.520 | .794 | .500 | |
–.231 | 0 | .231 | .794 | 1 | 1.260 | |
–.693 | .231 | .462 | .500 | 1.260 | 1.587 |
of the log-linear model and the well-developed sampling theory to explore interaction patterns of multivariate categorical data, (2) manipulating the design matrices to examine more specific hypotheses and models, and (3) combining the strategic features of subgroup analysis and the flexibility and power of the log-linear models to produce more readily accessible analysis results. These three extensions are discussed below.
General Extension of Log-Linear Models. The most straightforward and widely used application of the log-linear model is to explore the interaction pattern of multivariate data by exploiting the ANOVA-like structure of the model. Given several variables to examine, especially when each variable contains more than two categories, it is almost impossible to examine the data structure in detail. The ANOVA-like structure allows one to develop a convenient strategy to explore the existence of multiway relationships among the variables.
This strategy requires that one start with a design matrix for each variable (containing k-1 vectors, where k is the number of categories in the variable). It does not matter whether one uses dummy coding or effect coding. To examine all the possible interrelationships in the data, one needs design matrices corresponding to each twoway interaction to m-way interaction where m is the number of variables. To construct a design matrix for a two-way interaction between variable A and variable B, simply cross-multiply the design vectors for A with those for B. (This method is illustrated in Table 6.) This general approach to design matrices is extended to m-way. For example, a three-way interaction is handled by cross-multiplying each two-way vector with the basic design vectors for a third variable, and so on.
If one includes in the model all the vectors covering up to m-way interactions, the resulting model is saturated, and each frequency in the multiway table is completely described. In general, one wants to explore and, if possible, find a parsimonious way to describe the data structure. One general strategy, perhaps overused, is to examine systematically the hierarchical pattern inherent in the design constraints and serially examine a nested set of models. To illustrate, consider that there are three variables and that the basic design vectors for each variable are represented by A, B, and C, respectively. Let T stand for the constant vector. Then an example of a nested set of models is illustrated below. The commas indicate concatenation, and two or more letters together indicate cross-multiplication of the basic design vectors for each variable.
h1: T |
h2: T, A, B,C |
H3a: T, A, B, C, AB |
H3b: T, A, B, C, AB, AC |
H3c: T, A, B, C, AB, AC, BC |
h4: T, A, B, C, AB, AC, BC, ABC |
(H1) Equiprobability |
(H2) Total independence |
(H3a) One two-way interaction |
(H3b) Two two-way interactions |
(H3c) No three-way interaction |
(H4) Saturated model |
Each hypothesis is tested, using the appropriate degrees of freedom, which is given by the number of the cells in the frequency table minus the number of vectors contained in the design matrix and the χ2 or L2 statistics associated with each model. The sequence from the hypotheses in set (3) is arbitrary; one may choose any nested set or directly examine 3c. One usually would accept the simplest hypothesis that is compatible with the data.
If variables contain many categories, even the simplest two-way interactions will use up many degrees of freedom. This type of generic testing does not incorporate into the design matrix any special relationships that may exist between variables. Models of this type are routinely available in standard computer packages and therefore are quite accessible. For that reason, they are overused. Moreover, the sequential nature of the testing violates some of the assumptions of classical hypothesis testing. Nevertheless, in the hands of an experienced researcher, they become a flexible tool for exploring the multivariate data structure.
The Uses of Constrained Models. The flexibility and power of log-linear models are fully realized only when one incorporates a specific hypothesis about the data into the design matrices. There are virtually endless varieties one can consider. Some of the simple but strategic models of association were introduced in the preceding section.
Incorporating such models into a multivariate analysis is not difficult if one views the task in the context of design matrices. For instance, suppose one suspects that a certain pattern of relationship exists between X and Y (for instance, the social class of origin and destination in intergenerational mobility). Furthermore, one may have an additional hypothesis that these relationships vary systematically across different political systems (or across societies with different levels of economic development). If one can translate these ideas into appropriate design matrices, using such a model will provide a much more powerful test than the generic statistical models described in the previous section can provide. Many social mobility studies incorporate such design matrices as a way of incorporating a special pattern of social mobility in the overall design (for some examples, see Duncan 1979; Hout 1984; Yamaguchi 1987 and for new developments, see DiPrete 1990; Stier and Grusky 1990; Wong 1990, 1992, 1995; Xie 1992).
In general, there are two problems in using such design matrices. The first, which depends in part on the researcher's creative ability, is the problem of translating theoretically relevant models into appropriate design matrices. The second is finding a way to obtain a good statistical solution for the model, but this is no longer much of a problem because of the wide availability of computer programs that allow the incorporation of design matrices (see Breen 1984 for a discussion of preparing design matrices for a computer program that handles generalized linear systems).
One of the general problems has been that researchers often do not make the underlying design matrices explicit and as a result sometimes misinterpret the results. A solution for this problem is to think explicitly in terms of the design matrices, not in analogy to a generic (presumed) ANOVA model.
Use of Percentage Tables in Log-Linear Modeling. Multivariate analysis is in general complex. Categorical analysis is especially so, because one conceptual variable has to be treated as if it were (k-1) variables, with k being the number of categories in the variable. Therefore, even with a limited number of variables, if each variable contains more than two categories, examining the multivariate pattern becomes extremely difficult. Therefore, the tendency is to rely on the general hypothesis testing discussed earlier.
It is useful to borrow two of the strategies of subgroup analysis: focusing on a bivariate relationship and using percentage distributions. After an acceptable log-linear model is identified, one therefore may display the relationship between two key variables while the effects of other variables are controlled or purged (Clogg 1978; Clogg and Eliason 1988a; Clogg et al. 1990; Kaufman and Schervish 1986). Furthermore, a percentage distribution for the bivariate distribution may be compared with the corresponding percentage distributions when different sets of variables are controlled in this manner. Fortunately, the log-linear modeling can provide a very attractive way in which the confounding effects of many variables can be purged from the relationship that is under special scrutiny (Clogg 1978; Kaufman and Schervish 1986; Clogg and Eliason 1988a; Clogg et al. 1990). Clogg et al. (1990) show a general framework under which almost all the known variations in adjustments can be considered a special case. Furthermore, they also describe statistical testing procedures for variety of statistics associated with such adjustments.
Tables 8 and 9 contain examples of traditional subgroup analysis, log-linear analysis, and the uses of standardization or purging methods. The upper panel of Table 8 contains a bivariate table showing (1) the race of the defendant (X) and (2) the
Death Penalty Verdict (Y) by Defendant's Race (X) and Victim's Race (Z) | ||||
Death Penalty | ||||
victim's race | Defendant's Race | Yes | No | percentage yes |
source: radelet and pierce (1991), p. 25, and agresti (1996), p. 54. | ||||
note: (1) for log-linear analysis, 0.5 is added to zero cell. (2) in the original data, there are two cases that involve both white and black victims. (3) the data do not consistently identify spanish ancestry. most defendants and victims with spanish ancestry are coded as white. for detailed information, see radelet and pierce (1991). | ||||
total | white | 53 | 430 | 11.0 |
black | 15 | 176 | 7.9 | |
white | white | 53 | 414 | 11.3 |
black | 11 | 37 | 22.9 | |
black | white | 0† | 16 | 0.0 |
black | 4 | 139 | 2.8 |
verdict—death penalty versus other penalties (Y), while the lower panel contains the result of traditional three-variables subgroup analysis, in which the original relationship between X and Y is reanalyzed within the categories of the third variable, the race of the victims (Z). (These data are based on individuals who were convicted of multiple homicides in Florida. See Radelet and Peierce 1991; Agresti 1996.)
The original bivariate relationship seems to indicate that whites are more likely to receive the death penalty than are blacks. However, when the race of the victims is controlled, the partial relationship between X and Y is reversed: Within each category of victim, blacks are more likely to receive the death penalty than are whites. The underlying reasons for this reversal are two related facts: (1) There is a strong association between the race of the defendant (X) and the race of the victims (Z): white defendants are more likely to kill whites than blacks, while black defendants are more likely to kill blacks than whites, and (2) there is a strong relationship between the race of the victims and the death penalty: those who killed white victims are more likely to receive the death penalty than are those who killed blacks. Once these relationships
Design Matrix, Expected Frequencies, and Standardized Percentages under the Model without Three-Way Interaction | |||||||||
a) design matrix and coefficients for the model without three-way interaction | |||||||||
t | z | x | y | xz | yz | xy | parameter | coefficient | z-value |
1 | 1 | 1 | 1 | 1 | 1 | 1 | constant (t) | 2.959 | 21.0 |
1 | 1 | 1 | –1 | 1 | –1 | –1 | z | 1.039 | 6.9 |
1 | 1 | –1 | 1 | –1 | 1 | –1 | x | –0.135 | –1.3 |
1 | 1 | –1 | –1 | –1 | –1 | 1 | y | –1.382 | –11.1 |
1 | –1 | 1 | 1 | –1 | –1 | 1 | xz | 1.137 | 14.7 |
1 | –1 | 1 | –1 | –1 | 1 | –1 | yz | 0.558 | 3.9 |
1 | –1 | –1 | 1 | 1 | –1 | –1 | xy | –0.201 | –2.2 |
1 | –1 | –1 | –1 | 1 | 1 | 1 | l2= 0.284 | df =1 | prob. = 0.594 |
b) expected frequencies under the model without three-way interaction | |||||||||
death penalty | |||||||||
victim's race | defendant's race | yes | no | percentage yes | |||||
total | white | 53.5 | 430.0 | 11.1 | |||||
black | 15.0 | 176.0 | 7.9 | ||||||
white | white | 53.3 | 413.7 | 11.4 | |||||
black | 10.7 | 37.3 | 22.3 | ||||||
black | white | 0.2 | 16.3 | 1.2 | |||||
black | 4.3 | 138.7 | 3.0 | ||||||
c) direct standardization | |||||||||
total | white | 35.87 | 447.62 | 7.4 | |||||
black | 28.16 | 162.84 | 14.7 | ||||||
white | white | 33.58 | 260.67 | 11.4 | |||||
black | 25.91 | 90.33 | 22.3 | ||||||
black | white | 2.29 | 186.95 | 1.2 | |||||
black | 2.25 | 72.51 | 3.0 | ||||||
d) purging xz (= purging xz and xyz ) | |||||||||
total | white | 17.7 | 183.2 | 8.8 | |||||
black | 34.9 | 160.9 | 17.8 | ||||||
white | white | 17.0 | 132.3 | 11.4 | |||||
black | 33.5 | 116.6 | 22.3 | ||||||
black | white | 0.7 | 50.9 | 1.4 | |||||
black | 1.4 | 44.4 | 3.1 | ||||||
e) purging xz and yz (= purging xz, yz, and xyz ) | |||||||||
total | white | 11.0 | 260.9 | 4.0 | |||||
black | 21.5 | 228.7 | 8.6 | ||||||
white | white | 9.8 | 231.9 | 4.0 | |||||
black | 19.1 | 203.2 | 8.6 | ||||||
black | white | 1.2 | 29.0 | 4.0 | |||||
black | 2.4 | 25.4 | 8.6 |
are taken into consideration, blacks receive a higher rate of the death penalty than do whites.
The log-linear analysis can supplement such a traditional subgroup analysis in several convenient ways. Panel (a) of Table 9 shows design matrix, coefficients, and standardized values under the model without three-way interaction. These statistics show several things that are not obvious in the conventional subgroup analysis: (1) The three-way interaction is not statistically significant, (2) all three bivariate relationships are statistically significant, and (3) in some sense, the association between X and Z is the strongest and that between X and Y is the weakest among the three bivariate relationships.
Panel (b) shows expected frequencies and relevant percentages under the model (where the three-way interaction is assumed to be zero). The pattern revealed in each subtable is very similar to that under the traditional subgroup analysis shown in Table 8. (This is as it should be, given no three-way interaction effect.) Within each category of victim's race, black defendants are more likely to receive the death penalty than are white defendants. The standardization or purging, then, allows one to summarize this underlying relationship between X and Y under the hypothetical condition that the effect of the third variable is controlled or purged. There are many different ways of controlling the effects of the third variable: (1) direct standardization, panel (c), (2) when the effects of XZ relationship are purged (in addition to the purging of the three-way interaction) panel (d), (3) when, in addition to the previous purging, the effects of the YZ relationship are also purged, panel (e). Although the percentage differences seem to vary, the underlying log-linear effect remains constant: Blacks are twice more likely to receive the death penalty than are whites when both kill a victim of the same race (see Clogg 1978; Clogg and Eliason 1988b; Clogg et al. 1990).
(see also: Analysis of Variance and Covariance; Causal Inference Models; Measures of Association; Nonparametric Statistics; Statistical Methods)
references
Agresti, Alan 1983 "A Survey of Strategies for Modeling Cross-Classifications Having Ordinal Variables." Journal of the American Statistical Association 78:184–198.
—— 1984 Analysis of Ordinal Categorical Data. New York: Wiley.
—— 1990 Categorical Data Analysis. New York: Wiley.
—— 1996 An Introduction to Categorical Data Analysis. New York: Wiley.
Alba, Richard D. 1988 "Interpreting the Parameters of Log-Linear Models." In J. Scott Long, ed., Common Problems/Proper Solutions. Beverly Hills, Calif.: Sage.
Alwin, Dwane F., and Robert M. Hauser 1975 "The Decomposition of Effects in Path Analysis." American Sociological Review 40:37–47.
Anderson, J. A. 1984 "Regression and Ordered Categorical Variables." Journal of the Royal Statistical Society B46:1–30.
Arminger, G., and G. W. Bohrnstedt 1987 "Making It Count Even More: A Review and Critique of Stanley Lieberson's Making It Count: The Improvement of Social Theory and Research." Sociological Methodology 17:347–362.
Berk, R. A. 1983 "An Introduction to Sample Selection Bias in Sociological Data." American Sociological Review 48:386–398.
—— 1986 "Review of Making It Count: The Improvement of Social Research and Theory." American Journal of Sociology 92:462–465.
Bishop, Yvonne M. M., Stephen E. Fienberg, and Paul W. Holland 1975 Discrete Multivariate Analysis: Theory and Practice. Cambridge, Mass.: MIT Press.
Blalock, Hubert M., Jr. 1964 Causal Inferences in Nonexperimental Research. Chapel Hill: University of North Carolina Press.
——, ed. 1985a Causal Models in the Social Sciences, 2nd ed. New York: Aldine.
——, ed. 1985b Causal Models in Panel and Experimental Designs. New York: Aldine.
Bollen, K. A. 1989 Structural Equations with Latent Variables. New York: Wiley.
Breen, Richard 1984 "Fitting Non-Hierarchical and Association Models Using GLIM." Sociological Methods and Research 13:77–107.
Bunge, Mario 1979 Causality and Modem Science, 3rd rev. ed. New York: Dover.
Campbell, D. T., and J. C. Stanley 1966 Experimental and Quasi-Experimental Designs for Research. Boston: Houghton Mifflin.
Clogg, Clifford C. 1978 "Adjustment of Rates Using Multiplicative Models." Demography 15:523–539.
—— 1982a "Using Association Models in Sociological Research: Some Examples." American Journal of Sociology 88:114–134.
—— 1982b "Some Models for the Analysis of Association in Multiway Cross-Classifications Having Ordered Categories." Journal of the American Statistical Association 77:803–815.
——, and Scott R. Eliason 1988a "A Flexible Procedure for Adjusting Rates and Proportions, Including Statistical Methods for Group Comparisons." American Sociological Review 53:267–283.
—— 1988b "Some Common Problems in Log-Linear Analysis." In J. Scott Long, ed., Common Problems/ Proper Solutions. Beverly Hills, Calif.: Sage.
——, and Edward S. Shihadeh 1994 Statistical Models for Ordinal Variables. Thousand Oaks, Calif.: Sage.
——, James W. Shockey, and Scott R. Eliason 1990 "A General Statistical Framework for Adjustment of Rates." Sociological Methods and Research 19:156–195.
Cook, Thomas D., and Donald T. Campbell 1979 Quasi-Experimentation: Design and Analysis Issues for Field Settings. Chicago: Rand McNally.
Davis, James A. 1984 "Extending Rosenberg's Technique for Standardizing Percentage Tables." Social Forces 62:679–708.
——, and Ann M. Jacobs 1968 "Tabular Presentations." In David L. Sills, ed., The International Encyclopedia of the Social Sciences, vol.15. New York: Macmillan and Free Press.
Duncan, Otis Dudley 1966 "Path Analysis: Sociological Examples." American Journal of Sociology 72:1–16.
—— 1975 Introduction to Structural Equation Models. New York: Academic Press.
—— 1979 "How Destination Depends on Origin in the Occupational Mobility Table." American Journal of Sociology 84:793–803.
Fienberg, Stephen E. 1980 The Analysis of Cross-Classified Data, 2nd ed. Cambridge, Mass.: MIT Press.
Fisher, F. M. 1966 The Identification Problem in Econometrics. New York: McGraw-Hill.
Fleiss, J. L. 1981 Statistical Methods for Rates and Proportions,2nd ed. New York: Wiley Interscience.
Goldberger, Arthur S., and Otis Dudley Duncan (eds.) 1973 Structural Equation Models in the Social Sciences. New York and London: Seminar Press.
Goodman, Leo A. 1978 Analyzing Qualitative/Categorical Data: Log-Linear Analysis and Latent Structure Analysis. Cambridge, Mass.: Abt.
—— 1981 "Three Elementary Views of Loglinear Models for the Analysis of Cross-Classifications Having Ordered Categories." In Karl F. Schuessler, ed., Sociological Methodology. San Francisco: Jossey-Bass.
—— 1984 The Analysis of Cross-Classified Categorical Data Having Ordered Categories. Cambridge, Mass.: Harvard University Press.
—— 1985 "The Analysis of Cross-Classified Data Having Ordered and/or Unordered Categories: Association Models, Correlation Models, and Asymmetry Models for Contingency Tables with or without Missing Entries." Annals of Statistics 13:10–69.
—— 1987 "The Analysis of a Set of Multidimensional Contingency Tables Using Log-Linear Models, Latent Class Models, and Correlation Models: The Solomon Data Revisited." In A. E. Gelfand, ed., Contributions to the Theory and Applications of Statistics: A Volume in Honor of Herbert Solomon. New York: Academic Press.
—— 1990 "Total-Score Models and Rasch-Type Models for the Analysis of a Multidimensional Contingency Table, or a Set of Multidimensional Contingency Tables, with Specified and/or Unspecified Order for Response Categories." In Karl F. Schuessler, ed., Sociological Methodology. San Francisco: Jossey-Bass.
Haberman, Shelby J. 1978 Analysis of Qualitative Data, vol. 1: Introductory Topics. New York: Academic Press.
—— 1979 Analysis of Qualitative Data, vol. 2: New Developments.New York: Academic Press.
Hausman, J. A. 1978 "Specification Tests in Econometrics." Econometrica 46:1251–1272.
Heckman, J. J. 1979 "Sample Selection Bias as a Specification Error." Econometrica 47 153–161.
——, and R. Robb 1986 "Alternative Methods for Solving the Problem of Selection Bias in Evaluating the Impact of Treatments on Outcomes." In H. Wainer, ed., Drawing Inferences from Self-Selected Samples.New York: Springer-Verlag.
Heise, David R. 1975 Causal Analysis. New York: Wiley.
Hout, Michael 1984 "Status, Autonomy, Training in Occupational Mobility." American Journal of Sociology 89:1379–1409.
Hyman, Herbert 1955 Survey Design and Analysis: Principles, Cases and Procedures. Glencoe, Ill.: Free Press.
Kaufman, Robert L., and Paul G. Schervish 1986 "Using Adjusted Crosstabulations to Interpret Log-Linear Relationships." American Sociological Review 51:717–733.
Kendall, Patricia L., and Paul Lazarsfeld 1950 "Problems of Survey Analysis." In Robert K. Merton and Paul F. Lazarsfeld, eds., Continuities in Social Research: Studies in the Scope and Method of the American Soldier. Glencoe, Ill.: Free Press.
Kish, Leslie 1959 "Some Statistical Problems in Research Design." American Sociological Review 24:328–338.
Lazarsfeld, Paul F. 1955 "Interpretation of Statistical Relations as a Research Operation." In Paul F. Lazarsfeld and Morris Rosenberg, eds., The Language of Social Research. Glencoe, Ill.: Free Press.
——, Ann K. Pasanella, and Morris Rosenberg, eds. 1972 Continuities in the Language of Social Research. New York: Free Press.
Leamer, E. E. 1978 Specification Searches: Ad Hoc Inference with Nonexperimental Data. New York: Wiley Interscience.
Lieberson, Stanley 1985 Making It Count: The Improvement of Social Research and Theory. Berkeley and Los Angeles: University of California Press.
Long, J. Scott 1984 "Estimable Functions in Loglinear Models." Sociological Methods and Research 12:399–432.
——, ed. 1988 Common Problems/Proper Solutions: Avoiding Error in Quantitative Research. Beverly Hills, Calif.: Sage.
Mare, Robert D., and Christopher Winship 1988 "Endogenous Switching Regression Models for the Causes and Effects of Discrete Variables." In J. Scott Long, ed., Common Problems/Proper Solutions. Beverly Hills, Calif: Sage.
McCullagh, P. 1978 "A Class of Parametric Models for the Analysis of Square Contingency Tables with Ordered Categories." Biometrika 65:413–418.
——, and J. Nelder 1983 Generalized Linear Models. London: Chapman and Hall.
Mosteller, F. 1968 "Association and Estimation in Contingency Tables." Journal of the American Statistical Association 63:1–28.
——, and John W. Tukey 1977 Data Analysis and Regression. Reading, Mass.: Addison-Wesley.
Nelder, J. A., and R. W. M. Wedderburn 1972 "Generalized Linear Models." Journal of the Royal Statistical Society A135:370–384.
Plackett, R. L. 1974 The Analysis of Categorical Data. London: Griffin.
Press, S. L., and S. Wilson 1978 "Choosing between Logistic Regression and Discriminant Analysis." Journal of the American Statistical Association 73:699–705.
Radelet, Michael I., and Glenn L. Pierce 1991 "Choosing Those Who Will Die: Race and the Death Penalty in Florida." Florida Law Review 43:1–34.
Rosenberg, Morris 1968 The Logic of Survey Analysis. New York: Basic Books.
Rubin, D. B. 1977 "Assignment to Treatment Group on the Basis of a Covatiance." Journal of Educational Statistics 2:1–26.
Simon, Herbert A. 1954 "Spurious Correlation: A Causal Interpretation." Journal of the American Statistical Association49:467–479.
—— 1979 "The Meaning of Causal Ordering." In Robert K. Merton, James S. Coleman, and Peter H. Rossi, eds., Qualitative and Quantitative Social Research: Papers in Honor of Paul F. Lazarsfeld. New York: Free Press.
Singer, Burton, and Margaret Mooney Marini 1987 "Advancing Social Research: An Essay Based on Stanley Lieberson's Making It Count." In Clifford C. Clogg, ed., Sociological Methodology. Washington, D.C.: American Sociological Association.
Thiel, Henri 1971 Principles of Econometrics New York: Wiley.
Williams, D. A. 1976 "Improved Likelihood Ratio Tests for Complete Contingency Tables." Biometrika 63:33–37.
Xie, Yu 1989 "An Alternative Purging Method: Controlling the Composition-Dependent Interaction in an Analysis of Rates." Demography 26:711–716.
Yamaguchi, Kazuo 1987 "Models for Comparing Mobility Tables: Toward Parsimony and Substance." American Sociological Review 52:482–494.
Zeisel, Hans 1985 Say It with Figures, 6th ed. New York: Harper & Row.
Jae-On Kim
Myoung-Jin Lee