Event History Analysis
EVENT HISTORY ANALYSIS
Event history analysis is a collection of statistical methods for the analysis of longitudinal data on the occurrence and timing of events. As used in sociology, event history analysis is very similar to linear or logistic regression analysis, except that the dependent variable is a measure of the likelihood or speed of event occurrence. As with other regression methods, event history analysis is often used to develop causal or predictive models for the occurrence of events. Event history analysis has become quite popular in sociology since the mid 1980s, with applications to such diverse events as divorces (Bennett et al. 1988), births (Kallan and Udry 1986), deaths (Moore and Hayward 1990), job changes (Carroll and Mayer 1986), organizational foundings (Halliday et al. 1987), migrations (Baydar et al. 1990), and friendship choices (Hallinan and Williams 1987).
Although event history methods have been developed and utilized by statistical practitioners in a variety of disciplines, the term event history analysis is primarily used in sociology and closely allied disciplines. Elsewhere the methodology is known as survival analysis (biology and medicine), failure-time analysis (engineering), or duration analysis (economics). Introductory treatments for social scientists can be found in Teachman (1983), Allison (1984, 1995), Tuma and Hannan (1984), Kiefer (1988), and Blossfeld and Rohwer (1995). For a biostatistical point of view, see Collett (1994), Kleinbaum (1996), or Klein and Moeschberger (1997).
EVENT HISTORY DATA
The first requirement for an event history analysis is event history data. An event history is simply a longitudinal record of when events occurred for an individual or a sample of individuals. For example, an event history might be constructed by asking a sample of people to report the dates of any past changes in marital status. If the goal is a causal analysis, the event history should also include information on explanatory variables. Some of these, such as race and gender, will be constant over time while others, such as income, will vary. If the timing of each event is known with considerable precision (as with exact dates of marriages), the data are called continuous-time data. Frequently, however, events are only known to have occurred within some relatively large interval of time, for example, the year of a marriage. Such data are referred to as discrete-time data or grouped data.
Event history data are often contrasted with panel data, in which the individual's status is known only at a set of regular, fixed points in time. For example, employment status and other variables may be measured in annual interviews. Panel data collected at frequent intervals can often be treated as discrete-time event history data. But if the intervals between data collections are long, one of the major attractions of event history analysis can be lost—the ability to disentangle causal ordering. While this capability is by no means unequivocal, the combination of event history data and event history analysis is perhaps the best available nonexperimental methodology for studying causal relationships.
PROBLEMS WITH CONVENTIONAL METHODS
Despite the attractiveness of event history data, they typically possess two characteristics that make conventional statistical methods highly unsuitable. Censoring is the most common problem. Suppose, for example, that the aim is to study the causes of divorce. The sample might consist of a number of couples who married in 1990 and who are followed for the next five years. For the couples who get divorced, the length of the marriage is the principal variable of interest. But a large fraction of the couples will not divorce during the five-year interval. Marriages that are still in progress when the study ends are said to be censored. The problem is to combine the data on timing with the data on occurrence in a statistically consistent fashion. Ad hoc methods, such as excluding the censored cases or assigning the maximum length of time observed, can lead to substantial biases or loss of precision.
The second problem is time-varying explanatory variables (also known as time-dependent covariates). Suppose, in our divorce example, that the researcher wants to include number of children as a predictor of divorce. But number of children may change over the marriage, and it is not obvious how such a variable should be included in a regression model. If there were no censored cases, one might be tempted to regress the length of the marriage on the number of children at the end of the marriage. But longer marriages are likely to have produced more children simply because more time is available to have them. This would produce a spurious positive relationship between number of children and the length of the marriage.
One method for dealing with the censoring problem has been around since the seventeenth century and is still widely used—the life table. The life table is one example of a variety of methods that are primarily concerned with estimating the distribution of event times without regard for the effects of explanatory variables. For a comprehensive survey of such methods, see Elandt-Johnson and Johnson (1980). The remainder of this article focuses on regression methods that estimate the effects of explanatory variables on the occurrence and timing of events.
ACCELERATED FAILURE-TIME MODELS
Suppose the goal is to estimate a model predicting the timing of first marriages, and the sample consists of women who are interviewed at age twenty-five. For each woman (i=1, . . . ,n), we learn her age in days at the time of the marriage, denoted by Ti. For women who still were not married at age twenty-five (the censored cases), T * i is their age in days at the time of the interview. We also have data on a set of explanatory variables xi1, . . . , x11. For the moment, let us suppose that these are variables that do not change over time, such as race, parents' education, and eighth-grade test scores.
One class of models that is appropriate for data such as these is the accelerated failure-time (AFT) model. The general formulation is By taking the logarithm on the left-hand side, we ensure that Ti is always greater than 0, regardless of the values of the x variables. Specific cases of this general model are obtained by choosing particular distributions for the random disturbance εi. The most common distributions are normal, extreme-value, logistic, and log-gamma. These imply that Ti has distributions that are, respectively, lognormal, Weibull, log-logistic, and gamma, which are the names usually given to these models. The disturbance term εi is assumed to be independent of the x's and to have constant variance.
If there are no censored data, these models can be easily estimated by ordinary least-squares regression of log T on the x's. The resulting coefficients are best linear unbiased estimators. But the presence of censored data requires a different method. The standard approach is maximum likelihood, which combines the censored and uncensored data in an optimal fashion. Maximum likelihood estimation for these models is now widely available in several statistical packages (e.g., BMDP, LIMDEP, SAS, SYSTAT, Stata).
Here's an example from criminology. In the early 1970s, 432 inmates from Maryland state prisons were followed for one year after their release (Rossi et al. 1980). The event of interest is the first arrest that occurred to each person during the one-year observation period. Only 26 percent of the released inmates were arrested. We'll use the following variables:
- ARREST 1 if arrested, otherwise 0
- WEEK Week of first arrest for those who were arrested (ranges 1 to 52); for those not arrested, week 52
- FIN 1 if they received financial aid after release, otherwise 0
- AGE Age in years at the time of release
- RACE 1 if black, otherwise 0
- MAR 1 if married at the time of release, otherwise 0
- PRIO Number of prior convictions
Using these data, I estimated a Weibull version of the accelerated failure time model by maximum likelihood with the SAS® statistical package. Results are shown in Table 1. Looking first at the p-values, we see that race and marital status do not have a significant impact on the timing of arrests. On the other hand, we see highly significant effects of age and number of prior convictions, and a just barely significant effect of financial aid.
The negative coefficient for PRIO tells us that having more prior convictions is associated with shorter times to arrest. The positive coefficient for AGE tells us that inmates who were older when they released have longer times to arrest. Similarly, those who got financial aid have longer times to arrest. We can interpret the magnitudes of the coefficients by applying the transformation 100[exp(ß)−1], which gives the percentage change in time to event for a 1-unit increase in a particular independent variable. For PRIO we get 100[exp(−.071)−1]=−6.8, which tells us that each additional conviction lowers the time to arrest by 6.8 percent, controlling for other variables in the model. For FIN we get 100[exp(.268)−1]=31. Those who got financial aid have times to arrest that are 31 percent longer than those who did not get financial aid.
In addition to the Weibull model, I also estimated gamma, lognormal, and log-logistic models. Results were very similar across the different models.
PROPORTIONAL HAZARDS MODELS
A second class of regression models for continuous-time data is the proportional hazards model. To explain this model, it is first necessary to define the hazard function, denoted by h(t), which is the fundamental dependent variable. Let P(t+Δt) be the conditional probability that an event occurs in the time interval (t+Δt), given that it has not already
Results from Weibull Regression Model Predicting the Time of First Arrest | ||||
variable | coefficient | standard error | chi-square | p-value |
fin | .268 | .137 | 3.79 | .05 |
age | .043 | .015 | 8.00 | .004 |
race | -.226 | .220 | 1.06 | .30 |
mar | .353 | .269 | 1.73 | .19 |
prio | -.071 | .020 | 12.85 | .0003 |
intercept | 4.037 | .402 |
occurred prior to t. To get the hazard function, we divide this probability by the length of the interval Δt, and take the limit as Δt goes to 0:
Other common symbols for the hazard function are r(t) and λ(t). The hazard may be regarded as the instantaneous likelihood that an event will occur at exactly time t. It is not a probability, however, since it may be greater than 1.0 (although never less than 0).
If h(t) has a constant value c, it can be interpreted as the expected number of events in a 1-unit interval of time. Alternatively, 1/c is the expected length of time until the next event. Suppose, for example, that the events are residence changes, time is measured in years, and the estimated hazard of a residence change is .20. That would imply that, for a given individual, the expected number of changes in a year is .20 and the expected length of time between changes is 1/.20 = 5 years.
Like a probability (from which it is derived), the hazard is never directly observed. Nevertheless, it governs both the occurrence and timing of events, and models formulated in terms of the hazard may be estimated from observed data.
The general proportional hazards (PH) model is given by
where α(t) may be any function of time. It is called the proportional hazards model because the ratio of the hazards for any two individuals is a constant over time. Notice that, unlike the AFT model, there is no disturbance term in this equation. That does not mean that the model is deterministic, however, because there is random variation in the relationship between h(t) and the observed occurrence and timing of events.
Different versions of the PH model are obtained by choosing specific forms for α(t). For example, the Gompertz model sets α(t=αi+α2t, which says that the hazard is an increasing (or decreasing) function of time. Similarly, the Weibull model has α(t)=αi+α2 log t. (The Weibull model is the only model that is a member of both the AFT class and the PH class.) The exponential model—a special case of both the Weibull and the Gompertz models—sets α(t) = α, a constant over time. For any specific member of the PH class, maximum likelihood is the standard approach to estimation.
In a path-breaking paper, the British statistician David Cox (1972) showed how the PH model could be estimated without choosing a specific functional form for α(t), using a method called partial likelihood. This method is very much like maximum likelihood, except that only a part of the likelihood function is maximized. Specifically, partial likelihood takes account only of the ordering of events, not their exact timing. The combination of partial likelihood and the proportional hazards model has come to be known as Cox regression. The method has become extremely popular because, although some precision is sacrificed, the resulting estimates are much more robust. Computer programs that implement this method are now available in most full-featured statistical packages (SPSS, SAS, LIMDEP, BMDP, S-Plus, Stata, SYSTAT).
As an example, we'll estimate a proportional hazards model for the recidivism data discussed
Results from Cox Regression Model Predicting the Hazard of First Arrest | ||||
variable | coefficient | standard error | chi-square | p-value |
fin | -.373 | .191 | 3.82 | .05 |
age | -.061 | .021 | 8.47 | .004 |
race | .317 | .308 | 1.06 | .30 |
mar | -.493 | .375 | 1.73 | .19 |
prio | .099 | .027 | 13.39 | .0003 |
intercept | 4.037 | .402 |
earlier. The chi-squares and p-values shown in Table 2 for the five variables are remarkably similar to those in Table 1 that were obtained with maximum likelihood estimation of a Weibull model. On the other hand, the coefficients are noticeably different in magnitude and even have signs that are reversed from those in Table 1. The sign reversal is a quite general phenomenon that stems from the fact that the dependent variable is the time of the event in the AFT model and the hazard of the event in the PH model. People with high hazards are very likely to have events at any point in time, so their times to events tend to be short. By contrast, people with low hazards tend to have long times until event occurrence.
To interpret the magnitudes of the coefficients, we can use the same transformation used for the AFT models. Specifically, 100[exp(ß)−1] gives the percentage change in the hazard of an event for a 1-unit increase in a particular independent variable. Thus, for FIN we have 100[exp(−.373)−1] = −31, which says that those who got financial aid have hazards of arrest that are 31 percent lower than those who did not get aid. For AGE we have 100[exp(−.061)−1] = −6. Each additional year of age at release yields a 6 percent reduction in the hazard of arrest. Finally, each additional conviction is associated with 100[exp(.099)−1] = 10.4 percent increase in the hazard of an arrest.
The partial likelihood method also allows one to easily introduce time-varying explanatory variables. For example, suppose that the hazard for arrest is thought to depend on both financial aid (x1) and employment status (x2, coded 1 for employed, 0 for unemployed). A suitable PH model might be
which says that the hazard at time t depends on financial aid, on employment status at time t, and on time itself. If longitudinal data on income are available, models such as this can be estimated in a straightforward fashion with the partial likelihood method. How to do this is shown in Chapter 5 of Allison (1995).
MULTIPLE KINDS OF EVENTS
To this point, it has been assumed that all events can be treated alike. In many applications, however, there is a compelling need to distinguish among two or more types of events. For example, if the events of interest are job terminations, one might expect that explanatory variables would have vastly different effects on voluntary and involuntary terminations. For recidivism studies, it might be desirable to distinguish arrests for crimes against persons and crimes against property. The statistical analysis should take these distinctions into account.
All of the methods already discussed can be easily applied to multiple kinds of events. In essence, a separate model is estimated for each kind of event. In doing an analysis for one kind of event, one simply treats other kinds of events as though the individual were censored at the time when the event occurred, a method known as "competing risks." Thus, no new methodology is required to handle this situation.
An alternative approach is to estimate a single event history model for the timing of events, without distinguishing different event types. Then, after eliminating all the individuals who did not have events (the censored cases), one estimates a logistic regression model for the determinants of the type of event. This method of analysis is most appropriate when the different kinds of events are functionally alternative ways of achieving a single objective. For example, the event might be purchase of a computer and the two different types might be a Windows-based computer versus a Macintosh computer.
REPEATED EVENTS
The discussion so far has presumed that each individual experiences no more than one event. Obviously, however, such events as childbirths, job changes, arrests, and car purchases can occur many times over the life of an individual. The methods already described have been routinely applied to cases of repeated events, taking one of two alternative approaches. One approach is to do a separate analysis for each successive event. For example, one event history model is estimated for the birth of the first child, a second model is estimated for the birth of the second child, and so on. The alternative approach is to break each individual's event history into a set of intervals between events, treat each of these intervals as a distinct observation, and then pool all the intervals into a single analysis.
Neither of these alternatives is entirely satisfactory. The sequential analysis is rather tedious, wastes information if the process is invariant across the sequence, and is prone to selection biases for later events in the sequence. The pooled analysis, on the other hand, makes the rather questionable assumption that the multiple intervals for a single individual are independent. This can lead to standard errors that are biased downward and test statistics that are biased upward.
Several methods are available for dealing with the lack of independence. One is to estimate standard errors and test statistics using the robust method developed by White (1982). These robust standard errors have been incorporated into some Cox regression programs (e.g., Stata, S-Plus). Another is to do a "fixed-effects" Cox regression that stratifies on the individual (Allison 1996; Yamaguchi 1986). These models can be estimated with most Cox regression software, and they have the advantage of automatically controlling for all stable characteristics of the individual. On the other hand, fixed-effects models cannot produce coefficient estimates for stable characteristics such as sex or race. Finally, there are "random-effects" or "frailty" models that explicitly build the dependence into the model as an additional random disturbance. Unfortunately, there is little commercial software available to estimate random-effects event history models.
DISCRETE-TIME METHODS
When event times are measured coarsely, the continuous-time methods already discussed may yield somewhat biased estimates. In such cases, methods specifically designed for discrete-time data are more appropriate (Allison 1982). Moreover, such methods are easily employed and are particularly attractive for handling large numbers of time-varying explanatory variables.
Suppose that the time scale is divided into a set of equal intervals, indexed by t = 1, 2, 3, . . . The discrete-time analog of the hazard function, denoted by Pt, is the conditional probability that an event occurs in interval t, given that it has not occurred prior to t. A popular model for expressing the dependence of Pt on explanatory variables is the logit model
where the subscript on αt indicates that the intercept may differ for each interval of time. Similarly the explanatory variables may take on different values at each interval of time. This model can be estimated by the method of maximum likelihood, using the following computational strategy:
- Break each individual's event history into a set of discrete time units, for example, person-years.
- Create a dependent variable that has a value of 1 for time units in which events occurred; otherwise use 0. Explanatory variables are assigned whatever values they had at the beginning of the time unit.
- Pool all these time units, and estimate a logistic regression model using a standard maximum likelihood logit program.
Other models and computational methods are also available for the discrete-time case. These methods can be easily extended to allow for multiple kinds of events and repeated events.
references
Allison, Paul D. 1982 "Discrete Time Methods for the Analysis of Event Histories." In Samuel Leinhardt, ed., Sociological Methodology 1982. San Francisco: Jossey-Bass.
——1984 Event History Analysis. Beverly Hills, Calif.: Sage.
——1995 Survival Analysis Using the SAS® System: A Practical Guide. Cary, N.C.: SAS Institute.
——1996 "Fixed Effects Partial Likelihood for Repeated Events." Sociological Methods and Research 25:207–222.
Baydar, Nazli, Michael J. White, Charles Simkins, and Ozer Babakol 1990 "Effects of Agricultural Development Policies on Migration in Peninsular Malaysia." Demography 27:97–110.
Bennett, Neil G., Ann Klimas Blanc, and David E. Bloom 1988 "Commitment and the Modern Union: Assessing the Link Between Premarital Cohabitation and Subsequent Marital Stability." American Sociological Review 53:127–138.
Blossfeld, Hans-Peter, and Götz Rohwer 1995 Techniques of Event History Modeling: New Approaches to Causal Modeling. Mahwah, N.J.: Lawrence Earlbaum.
Carroll, Glenn R., and Karl Ulrich Mayer 1986 "Job-Shift Patterns in the Federal Republic of Germany: The Effects of Social Class, Industrial Sector, and Organizational Size." American Sociological Review 51:323–341.
Collett, D. 1994 Modelling Survival Data in Medical Research. London: Chapman and Hill.
Cox, David R. 1972 "Regression Models and Life Tables." Journal of the Royal Statistical Society, Series B, 34:187–202.
Elandt-Johnson, R. C., and N. L. Johnson 1980 Survival Models and Data Analysis. New York: Wiley.
Halliday, Terence C., Michael J. Powell, and Mark W. Granfors 1987 "Minimalist Organizations: Vital Events in State Bar Associations, 1870–1930." American Sociological Review 52:456–471.
Hallinan, Maureen, and Richard A. Williams 1987 "The Stability of Students' Interracial Friendships." American Sociological Review 52:653–664.
Kallan, Jeffrey, and J. R. Udry 1986 "The Determinants of Effective Fecundability Based on the First Birth Interval." Demography 23:53–66.
Kiefer, Nicholas M. 1988 "Economic Duration Data and Hazard Functions." Journal of Economic Literature 26:646–679.
Klein, John P., and Melvin Moeschberger 1997 Survival Analysis: Techniques for Censored and Truncated Data. New York: Springer-Verlag.
Kleinbaum, David G. 1996 Survival Analysis: A Self-Learning Text. New York: Springer-Verlag.
Moore, David E., and Mark D. Hayward 1990 "Occupational Careers and Mortality of Elderly Men." Demography 27:31–53.
Rossi, P. H., R. A. Berk, and K. J. Lenihan 1980 Money, Work and Crime. New York: Academic Press.
Teachman, Jay D. 1983 "Analyzing Social Processes: Life Tables and Proportional Hazards Models." Social Science Research 12:263–301.
Tuma, Nancy Brandon, and Michael T. Hannan 1984 Social Dynamics: Models and Methods. Orlando, Fla.: Academic Press.
White, Halbert 1982 "Maximum Likelihood Estimation of Misspecified Models." Econometrica 50:1–25.
Yamaguchi, Kazuo 1986 "Alternative Approaches to Unobserved Heterogeneity in the Analysis of Repeated Events." In Nancy Brandon Tuma, ed., Sociological Methodology 1986. Washington, D.C.: American Sociological Association.
Paul D. Allison