Event-History Analysis
EVENT-HISTORY ANALYSIS
Event-history analysis is a set of statistical methods designed to analyze categorical or discrete data on processes or events that are time-dependent (i.e., for which the timing of occurrence is as meaningful as whether they occurred or not). One example of such time-dependent processes is mortality: variation across individuals is not captured by the lifetime probability of dying (which is one for every individual), but by differences in the age at which death occurs. Another example is marriage: here, variation across individuals is captured by both the lifetime probability of getting married and differences in age at marriage.
Event-history analysis, sometimes called survival analysis, has applications in many fields, including sociology, economics, biology, medicine, and engineering. Applications in demography are particularly numerous, given demography's focus on age and cohorts. In addition to mortality, demographic events that can be investigated with event-history analysis include marriage, divorce, birth, migration, and household formation.
Comparison to Life Table Analysis
Event-history analysis has its roots in classical life table analysis. In fact, life table analysis is one of the methods covered by event-history analysis, and many of the concepts of event-history analysis, such as survival curves and hazard rates, have equivalents in a conventional life table. One difference from life table analysis is that event-history analysis is based on data at the individual level and aims at describing processes operating at that level. Also, whereas conventional life table analysis is deterministic, event-history analysis is probabilistic. Hence, many event-history analysis outcomes will have confidence intervals attached to them. Another feature of event-history analysis relative to conventional life table analysis is the use of covariates. Event-history analysis makes it possible to identify factors associated with timing of events. These factors can be fixed through time (such as ethnicity or parents' education), or vary with time (such as income and marital status).
Whereas conventional life table analysis can be applied to both longitudinal and cross-sectional data, event-history analysis requires longitudinal data. Longitudinal data can be collected either in a prospective fashion by following individuals through time, or retrospectively by asking individuals about past events.
Censored Data and Time-Varying Covariates
Because of its longitudinal nature, event history data have some features which make traditional statistical techniques inadequate. One such feature is censoring, which means that information on events and exposure to the risk of experiencing them is incomplete. Right censoring, the most common type of censoring in event-history analysis, occurs when recording of events is discontinued before the process is completed. For example, in longitudinal data collection, individuals previously included in a sample may stop contributing information, either because the study is discontinued before they experience the event of interest, or because they discontinue their participation in the study before they experience the event. Another, less common, type of censoring is left censoring, which occurs when recording is initiated after the process has started. In the remainder of this article, censoring will refer to right censoring.
It is important to include censored individuals in event-history analysis, because the fact that they did not experience the event of interest in spite of their exposure is in itself meaningful. Censoring can be handled adequately as long as it is independent–that is, as long as the risk of being censored is not related to the risk of experiencing the event, or, equivalently, provided that individuals censored at any given time are representative of all other individuals. If the two risks are related, however, the estimates obtained can be seriously biased.
Another particular feature of survival data is the potential presence of time-varying covariates. For example, an individual's income may vary over time, and these variations may have an effect on the risk of experiencing events. If this is the case, it is important to include information on these variations in the analysis.
Unlike traditional statistical techniques such as ordinary least squares (OLS), event-history analysis can handle both censoring and time-varying covariates, using the method of maximum likelihood estimation. With the maximum likelihood approach, the estimated regression coefficients are the ones that maximize the likelihood of the observations being what they are. That is, the set of estimated coefficients are more likely than any other coefficient values to have given rise to the observed set of events and censored cases.
Hazard Rates
An important concept in event-history analysis is the hazard rate, h(t). The hazard rate is the risk or hazard that an event will occur during a small time interval, (t, t+dt). It corresponds to the rate of occurrence of an event (number of occurrences/amount of exposure to the risk of occurrence) during an infinitesimal time or age interval. If the event under study is death, then the hazard rate is called the force of mortality, μ(x), where x is age. Event-history analysis can be used to explore how hazard rates vary with time, or how certain covariates affect the level of the hazard rate.
Types of Analysis
Methods of event-history analysis fall into three categories:
- Nonparametric, in which no assumption is made about the shape of the hazard function;
- Parametric, requiring an assumption about how the hazard rate varies with time; and
- Semiparametric, requiring an assumption about how the hazard rate varies across individuals but no assumption about its overall shape.
Nonparametric Models
The life table approach to analyzing event history data is a nonparametric method. It is very similar to traditional life table construction in demography, although it is based on cohort rather than period data. The logic behind the life table approach is to calculate Q(ti), the probability of "failing" (for instance, dying) in the interval [ti, ti+n], from data on N(ti), the number of individuals at risk of failing at time ti, and D(ti), the number of failures between and ti+n. The number of individuals at risk needs to be adjusted for the fact that some individuals, C(ti), will be censored–that is, removed from the risk of experiencing the event during the interval. Hence Q(ti) can be expressed as:
The proportion of persons surviving at time ti, S(ti), is then obtained as the product of the probabilities of surviving over all earlier time intervals as shown below.
Another output of the life table method is the hazard rate, h(ti), which is simply calculated by dividing the number of events experienced during the interval ti by the number of person-years lived during the interval. The number of person-years is estimated by assuming that both failures and censored cases occur uniformly through the interval. Hence h(ti) is given by:
The above equations can produce biased results when time intervals are large relative to the rate at which events occur. If failures and censored cases are recorded with exact time, it is possible to correct for these biases by use of what is known as the Kaplan-Meier method. Suppose that dj is the number of deaths at exact time tj, and that Nj is the number of persons at risk at time tj. The Kaplan-Meier estimator of the survival curve S(t) is defined as:
where Nj is obtained by subtracting all failures and censored cases that occurred before tj from the initial size of the cohort. Compared to the life table method, the Kaplan-Meier method produces a more detailed contour of the survival curve. It is more appropriate than the life table approach when the recording of events is precise. The Kaplan-Meier method permits calculation of confidence intervals around the survival curve and the hazard rate. It also makes it possible to calculate survival curves for two or more groups with different characteristics, and to test the null hypothesis that survival functions are identical for these groups.
Parametric and Semiparametric Models
Although nonparametric life table approaches can perform some tests across groups, they do not permit direct estimation of the effect of specific variables on the timing of events or on the hazard rate. In order to estimate such effects, one needs to use regression models that fall into the category of fully parametric or semiparametric methods.
Accelerated failure-time models. The most common fully parametric models are called accelerated failure-time models. They postulate that covariates have multiplicative effects both on the hazard rate and on timing of events. They commonly take Ti, the time at which the event occurs, as a dependent variable. A general representation of accelerated failure-time models is:
where Ti is the time at which the event of interest occurs for individual i, and xi1, …, xik is a set of k explanatory variables with coefficients β,εi is an errorterm, and σ is a scale parameter. (Taking the logarithm of Ti ensures that the timing of events will be positive whatever the values of the covariates for a specific individual.)
This model can be adapted to various situations by choosing a specific distribution for the error term εi. Common distributions chosen include normal (when the distribution of Ti is log-normal), extreme value (when the distribution of Ti is Weibull), logistic (when the distribution of Ti is log-logistic), and log-gamma (when the distribution of Ti is gamma). Accelerated failure-time models are fully parametric precisely because they require the choice of a model distribution of failure times. Although the above equation resembles that of an OLS regression, the estimation must be performed using the maximum likelihood procedure in order to accommodate the presence of censored cases. Regression coefficients in accelerated failure time models can be interpreted by calculating 100(eβ-1), which is an estimate of the percentage change in the time at which the event occurs for a one-unit increase in a particular independent variable.
Proportional hazard models. Another type of regression model in event-history analysis is the proportional hazard model. Such models postulate that the set of covariates acts in a multiplicative way on the hazard rate. A general formulation of proportional hazard models is:
where h0(t) is the baseline hazard that is increased or decreased by the effects of the covariates.
This model is called proportional hazard because for any two individuals the ratio of the risk of the hazard is constant over time. If the form for h0(t) is specified, the result is a fully parametric model. The most common specifications for h0(t) are the exponential, Weibull, and Gompertz models. Like accelerated failure time models, fully-parametric proportional hazard models are estimated using the maximum likelihood procedure.
Proportional hazard models can also be estimated without specifying the shape of h0(t). In an influential paper, D.R. Cox (1972) showed that if one assumes that the ratio of the hazards for any two individuals is constant over time, one can estimate the effect of covariates on hazard rates with no assumption regarding the shape of h0(t), using a "partial likelihood" approach. These models, commonly called Cox regression models, are semiparametric because of the absence of any assumption regarding the time structure of the baseline hazard rate. In order to interpret the coefficients (βi) of such regressions, one can calculate the percent change in the hazard rate for a one-unit increase in the variable, using again the transformation 100(eβ-1). Cox regression models, which also can be easily adapted to accommodate time-varying covariates, are probably the most popular of available event history models.
Generalizations
In some cases it is important to distinguish among different kinds of events. For example, in demography it is sometimes necessary to focus on deaths from particular causes rather than on deaths from all causes. In such situations, individuals are being exposed to "competing risks," which means that at any time they face the risk of experiencing two or more alternative events. All the methods described above can be adapted to handle multiple events by estimating separate models for each alternative event, treating other events as censored cases. As in the case of censoring, the assumption is that risks of experiencing alternative events are independent of one another; violation of this assumption leads to biased estimates.
There are cases where the event of interest occurs in discrete time intervals. This can happen because of the nature of the event, or because the timing of events is not exactly recorded. Event-history analysis includes methods that are specifically designed for dealing with discrete time. The basic principle behind these models is to use discrete time units rather than individuals as the unit of observation. By breaking down each individual's survival history into discrete time units and pooling these observations, it is possible to estimate a model predicting the probability that the event occurs during a time interval, given that it has not occurred before. Such models are easy to implement and are computationally efficient. Also, since the unit of observation is a time interval, it is easy to include covariates taking different values for different time intervals.
All the models presented here assume that two individuals with identical values of covariates have identical risks of experiencing the event of interest. If there are no covariates in the model, the assumption is that risks are identical for all individuals. Such assumptions can be problematic in survival analysis. In fact, if some important characteristics are not accounted for, the aggregate risk may appear to decrease with time because the proportion of individuals with lower risks increases as time passes. Thus, in the presence of unobserved heterogeneity, it may be erroneous to use survival analysis to make inferences about individuals' risks. Although there are solutions to handle this potential bias, options for dealing with unobserved heterogeneity are limited and are highly sensitive to the underlying assumptions of the models.
Another implicit assumption in all the models discussed above is that events can be experienced only once, which implies that individuals are removed from the population "at risk" after they experience the event. There are many situations, however, in which events are repeatable. For example, a person who had a child or changed jobs can experience those events again. Under these circumstances, it is still possible to use single-event methods by analyzing each successive event separately, or by using a discrete-time analysis where the unit of observation is a time interval and where all time intervals, assumed to be independent for a single individual, are pooled together. However, these strategies are unsatisfactory for many reasons, and specific methods exist to deal with repeatable events. As in the case of unobserved heterogeneity, options for dealing with repeatable events are still limited.
See also: Cohort Analysis; Estimation Methods, Demographic; Life Tables; Multistate Demography; Stochastic Population Theory.
bibliography
Allison, Paul D. 1995. Survival Analysis Using the SAS System: A Practical Guide. Cary, NC: SAS Institute.
Cleves, Mario, William W. Gould, and Roberto Gutierrez. 2002. An Introduction to Survival Analysis Using Stata. College Station, TX: Stata Corporation.
Collett, David. 1994. Modelling Survival Data in Medical Research. London: Chapman and Hill.
Courgeau, Daniel, and Eva Lelièvre. 1992. Event History Analysis in Demography. Oxford, Eng.: Clarendon Press.
Cox, David R. 1972. "Regression Models and Life Tables." Journal of the Royal Statistical Society B (34): 187–220.
Manton, Kenneth, Eric Stallard, and James W. Vaupel. 1986. "Alternative Models for the Heterogeneity of Mortality Risks among the Aged." Journal of the American Statistical Association 81: 635–44.
Palloni, Alberto, and Aage B. Sorensen. 1990. "Methods for the Analysis of Event History Data: A Didactic Overview." In Life Span Development and Behavior, ed. Paul B. Baltes, David L. Featherman, and Richard M. Lerner. Hills-dale, NJ: Erlbaum.
Trussell, James, Richard K. B. Hankinson, and Judith Tilton. 1992. Demographic Applications of Event History Analysis. Oxford, Eng.: Clarendon Press.
Wu, Lawrence L. 2003. "Event History Models for Life Course Analysis." In Handbook of the Life Course, ed. Jeylan Mortimer and Michael Shanahan. New York: Plenum.
Michel Guillot