Instrumental Variables Regression
Instrumental Variables Regression
SHORTCOMINGS OF THE INSTRUMENTAL VARIABLES METHOD
In regression analysis social scientists use data to understand relationships among the people, institutions, and conditions represented by the measurements contained in the data’s variables. These observations of the variables, along with some assumptions about the relationships, are used to test hypotheses generated by the model. The method of instrumental variables (IV) estimation addresses a particular difficulty encountered in ordinary least squares (OLS) regression: One or more of our explanatory variables may be “endogenous.” Consider either the simple regression model with observations i = 1, …, n, of variables x and y,
or the multiple regression model analogue with additional explanatory variables (without loss of generality, one additional variable)
The goal of regression analysis (using OLS, IV, or other methods) is to estimate the values of the β parameters and thus the relationship between x and y and, in the process, calculate values for each observation’s error, ε i , as well. IV estimation can improve our interpretation of this relationship as causation rather than simply correlation. IV estimation attempts to address the concern that correlation between x and ε confounds the causation running from x to y. There are several reasons to suspect this “endogeneity,” the correlation between the error and an explanatory variable.
First, the explanatory variable(s) and the dependent variable may be determined simultaneously—there may be feedback from the dependent variable to the explanatory variable. For example, when estimating the demand curve, x is a good’s price, and y is the quantity of that good purchased. The observed quantity is an equilibrium quantity, which is jointly determined with price. Higher prices are associated with larger error terms. Due to the law of supply, a deviation of price above the demand curve induces a deviation of quantity to the right of the demand curve, which would bias OLS regression to measure a weaker relationship than actually exists. In fact the earliest known application of IV estimation was Philip Wright’s (1928) study of the butter and flaxseed markets.
Second, there may be an omitted variable that is correlated with both the explanatory and explained variables. IV methods are often associated with labor economics, where the return to education may be of interest: the effect of another year of education (x ) on a person’s future earnings (y ). Ability is an omitted variable in this regression—either because it is unmeasured or because it is difficult to measure accurately. The estimate of β may be biased if individuals who are more likely to pursue additional education also have more innate ability that is rewarded in the labor market by higher wages. The correlation between education on wages calculated by OLS would be larger than the correlation holding ability constant.
Third, an explanatory variable may be measured with error. Because an explanatory variable’s (x ’s) measurement error is part of the regression’s unobserved error (ε ), larger errors are associated with larger values of the explanatory variable, and thus x is endogenous. The importance of this bias will be affected by the size of the measurement errors relative to other error components.
Although Wright (1928) is recognized as the first appearance of the method of instrumental variables, there is some controversy as to the actual author of the technique—it may have been Philip Wright’s son Sewall. James Stock and Francesco Trebbi (2003) confirm that Philip deserves the credit. Olav Reiersøl (1941) was the first to use the term instrumental variables when the method was “rediscovered” decades after the Wrights did their work (Reiersøl 1945; Geary 1949). The Cowles Commission (Christ 1994) and Trygve Haavelmo (1944) pursued issues of model identification, to which IV techniques are linked. The development of two-stage least squares (Theil 1953; Basmann 1957; Theil 1958) was an important step in computational feasibility and statistical efficiency for equations with multiple endogenous variables and instruments. In the early twenty-first century the discussion of instrumental variable estimation’s merits and faults continues (Angrist and Krueger 2001; Rosenzweig and Wolpin 2000), even in the popular press (Hilsenrath 2005; Whitehouse 2007).
REGRESSION TECHNIQUES
OLS regression estimates β in equation (1) by calculating the covariance of each side of the equation with x. Taking that covariance on each side of (1) produces the equation
If x and ε are uncorrelated, so that cov(x, ε ) = 0, then we can estimate β = cov(x, y )/cov(x, x ) from (3), and this also produces estimates of each observation’s ε i. Without the population-level assumption of zero correlation between the explanatory variable and the equation errors, we do not have enough information to “identify” both the β and ε i values.
IV regression addresses the problem of cov(x, ε ) ≠ 0 by assuming that the correlation between ε and another variable z, the “instrumental variable,” is zero: cov(z, ε) = 0. Taking the covariance between this instrumental variable and our equation (1) (replacing x in the first terms of equation (3)) produces
cov(z, y ) = βcov(z, x ) + cov(z, ε )
which can be solved for β = cov(z, y )/cov(z, x ), as long as cov(z, ε ) = 0. Note that we do not replace x entirely in the equation; we are still investigating the relationship between x and y, not that between z and y. In fact the condition that z and ε be uncorrelated implies that there is no relationship between z and y other than through the relationship between z and x. This solution for β also shows mathematically why it is important that cov(z,x ) ≠ 0.
The two important assumptions of IV are reflected by the connection between the instrument(s) z and x and the lack of any other connection between the instrument(s) and the dependent variable. The first assumption can be checked: The instrumental variable z must be correlated with the endogenous explanatory variable x. The second assumption relies on our knowledge of the social phenomenon under study. The only effect of the instrument z on the dependent variable y is through the explanatory variable x, so that there is no correlation between the instrument and the error ε in equation (1). Because we do not actually know the true value of the error, we cannot fully test this second assumption. Although we can test the first assumption statistically, it too should be supported by our understanding of the underlying theory and relationships. Finally, IV estimates are consistent but biased, which indicates that large samples are necessary.
Using multiple regression to estimate equation (2) requires calculation of a column vector β. Each observation can now be expressed as a row vector, resulting in the matrix X containing all observations’ explanatory variables and the column vector Y containing all observations’ dependent variable, Y = Xβ + ε. OLS calculates β as (X’ X )–1 X’ Y, again requiring zero correlation between the error ε and any of the explanatory x variables. The IV method instead calculates β using a row vector of instrumental variables, collected over all estimates in the matrix Z, as (Z’ X )–1 Z’Y.
Returning to the examples, instrumental variables addresses each of them as follows. First, in the estimation of a demand curve, an instrument (z ) for price is the cost of production: It is correlated with the price that buyers pay (x ), but it is not correlated with quantity demanded (y ). Second, a particularly well-known instrument for years of education is Joshua Angrist and Alan Krueger’s (1991) quarter of birth, which is explained in more detail below. Quarter of birth is unlikely to have a direct effect on future earnings, but if a child is born close to the cutoff date, he or she will start school either substiantially earlier or substantially later than other children but will still be required to remain in school through age sixteen. Thus some children are required to have up to one year of additional education. Third, if the dependent variable is measured with error, we can use another measure of the dependent variable as an instrument, so long as the two measures’ errors are not correlated.
The method most often used to implement IV is two-stage least squares (2SLS); it efficiently takes advantage of multiple instruments in addition to multiple explanatory variables (both endogenous and exogenous), as in equation (2). We can use the relative efficiency of IV and OLS to test for endogeneity with a Hausman (1978) test, and we can use “extra” instrumental variables to test their exogeneity.
SHORTCOMINGS OF THE INSTRUMENTAL VARIABLES METHOD
Before considering the drawbacks to the instrumental variables method, note that we are only concerned with the issue of endogenous variables if causality is an important part of the analysis. If the goal is only to describe statistical relationships in the data, IV estimation is not necessary (Moffitt 2003a).
We could also imagine that the omitted variable bias could be addressed by choosing participants randomly, avoiding the need to use IV methods, yet this is typically difficult in the social sciences. Although there have been some innovative public policy experiments (Moffitt 2003b) and field experiments (Smith 2002), many important questions in the social sciences require natural experiment methods, of which IV is an important part.
An important drawback to the method of instrumental variables is low correlation between the instrument(s) and an endogenous variable. The explanatory power of the regression is limited as a result. Not only are standard errors large, but a lack of variation in the instrument will translate into a limit on the range of behaviors that we can understand from the IV regression. We can return once more to our three categories of endogeneity and their associated examples. First, we learn about the demand curve only to the extent that changes in the supply curve reveal it. The portions of the demand curve that are not explored by changes in supply are no longer part of the estimation. Second, Angrist and Krueger (1991) estimated the return to education by instrumenting for education by quarter of birth. The combination of compulsory starting dates and compulsory schooling through the age of sixteen means that students born in different seasons may have up to a year’s difference in compulsory education. Yet when instrumenting years of education by birth date, the regression estimate now has little relevance to those students who continue on to college, for example, because entrance to college is not determined by age but by high school graduation and performance in high school. Compulsory schooling does not affect the education decision of these people. Third, errors in measurement require an alternate measurement technique that covers a range of values similar to the measurement technique of interest.
Various authors also highlight the lack of specific theory in many instrumental variable studies. Mark Rosenzweig and Kenneth Wolpin (2000) carefully review many results from the labor economics literature, describing the theoretical assumptions implied by those studies’ use of IV estimation. They argue that implicit assumptions are made whether the practitioners employing IV realize it or not. In either case, what these critiques call for is an improved understanding of the people, institutions, and conditions that are interacting as we study them. With better understanding we can improve our ability to generate relevant testable hypotheses about the actors’ decisions and their connections to the conditions and institutions they face.
SEE ALSO Least Squares, Ordinary; Least Squares, Three-Stage; Least Squares, Two-Stage; Regression; Regression Analysis; Simultaneous Equation Bias
BIBLIOGRAPHY
Angrist, Joshua D., and Alan B. Krueger. 1991. Does Compulsory School Attendance Affect Schooling and Earnings? Quarterly Journal of Economics 106 (4): 979–1014.
Angrist, Joshua D., and Alan B. Krueger. 2001. Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments. Journal of Economic Perspectives 15 (4): 69–85.
Basmann, Robert L. 1957. A Generalized Classical Method of Linear Estimation of Coefficients in a Structural Equation. Econometrica 25: 77–83.
Christ, Carl F. 1994. The Cowles Commission’s Contributions to Econometrics at Chicago, 1939–1955. Journal of Economic Literature 32 (1): 30–59.
Geary, Robert C. 1949. Determination of Linear Relations between Systematic Parts of Variables with Errors of Observation the Variances of Which Are Unknown. Econometrica 17 (1): 30–58.
Haavelmo, Trygve. 1944. The Probability Approach to Econometrics. Econometrica 12 (Suppl): 1–118.
Hausman, Jerry A. 1978. Specification Tests in Econometrics. Econometrica 46: 1251–1271.
Hilsenrath, Jon E. 2005. Novel Way to Assess School Competition Stirs Academic Row. Wall Street Journal, October 24.
Moffitt, Robert. 2003a. Causal Analysis in Population Research: An Economist’s Perspective. Population and Development Review 29 (3): 448–458.
Moffitt, Robert. 2003b. The Negative Income Tax and the Evolution of U.S. Welfare Policy. Journal of Economic Perspectives 17 (3): 119–140.
Reiersøl, Olav. 1941. Confluence Analysis by Means of Lag Moments and Other Methods of Confluence Analysis. Econometrica 9 (1): 1–24.
Reiersøl, Olav. 1945. Confluence Analysis by Means of Instrumental Sets of Variables. Arkiv for Mathematik Asstronomi och Fysik 32: 1–119.
Rosenzweig, Mark R., and Kenneth I. Wolpin. 2000. Natural “Natural Experiments” in Economics. Journal of Economic Literature 38 (4): 827–874.
Smith, Vernon L. 2002. Method in Experiment: Rhetoric and Reality. Experimental Economics 5 (2): 91–132.
Stock, James H., and Francesco Trebbi. 2003. Retrospectives: Who Invented Instrumental Variable Regression? Journal of Economic Perspectives 17 (3): 177–194.
Theil, Henri. 1953. Repeated Least Squares Applied to Complete Equation Systems. Mimeo. The Hague, Netherlands: Central Planning Bureau.
Theil, Henri. 1958. Economic Forecasts and Policy. Amsterdam: North-Holland.
Whitehouse, Mark. 2007. Is an Economist Qualified to Solve Puzzle of Autism? Wall Street Journal, February 27.
Wright, Philip G. 1928. The Tariff on Animal and Vegetable Oils. New York: Macmillan.
Christopher S. Ruebeck