Subset Selection in Regression Analysis
February 17, 2017 | Author: arifulmondal | Category: N/A
Short Description
Download Subset Selection in Regression Analysis...
Description
Subset Selection in Regression Analysis M.Sc. Project Report (Final Stage) submitted in partial fulfilment of the requirements for degree of Master of Science
by
Ariful Islam Mondal (02528019) under the guidance of Prof. Alladi Subramanyam Department Of Mathematics IIT Bombay
a
Department of Mathematics INDIAN INSTITUTE OF TECHNOLOGY, BOMBAY April, 2004
Acknowledgment
I am extremely grateful to my guide, Prof. Alladi Subramanyam, for giving me such a nice project on applied statistics. Without his support and suggestions I could not have finished my project. Indeed, he devoted a lot of time for me. Moreover, this project helped me a lot for having a good job. Lastly, I would like to thank my guide for everything he did for me and also would like to thank my friends and batchmates who spent time with me at the time of typing the report.
April, 2004
Ariful Islam Mondal
2
Contents 1 Introduction 1.0.1 Why Subset Selection? . . . . . . . . . . . . . . . . . . . . . . . . 1.0.2 Uses of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.0.3 Organization of report and Future Plan . . . . . . . . . . . . . . . 2 Multiple Linear Regression 2.1 Multiple Regression Models . . . . . . . . . . . . . . . . . . . . 2.2 Estimation of the Model Parameters by Least Squares . . . . . . 2.2.1 Least Squares Estimation Method . . . . . . . . . . . . . 2.2.2 Geometrical Interpretation of Least Squares . . . . . . . 2.2.3 Properties of the Least Squares Estimators . . . . . . . . 2.2.4 Estimation of σ 2 . . . . . . . . . . . . . . . . . . . . . . 2.3 Confidence Interval in Multiple Regression . . . . . . . . . . . . 2.3.1 Confidence Interval on the Regression Coefficients . . . . 2.3.2 Confidence Interval Estimation of the Mean Response . . 2.4 Gauss−Markov Conditions . . . . . . . . . . . . . . . . . . . . . 2.4.1 Mean and Variance of Estimates Under G-M Conditions 2.4.2 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . 2.5 Maximum Likelihood Estimator(MLE) . . . . . . . . . . . . . . 2.6 Explanatory Power−Goodness of Fit . . . . . . . . . . . . . . . 2.7 Testing of Hypothesis In Multiple Linear Regression . . . . . . . 2.7.1 Test for Significance of Regression . . . . . . . . . . . . . 2.7.2 Tests on Individual Regression Coefficients . . . . . . . . 2.7.3 Special Cases of Orthogonal Columns in X . . . . . . . . 2.7.4 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . 3 Regression Diagnosis and Measures of Model 3.1 Residual Analysis . . . . . . . . . . . . . . . . 3.1.1 Definition of Residuals . . . . . . . . . 3.1.2 Estimates . . . . . . . . . . . . . . . . 3.1.3 Estimates of σ 2 . . . . . . . . . . . . . 3.1.4 Coefficient of Multiple Determination . 3.1.5 Methods of Scaling Residuals . . . . . 3.1.6 Residual Plots . . . . . . . . . . . . . .
i
Adequacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . .
1 1 2 2
. . . . . . . . . . . . . . . . . . .
3 3 4 4 5 7 7 8 8 9 9 10 10 12 13 14 15 15 18 19
. . . . . . .
22 23 23 24 24 24 25 29
4 Subset Selection and Model Building 4.1 Model Building Or Formulation Of The Problem . 4.2 Consequences Of Variable Deletion . . . . . . . . 4.2.1 Properties of βˆp . . . . . . . . . . . . . . . ¯ Subset Regression Models 4.3 Criteria for Evaluating 4.3.1 Coefficient of Multiple Determination . . . 4.3.2 Adjusted R2 . . . . . . . . . . . . . . . . . 4.3.3 Residual Mean Square . . . . . . . . . . . 4.3.4 Mallows’ Cp -Statistics . . . . . . . . . . . 4.4 Computational Techniques For Variable Selection 4.4.1 All Possible Regression . . . . . . . . . . . 4.4.2 Directed Search on t . . . . . . . . . . . . 4.4.3 Stepwise Variable Selection . . . . . . . . 5 Dealing with Multicollinearity 5.1 Sources of Multicollinearity . . . . . . . . . 5.1.1 Effects Of Multicollinearity . . . . . 5.2 Multicollinearity Diagnosis . . . . . . . . . . 5.2.1 Estimation of the Correlation Matrix 5.2.2 Variance Inflation Factors . . . . . . 5.2.3 Eigensystem Analysis of X0 X . . . . 5.2.4 Other Diagnostics . . . . . . . . . . . 5.3 Methods for Dealing with Multicollinearity . 5.3.1 Collecting Additional Data . . . . . . 5.3.2 Model Respecification . . . . . . . . 5.3.3 Ridge Regression . . . . . . . . . . . 5.3.4 Principal Components Regression . . 6 Ridge Regression 6.1 Ridge Estimation . . . . . . . . . . . . 6.2 Methods for Choosing k . . . . . . . . 6.3 Ridge regression and Variable Selection 6.4 Ridge Regression: Some Remarks . . . 7 Better Subset Regression Using the 7.1 Nonnegative Garrote . . . . . . . . 7.2 Model Selection . . . . . . . . . . . 7.2.1 Prediction and Model Error 7.3 Estimation of Error . . . . . . . . . 7.3.1 X-Controlled Estimates . . . 7.3.2 X-Random Estimates . . . . 7.4 X-Orthonormal . . . . . . . . . . . 7.5 Conclusions And Remarks . . . . .
ii
. . . .
. . . .
. . . .
. . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
Nonnegative Garrote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . .
. . . . . . . .
. . . . . . . . . . . .
33 34 34 35 37 37 38 39 40 42 42 42 43
. . . . . . . . . . . .
45 45 46 48 48 49 49 50 50 50 51 51 51
. . . .
54 54 58 60 60
. . . . . . . .
62 62 63 63 64 64 66 67 68
8 Subset Auto Regression 8.1 Method of Estimation . . . . . . . . . . . . . . . . . . 8.2 Pagano’s Algorithm . . . . . . . . . . . . . . . . . . . . 8.3 Computation of The Increase in The Residual Variance 8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . A Data Analysis A.1 Correlation Coefficients . . . . . . . . . . . . . . . A.2 Forward Selection for Fitness Data . . . . . . . . A.3 Backward Elimination Procedure for Fitness Data A.4 Stepwise Selection For Fitness Data . . . . . . . . A.5 Nonnegative Garrote Estimation of Fitness data . A.6 Ridge Estimation For Fitness Data . . . . . . . . A.7 Conclusion . . . . . . . . . . . . . . . . . . . . . .
iii
. . . . . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
69 70 71 72 73
. . . . . . .
74 74 76 79 82 85 85 86
Chapter 1 Introduction Study of the analysis of data aimed at discovering how one or more variables (called independent variables, predictor variables or regressors) affect ther variables (called dependent variables or response variables) is known as “regression analysis”. Consider the model: y = β0 + β1 x + ε (1.1) This is a simple linear regression model. Where x is a independent variable and y is a dependent variable or response variable. In general, the response may be related to p regressors x1 , x2 , · · · , xp , so that y = β0 + β1 x1 + β2 x2 + · · · + βp xp
(1.2)
This is called a multiple linear regression model.The adjective linear is employed to indicate that the model is linear in the parameters β0 , β1 , · · · , βp and not because y is a linear function of the x’s. There are so many models in which y is related to the x’s in a non-linear fashion can still be treated as linear regression model as long as the equation is linear in the β’s. An important objective of regression analysis is to estimate the unknown parameters in the regression model. This process is also called fitting the model to the data. The next phase of regression analysis is called model adequacy checking in which the appropriateness of the model is studied and the quality of the fit ascertained. Through such analysis the usefulness of the regression model may be determined. The outcome of the adequacy checking may indicate either that the model is reasonable or that the original fit must be modified.
1.0.1
Why Subset Selection?
In most practical problems the analyst has a pool of candidate regressors that should include all the influential factors but the actual subset of regressors that should be used in the model needs to be determined. Finding an appropriate subset of regressors for the model is called variables selection problem. Building a regression model that includes only a subset of the available regressors involves two conflicting objectives. At first, we would allow the model to include as many regressors as possible so that the “information 1
content” in these factors can influence the predicted value of y. On the other hand, we want the model to include as few regressors as possible because the variance of the prediction increases as the number of regressors increases. Again, the more regressors there in a model, the greater is the cost of data collection and model maintenance. This process of finding a model that compromises between these two objectives is called selecting the “best” regression equation.
1.0.2
Uses of regression
Regression models are used for several purposes, such as 1. data description, 2. parameter estimation and prediction of events, 3. control. Engineers and scientists frequently use equations to summarize or describe a set of data. Sometime parameter estimation problems can be solved by regression technique. For example, suppose that an Regression analysis is helpful in developing such equations. Many applications of regression involve prediction of the response variable. Regression models may be used for control purposes like chemical engineering works.
1.0.3
Organization of report and Future Plan
This report is divided into 8 chapters.In third chapter, Regression Diagnosis and measures of Model adequacy from the first stage report are included. Chapter 4 related to Subset Selection using ordinary subset regression (Forward, Backward and Stepwise methods). In Chapter 5 Multicollinearity has been discussed, in Chapter 6 Ridge Regression, in Chapter 7 idea of a intermediate selection procedure (Nonnegative Garrote in subset selection by Leo Breiman) is given and in Chapter 8 we have tried to use subset selection criterion in order selection of an Autoregressive process scheme. And lastly, in the Appendix the data analysis is provided. ∗∗∗∗∗
2
Chapter 2 Multiple Linear Regression A regression model that involves more than one variable is called a multiple regression model. Fitting and analyzing these models is discussed in this chapter. We shall also discuss measure of model adequacy that are useful in multiple regression.
2.1
Multiple Regression Models
Suppose that n > p observations are available and let yi denote the ith observed response and xij denote the ith observation or level of regressors xj , i = 1, 2, · · · , n and j = 1, 2, · · · , p. Then the general Multiple Linear Regression Model can be written as y = Xβ + ε
(2.1)
Where y=
y1 y2 .. .
, X =
yn β=
β0 β1 .. . βn
1 x1,1 x1,2 . . . x1,p 1 x2,1 x2,2 . . . x2,p .. .. .. .. . . . . 1 xn,1 xn,2 . . . xn,p ε1 ε2 and ε = .. . εn
,
In general, y is an n × 1 vector of the observations, X is an n × (p + 1) matrix of the levels of the regressor variables, β is a (p + 1) × 1 vector of the regression coefficients, and ε is an n × 1 vector of random errors.
3
2.2 2.2.1
Estimation of the Model Parameters by Least Squares Least Squares Estimation Method
Model: y = Xβ + ε We assume that the error term ε in the model has E(ε) = 0, V(ε) = σ 2 I, and that of the errors are uncorrelated.
ˆ that minimizes We wish to find the vector of Least squares(LS) estimators, β, n X
S(β) =
ε2i
i=1
= ε0 ε = (y − Xβ)0 (y − Xβ).
(2.2)
The LS estimators βˆ must satisfy ∂S = −2X0 y + 2X0 Xβ = 0 ∂β
(2.3)
X0 Xβ = X0 y
(2.4)
which simplifies to These equations are called the Least Squares Normal Equations.
To find the solution of the normal equations shall consider the following cases: Case I: (X’X) is non-singular. If X0 X is non-singular, i.e. if no column of the X matrix is a linear combination of the other columns, i.e. if the regressors are linearly independent, then (X0 X)−1 exists. Hence the solution of the equations ( 2.4) is unique. Thus the least squares estimators of β is given by: ¯ βˆ = (X0 X)−1 X0 y (2.5) provided that (X0 X)−1 always exists.
4
Case II:(X0 X) is singular. If (X0 X) is singular then we can solve the normal equations X0 Xβ = X0 y by using ¯ ¯ Generalized inverses(g-inverse)1 as follows: βˆ = (X0 X)− X0 y (2.6) ¯ ¯ while the estimates are not unique.(X0 X)− is called the g-inverse of (X0 X), which is not unique.
The predicted value y ˆ, corresponding to the observed value y is ¯ ¯ (2.7) y ˆ = Xβˆ ¯ ¯ 0 −1 0 = X(X X) X y (2.8) ¯ = Hy (2.9) ¯ The n × n matrix H = X(X0 X)−1 X0 is usually called ”hat” matrix, because it maps the vector of observed values and its properties play a central role in the regression analysis. Let M = I − H. Then M X = (I − H)X = X − X(X 0 X)−1 X 0 X = X − X = 0 and residuals where
ˆ= e ¯
eˆ1 eˆ2 .. . eˆn
,
ˆ =y−y ˆ e ¯ ¯ ¯ yˆ1 yˆ2 ˆ = X(X 0 X)−1 X 0 y ˆ = .. = Xβ y ¯ . ¯ ¯ yˆn
ˆ in terms of y or β as follows: Therefore, we can express the residual e ¯ ¯ ¯ ˆ = y − Hy = My = MXβ + Mε = Mε e ¯ ¯ ¯ ¯ ¯ ¯ ¯
2.2.2
(2.10)
(2.11)
(2.12)
Geometrical Interpretation of Least Squares
An intuitive geometrical interpretation of least squares is sometimes helpful. We may think of the vector of observations y0 = [y1 , y2 , · · · , yn ] as defining a vector from the origin to the point A in Figure 2.1. ¯Note that [y1 , y2 , · · · , yn ] form the coordinates of an n-dimensional sample space. The sample space in Figure 2.1 is three dimensional. 1
If A− is the g-inverse of A then AA− A = A.
5
The X consisits of p+1 (n × 1) vectors, for example, 1 (a column vector of 10 s), x1 , ¯ ¯ x2 , · · · , xp . Each of these columns define a vector from the origin in the sample space. ¯ ¯ These p + 1 vectors form a p + 1-dimensional subspace called the estimation space. The figure shows a 2-dimensional estimation space. We may represent any point in this subspace by a linear combination of of the vectors 1, x1 , x2 , · · · , xp . Thus any point in ¯ ¯ ¯ ¯ the estimation space is of the form Xβ . Let the vector Xβ determine the point B in ¯ Figure 2.1. The squared distance from ¯B to A is just S(β ) = (y − Xβ )0 (y − Xβ ). ¯ ¯ ¯ ¯ ¯ Therefore minimizing the squared distance of point A defined by the observation vector y to the estimation requires finding the point in the estimation space that is closest to ¯ The squared distance will be a minimum when the point in the estimation space is A. the foot of the line from A normal (or perpendicular) to the estimation space. This is point C in Figure 2.1. This point is defined by the vector (y − Xβˆ). Therefore since y -ˆ y = y - Xβˆ is ¯ ¯ ¯ ¯ ¯ ¯
Figure 2.1: A geometrical representation of least squares perpendicular to the estimation space, we may write X0 (y − Xβˆ) = 0 ¯ ¯ ¯ or X0 Xβˆ = X0 y ¯ ¯ 6
which we recognize as the least squares normal equations.
2.2.3
Properties of the Least Squares Estimators
The statistical properties of the least squares estimators βˆ may be easily demonstrated.Consider ¯ first bias: E(βˆ) = E (X0 X)−1 X0 y ¯ ¯ = E (X0 X)−1 X0 (Xβ + ε) ¯ ¯ = E (X0 X)−1 X0 Xβ + (X0 X)−1 X0 ε ¯ ¯ = β ¯ ˆ since E (ε) = 0. Thus β is an unbiased estimator of β . ¯ ¯ ¯ ¯ The covariance matrix of βˆ is given by ¯ Cov(βˆ) = σ 2 (X0 X)−1 ¯ 0 −1 Therefore if we let C = (X X) , the variance of βˆj is σ 2 Cjj and the covariance between βˆi and βˆj is σ 2 Cij . The least squares estimator βˆ is the best linear unbiased eatimator of β (the Gauss¯ assume that the errors εi are normally¯ distributeb, Markov Theorem). If we further then βˆ is also the maximum likelihood estimator (MLE) of β . The maximum likelihood ¯ estimator is the minimum variance unbiased estimator of β¯. ¯
2.2.4
Estimation of σ 2
As in simple linear regression, we may develop an estimator of σ 2 from the residual sum of squares SSE =
n X
(yi − yˆi )2
i=1
=
n X
e2i
i=1 0
ˆe ˆ = e ¯¯
Substituting e = (y − Xβˆ),we have ¯ ¯ ¯ SSE = (y − Xβˆ)0 (y − Xβˆ) ¯ ¯ ¯ 0¯ = y0 y − 2βˆX0 y + βˆ X0 Xβˆ ¯ ¯ ¯ ¯ ¯¯ 0 0 Since X Xβˆ = X y,this last equation becomes ¯ ¯ 0 SSE = y0 y − βˆ X0 y ¯¯ ¯ ¯ 7
(2.13)
The residual sum of squares has n-p-1 degrees of freedom associated with it since p+1 parameters are estimated in the regression model. The residual mean square is M SE = SSE /(n-p+1)
(2.14)
We can show that the expected value of M SE is σ 2 , so an unbiased estimator of σ 2 is given by σ ˆ 2 = M SE (2.15) This estimator of σ 2 is model dependent.
2.3
Confidence Interval in Multiple Regression
Confidence intervals on individual regression coefficients and confidence intervals on the mean response given specifif levels of the regressors paly the same important role in multiple regression that they do in simple linear regression.The section develops the one-at-a-time confidence intervals for these cases.
2.3.1
Confidence Interval on the Regression Coefficients
To construct confidence interval estimates for the regression coefficients βj ,we must assume that the errors εi are normally and independently distributed with mean zero and variance σ 2 . Therefore the observations yi are normally and independently distributed P with mean β0 + pj=1 βj xij and variance σ 2 . Since the least squares estimator βˆ is a lin¯ ear combination of the observations, it follows that βˆ is normally distributed with mean ¯ vector β and covariance matrix σ 2 (X0 X)−1 . This implies that the marginal distribution ¯ of any regression coefficient βˆj is normal with mean βj and variance σ 2 Cjj , where Cij is the j th diagonal element of the (X0 X)−1 matrix. Consequently each of the statistics βˆ − βj pj , j = 0, 1, ..., p σ ˆ 2 Cij
(2.16)
is distributed as t with n - p - 1 degrees of freedom, where σ ˆ 2 is the estimate of the error variance obtained from 2.12 .Therefore a 100(1 − α)% confidence interval for the regression coefficient βj , j = 0, 1, ..., k, is p p βˆj − tα/2,n−p−1 σ ˆ 2 Cij ≤ βj ≤ βˆj + tα/2,n−p−1 σ ˆ 2 Cij (2.17) We usually call the quantity se(βˆj ) =
p
σ ˆ 2 Cij
the standard error of the regression coefficient βˆj .
8
(2.18)
2.3.2
Confidence Interval Estimation of the Mean Response
We may construct a confidence interval on the mean response at a particular point,such as x01 , x02 , · · · , x0p . Define vector x0 as ¯ 1 x01 x02 x0 = .. ¯ . x0p The fitted value at this point is ˆ 0 = x00 βˆ y (2.19) ¯ ¯ ¯ ˆ is This is an unbiased estimator of y0 , since E(ˆ y0 ) = x00 β = y0 , and the variance of y ¯ ¯ ¯ ¯ ¯ ¯0 V (ˆ y0 ) = σ 2 x00 (X0 X)−1 x0 (2.20) ¯ ¯ ¯ Therefore a 100(1−α)% confidence interval on the mean response at the point x01 , x02 , · · · , x0p is q q 0 0 2 −1 ˆ 0 + tα/2,n−p−1 σ 2 x00 (X0 X)−1 x0 ˆ 0 − tα/2,n−p−1 σ x0 (X X) x0 ≤ y0 ≤ y y (2.21) ¯ ¯ ¯ ¯ ¯ ¯ ¯
2.4
Gauss−Markov Conditions
In order for estimates of β to have some desirable statistical properties,we need the ¯ Gauss-Markov conditions,which have been already introfollowing assumptions, called duced: E(ε) = 0 (2.22) ¯ ¯ E(εε) = σ 2 I ¯¯ We shall use these conditions repeatedly in the sequel.
(2.23)
Note that G-M conditions imply that Ey = Xβ ¯ ¯
(2.24)
and Cov(y) = E[(y − Xβ )(y − Xβ )0 ] = E(εε0 ) = σ 2 I ¯¯ ¯ ¯ ¯ ¯ ¯ It also follows that (see 2.12)
(2.25)
ˆ0 ) = ME[εε0 ]M = σ 2 M E(ˆ ee ¯¯ ¯¯ since M = (I − H) is idempotent. Therefore,
(2.26)
V ar(ˆ ei ) = σ 2 mii = σ 2 [1 − hii ]
(2.27)
where mij and hij are the ij-th elements of M and H respectively. Because a variance is non-negative and covariance matrix is at least positive semi-definite, it follows that hii ≤ 1 and M is at least positive semi-definite. 9
2.4.1
Mean and Variance of Estimates Under G-M Conditions
From equation ( 2.22), ˆ) = β E(β (2.28) ¯ ¯ Now we know that, if for any parameter θ, its estimate T has the propetry that E(T) = ˆ is an unbiased θ, then T is an unbiased estimator of θ.Thus under G-M conditions, β estimator of β . Note that we only use the first G-M condition to prove¯ this. Therefore ¯ violation of condition ( 6.10) will not lead to bias. Further, under the G-M condition ( 6.10) ˆ ) = σ 2 (X0 X)−1 Cov(β (2.29) ¯
2.4.2
The Gauss-Markov Theorem
In most applications of regression we are interested in estimates of some linear function Lβ or l 0 β of β , where l is a vector and L is a matrix. Estimates of these type include ¯¯ ¯ ¯ ¯ ˆ itself. We consider ˆ and even β the predicteds yˆi , the estimate yˆ0 of future observation, y 0 ¯ ¯ here l β . ¯¯ Although there may be several possible estimators, we shall confine ourselves to linear estimators i.e., an estimator which is a linear function of y1 , y2 , · · · , yn , say c0 y. We ¯¯ also require that these linear functions be unbiased estimators of l 0 β and assume that ¯¯ 0 0 such linear unbiased estimators for l β exist; l β is then called estimable. ¯¯ ¯ In the following theorem we show that among¯ all linear unbiased estimators, the least ˆ = l 0 (X 0 X)−1 X 0 y, which is also a linear function of y1 , y2 , · · · , yn squares estimator l 0 β ¯¯ ¯ 0 ¯ ˆ ) ≤ V ar(c0 y) and which is unbiased for l β , has the smallest variance. That is, Var(l 0 β ¯ ¯ ¯¯ ¯ for all c such that E(c0 y) =¯ l 0 β . Such an estimator is called a best linear unbiased ¯ ¯¯ ¯¯ estimator (BLUE). ˆ = l0 (X 0 X)−1 X 0 y and y = Xβ + ε. Then under Theorem 2.4.1 (Gauss-Markov) Let β ¯ ¯ ¯ ¯ ¯ ¯ ˆ of the G-M conditions, the estimator l0 β estimable function l0 β is BLUE. ¯¯ ¯¯ Before proving G-M theorem, let X1 , X2 , · · · , Xn be a random sample of size n from the distribution f (x, θ), where θRk . Now let ¯ ¯ ¯ T1 T1 (X1 , X2 , · · · , Xn ) T2 T2 (X1 , X2 , · · · , Xn ) T = .. = .. ¯ . . Tn Tn (X1 , X2 , · · · , Xn ) If E(T) = θ,then we say that T is unbiased estimator for θ. Again, let and PTS be two PT ¯ ¯ S unbiased estimators for θ. Then we say that T is better than S if θ − θ is non¯ P ¯ ¯ negative definite, ∀θ Ω and ∀S unbised, where Sθ is the variance-covariance matrix ¯ ¯ of the estimator T.
10
Proof of G-M Theorem: Model: y = Xβ + ε (2.30) ¯ ¯ ¯ with E(ε) = 0 and V (ε) = σ 2 I. We shall concentrate only on linear combination of y. ¯ ¯ ¯ ¯ Then (i)a0 y is an unbiased estimator of a0 Xβ and ¯ ¯0 ¯ (ii)X y is an unbiased estimator of X¯0 Xβ . ¯ ¯ Again, let M0 y is also an unbiased estimator of X0 Xβ . Therefore ¯ ¯ M0 Xβ = X0 Xβ ∀ β Rp ¯ ¯ ¯ ⇒ M0 X = X0 X, ∀ β Rp ¯ 0 Write M y can be written as ¯ M0 y = M0 y − X0 y + X0 y ¯ ¯ ¯ ¯ 0 0 Let u = M y − X y. Then the variance-covariance matrix can be written as: ¯ ¯ ¯ u M X X X X ¯ = + θ θ θ ¯ ¯ ¯ u M X X X X ¯ ⇒ − = θ θ θ ¯ ¯ ¯ which is non-negative definite. Hence X0 y is the optimal unbiased estimator of X0 Xβ . ¯ ¯ If X0 X is non-singular, then 0 −1 ˆ = (X X) Xy β ¯ ¯ Suppose u0 X 0 y and v0 X 0 y be two BLUEs of λ0 β , we shall show that BLUE is unique. ¯ ¯ ¯ ¯ ¯¯ Therefore, E(u0 X 0 y) = λ0 β = E(v0 X 0 y) ¯ ¯ ¯¯ ¯ ¯ ⇒ u0 X 0 Xβ = λ0 β = v0 X 0 Xβ ¯¯ ¯ ¯ ¯ ¯ ⇒ u0 X 0 X = λ0 = v0 X 0 X ¯ ¯ ¯ 0 0 0 0 0 i.e., E((u − v )X y) = 0. Now since u X y is BLUE so, V (u0 X 0 y) ≤ ¯ ¯ ¯ ¯ 0 0 ¯ ¯ v0 X 0 y is a BLUE so, v0 6= u0 . Again, since V (v X y) ≤ u0 X 0 y for all¯u0 6= ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ V (v0 X 0 y) = u0 X 0 y and hence BLUE is unique. ¯ ¯ ¯ ¯
v0 X 0 y for all ¯ ¯ v0 . Therefore ¯
Alternative Proof of G-M theorem: let c0 y be another linear unbiased estimator of (estimable) l 0 β . Since c0 y is an unbiased ¯¯ ¯¯ ¯¯ estimator of l 0 β , l 0 β = E(c0 y) = c0 Xβ for all β and hence we have ¯¯ ¯¯ ¯¯ ¯ ¯ ¯ c0 X = l 0 (2.31) ¯ ¯ 11
Now, V ar(c0 y) = c0 Cov(yc) = c0 (σ 2 I)c = σ 2 c0 c, ¯¯ ¯ ¯ ¯ ¯¯ ¯¯ and ˆ )l = σ 2 l 0 (X 0 X)−1 l = σ 2 c0 X(X 0 X)−1 X 0 c V ar(l 0 β ) = l 0 Cov(β ¯¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ from ( 2.28) and ( 2.29). Therefore V ar(c0 y) − V ar(l 0 β ) = σ 2 [c0 c − c0 X(X 0 X)−1 X 0 c] ¯¯ ¯¯ ¯¯ ¯ ¯ = σ 2 c0 [I − X(X 0 X)−1 X 0 ]c ≥ 0 ¯ ¯ 0 −1 0 since I − X(X X) X = M is positive semi-definite (see Section 2.4). This proves the theorem. A slight generalization of the Gausss-Markov theorem is the following: ˆ of the estimable function Lβ Theorem 2.4.2 Under G-M conditions, the estimator Lβ ¯ ¯ is BLUE in the sense that ˆ) Cov(Cy) − Cov(Lβ ¯ ¯ is positive semi-definite, where L is an arbtrary matrix and Cy is another unbiased ¯ linear estimator of Lβ . ¯ This theorem implies that if we wish to estimate several(possibly) related linear function of the βj ’s, we can not do better(in a BLUE sense) than use least squares estimates. Proof: As in the proof of Gausss-Markov theorem, the unbiasedness of Cy yields Lβ ¯ = CE(y) = CXβ for all β , whence L = CX, and since Cov(Cy) = σ 2 CC 0¯and ¯ ¯ ¯ ¯ 2 0 −1 0 2 0 −1 0 0 ˆ Cov(Lβ ) = σ L(X X) L = σ CX(X X) X C , ¯ it follows that ˆ ) = σ 2 C[I − X(X 0 X)−1 X 0 ]C 0 , Cov(Cy) − Cov(Lβ ¯ ¯ which is positive semi-definite, since, as shown before that the matrix [I−X(X 0 X)−1 X 0 ] = M is positive semi-definite.
2.5
Maximum Likelihood Estimator(MLE)
Model: y = Xβ + ε ¯ ¯ ¯ We assume that the Gausss-Markov conditions hold and the yi ’s are normally distributed.That is, the error terms εi ’s are independent and identically distributed as 12
normal with mean zero and variance σ 2 i.e. εi iid N(0,σ 2 ), i.e. ε ; N (0, σ 2 I). Then ¯ ¯ the probability density function of y1 ,y2 , · · · , yn is given by (2πσ 2 )−n/2 exp[−
1 (y − Xβ )0 (y − Xβ )]. 2σ 2 ¯ ¯ ¯ ¯
(2.32)
The same probability density function, when considered as a function of β and σ 2 , given the observations y1 ,y2 , · · · , yn , i.e. f (β , σ 2 |y), is called the Likelihood ¯function and is ¯ denoted by L(β , σ 2 |y). The maximum ¯likelihood estimates of β and σ 2 are obtained by 2¯ 2 ¯ maximizing L(β , σ |y) with respect to β and σ . Since log[z] is¯ an incresing function of ¯ ¯ likelihood estimates ¯ z, the same maximum can be found by maximizing the logarithm of L. Since maximizing ( 2.32) with respect to β is equivalent to minimizing (y − Xβ )0 (y − ¯ estimate; ¯ ¯ Xβ ), the maximum likelihood estimate of β¯ is the same as the least squares ¯ ¯ i.e., it is βˆ = (X 0 X)−1 X 0 y. The maximum likelihood estimate of σ 2 , obtained by ¯ of the log of the likelihood function with respect to σ 2 equating to¯ zero the derivative ˆ after substituting β by β , is ¯ ¯ 1 ˆ )0 (y − Xβ ˆ ) = 1 e0 e (y − Xβ (2.33) n ¯ n¯ ¯ ¯ ¯ ¯ To obtain the maximum likelihood estimate of β under the constraints Cβ = γ ¯ ¯ we need to minimize ( 2.33 ) subject to Cβ = γ . ¯This is equivalent to minimizing ¯ ˆ )0 (y − Xβ ˆ ) subject to Cβ − γ = 0. ¯ (y − Xβ ¯ ¯ ¯ ¯ ¯ ¯ ¯
2.6
Explanatory Power−Goodness of Fit
In this section we shall discuss about a measure of how well our model explains the data-some measure of goodness of fit. One way to do this would be to see what a good would look like. A good model should make no mistake. Hence ˆ y ≈ Xβ ¯ ¯ Therefore, our estimatederrors or residuals can provide a useful measure of how well our model approximates the data. It would also be useful if we could scale this thing so the value associated with the goodness of fit would have some meaning. To see this, let’s examine our estimated equation and break it up into two partsˆ and the unexplained portion e ˆ. Ideally, we would like the the explained portion Xβ ¯ ¯ unexplained portion to be negligible. Hence our estimated model is y= ¯
ˆ Xβ |{z} ¯
+
explained portion
ˆ e |{z} ¯
(2.34)
unexplained portion
To gauge how well our models fits, we first minimize (using least squares) the square of our dependent variable y, so to get a sense of variation of y, and it will naturally leads ¯ ¯ 13
us to getting the variation of the explained and unexplained portions of our regression. So, ˆ +e ˆ +e ˆ)0 (Xβ ˆ) y0 y = (Xβ ¯¯ ¯ ¯ ¯ ¯ This gives, ˆ 0X 0X β ˆ +e ˆ +β ˆ 0X 0e ˆ0 X β ˆ+e ˆ0 e ˆ y0 y = β (2.35) ¯ ¯ ¯¯ ¯¯ ¯ ¯ ¯ ¯ ˆ0 X must also be zero. But from our assumptions, we know X’ˆ e = 0 so its transpose e ¯ ¯ ¯ Therefore ˆ 0X 0X β ˆ +e ˆ0 e ˆ y0 y = β ¯¯ ¯ ¯ ¯¯ which has good intuitive meaning. 0
ˆ X 0X β ˆ+ e ˆ0 e ˆ y0 y = β |{z} |{z} |¯ {z ¯} ¯ ¯ ¯¯ SSR
SSR
(2.36)
(2.37)
SSE
wher SST stands for the total sum of squares which gives us an idea of the total variation; SSR stands for the regression sum of squares which gives us idea of how much of the variation comes from explained portion; and SSE stands for the error sum of squares which gives us an idea of how much of the variation comes from the unexplained portion. If we divide it all by SST , we get ˆ 0X 0X β ˆ β ˆ0 e ˆ e 1 = ¯ 0 ¯ + ¯0 ¯ (2.38) yy yy ¯¯ ¯¯ so that SSR /SST (the percent of variation of y explained by X) and the SSE /SST ¯ must equal one. Hence, if we define a (the percent of variation of y that is unexplained) statistics R2 to compute how¯ well our model fits, i.e., the percent variation of y explained ¯ by X, we get 0 0 ˆ X Xβ ˆ β ˆ0 e ˆ e 2 R = 1 − ¯0 ¯ = ¯ 0 ¯ (2.39) yy yy ¯¯ ¯¯ So a high R2 , say .95, says that just about all variation in y is being explained by X, ¯ the variation of y. whereas a low R2 , say .05, says that we can’t explain much of ¯
2.7
Testing of Hypothesis In Multiple Linear Regression
In multiple regression problems certain tests of hypotheses about the model parameters are useful in measuring model adequacy. In this section we shall describe several important hypothesis-testing procedures. We shall continue to require the normality assumption on the errors introduced in previous sections. 14
2.7.1
Test for Significance of Regression
The test for significance of regression is a test to determine if there is a linear relationship between response y and any of the regressor variables x1 , x2 , · · · , xp . The appropriate hypotheses are H0 : β0 = β1 = · · · = βp = 0 (2.40) against H1 : βj 6= 0 f or at least one j Rejection of H0 : βj = 0 implies that atleast one of the regressors x1 , x2 , · · · , xp contributes significantly to the model. Test Procedure: The total sum of squares SST is partioned into a sum of squares due to regression and a residual sum of squares: SST = SSR + SSE (2.41) and if H0 : βj = 0 is true, then SSR /σ 2 ; χ2p+1 where the number of degrees of freedom (d.f.) for χ2 are equal to the number of regressor variables including constant or in other words d.f. equal to number of parameters to be estimated. Also we can show that SSE /σ 2 ; χ2n−p−1 and that SSE and SSR are independent. The test procedures for H0 : βj = 0 is to compute F0 =
M SR SSR /(p + 1) = SSE /(n − p − 1) M SE
(2.42)
and reject H0 if F0 > Fα,(p+1),(n−p−1) . The procedure is usually summarized in an nalysisof-variance table such as Table 2.1-Analysis of Variance for Significance Regression in Multiple Regression Source of Variation Regression Residuals Total
2.7.2
d.f. p+1 n-p-1 n
Sum of squares SSR SSE SST
Mean squares M SR M SE -
F0 M SR /M SE . -
Tests on Individual Regression Coefficients
The testing of hypotheses on the individual regression coefficients are helpful in determining the value of each of the regressors in the model. For example, the model might be more effective with the inclusion of additional regressors or perhaps with the deletion of one or more regressors presently in the model. Adding a variable to a regression model always causes the sum of squares for regression to increase and the residual sum of squares to decrease. We must decide whether the increase in the regression sum of squares is sufficient to warrant using the additional 15
regressor in the model. The addition of a regressor also increase the variance of the ˆ , so we must be careful to include only regressor that are of real value in fitted value y ¯ response. Furthermore adding an unimportant regressor may increase explaining the the residual mean square, which may decrease the usefulness of the model. The hypotheses for testing the significance of any individual regression coefficient, such as βj , are H0 : βj = 0 against H1 : βj 6= 0
(2.43)
If H0 : βj = 0 is not rejected, then this indicates that the regressor xj can be deleted from the model.The test statistic for this hypothesis is t0 = p
βˆj βˆj = σ ˆ 2 Cjj se(βˆj )
(2.44)
where Cij is the diagonal element of (X 0 X)−1 corresponding to βˆj . The null hypothesis H0 : βj = 0 is rejected if |t0 | > tα/2,n−p−1 . Note that this is really a partial or marginal test because regression coefficient βˆj depends on all the other regressor variables xi (i 6= j) that are in the model. Thus this is a test of the contribution of xj given the other regressors in the model. We may also directly determine the contribution to the regression sum of squares of a regressor, for example xj , given that the other regressors xi (i 6= j) are include in the model by using the ”extra-sum-of-squares” method. This procedure can also be used to investigate the contribution of a subset of the regressor variables to the model. Consider the regression model with p regressors y = Xβ + ε (2.45) ¯ ¯ ¯ where y is an n × 1 vector of the observations, X is an n × (p + 1) matrix of the levels of the ¯regressor variables, β is a (p + 1) × 1 vector of the regression coefficients, and ¯ ε is an n × 1 vector of random errors. We would like to determine if some subset of ¯ r < p regressors contributes significantly to the regression model. Let the regression coefficients be partioned as follows: " # β β = ¯1 β2 ¯ ¯ where β 1 is (p − r + 1) × 1 and β 2 is r × 1. We want to test the hypotheses ¯ ¯ H0 : β 2 = 0 ¯ ¯ against H1 : β 2 6= 0 (2.46) ¯ ¯ The model can be written as y = Xβ + ε = X1 β 1 + X2 β 2 + ε ¯ ¯ ¯ ¯ ¯ ¯ 16
(2.47)
where n × (p − r + 1) matrix X1 represents the columns of X associated with β 1 and ¯ full the n × r matrix X2 represents the columns of X associated with β 2 . This is called ¯ model. ˆ = (X 0 X)−1 X 0 y. The regression sum of squares For the full model, we know that β ¯ ¯ for this model is 0
ˆ X 0 y (p + 1 d.f.) SSR (β ) = β ¯ ¯ ¯ and ˆ 0X 0y y0 y − β M SE = ¯ ¯ ¯ ¯ n−p−1 To find the contribution of the terms in β 2 to the regression, fit the model assuming ¯ This reduced model is that the null hypothesis H0 : β 2 = 0 is true. ¯ ¯ y = X1 β 1 + ε (2.48) ¯ ¯ ¯ ˆ = (X 0 X1 )−1 X 0 y. The The lest squares estimator of β 1 in the reduced model is β 1 1 ¯ ¯1 ¯ regression sum of squares is ˆ 0 X 0 y (d.f.p − r + 1) SSR (β 1 ) = β (2.49) 1 ¯ ¯1 ¯ The regression sum of squares due to β 2 given that β 1 is already in the model is ¯ ¯ SSR (β 2 |β 1 ) = SSR (β ) − SSR (β 1 ) (2.50) ¯ ¯ ¯ ¯ with p + 1 − (p − r + 1) = r degrees of freedom. This sum of squares is called the extra sum of squares due to β 2 because it measures the increase in the regression sum ¯ of squares that results from adding the regressors xp−r+1 , xp−r+2 , · · · , xp to a model that already contains x1 , x2 , · · · , xp−r . Now SSR (β 2 |β 1 ) is independent of M SE , and the null ¯ hypothesis H0 : β 2 = 0 may be tested by the¯ statistic ¯ ¯ SSR (β 2 |β 1 )/r F0 = (2.51) ¯ ¯ M SE If F0 > Fα,r,n−p−1 , we reject H0 concluding that atleast one of the parameters in β 2 is not zero, and consequently at least one of the regressors xp−r+1 , xp−r+2 , · · · , xp in¯ X2 contributes significantly to the regression model. Some authors call the above test a partial F-test because it measures the contribution of the regressors in X2 given that the other regressors in X1 are in the model. To illustrate the usefulness of this procedure, consider the model
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε The sum of squares SSR (β1 |β0 , β2 , β3 ) SSR (β2 |β0 , β1 , β3 ) 17
(2.52)
and SSR (β3 |β0 , β1 , β2 ) are single-degree-of -freedom sums of squares that measure the contribution of each regressor xj , j = 1, 2, 3 to the model given that all other regressors were already in the model. That is, we are assessing the value of adding xj to a mode that did not include this regressor.In general, we could find SSR (βj |β0 , β1 , · · · , βj−1 , βj+1 , · · · , βp ), 1 ≤ j ≤ p which is the increase in the regression sum of squares due to adding xj to a model that already contains x1 , · · · , xj−1 , xj+1 , · · · , xp . Some find it helpful to think of this as measuring the contribution of xj as if it were the last variable added to the model.
2.7.3
Special Cases of Orthogonal Columns in X
Consider the model( 2.47) y = Xβ + ε ¯ ¯ ¯ = X1 β 1 + X2 β 2 + ε. ¯ ¯ ¯ The extra sum of squares method allows us to measure the effect of the regressors in X2 conditional on those in X1 by computing SSR (β 2 |β 1 ). In general, we can not talk about ¯ accounting for the dependency of finding the sum of squares due to β 2 , SSR (β 2 ), ¯without ¯ ¯ this quantity on the regressors in X1 . However if the columns orthogonal to the columns in X2 , we can determine a sum of squares due to β 2 that is free of any dependency on ¯ the regressors in X2 . ˆ = X0 y for the above model. To demonstrate this, form the normal equation X0 Xβ ¯ ¯ The normal equations are X10 X1 | X10 X2 − − −− | − − −− X20 X1 | X20 X2
0 ˆ β X1 y ¯ 1 = −−¯ −− ˆ X02 y β 2 ¯ ¯
Now if the columns of X1 are orthogonal to the columns in X01 X2 and X02 X1 = 0. Then ¯ the normal equations become ˆ = X0 y X01 X1 β 1 ¯1 ¯ 0 ˆ = X0 y X2 X2 β 2 ¯2 ¯ with solutions ˆ = (X0 X1 )−1 X0 yβ ˆ = (X0 X2 )−1 X0 y β 1 1 2 2 1 ¯ ¯ ¯2 ¯ 18
ˆ regardless of whether or not X2 is in Note that the least squares estimator of β 1 is β ¯ ¯1 ˆ the model, and the least squares estimator of β 2 is β 2 regardless of whether or not X1 ¯ ¯ in the model. The regression sum of squares for the full model is h i X0 y 0 0 1 ˆ ˆ 0 X0 y + β ˆ 0 X0 y ˆ ˆ SSR (β ) = β X y = β 1 , β 2 =β 0¯ 1 2 1 X2 y ¯ ¯ ¯ ¯ ¯ ¯2 ¯ ¯ ¯ ¯ = y0 X01 (X01 X1 )−1 X01 y + y0 X02 (X02 X2 )−1 X02 y ¯ ¯ ¯ ¯ However, the normal equations form two sets, and for each set we note that
(2.53)
ˆ 0 X0 y SSR (β 1 ) = β 1 ¯ ¯1 ¯ ˆ 0 X0 y SSR (β 2 ) = β 2 ¯ ¯2 ¯
(2.54)
which implies, SSR (β ) = SSR (β 1 ) + SSR (β 2 ) ¯ ¯ ¯
Therefore
(2.55)
SSR (β 1 |β 2 ) = SSR (β ) − SSR (β 2 ) = SSR (β 1 ) ¯ ¯ ¯ ¯ ¯ and SSR (β 2 |β 1 ) = SSR (β ) − SSR (β 2 ) = SSR (β 1 ) ¯ ¯ ¯ ¯ ¯ Consequently, SSR (β 1 ) measures the contribution of the regressors in X1 to the model unconditionally, and¯SSR (β 2 ) measures the contribution of the regressors in X2 to the ¯ model unconditionally. Because we can unambiguously determine the effect of each regressor when regressors are orthogonal, data collection experiments are often designed to have orthogonal variables.
2.7.4
Likelihood Ratio Test
Model: y = Xβ + ε ¯ ¯ ¯
(2.56)
Assumptions: ε ;iid N (0, σ 2 I). ¯ ¯ The MLE of β and σ 2 are, ¯ 19
βˆ = (X 0 X)−1 X 0 y ¯ ¯ and σ ˆ2 =
1 ˆ )0 (y − Xβ ˆ) = 1 e ˆ0 e ˆ (y − Xβ ¯¯ n ¯ n ¯ ¯ ¯
Let’s partition β as follows: ¯ "
β (1) β= ¯ β (2) ¯ ¯
#
where β (1) is a vector of order q + 1 and β (2) is a vector of order p - q. Our aim is to ¯ test ¯ H0 : β (2) = 0 against H1 : β (2) 6= 0 ¯ ¯ ¯ ¯ Now write X = X(1) : X(2) Then the model becomes y = X(1) β (1) + X(2) β (2) + ε ¯ ¯ ¯ ¯ and under H0 , the model becomes
(2.57)
y = X(1) β (1) + ε ¯ ¯ ¯
(2.58)
Let and
ˆ )0 (y − Xβ ˆ) R02 (X) = (y − Xβ ¯ ¯ ¯ ¯ ˆ )0 (y − X(1) β ˆ ) R02 (X(1) ) = (y − X(1) β ¯ ¯ (1) ¯ ¯ (1)
where ˆ = (X0 X(1) )−1 X0 y. β (1) (1) ¯ (1) ¯ 2 2 Also, D = R0 (X(1) ) − R0 (X), be the extra sum of squares.
Result 2.7.1 Let y = Xβ +ε, ε ; Nn (0, σ In ) and β = β (1) : β (2) . Then Likelihood ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Ratio (LR) test for testing H0 : β (2) = 0 is to reject H0 if ¯ ¯ 2
0
R02 (X(1) ) − R02 (X) D F = = > Fp−q,n−p−1 (α) (p − q)S 2 (p − q)S 2 where Fm,n (α) is the upper α% points of F distribution with d.f.(m,n).
20
Proof: The likelihood function 1 L(β , σ 2 ) = (2πσ 2 )−n/2 exp[− 2 (y − Xβ )0 (y − Xβ )] 2σ ¯ ¯ ¯ ¯ ¯ Therefore,
n ˆ, σ max L(β , σ 2 ) = L(β ˆ 2 ) = (2πˆ σ 2 )−n/2 e− 2 β ,σ2 ¯ ¯ ¯
Now under H0 ,
(2.59)
(2.60)
ˆ = (X0 X(1) )−1 X0 y β (1) (1) ¯ (1) ¯
and 2 σ ˆ(1) =
1 ˆ )0 (y − X(1) β ˆ ) (y − X(1) β n ¯ ¯ (1) ¯ ¯ (1)
Therefore,
2 2 −n/2 − n ˆ ,σ max L(β , σ 2 ) = L(β ˆ(1) ) = (2πˆ σ(1) ) e 2 (1) H0 ¯ ¯ Hence the LR test statistics is given by,
λ =
=
=
maxH0 L(β , σ 2 ) ¯ maxβ ,σ2 L(β , σ 2 ) ¯!−n/2¯ 2 σ ˆ(1) σ ˆ2 1+
2 σ ˆ(1) −σ ˆ2
!−n/2
σ ˆ2
Therefore, λ is small when 2 σ ˆ(1) −σ ˆ2
!
σ ˆ2 is very large or equivalently when n
2 σ ˆ(1) −σ ˆ2
!
nˆ σ2
≥ c1 , say
i.e., R02 (X(1) ) − R02 (X) ≥ c1 R02 (X) or, D=
(R02 (X(1) ) − R02 (X))/(p − q) > c2 R02 (X)/(n − p − 1)
where c1 , c2 are constants. Therefore, reject H0 if T =
D (p−q)S 2
> c2 .
Note that T ; Fp−q,n−p−1 (α). Hence we reject H0 if T > Fp−q,n−p−1 (α). 21
(2.61)
Chapter 3 Regression Diagnosis and Measures of Model Adequacy Evaluating model adequacy is an important part of a multiple regression problem. This section will present several methods for measuring model adequacy.Many of these techniques are extensions of those used in simple linear regression. The major assumptions that we have made so far in our study of regression analysis are as follows: 1. The relationship between y and X is linear, or at least it is well approximated by a straight line. 2. The error term ε has zero mean. 3. The error term εi has constant variance σ 2 for all i. 4. The errors are uncorrelated. 5. The errors are normally distributed. We should always consider the validity of these assumptions to be doubtful and conduct analysis to examine the adequacy of the model we have tentatively entertained. Gross violation of the assumptions may yield an unstable model in the sense that a different sample could lead to a totally different model with opposite conclusions. We usually cannot detect departures from the underlying assumptions by examination of the standard summary statistics, such as the t- or F-statistics or R2 . These are ”global” model properties, and as such they do not ensure model adequacy. In this chapter we present several methods useful for diagnosing and treating violations of the basic regression assumptions.
22
3.1
Residual Analysis
3.1.1
Definition of Residuals
We have defined the residuals as ˆ) ˆ =y−y ˆ = (y − Xβ e ¯ ¯ ¯ ¯ ¯
(3.1)
viewed as the deviation between the data and the fit. This is a measure of the variability not explained by the regression model and departure from the underlying assumptions on the errors should show up by residuals.Analysis of residuals effective method for discovering several types of model deficiencies.
Properties 1. E(ˆ ei ) = 0, ∀i. 2. Approximate variance is given by Pn 2 ˆ0 e ˆ ˆi e i e ¯¯ = = SSE /(n − p − 1) = M SE . n−p−1 n−p−1 where eˆi ’s are not independent(we shall see later part of this section). 3. Sometimes it is useful to work with the standardized residuals. di = √
eˆi , ∀i. M SE
(3.2)
where E(di ) = 0 and var(di ) ≈ 1 The above equation scales the residuals by dividing them by their average standard deviation. In some regression data sets residuals may have standard deviation that differ greatly.In simple linear regression V (ˆ ei ) = V (yi − yˆi ) = V (yi ) + V (ˆ y ) − 2Cov(yi , yˆi ) i (xi − x¯) 2 2 1 = σ +σ + − 2Cov(yi , yˆi ) n Sxx Now we can show that Sxy Cov(yi , yˆi ) = Cov yi , y¯ + (xi − x¯) Sxx (xi − x¯)2 2 1 = σ + 6= 0 n Sxx
23
Therefore eˆi ’s are not independent. The studentized residuals are defined as ri = r M SE
h
ei 1 − n1 +
(xi −¯ x) Sxx
i , i = 1, 2, · · · , n
(3.3)
Notice that in this equation(3.3) the ordinary least squares residuals eˆi has been divided by its exact standard error, rather than the average value as in standardized residuals(see 3.2). Studentized residuals are extremly useful in diagnostics.In small data sets the studentized residuals are often more appropriate than the standardized residuals because the differences in residual variances will be more dramatic.
3.1.2
Estimates ˆ = X(X 0 X)−1 X 0 y = Hy ˆ = Xβ y ¯ ¯ ¯ ¯
where
ˆ = (X 0 X)−1 X 0 y β ¯ ¯
and
H = X(X 0 X)−1 X 0 ˆ ) = β and Cov(β ˆ) = is a symmetric and idempotent matrix.We have seen that E(β 2 0 −1 ¯ ¯ ¯ σ (X X) .
3.1.3
Estimates of σ 2
Residual sum of squares SSE = = =
ˆ0 e ˆ e ¯¯ ˆ )0 (y − Xβ ˆ) (y − Xβ ¯ ¯ ¯ ¯ y0 y, d.f. n − p − 1 ¯¯
and residual mean squares M SE =
1 SSE . n−p−1
(3.4)
Then E(M SE ) = σ 2 , i.e.,ˆ σ 2 = M SE .Therefore M SE is an unbiased estimator of σ 2 .
3.1.4
Coefficient of Multiple Determination
The coefficient of multiple determination R2 is defined as SSR SSE =1− (3.5) SST SST It is customary to think of R2 as a measure of the reduction in the variability of y ¯ obtained by using x1 , x2 , · · · , xp . As in the simple linear regression case, we must have R2 =
24
0 ≤ R2 ≤ 1. However, a large value of R2 does not necessarily imply that the regression model is a good one. Adding a regressors to the model will always increase R2 regardless of whether or not the additional regressor contributes to the model. Thus it is possible for models that have a large values of R2 to perform poorly in prediction or estimation. The positive square root of R2 is the multiple correlation coefficient between y and the set of regressor variables x1 , x2 , · · · , xp . That is, R is a measure of the linear¯association between y and x1 , x2 , · · · , xp . We may also show that R2 is the square of the ¯ y and the vector of fitted values y ˆ. correlation between ¯ ¯
Adjusted R2 Some analysts prefer to use an adjusted R2 -statistics because the ordinary R2 defined above will always increase (at least not decrease) when a new term is added to the regression model.We shall see that in variable selection and model building procedures, it will be helpful to have a procedure that can guard against overfitting the model,that is, adding terms that are unnecessary. The adjusted R2 penalizes the analyst who includes unnecessary variables in the model. We define the adjusted R2 , Ra2 , by replacing SSE and SST in equation(3.5) by the corresponding mean squares; that is, Ra2 = 1 −
3.1.5
SSE /(n − p − 1) n =1− (1 − R2 ) SST /n n−p−1
(3.6)
Methods of Scaling Residuals
I. Standardized and Studentized Residuals: We have already introduced two types of scaled residuals, the standardized residuals di = √
eˆi , i = 1, 2, · · · , n M SE
and the studentized residuals. We now give a general development of the studentized residual scaling.Recall, ˆ = (I − H)y (3.7) e ¯ ¯ As H is symmetric (H 0 = H) and idempotent (HH = H). Similarly the matrix (I − H) is symmetric and idempotent. Substituting y = Xβ + ε into above equation yields ¯ ¯ ¯ ˆ = (I − H)(Xβ + ε) e ¯ ¯ ¯ = Xβ − HXβ + (I − H)ε ¯ ¯ ¯ = (I − H)ε (3.8) ¯
25
Thus the residuals are the same linear transformation of the observations y and the ¯ errors ε. ¯ The covariance matrix of the residuals is V (ˆ e) = V [(I − H)ε] ¯ ¯ = (I − H)V (ε)(I − H)0 ¯ = σ 2 (I − H)
(3.9)
since V (ε) = σ 2 I and (I − H) is symmetric and idempotent. The matrix (I − H) is ¯ generally not diagolnal, so the residuals have different variances and they are correlated. The variance of the i-th residual is V (ˆ ei ) = σ 2 (1 − hii )
(3.10)
where hii is the i-th diagonal element of H. Since 0 ≤ hii ≤ 1, using the residual mean square M SE to estimate the variance of the residuals actually overestimates V (ei ). Further more since hii is a measure of the location of the i-th point in x-space, the variance of ei depends upon where the point xi lies. Generally points near the center of ¯ the x-space have larger variance(poorer least squares fit) than residuals at more remote locations.Violation of the model assumptions are more likely at remote points, and these violations may be hard to detect from inspection of ei (or di ) because their residuals will usually be smaller. Several authors(Behnken and Draper[1972]),Davies and Hutton[1975],and Huber[1975] suggest talking this inequality of variance into account when scaling the residuals. They recommend plotting the ”studentized” residuals eˆi
ri = p
M SE (1 − hii )
, i = 1, 2, ..., n
(3.11)
instead of eˆi (or di ). The studentized residuals have constant variance V (ri ) = 1 regardless of the location of xi when the form of the model is correct. In many situations ¯ the variance of the residuals stabilizes, particularly for large data sets. In these cases there may be little difference between the standardized and studentized residuals. Thus standardized and studentized residuals often convey equivalent information. However, since any point with a large residual and a large hii potentially highly influential on the least squares fit, examination of the studentized residuals is generally recommended. The covariance between eˆi and eˆj is Cov(ˆ ei , eˆj ) = −σ 2 hij
(3.12)
so another approach to scaling the residuals is to transform the n dependent residuals into n − p orthogonal functions of the errors ε.These transformed residuals are normally ¯ and independently distributed with constant variance σ 2 . Several procedures have been proposed to investigate departures from the underlying assumptions using transformed residuals. These procedures are not widely used in practice because it is difficult to make specific inferences about the transformed residuals, such as the interpretation of outliers. Further more dependence between the residuals does not affect interpretation of the usual residual plots unless p is large relative to n. 26
II. Prediction Error Sum of Squares Residuals: The prediction error sum of squares(PRESS) proposed by Allen[1971b,1974] provides a useful residual scaling. To calculate PRESS, select an observation, for example i. Fit the regression model to the remaining n - 1 observations and use this equation to predict the withheld observation yi . Denoting this predicted value yˆ(i) , we may find the prediction error for point i as eˆ(i) = yi − yˆ(i) . The prediction error is often called the i-th PRESS residual. This procedure is repeated for each observation i = 1,2,...,n, producing a set of n PRESS residuals eˆ(1) , eˆ(2) , · · · , eˆ(n) . Then the PRESS statistic is defined as the sum of squares of the n PRESS residuals as in P RESS =
n X
eˆ2(i)
n X 2 = yi − yˆ(i)
i=1
(3.13)
i=1
Thus PRESS uses each possible subset of n − 1 observations as the estimation as the estimation data set, and every observation in turn is used to form the prediction data set, and every observation in turn is used to form the prediction data set. It would initially seem that calculating PRESS requires fitting n different regressions.However, it is possible to calculate PRESS from the results of a single least squares fit to all n observations. It turns out that the i-th PRESS residual is eˆi =
eˆi 1 − hii
(3.14)
Thus since PRESS is just the sum of the squares of PRESS residuals, a simple computing formula is 2 n X eˆi P RESS = (3.15) 1 − h ii i=1 From ( 3.14) it is easy to see that PRESS residual is just the ordinary residual weighted according to the diagonal elements of the hat matrix hii . Residuals associated with points for which hii is large will have PRESS residuals. These points will generally be higher influence points. Generally, a large difference between the ordinary residual will indicate a point where the model fits the data well, but a model built without that point predicts poorly. Finally note that the variance of the i-th PRESS residual is eˆi V eˆ(i) = V 1 − hii 2 1 = σ (1 − h ) ii (1 − hii )2 σ2 = 1 − hii
27
so that a standardized PRESS residual is eˆ(i) /(1 − hii ) eˆ(i) q = p [σ 2 (1 − hii i)] V eˆ(i) eˆi
= p
σ 2 (1 − hii )
which if we use M SE to estimate σ 2 is just the studentized residual discussed previously. III. R-Student: The studentized residual ri discussed above is often considered an outlier diagnostic. It is customary to use M SE as an estimate of σ 2 in computing ri . This is referred to as internal scaling of the residual because M SE is an internally generated estimate of σ 2 obtained from fitting the model to all n observations. Another approach would be to use an estimate of σ 2 based on a data set with the i-th observation removed. Denote 2 the estimate of σ 2 so obtained by S(i) . We can show that 2 S(i) =
(n − p)M SE − eˆ2i /(1 − hii ) n−p−1
(3.16)
The estimate of σ 2 in ( 3.16) is used instead of M SE to produce an externally studentized residual, usually called R-student, given by eˆi ti = q , i = 1, 2, · · · , n (3.17) 2 S(i) (1 − hii ) In many situation ti will differ little from the studentized residual ri . However, if the 2 i-th observation is influential, then S(i) can differ significantly from M SE , and thus the R-student statistic will be more sensitive to this point. Furthermore under the standard assumptions ti does follows the tn−p−1 -distribution. Thus R-student offers a more formal procedure for outlier detection via hypothesis testing. Furthermore detection of outliers needs to be considered simultaneously with the detection of influential observations.
IV.Estimation of Pure Error: The procedure involved partitioning the error (or residual) sum of squares into sum squares due to ”pure error” and sum of squares due to ”lack of fit”, SSE = SSP E + SSLOF where SSP E is computed using responses at repeated observations at the same level of x. ¯ This is a model independent estimate of σ 2 .The calculation of SSP E requires repeated observations on y at the same set of levels on the regressor variables x1 , x2 , · · · , xp , i.e., some of the rows¯ of X matrix be same. However repeated observations do not often occur in multiple regression. 28
3.1.6
Residual Plots
The residuals ei from the multiple linear regression model plays an important role in judging model adequacy just as they do in simple linear regression. Specially we often find it instructive to plot the following: 1. Residuals on Normal Probability paper. 2. Residuals versus each regressor xj , j = 1,2,...,p 3. Residuals versus fitted yˆi , i = 1, 2, · · · , n. 4. Residuals in time sequence (if known). The plots are used to detect departures from normality, outliers, inequality of variance and the wrong functional specification for a regressor.There are several other residual plots useful in multiple regression analysis, some of them are as follows: 1. Plot of residuals against regressors omitted from the model. 2. Partial residual plots(i-th partial residuals for regressor xj is eˆ∗ij = yi − βˆ 1 xi1 − ¯ · · · − βˆ j−1 xi,j−1 − βˆ j+1 xi,j+1 − · · · − βˆ p xi,p ). ¯ ¯ ¯ 3. Partial regression plots:plots of residuals from which the linear dependency of y on ¯ all regressors other than xj have been removed against regressor xj with its linear dependency on other regressors removed. 4. Plot of regressor xj against xi (checking multicollinearity):If two or more regressors are highly correlated, we say that multicollinearity is present in the data. Multicollinearity can seriously disturb the least squares fit and in some situations render the regression model almost useless. We shall discuss some of them here. Normal Probability Plot A very simple method for checking the normality assumption is to plot the residuals on normal probability paper. This graph paper is so designed so that the cumulative normal distribution will plot as a straight line. Let e[1] < e[2] < · · · < e[n] be the residuals ranked in increasing order. Plot e[i] against the cumulative probabilities, Pi = (i − 12 )/n, i = 1, 2, · · · , n. resulting should be approximately on a straight line. The straight line is usually determined visually, with emphasis on the central values(e.g., the .33 and .67 cumulative probability points) rather than the extremes. Substantially departures from a straight line indicate that the distribution is not normal. Figure 3.1(a) displays an ”idealized” normal probability plot. Notice that the points lie approximately along a straight line. Figure 3.1(b)-(e) represent other typical problems. Figure 3.1(b) shows a sharp upward and downward curve at both extremes, 29
Figure 3.1: Normal Probability plots 30
indicating that the tails of the distribution are too heavy for it to be considered normal. Conversely Figure 3.1c shows flattening at the extremes, which is a pattern typical of samples from a distribution with thinner tails than the normal. Figure 3.1(d)-(e) exhibit patterns associated with positive and negative skew, respectively. Andrews [1970] and Gnanadesikan [1977] note that normal probability plots often exhibit no unusual behavior even if errors εi are not normally distributed. This problems occur because the residuals are not a simple random sample; they are the remnants of a parameter estimation process. The residuals can be shown to be linear combinations of the model errors (the εi ). Thus fitting the parameters tends to destroy the evidence of non-normality in the residuals, and consequently we cannot always rely on the normal probability plot to detect departures from normality. A common defect that shows up on the normal probability plot is the occurrence of one or two large residuals. Sometimes this is an indication that the corresponding observations are outliers. Residual Plot against yˆi
Figure 3.2: Patterns for residual plots:(a) satisfactory;(b) funnel;(c) double bow;(d) nonlinear.
31
A plot of the residuals eˆi (or the scaled residuals di or ri ) versus the corresponding fitted values yˆi is useful for detecting several common types of model inadequacies.1 vspace0.5mm If this plot resembles Figure 3.1.6(a), which indicates that the residuals can be contained in a horizontal band, then there are no obvious model defects. Plots of ei against yˆi that resemble any of the patterns in Figures 3.1.6(b)-(d) are symptomatic of model deficiencies. ∗∗∗∗∗
1
The residuals should be plotted against the fitted values yˆi and not the observed values yi because the ei and the yˆi are uncorrelated while the ei and the yi are usually correlated .
32
Chapter 4 Subset Selection and Model Building So far we have assumed that the variables that go into the regression equation were chosen in advance. Our analysis involved examining the equation to see whether the functional specification was correct and whether the underlying assumptions about the error term were valid. The analysis presupposed that the regressor variables included in the model are known to be influential. However, in most practical applications the analyst has a pool of candidate regressors that should include all the influential factors, but the actual subset of regressors that should be used in the model needs to be determined. Finding an appropriate subset of regressors for the model is called the variable selection problem. Building a regression model that includes only a subset of the variable regressors involves two conflicting objectives. 1. We would like the model to include as many as regressors as possible so that the ”information content” in these factors can influence the predicted value of y. 2. We want the model to include as few regressors as possible because the variance of the prediction yˆ increases as the number of regressors increases. Also the more regressors there are in a model, greater the costs of data collection and model maintenance. The process of finding a model that is a compromise between these two objectives is called ”best” regression equation. Unfortunately there is no unique definition of best. Furthermore there are several algorithms that can be used for variable selection, and these procedures frequently specify different different subsets of the candidate regressors as best. The variable selection problem is often discussed in an idealized setting. It is usually assumed that the correct functional specification of the regressors is known, and that no outliers or influential observations are present. In practice, these assumptions are rarely met. Investigation of model adequacy is linked to the variable selection problem. Although ideally these problems should be solved simultaneously, an iterative approach is often employed, in which 33
1. a particular variable selection strategy is employed and then 2. the resulting subset model is checked for correct functional specification, outliers, influential observations. Several iterations may be required to produce an adequate model.
4.1
Model Building Or Formulation Of The Problem
Suppose we have a response variable Y and q predictor variables X1 , X2 , · · · , Xq . A linear model that represents Y in terms of q variables is yi =
q X
βj xij + i ,
(4.1)
j=1
where βj are parameters an i represents random disturbances. Instead of dealing with full set of variables (particularly when q is large), we might delete a number of variables and construct an equation with a subset of variables. This chapter is concerned with determining which variables are to be retained in the equation. Let us denote the set of variables retained by X1 , X2 , · · · , Xp and those deleted by Xp+1 , Xp+2 , · · · , Xq . Let us examine the effect of variable deletion under two general conditions: 1. The model that connects Y to the X’s has all β’s (β0 , β1 , · · · , βq ) nonzero. 2. The model has β0 , β1 , · · · , βp nonzero, but βp+1 , βp+2 , · · · , βq zero. Suppose that instead of fitting (4.1) we fit the subset model yi = β0 +
p X
βj xij + i
(4.2)
j=1
We will examine the effect of deletion of variables on the estimates of parameters and the predicted values of Y . The solution to the problem of variable selection becomes a little clearer once the effects of retaining unessential variables or the deletion of essential variables in an equation are known.
4.2
Consequences Of Variable Deletion
To provide motivation for variable selection, we will briefly review the consequences of incorrect model specification. assume that there are K candidate regressors x1 , x2 , · · · , xK and n ≥ K + 1 observations on these regressors and the response y. The full model, containing all K regressors, is yi = β0 +
K X
βj xij + i ,
j=1
34
i = 1, 2, · · · , n
(4.3)
or equivalently y = Xβ + (4.4) ¯ ¯ ¯ We assume that the list of candidate regressors contains all the influential variables and all equations include an intercept term. Let r be the number of regressors that are deleted from (4.4). Then the number of variables that are retained is p = K + 1 − r. Since the intercept is included, the subset model contains p − 1 = K − r of the original regressors. The model (4.4) can be written as y = Xp β p + Xr β r + (4.5) ¯ ¯ ¯ ¯ where the X matrix has been partitioned into Xp an n × p matrix whose columns represent the intercept and the p − 1 regressors to be retained in the subset model, and Xr , an n × r matrix whose columns represent the regressors to be deleted from the full model. Let β be partitioned conformably into β p and β r . For the full model the least ¯ ¯ ¯ squares estimate of β is ¯ ∗ βˆ = (X0 X)−1 X0 y (4.6) ¯ ¯ and an estimate of the residual variance σ 2 is 0
∗ y0 y − βˆp X0 y y0 [I − X(X0 X)−1 X0 ] y 2 ¯ =¯ σ ˆ∗ = ¯ ¯ ¯ ¯ n−K −1 n−K −1
(4.7)
∗ ∗ ∗ The components of βˆ are denoted by βˆp and βˆr , and yˆi∗ are the fitted values. For the ¯ ¯ ¯ subset model
y = Xp β p + ¯ ¯ ¯ the least squares estimate of β p is ¯ βˆp = (X0p Xp )−1 X0p y ¯ ¯ the estimate of the residual variance is 0 y0 y − βˆp X0p y y0 I − Xp (X0p Xp )−1 X0p y 2 ¯ =¯ σ ˆ = ¯¯ ¯ ¯ n−p n−p
(4.8)
(4.9)
(4.10)
and the fitted values are yˆi .
4.2.1
Properties of βˆp ¯
The properties of the estimates βˆp and σ ˆ 2 from the subset model have been investigated by several authors. The results ¯can be summarized as follows: 35
1. Bias in βˆp ¯ E βˆp = β p + (X0p Xp )−1 X0p Xr β r ¯ ¯ ¯ = β p + Aβ r ¯ ¯ where A = (X0p Xp )−1 X0p Xr . Thus βˆp is a biased estimate of β p unless the regression ¯ ¯ coefficients corresponding to the deleted variables (β r ) are zero or the retained variables are orthogonal to the deleted variables (X0p Xr = 0).¯ ¯ 2.Variance of βˆp ¯ ∗ ∗ The variance of βˆp and βˆ are V (βˆp ) = σ 2 (X0p Xp )−1 and V (βˆ ) = σ 2 (X0 X)−1 , respec¯ ¯ ∗ ¯ ¯ tively. Also the matrix V (βˆp ) − V (βˆp ) is positive semidefinite; that is, the variances of ¯ the least squares estimates ¯of the parameters in the full model are greater than or equal to the variances of the corresponding parameters in the subset model. Consequently deleting variables never increases the variances of the estimates of the remaining parameters. 3. Precision of the Parameter Estimates ∗ Since βˆp is a biased estimate of β p and βˆp is not, it is more reasonable to compare the ¯ ¯ the full and subset models in terms of mean ¯ of the parameter estimates precision from square error. The mean square error of βˆp is ¯ 0 M SE(βˆp ) = σ 2 (X0p Xp )−1 + Aβˆr βˆr A0 ¯ ¯ ¯ ∗ 0 ∗ If the matrix V (βˆr ) − βˆr βˆr is positive semidefinite, the matrix V (βˆr ) − M SE(βˆp ) is ¯ ¯ ¯ means that the least squares estimates ¯of the parameters ¯ positive semidefinite. This in the subset model have smaller mean squares error than the corresponding parameter estimates from the full model when the deleted variables have regression coefficients that are smaller than the standard errors of their estimates in the full model.
4. Precision in Prediction Suppose we wish to predict the response at the point x0 = [x0p , x0r ]. If we use the full ∗ model, the predicted value is yˆ∗ = x0 βˆ , with mean x0 β ∗ and prediction variance ¯ ¯ ¯ ∗ 2 0 0 −1 V (ˆ y ) = σ 1 + x (X X) x ¯ ¯ However, if the subset model is used, yˆ = x0p βˆp , with mean ¯ ¯ ¯ E(ˆ y ) = x0p β + x0p Aβˆr ¯ ¯ ¯ ¯ 36
and the prediction mean square error M SE(ˆ y ) = σ 2 1 + x0p (X0p Xp )−1 xp + (x0p Aβˆr − x0r βˆr )2 ¯ ¯ ¯ ¯ ¯ ¯ Note that yˆ is a biased estimate of y unless x0p βˆp = 0, which is only true in general if ¯ ¯ X0p Xr x0p βˆr = 0. Furthermore the variance of yˆ∗ from the full model is not less than the ¯ ¯ variance ¯of yˆ from the subset model. In terms of mean square error V (ˆ y ∗ ) ≥ M SE(ˆ y) ∗ 0 provided that the matrix V (βˆr ) − βˆr βˆr is positive semidefinite. ¯ ¯ ¯
4.3
Criteria for Evaluating Subset Regression Models
Two key aspects of the variable selection problem are generating the subset models and deciding if one subset is better than another. In this section we discuss criteria for evaluating and comparing subset regression models.
4.3.1
Coefficient of Multiple Determination
A measure of the adequacy of a regression model that has been widely used is the coefficient of multiple determination, R2 . Let Rp2 denote the coefficient of multiple determination for a subset regression model with p terms, that is, p − 1 regressors and an intercept term β0 . Computationally Rp2 =
SSR (p) SSE (p) =1− Syy Syy
(4.11)
where SSR (p) and SSE (p) denote the regression sum of squares and the residual sum K of squares, respectively, for a p-term subset model. There are values of Rp2 p−1 for each value of p, one for each possible subset model of size p. Now Rp2 increases as p increases and is a maximum when p = K + 1. Therefore the analyst uses this criterion by adding regressors to the model up to the point where an additional variable is not useful in that it provides only a small increase in Rp2 . The general approach is illustrated in Figure 4.1, which represents a hypothetical plot of the maximum value of Rp2 for each subset of size p against p. Typically one examines a display such as this and then specifies the number of regressors for the final model as the point at which the ”knee” in the curve becomes apparent. Since we cannot find an ”optimum” value of R2 for subset regression model, we must look for a ”satisfactory” value. Aitkin [1974] has proposed one solution to this problem by providing a test by which all subset regression models that have an R2 not significantly different from the R2 for the full model can be identified. let 2 R02 = 1 − (1 − RK+1 )(1 + da,n,K )
37
(4.12)
Figure 4.1: Plot of Rp2 against p where da,n,K =
KFa,n,n−K−1 n−K −1
2 and RK+1 is the value of R2 for the full model. Aitkin calls any subset of regressor variables producing an R2 greater than R02 an R2 -adequate (α) subset. Generally it is not straightforward to use R2 as an criterion for choosing the number of regressor to include However, for a fixed number of variables p can be in the model. K used to compare the subset models so generated. Models having large values p−1 of Rp2 are preferred.
4.3.2
Adjusted R2
To avoid difficulties of interpreting R2 , some analysts prefer to use adjusted R2 statistic, defined for a P -term equation as n−1 2 ¯ Rp = 1 − (1 − Rp2 ) (4.13) n−p ¯ p2 does not necessarily increase as additional regressors are introduced into the The R model. Infact Edward[1969], Haitovski[1969], and Seber [1977] showed that if s regressors ¯ 2 will exceed R ¯ 2 iff the partial F -statistic for testing the are added to the model, R p+s p significance of s additional regressors exceeds 1. Therefore optimum subset model can ¯2. be chosen with maximum R p
38
4.3.3
Residual Mean Square
The residual mean square for a subset regression model with p variables, M SE (p) =
SSE (p) n−p
(4.14)
can be used as a model evaluation criterion. The general behavior of M SE (p) as p increases as in Figure 4.2. Because SSE (p) always decreases as p increases, M SE (p)
Figure 4.2: Plot of M SE (p) against p initially decreases, then stabilizes, and eventually may increases. The eventual increase in M SE (p) occurs when the reduction in SSE (p) from adding a regressor to the model is not sufficient to compensate for the loss of one degree of freedom in the denominator of (4.14). That is, adding a regressor to a p-term model will cause M SE (p + 1) to be greater than M SE (p). Advocates of the M SE (p) criterion will plot M SE (p) against and base the choice of p on 1. the minimum M SE (p), 2. the value of p such that M SE (p) is approximately equal to M SE for the full model, or 3. a value of p near the point where the smallest M SE (p) turns upward.
39
¯ 2 . To see The subset regression model that minimizes M SE (p) will also maximize R p this, note that ¯ 2 = 1 − n − 1 (1 − R2 ) R p p n−p n − 1 SSE (p) = 1− n − p Syy n − 1 SSE (p) = 1− Syy n − p n−1 = 1− M SE (p) Syy ¯ 2 are equivalent. Thus the criteria minimum M SE (p) and maximum R p
4.3.4
Mallows’ Cp -Statistics
Mallows [1964, 1966, 1973] has proposed a criterion that is related to the mean square error of a fitted value, that is, E [ˆ yi − E(yi )]2 = [E(yi ) − E(ˆ yi )]2 + V (ˆ yi )
(4.15)
where E(yi ) and E(ˆ yi ) are the expected responses from the true regression model and p-term subset model, respectively. Thus E(yi ) − E(ˆ yi ) is the bias at the i-th data point. Consequently the two terms on the right-hand side of (4.15) are the squared bias and variance components, respectively, of the mean square error. Let the total squared bias for a p-term equation be SSB (p) =
n X
[E(yi ) − E(ˆ yi )]2
i=1
and define the standardized total total mean square error as ( n ) n X 1 X Γp = [E(yi ) − E(ˆ yi )]2 + V (ˆ yi ) σ 2 i=1 i=1 n SSB (p) 1 X = + 2 V (ˆ yi ) σ2 σ i=1
(4.16)
It can be shown that n X
V (ˆ yi ) = pσ 2
i=1
and that the expected value of the residual sum of squares from a p-term equation is E [SSE (p)] = SSB (p) + (n − p)σ 2 40
Substituting for
Pn
i=1
V (ˆ yi ) and SSB (p) in (4.15) gives 1 2 2 E [SS (p)] − (n − p)σ + pσ E σ2 E[SSE (p)] = − n + 2p σ2
Γp =
(4.17)
Suppose that σ ˆ 2 is a good estimate of σ 2 . Then replacing E[SSE (p)] by the observed value SSE (p) produces an estimate of Γp , say Cp =
SSE (p) − n − 2p σ ˆ2
(4.18)
If the p-term model has negligible bias, then SSB (p) = 0. Consequently E[SSE (p)] = (n − p)σ 2 , and E [Cp |Bias = 0] =
(n − p)σ 2 − n + 2p = p σ2
When using the Cp criterion, it is helpful to construct a plot of Cp as a function of p for
Figure 4.3: Plot of Cp against p each regression equation, such as shown in Figure 4.3. Regression equation with little bias will have values of Cp that fall near the line Cp = p (point A in Figure 4.3) while those equations with substantial bias will fall above this line (point B in Figure 4.3). Generally small values of Cp are desirable. For example, although point C in Figure 4.3 is above the line Cp = p, it is below point A and thus represents a model with lower total error. It may be preferable to accept some bias in the equation to reduce the average prediction error. To calculate Cp , need an unbiased estimate of σ 2 . Generally, we use the residual mean square for the model. It gives Cp = p = K + 1 for the full model. Using M SE (K + 1) 41
from the full model as an estimate of σ 2 assumes that the full model has negligible bias. If the full model has several regressors that do not contribute significantly to the model (zero regression coefficients), then M SE (K + 1) will often overestimate σ 2 , and consequently the values of Cp will be small. If the Cp statistic is to work properly, a good estimate of σ 2 must be used.
4.4
Computational Techniques For Variable Selection
In this section we will discuss several computational techniques for generating subset regression models.
4.4.1
All Possible Regression
This procedure requires that the analyst fit all the regression equations involving onecandidate regressor, two-candidate regressors, and so on. These equations are evaluated according to some suitable criterion and the ”best” regression model selected. If we assume that the intercept term β0 is included in all equations, then if there are K candidate regressors, there are 2K total equations to be estimated and examined. For example, if K = 5, then there are 32 possible equations, while if K = 10, there are 1024 possible regression equations. Clearly the number of equations to be examined increases rapidly as the number of candidate regressors increases. Generally all possible regression is impractical for problems involving more than a few regressors. This method is highly computer oriented. There are presently several algorithm available for efficiently generating all possible regression. The basic idea underlying all these algorithms is to perform the calculations for the 2K possible subset models in such a way that sequential subset models differ by only one variable.
4.4.2
Directed Search on t
The test statistic for testing H0 : βj = 0 for the full model with p = K + 1 regressors is tK,j =
βˆ j se βˆj
Regressors that contribute significantly to the full model will have a large |tK,j | and will tend to be included in the best p-regressors subset, where best implies minimum residual sum of squares or Cp . Consequently ranking the regressors according to decreasing order of magnitude of the |tK,j |,j = 1, 2, · · · , K, and then introducing the regressors into the model one at a time in this order should lead to the best (or one of the best) subset models for each p. Daniel and Wood [1980] call this procedure the directed search on t. It is often a very effective variable selection strategy when the number of candidate regressors is relatively large, for example, K > 20 or 30. 42
4.4.3
Stepwise Variable Selection
For cases when there are a large number of potential predictor variables, a set of procedures that does not involve computing of all possible equations has been proposed. These procedures have the feature that the variables are introduced or deleted from the equation one at a time, and involve examining only subset of all possible equations. With p variables these procedures will involve equation of at most (p + 1) equations, as contrasted with the evaluation of 2p equations necessary for examining, all possible equations. The procedures can be classified into two broad categories: (1) the forward selection procedure (FS), and (2) the backward elimination procedure (BE). There is also a very popular modifications of the FS procedure called the stepwise method. Forward Selection Procedure This procedure begins with the assumption that there are no regressors in the model other than intercept. The first variable included in the equation is the one which has the highest simple correlation with the response Y . Suppose that this regressor is x1 . This is also the regressor that will produce the largest value of the F -statistic for testing significance of regression. This regressor is entered if the F -statistic exceeds a preselected F value, say FIN ( or F -to enter). The second regressor chosen for entry is the one that now has the largest correlation with Y after adjusting for the effect of the first regressor entered (x1 ) on Y . These correlations are called partial correlations. They are the simple correlations between the residuals from the regression yˆ = βˆ0 + βˆ1 x1 and the residuals from the regressions of the other candidate regressors on x1 , say xˆj = α ˆ 1j x1 , j = 2, 3, · · · , K. Suppose that at step 2 the regressor with the highest partial correlation with y is x2 . This implies that the largest partial F -statistic is F =
SSR (x2 |x1 ) M SE (x1 , x2 )
If this F value exceeds FIN , then x2 is added to the model. In general at each step the regressor having the highest partial correlation with Y is added to the model if its partial F -statistic exceeds the preselected entry level FIN . The procedure terminates either when the partial F -statistic at a particular step does not exceed FIN or when the last candidate regressor is added to the model. Backward Elimination Forward selection begins with no regressors in the model and attempts to insert variables until a suitable model is obtained. Backward elimination attempts to find a good model by working in the opposite direction. Here we eliminate stepwise variables without influence. We first calculate the linear regression for the full model. Eliminate the variable xk with one of the following (equivalent) properties: (1) xk has the smallest simple partial correlation among all remaining variables. (2) The removing of xk causes the smallest change of R2 . (3) Of the remaining variables, xk has the smallest t- or F -values. Repeat the whole procedure until one of the following stopping rules is valid: 43
1. The order p of the model has reached a predetermined p∗ . 2. The F -value is larger than a predetermined value Fout . 3. Removing xk does not significantly change the model fit. Stepwise Selection A kind of compromise between forward selection and backward elimination is given by the stepwise selection method. Beginning with one variable just like in forward selection we have to choose one of the four alternatives: 1. Add a variable. 2. Remove a variable. 3. Exchange two variables. 4. Stop the selection. This can be done with the following rules: 1. Add the variable xk if one of the forward selection criteria is satisfied. 2. Remove the variable xk with the smallest F -value if there are (possibly more than one) variables with an F -value smaller than Fout . 3. Remove the variable xk with the smallest F -value if this removal result in a larger R2 -value than it was obtained with the same number of variables before. 4. Exchange the variables xk in the model and xl not in the model if this will increase the R2 -value. 5. Stop the selection if neither of the above criteria is satisfied. The rules 1, 2 and 3 only make sense if there are two or more variables in the model. That is why they are only admitted in this case. Considering rule 3, we see that possibility that the same variable can be added and removed in several steps of the procedure. ∗∗∗∗∗
44
Chapter 5 Dealing with Multicollinearity The use and interpretation of a multiple regression model often depends explicitly or implicitly on the estimates of the individual regression coefficients. Some examples of inferences that are frequently made include: 1. Identifying the relative effects of the regressor variables, 2. Prediction and/or estimation and 3. Selection of an appropriate set of variables for the model. If there is no linear relationship between the regressors, they are said to be orthogonal. When the regressors are orthogonal, inferences such as those made illustrated above can be made easily. Unfortunately in most applications of regression, the regressors are not orthogonal. Sometimes the lack of orthogonality is not serious. However, in some situations the regressors are nearly perfectly linearly related and in such cases the inferences based on the regression model can be misleading or erroneous. When there are near linear dependencies between the regressors, the problem of multicollinearity is said to exist.
5.1
Sources of Multicollinearity
There are four primary sources of multicollinearity: 1. The data collection method can lead to multicollinearity problems Pp when the analyst samples only a subspace of the region of the regressors,i.e. j=1 tj Xj = 0 ¯ ¯ if there is a set of constants t1 , t2 , ..., tp ,not all zeros. 2. The constraints on the model or in the population being sampled can cause multicollinearity problems. For example, suppose that an electric utility is investigating the effect of family(x1 ) income and house size(x2 ) on residential electricity consumptions. The levels of the two regressors in the sample data will show a potential multicollinearity problem, a physical constraint in the population caused this problem; namely, families with higher incomes generally have larger houses 45
than families with lower incomes. When physical constraints such as this are present, multicollinearity will exist regardless of the sampling method employed. Constraints often occur in problems involving prediction or chemical processes, where the regressors are the components of a product and these components add to a constant. 3. Multicollinearity also be induced by the choice of the model. For example, adding polynomial terms to a regression model causes illconditioning in X 0 X. Furthermore, if the range of x is small, adding an x2 term can result insignificant multicollinearity. We often encounter situations such as these where two or more regressors are nearly dependent, and retaining all these regressors may contribute to multicollinearity.In these cases some subset of the regressors is usually preferable from the standpoint of the multicollinearity. 4. An overdefined model has more regressors variables than observations.These models are sometimes encountered in medical and behavioral research, where there may be only a small no of subjects (sample units) available and information is collected for a large number of regressors on each subject. The usual approach to dealing with multicollinearity in this context is to eliminate some of the regressors from consideration.
5.1.1
Effects Of Multicollinearity
The presence of multicollinearity has a number of potentially serious effects on the least squares estimates of the regression coefficients.Suppose that there are only two regressor variables, x1 and x2 . The model, assuming that x1 , x2 and y, are scaled to unit length, is y = β1 x1 + β2 x2 + (5.1) and the least squares normal equations are
1 r12
X0 Xβ = X0 y ¯ ¯ r12 βˆ1 r1y = ˆ 1 r2y β2
where r12 is the simple correlation between x1 and x2 and rjy is the simple correlation between xj and y, j = 1,2. Now the inverse of (X 0 X) is ! −r 1 0
−1
C = (X X)
=
2 1−r12 −r12 2 1−r12
12
2 1−r12 1 2 1−r12
(5.2)
and the estimates of the regression coefficients are r1y − r12 r2y , βˆ1 = 2 (1 − r12 ) r2y − r12 r1y βˆ2 = 2 (1 − r12 ) 46
(5.3)
If there is strong multicollinearity between x1 and x2 , then the correlation coefficient r12 will be large. From (5.2) we see that as |r12 | → 1, V(βˆj ) = Cjj σ 2 → ±∞ depending on whether r12 → +1 or r12 → −1. Therefore strong multicollinearity between x1 andx2 results in large variances and covariances for the least squares estimators of the regression coefficients. This implies that different samples taken at the same x levels could lead widely different estimates of the model parameters. When there are more than two regressor variables, multicollinearity produces similar effects. It can be shown that the diagonal elements of the C = (X0 X)−1 matrix are Cjj =
1 , 1−Rj2
j = 1, 2, ..., p
(5.4)
where Rj2 is the coefficient of multiple determination from regression of xj on the remaining p − 1 regressor variables. If there is strong multicollinearity between xj and any subset of other p − 1 regressors, then the value of Rj2 will be close to unity. Since the variance of βj is V (βˆj ) = Cjj σ 2 = (1 − Rj2 )−1 σ 2 , strong multicollinearity implies that the variance of the least squares estimate of the regression coefficient βj is very large. Generally the covariance of βˆi andβˆj will also be large if the regressors xi andxj are involved in a multicollinear relationship. Multicollinearity also tends to produce least squares estimates βˆj that are too large in absolute value. To see this, consider the squared distance from βˆ to the true parameter ¯ β , for example, ¯ L21 = (βˆ − β )0 (βˆ − β ) (5.5) ¯ ¯ ¯ ¯ The expected squared distance, E(L21 ), is E(L21 ) = E(βˆ − β )0 (βˆ − β ) ¯ ¯ ¯ p ¯ X = E(βˆj − βj )2 j=1 p
=
X
E(βˆj )
j=1 2
= σ T r(X0 X)−1
(5.6)
where the trace of a matrix (denoted by Tr) is just the sum of the main diagonal elements. When there is multicollinearity present, some of the eigenvalues of X0 X will be small. Since the trace of a matrix is also equal to the sum of the eigenvalues, (5.6) becomes E(L21 )
p X 1 =σ λ j=1 j 2
(5.7)
where λj 0, j = 1, 2, ..., p, are the eigenvalues of X0 X. Thus if the X0 X matrix is illconditioned because of multicollinearity, at least one of the λj will be small, and (5.7)
47
implies that the distance from the least squares estimate βˆ to the true parameters β ¯ ¯ may be large. Equivalently we can show that E(L21 ) = E(βˆ − β )0 (βˆ − β ) ¯ 0 ¯ ¯0 ¯ = E(βˆ βˆ − 2βˆ β + β 0 β ) ¯ ¯ ¯ ¯ ¯ ¯ or 0 E(βˆ βˆ) = β 0 β + σ 2 T r(X0 X)−1 ¯ ¯ ¯ ¯
(5.8)
That is, the vector βˆ is generally longer than the the vector β is generally longer than ¯ ¯ the the vector β . This implies that the method of least squares produces estimated ¯ regression coefficients that are too large in absolute value. While the method of least squares will generally produces poor estimates of the individual model parameters when strong multicollinearity is present, this does not necessarily imply that the fitted model is a poor predictor. If predictions are confined to regions of the x-space where the multicollinearity holds approximately, the fitted model often Pp produces satisfactory predictions. This can occur because the linear combination j=1 βj xij may be estimated quite well, even though the individual parameters βj are estimated poorly. That is, often be precisely predicted despite the inadequate estimates of the individual model parameters. In general, if a model is to extrapolate well, good estimates of the individual coefficients are required. When multicollinearity is suspected, the least squares estimates of the regression coefficients may be very poor. This may seriously limit the usefulness of the regression model for inference and prediction.
5.2
Multicollinearity Diagnosis
Several techniques have been proposed for detecting multicollinearity. We will now discuss some of these diagnostic measures. Desirable characteristics of a diagnostic procedure are that it directly reflect the degree of the multicollinearity problem and provide information helpful in determining which regressors are involved.
5.2.1
Estimation of the Correlation Matrix
A very simple measure of multicollinearity is inspection of the off-diagonal elements rij in X0 X. If regressors xi and xj are nearly dependent, then |rij | will be near unity. Examining the simple correlations rij between the regressors is helpful in detecting near linear dependency between pairs of regressors only. Unfortunately when more than two regressors are involved in a near linear dependency, there is no assurance that any of the pairwise correlations rij will be large. Generally inspection of the rij is not sufficient for detecting anything more complex than pairwise multicollinearity.
48
5.2.2
Variance Inflation Factors
The diagonal elements of C = (X0 X)−1 matrix are very useful in detecting multicollinearity. The jth diagonal element of C can be written as Cjj = (1 − Rj2 )−1 , where Rj2 is the coefficient of determination obtained when xj is regressed on the remaining p − 1 regressors. If xj is nearly orthogonal to the remaining p − 1 regressors, Rj2 is small and Cjj is close to unity, while if xj is nearly linearly dependent on some subset of the remaining regressors, Rj2 is near unity and Cjj is large. Since the variance of the jth regression coefficient is Cjj σ 2 , we can view Cjj as the factor by which the variance of βˆj is increased due to near linear dependencies among the regressors. We call V IFj = Cjj = (1 − Rj2 )−1 the variance inflation factor (Marquardt [1970]). The VIF for each term in the model measures the combined effect of the dependencies among the regressors on the variance of the term. One or more large VIFs indicate multicollinearity. practical experience indicates that if any of the VIFs exceeds 5 or 10, it is an indication that the associated regression coefficients are poorly estimated because of multicollinearity. The VIFs have another interesting interpretation. The length of the normal-theory confidence interval on the jth regression coefficient may be written as 1/2 Lj = 2 Cjj σ ˆ2 tα/2,n−p−1
and the length of corresponding interval based on an orthogonal reference Pn design with the same sample size and root-mean-square (rms) values [i.e., rms = j=1 (xij − x¯j )2 is a measure of the spread of the regressor xj ] as the original design is L∗ = 2ˆ σ tα/2,n−p−1 1/2
The ratio of these two confidence intervals is Lj /L∗ = Cjj . Thus the square root of the jth VIF indicates how much longer the confidence interval for the jth regression coefficient is because of multicollinearity. The VIFs can help identify which regressors are involved in multicollinearity.
5.2.3
Eigensystem Analysis of X0 X
The characteristic roots or eigenvalues of X0 X, say λ1 , λ2 , ..., λp , can be used to measure the extent of multicollinearity in the data. If there are one or more near linear dependencies in the data, then one or more of the characteristic roots will be small. One or more small eigenvalues imply that there are near linear dependencies among the columns of X. Some analyst prefer to examine the condition number of X0 X, defined as κj =
λmax λmin
49
(5.9)
This is just a measure of the spread in the eigenvalue spectrum of X0 X. Generally if the condition number is less than 100, there is no serious problem with multicollinearity. Condition numbers between 100 and 1000 imply moderate to strong multicollinearity, and if κ exceeds 1000, severe multicollinearity is indicated. The condition indices of the X0 X matrix are κj =
λmax , λj
j = 1, 2, ..., p
Clearly the largest condition index is the condition number defined in (5.9). The number of condition indices that are large (say ≥ 1000) are a useful measure of the number of near linear dependencies in X0 X.
5.2.4
Other Diagnostics
There are several techniques that are occasionally useful in diagnosing multicollinearity. The determinant of X0 X can be used as an index of multicollinearity. Since the X0 X matrix is in correlation form, the possible range of values of the determinant is 0 ≤ |X0 X| ≤ 1. If |X0 X| = 1, the regressors are orthogonal, while if |X0 X| = 0, there is an exact linear dependency among the regressors. The degree of multicollinearity becomes more severe as |X0 X| approaches zero. While this measure of multicollinearity is easy to apply, it does not provide any information on the source of the multicollinearity. There are more methods that can be taken care of.
5.3
Methods for Dealing with Multicollinearity
Several Techniques have been proposed for dealing with the problems caused by multicollinearity. The general approaches include collecting additional data, model specification, and the use of estimation methods other than least squares that are specially designed to combat the problems induced by multicollinearity.
5.3.1
Collecting Additional Data
Collecting additional data has been suggested as the best method of combating multicollinearity. The additional data should be collected in a manner designed to break up the multicollinearity in the existing data. Unfortunately, collecting additional data is not always possible because of economic constraints or because the process being studied is no longer available for sampling. Even when additional data are available, it may be inappropriate to use if the new data extend the range of the regressor variables far beyond the analyst’s region of interest. Furthermore if the new data points are unusual or a typical of the process being studied, their presence in the sample could be highly influential on the fitted model.Finally, note that collecting additional data is not viable solution to the multicollinearity problem when the multicollinearity is due to constraints on the model or in the population. 50
5.3.2
Model Respecification
Multicollinearity is often caused by the choice of model, such as when two highly correlated regressors are used in the regression equation. In these situation some respecification of the regression equation may lessen the impact of multicollinearity. One approach to model respecification is to redefine the regressors. For example, if x1 , x2 and x3 are nearly linearly dependent, it may be possible to find out some function such as x = (x1 + x2 )/x3 or x = x1 x2 x3 that preserves the information content in the original regressors but reduces the ill-conditioning. Another widely used approach to model respecification is variable elimination. That is, if x1 , x2 andx3 are nearly linearly dependent, eliminating one regressor (say x3 ) may be helpful in combating multicollinearity. Variable elimination is often a highly effective technique. However, it may not provide a satisfactory solution if the regressors dropped from the model have significant explanatory power relative to the response y. That is, eliminating regressors to reduce multicollinearity may damage the predictive power of the model. Care must be exercised in variable selection because many of the selection procedures are seriously distorted by multicollinearity, and there is no assurance that the final model will exhibit any lesser degree of multicollinearity than was present in the original data.
5.3.3
Ridge Regression
When the method of least squares is applied to nonorthogonal data, very poor estimates of the regression coefficients are usually obtained. In such cases the variance of the least squares estimates of the regression coefficients may be considerably inflated, and the length of the vector of least squares parameter estimates is too long on the average. In such case we generally estimate the parameter using Ridge regression that we shall discuss in the next chapter.
5.3.4
Principal Components Regression
Biased estimators of regression coefficients can also be obtained by using a procedure known as principal components regression. Consider the canonical form of the model, y = Zα + ε ¯ ¯ ¯
(5.10)
where Z = XT, α = T0 β , T0 X0 XT = Z0 Z = Λ ¯ ¯ Recall that Λ = diag(λ1 , λ2 , ..., λp ) is a pxp diagonal matrix of the eigenvalues of X0 X and T is a pxp orthogonal matrix whose columns are the eigenvectors associated with λ1 , λ2 , ..., λp . The columns of Z, which define a new set of orthogonal regressors, such as Z = Z1 , Z2 , ..., Zp ¯ ¯ ¯ are referred to as principal components. 51
The least squares estimator of α is ¯ −1
α ˆ = (Z0 Z) Z0 y ¯ ¯ = Λ−1 Z0 y ¯
(5.11)
and the covariance matrix of α ˆ is ¯ −1
V (α ˆ ) = σ 2 (Z0 Z) ¯ = σ 2 Λ−1
(5.12)
Thus a small eigenvalue of X0 X means that the variance of the corresponding orthogonal regression coefficient will be large. Since 0
ZZ=
p p X X i=1
Zi Z0j = Λ ¯¯ j=1
we often refer to the eigenvalue λj as the variance of the jth principal component. If all the λj are equal to unity, the original regressors are orthogonal, while if an λj is exactly equal to zero, this implies a perfect linear relationship between the original regressors. One or more of the λj near zero implies that multicollinearity is present. Note also that the covariance matrix of the standardized regression coefficients βˆ is ¯ V βˆ = V (Tα ˆ) ¯ ¯ = TΛ−1 T0 σ 2 Pp 2 ˆ This implies that the variance of βˆj is σ 2 i=1 tij /λi . Therefore the variance of βj is a linear combination of the reciprocals of the eigenvalues. This demonstrates how one or more small eigenvalues can destroy the precision of the least squares estimates βˆj . We have observed previously how the eigenvalues and eigenvectors of X0 X provide specific information on the nature of the multicollinearity. Since Z = XT, we have Zj = ¯
p X
tij Xj ¯ j=1
(5.13)
where Xj is the j − th column of the X matrix and tij are the elements of the ith column of T (the ith eigenvector of X0 X). If the variance of the ith principal component (λj ) is small, this implies that Zj is nearly a constant, and (5.13) indicates that there ¯ is a linear combination of the original regressors that is nearly constant. This is the definition of multicollinearity. Therefore (5.13) explains why the elements of the eigenvector associated with a small eigenvalue of X0 X identifies the regressors involved in the multicollinearity. The principal components regression approach combats multicollinearity by using less than the full set of principal components in the model. To obtain the principal components estimator, assume that the regressors are arranged in order of decreasing 52
eigenvalues, λ1 ≥ λ2 ≥ ... ≥ λp > 0. Suppose that the last s of these eigenvalues are approximately equal to zero. In principal components regression the principal components corresponding to near-zero eigenvalues are removed from the analysis and least squares applied to the remaining components. That is, α ˆ P C = Bα ˆ ¯ ¯
(5.14)
where b1 = b2 = ... = bp−s = 1 and bp−s+1 = bp−s−2 = ... = bp = 0 Thus the principal components estimator is α ˆ1 α ˆ2 .. . ˆ p−s α α ˆ P C = − − −− ¯ 0 0 . .. 0 p×p or in terms of the standardized regressors βˆP C = Tα ˆP C ¯ ¯ p−s X 0 0 = λ−1 j tj X ytj ¯ ¯¯ j=1
(5.15)
A simulation study by Gunst and Mason [1977] showed that principal components regression offers considerable improvement over least squares when the data are illconditioned. They also point out that another advantage of principal components is that exact distribution theory and variable selection procedures are available. ∗∗∗∗∗
53
Chapter 6 Ridge Regression Ridge regression provides another alternative estimation method that may be used to advantage when the predictor variables are highly collinear. There are a number of alternative ways to define and compute ridge estimates. we have chosen to present the method associated with the ridge trace. It is graphical approach and may be viewed as an exploratory technique. ridge analysis using the ridge trace represents a unified approach to problems of detection and estimation when multicollinearity is suspected. The estimators produced are biased but tend to have a smaller mean squared error than OLS estimator (Hoerl and Kennard, 1970). The prime competitor of ’subset regression’ in terms of variance reduction is ridge regression. Here the coefficients are estimated by (X0 X + λI)−1 X0 y, where λ is a shrinkage parameter. Increasing λ shrinks the coefficient ¯ are set equal to zero. Gruber (1990) gave a recent overview of ridge estimates, but none methods. Some studies (i.e. Frank and Friedman 1993; Hoerl, Schuenemeyer 1986) have shown that ridge regressions give more accurate predictions than subset regressions unless, assuming that y is of the form X y= βk xk + k
all but few of the {βk } are nearly zero and the rest are large. Thus, although subset regression can improve accuracy if p (number of regressors) is large, it is usually not as accurate as ridge. Ridge has its own drawbacks. It gives a regression equation no simpler than the original OLS equation. Furthermore more it is not scale invariant. If the scales used to express the individual predictor variables are changed, then the ridge coefficients do not change inversely proportional to the changes in the variable scales. the usual recipe is to standardize the {xm } to mean 0, variance 1 and then apply ridge. But the recipe is arbitrary; that is, interquartile ranges could be used to normalize instead, giving a different regression predictor.
6.1
Ridge Estimation
As we have already introduced in the last chapter that when the method of least squares is applied to nonorthogonal data, very poor estimates of the regression coefficients are 54
usually obtained. In such cases the variance of the least squares estimates of the regression coefficients may be considerably inflated, and the length of the vector of least squares parameter estimates is too long on the average. The problem with the method of least squares is the requirement that βˆ be an unbiased estimator of β . The Gauss¯ minimum in Markov property referred assures us that¯ the least squares estimator has the class of unbiased linear estimators, but there is no guarantee that this variance will be small. The variance of βˆ is large, implying that confidence intervals on β would be ¯ ¯ wide and the point estimate βˆ is very unstable. One way to alleviate this ¯problem is to drop the requirement that the estimator of ∗ β be unbiased. suppose that we can find a biased estimator of β , say βˆ , that has a ¯ ¯ ¯ smaller variance than the unbiased estimator βˆ. the mean square error of the estimator ∗ ¯ βˆ is defined as ¯ ∗ ∗ 2 ∗ h ∗ i M SE βˆ = E βˆ − β = V βˆ + E βˆ − β (6.1) ¯ ¯ ¯ ¯ ¯ ¯ or ∗ ∗ ∗ 2 M SE βˆ = V βˆ + bias in βˆ ¯ ¯ ¯ ∗ Note that MSE is just the expected squared distance from βˆ to β . By allowing a small ∗ ∗ ∗ ¯ ¯ amount of bias in βˆ , the variance of βˆ can be made small such that the MSE of βˆ is ¯ ¯ ¯ less than the variance of the unbiased estimator βˆ. Figure 6.1(b) illustrates a situation ¯ where the variance of the biased estimator is considerably smaller than the variance of
Figure 6.1: Sampling distribution of (a) unbiased (b) biased estimator of β the unbiased estimator (Figure 6.1(a)). Consequently confidence intervals of β would be 55
much narrower using the biased estimator. The small variance for the biased estimator ∗ also implies thatβˆ is a more stable estimator of β than is the unbiased estimator βˆ. ¯ for obtaining biased estimators ¯ of A number of¯procedures have been developed regression coefficients. One of these procedures is ridge regression. the ridge estimator is found by solving a slightly modified version of the normal equations. Specifically we define the ridge estimator βˆR as the solution to ¯ (X0 X + kI) βˆR = X0 y (6.2) ¯ ¯ or −1 βˆR = (X0 X + kI) X0 y (6.3) ¯ ¯ where k ≥ 0 is a constant selected by the analyst. The procedure is called ridge regression. Note that when k = 0, the ridge estimator is the least squares estimator. The ridge estimator is a linear transformation of the least squares estimate since −1 βˆR = (X0 X + kI) X0 y ¯ ¯ −1 = (X0 X + kI) (X0 X) βˆ ¯ = Zk βˆ ¯ Therefore since E βˆR = E Zk βˆ = Zk β , βˆR is a biased estimator of β . We usually ¯ ¯ ¯ ¯ ¯ refer to the constant k as the biasing parameter. The covariance matrix of βˆR is ¯ −1 −1 2 0 0 0 V βˆR = σ (X X + kI) X X (X X + kI) (6.4) ¯ The mean square error of the ridge estimator is 2 M SE βˆR = V ar βˆR + bias in βˆR ¯ ¯ h¯ i −1 −1 2 0 = σ T r (X X + kI) X0 X (X0 X + kI) −2
+ k 2 β 0 (X0 X + kI) β ¯ p¯ X λj −2 2 2 0 0 = σ β 2 + k β (X X + kI) (λ + k) ¯ ¯ j j=1
(6.5)
where λ1 , λ2 , · · · , λp are the eigenvalues of X0 X. If k > 0, note that the bias in βˆR ¯ increases with k. However, the variance decreases as k increases. In using ridge we would like to choose a value of k such that the reduction in the variance term is greater than the increase in the squared bias. if this can be done, the mean square error of the ridge estimator βˆR will be less than the variance of the least ¯ squares estimator βˆ, provided that β 0 β is bounded. the residual sum of squares is ¯ ¯ ¯ 0 SSE = y − XβˆR y − XβˆR (6.6) ¯ ¯ ¯ ¯ 0 0 = y − Xβˆ y − Xβˆ + βˆR − βˆ X0 X βˆR − βˆ (6.7) ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 56
Since the first term on the right hand side of (6.7) is the residual sum of squares for the least squares estimates βˆ, we see that as k increases, the residual sum of squares ¯ increases. Consequently because the total sum of squares is fixed, R2 decreases as k increases. Therefore the ridge estimate will not necessarily provide the best ”fit” to the data, but this should not overly concern us, since we are more interested in obtaining a stable set of parameter estimates. The ridge estimates may result in an equation that does a better job of predicting future observations than would least squares. Ridge Trace Hoerl and Kennard have suggested that an appropriate value of k may be determine by inspection of the ridge trace. The ridge trace is a plot of the elements of βˆR versus ¯ using k for values of k usually in the interval 0-1. Marquardt and Snee [1975] suggest upto about 25 values of k, spaced approximately logarithmically over the interval [0,1]. If multicollinearity is severe, the instability in the regression coefficients will be obvious from the ridge trace. As k is increased, some of the ridge estimates will vary dramatically. At some value of k, the ridge estimates βˆR are stable. Hopefully this will produce a set of estimates with smaller MSE than the¯least squares estimates. The ridge regression estimates may be computed by using an ordinary least squares computer program and augmenting the standardized data as follows: X y XA = √ yA = ¯ 0p kIp ¯ ¯ √ where kIp is a p × p diagonal matrix with diagonal elements equal to the square root of the biasing parameter and 0p is a p × 1 vector of zeros. The ridge estimates are then ¯ computed from −1 βˆR = (X0A XA ) X0A yA ¯ −1¯ = (X0 X + kIp ) X0 y ¯
Some Other Properties of Ridge Regression Figure 6.2 illustrates the geometry of ridge regression for a two-regressor problem. The point βˆ at the center of the ellipses corresponds to the least squares solution, where ¯ the residual sum of squares takes on its minimum value. The small ellipse represents the locus of points in the β1 , β2 plane where the residual sum of squares is constant at some value greater than the minimum. The ridge estimate βˆR is the shortest vector from the origin that produces a residual sum of squares equal¯to the value represented by the small ellipse. That is, the ridge estimate βˆ produces the vector of regression coefficients with the smallest norm consistent with¯ a specified increase in the residual sum of squares. We note that the ridge estimator shrinks the least squares estimator toward the origin. Consequently ridge estimators (and other biased estimators generally) are sometimes called shrinkage estimators. Hocking [1976] has observed that the ridge 57
Figure 6.2: A geometrical interpretation of ridge regression estimator shrinks the least squares estimator with respect to the contours of X0 X. That is, βˆ is the solution to ¯ 0 M inimizeβ β − βˆ X0 X β − βˆ ¯ ¯ (6.8) ¯ ¯ ¯ Subject
β 0 β ≤ d2 ¯ ¯
to
where the radius d depends on the k. Many of the properties of the ridge estimator assume that the value of k is fixed. In practice, since k is estimated from the data by inspection of the ridge trace, k is stochastic. It is of interest to ask if the optimality properties cited by Hoerl and Kennard hold if k is stochastic. Several authors have shown through simulations that the ridge regression generally offers improvement in mean square error over least squares when k is estimated from the data. Theobald [1974] has generalized the conditions under which ridge regression leads to smaller MSE than least squares. the expected improvement depends on the orientation of the β vector relative to the eigenvectors of X0 X. ¯ Obenchin [1977] shown that non-stochastically shrunken ridge estimators yield the same t- and F -statistics for testing hypotheses as does least squares. Thus although ridge regression leads to biased point estimates, it does not generally require a new distribution theory. However, distributional properties are still unknown for stochastic choice of k.
6.2
Methods for Choosing k
Much of the controversy concerning ridge regression centers around the choice of the biasing parameter k. Choosing k by inspection of the ridge trace is a subjective procedure 58
requiring judgment on the part of the analyst. several authors have proposed procedures for choosing k that are more analytical. Hoerl, Kennard, and Baldwin [1975] have suggested that an appropriate choice for k is pˆ σ2 (6.9) 0 βˆ βˆ ¯ ¯ where β and σ ˆ 2 are found from the least squares solution. They showed via simula¯ the resulting ridge estimator had significant improvement in MSE over least tion that squares. In a subsequent paper, Hoerl and Kennard [1976] proposed an iterative estimation procedure based on (6.9). Specifically they suggested the following sequence of estimates of β and k: ¯ pˆ σ2 βˆk0 = 0 ¯ βˆ βˆ ¯ ¯ pˆ σ2 βˆR (k0 )k1 = βˆR (k0 )0 βˆR (k0 ) ¯ ¯ ¯2 pˆ σ βˆR (k1 )k1 = βˆR (k1 )0 βˆR (k1 ) ¯ ¯ ¯ .. . k=
The relative change in kj is used to terminate the procedure. If kj+1 − kj > 20T −1.3 kj the algorithm should continue; otherwise terminate and use βˆR (kj ), where T = T r(X0 X)−1 /p. ¯ This choice of termination criterion has been selected because T is increases with the 0 spread in the eigenvalues of X X, allowing further shrinkage as the degree of ill-conditioning in the data increases McDonald and Galarneau [1975] suggest choosing k so that p X 0 0 1 2 βˆR βˆR = βˆ βˆ − σ (6.10) λ j ¯ ¯ ¯ ¯ j=1 For cases where the right-hand side of (6.10) is negative, they investigated letting either k = 0 (least squares) or k = ∞(βˆR = 0). ¯ we have described so far are focused on improving The method of choosing k that the estimates of the regression coefficients. If the model is to be used for prediction purposes, then it may be more appropriate to consider prediction-oriented criteria for choosing k. Mallows modified the Cp statistic to a Ck statistic that can be used to determine k. He proposed plotting Ck against Vk , where SSE (k) − n + 2 + 2T r(XL) σˆ2 Vk = 1 + T r(X0 XLL0 ) L = (X0 X + kI)−1 X0
Ck =
59
and SSE (k) is the residual sum of squares as a function of k. The suggestion is to choose k to minimize Ck . Note that XL = X(X0 X + kI)−1 X0 ≡ Hk and the Hk is equivalent to the hat matrix in ordinary least squares. Another possibility is to use a P RESSridge procedure involving P RESSridge =
n X i=1
ei,k 1 − hii,k
2
where ei,k is the ith residual for a specific value of k and hii,k is the ith diagonal element of Hk . The value of k is chosen so as to minimize P RESSridge .
6.3
Ridge regression and Variable Selection
Standard variable selection algorithms often do not perform well when the data are highly multicollinear. However, variable selection usually works quite well when the regressors have been made more nearly orthogonal by the use of biased estimators, then variable selection may be a good strategy. Hoerl and Kennard [1970] suggest the ridge trace can be used as a guide for variable selection. They propose the following rules for removing regressors from the full model: 1. Remove regressors that are stable but have small prediction power, that is, regressors with small standardized coefficients. 2. Remove regressors with unstable coefficients that do not hold their prediction power, that is, unstable coefficients that are driven to zero. 3. Remove one or more of the remaining regressors that have unstable coefficients. The subset of remaining regressors, say p in number, are used in the ”final” model. We may examine these regressors to see if they form a nearly orthogonal subset. This 0 may be done by plotting βˆR (k)βˆR (k), the squared length of the coefficient vector as a ¯ ¯are orthogonal, the squared length of the vector ridge function of k. If the regressors 0 estimates should be βˆ βˆ/(1 + k 2 ), where βˆ is the ordinary least squares estimates of ¯ ¯ model contains¯nearly orthogonal regressors, the functions β . Therefore if the subset 0 ¯0 βˆR (k)βˆR (k) and βˆ βˆ/(1 + k 2 ) plotted against k should be very similar. ¯ ¯ ¯ ¯
6.4
Ridge Regression: Some Remarks
Ridge regression provides a tool for judging the stability of a given body of data.....................update from hardcopy.....cess under study. Another practical problem with ridge regression is that it has not been implemented in some statistical packages. If a statistical package does not have a routine for ridge 60
regression, ridge regression estimates can be obtained from the standard least squares packages by using slightly altered data set. Specially, the ridge estimates of the regression coefficients can be obtained from the regression of Y ∗ onX1∗ , X2∗ , ..., Xp∗ . The new response variable Y ∗ is obtained by augmenting Y˜ by p new fictitious observations, each of which is equal to zero. Similarly the new predictor variable Xj∗ is obtained by augmenting X˜j by p new fictitious observations, each of which is equal to zero except the one in the jth √ position which is equal to k, where k is the chosen value of the ridge parameter. It can be shown that the ridge estimates θˆ1 (k), ..., θˆp (k) are obtained by the least squares regression of Y ∗ onX1∗ , X2∗ , ..., Xp∗ without having a constant term in the model. ∗∗∗∗∗
61
Chapter 7 Better Subset Regression Using the Nonnegative Garrote A new method, called the nonnegative (nn) garrote, is proposed by Leo Breiman for doing subset regression. It both shrinks and zeroes coefficients. In tests on real and simulated data, it produces lower prediction error than ordinary subset selection. It is also compared to ridge regression. If the regression equations generated by a procedure do not change drastically with small changes in the data, the procedure is called stable. Subset selection is unstable, ridge is very stable, and the nonnegative-garrote is intermediate.
7.1
Nonnegative Garrote
Subset regression is instable with respect to small perturbations in the data. Say that n = 100,p = 40 and that using stepwise deletion of variables a sequence of subsets of variables {xm ; m ∈ ζk }, of dimension k(|ζk | = k), k = 1, · · · , p, has been selected. Now remove a single data case (yi , xi ), and use the same selection procedure, getting a sequence of subsets{xm ; m ∈ ζ 0 }. Usually the {ζk0 } and {ζk } are different so that for some k a slight data perturbation leads to a drastic change in the prediction equation. On the other hand, if one uses ridge estimates and deletes a single data case, the new ridge estimates, for some λ, will be close to the old. Much work and research have gone into subset-selection regression, but the basic method remains flawed by its relative lack of accuracy and instability. Subset regression either zeroes a coefficient, if it is not in the selected subsets. or inflates it. Ridge regression gains its accuracy by selective shrinking. Methods that select subsets , are stable, and shrink are needed. Here is one: Let βˆk be the original OLS estimates. takes {ck } to minimize !2 X X ck βˆk xki yi − i
k
under the constraints ck ≥ 0,
P
k ck
62
≤ s.
˜ = ck βˆk are the new predictor coefficients. As the garrote drawn tighter by deThe β(s) creasing s, more of the {ck } become zero and the remaining nonzero β˜k (s) are shrunken. This procedure is called nonnegative (nn) garrote. The garrote eliminates some variables, shrinks others, and is relatively stable. It is also scale invariant. Breiman show that it is almost always more accurate than subset selection and that is accuracy is competitive with ridge. In general nn-garrote produces regression equations having more nonzero coefficients than subset regression. But the loss in simplicity is offset by substantial gains in accuracy.
7.2 7.2.1
Model Selection Prediction and Model Error
The prediction error is defined as the average error in predicting y from x for future ¯ cases not used in the construction of the prediction equation. There are two regression situations, X-controlled and X-random. In the controlled situation, the xi are selected ¯ by the experimenter and only y is random. In the X-random situation, both y and x are randomly selected. CaseI: When X-controlled In the controlled situation, future data are assumed gathered using the same {xi } as ¯ ˆ(x) is the in the present data and thus have the form {(yinew , xi ), i = 1, · · · , n}. If µ ¯ ¯ prediction equation derived from the present data, then define the prediction error as X P E(ˆ µ) = E (yinew − µ ˆ(xi ))2 , ¯ i where the expectation is over {yinew }. In the data are generated by the mechanism yi = µ(ˆ xi ) + i where i are mean zero 2 uncorrelated with average variance σ , then X ˆ(xi ))2 . P E(ˆ µ) = nσ 2 + (µ(xi ) − µ ¯ ¯ i The first component is the inherent prediction error due to the noise. The second component is the prediction error due to lack of fit to the underlying model. This component is called model error and denoted by M E(ˆ µ). The size of the error P P model ˆ reflects different methods of model estimation. If µ = βm xm and µ ˆ = βm xm , then M E(ˆ µ) = (βˆ − β )0 (X0 X)(βˆ − β ). ¯ ¯ ¯ ¯
63
Case II: When X-random If the {xi } are random, then it is assumed that the (yi , xi ) are iid selections from the ¯ ¯ parent distribution (Y, X). Then if µ ˆ(x) is the prediction equation constructed using the ¯ present data, P E(ˆ µ) = E(Y − µ ˆ(X))2 . assuming that Y = µ(X) + , where E(/X) = 0, then P E(ˆ µ) = E(2 ) + E(µ(X) − µ ˆ(X))2 . Again, the relevant error is the second component. If µ = then, the model error is given by
P
βm xm and µ ˆ=
Pˆ βm xm
M E(ˆ µ) = (βˆ − β )0 (n.Γ)(βˆ − β ) ¯ ¯ ¯ ¯ when Γij = E(Xi Xj ).
7.3
Estimation of Error
In each regression procedure we generally get a sequence of models {ˆ µk (x)}. Variable ¯ selection gives a sequence of subsets of variables {xm , m ∈ ζk }, |ζk | = k, k = 1, · · · , p, and µ ˆk (x) is the OLS linear regression based on {xm , m ∈ ζ}. In nn-garrote, a sequence ¯ of s-parameter values s1 , s2 , · · · , sk is selected and µ ˆk (x), k = 1, · · · , K, is the prediction ¯ equation using parameter sk . In ridge a sequence of λ-parameter values λ1 , λ2 , · · · , λK is selected, and µ ˆk (x) is the ridge regression based on λk . ¯ If we know the true value of P E(ˆ µk ), the model selected would be the minimizer of P E(ˆ µk ). We refer to these sections as the crystal-ball models. Otherwise, the selection process constructs an estimate P E(ˆ µk ) and selects that µ ˆk that minimizes PˆE. The estimation methods differ from X-controlled to X-random.
7.3.1
X-Controlled Estimates
The most widely used estimate in subset selection is Mallows Cp . If k is the number of variables in the subset, RSS(k) is the residual sum of squares using µ ˆk , and σ ˆ 2 is the noise variance estimate derived from the full model, then the Cp estimate is PˆE (ˆ µk ) = RSS(k) + 2kˆ σ2 But Breiman [1992] showed that this estimate is heavily biased and does poorly in model selection. He also showed that a better estimate for P E(µk ) is PˆE(µk ) = RSS(k) + 2Bt (k) where Bt (k) defined as follows:
64
Let σ 2 be the noise variance, and add iid N (0, t2 σ 2 ),0 < t ≤ 1, noise {˜i } to the {yi }, getting {˜ yi }. Now using the data {(˜ yi , xi )}, repeat the subset-selection process getting ¯ a new sequence of OLS predictors {˜ µk , k = 1, · · · , p}. Then ! X 1 Bt (k) = 2 E ˜i µ ˜k (xi ) , ¯ t i where the expectation is on the {˜i } only. This is made computable by replacing σ 2 by the noise variance estimate σ ˆ 2 and the expectation over {˜i } by the average over many repetitions. This procedure is called the little bootstrap. Little bootstrap can also be applied to nn-garrote and ridge. Suppose that the nngarrote predictor µ ˆk has been computed for parameter values sk with resulting residual sum of squares RSS(sk ), k = 1, · · · , K. Now add {˜i } to the {yi }, getting {˜ yi }, where 2 2 the {˜i } are iid N (0, t σ ˆ ). Using {(˜ yi , xi )} data, derive the ˜k for Pnn-garrote predictor µ ¯ the parameter value sk , and compute the quantity t12 E ( i ˜i µ ˜k (xi )). Repeat several ¯ times, average these quantities, and denote the result by Bt (sk ). The estimate of PE is PˆE(ˆ µk ) = RSS(sk ) + 2Bt (sk ). In ridge regression, let µ ˆk be the predictor using parameter λk . The little bootstrap estimate is RSS(λk ) + 2Bt (λk ), where Bt (λk ) is computed just as in subset selection and nn-garrote. It was shown by Breiman [1992] that, for subset selection, the bias of the little bootstrap estimate is small for t small. The same proof holds, almost word for word, for the nn-garrote and ridge. But what happens in subset selection is that as t ↓ 0, the variance of Bt increases rapidly, and Bt has no sensible limiting value. Experiments by Breiman [1992] indicated that the best range for t is [0.6,0.8] and that averaging over 25 repetitions to form Bt is usually sufficient. On the other hand, in ridge regression the variance of Bt does not increase appreciably as t ↓ 0, and taking this limit results in the more unbiased estimate PˆE(ˆ µk ) = RSS(λk ) + 2ˆ σ 2 tr(X0 X(X0 X + λk I)−1 )
(7.1)
This turns out to be an excellent PE estimate that selects regression equations µ ˆk 0 0 with P E(ˆ µk ) close to mink P E(ˆ µk ). The situation in nn-garrote is intermediate between subset selection and ridge. The variance of Bt increases as t gets small, but a finite variance limit exists. It does not perform as well as using t in the range [0.6,0.8], however. Therefore, our preferred PE estimates for subset selection and nn-garrote use t ∈ [0.6,0.8] and (7.1) for the ridge PE estimate. the behavior of Bt for t small is reflection of the stability of the regression procedures used.
65
7.3.2
X-Random Estimates
For subset regressions {ˆ µk } in the X-random situations, the most frequently encountered PE estimate is 1 RSS(k). (1 − nk )2
PˆE(ˆ µk ) =
This estimate can be strongly biased and does poorly in selecting accurate models. In such case, cross-validation(CV) is workful. V-fold CV is used to estimate P E(ˆ µk ) for subset selection and nn-garrote. The data L = {(yi , xi ), i = 1, · · · , p} are split into V subsets L1 , · · · , LV . Let L(v) = n L − Lov . ¯ (v) (v) Using subset selection (nn-garrote) and the data in L , from the predictors µ ˆk (x) . ¯ Then the CV estimate is X X (v) PˆE(ˆ µk ) = (yi − µ ˆk (xi ))2 (7.2) ¯ v (yi ,xi )∈Lv
and MˆE(ˆ µk ) = PˆE(ˆ µk ) − pˆ σ2
(7.3)
To get an accurate PE estimate for the ridge regression µ ˆλ (x), remove the i-th case ¯ (−i) (yi , xi ) from the data, and recompute µ ˆλ (x) getting µ ˆλ (x). Then the estimate is ¯ ¯ ¯ X (−i) PˆE(λ) = (yi − µ ˆk (xi ))2 . (7.4) ¯ i This is the leave-one-out CV estimate. If ri (λ) = yi − µ ˆλ (xi ) and hi (λ) = x0i (X0 X + ¯ ¯ λI)−1 xi , then X ¯ PˆE(λ) = (ri (λ)/(1 − hi (λ)))2 . (7.5) i
¯ ¯ Usually, hi (λ) ' h(λ) is a good approximation, where h(λ) = tr(X 0 X(X 0 X + λI)−1 )/p. With this approximation 2 ¯ PˆE(λ) = RSS(λ)/(1 − h(λ)) .
(7.6)
This estimate was first derived by Golub, health, and Wahba [1979] and is called GCV (generalized cross-validation) estimate of PE. Breiman and Spector (1992) found that the ”infinitesimal” version of CV-that is, leave-one-out–gave poorer results in subset selection that five- or tenfold CV. But leaveone-out works well in ridge regression.
66
7.4
X-Orthonormal
In the X-controlled case, assume that X0 X = I and that y is generated as X βm xmi + i , yi =
(7.7)
i
where the {i } are iid N (0, 1). Then OLS βˆm = βm + Zm , where the Zm are iid N (0, 1).Although this is a highly simplified situation, it can give interesting insights into the comparative behavior of subset selection, ridge regression, and the nn-garrote regressions. The best subset of k variables consists of those xm corresponding to the k largest |βˆm | so that the coefficients of best subset regression are (S) βˆm = I(|βˆm | ≥ λ)βˆm ,
m = 1, · · · , p
(7.8)
for some λ ≥ 0, where I(.) is the indicator function. In nn-garrote, the expression !2 X X yi − cm βˆm xmi m
i
P is minimized under the constraints cm ≥ 0, all m , m cm = s. The solution is of the form !+ λ2 (7.9) cm = 1 − 2 βˆm P where λ is determined from s by the condition m cm = s and the subscript + indicates the positive part of the expression. The nn-garrote coefficients are !+ 2 λ (G) βˆm = 1 − βˆm . (7.10) 2 βˆm the ridge coefficients are (R) βˆm =
1 ˆ βm . 1+λ
(7.11)
ˆ λ)β, ˆ where θ is a shrinkage All three of these estimates are of the form βˆ0 = θ(β, factor. OLS estimates correspond to θ ≡ 1. Ridge regression gives a constant shrinkage, ˆ ≤ λ and 1 otherwise. The nn-garrote shrinkage θ = 1/(1+λ). Subset selection is 0 for |β| n o ˆ ≤ λ and then increasing to 1. If the βˆ0 are any estimates of is continuous, 0 if |β| m n o the βˆm then the model error is M E(βˆ0 ) =
X
0 2 (βm − βˆm ).
m
67
7.5
Conclusions And Remarks
Breiman has given evidence that the nn-garrote is a worthy competitor to subset selection methods. It provides simple regression equations with better predictive accuracy. Unless a large proportion of the ”true” coefficients are non-negligible, it gives accuracy better than or comparable to ridge methods. Data reuse method such as little bootstrap or V-fold CV do well in estimating good values of the garrote parameter. Some simulation results can be viewed as intriguing aspects of stability. Each regression procedure chooses from a collection of regression equations. Instability is intuitively defined as meaning that small changes in the data can bring large changes in the equations selected. If, by use of crystal ball , we could choose the lowest PE equations among the subset selection, nn-garrote, and ridge collections, the differences in accuracy between the three procedures are sharply reduced. But the more unstable a procedure is, the more difficult it is to accurately estimate PE or ME. Thus, subset-selection accuracy is severely affected by the relatively poor performance of the PE estimates in picking a low PE subset. On the other hand, ridge regression, which offers only a small diversity of models but is very stable, sometimes wins because the PE estimates are able to accurately locate low PE ridge predictors. nn-garrote is intermediate. In crystal-ball selection is usually somewhat better than the crystal-ball subset selection, but its increased stability allows a better location of low PE nn-garrote predictions, and this increases its edge. The ideas used in the n nn-garrote can be applied to get other regression shrinkage o schemes. For instance, let βˆk be the original OLS estimates. Take {ck } to minimize X
yi −
X
ck βˆk xki
2
i
P 2 under the constraint ck ≤ s. This leads to a procedure intermediate between nngarrote and ridge regression. In the X0 X = I case, its shrinkage factor is ˆ λ) = θ(β,
βˆ2 βˆ2 + λ2
.
Unlike ridge, its is scale invariant. Our expectation is that it will be uniformly more accurate than ridge regression while being almost as stable. Like ridge regression it does not zero coefficients and procedure simplified predictors. ∗∗∗∗∗
68
Chapter 8 Subset Auto Regression A stationary model which is often fitted to detrended (mean estimated and removed) time series data is the autoregressive (AR) model. If T is the set of all integers, then {Xt ; t ∈ T } is a mean zero AR scheme of order p if Xt + α1 Xt−1 + · · · + αp Xt−p = t
(8.1)
where {t } is white noise (WN), i.e., E(t ) = 0, Cov(t s ) = δt,s σ 2 .
(8.2)
Fitting an AR model is a two step procedure: 1. Choose the order, p, of the model. 2. Estimate the model parameters, {α1 , · · · , αp }. We shall primarily consider the step (1). The optimal choice of order has been considered by a group of statisticians (Quenouille, Walker, Whittle, Akaike and Hannan). A sequential hypothesis testing procedures has been adopted. The application of these stepwise procedures is generally restricted to testing the sequence {H0 : p = k; k = 1, 2, · · · } until the first acceptance, or the sequence {H0 : p1 = k; k = K, K − 1, · · · } until the first acceptance, for some arbitrary chosen K. After the order is chosen, one may wish to determine whether certain intermediate lags have non-zero coefficients. Cleveland has proposed the use of inverse autocorrelations for determining the order of an AR model, analogous to the use of the autocorrelations for moving average models. using a sequence of t -tests for determining the importance of individual terms in a regression model we can retain a particular lag in the final model. Akaike has developed a decision theoretic approach for determining the order. The decision function is the estimated one-step-ahead prediction error for each fitted AR model, called by Akaike the final prediction error (FPE). The model for which FPE is minimized is chosen as the best model. This technique is most readily adapted to subset regression methodology, since the decision function can be calculated for any AR model. 69
8.1
Method of Estimation
Suppose that the observed (detrended) time series be X1 , X2 , · · · , XN . Define the sample autocovariance and autocorrelation functions as N −h 1 X c(h) = Xt Xt+h , N i=1
(8.3)
r(h) = c(h)/c(0),
(8.4)
respectively, for h = −N + 1, −N + 2, · · · , N − 1, where c(−h) = c(h). These estimate the true autocovariance and autocorrelation functions, γ(h) = E(Xt , Xt+h ) and ρ(h) = [γ(0)]−1 γ(h), respectively. Let RK = T oepl{1, r1 , · · · , rK−1 }
(8.5)
rTK = −{r1 , r2 , · · · , rK }
(8.6)
and
where Toepl {b1 , b2 , · · · , bn } is the n × n Toeplitz matrix with (i, j)-th element b(i−j)+1 , and AT is the transpose of the matrix A. Then, assuming the true model is a p-th order AR, where p ≤ K, the Yule-Walker system of equations for calculating the estimate, aK = {aK (1), aK (2), · · · , aK (K)}T , of aK = {αh (1), · · · , αh (p), 0, · · · , 0}T , is RK aK = rK .
(8.7)
The asymptotic variance-covariance matrix, ΣK , of aK is ΣK =
σ 2 −1 Γ , N K
(8.8)
where ΓK = T oepl{γ(0), γ(1), · · · , γ(K − 1)}. Pagano’s algorithm, which is both fast and numerically stable, is based upon the Cholesky decomposition of RK+1 . This is RK+1 = LK+1 DK+1 LTK+1 ,
(8.9)
where LK+1 is unit lower triangular and DK+1 diagonal with positive elements {di }K+1 i=1 . From this decomposition one obtains a1 , a2 , · · · , aK and 2 σ ˆK = c(0)[1 − rTk R−1 k rk ] = c(0)dk+1 k = 1, 2, · · · , K,
(8.10)
the sequence of estimated AR coefficient vectors and residual variances, respectively, for each of the K AR models with lags {(1), (1, 2), · · · , (1, 2, · · · , K)}. One criterion for choosing the order of the scheme is to observe the non-increasing sequence {ˆ σ12 , σ ˆ22 , · · · } until it has ”leveled out”, choosing the order to be that point at which no further significant decrease in σ ˆk2 seems likely. 70
Akaike’s decision function is of the form k 2 F P Eδ (k) = σ ˆk 1 + δ , N
(8.11)
where k is the order being considered and δ a positive constant. if p ≤ k, the value of δ determines the amount by which the use of ak for αk increases the FPE. Suppose in estimating prediction error we use an unbiased estimator of σ 2 , σ ˜k2 =
N σ ˆ2 N −k k
(8.12)
makes the choice δ = 2. The additional prediction error attributable to estimation of αk is approximately (k/N )σ 2 . Thus the one-step-ahead prediction error is −1 k k k 2 σ ˜k 1 + = 1− 1+ σ ˆk2 N N N = F P E2 (k) + O(N −1 ) (8.13) The sequential application of F P Eδ (k) (k = 1, 2, · · · , K) will yield an estimated order pˆ, where pˆ = mink {F P Eδ (k); k = 1, 2, · · · , K}. However, the resulting AR model of order p will not necessarily be the method for which employs pˆ of the first K lags and minimizes {F P Eδ (ˆ p)}. That is the estimated residual variance may be smaller for another pˆ-subset of the K lags in which some intermediate lags are constrained to have zero coefficients. The first problem is to find the k-th order model with minimum K residual variance, preferably without checking all subsets. Then we can apply k the FPE criterion to the subsequence of best k-subsets to determine pˆ.
8.2
Pagano’s Algorithm
Hocking and Leslie have described an algorithm which almost always avoids checking every k-subset of a K variate regression model before finding that with minimum residual variance. For the AR model, define θi as the increase in the residual variance, σ ˆk2 , when lag i is removed from the complete K-th order model. Let θi1 ≤ θi2 ≤ · · · ≤ θiK . Suppose we wish to determine the k-lag model with minimum residual variance among those whose maximum lag does not exceed K. Let q = K − k, and consider any subset of lags {j1 , j2 , · · · , jq }, ordered according to the system {i1 , i2 , · · · , iK }. Then it can be shown that if that increase in residual variance with θiq+1 . If the increase does not exceed θiq+1 , the best k-subset consists of that with lags complementing i1 , i2 , · · · , iq in {1,2,· · · , K}. If the increase exceeds θiq+1 , we find the q-subset of lags i1 , i2 , · · · , iq+1 which, upon removal from the model, results in the minimum increase in residual variance. If the minimum increase does not exceed θiq+2 , the set of complementary lags corresponding to among all theminimum increase is the k-subset having minimum residual variance K q+2 subsets. If the minimum increase exceeds θiq+2 , we proceed with the k q q-subsets of i1 , i2 , · · · , iq+2 . 71
Note that 2 θi = σ ˆK
a2K (i) , σ ˆa2K (i)
(8.14)
where ˆ K )ii = σ ˆa2K (i) = (Σ
2 σ ˆK {R−1 }ii , N c(0) K
(8.15)
−1 the estimated variance of aK (i). The form of RK is such that i
−1 {RK }
N c(0) X 2 [aK (j − 1) − a2k (K + 1 − j)] = 2 σ ˆK j=1
(8.16)
where aK (0) = 1, so that 2 2 σ ˆK aK (i) , 2 2 j=1 [aK (j − i) − aK (K + 1 − j)]
θi = Pi
(8.17)
which is easily computed from Pagano’s algorithm.
8.3
Computation of The Increase in The Residual Variance
Define the K × K matrix PK = {pjK } by 1 if k = K − j + 1; j = 1, 2, · · · , K pik = 0 otherwise
(8.18)
which implies that RK = PK RK PK . Now define a modification of RK ,RK (i1 , i2 , · · · , iq ), so that the (m, n)-th element is 0 if n = ij , m 6= ij or m = ij , n 6= ij j = 1, · · · , q rmn = (8.19) 1 if n = m = i , j = 1, · · · , q j r(n − m) otherwise. Thus RK (i1 , i2 , · · · , iq ) is formed by placing ones on the diagonals and zeros in the offdiagonal positions of the i1 , i2 , · · · , iq rows and columns of rK . If the true underlying model is AR of maximum order k = K − q with αi1 , · · · , αiq = 0, the estimator −1 aK (i1 , · · · , iq ) = RK (i1 , · · · , iq )rK (i1 , · · · , iq )
(8.20)
has the properties as does aK if the model is no more than K-th order with no constrained coefficients. The residual variance for the reduced model is estimated by −1 2 T σ ˆK (i1 , · · · , iq ) = c(0)[1 − rK (i1 , · · · , iq ).RK (i1 , · · · , iq )rK (i1 , · · · , iq )].
72
(8.21)
Define the Cholesky decomposition RK = LK DK LTK .
(8.22)
σ ˆk2 (i1 , · · · , iq ) = c(0)dK+1 (j1 , · · · .jq ),
(8.23)
Lemma
where jm = K + 1 − im , m = 1, 2, · · · , q. Proof
Note that RK+1 (j1 , · · · , q ) =
RK (j1 , · · · , jq ) PK rK (i1 , · · · , iq ) T rK (i1 , · · · , iq )PK 1
(8.24)
implies det[RK+1 (j1 , · · · , jq )] =
(8.25)
——–update from notes————
8.4
Conclusion
The main steps of the algorithm are: 1. Choose K, the maximum lag to be considered in the AR model (Akaike used K ≈ N/10), and perform the Cholesky decomposition of RK+1 . 2 2. Calculate aK and σˆK using Pagano’s algorithm. d order {θi , i = 1, · · · , K} by (8.17).
3. For k = 1, 2, · · · , K, find the best k-lag AR model (with maximum lag K) using (??). 4. Calculate {F P E2 (k + m);k = 1, · · · , K}, where m is the number of estimated trend parameters, for the sequence of best k-lag models. Choose pˆ = mink {F P E2 (k + m)}
(8.26)
This pˆ-lag model has minimum estimated one-step-ahead prediction error among all AR models with maximum lag K. The subset autoregression algorithm is useful for determination of model seasonality, for improving autoregressive spectral estimation, for determination of the order of an autoregressive model, and even for making inferences about suitability of an autoregressive model for a particular time series. Its speed makes it possible to find the best subset autoregressive of each order. ∗∗∗∗∗ 73
Appendix A Data Analysis Data on the consumption of oxygen has been considered for 31 American people aged between 35 - 60 years. Factors involved in the experiment were Age, Weight, Oxygen, Runtime, Rest Pulse, Run Pulse, Max Pulse. In my project we have used subset selection procedures and ridge estimation and nonnegative garrote estimation. Data For Fitness of American People in Certain experiment -------------------------------------------------------------------Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@. -------------------------------------------------------------------44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ---------------------------------------------------------------------
A.1
Correlation Coefficients The CORR Procedure
7
Variables: Oxygen RunTime Age Weight 74
RunPulse
MaxPulse
RestPulse
Variable
N
Oxygen RunTime Age Weight RunPulse MaxPulse RestPulse
Simple Statistics Mean Std Dev Sum
31 47.37581 5.32723 31 10.58613 1.38741 31 47.67742 5.21144 31 77.44452 8.32857 31 169.64516 10.25199 31 173.77419 9.16410 31 53.45161 7.61944
Minimum
Maximum
1469 37.38800 328.17000 8.17000 1478 38.00000 2401 59.08000 5259 146.00000 5387 155.00000 1657 40.00000
60.05500 14.03000 57.00000 91.63000 186.00000 192.00000 70.00000
Pearson Correlation Coefficients, N = 31 Prob > |r| under H0: Rho=0
Oxygen
RunTime
Age
Weight
Oxygen
1.00000
-0.86219 -0.30459
View more...
Comments