# ch09

#### Description

Chapter 9 Multiple Regression STAT 3022 School of Statistic, University of Minnesota

2014 spring

1/1

Introduction Example Consider the following study designed to investigate how to elevate meadowfoam production to a profitable crop. Explanatory variables: two light-related factors light intensity(150, 300, 450, 750, and 900 µmol/m2 /sec) timing of the onset of the light treatment (PFI or 24 days before PFI) Response variable: Number of flowers per meadowfoam plant What are the effects of differing light intensity levels? What is the effect of the timing? Does the effect of intensity depend on the timing? PFI = photoperiodic floral induction 2/1

Graphical Summary

60 50 40 30

Number of flowers per plant

70

Scatterplot of flowers vs. intensity

200

400

600

800

Light intensity

3/1

Graphical Summary > plot(Flowers ~ Intens, data = case0901, + col = as.numeric(case0901\$Time), pch = as.numeric(case0901\$Time), + xlab="Light intensity", ylab="Number of flowers per plant", + main="Scatterplot of flowers vs. intensity") > legend(800, 75, c("Late", "Early"), col = c(1,2), pch = c(1,2))

Scatterplot of flowers vs. intensity

60 50 40 30

Number of flowers per plant

70

Late Early

200

400

600

800

Light intensity

4/1

Including “Time” Variable The scatterplot on the previous slide suggests that two regression lines one for Late treatment (at PFI) and one for Early treatment (24 days before PFI) might be appropriate. > data_late data_early head(data_late) Flowers Intens 1 62.3 150 2 77.4 150 3 55.3 300 4 54.2 300 5 49.6 450 6 61.9 450 > head(data_early) Flowers Intens 13 77.8 150 14 75.6 150 15 69.1 300 16 78.0 300 17 57.0 450 18 71.1 450

5/1

Including “Time” Variable (2)

> m_late m_early m_late\$coefficients (Intercept) Intens 71.62333349 -0.04107619 > m_early\$coefficients (Intercept) Intens 83.14666684 -0.03986667

Consider plotting both lines on the scatterplot from before.

6/1

Two Regression Lines plot(Flowers ~ Intens, data = case0901, xlab="", ylab="", col = as.numeric(case0901\$Time), pch = as.numeric(case0901\$Time)) abline(m_late, col = 1, lty = 1) abline(m_early, col = 2, lty = 2) legend(750, 80, col = c(1,2), pch = c(1,2), lty = c(1,2), legend=c("Late","Early"))

40

50

60

70

Late Early

30

> + > > > +

200

400

600

800

7/1

Summary Including separate regression lines seems to suggest parallel lines. i.e., only their intercepts differ. What if we wanted to include both the variables “Intensity” and “Time” in our model? µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 where X1 = intensity, and  0, Late (at PFI) X2 = 1, Early (24 days before PFI) Q: Using this model, how would we interpret β2 ? A: β2 is the difference in the intercept, or the vertical difference between the two regression lines. 8/1

R

> m summary(m) Call: lm(formula = Flowers ~ Intens + Time, data = case0901) Residuals: Min 1Q Median -9.652 -4.139 -1.558

3Q Max 5.632 12.165

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 71.305834 3.273772 21.781 6.77e-16 *** Intens -0.040471 0.005132 -7.886 1.04e-07 *** TimeEarly 12.158333 2.629557 4.624 0.000146 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 6.441 on 21 degrees of freedom Multiple R-squared: 0.7992, Adjusted R-squared: 0.78 F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08

9/1

R - Summary cf > + > > >

200

400

600

800

10 / 1

case0902: Why Do Some Mammals Have Large Brains? Example Brain size is an interesting variable for studying evolution. Bigger brains are not always better - they are associated with fewer offspring and longer pregnancies. After controlling for body size, what characteristics are associated with large brains? Data: For 96 species of mammals, data consists of average values for brain weight body weight gestation lengths (length of pregnancy) litter size 11 / 1

The Multiple Regression Model Definition The regression of Y on X1 and X2 , µ{Y |X1 , X2 }, is an equation that describes the mean of Y for particular values of X1 and X2 . Some examples of multiple linear regression models are µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 + β3 X1 X2 µ{Y |X1 , X2 } = β0 + β1 log(X1 ) + β2 log(X2 ) µ{Y |X1 } = β0 + β1 X1 + β2 X12 In multiple regression there is a single response Y and two or more explanatory variables, X1 , X2 , . . . , Xp . Note that the constant term β0 is included in all models, unless a specific reason for excluding it exists. 12 / 1

Constant Variance The ideal regression model assumes constant variation: Var{Y |X1 , X2 , . . . , Xp } = σ 2 For the meadowfoam example, this means that the variation of points about the regression lines is the same for all values of light and time. Constant variance assumption is important for two reasons: the regression interpretation is more straightforward when explanatory variables are only associated with the mean of the response distribution the assumption justifies the standard inferential tools

13 / 1

Regression Coefficients

Regression analysis involves: finding a model for the response mean that fits well wording the question of interest in terms of model parameters estimating the parameters from the available data employing appropriate inferential tools for answering the questions of interest First we must discuss the meaning of regression coefficients. What questions can they help answer?

14 / 1

Regression Surfaces Consider the multiple linear regression model with two explanatory variables µ{Y |X1 , X2 } = β0 + β1 X1 + β2 X2 Model describes the regression surface as a plane, rather than a line. Imagine a 3-dimensional space with Y as the vertical axis, X1 as the horizontal axis, and X2 as the “out-of-page” axis: β0 = height of the plane when X1 = X2 = 0 β1 = slope of the plane as a function of X1 for any fixed value of X2 β2 = slope of the plane as a function of X2 for any fixed value of X1 15 / 1

Effects of Explanatory Variables Definition The effect of an explanatory variable is the change in the mean response that is associated with a one-unit increase in that variable while holding all other explanatory variables fixed. Example In the meadowfoam study: light effect

= µ{flower | light + 1, time} − µ{flower | light, time} =

(β0 + β1 (light + 1) + β2 time) − (β0 + β1 light + β2 time)

= β1

16 / 1

Causal vs. Associative Effects If regression analysis involves results of a randomized experiment, interpretation of “effect” of explanatory variable implies a causation. “A one-unit increase in light intensity causes the mean number of flowers to increase by β1 ” For observational studies, interpretation is less straightforward. We cannot make causal conclusions from statistical association. The X ’s cannot be held fixed independently of another because they were not controlled. “For any subpopulation of mammal species with the same body weight and litter size, a one-day increase in the species’ gestation length is associated with an increase in mean brain weight of β2 grams.” (read case0902) 17 / 1

Interpretation of Coefficients Interpretation of β1 in the model µ{brain | gestation} = β0 + β1 gestation differs from the interpretation of β1 in the model µ{brain | gestation, body} = β0 + β1 gestation + β2 body First model: β1 measures rate of change in mean brain weight with changes in gestation length in the population of all mammal species. Second model: β1 measures rate of change in mean brain weight with changes in gestation length within subpopulations of fixed body size. Furthermore, coefficients themselves will likely change depending on which X ’s are included (unless correlation between ‘gestation’ and ‘body’ is 0). 18 / 1

Mammals Brains

> head(case0902) Species Brain Body Gestation Litter 1 Quokka 17.50 3.500 26 1.0 2 Hedgehog 3.50 0.930 34 4.6 3 Tree shrew 3.15 0.150 46 3.0 4 Elephant shrew I 1.14 0.049 51 1.5 5 Elephant shrew II 1.37 0.064 46 1.5 6 Lemur 22.00 2.100 135 1.0 > summary(case0902[, -1]) Brain Body Gestation Min. : 0.45 Min. : 0.017 Min. : 16.0 1st Qu.: 12.60 1st Qu.: 2.075 1st Qu.: 63.0 Median : 74.00 Median : 8.900 Median :133.5 Mean : 218.98 Mean : 108.328 Mean :151.3 3rd Qu.: 260.00 3rd Qu.: 94.750 3rd Qu.:226.2 Max. :4480.00 Max. :2800.000 Max. :655.0

Litter Min. :1.00 1st Qu.:1.00 Median :1.20 Mean :2.31 3rd Qu.:3.20 Max. :8.00

We are interested in the regression of ‘Brain’ on ‘Body’, ‘Gestation’ and ‘Litter’.

19 / 1

A Matrix of Pairwise Scatterplots Definition A scatterplot matrix is a consolidation of all possible pairwise from a set of variables. The variable that determines each row is represented on the vertical axis of each scatterplot in that row, while the variable that determines each column is represented on the horizontal axis of each scatterplot in that column. Do any relationships appear to be linear? Which relationships are the strongest? Are there any outliers? Typically first compare response to each explanatory variable.

20 / 1

Graphical Summary pairs(case0902[, -1])

500

1500

2500

1

2

3

4

5

6

7

8

3000

0

2000

0 1000

Brain

400

600

0

1000

Body

1 2 3 4 5 6 7 8

0

200

Gestation

Litter

0

1000 2000 3000 4000

0

100

300

500

21 / 1

Scatterplots for Brain Weight Data Consider the top row first. Plot of brain weight versus body weight is not helpful - data is clustered in bottom-left corner because of an outlier (African elephant). Mammals differ by size in orders of magnitude (differences are bigger for bigger mammals) - use log transformation for brain and body weight. Notice that gestation and litter are also positive variables, whose observations become more spread out for larger values. We will consider log transformations for all 4 variables.

22 / 1

100

2

0

1

3

4

5

Litter size

300

6

500

Gestation (days)

7

8

0

0

Brain weight (g)

500

1500

Body weight (kg) 2500

1000 2000 3000 4000

Before Transformations Brain Weight Body Weight

Gestation Litter Size

23 / 1

0.0

3

5

0.5

1.0

1.5

Litter size (log scale)

4

Gestation (log scale) 6

2.0

−4

0

4

6

−2

0

2

4

6

Body weight (log scale)

2

Brain weight (log scale) 8

8

After Transformations Brain Weight Body Weight

Gestation Litter Size

24 / 1

Updated Scatterplot Matrix −2

0

2

4

6

8

0.0

0.5

1.0

1.5

2.0

6

8

−4

4

6

8

0

2

4

Brain

5

6

−4

0

2

Body

1.5

2.0

3

4

Gestation

0.0

0.5

1.0

Litter

0

2

4

6

8

3

4

5

6

25 / 1

Pronounced relationship between log brain weight and each of the explanatory variables. Gestation and litter size also related to body weight. Is there an association between gestation and brain weight, after accounting for the effect of body weight? Is there an association between litter size and brain weigh, after accounting for the effect of body weight? Scatterplots do not resolve these questions. Next course of action: fit a regression model for log brain weight on log body weight, log gestation, and log litter size.

26 / 1

Regression Output

> m summary(m) Call: lm(formula = log(Brain) ~ log(Body) + log(Gestation) + log(Litter), data = case0902) Residuals: Min 1Q Median -0.95415 -0.29639 -0.03105

3Q 0.28111

Max 1.57491

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.85482 0.66167 1.292 0.19962 log(Body) 0.57507 0.03259 17.647 < 2e-16 *** log(Gestation) 0.41794 0.14078 2.969 0.00381 ** log(Litter) -0.31007 0.11593 -2.675 0.00885 ** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.4748 on 92 degrees of freedom Multiple R-squared: 0.9537, Adjusted R-squared: 0.9522 F-statistic: 631.6 on 3 and 92 DF, p-value: < 2.2e-16

27 / 1

Conclusions

Controlling for body weight and litter size, an increase in gestation length of one unit on the log scale is associated with an increase in mean log brain weight of 0.418. Controlling for body weight and gestation length, an increase in litter size of one unit on the log scale is associated with a decrease in mean log brain weight of 0.310. In the next chapter we will discuss inferential procedures for multiple regression in-depth.

28 / 1

Introduction We can include specially constructed explanatory variables in order to exhibit curvature in the regression model interactive effects of explanatory variables effects of categorical variables We accomplish these goals by including quadratic terms (e.g. X12 ) product terms (e.g. X1 X2 ) indicator terms  X3 =

0, if group A 1, if group B

29 / 1

A Squared Term for Curvature

20

25

Yield

30

35

Consider the scatterplot of yearly corn yield vs. rainfall (1890 1927) in six U.S. states:

8

10

12

14

16

Rainfall

30 / 1

Incorporating Curvature A straight-line regression model is not adequate. One model for incorporating curvature includes squared rainfall: µ{yield | rain} = β0 + β1 rain + β2 rain2 This allows the effect of rainfall to be different at different levels of rainfall: µ{yield | rain + 1} − µ{yield | rain} = (β0 + β1 (rain + 1) + β2 (rain + 1)2 ) − (β0 + β1 rain + β2 rain2 ) = β1 + β2 (2 × rain + 1) As rainfall increases, its effect changes. 31 / 1

Squared Term in R In R, there is no need to create a new variable to include a squared term: > m summary(m) Call: lm(formula = Yield ~ Rainfall + I(Rainfall^2), data = ex0915) Residuals: Min 1Q Median -8.4642 -2.3236 -0.1265

3Q 3.5151

Max 7.1597

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.01466 11.44158 -0.438 0.66387 Rainfall 6.00428 2.03895 2.945 0.00571 ** I(Rainfall^2) -0.22936 0.08864 -2.588 0.01397 * --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 3.763 on 35 degrees of freedom Multiple R-squared: 0.2967, Adjusted R-squared: 0.2565 F-statistic: 7.382 on 2 and 35 DF, p-value: 0.002115

32 / 1

Plotting the Fitted Model

30 25

Corn Yield (bu/acre)

35

plot(ex0915\$Yield ~ ex0915\$Rainfall, pch=16, xlab="Rainfall (inches)", ylab="Corn Yield (bu/acre)") xx > >

8

10

12

14

16

Rainfall (inches)

33 / 1

Interpretations Effect of rainfall: Increase from 8 to 9 inches associated with increase in mean yield of 2.1 bushels of corn per acre. Increase from 14 to 15 inches associated with decrease in mean yield of 0.6 bushels of corn per acre. Interpretation of individual coefficients is difficult and unnecessary. Fitted model suggests that increasing rainfall is associated with increasing rainfall only up to a point. In many situations, squared term is just there to incorporate curvature; its coefficient need not be interpreted.

34 / 1

Distinguishing Between Groups

Definition An indicator variable (or “dummy variable”) takes on one of two values: 1 indicates that an attribute is present 0 indicates that the attribute is absent Example In the meadowfoam study, consider the variable:  1, if time = 0 early = 0, if time = 24

35 / 1

Indicators in R In R, a variable that should be coded as a factor may be coded as numeric at first. > case0901\$early case0901 Flowers Time Intens early 1 62.3 Late 150 0 2 77.4 Late 150 0 3 55.3 Late 300 0 4 54.2 Late 300 0 5 49.6 Late 450 0 6 61.9 Late 450 0 7 39.4 Late 600 0 8 45.7 Late 600 0 9 31.3 Late 750 0 10 44.9 Late 750 0 11 36.8 Late 900 0 12 41.9 Late 900 0 13 77.8 Early 150 1 14 75.6 Early 150 1 15 69.1 Early 300 1 16 78.0 Early 300 1 17 57.0 Early 450 1 18 71.1 Early 450 1 19 62.9 Early 600 1 20 52.2 Early 600 1 21 60.3 Early 750 1 22 45.6 Early 750 1 23 52.6 Early 900 1 24 44.4 Early 900 1 36 / 1

Indicators in R (2)

> summary(case0901) Flowers Time Min. :31.30 Late :12 1st Qu.:45.42 Early:12 Median :54.75 Mean :56.14 3rd Qu.:64.45 Max. :78.00

Intens Min. :150 1st Qu.:300 Median :525 Mean :525 3rd Qu.:750 Max. :900

early 0:12 1:12

Note that “early” is a factor with 2 levels, not a numeric.

37 / 1

Modeling an Indicator Consider the regression model µ{flowers | light, early} = β0 + β1 light + β2 early If time = 0, then early = 0, and the regression line is µ{flowers | light, early = 0} = β0 + β1 light If time = 24, then early = 1, and the regression line is µ{flowers | light, early = 1} = β0 + β1 light + β2

Slope of both lines is β1 Intercept for timing at PFI (“late”) is β0 Intercept for timing 24 days prior to PFI (“early”) is β0 + β2 “Parallel lines model” 38 / 1

Fitting Parallel Lines Model

> m_parallel summary(m_parallel) Call: lm(formula = Flowers ~ Intens + early, data = case0901) Residuals: Min 1Q Median -9.652 -4.139 -1.558

3Q Max 5.632 12.165

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 71.305834 3.273772 21.781 6.77e-16 *** Intens -0.040471 0.005132 -7.886 1.04e-07 *** early1 12.158333 2.629557 4.624 0.000146 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 6.441 on 21 degrees of freedom Multiple R-squared: 0.7992, Adjusted R-squared: 0.78 F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08

39 / 1

Interpretations

This regression model states that: mean number of flowers is a straight-line function of light intensity for both levels of timing. the slope of both lines is estimated to be βˆ1 = −0.0405 flowers per plant per µmol/m2 /sec. βˆ2 = 12.158 means that the mean number of flowers with prior timing at 24 days exceeds the mean number of flowers with no prior timing by about 12.158 flowers per plant.

40 / 1

Sets of Indicator Variables What if an explanatory variable has more than two categories? Definition When a categorical variable is used in regression it is called a factor and the individual categories are called the levels of the factor. If there are k levels, then k − 1 indicator variables are needed as explanatory variables. Example In the meadowfoam study, light intensity can be viewed as a categorical variable with 6 levels. How many indicator variables will be associated with this factor? 41 / 1

Factorizing Intensity

Since “Intens” is numeric, we need to create a new factor: > case0901\$light summary(case0901) Flowers Time Intens early light Min. :31.30 Late :12 Min. :150 0:12 150:4 1st Qu.:45.42 Early:12 1st Qu.:300 1:12 300:4 Median :54.75 Median :525 450:4 Mean :56.14 Mean :525 600:4 3rd Qu.:64.45 3rd Qu.:750 750:4 Max. :78.00 Max. :900 900:4

Note that “light” is a factor with 6 levels.

42 / 1

Modeling k-Level Factor With 6 levels, we can set the first level, 150 µmol/m2 /sec, as the “reference level”, the multiple linear regression model is: µ{flowers | light, early} = β0 + β1 L300 + β2 L450 + β3 L600 + β4 L750 + β5 L900 + β6 early By “reference level”, we mean that when L300 = L450 = L600 = L750 = L900 = 0, we have the estimate for “light” = 150. Consider the regression output on the following slide. In practice we would treat light as a numerical variable, but we briefly treat it as a factor for the sake of illustration.

43 / 1

Fitting Model with Factors > m_light summary(m_light) Call: lm(formula = Flowers ~ light + early, data = case0901) Residuals: Min 1Q Median -8.979 -4.308 -1.342

3Q Max 5.204 10.204

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 67.196 3.629 18.518 1.05e-12 light300 -9.125 4.751 -1.921 0.071715 light450 -13.375 4.751 -2.815 0.011919 light600 -23.225 4.751 -4.888 0.000138 light750 -27.750 4.751 -5.841 1.97e-05 light900 -29.350 4.751 -6.178 1.01e-05 early1 12.158 2.743 4.432 0.000365 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1

*** . * *** *** *** *** 1

Residual standard error: 6.719 on 17 degrees of freedom Multiple R-squared: 0.8231, Adjusted R-squared: 0.7606 F-statistic: 13.18 on 6 and 17 DF, p-value: 1.427e-05

Q: Can you use this model to estimate the flowers per meadowfoam plant for light intensity = 800? 44 / 1

A Product Term for Interaction Definition Two explanatory variables are said to interact if the effect of one of them depends on the value of the other. In multiple regression, explanatory variable for interaction is constructed as the product of two explanatory variables thought to interact. Example Recall a question of interest from the meadowfoam study: Does the effect of light intensity on mean number of flowers depend on the timing of the light treatment? Answer this question by including a product term for interaction. 45 / 1

Interaction Model In the meadowfoam study, consider the product variable light × early (where light is numeric, but early is a factor). Consider the model µ{flowers | light, early} = β0 + β1 light + β2 early +β3 (light × early)

When early = 0, what is the slope? What is the intercept? slope = β1 , intercept = β0

When early = 1, what is the slope? What is the intercept? slope = β1 + β3 , intercept = β0 + β2

If β3 6= 0, then the model is not parallel lines.

46 / 1

Interaction in R

> m_interaction summary(m_interaction) Call: lm(formula = Flowers ~ Intens * early, data = case0901) Residuals: Min 1Q Median -9.516 -4.276 -1.422

3Q Max 5.473 11.938

Coefficients: Estimate Std. Error t value (Intercept) 71.623333 4.343305 16.491 Intens -0.041076 0.007435 -5.525 early1 11.523333 6.142361 1.876 Intens:early1 0.001210 0.010515 0.115 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05

Pr(>|t|) 4.14e-13 *** 2.08e-05 *** 0.0753 . 0.9096 . 0.1

1

Residual standard error: 6.598 on 20 degrees of freedom Multiple R-squared: 0.7993, Adjusted R-squared: 0.7692 F-statistic: 26.55 on 3 and 20 DF, p-value: 3.549e-07

47 / 1

Interpretations Consider rearranging the model as µ{flowers | light, early} = (β0 + β2 early) + (β1 + β3 early)light Both the intercept and the slope depend on the timing. the effect of light intensity is (β1 + β3 early) the effect of timing is (β2 + β3 light) So there are 3 different fitted models: 1

separate lines (β2 6= 0, β3 6= 0)

2

parallel lines (β2 6= 0, β3 = 0)

3

equal lines (β2 = 0, β3 = 0)

48 / 1

Further Interpretations Often difficult to interpret individual coefficients in an interaction model. Coefficient of light, β1 , changes from being a “global slope” to being the slope when time = 0. Coefficient of the product term, β3 , is the difference between the slope when time = 24 and the slope when time = 0. To test for the presence of an interaction effect, consider testing H0 : β3 = 0 vs. Ha : β3 6= 0 In the meadowfoam example, the p-value for this test is 0.9096. What can we conclude?

49 / 1

When to Include Interaction Terms We do not routinely include interaction terms in regression models. We include them when: when a question of interest pertains to an interaction (like in meadowfoam study) when good reason exists to suspect interaction when interactions are proposed as a more general model for the purpose of examining the goodness of fit of a model without interaction (i.e., does the model with interaction terms fit better than the one without interaction terms?) Also, if we include a product term in a model, we should also include the individual terms unless otherwise specified. If we have a light × time interaction, make sure both ‘light’ and ‘time’ are in the model. 50 / 1

Strategy for Data Analysis After defining the questions of interest, reviewing the study design and model assumptions, and correcting any errors in the data: 1 Explore the data graphically. Look for initial answers to questions. Consider transformations. Check outliers. 2

Formulate an inferential model.

3

Check the model.

Word questions of interest in terms of model parameters. Check for nonconstant variance, outliers. If appropriate, fit interactions or curvature. See if extra terms can be dropped from model. 4

Infer the answers to the questions of interest, using appropriate inferential tools. Confidence intervals and/or tests for regression coefficients. Prediction intervals and/or confidence intervals for mean.

Then present your results, in the context of the problem. 51 / 1