Predictive Analytics Using Regression

May 28, 2016 | Author: Pranav Aggarwal | Category: N/A

Share Embed Donate

Report this link

Short Description

Regression...

Description

PREDICTIVE ANALYTICS USING REGRESSION Sumeet Gupta Associate Professor Indian Institute of Management Raipur

Outline •  Basic Concepts •  Applications of Predictive Modeling •  Linear Regression in One Variable using OLS •  Multiple Linear Regression •  Assumptions in Regression •  Explanatory Vs Predictive Modeling •  Performance Evaluation of Predictive Models •  Practical Exercises •  Case: Nils Baker •  Case: Pedigree Vs Grit

BASIC CONCEPTS

4

Predictive Modeling: Applications •  Predictive customer activity on credit cards from their

demographic and historical activity patterns •  Predicting the time to failure or equipment based on utilization and environment conditions •  Predicting expenditures on vacation travel based on historical frequent flyer data •  Predicting staffing requirements at help desks based on historical data and product and sales information •  Predicting sales from cross selling of products from historical information •  Predicting the impact of discounts on sales in retail outlets

5

Basic Concept: Relationships Examples of relationships: •  Sales and earnings •  Cost and number produced •  Microsoft and the stock market •  Effort and results

•  Scatterplot •  A picture to explore the relationship in bivariate data •  Correlation r •  Measures strength of the relationship (from –1 to 1) •  Regression •  Predicting one variable from the other

6

Basic Concept: Correlation •  r = 1 •  A perfect straight line tilting up to the right •  r = 0 •  No overall tilt •  No relationship? •  r = – 1 •  A perfect straight line tilting down to the right

Y

Y

X Y

X Y

X Y

X Y

X

X

7

Basic Concepts: Simple Linear Model •  Linear Model for the Population •  The foundation for statistical inference in regression •  Observed Y is a straight line, plus randomness

{

Y = α + βX + ε Randomness of individuals Population relationship, on average Y ε

X

8

Basic Concepts: Simple Linear Model •  Time Spent vs. Internet Pages Viewed •  Two measures of the abilities of 25 Internet sites •  At the top right are eBay, Yahoo!, and MSN

•  Correlation is r = 0.964 •  Linear relationship •  Straight line

with scatter •  Increasing relationship •  Tilts up and to the right

Minutes per person

•  Very strong positive association (since r is close to 1)

90 eBay Yahoo!

60 MSN

30 0

0

100 200 Pages per person

9

Basic Concepts: Simple Linear Model •  Dollars vs. Deals •  For mergers and acquisitions by investment bankers •  244 deals worth $756 billion by Goldman Sachs

•  Correlation is r = 0.419 •  Positive association •  Straight line

with scatter •  Increasing relationship •  Tilts up and to the right

$1,000 Dollars (billions)

•  Linear relationship

$500

$0

0

100

200 300 Deals

400

10

Basic Concepts: Simple Linear Model •  Interest Rate vs. Loan Fee •  For mortgages •  If the interest rate is lower, does the bank make it up with a higher loan

fee? •  Correlation is r = – 0.890 •  Linear relationship •  Straight line

with scatter •  Decreasing relationship •  Tilts down and to the right

Interest rate

•  Strong negative association

6.0% 5.5% 5.0% 0%

1%

2% 3% Loan fee

4%

11

Basic Concepts: Simple Linear Model •  Today’s vs. Yesterday’s Percent Change •  Is there momentum? •  If the market was up yesterday, is it more likely to be up today? Or is

each day’s performance independent? 3%

•  Correlation is r = 0.11 •  No relationship? •  Tilt is neither

up nor down

2%

Today's change

•  A weak relationship?

1% 0% -1% -2% -3% -3% -2% -1% 0% 1% 2% 3% Yesterday's change

12

Basic Concepts: Simple Linear Model •  Call Price vs. Strike Price •  For stock options •  “Call Price” is the price of the option contract to buy stock at the

“Strike Price” •  The right to buy at a lower strike price has more value •  A nonlinear relationship

A curved relationship •  Correlation r = – 0.895 •  A negative relationship:

Higher strike price goes with lower call price

$100

Call Price

•  Not a straight line:

$75 $50 $25 $0 $450

$500

$550 Strike Price

$600

$650

13

Basic Concepts: Simple Linear Model •  Output Yield vs. Temperature •  For an industrial process •  With a “best” optimal temperature setting

•  A nonlinear relationship •  Not a straight line:

•  Correlation r = – 0.0155 •  r suggests no relationship

•  But relationship is strong •  It tilts neither

up nor down

Yield of process

A curved relationship 160 150 140 130 120 500

600 700 800 Temperature

900

14

Basic Concepts: Simple Linear Model •  Circuit Miles vs. Investment (lower left) •  For telecommunications firms •  A relationship with unequal variability •  More vertical variation at the right than at the left •  Variability is stabilized by taking logarithms (lower right)

r = 0.957 Log of miles

Circuit miles (millions)

•  Correlation r = 0.820

2,000 1,000 0 0

1,000 2,000 Investment ($millions)

20

15

15 20 Log of investment

15

Basic Concepts: Simple Linear Model •  Price vs. Coupon Payment •  For trading in the bond market •  Bonds paying a higher coupon generally cost more

•  Two clusters are visible •  Ordinary bonds (value is from coupon) •  Inflation-indexed bonds (payout rises with inflation) •  for all bonds

•  Correlation r = 0.994 •  Ordinary bonds only

Bid price

•  Correlation r = 0.950

$150 $100 0%

5% 10% Coupon rate

16

Basic Concepts: Simple Linear Model •  Cost vs. Number Produced •  For a production facility •  It usually costs more to produce more

•  An outlier is visible •  A disaster (a fire at the factory)

Cost

•  High cost, but few produced

r = – 0.623

10,000

0

0

20 40 60 Number produced

Cost

5,000

Outlier removed: More details, r = 0.869

4,000 3,000 20 30 40 50 Number produced

17

Basic Concepts: OLS Modeling •  Salary vs. Years Experience •  For n = 6 employees •  Linear (straight line) relationship •  Increasing relationship •  higher salary generally goes with higher experience

Experience 15 10 20 5 15 5

Salary 30 35 55 22 40 27

Salary ($thousand)

•  Correlation r = 0.8667

60 50 40 30 20 0

10

20 Experience

18

Basic Concepts: OLS Modeling •  Summarizes bivariate data: Predicts Y from X •  with smallest errors (in vertical direction, for Y axis) •  Intercept is 15.32 salary (at 0 years of experience) •  Slope is 1.673 salary (for each additional year of experience, on average)

Salary (Y)

60 50 40 30 20 10 0

10

20

Experience (X)

19

Basic Concepts: OLS Modeling •  Predicted Value comes from Least-Squares Line •  For example, Mary (with 20 years of experience) has predicted salary 15.32+1.673(20) = 48.8 •  So does anyone with 20 years of experience

•  Residual is actual Y minus predicted Y •  Mary’s residual is 55 – 48.8 = 6.2 •  She earns about $6,200 more than the predicted salary for a person

with 20 years of experience •  A person who earns less than predicted will have a negative residual

20

Basic Concepts: OLS Modeling Mary’s residual is 6.2 60

Mary earns 55 thousand

50 Mary’s predicted value is 48.8 Salary

40 30 20 10 0

10

Experience

20

21

Basic Concepts: OLS Modeling •  Standard Error of Estimate

Se = SY

(1 − r ) nn −− 12 2

•  Approximate size of prediction errors (residuals) Actual Y minus predicted Y: Y–[a+bX] •  Example (Salary vs. Experience)

Se = 11.686 (1 − 0.86672 )

6 −1 = 6.52 6−2

Predicted salaries are about 6.52 (i.e., $6,520) away from actual salaries

22

Basic Concepts: OLS Modeling •  Interpretation: similar to standard deviation •  Can move Least-Squares Line up and down by Se •  About 68% of the data are within one “standard error of estimate” of the least-squares line •  (For a bivariate normal distribution)

Salary

60 50 40 30 20 0

10 Experience20

23

Multiple Linear Regression •  Linear Model for the Population Y = (α + β1 X1 + β2 X2 + … + βk Xk) + ε = (Population relationship) + Randomness •  Where ε has a normal distribution with mean 0 and constant

standard deviation σ, and this randomness is independent from one case to another •  An assumption needed for statistical inference

24

Multiple Linear Regression: Results •  Intercept: a •  Predicted value for Y when every X is 0 •  Regression Coefficients: b1, b2, …bk •  The effect of each X on Y, holding all other X variables constant •  Prediction Equation or Regression Equation (Predicted Y) = a+b1 X1+b2 X2+…+bk Xk •  The predicted Y, given the values for all X variables •  Prediction Errors or Residuals (Actual Y) – (Predicted Y)

25

Multiple Linear Regression: Results •  t Tests for Individual Regression Coefficients •  Significant or not significant, for each X variable •  Tests whether a particular X variable has an effect on Y, holding the other X variables constant •  Should be performed only if the F test is significant •  Standard Errors of the Regression Coefficients

Sb1 , Sb2 ,!, Sbk

(with n – k – 1 degrees of freedom)

•  Indicates the estimated sampling standard deviation of each

regression coefficient •  Used in the usual way to find confidence intervals and hypothesis tests for individual regression coefficients

26

Multiple Linear Regression: Results •  Predicted Page Costs for Audubon = a + b1 X1 + b2 X2 + b3 X3 = $4,043 + 3.79(Audience) – 124(Percent Male) + 0.903(Median Income) = $4,043 + 3.79(1,645) – 124(51.1) + 0.903(38,787)

= $38,966 •  Actual Page Costs are $25,315 •  Residual is $25,315 – 38,966 = –$13,651 •  Audubon has Page Costs $13,651 lower than you would expect for a magazine with its characteristics (Audience, Percent Male, and Median Income)

27

Standard Error •  Standard Error of Estimate Se •  Indicates the approximate size of the prediction errors •  About how far are the Y values from their predictions? •  For the magazine data •  Se = S = $21,578 •  Actual Page Costs are about $21,578 from their predictions for this

group of magazines (using regression) •  Compare to SY = $45,446: Actual Page Costs are about $45,446 from their average (not using regression) •  Using the regression equation to predict Page Costs (instead of simply using Y ) the typical error is reduced from $45,446 to $21,578

28

Coeff. of Determination The strength of association is measured by the square of the multiple correlation coefficient, R2, which is also called the coefficient of multiple determination.

R2 =

SS reg SS y

R2 is adjusted for the number of independent variables and the sample size by using the following formula: Adjusted

R2

= R2

k(1 - R 2) n-k-1

29

Coeff. of Determination •  Coefficient of Determination R2 •  Indicates the percentage of the variation in Y that is explained by (or attributed to) all of the X variables •  How well do the X variables explain Y? •  For the magazine data •  R2 = 0.787 = 78.7% •  The X variables (Audience, Percent Male, and Median Income) taken

together explain 78.7% of the variance of Page Costs •  This leaves 100% – 78.7% = 21.3% of the variation in Page Costs unexplained

30

The F test •  Is the regression significant? •  Do the X variables, taken together, explain a significant amount of the variation in Y? •  The null hypothesis claims that, in the population, the X variables do not help explain Y; all coefficients are 0 H0: β1 = β2 = … = βk = 0 •  The research hypothesis claims that, in the population, at least

one of the X variables does help explain Y H1: At least one of β1, β2, …, βk ≠ 0

31

The F test H0 : R2pop = 0 This is equivalent to the following null hypothesis:

H0: β1 = β2 = β 3 = . . . = βk = 0 The overall test can be conducted by using an F statistic: F=

SS reg /k SS res /(n - k - 1)

=

R 2 /k (1 - R 2 )/(n- k - 1)

which has an F distribution with k and (n - k -1) degrees of freedom.

32

Performing the F test •  Three equivalent methods for performing F test; they

always give the same result •  Use the p-value •  If p < 0.05, then the test is significant •  Same interpretation as p-values in Chapter 10

•  Use the R2 value •  If R2 is larger than the value in the R2 table, then the result is significant •  Do the X variables explain more than just randomness?

•  Use the F statistic •  If the F statistic is larger than the value in the F table, then the result is

significant

33

Example: F test •  For the magazine data, The X variables (Audience, Percent Male, and Median Income) explain a very highly significant

percentage of the variation in Page Costs •  The p-value, listed as 0.000, is less than 0.0005, and is therefore

very highly significant (since it is less than 0.001) •  The R2 value, 78.7%, is greater than 27.1% (from the R2 table at level 0.1% with n = 55 and k = 3), and is therefore very highly significant •  The F statistic, 62.84, is greater than the value (between 7.054 and 6.171) from the F table at level 0.1%, and is therefore very highly significant

34

t Tests •  A t test for each regression coefficient •  To be used only if the F test is significant •  If F is not significant, you should not look at the t tests

•  Does the jth X variable have a significant effect on Y, holding the

other X variables constant? •  Hypotheses are H0: βj = 0, H1: βj ≠ 0 •  Test using the confidence interval

b j ± tSb j

•  use the t table with n – k – 1 degrees of freedom

•  Or use the t statistic

tstatistic = b j / Sb j

•  compare to the t table value with n – k – 1 degrees of freedom

35

Example: t Tests •  Testing b1, the coefficient for Audience b1 = 3.79, t = 13.5, p = 0.000 •  Audience has a very highly significant effect on Page Costs, after

adjusting for Percent Male and Median Income

•  Testing b2, the coefficient for Percent Male b2 = – 124, t = – 0.90, p = 0.374 •  Percent Male does not have a significant effect on Page Costs, after

adjusting for Audience and Median Income

•  Testing b3, the coefficient for Median Income b3 = 0.903, t = 2.44, p = 0.018 •  Median Income has a significant effect on Page Costs, after adjusting

for Audience and Percent Male

36

Assumptions in Regression •  Assumptions underlying the statistical techniques

should be tested twice •  First for the separate variables •  Second for the multivariate model variate, which acts

collectively for the variables in the analysis and thus must meet the same assumption as individual variables. Differs for different multivariate technique

Assumptions in Regression •  Linearity •  The independent variable has a linear relationship with the dependent variable •  Normality •  The residuals or the dependent variable follow a normal distribution •  Multicollinearity •  When some X variables are too similar to one another •  Homoskedasticity •  The variability in Y values for a given set of predictors is the same regardless of the values of the predictors •  Independence among cases (Absence of correlated errors) •  The cases are independent of each other

38

Assumptions in Regression Normality •  The residuals or the dependent variable follow a normal

distribution •  If the variation from normality is significant then all statistical tests are invalid •  Graphical Analysis •  Histogram and Normal probability plot •  Peaked and Skewed distribution result in non-normality

•  Statistical Analysis

•  If Z value exceeds critical value, then the distribution is non-

normal •  Kolmogorov – Smirnov Test; Shapiro-Wilk’s Test

39

Assumptions in Regression Normality

40

Assumptions in Regression Homoskedasticity •  Assumption related primarily to dependence

relationships between variables •  Assumption that the dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s). •  The variance of the dependent variable should not be concentrated in only a limited range of the independent values •  Source •  Type of variable •  Skewed distribution

41

Assumptions in Regression Homoskedasticity •  Graphical Analysis •  Analysis of residuals in case of Regression •  Statistical Analysis •  Variances within groups formed by non-metric variables •  Levene Test •  Box’s M Test •  Remedy •  Data Transformation

42

Assumptions in Regression Homoskedasticity •  Graphical Analysis

43

Assumptions in Regression Linearity •  Assumption for all multivariate techniques based on

correlational measures such as •  multiple regression, •  logistics regression, •  factor analysis, and •  structural equation modeling

•  Correlation represents only the linear association

between variables •  Identification

•  Scatterplots or examination of residuals using regression

•  Remedy •  Data Transformations

44

Assumptions in Regression Linearity

45

Assumptions in Regression Absence of Correlated Errors •  Prediction errors should not be correlated with each

other •  Identification •  Most possible cause is the data collection process, such as

two separate groups in the data collection process •  Remedy •  Including the omitted causal factor into the multivariate analysis

46

Assumptions in Regression Multicollinearity •  Multicollinearity arises when intercorrelations among the predictors

are very high. •  Multicollinearity can result in several problems, including: •  The partial regression coefficients may not be estimated precisely. The standard errors are likely to be high. •  The magnitudes as well as the signs of the partial regression coefficients may change from sample to sample. •  It becomes difficult to assess the relative importance of the independent variables in explaining the variation in the dependent variable. •  Predictor variables may be incorrectly included or removed in stepwise regression.

47

Assumptions in Regression Multicollinearity •  The ability of an independent variable to improve the prediction of the

dependent variable is related not only to its correlation to the dependent variable, but also to the correlation(s) of the additional independent variable to the independent variable(s) already in the regression equation •  Collinearity is the association, measured as the correlation,

between tow independent variables •  Multicollinearity refers to the correlation among three or more independent variables •  Impact •  Reduces any single IVs predictive power by the extent to which it is

associated with the other independent variables

48

Assumptions in Regression Multicollinearity •  Measuring Multicollinearity •  Tolerance •  Amount of variability of the selected independent variable not explained

by the other independent variables •  Tolerance Values should be high •  Cut-off is 0.1 but greater than 0.5 gives better results •  VIF •  Inverse of Tolerance •  Should be low (typically below 2.0 and usually below 10)

49

Assumptions in Regression Multicollinearity •  Remedy for Multicollinearity •  A simple procedure for adjusting for multicollinearity consists of using only one of the variables in a highly correlated set of variables. •  Omit highly correlated independent variables and identify other independent variables to help the prediction •  Alternatively, the set of independent variables can be transformed into a new set of predictors that are mutually independent by using techniques such as principal components analysis. •  More specialized techniques, such as ridge regression and latent root regression, can also be used.

50

Assumptions in Regression Data Transformations •  To correct violations of the statistical assumptions

underlying the multivariate techniques •  To improve the relationship between variables •  Transformation to achieve Normality and Homoscedasticity •  Flat Distribution – Inverse transformation •  Negatively Skewed Distribution – Square Root

Transformation •  Positively Skewed Distribution – Logarithmic Transformation •  If the residuals in regression are cone shaped then •  Cone opens to right – Inverse transformation •  Cone opens to left – Square root transformation

51

Assumptions in Regression Data Transformations •  Transformation to achieve

Linearity

52

Assumptions in Regression Data Transformations

53

Assumptions in Regression •  General guidelines for transformation •  For a noticeable effect of transformation the ratio of a variable’s mean to the standard deviation should be less than 4.0 •  When the transformation can be performed on either of the two variables, select the one with smallest ratio of mean/sd. •  Transformation should be applied to independent variables except in case of heteroscedasticity •  Heteroscedasticity can only be remedied by transformation of the dependent variable in a dependent relationship •  If the heteroscedastic relationship is also non-linear the dependent variable and perhaps the independent variables must be transformed •  Transformations may change the interpretation of the variables

54

Issues in Regression •  Variable Selection •  How to choose from a long list of X variables? •  Too many: waste the information in the data •  Too few: risk ignoring useful predictive information

•  Model Misspecification •  Perhaps the multiple regression linear model is wrong •  Unequal variability? Nonlinearity? Interaction?

EXPLANATORY VS PREDICTIVE MODELING

Explanatory Vs Predictive Modeling •  Explanatory models fits the data closely, whereas a good

predictive model predicts new cases accurately •  Explanatory models uses entire dataset for estimating the best-fit model and to maximize explanatory variance (R2). Predictive models estimate the model on training set and assess it on the new, unobserved data •  Performance measures for explanatory models measures how close the data fit the models, whereas in predictive models performance is measured by predictive accuracy

Performance Evaluation •  Prediction Error for observation ‘i’= Actual y value –

predicted y value •  Popular numerical measures of predictive accuracy •  MAE or MAD (Mean absolute error / deviation)

•  Average Error

•  MAPE (Mean Absolute Percentage Error)

Performance Evaluation •  RMSE (Root mean squared error)

‘ •  Total SSE (total sum of squared errors)

CASE

Case: Pedigree Vs Grit •  Why does a low R2 does not make the regression useless? •  Describe a situation in which a useless regression has a high R2. •  Check the validity of the linear regression model assumptions. •  Estimate the excess returns of Bob’s and Putney’s funds. Between them, who •  •  •  •  • 

is expected to obtain higher returns at their current funds and by how much? If hired by the firm, who is expected to obtain higher returns and by how much? Can you prove at the 5% level of significance that Bob would get higher expected returns if he had attended Princeton instead of Ohio State? Can you prove at the 10% level of significance that Bob would get at least 1% higher expected returns by managing a growth fund? Is there strong evidence that fund managers with MBA perform worse than fund managers without MBA? What is held constant in this comparison? Based on your analysis of the case, which candidate do you support for AMBTPM’s job opening: Bob or Putney? Discuss

Case: Nils Baker •  Is the presence of a physical Bank Branch creating

demand for checking accounts?

Thank You

Predictive Analytics Using Regression

Short Description

Description

Comments

We need your help!