Predictive Analytics Using Regression
May 28, 2016 | Author: Pranav Aggarwal | Category: N/A
Short Description
Regression...
Description
PREDICTIVE ANALYTICS USING REGRESSION Sumeet Gupta Associate Professor Indian Institute of Management Raipur
Outline • Basic Concepts • Applications of Predictive Modeling • Linear Regression in One Variable using OLS • Multiple Linear Regression • Assumptions in Regression • Explanatory Vs Predictive Modeling • Performance Evaluation of Predictive Models • Practical Exercises • Case: Nils Baker • Case: Pedigree Vs Grit
BASIC CONCEPTS
4
Predictive Modeling: Applications • Predictive customer activity on credit cards from their
demographic and historical activity patterns • Predicting the time to failure or equipment based on utilization and environment conditions • Predicting expenditures on vacation travel based on historical frequent flyer data • Predicting staffing requirements at help desks based on historical data and product and sales information • Predicting sales from cross selling of products from historical information • Predicting the impact of discounts on sales in retail outlets
5
Basic Concept: Relationships Examples of relationships: • Sales and earnings • Cost and number produced • Microsoft and the stock market • Effort and results
• Scatterplot • A picture to explore the relationship in bivariate data • Correlation r • Measures strength of the relationship (from –1 to 1) • Regression • Predicting one variable from the other
6
Basic Concept: Correlation • r = 1 • A perfect straight line tilting up to the right • r = 0 • No overall tilt • No relationship? • r = – 1 • A perfect straight line tilting down to the right
Y
Y
X Y
X Y
X Y
X Y
X
X
7
Basic Concepts: Simple Linear Model • Linear Model for the Population • The foundation for statistical inference in regression • Observed Y is a straight line, plus randomness
{
Y = α + βX + ε Randomness of individuals Population relationship, on average Y ε
X
8
Basic Concepts: Simple Linear Model • Time Spent vs. Internet Pages Viewed • Two measures of the abilities of 25 Internet sites • At the top right are eBay, Yahoo!, and MSN
• Correlation is r = 0.964 • Linear relationship • Straight line
with scatter • Increasing relationship • Tilts up and to the right
Minutes per person
• Very strong positive association (since r is close to 1)
90 eBay Yahoo!
60 MSN
30 0
0
100 200 Pages per person
9
Basic Concepts: Simple Linear Model • Dollars vs. Deals • For mergers and acquisitions by investment bankers • 244 deals worth $756 billion by Goldman Sachs
• Correlation is r = 0.419 • Positive association • Straight line
with scatter • Increasing relationship • Tilts up and to the right
$1,000 Dollars (billions)
• Linear relationship
$500
$0
0
100
200 300 Deals
400
10
Basic Concepts: Simple Linear Model • Interest Rate vs. Loan Fee • For mortgages • If the interest rate is lower, does the bank make it up with a higher loan
fee? • Correlation is r = – 0.890 • Linear relationship • Straight line
with scatter • Decreasing relationship • Tilts down and to the right
Interest rate
• Strong negative association
6.0% 5.5% 5.0% 0%
1%
2% 3% Loan fee
4%
11
Basic Concepts: Simple Linear Model • Today’s vs. Yesterday’s Percent Change • Is there momentum? • If the market was up yesterday, is it more likely to be up today? Or is
each day’s performance independent? 3%
• Correlation is r = 0.11 • No relationship? • Tilt is neither
up nor down
2%
Today's change
• A weak relationship?
1% 0% -1% -2% -3% -3% -2% -1% 0% 1% 2% 3% Yesterday's change
12
Basic Concepts: Simple Linear Model • Call Price vs. Strike Price • For stock options • “Call Price” is the price of the option contract to buy stock at the
“Strike Price” • The right to buy at a lower strike price has more value • A nonlinear relationship
A curved relationship • Correlation r = – 0.895 • A negative relationship:
Higher strike price goes with lower call price
$100
Call Price
• Not a straight line:
$75 $50 $25 $0 $450
$500
$550 Strike Price
$600
$650
13
Basic Concepts: Simple Linear Model • Output Yield vs. Temperature • For an industrial process • With a “best” optimal temperature setting
• A nonlinear relationship • Not a straight line:
• Correlation r = – 0.0155 • r suggests no relationship
• But relationship is strong • It tilts neither
up nor down
Yield of process
A curved relationship 160 150 140 130 120 500
600 700 800 Temperature
900
14
Basic Concepts: Simple Linear Model • Circuit Miles vs. Investment (lower left) • For telecommunications firms • A relationship with unequal variability • More vertical variation at the right than at the left • Variability is stabilized by taking logarithms (lower right)
r = 0.957 Log of miles
Circuit miles (millions)
• Correlation r = 0.820
2,000 1,000 0 0
1,000 2,000 Investment ($millions)
20
15
15 20 Log of investment
15
Basic Concepts: Simple Linear Model • Price vs. Coupon Payment • For trading in the bond market • Bonds paying a higher coupon generally cost more
• Two clusters are visible • Ordinary bonds (value is from coupon) • Inflation-indexed bonds (payout rises with inflation) • for all bonds
• Correlation r = 0.994 • Ordinary bonds only
Bid price
• Correlation r = 0.950
$150 $100 0%
5% 10% Coupon rate
16
Basic Concepts: Simple Linear Model • Cost vs. Number Produced • For a production facility • It usually costs more to produce more
• An outlier is visible • A disaster (a fire at the factory)
Cost
• High cost, but few produced
r = – 0.623
10,000
0
0
20 40 60 Number produced
Cost
5,000
Outlier removed: More details, r = 0.869
4,000 3,000 20 30 40 50 Number produced
17
Basic Concepts: OLS Modeling • Salary vs. Years Experience • For n = 6 employees • Linear (straight line) relationship • Increasing relationship • higher salary generally goes with higher experience
Experience 15 10 20 5 15 5
Salary 30 35 55 22 40 27
Salary ($thousand)
• Correlation r = 0.8667
60 50 40 30 20 0
10
20 Experience
18
Basic Concepts: OLS Modeling • Summarizes bivariate data: Predicts Y from X • with smallest errors (in vertical direction, for Y axis) • Intercept is 15.32 salary (at 0 years of experience) • Slope is 1.673 salary (for each additional year of experience, on average)
Salary (Y)
60 50 40 30 20 10 0
10
20
Experience (X)
19
Basic Concepts: OLS Modeling • Predicted Value comes from Least-Squares Line • For example, Mary (with 20 years of experience) has predicted salary 15.32+1.673(20) = 48.8 • So does anyone with 20 years of experience
• Residual is actual Y minus predicted Y • Mary’s residual is 55 – 48.8 = 6.2 • She earns about $6,200 more than the predicted salary for a person
with 20 years of experience • A person who earns less than predicted will have a negative residual
20
Basic Concepts: OLS Modeling Mary’s residual is 6.2 60
Mary earns 55 thousand
50 Mary’s predicted value is 48.8 Salary
40 30 20 10 0
10
Experience
20
21
Basic Concepts: OLS Modeling • Standard Error of Estimate
Se = SY
(1 − r ) nn −− 12 2
• Approximate size of prediction errors (residuals) Actual Y minus predicted Y: Y–[a+bX] • Example (Salary vs. Experience)
Se = 11.686 (1 − 0.86672 )
6 −1 = 6.52 6−2
Predicted salaries are about 6.52 (i.e., $6,520) away from actual salaries
22
Basic Concepts: OLS Modeling • Interpretation: similar to standard deviation • Can move Least-Squares Line up and down by Se • About 68% of the data are within one “standard error of estimate” of the least-squares line • (For a bivariate normal distribution)
Salary
60 50 40 30 20 0
10 Experience20
23
Multiple Linear Regression • Linear Model for the Population Y = (α + β1 X1 + β2 X2 + … + βk Xk) + ε = (Population relationship) + Randomness • Where ε has a normal distribution with mean 0 and constant
standard deviation σ, and this randomness is independent from one case to another • An assumption needed for statistical inference
24
Multiple Linear Regression: Results • Intercept: a • Predicted value for Y when every X is 0 • Regression Coefficients: b1, b2, …bk • The effect of each X on Y, holding all other X variables constant • Prediction Equation or Regression Equation (Predicted Y) = a+b1 X1+b2 X2+…+bk Xk • The predicted Y, given the values for all X variables • Prediction Errors or Residuals (Actual Y) – (Predicted Y)
25
Multiple Linear Regression: Results • t Tests for Individual Regression Coefficients • Significant or not significant, for each X variable • Tests whether a particular X variable has an effect on Y, holding the other X variables constant • Should be performed only if the F test is significant • Standard Errors of the Regression Coefficients
Sb1 , Sb2 ,!, Sbk
(with n – k – 1 degrees of freedom)
• Indicates the estimated sampling standard deviation of each
regression coefficient • Used in the usual way to find confidence intervals and hypothesis tests for individual regression coefficients
26
Multiple Linear Regression: Results • Predicted Page Costs for Audubon = a + b1 X1 + b2 X2 + b3 X3 = $4,043 + 3.79(Audience) – 124(Percent Male) + 0.903(Median Income) = $4,043 + 3.79(1,645) – 124(51.1) + 0.903(38,787)
= $38,966 • Actual Page Costs are $25,315 • Residual is $25,315 – 38,966 = –$13,651 • Audubon has Page Costs $13,651 lower than you would expect for a magazine with its characteristics (Audience, Percent Male, and Median Income)
27
Standard Error • Standard Error of Estimate Se • Indicates the approximate size of the prediction errors • About how far are the Y values from their predictions? • For the magazine data • Se = S = $21,578 • Actual Page Costs are about $21,578 from their predictions for this
group of magazines (using regression) • Compare to SY = $45,446: Actual Page Costs are about $45,446 from their average (not using regression) • Using the regression equation to predict Page Costs (instead of simply using Y ) the typical error is reduced from $45,446 to $21,578
28
Coeff. of Determination The strength of association is measured by the square of the multiple correlation coefficient, R2, which is also called the coefficient of multiple determination.
R2 =
SS reg SS y
R2 is adjusted for the number of independent variables and the sample size by using the following formula: Adjusted
R2
= R2
k(1 - R 2) n-k-1
29
Coeff. of Determination • Coefficient of Determination R2 • Indicates the percentage of the variation in Y that is explained by (or attributed to) all of the X variables • How well do the X variables explain Y? • For the magazine data • R2 = 0.787 = 78.7% • The X variables (Audience, Percent Male, and Median Income) taken
together explain 78.7% of the variance of Page Costs • This leaves 100% – 78.7% = 21.3% of the variation in Page Costs unexplained
30
The F test • Is the regression significant? • Do the X variables, taken together, explain a significant amount of the variation in Y? • The null hypothesis claims that, in the population, the X variables do not help explain Y; all coefficients are 0 H0: β1 = β2 = … = βk = 0 • The research hypothesis claims that, in the population, at least
one of the X variables does help explain Y H1: At least one of β1, β2, …, βk ≠ 0
31
The F test H0 : R2pop = 0 This is equivalent to the following null hypothesis:
H0: β1 = β2 = β 3 = . . . = βk = 0 The overall test can be conducted by using an F statistic: F=
SS reg /k SS res /(n - k - 1)
=
R 2 /k (1 - R 2 )/(n- k - 1)
which has an F distribution with k and (n - k -1) degrees of freedom.
32
Performing the F test • Three equivalent methods for performing F test; they
always give the same result • Use the p-value • If p < 0.05, then the test is significant • Same interpretation as p-values in Chapter 10
• Use the R2 value • If R2 is larger than the value in the R2 table, then the result is significant • Do the X variables explain more than just randomness?
• Use the F statistic • If the F statistic is larger than the value in the F table, then the result is
significant
33
Example: F test • For the magazine data, The X variables (Audience, Percent Male, and Median Income) explain a very highly significant
percentage of the variation in Page Costs • The p-value, listed as 0.000, is less than 0.0005, and is therefore
very highly significant (since it is less than 0.001) • The R2 value, 78.7%, is greater than 27.1% (from the R2 table at level 0.1% with n = 55 and k = 3), and is therefore very highly significant • The F statistic, 62.84, is greater than the value (between 7.054 and 6.171) from the F table at level 0.1%, and is therefore very highly significant
34
t Tests • A t test for each regression coefficient • To be used only if the F test is significant • If F is not significant, you should not look at the t tests
• Does the jth X variable have a significant effect on Y, holding the
other X variables constant? • Hypotheses are H0: βj = 0, H1: βj ≠ 0 • Test using the confidence interval
b j ± tSb j
• use the t table with n – k – 1 degrees of freedom
• Or use the t statistic
tstatistic = b j / Sb j
• compare to the t table value with n – k – 1 degrees of freedom
35
Example: t Tests • Testing b1, the coefficient for Audience b1 = 3.79, t = 13.5, p = 0.000 • Audience has a very highly significant effect on Page Costs, after
adjusting for Percent Male and Median Income
• Testing b2, the coefficient for Percent Male b2 = – 124, t = – 0.90, p = 0.374 • Percent Male does not have a significant effect on Page Costs, after
adjusting for Audience and Median Income
• Testing b3, the coefficient for Median Income b3 = 0.903, t = 2.44, p = 0.018 • Median Income has a significant effect on Page Costs, after adjusting
for Audience and Percent Male
36
Assumptions in Regression • Assumptions underlying the statistical techniques
should be tested twice • First for the separate variables • Second for the multivariate model variate, which acts
collectively for the variables in the analysis and thus must meet the same assumption as individual variables. Differs for different multivariate technique
Assumptions in Regression • Linearity • The independent variable has a linear relationship with the dependent variable • Normality • The residuals or the dependent variable follow a normal distribution • Multicollinearity • When some X variables are too similar to one another • Homoskedasticity • The variability in Y values for a given set of predictors is the same regardless of the values of the predictors • Independence among cases (Absence of correlated errors) • The cases are independent of each other
38
Assumptions in Regression Normality • The residuals or the dependent variable follow a normal
distribution • If the variation from normality is significant then all statistical tests are invalid • Graphical Analysis • Histogram and Normal probability plot • Peaked and Skewed distribution result in non-normality
• Statistical Analysis
• If Z value exceeds critical value, then the distribution is non-
normal • Kolmogorov – Smirnov Test; Shapiro-Wilk’s Test
39
Assumptions in Regression Normality
40
Assumptions in Regression Homoskedasticity • Assumption related primarily to dependence
relationships between variables • Assumption that the dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s). • The variance of the dependent variable should not be concentrated in only a limited range of the independent values • Source • Type of variable • Skewed distribution
41
Assumptions in Regression Homoskedasticity • Graphical Analysis • Analysis of residuals in case of Regression • Statistical Analysis • Variances within groups formed by non-metric variables • Levene Test • Box’s M Test • Remedy • Data Transformation
42
Assumptions in Regression Homoskedasticity • Graphical Analysis
43
Assumptions in Regression Linearity • Assumption for all multivariate techniques based on
correlational measures such as • multiple regression, • logistics regression, • factor analysis, and • structural equation modeling
• Correlation represents only the linear association
between variables • Identification
• Scatterplots or examination of residuals using regression
• Remedy • Data Transformations
44
Assumptions in Regression Linearity
45
Assumptions in Regression Absence of Correlated Errors • Prediction errors should not be correlated with each
other • Identification • Most possible cause is the data collection process, such as
two separate groups in the data collection process • Remedy • Including the omitted causal factor into the multivariate analysis
46
Assumptions in Regression Multicollinearity • Multicollinearity arises when intercorrelations among the predictors
are very high. • Multicollinearity can result in several problems, including: • The partial regression coefficients may not be estimated precisely. The standard errors are likely to be high. • The magnitudes as well as the signs of the partial regression coefficients may change from sample to sample. • It becomes difficult to assess the relative importance of the independent variables in explaining the variation in the dependent variable. • Predictor variables may be incorrectly included or removed in stepwise regression.
47
Assumptions in Regression Multicollinearity • The ability of an independent variable to improve the prediction of the
dependent variable is related not only to its correlation to the dependent variable, but also to the correlation(s) of the additional independent variable to the independent variable(s) already in the regression equation • Collinearity is the association, measured as the correlation,
between tow independent variables • Multicollinearity refers to the correlation among three or more independent variables • Impact • Reduces any single IVs predictive power by the extent to which it is
associated with the other independent variables
48
Assumptions in Regression Multicollinearity • Measuring Multicollinearity • Tolerance • Amount of variability of the selected independent variable not explained
by the other independent variables • Tolerance Values should be high • Cut-off is 0.1 but greater than 0.5 gives better results • VIF • Inverse of Tolerance • Should be low (typically below 2.0 and usually below 10)
49
Assumptions in Regression Multicollinearity • Remedy for Multicollinearity • A simple procedure for adjusting for multicollinearity consists of using only one of the variables in a highly correlated set of variables. • Omit highly correlated independent variables and identify other independent variables to help the prediction • Alternatively, the set of independent variables can be transformed into a new set of predictors that are mutually independent by using techniques such as principal components analysis. • More specialized techniques, such as ridge regression and latent root regression, can also be used.
50
Assumptions in Regression Data Transformations • To correct violations of the statistical assumptions
underlying the multivariate techniques • To improve the relationship between variables • Transformation to achieve Normality and Homoscedasticity • Flat Distribution – Inverse transformation • Negatively Skewed Distribution – Square Root
Transformation • Positively Skewed Distribution – Logarithmic Transformation • If the residuals in regression are cone shaped then • Cone opens to right – Inverse transformation • Cone opens to left – Square root transformation
51
Assumptions in Regression Data Transformations • Transformation to achieve
Linearity
52
Assumptions in Regression Data Transformations
53
Assumptions in Regression • General guidelines for transformation • For a noticeable effect of transformation the ratio of a variable’s mean to the standard deviation should be less than 4.0 • When the transformation can be performed on either of the two variables, select the one with smallest ratio of mean/sd. • Transformation should be applied to independent variables except in case of heteroscedasticity • Heteroscedasticity can only be remedied by transformation of the dependent variable in a dependent relationship • If the heteroscedastic relationship is also non-linear the dependent variable and perhaps the independent variables must be transformed • Transformations may change the interpretation of the variables
54
Issues in Regression • Variable Selection • How to choose from a long list of X variables? • Too many: waste the information in the data • Too few: risk ignoring useful predictive information
• Model Misspecification • Perhaps the multiple regression linear model is wrong • Unequal variability? Nonlinearity? Interaction?
EXPLANATORY VS PREDICTIVE MODELING
Explanatory Vs Predictive Modeling • Explanatory models fits the data closely, whereas a good
predictive model predicts new cases accurately • Explanatory models uses entire dataset for estimating the best-fit model and to maximize explanatory variance (R2). Predictive models estimate the model on training set and assess it on the new, unobserved data • Performance measures for explanatory models measures how close the data fit the models, whereas in predictive models performance is measured by predictive accuracy
Performance Evaluation • Prediction Error for observation ‘i’= Actual y value –
predicted y value • Popular numerical measures of predictive accuracy • MAE or MAD (Mean absolute error / deviation)
• Average Error
• MAPE (Mean Absolute Percentage Error)
Performance Evaluation • RMSE (Root mean squared error)
‘ • Total SSE (total sum of squared errors)
CASE
Case: Pedigree Vs Grit • Why does a low R2 does not make the regression useless? • Describe a situation in which a useless regression has a high R2. • Check the validity of the linear regression model assumptions. • Estimate the excess returns of Bob’s and Putney’s funds. Between them, who • • • • •
is expected to obtain higher returns at their current funds and by how much? If hired by the firm, who is expected to obtain higher returns and by how much? Can you prove at the 5% level of significance that Bob would get higher expected returns if he had attended Princeton instead of Ohio State? Can you prove at the 10% level of significance that Bob would get at least 1% higher expected returns by managing a growth fund? Is there strong evidence that fund managers with MBA perform worse than fund managers without MBA? What is held constant in this comparison? Based on your analysis of the case, which candidate do you support for AMBTPM’s job opening: Bob or Putney? Discuss
Case: Nils Baker • Is the presence of a physical Bank Branch creating
demand for checking accounts?
Thank You
View more...
Comments