Logistic Regression

April 5, 2017 | Author: Shishir Pal | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Logistic Regression ...

Description

Training Document Logistic Regression

Table of Content

Contents Objective.................................................................................................................... 2 About Logistic Regression........................................................................................... 3 CONCEPT.............................................................................................................. 3 Steps of developing a Logistic Regression Model.......................................................4 Key Metrics Finalization........................................................................................... 4 Rolling Performance Windows..............................................................................4 Data Preparation..................................................................................................... 6 Data Treatment.................................................................................................... 7 Derived variables creation................................................................................. 10 Data Split........................................................................................................... 11 Oversampling..................................................................................................... 12 Variable Selection/Reduction.................................................................................12 Data Distribution Related Issues........................................................................12 Information Value............................................................................................... 13 WOE Approach................................................................................................... 16 MULTI COLLINEARITY CHECK..............................................................................16 Standardization of Variables............................................................................... 19 Logistic Regression Procedure............................................................................19 Key Model Statistics........................................................................................... 20 Model Fit Statistics.................................................................................................... 21 Model description............................................................................................... 22 KS Statistic and Rank Ordering –........................................................................23 Gini and Lorentz curves...................................................................................... 24 Divergence Index Test........................................................................................ 25 Clustering checks –............................................................................................ 26 Deviance and Residual Test................................................................................27 Hosmer and Lemeshow Test...............................................................................28

Model Validation.................................................................................................... 29 1)

Re-estimation on Hold out sample...............................................................29

2)

Rescoring on bootstrap samples..................................................................29

Objective

The Purpose of this document is to guide new joiners or people new to Logistic modelling on how to carry out each step starting from data collection/preparation to logistic modelling results and validation. The level of detail of each stage will be primary. It does this by allowing the reader to start at the beginning, seeing how each stage of the process contributes to the overall problem, and how it interacts and flows together while progressing towards a final solution and its presentation. The focus will be on execution of each step of the process and methods used to verify the integrity of the process.

About Logistic Regression Logistic regression technique uses maximum likelihood estimation to develop the models. Logistic regression is a form of statistical modeling that is often appropriate for dichotomous outcomes, for example good and bad. It is a method of describing the relationship between a binary dependent (predicted) variable and a set of independent explanatory variables from a set of observations. The independent variables typically comprise of demographic characteristics, past performance characteristics, and product related characteristics. Essentially, it is a method of finding the best fit to a set of data points

CONCEPT Logistic Regression predicts the probability (P) of an event (Y) to occur through the following equation: Log(P/(1-P)) = α+β1X1+β2X2+..+βnXn  P is the probability that the event Y occurs, p(Y=1)  Odds Ratio = P/1-P  Log{P/1-P} = log of the odds ratio METHOD OF ESTIMATION •

Maximum Likelihood Estimation: The coefficients α, β1, β2,...,βp are estimated such that the Log of the likelihood function is as large as possible.

•

Maximum likelihood solves for the following condition: (Y – p(Y=1)) Xi = 0; summed over all observations, i = 1, 2....,n.

•

Assumption: Yi and Yj independent for all i≠j.

•

There are no distributional assumptions on the independent predictors.

Steps of developing a Logistic Regression Model Key Metrics Finalization •

Observation Window: Time frame from where independent variables (X’s) come from.

•

Observation Point: Point at which the population will be scored

•

Performance Window: Time frame from where the dependent variable (Y) comes from

Observation Point

Performance Observation WindowWindow

Rolling Performance Windows The above example uses Jan’14 to Mar’14 as Observation Window and May’14 to Aug’14 as Performance Window i.e. single performance and observation window. Multiple rolling performance windows are used in following cases: 1. To capture data seasonality While using a single performance window, the assumption is that the parameters of the model are constant over time. However, the economic environment often changes considerably and it may not be reasonable to assume that a model’s parameters are constant. A common technique to assess the constancy of a model’s parameters is to compute parameter estimates over a rolling window of a fixed size through the sample. If the parameters are truly constant over the entire sample, then the estimates over the rolling windows should not be too different. If the parameters change at some point during the sample, then the rolling estimates should capture this instability e.g. The below example utilizes 3 performance windows of 3 months each. Using multiple performance windows, data of 10 months (Jan’13 to Oct’13) is utilized in model development which would not be possible using a single performance window. This will cater for seasonality in the data.

2. Utilizing campaign data of multiple months for model development: If campaign data of multiple months is to be utilized for campaign response model development then multiple performance windows can be used. e.g. Credit cards campaign data of 3 months (i.e. Jan’13 to Mar’13) is available for campaign response model development. Then instead of a single performance window, following rolling windows can be utilized. Since different set of customers will be targeted in campaign in each month, there will be no duplicates across different windows. Performance Window 1: Customers targeted in Jan’13 for the campaign–whether they bought a credit card from Feb’13 to April’13 Performance Window 2: Customers targeted in Feb’13 for the campaign – whether they bought a credit card from Mar’13 to May’13 Performance Window 3: Customers targeted in Mar’13 for the campaign – whether they bought a credit card from Apr’13 to Jun’13 Target variable: Once the objective and scope of analysis is defined, it is important to identify the target variable. For example, a risk model depending upon the data and business, can have multiple target variables, such as 90DPD, Bankruptcy Indicator, Charge Off, BAD12 (account becoming bad within first 12 months of activation). Different target variables will lead to different models/ performance/ and business insights. Sometimes, a combination of various target variables is used to build the overall model. For Example, an overall BAD predictor can be a combination of 90DPD, Bankruptcy or Charge-offs.

In some problems, the target variable needs to be created and possibly defined. For example, the client may want to build a model to identify potential churn but might not have a clear definition of attrition. In such situations, it might often help to look at the data and come up with some set of rules/algorithm to identify the dependent variable. Again, the definition of the dependent variable in certain cases may influence the overall value of the model. For example, say the objective is to predict bankruptcy of cardholders. We can choose to define the dependent variable to capture bankruptcy next month or bankruptcy in 3 months. Clearly the latter model is more useful if the objective of the analysis is to take some pre-emptive action against those likely to go bankrupt. In the current sample data, target variable is defined as :•

The set of positive responders to campaign in the population data are (tagged as ‘1’). Rest of the population (non-target) is tagged as ‘0’.

Exclusion Criteria: Policy exclusions and any other exclusions needs to be undertaken prior to model development to ensure data is not biased and model base is representative of the actual population.

Data Preparation The goal of this step is to prepare a “master” data set to be used in the modeling phase of the problem solution. This dataset atleast should contain:  A key, or set of keys, that identifies each record uniquely  The dependent variable relevant to the problem  All independent variables that are relevant or may be important to the problem solution In the early stages of a solution, it can be sometimes hard to determine an exact set of independent variables. Often, nothing is left out to begin with, and the list of relevant variables is derived and constantly updated as the process unfolds. If the required master data is spread across several data sets, then the pertinent records and variables will need to be extracted from each dataset and merged together to form the master dataset. If this must be done, it is very important that proper keys are used across the datasets so that not only do we end up with all the needed variables in the final dataset, but that you are merging the datasets

correctly. For example, you may have a customer dataset with customer level information such as name, dob, age, sex, address etc…. (a “static” data set), and another data set, “account” data, which contains account level information such as account number, account type(savings/current/mortgage/Fixed deposit) , total balance , date of opening , last transaction date etc. This account level dataset needs to be rolled up to customer level before merging with customer dataset to create master dataset. PS:- If you try to merge two datasets by a common numeric variable, but whose lengths were defined differently in each dataset, you may see a warning in the log file similar to: WARNING: Multiple lengths were specified for the BY variable by_var by input data sets. This may cause unexpected results.

It is generally not wise to overlook log file warnings unless you have a very good reason to. A short data step redefining the length of the shorter variable in one of the datasets before merging will suffice to get rid of the warning, and could reveal important data problems, such as information being truncated from some values of the BY variables in the data set with the shorter length.

Data Treatment Once master dataset has been created, univariate macro(if available) needs to be run to understand the data. Certain characteristics of the data that need to be looked at are: Variable name  Format  Number of unique values  Number of missing values  Distribution (proc means output for numeric variables; highest and lowest frequency categories for a categorical variable) o Numeric variables: standard numerical distribution including the mean, min, max, and percentiles 1, 5, 10, 25, 50, 75, 90, 95, and 99 o Categorical variables: no of times the variable takes each categorical value .

Ob s

na me

ty pe

var_len gth

n_p os

numo bs

nmi ss

uniq ue

mean_or_t op1

min_or_t op2

p1_or_to p3

1

Var1

nu m

8

0

46187

0

929

0.12648

0

0

Delet ed Colum ns

p99_or_b ot2

max_or_b ot1

0.769

0.998

2

Var2

nu m

8

8

46187

0

505

0.06473

0

0

0.285

0.944

3

Var3

nu m

8

16

46187

0

175

714.42876

0

650

756

794

4

Var4

nu m

8

24

46187

0

257

656.30054

0

0

755

794

5

Var5

nu m

8

32

46187

0

1067 3

100.50368

0

0

1710.922

136318.12 3

6

Var6

nu m

8

40

46187

0

3312 3

305.97356

0

0

2552.431

221315.61 4

7

Var7

nu m

8

48

46187

0

1332

0.11786

0

0

1.073

47.794

8

Var8

cha r

1

56

46187

0

10

2::524952

::429733

1::37644 1

7::5468

8::2781

Output of univariate macro for few variables is given below:

Code for getting univariate output of variables:

Univariate_Macro.txt

Put the library path location ( where the dataset exists) and the dataset name( on which univariate will run) in place of XXX at the bottom of the Univariate code before running it. Basic things that should be looked for when first assessing the data:  Are data formats correct? o Are numerical variables stored as text? Do date variables need to be converted in order to be useful?  Which variables have missing values?  Data Outliers?  Do any variables exhibit invalid values (many “9999999”, “101010101…”, “0/1” values, etc)? o If you have a data dictionary provided by the client, there may be information on invalid values, so this would be the first thing to check Are any distributions clearly out of line with expectations?

Missing Value Imputation It is important to impute the missing values in a dataset before analysis can be performed on it. Below are some popular techniques:

      

Replace missing values with zero Replace missing values with value of same variable whose records have closest mean value of response variable Regression on other predictors Replace missing values with mean/median Replace missing values with median Inter correlation Do not impute missing values

Replace missing values with zero Situations where the absence of a value is implicitly zero – for example, NUMBER OF LATE PAYMENTS. The value of this field would be expected to be zero for most customers. Check related fields to justify this decision. Also, if some records have a 0 and others are blank (missing), then check with client if a blank has a different interpretation.

Regression on other predictors Create a linear model with populated records to predict the values of this variable using other variables in dataset as predictors. Then score the records with missing values using the model coefficients. It is a very good method when there is sufficient covariance among the variables in the dataset to produce a precise and accurate regression. Replace missing values with mean This technique should be used in situations where great majority of records ~ 85 % + are populated and where other methods are not feasible. It can also be used in situations where the variable is a predictor with low influence in the model but it needs to be included. Replace missing values with median This technique should be used in situations where great majority of records ~ 85 % + are populated and used instead of mean imputation if distribution is highly skewed and only if other methods are not feasible. . It can also be used in situations where the variable is a predictor with low influence in the model but it needs to be included. Inter-correlation This method involves finding another predictor variable which has very high fill rate and which is very highly correlated with the variable being imputed. The other predictor is binned and the median value of the variable being imputed is calculated

for the bin. The variable is then imputed based on the bin into which it falls. This is a good method but will need a very high correlation among predictors and require a very high (close to 100%) fillrate. Do not impute missing values    

Response variable for model Predictor variable that has low correlation with other predictors and imputation of zero, mean or median would bias model results Predictor variable that has low correlation with response and is unlikely to play significant role in model should be excluded from modeling Predictor variable that could be important in model, but has large percentage of values missing should be excluded from model (imputation using above techniques and inclusion in model would result in either a model with inflated performance statistics or reflecting data manipulation rather than original source data)

Predictor variable that could be important in model, but has large percentage of values missing should either be excluded records with missing values or exclude variable from model (imputation using above techniques and inclusion in model would result in either a model with inflated performance statistics or reflecting data manipulation rather than original source data) Outlier Treatment It is very important to eliminate outliers from the dataset before any analysis can be performed. Outliers can be detected using Proc Univariate output. Comparing the P99 and Max Values (or the P1 and the Min values), we can identify the variables having possible outliers . Here are some common ways of dealing with outliers:   

Cap all outliers at P1 or P99. Cap all outliers at (P1 - δ) or (P99 + δ). The value of δ will be subjective. Using Exponential Smoothening for all values beyond the range P1-P99

The first and second methodis easier to implement but lose the ordinality of data. The fourth method takes care of the outlier problem but does not lose the ordinality of the data. Derived variables creation

Derived variables are created in order to capture all underlying trends and aspects of the data. Rather than just using the raw variables in the model; taking proportions, ratios and making indexes sometimes help reduce bias and also helps in identifying new trends in the data. For E.g.: Taking average monthly spends instead of total spends in last 12 months is more insightful because it helps neutralize the effect of new customers having lower spends due to the reduced tenure. The normalized average value provides a more accurate comparison amongst customers Data Split • Development dataset – Fit the model on this dataset •

Validation dataset (Hold-out sample) – Validate the model using the hold-out sample

•

Out of time sample (Validation) – Validate the model on a different time period to ensure the model works.

Development and validation sample are split in any ratio with 50 - 80% records in development sample. Sample code for doing a 70-30 split is below :data temp; set xxx; n=ranuni(8); proc sort data=temp; by n; data training validation; set temp nobs=nobs; if _n_ 0.1 Very Predictive – use in modelling Coarse Classing: Coarse Classing is a method to identify similar categories. To coarse class population, group categories with similar log odds and same sign. Calculate log odds for the grouped category. The new log odds is Weight of Evidence of the variable Each of the characteristics deemed to be predictive (information value > 0.03) should be grouped (normally performed using fine class output and a ruler) into larger more robust groups such that the underlying trend in the characteristic is preserved. A rule of thumb suggests that at least 3% of the goods and 3% of the bads should fall within a group. For continuous characteristics 10% bands are used to get an initial indication of the predictive patterns and the strength of the characteristic. However, for the grouping of attributes more detailed reports are produced using 5% bands.  Try to make classes with around 5-10% of the population. Classes with less than 5% might not be a true picture of the data distribution and might lead to model instability.  The trend post coarse classing should either be monotonically increasing, decreasing, parabola or an inverted parabola. Polytonic trends are usually not acceptable  Business inputs from the SMEs in the markets are essential for coarseclassing process as fluctuations in variables can be better explained and classes make business sense.

WOE Approach

Concept: In the standard WOE approach every variable is replaced by its binned counterpart. The binned variable is created by assigning a value equal to WOE of each of the bins formed during coarse classing. WOE = ln (% Good/ % Bad) WOE = 0 is imputed for the bins containing missing records and for bins that consisted of less than 2% of the population. Advantage: Every attribute of the variable is differently weighed hence taking care of the neutral weight assignment in case of dummy approach Disadvantage: Lesser degrees of freedom hence the chances of a variable representation is lower in comparison to the dummy approach.

MULTI COLLINEARITY CHECK Multicollinearity Macro Introduction: The macro MULTI COLLINEARITY is used to remove the multicollinearity. It identifies the variables that are correlated and helps in removing the correlated and / or insignificant variables. Logic: 1. Capture the outputs of Regression and Logistic Regression procedures 2. Transpose the factor-loading matrix and attach the parameter estimate, VIF, Wald-Chi Square value to each independent variable. 3. Go to last eigen vector (with lowest eigen value) and find out the variables that are correlated (that have factor loadings more than specified in the CutOff Factor loading in the Excel sheet) 4. Remove those variables (not more than 3 variables at any iteration / point of time) that have high VIF (more than 1.75) and lower Wald – Chi Square value. 5. Go to each eigen vector in the ascending order of the eigen value and find out the variables that are correlated (that have factor loadings more than specified in the Cut-Off Factor loading in the Excel sheet) 6. Go to step 4 after each Eigen vector selection.

Points to note: 1. If the factor loadings on a particular Eigen vector are not above the cut-off, that vector is ignored and next Eigen vector would be looked for. 2. Not more than 3 variables could be dropped. 3. Not more than 250 variables could be used because of Excel limitation on the number of rows. 4. Clear Contents in the columns M & P of Multicol8.xls sheet before start using the macro for each new project.

Programs – Overview of Inputs & Outputs

Multi Collinearity.sas

1. SAS Program:

Inputs required apart from the Library and dataset name are 1. List of variables for REG and LOGISTIC procedures 2.Response Variable Name

Output: One Excel sheet "mc.xls" created in directory C ( Can change the location and Name of the file). Please go through the program and input the values at appropriate places ( COMMENTS will guide you in doing that)

MultiCol8.ZIP

2. Excel Sheet:Multicol8.xls. Save this in C folder. This has VB macro that runs on the file "C:\mc.xls" created out of SAS Program. Save this file on your computer and keep it open at the time of removing multi collinearity.

3. Outputs:



List of Variables Retained (column M) and Removed (column P) will get pasted in the same excel sheet "MultiCol8.xls"



  

Tracking of variables removed from the first iteration to the last itearation. Name of the tracking file is to be specified each time at the time of running macro (eg: c:\log.txt). You can use same file name through out the project. This would have the history of variables removed and corresponding correlated variables. Please make sure that you change the file name when you are working on a new project. Otherwise the existing file "log.txt" gets appended. Idea of having this tracking file is to find out the replacement variables for any variable that was dropped any point of time. Open this txt file in Excel " Delimited , select Comma and others "-" . Each row will give the variables that are correlated. Columns (B, C and D) give the variables removed at a point. Variables from F column are correlated with the removed variables and retained at that respective point.

Frequently Asked Questions Q1: Why should we do a multicollinearity check? Ans1: Multicollinearity refers to correlation among independent variables and leads to an increase in the standard error. This in turn makes the model unreliable.

Clustering of variables Varclus is a SAS procedure that groups variables into clusters based on the variance of the data they explain. It is unsupervised in that it does not depend on the dependent variable. In the background it performs principal component like analysis and then makes the components orthogonal so that that they contain distinct set of variables. Here are some practical implementation steps for running varclus: 

Varclus will group the identical variables into clusters



Ideal representative(s) from cluster can be retained



Selected variable should have high r-square with own cluster and low r-square with next closest cluster



1-R-squared ratio is a good selection criteria ( 1-r-quared with own cluster/1-rsquare with next closest cluster)



Multiple selections can be made from clusters if necessary



Business judgment might often determine variable selection

Here is the sample code:

proc varclus data=imputed maxeigen=.7 short hi; var &list; run; Here is the sample output:

variable Cluster bb cc Cluster kk ll Cluster mm Cluster nn

R-squared with 1Own Next closest R-squared Cluster Cluster Ratio

cluster# 69

0.5804 0.443

0.468 0.788722 0.1362 0.644825

0.8057 0.6345

0.2993 0.277294 0.2918 0.516097

0.5625

0.0013 0.438069

0.7797

0.2811

70

71 72 0.30644

So in the above output we have 4 clusters. We want to select one variable to represent a cluster, but often we might use more than one variable from a cluster due to business reasons. We ideally want to choose a variable which has a high rsquare with its own cluster and a low r-square with its next closest cluster. Choosing the variable with the lowest “1- rsquared” ratio accomplishes this task. This implies the choice of cc, kk, mm, nn. Standardization of Variables

Once final set of variables on which Logistic has to be run is decided, standardization of coefficients can be obtained. For example, the z score for age would be calculated as (age mean[age])/std dev[age]. Your output would then give standardized coefficients as results.

Logistic Regression Procedure Before running a logistic model in SAS you first need to check that your dataset is ready for logistic modeling. SAS will discard any observations with missing values

for any of the variables the user specifies to be considered for the model, so it is necessary that general modeling data preparation and missing imputation have occurred. Also, PROC LOGISTIC will only accept numeric variables as predictors, unless categorical level character variables are specified in the CLASS statement. You must also be careful with categoricals that are coded with numerals – the program will treat these as if they were continuous numerics unless they are specified in the CLASS statement. To run a logistic regression in SAS, and estimate the coefficients of the model, one would run code similar to the below: proc logistic data = . DESCENDING; model dependent_variable = var1 var2 var3; run;

The DESCENDING options lets SAS know that the value of the dependent variable we wish to predict is “1”, and not “0”. With no other options selected, SAS will estimate the “full model”, meaning all variables will be included, regardless of whether their coefficients are significantly different from 0. PROC LOGISTIC also permits:   

forward selection backward elimination forward stepwise

selection of ‘optimal subsets’ of independent variables – Default significance levels for entry into/removal from the model can be modified by use of the SLENTRY and SLSTAY options Key Model Statistics PROC LOGISTIC provides several means of assessing model fit: ALL THE BELOW TABLES/GRAPHS ARE ILLUSTRATIVE AND YET TO BE DEVELOPED FOR SAMPLE DATA

Model Fit Statistics Criterion

Intercept Only

Intercept and Covariates

AIC

501.977

470.517

SC

505.968

494.466

-2 Log L

499.977

458.517

Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio

Chi-Square

DF

Pr > ChiSq

41.4590

5

20% would be classified as a good model. Let G(s) be the number of goods with a score less than s, and B(s) be the number of bads with a score less than s. A Lorentz curve is a plot of G(s) against B(s). The Gini coefficient captures the degree to which the distributions differ by calculating the difference

between the areas under the G(s) and B(s) curves, i.e.:

 G(s)  B(s) .ds

. If G(s) and

B(s) are identical the Lorentz curve is the straight line between (0, 0) and (1, 1), and the required integral is 0.

The Lorentz curves above show the plot of the cumulative percentage of bad against the cumulative percentage of the good in the development and validation samples. The darkblueline shows the distribution of good and bad under random scoring whereas the brown curve (development sample) and green curve (validation sample) show the ‘lift’ provided by the Conversion Rate Model over and above random selection. Model exhibits similar level of performance across development and validation samples as can be seen from the almost overlapping Lorentz curves.

Frequently Asked Questions Q: What is C-stat (C)? Ans: C is the Area under the Curve=(# of concordant pairs+0.5*# of tie pairs)/#of pairs = % concordance+0.5*% tie =2(1+gini) Q: How can we determine which variable has maximum contribution in the model?

Ans: To measure the contribution of variables we standardize variables since all variables are on different scale. If the variables have the same unit of measurement then their magnitude or scale needs to be compared. The contribution of variables in the model can be measured by: a) Wald chi square b) Point estimate Q: What does point estimate tell us? Ans : A point estimate tells us that for every change in one unit of the estimate, how will the dependent variable change .

Divergence Index Test

D Divergence Index -

x g  xb s

is a commonly employed measure of the

separation achieved by a model. It is related to a t-distribution (multiply by (G+B) ½) if the two population variances are equal. This measure tells us how well the means of the goods and bads are differentiated. A t statistic > |6| shows a high level of differentiation.

Null Hypothesis (H0): The mean score of the good in the population is less than equal to the mean score of the bad in the population. A robust model implies that the mean score for good will be significantly greater than the mean score for bad i.e. the null hypothesis needs to be rejected. As shown by the p-value in the Table 4.2.4, the null hypothesis is being rejected at 1% level of significanc

Clustering checks – A good model should not have significant clustering of the population at any particular score and the population must be well scattered across score points.

Clustering refers to the proportion of accounts falling at various integral values of the model-generated scores1.

Deviance and Residual Test

Both the Pearson and deviance test whether or not there is sufficient evidence that the observed data do not fit the model. The null hypothesis is that the data fit the model. If they are not significant it suggests that there is no reason to assume that the model is not correct / we accept that the model generally fits the data. For this model both the Pearson and Deviance test are coming as insignificant thereby further confirming that the model is fitting the data. Hosmer and Lemeshow Test The Hosmer-Lemeshow Goodness-of-Fit test tells us whether we have constructed a valid overall model or not. If the model is a good fit to the data then the Hosmer-Lemeshow Goodness-of-Fit test should have an associated 1 Score is defined as the probability of being a responder (as per Conversion Rate Model) multiplied by 1000

p-value greater than 0.05. In our case the associated p value is coming as high for both the development and the validation sample signalling that the model is a good fit for the data.

Frequently Asked Questions

Q: What drawback does Hosmer Lemeshow test have? Ans: Hosmer Lemeshow is a goodness of fit test. However this metric is volatile due to the degrees of freedom deployed

Confusion Matrix (1 is the Event Indicator) – Development Data 65% of the consumer completions got correctly predicted by the model

Confusion Matrix (1 is the Event Indicator) – Validation Data

Model Validation 1) Re-estimation on Hold out sample–

We re-estimate the model parameters on the hold-out validation sample to ensure the parameters are in close proximity of the development sample and all the other model performance measures hold 2) Rescoring on bootstrap samples – The samples are selected every time with replacement and the equation from the development sample is used to re-score the model on several bootstrap samples in varying sample proportions – 20% - 80%. The model should satisfy the performance measures stated above. Test statistics show that the model validates for all 5 bootstrap samples with within confidence interval and achieves complete rank-ordering .

Logistic Regression

Short Description

Description

Comments

We need your help!