Machine Learning

December 16, 2016 | Author: srconstantin | Category: N/A
Share Embed Donate


Short Description

Class notes from Yale course on machine learning....

Description

Data Mining and Machine Learning Sarah Constantin April 16, 2012

1

Lecture 1

Textbook: Hastie’s ”Elements of Statistical Learning.” Grade: 60 percent bi-weekly homework, 40 percent final project. Class demonstrations are in R, work may be in R or MATLAB. There are two basic types of problems: classification and regression. Regression relates input variables to a numerical variable, trying to predict the response variable from the input variables. Classification is the same thing, except the output variable is discrete. Regression example: predict house price from things like size, school district, number of bedrooms, etc. Classification example: distinguishing spam from ham emails based on the text of an email. We’ll start with traditional linear models for classification and regression, and from there try to make linear models more flexible. The other main issue to consider is high dimensional data – many possible features. For example: a grayscale image with many pixels. Example: Autism. One of the characteristics of autism is impaired social interaction. When watching a movie, autistics will pay less attention to social scenes than normal people do. Eye-tracking data can pay attention to where the subject is looking on the screen. Subjects watched ”Who’s Afraid of Virginia Woolf.” Subjects have classification labels (autistic or neurotypical) and each frame has a data point indicating where the subjects are looking. Can we use this data as a binary classifier? Some frames have little discriminatory power, but some frames show significant difference between autistic and neurotypical subjects. So part of this is a variable selection problem. Additionally, we need to take into account the time ordering of the frame.

1

1.1

Least Squares and Nearest Neighbors

Toy example: simple prediction method. Two input variables, x1 and x2, and a class label, red or green. Linear function of the input: X Yˆ = βˆ0 + Xj βˆj Or, in other words, Yˆ = X T βˆ The residual sum of squares is given by X RSS(β) = (yi − xTi β)2 = (y − Xβ)T (y − Xβ) This is a measure of goodness of fit of the linear model. K-nearest neighbors algorithm: for each data point, rank the distances to all other points and identify the k nearest neighboring points. For each grid point, calculate k nearest neighbors in the data set. The majority vote will assign the classification. This creates a classification boundary which is not necessarily linear. Note that not every data point will have an effect on the classification boundary. It’s more of a local method, trying to use local features for classification rule. The linear method is more of a global method.

2

Lecture 2

Simulation example: red points and green points. We assume we don’t know the underlying probability distribution. Two input variables, X1 and X2 and classification labels. Can we find a classification rule that predicts the label of a new data point? Procedures: least squares and clustering. Least squares treats Y as a linear function of X1 and X2. If the predictor is ¿0.5 predict red, otherwise predict green. K-nearest-neighbors says that if two data points are close in terms of their input variables, we expect the labels to be similar. Prediction is based on neighboring points. For any data point, look at its k nearest neighbors, and give the new point the label of the majority class. The smaller k is, the more seriously it takes outliers. How to choose k is a very important question. As the size of the neighborhood increases, are you using more degrees of freedom or fewer? You’re using less. At the extreme where the size of the neighborhood was all the data, there would be only one possible option. Define N/K, number of clusters, approximate degrees of freedom. Linear method vs. K nearest neighbor procedure... plotting degrees of freedom against error, there’s a minimum error point for test data (training data, of course, more fitting is always better.) 2

2.1

Statistical Decision Theory

Choose f (x) to minimize expected squared error loss (Y − f (X))2 . Expected Prediction Error: Z EP E(f ) = (y − f (x))2 p(y|x)p(x)dxdy Z = [(y − f (x))2 p(y|x)dy]p(x)dx = Ex EY |X ([Y − f (x)]2 |X) So it’s sufficient to minimize EY |X ([Y − f (X))2 |X) pointwise. The least squares solution is the regression function. K-nearest neighbors is f (x) = Ave(y|x ∈ Nk (x)) Linear regression: f (x) =

X T β.

EPE is minimized by [E(XX T )]−1 E[XY ]

K-nearest-neighbors directly approximates the Bayes classifier; conditional probability of a point is relaxed to conditional probability within a neighborhood of a point, and probabilities are estimated by training sample proportions.

3

Lecture 3

Linear regression: assume yi = xti β + i where E[i ] = 0, V ar[i ] = σ 2 , Cov(i , j ) = 0. We minimize the sum of squared errors by minimizing X (yi − xTi β)2 = (y − Xβ)T (y − Xβ) by setting 0 = dS/dβ = d/dβ(y T y − β T X T y − y T Xβ + β T X T Xβ) = −2X T y + 2X T Xβ βˆ = (X T X)−1 X T y Then if we use the least-squares estimator βˆ = (X T X)−1 X T y 3

it’s an unbiased estimate: ˆ =β E[β] Indeed, ˆ = E[(X T X)−1 X T (Xβ + )] E[β] = β + E[(X T X)−1 ] = β + E[E[(X T X)−1 X T |X]] = β because E[|X] = 0. ˆ = (X T X)−1 σ 2 Cov(β) Typical estimate of σ 2 : σ ˆ2 =

X 1 (yi − yˆi )2 N −p−1

unbiased estimator of σ 2 . If we further assume that the errors  ∼ N (0, σ 2 ) then βˆ follows a multivariate normal distribution, βˆ ∼ N (β, (X T X)−1 σ 2 ) So we can do statistical tests on whether βj = 0 or not. If we do simulations of data generated from an underlying distribution, with randomly generated error term, look at the coefficient estimate for each sample so generated. Your expectation, the center of the sampling distribution, will be the underlying truth. Gauss-Markov Theorem: the least squares estimate has the smallest variance among all linear unbiased estimates. ˆ = E[(θˆ − θ)2 ] M SE(θ) = Bias2 + V ariance but if bias is 0, minimum MSE is minimum variance. There may, however, exist a biased estimator with smaller variance. We can trade a little bias for a large reduction in variance. If variables are highly correlated, the residual vector will be very close to zero and the coefficient βˆp will be unstable. Think of this as the coefficient in simple linear regression/ Xj , j = 1 . . . p

4

if these are all independent, then y ∼ x1 . . . xp the effect of the xi ’s will be the same as if you did a simple linear regression for each of them. But in practice very often the x’s have some dependence. Then the coefficient obtained from multiple linear regression would be different from simple linear regression. How different? Marginal coefficient βˆj represents the additional contribution of xj on y after xj has been adjusted for all the other x’s. When we do the linear regression of y on x, the residual is orthogonal to the input space, therefore orthogonal to all the input variables. You can regress variables one at a time on the residuals, and this gives the correct coefficients. It’s a way of finding the pure effect of a variable. Colinearity can lead to unstable estimates because the residual of (xj ∼ xi ) has smaller variance. Problems with least squares estimates: Prediction accuracy least squares estimates often have low bias but large variance. Can we select variables (subset selection) so that the estimator is biased but has much lower variance. Best subset regression finds for each k < p the subset of size k that gives smallest residual sum of squares. Unfortunately the number of subsets is 2p , so it’s hard to search through all of subset space. There are two commonly used search procedures. Forward stepwise selection and backwards stepwise selection. Forward: start with no variables; for all predictors not in the model, choose the one to optimize a variable selection criterion such as AIC or BIC continue until no new predictors can be added to improve the criterion. Backwards is the same way, but reversed; start with all the variables, and remove one at a time until you can’t improve the variable selection criterion any more. Akaike Information Criterion: ˆ = −2 log(likelihood) + 2k AIC(θ) where k is the number of parameters. Bayesian Information Criterion: ˆ = −2 log(likelihood) + (log N )k BIC(θ) where N is the dimension. Consistency of model selection: if the true f is among the candidate families of regression functions, the probability of selecting the true model by BIC approaches 1 as n → ∞.

5

4

Lecture 4

Now we look at coefficient shrinkage by ridge regression. This is a more stable estimator: shrink the regression coefficients by imposing a penalty on their size. The ridge coefficients minimize a penalized residual sum of squares. X X X βˆridge = argmin (yi − β0 − xij βj )2 + λ βj2 as λ increases, you shrink the coefficients harder. This makes mean squared error smaller than least squares. For the prostate cancer example (Hastie, Chapter 3) the coefficients of some of the input variables fall as λ increases, while others grow. The OLS estimate corresponds to λ = 0. If λ = ∞ then all coefficients are forced to be zero. If the OLS estimator for a coefficient for a variable is nonzero, then the ridge regression is nonzero. Basically everything has nonzero coefficients. If you want to reduce the number of input variables, this is not ideal. Lasso instead penalizes the `1 norm. X X X βˆlasso = argmin (yi − β0 − xij βj )2 + λ |βj | The effect of raising λ on the coefficients is quite different; the number of nonzero coefficients gets smaller and smaller. Geometric intuition: an ellipse approaching the L1 ball will hit closer to (1, 0) and if it approaches the L2 ball it’ll hit farther out. Cross-validation: build your model using your training set, evaluate using your test set. This avoids overfitting. You can minimize training error without being good at predicting new data. If you have a lot of data, you can split it into different subsets, train a model on one of them and test it on the others. K-fold cross-validation: Ek (λ) =

X

(yi − xi βˆ−k (λ))2

i∈partk

Ridge regression can show that multiple variables converge to each other; this can separate and reconstruct variables. The shrinkage method standardizes your shrinkage variables so they have the same standard deviation so the coefficients will be comparable to each other. You can have a prediction using each training set, for each of several choices of λ. Least Angle Regression: commonly 6

used algorithm for implementing Lasso. So you can see the range of cross validated MSE and how it changes with L1 norm, so that you can be more confident about choosing the minimum.

5

Lecture 5

There are still other regression techniques – group Lasso, principal components regression, etc. Fused lasso: x1 . . . xp x variable has some time ordering or spatial structure λ1

X

|βt | + λ

p X

|βt − βt−1 |

t=2

penalize differences between adjacent coefficients. Today is all about classification problems. You’re interested in predicting categories. Linear methods: P this means the decision boundaries are linear. Decision boundary of the form α0 + αj xj = 0. Two regions separated by hyperplane. Expected prediction error: ˆ E[L(G, G(x))] loss is 0 if you assign a point to the right category, otherwise 1. This is called 0-1 loss. Bayes classifier means classify to the most probable class, using the conditional distribution: maxP (g|X = x). Linear regression of an indicator matrix. K class indicators Yk , each Yk is 1 if G = k, else 0. So you can treat them as a numeric value in regression. βˆk = (X T X)−1 X T yk yˆk = X βˆk Training data is of the form (xi , gi ), data point and classification. Compare the three values from the regression Yˆ1 , Yˆ2 , etc. Pick the class with the highest Yˆ . Actual Yˆ can be negative or above one, because of the nature of the linear regression line. This is one problem with linear regression for classification. You are not guaranteed that your predicted value is in the appropriate range. Alternative: mixture of Gaussians distribution: sum of Gaussians with different scales and means. P 1 −1/2(x−µk )T −1 k (x−µk ) P fk (x) = e (2π)p/2 | k |1/2 7

each class density. Linear discriminant analysis: we see that the linear discriminant functions −1 −1 X X T δk (x) = xT µk − 1/2µk µk + log πk is an equivalent description of the decision rule, and you choose the k that maximizes δk (x) where πk are the priors. You only retain the terms that are related to k.

6

Lecture 6

Masking problem: if there are more than two classes, simple linear classification will assume there are too few classes. So we need other options. For instance, linear discriminant analysis. The linear discriminant functions are δk (x) = x

t

−1 X

µk −

1/2µtk

−1 X

µk + log πk

Maximizing this is equivalent to maximizing the posterior probability of the data if we model each class density as a multivariate Gaussian fk (x) =

P 1 −1/2(x−µk )T −1 k (xk −µk ) P e (2π)p/2 | k |1/2

P (G = k, X = x) = P (X = x|G = k)P (G = k) = fk (x)πk All the probabilities of being in classes are f1 (x) . . . fk (x) P (X = x) = P (X = x ∩ ∪k G = k X = P (X = x, G = k) To maximize THIS, maximize the log-likelihood, log fk (x) + log πk ∼ δk (x) modulo constant terms. The classification boundary is where δ1 = δ2 , so that will wind up a linear classification boundary. log

−1 −1 X X πk − 1/2(µk − µl )T (µk − µl ) + xT (µk − µl ) = 0 πl

8

What’s the estimate of the Gaussian distribution? Quite simple. πk = Nk /N Proportion in the kth class. µ ˆk =

X

xi /Nk

The mean of each class. X

=

XX (xi − µ ˆk )(xi − µ ˆ)T /(N − K) k

i

Each data point gets equal weight. This is reasonable; it estimates the variance-covariance structure for each Gaussian and assumes they are equal. The LDA and linear regression are equivalent when N1 = N2 . Quadratic discriminant analysis is what happens if you don’t assume all Σk to be equal. The discriminant functions are quadratic: δk (x) = −1/2 log |Σk | − 1/2(x − µk )T Σ−1 k (x − µk ) + log πk So the decision boundary between each of the two classes is a quadratic curve. Where does it come from? Diagonalize Σˆk = Vk Dk2 VkT . Vk is a p by p orthonormal and Dk is a diagonal matrix of non-negative eigenvalues dkl . Then (x − µˆk )T

X ˆ −1 k

ˆk ) ˆk )]T [Dk−1 VkT (x − µ (x − µˆk ) = [Dk−1 VkT (x − µ

and log |

X ˆ k

|=2

X

log dkl

l

P P ˆk (µˆk − What’s going on: within-class variance is W = ˆ . Between-class variance B: k π T µ ˆ)(µˆk − µ ˆ) . How much the classes differ from the center of all the clases. The Fisher method is to spread out the between-class variance as much as possible with respect to the within-class variance. Maximize aT Ba aT W a where Z = aT X.

7

Lecture 7

One approach to compromise between linear and quadratic discriminant analysis is to low variance-covariance to be a weighted sum of the pooled covariances and the individual ones.

9

At one extreme it’s QDA – different for each class – and at the other extreme it’ll be LDA – pooled between all the classes. The tuning parameter can be decided by the data. Dimensionality reduction perspective; project the data into a sequence of directions so the data are as well separated as possible. Two-dimensional data with two classes, concentrated around two overlapping ellipses. How can we project the data into a one-dimensional direction so we can separate the classes as well as possible? Finding this direction is a general eigendecomposition problem. Both individual and pooled variance-covariance structure play a role. Two goals: separate the classes, and be orthogonal to the previous directions. (Uncorrelated.) Logistic regression: a generalized linear model. It’s still a linear model, in the sense that you’re modeling a linear function of your input variables. But the generalization part is due to the fact that now you have a classification problem, and you have a transformation: log

pi = β 0 + β T xi 1 − pi

The 0-1 response yi is generated from a Bernoulli distribution with probability pi . The link function links the underlying parameter to the 0-1 response. The generalization with k classes: P r(G = k|X = x) log = βk0 + βkT x P r(G = K|X = x) The coefficients are calculated by maximum likelihood. The likelihood Y L(β) = pgi (xi , β) P (Y = yi ) = pyi i (1 − pi )1−yi The log-likelihood for the two class case is X l(β) = log pgi (xi , β) =

X

yi log p(xi , β) + (1 − yi ) log(1 − p(xi , β))

Maximizing this usually can’t be solved explicitly – you have to use Newton’s Method to find roots. The pi should be getting close to 1/2 as we get close to the classification boundary. Far away from the classification boundary they should be getting closer to 0 or 1. More emphasis on the more difficult cases. p(1 − p) is large when p is close to 1/2 and small when p is close to 0 or 1. 10

Logistic regression or LDA? LDA is log-posterior odds between class k and K are linear functions of x.

log

−1 −1 X X P (G = k|X = x) πk − 1/2(µk − µi )T = log (µk − µl ) + xT (µk − µl ) P r(G = K)|X = x) π1

= αk0 + αkT x This linearity comes from Gaussian assumption and assumption of a common covariance matrix.

8

Lecture 9

(missed lecture 8) Input variable: univariate variable; unknown true relationship between y and x, add some noise and the points are the observed data. We want to fit a nonlinear trend for the data. How can we do this systematically? What basis functions? Piecewise constant; piecewise linear; continuous piecewise linear; ”knots” are the continuity constraint points. (You can also force the first derivative and second derivative to be continuous at the knots for more smoothness. Piecewise cubic polynomials: discontinuous, continuous, continuous first derivative, or continuous second derivative. This is the idea of the so-called ”regression spline.” In each local region, to which degree do you want to fit a polynomial term? And where are the knots? Then you will have a set of basis functions and you can just treat the problem as a linear regression problem. Another method has to do with smoothing splines. This avoids the knot selection problem to use a maximal set of knots. Minimizes Z X 2 RSS(f, λ) = (yi − f (xi )) + λ f 00 (t)2 (t)dt Penalize curvature. Or wiggling. We assume f has continuous second derivatives. Least square fit if λ = ∞, on the one hand, and if λ = 0 f can be any function – no penalty on wiggling. f (x) =

X

Nj (x)θj

basis functions N1 . . . NN (x) basis functions for a natural spline basis. For a specific choice of knots, you are fitting up to a third degree polynomial, but if you go beyond the last

11

boundary knot you are allowed to fit the linear rather than the cubic. The data becomes sparse and you don’t want to overfit the data. The criterion reduces to RSS(θ, λ) = (y − N θ)T (y − N θ) + λθT ΩN θ where

Z ΩN =

Nj00 (t)Nk00 (t)dt

Compare to ridge regression, (y − xβ)T (y − xβ) + λβ T β Here instead of β on the end, we add the matrix Ω which is the p by p matrix. Effective degrees of freedom: trace of Sλ where Sλ = N (N T N + λΩN )−1 N T in the equation fˆ = N (N T N + λΩN )−1 N T y Recall y = f (x) + . Analogous to ridge regression. Expected prediction error, combines bias and variance. EP E(fˆx ) = Ex,y ET |x,y (Y − fˆλ (X))2 = EX,Y ET |X,Y (Y − f (X) + F (x) − ET fˆλ (X) + ET fˆλ (X) − fˆλ (X))2 = EX,Y (Y − f (X))2 + EX,Y ET [(f (X) − ET fˆλ (X))2 + (ET fˆλ (X) − fˆλ (X))2 ] bias plus variance.

9

Lecture 10

Nonparametric Logistic Regression Smoothing splines for classification: Two class logistic regression with an input X. log

P r(Y = 1|X = x) = f (x) P r(Y = 0)|X = x

which implies the probability that Y = 1 given X =x is ef (x) 1 + ef (x) 12

Predict the predicted log-likelihood criterion – penalized with curvature. Z X (yi log p(xi ) + (1 − yi ) log(1 − p(xi )) − 1/2λ (f 00 (t))2 dt =

X

(yi f (xi ) − log(1 + ef (xi ) )) − 1/2λ

Z

(f 00 (t))2 dt

Optimal f is a finite-dimensional natural spline with knots t the values of xi . Suppose you have more than one input. Want to model Y ∼ f (x1 , x2 ). Nonlinear basis expansion of x1 and x2 separately, and also interaction terms between x1 and x2 . This is called the tensor product of the two sets of basis functions. f (x1 , x2 ) = f1 (x1 ) + f2 (x2 ) additive model without interaction term. Let hj (x1 ) be basis functions for x1 and gl (x2 ) be the basis functions for x2 . The tensor product is the set of all possible product pairs of h’s and g’s. Generalized additive model E(Y |X1 , X2 . . . Xp ) = α + f1 (X1 ) + f2 (X2 ) + . . . fp (Xp ) fj0 s are smooth functions. Each is fit with, say, a cubic smoothing spline. A penalized residual sum of squares can be specified as a criterion to minimize X X X Z (yi − α − fj (xij )2 + λj fj00 (tj )2 dtj different λ’s for each component. Iterative approach: fit the smooth functions one at a time. Let α be the mean of y, and minimize X (yi − α − fk (xij ) k6=j

Generalized cross validation is used to choose each value of λ.

10

Lecture 11

Tree based methods. Extending linear methods to nonlinear models via basis expansion. Nonlinear transformation of the linear input variable and transform it to a linear fitting model. Multivariate case: tensor product space. High dimensionality of the problem. Generalized additive model: ignore the interaction term for pairs of variables. Size of the 13

problem now grows linearly instead of exponentially. Alternative: tree based method. 2d case: split the input space into small subregions. Rectangles. Fit the simplest possible function in the subregion. Constant predictor within each region. Not a smooth model, but takes account of interactions. If Y does not vary very much within each subregion, it’s not that bad. identify smaller regions if you notice higher values of Y within a region. Divide input space into smaller regions so that within each region you see a smaller region wth a different value. criterion for partition. Homogeneous subregion. Binary tree: first splitting point is at X1 ≤ t1 . Then two subsets. Growing the tree: regression tree. Look at marginal distributions of all input variables. Note spikes – this suggests things about how the data was collected. For some pairs of variables, linear relationship is not enough. Curve looks like the slope is changing. How do we choose splitting value? Fit one constant in one subset and another in another. Seach for optimal splitting value. Look at residual sum of squared error. X (yi − yˆ)2 for all splitting values. Split is the ith sorted value, minimize X X (yi − cˆ1 )2 + (yi − cˆ)2] R1

R2

Fewer data points in a subset; easier to fit with constant. If the difference in standard error is close to constant, stop. But this only goes for local minima. So instead, grow a large tree and prune. Tree pruning: take a subtree. Collapse some nodes together.

11

Lecture 12

This is from Chapter 8 of Hastie. Divide data into 200 bootstrap samples, and fit classification trees to all of them; if the trees are variable then the model has high variance. In our test data, we have two classes, 5 features, each Gaussian distributed and pairwise correlation of 0.95, and the response Y only depends on the first input. Bootstrap estimation: give the observed values equal probability based on your data. (xi , yi ) training data. Give equal probability for all of these. (x∗ , y ∗ ) is drawn from the empirical distribution putting equal probability 1/N on each of the training data points. Bootstrapped sample of original data: draw the same number of data points, but with replacement. You may see some repeated points. From the bootstrapped sample you can create parameter estimation and see how variable your parameter estimation can be.

14

fˆ(x) = 1/B

B X

fˆ∗b (x)

b=1

fitted model based on a bootstrap sample Z ∗b ∼ pˆ. For k-classification, you can take the majority vote of the trees. Average class probability of the B trees. This gives the data a better chance of being repeatedly used in the training model. Averaging multiple trees. Why does bagging work? Assume the ideal case, where (xi , yi ) are drawn from the population distribution P . The ideal aggregation fag (x) = Ep fˆ∗ (x). This is a fitted model based on bootstrap sample. Prediction error, E[(Y − fag (x))2 ] < E[(Y − f ∗ (x))2 ]. Why? Bias-variance decomposition. (Y − f ∗ (x))2 = (Y − fag (x) + fag (x) − f ∗ (x))2 ≤ (Y − fag )2 + (fag − f ∗ )2

12

Lecture 13

Bagging or bootstrap aggregation is best for unstable methods. The tree method is an example of an unstable method that responds very well for bagging. Construct a tree for each bootstrap sample, and base the prediction on the average of the trees. For classification there are two ways of combining the trees – majority vote, or average of classification probabilities. Random forests are an improved version of the bagging idea. Reduce the correlation between the tree you’re adding and the bootstrapped sample. You reduce variance by averaging independent identically distributed random variables. V ar(xi ) = σ 2 V ar(1/B

B X

xi ) = σ 2 /B

i=1

Different trees are drawn from the same distribution, so they probably have some positive correlation. cor(xi , xj ) = ρ. So we can’t use the independence assumption. B X X 1/B ( V ar(xi ) + cov(xi , xj ) 2

i=1

i6=j

15

= 1/B 2 (Bσ 2 + B(B − 1)ρσ 2 ) = σ 2 (1/B + (1 − 1/B)ρ) = ρσ 2 +

1−ρ 2 σ B

Random forest algorithm (see Ch. 15) Draw a bootstrap sample from the training ata. Grow a random forest tree to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree: 1. select m variables at random from the p variables 2. Pick the best variable/split-point among the m predictor variables 3. Split the node into two daughter nodes. The smaller m, the lower the chance that this tree actually picks the best option. But the tradeoff is that smaller m gives the option of greater sparsity. You have to tune the parameter. Sometimes, in a bootstrap sample, some observations are repeated and some are missing. In a random forest, what do you do about missing values? You will evaluate the prediction accuracy on (x, y) based on the prediction from those trees where (x, y) didn’t show up. You’re getting an out-of-sample performance evaluation from that data point automatically. Error drops sharply after only a fairly small number of trees. Variable importance: to measure the prediction strength of each variable. Record the prediction accuracy (compared to the out of sample error) is recorded.Then the values for the jth variable are randomly permuted in the OOB samples, and the accuracy is again permuted. That is: we have a matrix of missing x values and a vector of missing y values. Permute the x’s, so we take out the predicting effect of the x’s, since the relationship of x’s and y will be canceled. If the xj was irrelevant, then this won’t change the prediction much; but if it is, the prediction accuracy will decrease.

13

Lecture 14

The idea of boosting is to have a classifier sign[

M X

αm Gm (x)]

m=1

16

This is a weighted sum rather than a plain average. Also, the Gm are not generated by a bootstrap sample; each of them is a modified sample, reweighted in some intelligent way. The samples are not independent, as in bootstrapping. Here, they depend on the performance of the previous sample. There is a sequential order. AdaBoost defines how to generate the weights and samples. Generate your classifier Gm to fit the training data using weights wi . Compute the error rate: P wi I(yi 6= Gm (xi )) P wi How many mistakes did you make? weighted by the wi . errm )/errm ). Logit of the mistakes. Set

Compute αm = log((1 −

wi = wi exp(αm I(yi 6= Gm (xi ))) Upweight the misclassified points. Repeat for m = 1 : M . Then output a weighted sum of classifiers M X G(x) = sign[ αm Gm (x) m=1

Boosting fits an additive model. f (x) =

X

βm b(x; γm )

Tree-based methods are examples of this; the basis functions are step functions and γ parametrizes the split variables and split points. Fit these functions by minimizing an average loss function over the training data. X X min L(yi , βm b(xi , γm )) Forward stagewise additive modeling, in general, computes the best β and γ by X (βm , γm ) = argmin L(yi , fm−1 (xi ) + βb(xi , γ)) Set fm = fm−1 (x) + βm b(x, γm ). AdaBoost is an example of this. It’s equivalent to forward stagewise additive modeling using the loss function L(y, f (x)) = e−yf (x) But there are other possible monotone decreasing functions of the error term. X (βm , Gm ) = argmin wim exp(−yi βG(xi )) 17

If you fix β and optimize with respect to G, you’re minimizing weighted classification error. X Gm = argmin wim I(yi 6= G(xi )) Plugging this Gm in and solving for β, one obtains βm = 1/2 log

1 − errm errm

The exponential loss is more sensitive to changes in the estimated class probabilities compared to the 0-1 loss. The misclassification error rate will suggest you stop sooner than the exponential loss.

14

Lecture 15

Generalize the idea of AdaBoost to a model of boosted trees. fM (x) =

M X

T (x, Θm )

m=1

A sum of individual trees, parametrized by Θm , which indicates tree structure. Like splitting value and splitting variable. Training loss: N X

L(yi , fM (xi ))

i=1

for example, the loss function can be (yi − fm (xi ))2 , the regression loss or L2 loss. The exponential loss is e−yi fm (xi ) . In practice it should be differentiable. Solving Θm = argmin

N X

L(y, fm−1 (x) + T (xi , Θm ))

i=1

gives you the best choice of next tree, given all the trees you had so far. To find the fitted value for each subregion, you just need to optimize the loss function falling in that subregion to choose the optimal constants. We will be fitting the tree model with gradient descent. Go in the direction of steepest descent: choose fm = −ρm gm where ρm is a scalar and gm is the gradient of L(f ) evaluated at f = fm−1 . Then update the solution: fm = fm−1 − ρm gm . Gradient tree boosting (MART) – at each step, compute the derivative of the loss function at point i. Fit a

18

regression tree, to targets rm , the gradients, giving terminal regions Rjm . Choose the optimal coefficients γjm such that X γjm = argmin L(yi , fm−1 (xi ) + γ) Update fm to include fm = fm−1 +

PJm

j=1 γjm I(x

∈ Rjm ).

Shrinkage: shrink the contribution of each tree by a factor v when it is added to the current approximation. X fm = fm−1 + v γjm I(x ∈ Rjm 0 This is analagous to penalized least squares (like ridge regression or lasso.) J is a meta-parameter, the tree size, which determines how much overfitting happens.

15

Lecture 16

Non-parametric, unsupervised learning techniques: we have no training data. PCA: linear approximation of the data which captures the most variation of the data. projection on a smaller number of dimensions, on a finite-dimensional space. Useful in high-dimensional situations. Also useful for visualization, compression. The first linear component: z1 = a11 x1 + a12 x2 + . . . a1p xp Sample variance of the projection z1 is greatest among all such linear combinations with ||a1 || = 1. Second linear component: z2 = a21 x1 + a22 x2 + . . . a2p xp such that a02 a1 = 0, orthogonal to the first projection, and ||a2 || = 1, and the variance is maximized. X is the variance-covariance matrix of the original data; the jth principal component zj is the linear combination zj = a0j X, which has the greatest variance subject to the conditions that ||aj || = 1 and aj is orthogonal to all previous components. Note var(z1 ) = var(Xa1 ) = a01 Sa1 where S = 1/(N − 1)X 0 X, the sample variancecovariance matrix. sX ||a1 || = a21l l

Your optimization problem: maximize the variance of the first linear combination subject to unit norm. This can be solved by an eigendecomposition problem. 19

Intuitively: you have a data cloud; the direction of greatest eccentricity, the greatest radius, is the first principal direction. There is a theorem that any symmetric matrix has a decomposition A = ΓΛΓT where Λ is diagonal and Γ is orthogonal. Why is PCA optimal? Consider a rank-q linear model for representing observations xi as i + µ + Vq λ. Fitting this by least squares means minimizing the reconstruction error X ||xi − µ − Vq λi ||2 or X

¯ 2 ||(xi − x ¯)2 − Vq VqT (xi − X)||

if we assume x ¯ = 0, then this yields the projection Vq VqT xi . singular value decomposition of X = U DV T gives an optimal choice for V .

16

Lecture 17

PCA is a linear method: project onto a linear combination of eigenvectors, which are linear combinations of basis vectors. Kernel PCA: map the input domain to a feature space, Φ : X → H. This transformation is generally nonlinear. Look at data in new feature space. P Look at the covariance matrix in the feature space, 1/n nj=1 Φ(xj )Φ(xj )T . Find the eigendecomposition of this. λ < Φ(xi ), v >=< Φ(xi ), Sv > X v= ai Φ(xi ) P The ai are unknown. but we also have v = ai xi . Puttng these together, if Kij =< xi , xj >, nλKa = K 2 a Kernel trick: formulate PCA as eigendecomposition of kernel matrix. Centering K: subtract the column mean and the row mean. Compute the projections on the eigenvectors: X j < v j , Φ(x) >= αi k(xi , x) where αn are the eigenvector expansion coefficients. Examples of positive definite kernels: Linear kernel K(x, x0 ) = xT x0 Polynomial kernels: K(x, x0 ) = (c + xT x0 )d . Gaussian 0 2 2 kernels: K(x, x0 ) = e−||x−x || /2σ . Mercer’s Theorem: positive definite kernel, p p Φ(x) = ( λ1 Φ1 (x), λ2 Φ2 (x), . . .) 20

17

Lecture 18

Sparce PCA Formulate PCA as a regression-type optimization problem, impose the lasso constraint, and solve the penalized optimization problem. ||Y − xβ||2 + λ||β||1 That’s the lasso penalty. It imposes sparseness. Or the ridge penalty: ||Y − xβ||2 + λ||β||2 If ˆ = argmin (ˆ α, β)

X

||xi − αβ T xi ||2 + λ||β||2

with ||α||2 = 1, then βˆ is proportional to V1 , the first principal component. This is the same thing, with a slackness condition. Suppose we’re looking at the first k principal components, Ap×k = [α1 . . . αk ], Bp×k = [β1 . . . βk ]

ˆ B) ˆ = argmin (A,

X

T

2

||xi − AB xi || + λ

k X

||βj ||2

j=1

subject to AT A = Ik×k . Then βˆj is proportional to Vj . Alternative: elastic net. You have variables x1 , x2 , ... and you have a subset which capture most of the data, but they may be highly collinear. Instead of picking just one to include in the model, pick them as a subset. Create a matrix whose ith column is the ith principal component. UT U where X = U DV T . Weight on ridge regression: βˆridge = (X T X + λI)−1 X T (XVi ) D2 )V T Vi D2 + λI D2 = Vi 2 i Di + λ

=V(

21

18

Lecture 19

Clustering: partition objects into homogeneous groups. Objects are more similar within each cluster. Obvious measure of dissimilarity: distance. Euclidean distance, sum of absolute differences (L1 distance). In practice you might have categorical data, binary data, etc. Categorical variables: cost matrix, zero along the diagonal, otherwise d(A, B) is a cost value based on your domain knowledge. How bad a wrong choice is that? K-means algorithm; choose K centroids, one for each classes; then iterate the following: cluster an object to the cluster with the closest centroid; update the centroid of each cluster to the mean of all objects in the cluster. This miniizes the square error criterion W (S, c) =

K X X

d(i, ck )

k=1 i∈Sk

Step 2 minimizes the above given a choice of centroids c, and Ste 3 minimizes it given a set of clusters S. This converges in a finite number of steps. Let T (Y ) =

N X

(Yi − Y¯ )2

i=1

variance generally. W (S, C) =

K X X

(Yi − Y¯k )2

k=1 i∈Sk

B(S, c) =

K X

Nk (Y¯k − Y¯ )2

k=1

Within-cluster variance and between-cluster variance. T (Y ) = W (S, C) + B(S, C) Bias-variance tradeoff strikes again. Cross term equals 0 XX (Yi − Y¯k )(Y¯j − Y¯ ) k i∈Sk

X k

(Y¯k − Y¯ )

X i∈Sk

22

(Yi − Y¯k )

The inner sum is 0, so the total is 0. How do we choose K: plot within-cluster variance and look for a kink, when the K-1 clustering doesn’t resemble K clustering at all. Consistency: how much does cluster assignment change over different random subsets of the data? (Cross validation.)

19

Lecture 20

Last time: K-means. The between-cluster sum of squares should be large – how separated the data are. Within-cluster sum of squares should be as small as possible. Model-based clustering: assume there’s an underlying model. Underlying probability function X f (x) = pk φ(y, µk , Σk ) where pk are mixture probabilities. Estimate pk , µk , Σk by maximum likelihood. Estimated by an iterative procedure, the EM algorithm. The latent variable is the class membership. If you know that, the estimate for µ and Σ is straightforward. In the E step, estimate the expectations of the latent variable given current parameters; in M step, re-estimate the parameters given the expectation of the latent variable. At each E-step, estimate posterior probability for each data point: pk φ(yi , µk , Σk ) gik = P pk φ(yi , µk , Σk ) At each M-step, given gik find pk , µk , Σk maximizing the log-likelihood. X µk = gik yi /gk i

Σk

X

gik (y − µk )(y − µk )0 /gk

i

data point will be assigned to the k for which gij is maximal for all k. Connection with Kmeans: if we assume Σk is diagonal, no correlation between multivariate data. To maximize the log-likelihood, you’re minimizing XX l(µk , σ 2 , Sk ) = (yi − µk )0 (yi − µk )/2σ 2 k i∈Sk

This is the squared error criterion in K-means algorithm. In the extreme case, if I observe two extreme clusters, what would you expect to see if you fit a regression line? A line that follows the axis between the clusters. But the relationship within each cluster would be two separate regression lines. 23

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF