Machine Learning Algorithms - Summary + R Code
Short Description
Summary and R code for several machine learning algorithms....
Description
Machine Learning Algorithms – Summary + R Code Supervised Learning Algorithms 1 Supervised Learning by Empirical Risk Minimization (EMR) 1 1 Empirical Risk Minimization and Inductive Bias 1 2 Ordinary Least Squares (OLS) 1 3 Ridge Regression 1 4 LASSO 1 5 Logistic Regression 1 6 Regression Classifier 1 7 Linear Support Vector Machines (SVM) 1 8 Generalized Additive Models (GAMs) 1 9 Projection Pursuit Regression (PPR) 1 10 Neural Networks (NNETs) 1 11 Classification and Regression Trees (CARTs) 1 12 Random Forests 1 13 Rotation Forest 1 14 Smoothing Splines 2 Non ERM Supervised Learning 2 1 k-Nearest Neighbour (KNN) 2 2 Kernel Regression 2 3 Local Likelihood and Local ERM 2 4 Boosting 2 5 Learning Vector Quantizations (LVQ) 3 Dimensionality Reduction In Supervised Learning 3 1 Variable Selection 3 2 LASSO 3 3 Principal Component Regression (PCAR) 3 4 Partial Least Squares (PLS) 3 5 Canonical Correlation Analysis (CCA) 3 6 Reduced Rank Regression (RRR) 4 Generative Models In Supervised Learning 4 1 Fisher's Linear Discriminant Analysis (LDA) 4 2 Fisher's Quadratic Discriminant Analysis (QDA) 4 3 Naive Bayes 5 Ensembles 5 1 Committee Methods 5 2 Bayesian Model Averaging 5 3 Stacking 5 4 Bootstrap Averaging (Bagging) 5 5 Boosting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Classification: • Kernel Density Classification. • Naive Bayes Classifier - has the form of a generalized additive model. The models are fit in quite different ways though. • Mixture Models for Density Estimation and Classification - can be viewed as a kind of kernel method.
1
1 9 Projection Pursuit Regression (PPR) Another way to generalize the hypothesis class F, which generalizes the GAM model, is to allow f to be some simple function of a linear combination of the predictors, of the form M
f ( x ) = ∑ g m ( wm x ) ,
(1.9)
m =1
where both gm and wm are learned from the data. The regularization is now performed by choosing M and the class of {gm }m=1 . M
Note: PPR is not a pure ERM. Just like the GAM problem, in the PPR problem {gm }m=1 are learned by Kernel Regression. Solving the PPR problem is thus a hybrid of ERM and Kernel Regression algorithms. M
Note: If M is taken arbitrarily large, for appropriate choice of gm the PPR model can approximate any continuous function in Rp arbitrarily well. Such a class of models is called a universal approximator. However this generality comes at a price. Interpretation of the fitted model is usually difficult, because each input enters into the model in a complex and multifaceted way. As a result, the PPR model is most useful for prediction, and not very useful for producing an understandable model for the data. Notice also- that the neural network model with one hidden layer has exactly the same form as the projection pursuit model described above. The difference is that the PPR model uses nonparametric functions gm(v),while the neural network uses a far simpler function based on sigmoid(v).
1.10 Neural Networks (NNETs) - Single Hidden Layer We introduce the NNET model via the PPR model, and not through its historically original construction. In the language of Eq.(1.9), a single-layer{feed-forward neural network, is a model where {gm }m=1 are not learned from the data, but rather assumed a-priori. M
g m ( x ) := β mσ ( xα m )
where {α m , β m }m=1 only are learned from the data. A typical activation function is the standard M
1 . 1 + e−t As can be seen, the NNET is merely a non-linear regression model. The parameters of which are often called weights.
logistic CDF: σ ( t ) =
Loss Functions: Like any other ERM problem, we are free to choose the appropriate loss function. Universal Approximator: Like the PPR, even when {gm }m=1 are fixed beforehand, the class is still a universal approximator. M
Regularization: regularization of the model is done via the selection of the σ , the number of nodes/variables in the network and the number of layers.
2
1.11 Classification and Regression Trees (CARTs) CARTs are a type of ERM where f(x) include very non smooth functions that can be interpreted as "if-then" rules, also known as decision trees. The hypothesis class of CARTs includes functions of the form M
f ( x ) = ∑ cm I{x∈Rm } m =1
The parameters of the model are the different conditions {Rm }m =1 and the function's value at M
each condition {cm }m=1 . Regularization: is done by the choice of M which is called the tree depth. M
Loss Functions: As usual, a squared loss can be used for continuous outcomes y. For categorical outcomes, the loss function is called the impurity measure. Impurity Measure One can use either a misclassification error, the multinomial likelihood (knows as the deviance, or cross-entropy), or a first order approximation of the latter known as the Gini Index. Universal Approximator: CART is a universal approximator.
Random Forests Trees are very flexible hypothesis classes. They thus have small bias but large variance. Bagging trees will reduce this variance by averaging trees from different bootstrap samples. Alas, the variance (thus the MSE) of bagged trees is lower bounded by the fact the trees use the same variables, and are thus correlated. To remedy this, [Breiman, 2001] proposed to fit trees to bootstrapped samples, using only a random subset of variables. This decorrelates between the trees, this allowing a reduction in the variances of the trees (thus their MSE).
3
1.14 Smoothing Splines
4
Unsupervised Learning 1 Introduction to Unsupervised Learning 2 Density Estimation 2 1 Parametric Density Estimation 2 2 Kernel Density Estimation 2 3 Graphical Models 3 High Density Regions 3 1 Association Rules 4 Linear-Space Embeddings 4 1 Principal Components Analysis (PCA) 4 2 Random Projections 4 3 Sparse Principal Component Analysis (sPCA) 4 4 Multidimensional Scaling (MDS) 4 5 Local MDS 4 6 Isometric Feature Mapping (Isomap) 5 Non-Linear-Space Embeddings 5 1 Kernel Principal Component Analysis (kPCA) 5 2 Self Organizing Maps (SOM) 5 3 Principal Curves and Surfaces 5 4 Local Linear Embedding (LLE) 5 5 Auto Encoders 5 6 Matrix Factorization 5 7 Information Bottleneck 6 Latent Space Generative Models 6 1 Factor Analysis (FA) 6 2 Independent Component Analysis (ICA) 6 3 Exploratory Projection Pursuit 6 4 Compressed Sensing 6 5 Generative Topographic Map (GTM) 6 6 Finite Mixtures 6 7 Hidden Markov Models (HMM) 6 8 Latent Space Graphical Models 6 9 Latent Dirichlet Allocation (LDA) 6 10 Probabilistic Latent Semantic Indexing (PLSI) 6 11 Prediction by Partial Matching (PPM) 6 12 Dynamic Markov Compression (DMC) 7 Random Graph Models 7 1 Erdos Renyi 7 2 Exchangeable Graph Model 7 3 p1 Graph Model 7 4 p2 Graph Model 7 5 Stochastic Block Graph Model 7 6 Latent Space Graph Model 7 7 Exponential Random Graphs (ERGMs) 8 Cluster Analysis 8 1 K-Means Clustering 8 2 K-Medoids Clustering (PAM) 8 3 Quality Threshold Clustering (QT) 8 4 Hierarchical Clustering 5
8 5 Fuzzy Clustering 8 6 Self Organizing Maps (SOM) 8 7 Spectral Clustering 8 8 Bi Clustering ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3.1 Association Rules (Market Basket Analysis; Aprioiri algorithm) Association rules, or market basket analysis, or affinity analysis, can be seen as approximating the joint distribution with a region-wise constant function. Apriori Algorithm Terminology
The algorithm: (use dummy variables for 0/1 response = "in basket"/"Not in basket"). The first pass over the data computes the support (relative frequency) of all single-item sets. Those whose support is less than the threshold are discarded. The second pass computes the support of all item sets of size two that can be formed from pairs of the single items surviving the first pass. In other words, to generate all frequent itemsets with |K| = m, we need to consider only candidates such that all of their m ancestral item sets of size m − 1 are frequent. Those size-two item sets with support less than the threshold are discarded. Each successive pass over the data considers only those item sets that can be formed by combining those that survived the previous pass with those retained from the first pass. Passes over the data continue until all candidate rules from the previous pass have support less than the specified threshold. > Example: suppose the item set K = {peanut butter, jelly, bread} and consider the rule {peanut butter, jelly} => {bread}. A support value of 0.03 for this rule means that peanut butter, jelly, and bread appeared together in 3% of the market baskets. A confidence of 0.82 for this rule implies that when peanut butter and jelly were purchased, 82% of the time bread was also purchased. If bread appeared in 43% of all market baskets then the rule {peanut butter, jelly} => {bread} would have a lift of 1.95. The goal of this analysis is to produce association rules (A => B) with both high values of support and confidence(A => B).
6
Examples of Association Rules:
4 Linear Space Embedding Methods Linear space embedding are a class of dimensionality reduction techniques that map the data X into a lower dimensional linear space M. The mapping itself, f : X → M can be linear or nonlinear. We denote the low dimensional representation of the data by Xˆ := f ( X ) ∈ M . The idea of ERM and Inductive Bias also applies to unsupervised learning. We seek some f that does not incur too much loss, on average. I.e., we seek to minimize R(f).
Remark: Two interpretations of "linear" can be found in the literature. It may refer to the nature of the low dimensional space approximating the data, or to the nature of the embedding operation.
4.1 PCA Maximizing under a constraint, using Lagrange-Multipliers: Where Cov [ vX ] = Σ, Cov [ vX ] = vΣv′ .
7
PCA is such a basic technique it has been rediscovered and renamed independently in many fields. It can be found under the names of discrete Karhunen-Loeve Transform; Hotteling Transform; Proper Orthogonal Decomposition (POD); Eckart-Young Theorem; Schmidt-Mirsky Theorem; Empirical Orthogonal Functions; Empirical Eigenfunction Decomposition; Empirical Component Analysis; Quasi-Harmonic Modes; Spectral Decomposition; Empirical Modal Analysis; and possibly more. Example: Consider human height and weight data. While clearly two dimensional data, you don't really need both to understand how "big" are the people in the data. This is because; height and weight vary mostly along a single dimension, which can be interpreted as the "bigness" of an individual. This is why, physicians use the Body Mass Index (BMI) as an indicator of size, instead of a two-dimensional measurement. Assume now that you wish to give each individual a size score that is a linear combination of height and weight, PCA does just that. It returns the linear combination that has the most variability, i.e., the combination which best distinguishes between individuals. Notice we have currently offered two motivations for PCA: (i) Find linear combinations that best distinguish between observations, i.e., maximize variance. (ii) Find the linear subspace the bets approximates the data. The reason these two problems are equivalent, is due to the use of the squares error. Informally speaking, the data has some total variance. This variance can be decomposed into the part captured in M, and the part not captured. Note: Usually for simplicity of exposition, we will assume that the data X has been mean centered. Terminology: Principal Components: The linear combinations of the features, which best separate between observations. In our example - the "bigness" index of each individual. The first component captures the most variance, the second components, the second most variance, etc. In terms of M, the principal components are an orthogonal basis for M. Scores: Synonymous to Principal Components. Loadings: The weights of each data point in each principal component. In our example, the importance of the height and weight in constructing the "bigness" score. PCA as a Graph Method Starting from the maximal variance motivation, it is perhaps not surprising that PCA depends only on the similarities between features, as measured by their empirical covariance. The linearity of the target manifold was there by assumption. The building blocks of all these graph-based dimensionality reduction methods are: 1. Compute some similarity graph G (or dissimilarity graph D) from the raw features. 8
2. Call upon graph embedding theory to map the data points into the target manifold M.
To summarize: Task = dim reduce Type = optimization Input = Graph (G) Output = embedding function Sparse Principal Component Analysis (sPCA) When analyzing the PCA results, we often wish to understand which features contribute to which component. This is much easier when the loadings (A) are sparse, i.e., include many zeroes. sPCA performs this in LASSO style, by means of l1 regularization. 4.4 Multidimensional Scaling (MDS) • MDS - Both self-organizing maps and principal curves and surfaces map data points in Rp to a lower dimensional manifold. Multidimensional scaling (MDS) has a similar goal, but approaches the problem in a somewhat different way. MDS represents high-dimensional data in a low-dimensional coordinate system. MDS requires only the dissimilarities dij , in contrast to the SOM and principal curves and surfaces which need the data points xi. MDS aims at representing a network (= a weighted graph) of distances (or similarities) between observations, by embedding the observations in a q dimensional linear subspace, while preserving the original distances.
5 Non-Linear Space Embedding Methods The fact that the linear-space embedding of the data depends only some similarity graph has laid a bridge between feature embedding, such as PCA, and graph embedding methods such as MDS. Moreover, it has opened the door for replacing the covariance similarity, with many other similarity measures. Classic MDS is simply PCA when starting from G, thus viewed as a graph embedding problem. kPCA plugs kernel similarities instead of covariance similarities. LocalMDS and LLE follow a similar motivation using local measures of similarity. PCA solution can be cast in terms of the covariance between individuals (G = X'X) or the Euclidean distances (D). In particular, we show that all the information on the location (mean) of X, needed for the PCA reconstruction, is actually encoded in G (or D). Kernel Principal Component Analysis (kPCA) The optimization problem is: arg max Cov g ( X ) , where g(X) = best separating score (function). g
{
}
We thus have two matters to attend: (i) We need to constrain g(x) so that it does not overfit. (ii) We need the problem to be computable. This is precisely the goal of kPCA. We have already encountered a similar problem with Smoothing Splines. It is thus not surprising that the solution has the same form. Namely, if we choose the right g's, the solution
9
of the optimization problem takes a very simple form. The classes of such g's are known as Reproducing Kernel Hilbert Spaces (RKHS).
Nonlinear Dimension Reduction and Local Multidimensional Scaling - These methods can be thought of as “flattening” the manifold, and hence reducing the data to a set of low-dimensional coordinates that represent their relative positions in the manifold. They are useful for problems where signal-to-noise ratio is very high (e.g., physical systems), and are probably not as useful for observational data with lower signal-to-noise ratios. Three Methods of Nonlinear MDS: ISOMAP = Isometric feature mapping (Tenenbaum et al., 2000) - constructs a graph to approximate the geodesic distance between points along the manifold. Specifically, for each data point we find its neighbors-points within some small Euclidean distance of that point. We construct a graph with an edge between any two neighboring points. The geodesic distance between any two points is then approximated by the shortest path between points on the graph. Finally, classical scaling is applied to the graph distances, to produce a low-dimensional mapping.
LLE = Local linear embedding (Roweis and Saul, 2000) - takes a very different approach, trying to preserve the local affine structure of the high-dimensional data. Each data point is approximated by a linear combination of neighboring points. Then a lower dimensional representation is constructed that best preserves these local approximations.
LLE aims at finding linear subspaces that are good approximations of small neighborhoods of the whole data X. It is similar in spirit to Isomap and LocalMDS (x5.4.5). It differs, however, in the way similarities are computed, and in the way embedding are performed. In particular, as the name may suggest, LLE performs local embedding to linear subspaces. To summarize: Task = dim. reduction Type = algorithm Input = graph (G) Output = data embedding Concept = local distance Local MDS (Chen and Buja, 2008) - takes the simplest and arguably the most direct approach. We define N to be the symmetric set of nearby pairs of points; specifically a pair (i, i') is in N if point i is among the K-nearest neighbors of i', or vice-versa. Self Organizing Maps (SOM) SOMs, are a non-linear-subspace dimensionality reduction method, aimed at good clustering. It is non-linear because the algorithm (which cannot be cast as an ERM problem, i.e., optimization problem) returns an embedding into a non-linear manifold. To summarize: Task = dim. reduction Type = algorithm Input = X (data) Output = parametric curve or surface Concept = self consistency => I.e., a curve with a path that is the average of all it's closest data points. Self Consistency Roughly speaking, one can think of this curve as a parameterized function, connecting all the k-means cluster centers in the smoothest way possible.
10
8 Cluster Analysis •
Gaussian Mixtures as Soft K-means Clustering.
•
K-means Clustering - the algorithm is appropriate when the dissimilarity measure is taken to be squared Euclidean distance. This requires all of the variables to be of the quantitative type. In addition, using squared Euclidean distance places the highest influence on the largest distances. This causes the procedure to lack robustness against outliers that produce very large distance.
•
K-medoids Clustering - For a given cluster assignment (C) find the observation in the cluster minimizing total distance to other points in that cluster. This algorithm assumes attribute data, but the approach can also be applied to data described only by proximity matrices. There is no need to explicitly compute cluster centers.
11
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Recommender Systems Algorithms 1. Content Filtering 2. Collaborative Filtering 3. Hybrid Filtering 4. Recommender Systems ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The two main approaches to recommender systems include content filtering and collaborative filtering. 1. Content Filtering In content filtering, the system is assumed to have some background information on the user (say, because he logged in), and uses this information to give him recommendations. The recommendation in this case, is approached as a supervised learning problem: the system learns to predict a product's rating based on the user's features. 2. Collaborative Filtering Unlike content filtering, in collaborative filtering, there is no external information on the user or the products, besides the ratings of other users. Collaborative filtering can be approached as a supervised learning problem, or as an unsupervised learning problem. This is because it is neither. It is essentially a missing data problem. The two main approaches to collaborative filtering include neighborhood methods, and latent factor models. a. The neighborhood methods to collaborative filtering rest on the assumption that similar individuals have similar tastes. If someone similar to individual i has seen movie j, then i should have a similar opinion. b. The latent factor models approach to collaborative filtering rests on the assumption that the rankings are a function of some latent user attributes and latent movie attributes. This idea is not a new one, as we have seen it in the context of unsupervised learning in factor analysis (FA) and independent component analysis (ICA). This is why this approach is more commonly known as the Matrix Factorization approach collaborative filtering. We can present several matrix factorization problems in the ERM framework. Hybrid Filtering After introducing the ideas of content filtering and collaborative filtering, why not marry the two? Hybrid filtering is the idea of imputing the missing data, thus making recommendations, using both a viewer's attributes, and other viewers' preferences. It can be presented as an ERM problem. Recommender Systems Terminology • Content Based Filtering: A supervised learning approach to recommendations. • Collaborative Filtering: A missing data imputation approach to recommendations. • Memory Based Filtering: A non-parametric (neighborhood) approach to collaborative filtering.
12
•
Model Based Filtering: A latent space generative model approach to collaborative filtering.
Misc notes: ======== The Relation Between Supervised and Unsupervised Learning It may be surprising that collaborative filtering can be seen as both an unsupervised and a supervised learning problem. But these are not mutually exclusive problems. In unsupervised learning we try to learn the joint distribution of x, i.e., try to learn the relationship between any variable in x to the rest, we may see it as several supervised learning problems. In each, a different variable in x plays the role of y.
The Kernel Trick Applies to: SVM, PCA, canonical correlation analysis, ridge regression, spectral clustering, Gaussian processes, and more (k-nearest neighbor (kNN) is also a kernel method). Think of smoothing splines, it was quite magical that without constraining the hypothesis class F, the ERM problem has a finite dimensional closed form solution. The property of an infinite dimensional problem having a solution in a finite dimensional space is known as the kernel property The problem is then- what type of penalties J(f) will return simple solutions to: (1)
The answer is: functions that belong to (RKHS) Reproducing Kernel Hilbert Space – function space. The Bayesian View of RKHS Just as the ridge regression has a Bayesian interpretation, so does the kernel trick. Informally, the functions solving Eq.(1) can be seen as the posterior mode if our prior beliefs postulate that the function we are trying to recover is a Gaussian zero-mean process with covariance given by K.
Generative Models By generative model we mean that we specify the whole data distribution. This is particularly relevant to supervised learning where many methods only assume the distribution of P(y|x) without stating the distribution of P(x). LDA, QDA, and Naive Bayes, follow this exact same rational. Dimensionality Reduction - It is thus intimately related to lossy compression in information theory. - Dimensionality reduction is often performed before supervised learning to keep computational complexity low.
13
R code Supervised Learning Code library(magrittr) # for piping library(dplyr) # for handeling data frames # Some utility functions: l2 % sum %>% sqrt l1 % sum MSE % mean missclassification
View more...
Comments