machine learning
December 16, 2016 | Author: Abbas | Category: N/A
Short Description
statistical machine learning...
Description
CSE 575: Sta*s*cal Machine Learning Jingrui He CIDSE, ASU
Instance-‐based Learning
1-‐Nearest Neighbor Four things make a memory based learner: 1. A distance metric Euclidian (and many more) 2. How many nearby neighbors to look at? One 1. A weigh:ng func:on (op:onal) Unused
2. How to fit with the local points? Just predict the same output as the nearest neighbor.
3
Consistency of 1-‐NN • Consider an es*mator fn trained on n examples – e.g., 1-‐NN, regression, ...
• Es*mator is consistent if true error goes to zero as amount of data increases – e.g., for no noise data, consistent if:
• Regression is not consistent! – Representa*on bias
• 1-‐NN is consistent (under some mild fineprint)
What about variance??? 4
1-‐NN overfits?
5
k-‐Nearest Neighbor Four things make a memory based learner: 1. A distance metric Euclidian (and many more) 2. How many nearby neighbors to look at? k 1. A weigh:ng func:on (op:onal) Unused 2.
How to fit with the local points? Just predict the average output among the k nearest neighbors.
6
k-‐Nearest Neighbor (here k=9)
K-‐nearest neighbor for funcFon fiHng smooth away noise, but there are clear deficiencies. What can we do about all the discon*nui*es that k-‐NN gives us?
7
Curse of dimensionality for instance-‐based learning • Must store and retrieve all data! – Most real work done during tes*ng – For every test sample, must search through all dataset – very slow! – There are fast methods for dealing with large datasets, e.g., tree-‐based methods, hashing methods,
• Instance-‐based learning o^en poor with noisy or irrelevant features
8
Support Vector Machines
Linear classifiers – Which line is beber? Data:
Example i:
w.x = ∑j w(j) x(j)
10
w.x + b
= 0
Pick the one with the largest margin!
w.x = ∑j w(j) x(j)
11
w.x + b
= 0
Maximize the margin
12
w.x + b
= 0
But there are a many planes…
13
w.x + b
= 0
Review: Normal to a plane
14
x+
margin 2γ
= -‐1 w.x + b
= 0 w.x + b
w.x + b
= +1
Normalized margin – Canonical hyperplanes
x-‐
15
x+
margin 2γ
= -‐1 w.x + b
= 0 w.x + b
w.x + b
= +1
Normalized margin – Canonical hyperplanes
x-‐
16
w.x + b
= 0
= +1
= -‐1
w.x + b
w.x + b
Margin maximiza*on using canonical hyperplanes
margin 2γ
17
= -‐1 w.x + b
= 0 w.x + b
w.x + b
= +1
Support vector machines (SVMs)
• Solve efficiently by quadra*c programming (QP) – Well-‐studied solu*on algorithms
• Hyperplane defined by support vectors margin 2γ
18
What if the data is not linearly separable? Use features of features of features of features….
19
What if the data is s*ll not linearly separable? • Minimize w.w and number of training mistakes – Tradeoff two criteria?
• Tradeoff #(mistakes) and w.w – – – – 20
0/1 loss Slack penalty C Not QP anymore Also doesn’t dis*nguish near misses and really bad mistakes
Slack variables – Hinge loss
• If margin ≥ 1, don’t care • If margin < 1, pay linear penalty 21
Side note: What’s the difference between SVMs and logis*c regression? SVM:
LogisFc regression: Log loss:
22
Constrained op*miza*on
23
Lagrange mul*pliers – Dual variables Moving the constraint to objecFve funcFon Lagrangian:
Solve:
24
Lagrange mul*pliers – Dual variables Solving:
25
Dual SVM deriva*on (1) – the linearly separable case
26
Dual SVM deriva*on (2) – the linearly separable case
27
w.x + b
= 0
Dual SVM interpreta*on
28
Dual SVM formula*on – the linearly separable case
29
Dual SVM deriva*on – the non-‐separable case
30
Dual SVM formula*on – the non-‐separable case
31
Why did we learn about the dual SVM? • There are some quadra*c programming algorithms that can solve the dual faster than the primal • But, more importantly, the “kernel trick”!!! – Another lible detour…
32
Reminder from last *me: What if the data is not linearly separable? Use features of features of features of features….
Feature space can get really 33 large really quickly!
number of monomial terms
Higher order polynomials
d=4
m – input features d – degree of polynomial
d=3 d=2 number of input dimensions 34
grows fast! d = 6, m = 100 about 1.6 billion terms
Dual formula*on only depends on dot-‐products, not on w!
35
Dot-‐product of polynomials
36
Finally: the “kernel trick”!
• Never represent features explicitly – Compute dot products in closed form
• Constant-‐*me high-‐dimensional dot-‐ products for many classes of features • Very interes*ng theory – Reproducing Kernel Hilbert Spaces
37
Polynomial kernels • All monomials of degree d in O(d) opera*ons:
• How about all monomials of degree up to d? – Solu*on 0: – Beber solu*on:
38
Common kernels • Polynomials of degree d
• Polynomials of degree up to d • Gaussian kernels • Sigmoid
39
Overfivng? • Huge feature space with kernels, what about overfivng??? – Maximizing margin leads to sparse set of support vectors – Some interes*ng theory says that SVMs search for simple hypothesis with large margin – O^en robust to overfivng
40
What about at classifica*on *me • For a new input x, if we need to represent Φ(x), we are in trouble! • Recall classifier: sign(w.Φ(x)+b) • Using kernels we are cool!
41
SVMs with kernels • Choose a set of features and kernel func*on • Solve dual problem to obtain support vectors αi • At classifica*on *me, compute:
Classify as
42
What’s the difference between SVMs and Logis*c Regression?
Loss function
High dimensional features with kernels
SVMs
Logistic Regression
Hinge loss
Log-loss
Yes!
No
43
Kernels in logis*c regression
• Define weights in terms of support vectors:
• Derive simple gradient descent rule on αi 44
What’s the difference between SVMs and Logis*c Regression? (Revisited)
Loss function High dimensional features with kernels Solution sparse Semantics of output
SVMs
Logistic Regression
Hinge loss
Log-loss
Yes!
Yes!
Often yes!
Almost always no!
“Margin”
Real probabilities
45
View more...
Comments