machine learning

December 16, 2016 | Author: Abbas | Category: N/A

Share Embed Donate

Report this link

Short Description

statistical machine learning...

Description

CSE 575: Sta*s*cal Machine Learning Jingrui He CIDSE, ASU

Instance-‐based Learning

1-‐Nearest Neighbor Four things make a memory based learner: 1.  A distance metric Euclidian (and many more) 2.  How many nearby neighbors to look at? One 1.  A weigh:ng func:on (op:onal) Unused

2.  How to fit with the local points? Just predict the same output as the nearest neighbor.

3

Consistency of 1-‐NN •  Consider an es*mator fn trained on n examples –  e.g., 1-‐NN, regression, ...

•  Es*mator is consistent if true error goes to zero as amount of data increases –  e.g., for no noise data, consistent if:

•  Regression is not consistent! –  Representa*on bias

•  1-‐NN is consistent (under some mild fineprint)

What about variance??? 4

1-‐NN overfits?

5

k-‐Nearest Neighbor Four things make a memory based learner: 1.  A distance metric Euclidian (and many more) 2.  How many nearby neighbors to look at? k 1.  A weigh:ng func:on (op:onal) Unused 2. 

How to fit with the local points? Just predict the average output among the k nearest neighbors.

6

k-‐Nearest Neighbor (here k=9)

K-‐nearest neighbor for funcFon fiHng smooth away noise, but there are clear deficiencies. What can we do about all the discon*nui*es that k-‐NN gives us?

7

Curse of dimensionality for instance-‐based learning •  Must store and retrieve all data! –  Most real work done during tes*ng –  For every test sample, must search through all dataset – very slow! –  There are fast methods for dealing with large datasets, e.g., tree-‐based methods, hashing methods,

•  Instance-‐based learning oên poor with noisy or irrelevant features

8

Support Vector Machines

Linear classifiers – Which line is beber? Data:

Example i:

w.x = ∑j w(j) x(j)

10

w.x + b

= 0

Pick the one with the largest margin!

w.x = ∑j w(j) x(j)

11

w.x + b

= 0

Maximize the margin

12

w.x + b

= 0

But there are a many planes…

13

w.x + b

= 0

Review: Normal to a plane

14

x+

margin 2γ

= -‐1 w.x + b

= 0 w.x + b

w.x + b

= +1

Normalized margin – Canonical hyperplanes

x-‐

15

x+

margin 2γ

= -‐1 w.x + b

= 0 w.x + b

w.x + b

= +1

Normalized margin – Canonical hyperplanes

x-‐

16

w.x + b

= 0

= +1

= -‐1

w.x + b

w.x + b

Margin maximiza*on using canonical hyperplanes

margin 2γ

17

= -‐1 w.x + b

= 0 w.x + b

w.x + b

= +1

Support vector machines (SVMs)

•  Solve efficiently by quadra*c programming (QP) –  Well-‐studied solu*on algorithms

•  Hyperplane defined by support vectors margin 2γ

18

What if the data is not linearly separable? Use features of features of features of features….

19

What if the data is s*ll not linearly separable? •  Minimize w.w and number of training mistakes –  Tradeoff two criteria?

•  Tradeoff #(mistakes) and w.w –  –  –  –  20

0/1 loss Slack penalty C Not QP anymore Also doesn’t dis*nguish near misses and really bad mistakes

Slack variables – Hinge loss

•  If margin ≥ 1, don’t care •  If margin < 1, pay linear penalty 21

Side note: What’s the difference between SVMs and logis*c regression? SVM:

LogisFc regression: Log loss:

22

Constrained op*miza*on

23

Lagrange mul*pliers – Dual variables Moving the constraint to objecFve funcFon Lagrangian:

Solve:

24

Lagrange mul*pliers – Dual variables Solving:

25

Dual SVM deriva*on (1) – the linearly separable case

26

Dual SVM deriva*on (2) – the linearly separable case

27

w.x + b

= 0

Dual SVM interpreta*on

28

Dual SVM formula*on – the linearly separable case

29

Dual SVM deriva*on – the non-‐separable case

30

Dual SVM formula*on – the non-‐separable case

31

Why did we learn about the dual SVM? •  There are some quadra*c programming algorithms that can solve the dual faster than the primal •  But, more importantly, the “kernel trick”!!! –  Another lible detour…

32

Reminder from last *me: What if the data is not linearly separable? Use features of features of features of features….

Feature space can get really 33 large really quickly!

number of monomial terms

Higher order polynomials

d=4

m – input features d – degree of polynomial

d=3 d=2 number of input dimensions 34

grows fast! d = 6, m = 100 about 1.6 billion terms

Dual formula*on only depends on dot-‐products, not on w!

35

Dot-‐product of polynomials

36

Finally: the “kernel trick”!

•  Never represent features explicitly –  Compute dot products in closed form

•  Constant-‐*me high-‐dimensional dot-‐ products for many classes of features •  Very interes*ng theory – Reproducing Kernel Hilbert Spaces

37

Polynomial kernels •  All monomials of degree d in O(d) opera*ons:

•  How about all monomials of degree up to d? –  Solu*on 0: –  Beber solu*on:

38

Common kernels •  Polynomials of degree d

•  Polynomials of degree up to d •  Gaussian kernels •  Sigmoid

39

Overfivng? •  Huge feature space with kernels, what about overfivng??? –  Maximizing margin leads to sparse set of support vectors –  Some interes*ng theory says that SVMs search for simple hypothesis with large margin –  Oên robust to overfivng

40

What about at classifica*on *me •  For a new input x, if we need to represent Φ(x), we are in trouble! •  Recall classifier: sign(w.Φ(x)+b) •  Using kernels we are cool!

41

SVMs with kernels •  Choose a set of features and kernel func*on •  Solve dual problem to obtain support vectors αi •  At classifica*on *me, compute:

Classify as

42

What’s the difference between SVMs and Logis*c Regression?

Loss function

High dimensional features with kernels

SVMs

Logistic Regression

Hinge loss

Log-loss

Yes!

No

43

Kernels in logis*c regression

•  Define weights in terms of support vectors:

•  Derive simple gradient descent rule on αi 44

What’s the difference between SVMs and Logis*c Regression? (Revisited)

Loss function High dimensional features with kernels Solution sparse Semantics of output

SVMs

Logistic Regression

Hinge loss

Log-loss

Yes!

Yes!

Often yes!

Almost always no!

“Margin”

Real probabilities

45

machine learning

Short Description

Description

Comments

We need your help!