machine learning

December 16, 2016 | Author: Abbas | Category: N/A
Share Embed Donate


Short Description

statistical machine learning...

Description

CSE  575:  Sta*s*cal  Machine  Learning   Jingrui  He   CIDSE,  ASU  

Instance-­‐based  Learning  

1-­‐Nearest  Neighbor   Four  things  make  a  memory  based  learner:   1.  A  distance  metric        Euclidian  (and  many  more)   2.  How  many  nearby  neighbors  to  look  at?                                  One   1.  A  weigh:ng  func:on  (op:onal)      Unused  

 

 

 

 

 

   

 

 

2.  How  to  fit  with  the  local  points?        Just  predict  the  same  output  as  the  nearest  neighbor.  

  3  

 

Consistency  of  1-­‐NN   •  Consider  an  es*mator  fn  trained  on  n  examples   –  e.g.,  1-­‐NN,  regression,  ...  

•  Es*mator  is  consistent  if  true  error  goes  to  zero  as  amount  of   data  increases   –  e.g.,  for  no  noise  data,  consistent  if:  

•  Regression  is  not  consistent!   –  Representa*on  bias  

•  1-­‐NN  is  consistent  (under  some  mild  fineprint)    

What  about  variance???   4  

1-­‐NN  overfits?  

5  

k-­‐Nearest  Neighbor   Four  things  make  a  memory  based  learner:   1.  A  distance  metric        Euclidian  (and  many  more)   2.  How  many  nearby  neighbors  to  look  at?    k   1.  A  weigh:ng  func:on  (op:onal)    Unused   2. 

 

 

 

 

 

 

 

 

 

How  to  fit  with  the  local  points?          Just  predict  the  average  output  among  the  k  nearest  neighbors.  

6  

   

   

k-­‐Nearest  Neighbor  (here  k=9)  

K-­‐nearest  neighbor  for  funcFon  fiHng  smooth  away  noise,  but  there  are  clear   deficiencies.   What  can  we  do  about  all  the  discon*nui*es  that  k-­‐NN  gives  us?  

7  

Curse  of  dimensionality  for   instance-­‐based  learning   •  Must  store  and  retrieve  all  data!   –  Most  real  work  done  during  tes*ng   –  For  every  test  sample,  must  search  through  all  dataset  –  very  slow!   –  There  are  fast  methods  for  dealing  with  large  datasets,  e.g.,  tree-­‐based   methods,  hashing  methods,    

•  Instance-­‐based  learning  o^en  poor  with  noisy  or  irrelevant  features  

8  

Support  Vector  Machines  

Linear  classifiers  –  Which  line  is  beber?   Data:  

Example  i:  

w.x  =  ∑j  w(j)  x(j)  

10  

w.x  +  b

 =  0  

Pick  the  one  with  the  largest  margin!  

w.x  =  ∑j  w(j)  x(j)  

11  

w.x  +  b

 =  0  

Maximize  the  margin  

12  

w.x  +  b

 =  0  

But  there  are  a  many  planes…  

13  

w.x  +  b

 =  0  

Review:  Normal  to  a  plane  

14  

x+  

margin  2γ

 =  -­‐1   w.x  +  b

 =  0   w.x  +  b

w.x  +  b

 =  +1  

Normalized  margin  –  Canonical   hyperplanes  

x-­‐  

15  

x+  

margin  2γ

 =  -­‐1   w.x  +  b

 =  0   w.x  +  b

w.x  +  b

 =  +1  

Normalized  margin  –  Canonical   hyperplanes  

x-­‐  

16  

w.x  +  b

 =  0  

 =  +1  

 =  -­‐1  

w.x  +  b

w.x  +  b

Margin  maximiza*on  using   canonical  hyperplanes  

margin  2γ

17  

 =  -­‐1   w.x  +  b

 =  0   w.x  +  b

w.x  +  b

 =  +1  

Support  vector  machines  (SVMs)  

•  Solve  efficiently  by  quadra*c   programming  (QP)   –  Well-­‐studied  solu*on  algorithms  

•  Hyperplane  defined  by  support  vectors   margin  2γ

18  

What  if  the  data  is  not  linearly   separable?   Use  features  of  features     of  features  of  features….  

19  

What  if  the  data  is  s*ll  not  linearly   separable?   •  Minimize  w.w  and  number  of  training   mistakes   –  Tradeoff  two  criteria?  

•  Tradeoff  #(mistakes)  and  w.w   –  –  –  –  20  

0/1  loss   Slack  penalty  C   Not  QP  anymore   Also  doesn’t  dis*nguish  near  misses   and  really  bad  mistakes  

Slack  variables  –  Hinge  loss  

•  If  margin  ≥  1,  don’t  care   •  If  margin  <  1,  pay  linear  penalty   21  

Side  note:  What’s  the  difference   between  SVMs  and  logis*c  regression?   SVM:  

LogisFc  regression:   Log  loss:  

22  

Constrained  op*miza*on  

23  

Lagrange  mul*pliers  –  Dual  variables   Moving  the  constraint  to  objecFve  funcFon   Lagrangian:  

Solve:  

24  

Lagrange  mul*pliers  –  Dual  variables   Solving:  

25  

Dual  SVM  deriva*on  (1)  –     the  linearly  separable  case  

26  

Dual  SVM  deriva*on  (2)  –     the  linearly  separable  case  

27  

w.x  +  b

 =  0  

Dual  SVM  interpreta*on  

28  

Dual  SVM  formula*on  –     the  linearly  separable  case  

29  

Dual  SVM  deriva*on  –     the  non-­‐separable  case  

30  

Dual  SVM  formula*on  –     the  non-­‐separable  case  

31  

Why  did  we  learn  about  the  dual  SVM?   •  There  are  some  quadra*c  programming   algorithms  that  can  solve  the  dual  faster  than   the  primal   •  But,  more  importantly,  the  “kernel  trick”!!!   –  Another  lible  detour…  

32  

Reminder  from  last  *me:  What  if  the   data  is  not  linearly  separable?   Use  features  of  features     of  features  of  features….  

Feature  space  can  get  really  33  large  really  quickly!  

number  of  monomial  terms  

Higher  order  polynomials  

d=4  

m  –  input  features   d  –  degree  of  polynomial  

d=3   d=2   number  of  input  dimensions   34  

grows  fast!   d  =  6,  m  =  100   about  1.6  billion  terms  

Dual  formula*on  only  depends  on   dot-­‐products,  not  on  w!  

35  

Dot-­‐product  of  polynomials  

36  

Finally:  the  “kernel  trick”!  

•  Never  represent  features  explicitly   –  Compute  dot  products  in  closed  form  

•  Constant-­‐*me  high-­‐dimensional  dot-­‐ products  for  many  classes  of  features   •  Very  interes*ng  theory  –  Reproducing   Kernel  Hilbert  Spaces  

37  

Polynomial  kernels   •  All  monomials  of  degree  d  in  O(d)  opera*ons:  

•  How  about  all  monomials  of  degree  up  to  d?   –  Solu*on  0:     –  Beber  solu*on:  

38  

Common  kernels   •  Polynomials  of  degree  d  

•  Polynomials  of  degree  up  to  d   •  Gaussian  kernels   •  Sigmoid  

39  

Overfivng?   •  Huge  feature  space  with  kernels,  what  about   overfivng???   –  Maximizing  margin  leads  to  sparse  set  of  support   vectors     –  Some  interes*ng  theory  says  that  SVMs  search  for   simple  hypothesis  with  large  margin   –  O^en  robust  to  overfivng  

40  

What  about  at  classifica*on  *me   •  For  a  new  input  x,  if  we  need  to  represent   Φ(x),  we  are  in  trouble!   •  Recall  classifier:  sign(w.Φ(x)+b)   •  Using  kernels  we  are  cool!  

41  

SVMs  with  kernels   •  Choose  a  set  of  features  and  kernel  func*on   •  Solve  dual  problem  to  obtain  support  vectors   αi   •  At  classifica*on  *me,  compute:  

Classify  as  

42  

What’s  the  difference  between   SVMs  and  Logis*c  Regression?  

Loss function

High dimensional features with kernels

SVMs

Logistic Regression

Hinge loss

Log-loss

Yes!

No

43  

Kernels  in  logis*c  regression  

•  Define  weights  in  terms  of  support  vectors:  

•  Derive  simple  gradient  descent  rule  on  αi   44  

What’s  the  difference  between  SVMs   and  Logis*c  Regression?  (Revisited)  

Loss function High dimensional features with kernels Solution sparse Semantics of output

SVMs

Logistic Regression

Hinge loss

Log-loss

Yes!

Yes!

Often yes!

Almost always no!

“Margin”

Real probabilities

45  

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF