Solutions to Selected Problems-Duda, Hart

March 27, 2017 | Author: Tiep VuHuu | Category: N/A
Share Embed Donate


Short Description

Solution for "Pattern Classification" - Duda, Hart...

Description

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork John L. Weatherwax∗ February 24, 2008

Problem Solutions Chapter 2 (Bayesian Decision Theory) Problem 11 (randomized rules) Let  R((x) be the average risk given measurement x. Then since this risk depends Part (a): Let R on the action and correspondingly the probability of taking such action, this risk can be expressed as m

R(x) =



|

|

R(ai x)P ( P (ai x)

i=1

Then the total risk (over all possible x) is given by the average of  R of  R((x) or m

R  =

   (

Ωx i=1

|

|

R(ai x)P ( P (ai x)) p(  p(x)dx

Part (b):  One would pick

|

P ( P (ai x) =



|

1 i  = ArgMini R(ai x) 0 i  = anything else

Thus for every x every  x  pick the action that minimizes the local risk. This amounts to a deterministic rule i.e. there is no benefit to using randomness. ∗

[email protected]

1

Part (c):   Yes, since in a deterministic decision rule setting randomizing some decisions might further improve performance. But how this would be done would be difficult to say.

Problem 13 Let

|

λ(αi ω j ) =

 

0 i  = j  =  j  =  c +  + 1 λr i  = c λs   otherwise

|

|

Then since the definition of risk is expected loss we have c

| |x)



R(αi x) = R(αc+1

λ(ai ω j )P ( P (ω j x)

i=1

i = 1, 2, . . . , c

(1)

= λr

(2)

Now for i for  i =  = 1, 2, . . . , c the c  the risk can be simplified as c

|

R(αi x) =  λ s



|

|

λ(ai ω j )P ( P (ω j x) =  λs (1

 j=1  j =1,, j =i



− P ( P (ω |x)) i

Our decision function is to choose the action with the smallest risk, this means that we pick the reject option if and only if  λr < λs (1

− P ( P (ω |x))



i

This is equivalent to

|

P ( P (ωi x)   P ( P (ω j x)

∀ j = i

This This can be inte interpr rprete eted d as selec selectin tingg the class with the largest largest poster posterior iorii probab probabili ility ty.. In summary we should reject if and only if 

|

P ( P (ωi x)   λs , the loss from rejecting is greater than the loss from missclassifing and we should never reject. This can be seen from the above since λλ >  1 so 1 λλ ξ . Here P ( P (H i ) are the priors for each of the two two classes. classes. We begin by finding the Bayes Bayes decision boundary ξ  under ξ  under the minimum error criterion criterion.. From the discussion discussion in the book, the decision decision boundary is given by a likelih likelihood ood ratio test. That is we decide class H 2 if 

|  ≥ P ( P (H  ) | P ( P (H  )

 p(  p(x H 2)  p(  p(x H 1)

1 2



C 21 21 C 12 12

− C  − C 

11 11 22 22



,

and classify the point as a member of  H 1  otherwise  otherwise.. The decision decision boundar b oundary y ξ  is ξ  is the point at which this inequality is an   equality . If we use a minim minimum um probabili probability ty of error error criter criterion ion the costs C ij δ ij P (H 1) = ij  are given by C ij ij = 1 ij , and the decision boundary reduces when P ( 1 P ( P (H 2 ) = 2 to  p(  p(ξ  H 1 ) =  p(  p (ξ  H 2) .



|

|

If each conditional distributions is Gaussian p Gaussian  p((x H i ) = (µi , σi2 ), then the above expression becomes 1 1 (ξ  µ1 )2 1 1 (ξ  µ2 )2 exp = exp . 2 2 σ12 σ22 2πσ1 2πσ2 If we assume (for simplicity) that σ that  σ 1  = σ  =  σ 2  = σ  =  σ  the above expression simplifies to (ξ  ( ξ  µ1 )2 = (ξ  µ2 )2 , or on taking the square root of both sides of this we get  ξ  µ1  = (ξ  µ2). If we take the plus sign in this equation we get the contradiction that  µ 1  = µ  =  µ 2. Taking the minus sign and solving for ξ  for  ξ  we  we find our decision threshold given by

√ 



{−

|

 N 

} √ 



{−



}



 = ξ  =

± −



1 (µ1 + µ  +  µ2) . 2

Again since we have uniform priors (P  ( P ((H 1 ) = P ( P (H 2 ) = 12 ), we have an error probability given by ξ

2P e  = or

 

−∞

1 exp 2πσ

√ 

√ 

{−

1 (x µ2)2 dx + dx + 2 σ2

ξ

2 2πσP  πσ P e  =

 

−∞

{−

exp



}



ξ

1 (x µ2 )2 dx + dx + 2 σ2



}

1 exp 2πσ

  √    ∞

{−

{−

exp

ξ

1 (x µ1 )2 dx . 2 σ2



}

1 (x µ1)2 dx . 2 σ2



}

−µ2 so dv = √ dx , and in the second To evaluate the first integral use the substitution v = x√  2σ 2σ x√  µ2 − dx √  use v  = 2σ so dv so  dv  = 2σ . This then gives

√ 

√ 

2 2πσP   =  σ 2 πσ P e  = σ

−µ2 √  2 2σ

µ1

 

−∞

√  2 e−v dv + dv + σ  σ 2



 

2

− 21√ −2 2 µ

µ

σ

e−v dv .

as |µ1 −σ µ2 |  + . This can happen if  µ1 and µ2  get far apart of  σ  shrinks to zero. In each case the classificat classification ion problem problem gets easier.

→ ∞

Problem 32 (probability of error in higher dimensions) As in Problem 31 the classification decision boundary are now the points ( x) that satisfy 2

||x − µ || = ||x − µ || 1

2

2

.

Expanding each side of this expression we find 2

t

2

2

t

2

||x|| − 2x µ  + ||µ || = ||x|| − 2x µ  + ||µ || . Canceling the common ||x|| from each side and grouping the terms linear in  x  we have that 1 ||µ || − ||µ || . x (µ − µ ) = 2 1

1

2

2

2

t

1



2

Which is the equation for a hyperplane.

2

1

2

2



Problem 34 (error bounds on non-Gaussian data) Part (a):  Since the Bhattacharyya bound is a specialization of the Chernoff bound, we begin by considering the Chernoff bound. Specifically, using the bound min(a, min( a, b) aβ b1−β , for a for  a >  0,  b >  0, and 0 β  1, one can show that

≤  ≤  ≤ P (error) P (error) ≤ P ( P (ω ) P ( P (ω ) − e− 1

β 

2

1 β 



k(β )

for 0

≤ β  ≤  ≤ 1 .

Here the expression k(β ) is given by k(β ) =

(1 β (1

− β ) (µ − µ ) [β Σ  + (1 − β )Σ )Σ ]− 2 2

1

t

1

1

2

(µ2

− µ ) +  12 ln 1

|

− | || |

)Σ2 β Σ1 + (1 β )Σ β  1−β  Σ1 Σ2

|

 .

When we desire to compute the Bhattacharyya bound we assign β  = 1/2. For a two two class scalar problem with symmetric means and equal variances that is where  µ 1  = µ,  µ 2  = +µ, and σ and for  k((β ) above becomes  σ 1  =  σ 2  =  σ  the expression for k



k (β ) =

(1 β (1

− β ) (2µ (2µ)

2 2β (1 (1 β )µ2 = . σ2



βσ 2 + (1



− β )σ

2

−1



 1 (2µ (2µ) +  ln 2

|

≤  ≤  ≤

βσ 2 + (1 β )σ2 σ2β σ2(1−β )



|



Since the above expression is valid for  al l  β    in the range 0 β  1, we can obtain the tightest  β  in possible bound by computing the  maximum  of the expression k(β ) (since the dominant β  expression is e−k(β ). To evaluate evaluate this maximum maximum of  k(  k (β ) we take the derivative with respect to β  to  β  and  and set that result equal to zero obtaining (1 β )µ2 d 2β (1 dβ  σ2







=0

which gives 1

− 2β  =  = 0

or β  = 1/2. Th Thus us in this this case case the Chernoff Chernoff bound  equals   the Bhattac Bhattacharra harra bound. When β  = 1/2 we have for the following bound on our our error probability P (error) P (error)

2 − P ( P (ω1)P ( P (ω2)e 2 2 . µ

  ≤

σ

To further simplify things if we assume that our variance is such that  σ 2 =  µ 2 , and that our priors over class are taken to be uniform (P  ( P ((ω1) =  P (  P (ω2) = 1/2) the above becomes P (error) P (error)

≤ 12 e− ≈ 0.3033 . 1/2

Chapter 3 (Maximum Likelihood and Bayesian Estimation) Problem 34 In the problem statement we are told that to operate on a data of “size” n, requires f ( f (n) 9 − numeric calculations. Assuming that each calculation requires 10 seconds of time to compute, we can process a data structure of size n  in 10−9f ( seconds. If we want want our results results f (n) seconds. finished by a time T  time  T  then  then the largest data structure that we can process must satisfy 10−9f ( f (n)

≤ T 

or n

1

9

≤ f − (10 T ) T ) .

Converting the ranges of times into seconds we find that

{

} {

}

1sec, 1 hour hour,, 1day, 1day, 1 year ear = 1sec, 1sec, 3600 3600 sec sec, 86400 86400 sec, sec, 31536000 31536000 sec sec . T  = 1sec, Using a very naive algorithm we can compute the largest n  such that f  that  f ((n) by simply incrementing n  repeatedly  repeatedly.. This is done in the Matlab script   prob 34 chap 3.m  and the results are displayed in Table XXX.

Problem Problem 35 (recursiv (recursive e v.s. direct direct calculatio calculation n of means means and covaria covariances) nces) ˆ n   we have to add n, d-dimensional vectors resulting in nd  comPart (a):   To compute µ putations. putations. We then have have to divide each component component by the scalar n, resulting in another n computations. Thus in total we have  = (1 + d + d))n , nd + nd + n  n = calculations. To compute the sample covariance matrix, C n , we have to first compute the differences ˆn  requiring d  subtraction for each vector xk . Since we must do this for each of the n xk µ vectors x vectors  x k  we have a total of  nd calculations ˆn  difference  nd  calculations required to compute all of the  x k µ ′ vector vectors. s. We then have to compute compute the outer products ( xk µ ˆn )(x )(xk µ ˆn ) , which require 2 calculations ns to produce each each (in a naive naive implementation implementation). ). Since Since we have have n   such outer d calculatio 2 products to compute computing them all requires nd operations. We next, have to add all of these n these  n  outer products together, resulting in another set of ( n 1)d 1)d2 operations. Finally dividing by the scalar n scalar  n 1 requires another d another  d 2 operations. Thus in total we have



 −

 −







(n nd + nd + nd  nd2 + (n

calculations.

2

− 1)d 1)d

+ d2 = 2nd2 + nd ,

Part (b):  We can derive recursive formulas for ˆµn and C n   in the following following way. way. First for ˆn  we note that µ ˆn+1 µ

1 = n + 1 =

n+1

xn+1  + xk =  + 1 n k=1



1 xn+1  + n + 1

  n n + 1

n

   1 n n + 1 n

ˆn . µ

k=1

xk

Writing

n n+1

as

n+1 1 n+1

− = 1 − 1  the above becomes n+1 1 1 ˆn ˆn xn+1  + µ µ n + 1 n + 1 1 = µ ˆn  + (xn+1 µ ˆn ) , n + 1



ˆn+1 = µ

 



as expected. To derive the recursive update formula for the covariance matrix C n  we begin by recalling the definition of  C   C n of  n+1 1 (xk µ ˆn+1 )(x )(xk µ ˆn+1 )′ . C n+1  = n k=1







Now introducing the recursive definition of the mean ˆµn+1  as computed above we have that C n+1

1 = n 1 = n

(xk

k =1 n+1

k =1

1 n

1 − µˆ − n +1 1 (x − µˆ ))(x − ))(x − µ ˆ − (x ˆ ))′ µ n + 1 n

n+1

(xk − µ ˆn)(x )(xk − µ ˆn )′ −

1 n(n + 1)

− Since µ ˆ n  =

n+1

 

k



(xk

k=1

n+1

n

)(x − µˆ )(x − µˆ )′ n

1 n(n + 1) 2

 k=1

n

n+1

1 n(n + 1)

n+1

n k=1 xk  we



n

(xn+1 − µ ˆn )(x )(xk − µ ˆn )′ −

n+1

n

n+1



(xn+1

k =1

− µˆ )(x − µˆ )′ . )(x n

n+1

n

see that

ˆn nµ



1 n

n



xk = 0 or

k =1

1 n

n



(µ ˆn

k=1

−x ) = 0, k

Using this fact the second and third terms the above can be greatly simplified. For example we have that n+1

n



− µˆ ) =



−1

 

(xk

k=1

n

k =1

(xk

(x − µˆ ) + (x − µˆ ) = x − µˆ n

n+1

n

n+1

n

Using this identity and the fact that the fourth term has a summand that does not depend on k on  k  we have C n+1  given by C n+1 =

n

n

1

n

−1

n

(xk

k=1

1 − µˆ )(x − − )(x − µ ˆ )′  + (x ˆ )(x )(x ˆ )′ µ µ n−1 n

k

n

n+1

n

n+1

2 1 (xn+1 µ ˆn )(x )(xn+1 µ ˆn ) + (xn+1 µ ˆn )(x )(xn+1 n(n + 1) n(n + 1) 1 1 1 = 1 (xn+1 µˆn )(x )(xn+1 µ ˆn )′ C n + n n n(n + 1) 1 1 n + 1 1 = C n (xn+1 µˆn )(x )(xn+1 µ ˆn )′ C n + n n n + 1 1 n 1 = (xn+1 µ ˆn )(x )(xn+1 µ ˆ n )′ . C n + n n + 1







 −    −    − − − −













n

− µˆ )′ n



which is the result in the book. ˆ n  using the recurrence formulation we have d  subtractions to comPart (c):  To compute µ pute xn  ˆ  operations. Finally Finally, addition addition by µn−1  and the scalar division requires another d  operations. the µ ˆ n−1  requires another d  operations giving in total 3d 3 d  operations. To be compared with the (d (d + 1)n 1)n  operations in the non-recursive case.



To compute the number of operations required to compute C n  recursively we have d  computations to compute the difference  x n µ ˆn−1  and then d then  d 2 operations to compute the outer 2 product (x (xn µ ˆn−1)(x )(xn µˆn−1 )′ , another d another  d2 operations to multiply C  multiply  C n−1  by nn− and finally 1 − d2 operations to add this product to the other one. Thus in total we have







 

 + d2 + d2 + d2 + d2 = 4d2 + d . d + d To be compare with 2nd 2 nd2 +  nd   in the non recur recursiv sivee case. case. For large n   the non-recursive calculations are much more computationally intensive.

Chapter 6 (Multilayer Neural Networks) Problem 39 (derivatives of matrix inner products) Part (a):  We present a slightly different method to prove this here. We desire the gradient (or derivative) of the function φ xT K x .



The i-th component of this derivative is given by ∂φ ∂  =  + xT Ke i xT K x = e T  i  K x + x ∂xi ∂x i

 

where e where  e i  is the i the  i-th -th elementary basis function for zeros everywhere else. Now since

n R ,

i.e. it has a 1 in the  i-th  i -th position and

T  T  T  T  T  (eT  i  K x) =  x K  ei =  e i  K x ,

the above becomes

∂φ T  =  eT   + eT   =  e T   +  K T )x . i  K x + e i  K  x  = e i (K  + K  ∂xi Since multiplying by eT  i   on the left selects the i-th row from the expression to its right we see that the full gradient expression is given by







∇φ = (K  + K   +  K  )x , as requested requested in the text. text. Note that this expressi expression on can also be b e prove proved d easily easily by writing writing each each term in component notation as suggested in the text.

Part (b): If  K  is  the above expression simplifies to  K  is symmetric then since K T  = K   K  the

∇φ = 2Kx , as claimed in the book.

Chapter 9 (Algorithm-Independent Machine Learning) Problem 9 (sums of binomial coefficients) Part (a):  We first recall the binomial theorem which is n

n

(x + y  + y)) =

  k=0

n k

xk y n−k .

If we let x = 1 and and  y  = 1 then x + y  + y =  = 2 and the sum above becomes n

2n =

  k=0

n k

,

 −

whic which is the state stated d ident identit ity y. As an aside, aside, note also that if we let x = 1 and y  = 1 then  + y =  = 0 and we obtain another interesting identity involving the binomial coefficients x + y n

0=

  − n k

k=0

( 1)k .

Problem 23 (the average of the leave one out means) We define the leave one out mean  µ (i) as 1

µ(i)  =

−1

n



x j .

 j =i



This can obviously be computed for every  i.  i . Now we want to consider the mean of the leave one out means. To do so first notice that  µ (i)  can be written as µ(i) = =

1 n

−1 1

n

−1

   −     − x j

 j =i



n

x j

xi

 j=1  j =1

n

1

1 = n x j n 1 n  j=1  j =1 n xi = ˆ µ  . n 1 n 1



xi

− − −

With this we see that the mean of all of the leave one out means is given by 1 n

n

 i=1

µ(i)

1 = n = =

n

 i=1

n

n

− 1 µˆ

n

1

 − −   xi

1

n

n

1 n

− 1 µˆ − n − 1 ˆ n µ ˆ− =µ ˆ, µ n−1 n−1 n

xi

i=1

as expected.

Problem 40 (maximum likelihood estimation with a binomial random variable) Since k  has a binomial distribution with parameters (n (n′ , p) we have that P ( P (k) =

n′ k

 

 pk (1

n

′ − p)  p) −

k

.

Then the p   that maximizes this expression is given by taking the derivative of the above (with respect to  p)  p ) setting the resulting expression equal to zero and solving for  p.  p . We find that this derivative is given by d P ( P (k) = dp

n′ k

 

′ kp − (1 − p)  p)n −k + k 1

n′ k

 

 pk (1

′ − p)  p) − − (n′ − k)(−1) . n

k 1

Which when set equal to zero and solve for  p  we find that p that  p =  = nk′ , or the empirical counting estimate of the probability of success as we were asked to show.

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF