Solutions to Selected Problems-Duda, Hart

March 27, 2017 | Author: Tiep VuHuu | Category: N/A

Short Description

Solution for "Pattern Classification" - Duda, Hart...

Description

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork John L. Weatherwax∗ February 24, 2008

Problem Solutions Chapter 2 (Bayesian Decision Theory) Problem 11 (randomized rules) Let R((x) be the average risk given measurement x. Then since this risk depends Part (a): Let R on the action and correspondingly the probability of taking such action, this risk can be expressed as m

R(x) =



|

|

R(ai x)P ( P (ai x)

i=1

Then the total risk (over all possible x) is given by the average of  R of  R((x) or m

R =

  (

Ωx i=1

|

|

R(ai x)P ( P (ai x)) p( p(x)dx

Part (b): One would pick

|

P ( P (ai x) =



|

1 i = ArgMini R(ai x) 0 i = anything else

Thus for every x every x pick the action that minimizes the local risk. This amounts to a deterministic rule i.e. there is no benefit to using randomness. ∗

[email protected]

1

Part (c): Yes, since in a deterministic decision rule setting randomizing some decisions might further improve performance. But how this would be done would be difficult to say.

Problem 13 Let

|

λ(αi ω j ) =

 

0 i = j = j = c + + 1 λr i = c λs otherwise

|

|

Then since the definition of risk is expected loss we have c

| |x)



R(αi x) = R(αc+1

λ(ai ω j )P ( P (ω j x)

i=1

i = 1, 2, . . . , c

(1)

= λr

(2)

Now for i for i = = 1, 2, . . . , c the c the risk can be simplified as c

|

R(αi x) = λ s



|

|

λ(ai ω j )P ( P (ω j x) = λs (1

j=1 j =1,, j =i



− P ( P (ω |x)) i

Our decision function is to choose the action with the smallest risk, this means that we pick the reject option if and only if λr < λs (1

− P ( P (ω |x))

∀

i

This is equivalent to

|

P ( P (ωi x) P ( P (ω j x)

∀ j = i

This This can be inte interpr rprete eted d as selec selectin tingg the class with the largest largest poster posterior iorii probab probabili ility ty.. In summary we should reject if and only if

|

P ( P (ωi x) λs , the loss from rejecting is greater than the loss from missclassifing and we should never reject. This can be seen from the above since λλ > 1 so 1 λλ ξ . Here P ( P (H i ) are the priors for each of the two two classes. classes. We begin by finding the Bayes Bayes decision boundary ξ  under ξ  under the minimum error criterion criterion.. From the discussion discussion in the book, the decision decision boundary is given by a likelih likelihood ood ratio test. That is we decide class H 2 if

| ≥ P ( P (H ) | P ( P (H )

p( p(x H 2) p( p(x H 1)

1 2



C 21 21 C 12 12

− C − C

11 11 22 22



,

and classify the point as a member of H 1 otherwise otherwise.. The decision decision boundar b oundary y ξ  is ξ  is the point at which this inequality is an equality . If we use a minim minimum um probabili probability ty of error error criter criterion ion the costs C ij δ ij P (H 1) = ij are given by C ij ij = 1 ij , and the decision boundary reduces when P ( 1 P ( P (H 2 ) = 2 to p( p(ξ H 1 ) = p( p (ξ H 2) .

−

|

|

If each conditional distributions is Gaussian p Gaussian p((x H i ) = (µi , σi2 ), then the above expression becomes 1 1 (ξ µ1 )2 1 1 (ξ µ2 )2 exp = exp . 2 2 σ12 σ22 2πσ1 2πσ2 If we assume (for simplicity) that σ that σ 1 = σ = σ 2 = σ = σ the above expression simplifies to (ξ ( ξ µ1 )2 = (ξ µ2 )2 , or on taking the square root of both sides of this we get ξ µ1 = (ξ µ2). If we take the plus sign in this equation we get the contradiction that µ 1 = µ = µ 2. Taking the minus sign and solving for ξ for ξ  we we find our decision threshold given by

√

−

{−

|

N

} √

−

{−

−

}

−

= ξ  =

± −

−

1 (µ1 + µ + µ2) . 2

Again since we have uniform priors (P ( P ((H 1 ) = P ( P (H 2 ) = 12 ), we have an error probability given by ξ

2P e = or



−∞

1 exp 2πσ

√

√

{−

1 (x µ2)2 dx + dx + 2 σ2

ξ

2 2πσP πσ P e =



−∞

{−

exp

−

}

∞

ξ

1 (x µ2 )2 dx + dx + 2 σ2

−

}

1 exp 2πσ

 √  ∞

{−

{−

exp

ξ

1 (x µ1 )2 dx . 2 σ2

−

}

1 (x µ1)2 dx . 2 σ2

−

}

−µ2 so dv = √ dx , and in the second To evaluate the first integral use the substitution v = x√ 2σ 2σ x√ µ2 − dx √ use v = 2σ so dv so dv = 2σ . This then gives

√

√

2 2πσP = σ 2 πσ P e = σ

−µ2 √ 2 2σ

µ1



−∞

√ 2 e−v dv + dv + σ σ 2

∞



2

− 21√ −2 2 µ

µ

σ

e−v dv .

as |µ1 −σ µ2 | + . This can happen if µ1 and µ2 get far apart of σ shrinks to zero. In each case the classificat classification ion problem problem gets easier.

→ ∞

Problem 32 (probability of error in higher dimensions) As in Problem 31 the classification decision boundary are now the points ( x) that satisfy 2

||x − µ || = ||x − µ || 1

2

2

.

Expanding each side of this expression we find 2

t

2

2

t

2

||x|| − 2x µ + ||µ || = ||x|| − 2x µ + ||µ || . Canceling the common ||x|| from each side and grouping the terms linear in x we have that 1 ||µ || − ||µ || . x (µ − µ ) = 2 1

1

2

2

2

t

1



2

Which is the equation for a hyperplane.

2

1

2

2



Problem 34 (error bounds on non-Gaussian data) Part (a): Since the Bhattacharyya bound is a specialization of the Chernoff bound, we begin by considering the Chernoff bound. Specifically, using the bound min(a, min( a, b) aβ b1−β , for a for a > 0, b > 0, and 0 β 1, one can show that

≤ ≤ ≤ P (error) P (error) ≤ P ( P (ω ) P ( P (ω ) − e− 1

β

2

1 β

≤

k(β )

for 0

≤ β  ≤ ≤ 1 .

Here the expression k(β ) is given by k(β ) =

(1 β (1

− β ) (µ − µ ) [β Σ + (1 − β )Σ )Σ ]− 2 2

1

t

1

1

2

(µ2

− µ ) + 12 ln 1

|

− | || |

)Σ2 β Σ1 + (1 β )Σ β 1−β Σ1 Σ2

|

.

When we desire to compute the Bhattacharyya bound we assign β = 1/2. For a two two class scalar problem with symmetric means and equal variances that is where µ 1 = µ, µ 2 = +µ, and σ and for k((β ) above becomes σ 1 = σ 2 = σ the expression for k

−

k (β ) =

(1 β (1

− β ) (2µ (2µ)

2 2β (1 (1 β )µ2 = . σ2

−

βσ 2 + (1



− β )σ

2

−1



1 (2µ (2µ) + ln 2

|

≤ ≤ ≤

βσ 2 + (1 β )σ2 σ2β σ2(1−β )

−

|



Since the above expression is valid for al l  β in the range 0 β 1, we can obtain the tightest β  in possible bound by computing the maximum  of the expression k(β ) (since the dominant β expression is e−k(β ). To evaluate evaluate this maximum maximum of  k( k (β ) we take the derivative with respect to β to β  and and set that result equal to zero obtaining (1 β )µ2 d 2β (1 dβ σ2



−



=0

which gives 1

− 2β  = = 0

or β = 1/2. Th Thus us in this this case case the Chernoff Chernoff bound equals   the Bhattac Bhattacharra harra bound. When β = 1/2 we have for the following bound on our our error probability P (error) P (error)

2 − P ( P (ω1)P ( P (ω2)e 2 2 . µ

 ≤

σ

To further simplify things if we assume that our variance is such that σ 2 = µ 2 , and that our priors over class are taken to be uniform (P ( P ((ω1) = P ( P (ω2) = 1/2) the above becomes P (error) P (error)

≤ 12 e− ≈ 0.3033 . 1/2

Chapter 3 (Maximum Likelihood and Bayesian Estimation) Problem 34 In the problem statement we are told that to operate on a data of “size” n, requires f ( f (n) 9 − numeric calculations. Assuming that each calculation requires 10 seconds of time to compute, we can process a data structure of size n in 10−9f ( seconds. If we want want our results results f (n) seconds. finished by a time T time T  then then the largest data structure that we can process must satisfy 10−9f ( f (n)

≤ T

or n

1

9

≤ f − (10 T ) T ) .

Converting the ranges of times into seconds we find that

{

} {

}

1sec, 1 hour hour,, 1day, 1day, 1 year ear = 1sec, 1sec, 3600 3600 sec sec, 86400 86400 sec, sec, 31536000 31536000 sec sec . T = 1sec, Using a very naive algorithm we can compute the largest n such that f that f ((n) by simply incrementing n repeatedly repeatedly.. This is done in the Matlab script prob 34 chap 3.m and the results are displayed in Table XXX.

Problem Problem 35 (recursiv (recursive e v.s. direct direct calculatio calculation n of means means and covaria covariances) nces) ˆ n we have to add n, d-dimensional vectors resulting in nd comPart (a): To compute µ putations. putations. We then have have to divide each component component by the scalar n, resulting in another n computations. Thus in total we have = (1 + d + d))n , nd + nd + n n = calculations. To compute the sample covariance matrix, C n , we have to first compute the differences ˆn requiring d subtraction for each vector xk . Since we must do this for each of the n xk µ vectors x vectors x k we have a total of  nd calculations ˆn difference nd calculations required to compute all of the x k µ ′ vector vectors. s. We then have to compute compute the outer products ( xk µ ˆn )(x )(xk µ ˆn ) , which require 2 calculations ns to produce each each (in a naive naive implementation implementation). ). Since Since we have have n such outer d calculatio 2 products to compute computing them all requires nd operations. We next, have to add all of these n these n outer products together, resulting in another set of ( n 1)d 1)d2 operations. Finally dividing by the scalar n scalar n 1 requires another d another d 2 operations. Thus in total we have

−

−

−

−

−

−

(n nd + nd + nd nd2 + (n

calculations.

2

− 1)d 1)d

+ d2 = 2nd2 + nd ,

Part (b): We can derive recursive formulas for ˆµn and C n in the following following way. way. First for ˆn we note that µ ˆn+1 µ

1 = n + 1 =

n+1

xn+1 + xk = + 1 n k=1



1 xn+1 + n + 1

  n n + 1

n

  1 n n + 1 n

ˆn . µ

k=1

xk

Writing

n n+1

as

n+1 1 n+1

− = 1 − 1 the above becomes n+1 1 1 ˆn ˆn xn+1 + µ µ n + 1 n + 1 1 = µ ˆn + (xn+1 µ ˆn ) , n + 1

−

ˆn+1 = µ

 

−

as expected. To derive the recursive update formula for the covariance matrix C n we begin by recalling the definition of  C C n of n+1 1 (xk µ ˆn+1 )(x )(xk µ ˆn+1 )′ . C n+1 = n k=1



−

−

Now introducing the recursive definition of the mean ˆµn+1 as computed above we have that C n+1

1 = n 1 = n

(xk

k =1 n+1

k =1

1 n

1 − µˆ − n +1 1 (x − µˆ ))(x − ))(x − µ ˆ − (x ˆ ))′ µ n + 1 n

n+1

(xk − µ ˆn)(x )(xk − µ ˆn )′ −

1 n(n + 1)

− Since µ ˆ n =

n+1

 

k



(xk

k=1

n+1

n

)(x − µˆ )(x − µˆ )′ n

1 n(n + 1) 2

 k=1

n

n+1

1 n(n + 1)

n+1

n k=1 xk we



n

(xn+1 − µ ˆn )(x )(xk − µ ˆn )′ −

n+1

n

n+1



(xn+1

k =1

− µˆ )(x − µˆ )′ . )(x n

n+1

n

see that

ˆn nµ

−

1 n

n



xk = 0 or

k =1

1 n

n



(µ ˆn

k=1

−x ) = 0, k

Using this fact the second and third terms the above can be greatly simplified. For example we have that n+1

n



− µˆ ) =



−1

 

(xk

k=1

n

k =1

(xk

(x − µˆ ) + (x − µˆ ) = x − µˆ n

n+1

n

n+1

n

Using this identity and the fact that the fourth term has a summand that does not depend on k on k we have C n+1 given by C n+1 =

n

n

1

n

−1

n

(xk

k=1

1 − µˆ )(x − − )(x − µ ˆ )′ + (x ˆ )(x )(x ˆ )′ µ µ n−1 n

k

n

n+1

n

n+1

2 1 (xn+1 µ ˆn )(x )(xn+1 µ ˆn ) + (xn+1 µ ˆn )(x )(xn+1 n(n + 1) n(n + 1) 1 1 1 = 1 (xn+1 µˆn )(x )(xn+1 µ ˆn )′ C n + n n n(n + 1) 1 1 n + 1 1 = C n (xn+1 µˆn )(x )(xn+1 µ ˆn )′ C n + n n n + 1 1 n 1 = (xn+1 µ ˆn )(x )(xn+1 µ ˆ n )′ . C n + n n + 1

−

−

−

 −   −    − − − −

−

−

−

−

−

−

n

− µˆ )′ n



which is the result in the book. ˆ n using the recurrence formulation we have d subtractions to comPart (c): To compute µ pute xn ˆ operations. Finally Finally, addition addition by µn−1 and the scalar division requires another d operations. the µ ˆ n−1 requires another d operations giving in total 3d 3 d operations. To be compared with the (d (d + 1)n 1)n operations in the non-recursive case.

−

To compute the number of operations required to compute C n recursively we have d computations to compute the difference x n µ ˆn−1 and then d then d 2 operations to compute the outer 2 product (x (xn µ ˆn−1)(x )(xn µˆn−1 )′ , another d another d2 operations to multiply C multiply C n−1 by nn− and finally 1 − d2 operations to add this product to the other one. Thus in total we have

−

−

−

 

+ d2 + d2 + d2 + d2 = 4d2 + d . d + d To be compare with 2nd 2 nd2 + nd in the non recur recursiv sivee case. case. For large n the non-recursive calculations are much more computationally intensive.

Chapter 6 (Multilayer Neural Networks) Problem 39 (derivatives of matrix inner products) Part (a): We present a slightly different method to prove this here. We desire the gradient (or derivative) of the function φ xT K x .

≡

The i-th component of this derivative is given by ∂φ ∂ = + xT Ke i xT K x = e T i K x + x ∂xi ∂x i

 

where e where e i is the i the i-th -th elementary basis function for zeros everywhere else. Now since

n R ,

i.e. it has a 1 in the i-th i -th position and

T T T T T (eT i K x) = x K ei = e i K x ,

the above becomes

∂φ T = eT + eT = e T + K T )x . i K x + e i K x = e i (K  + K ∂xi Since multiplying by eT i on the left selects the i-th row from the expression to its right we see that the full gradient expression is given by





T

∇φ = (K  + K + K )x , as requested requested in the text. text. Note that this expressi expression on can also be b e prove proved d easily easily by writing writing each each term in component notation as suggested in the text.

Part (b): If  K  is the above expression simplifies to K  is symmetric then since K T = K K  the

∇φ = 2Kx , as claimed in the book.

Chapter 9 (Algorithm-Independent Machine Learning) Problem 9 (sums of binomial coefficients) Part (a): We first recall the binomial theorem which is n

n

(x + y + y)) =

  k=0

n k

xk y n−k .

If we let x = 1 and and y = 1 then x + y + y = = 2 and the sum above becomes n

2n =

  k=0

n k

,

−

whic which is the state stated d ident identit ity y. As an aside, aside, note also that if we let x = 1 and y = 1 then + y = = 0 and we obtain another interesting identity involving the binomial coefficients x + y n

0=

  − n k

k=0

( 1)k .

Problem 23 (the average of the leave one out means) We define the leave one out mean µ (i) as 1

µ(i) =

−1

n



x j .

j =i



This can obviously be computed for every i. i . Now we want to consider the mean of the leave one out means. To do so first notice that µ (i) can be written as µ(i) = =

1 n

−1 1

n

−1

   −     − x j

j =i



n

x j

xi

j=1 j =1

n

1

1 = n x j n 1 n j=1 j =1 n xi = ˆ µ . n 1 n 1

−

xi

− − −

With this we see that the mean of all of the leave one out means is given by 1 n

n

 i=1

µ(i)

1 = n = =

n

 i=1

n

n

− 1 µˆ

n

1

 − −   xi

1

n

n

1 n

− 1 µˆ − n − 1 ˆ n µ ˆ− =µ ˆ, µ n−1 n−1 n

xi

i=1

as expected.

Problem 40 (maximum likelihood estimation with a binomial random variable) Since k has a binomial distribution with parameters (n (n′ , p) we have that P ( P (k) =

n′ k

 

pk (1

n

′ − p) p) −

k

.

Then the p that maximizes this expression is given by taking the derivative of the above (with respect to p) p ) setting the resulting expression equal to zero and solving for p. p . We find that this derivative is given by d P ( P (k) = dp

n′ k

 

′ kp − (1 − p) p)n −k + k 1

n′ k

 

pk (1

′ − p) p) − − (n′ − k)(−1) . n

k 1

Which when set equal to zero and solve for p we find that p that p = = nk′ , or the empirical counting estimate of the probability of success as we were asked to show.

Solutions to Selected Problems-Duda, Hart

Short Description

Description

Comments

We need your help!