Linear Algebra Assignment Solution

August 13, 2022 | Author: Anonymous | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Linear Algebra Assignment Solution...

Description

CS 7015 - Deep Learning - Assignment 1

Shubham Patel, CS17M051

Instructions:

• This assignment is meant to help you grok certain concepts we will use in the course. Please don’t copy solutions from any sources. • Avoid verbosity. • Questions marked marked with * are relativ relatively ely difficult difficult.. Don’t be discou discouraged raged if you cannot solve them right away! • The assignment needs to be written in latex using the attached tex file. The solution for each question should be written in the solution block in space already provided in the tex file. Handwritten assignments will not be accepted.

1. Partial Derivatives (a) Consi Consider der the follow following ing computa computation tion ,

tanh

x

1+

2(

wx +b 2

)

f (x)

1 where f (x) =

1+tanh( wx2+ b ) 2

z

−

z

−

z

and by defini definiton ton : tanh(z ) = eez +ee −

The value L is given by,

1 (y − f (x))2 2 Here, x and y are constants and w and b are parameters that can be modified. In other words, L is a function of   w and b. ∂L Derive the partial derivatives, ∂w and ∂L . ∂b L =

Solution: We know, f (x) f (x))2 .

1+tanh( wx2+ b ) = 2

z

−

z

also, tanh(z ) = eez +ee z and L = 21 (y − −

−

∂L a. Solving for ∂w .

∂L 1 ∂ = (y − f (x))2 ∂w 2 ∂w

1 ∂ f (x)) = ∗ 2 ∗ (y − f (x)) ∗ (0 − 2 ∂w Putting f(x) from the defination, and taking minus out

∂ 1 + tanh( wx2+ b ) = −(y − f (x))[ ] ∂w 2 y − f (x) ∂ wx + b = − 2 [ ∂ w tanh( 2 )]

(1)

Simplifying tanh defination tanh(z ) =

ez − e1z ez + e1z

e2z − 1 = 2z e +1

replacing z by ( wx2+ b ) 2∗ wx2+b

= e2 e

tan(

∗

− 1 +1

wx +b 2

wx + b

)=

2

ewx+b − 1 ewx+b + 1

(2)

by using (2) in (1) y − f (x) ∂ ewx+b − 1 = − [ ] ∂w ewx+b + 1 2

= −

y − f (x) (x.ewx+b − 0)(ewx+b + 1) − (xewx+b + 0)(ewx+b + 1)

[

wx+b

2 = −

y − f (x) (x.e

2

[

wx+b

)(e

wx+b

2

− 1) (e + 1) − (xewx+b )(ewx+b − 1) ] (ewx+b + 1)2

y − f (x) (x.ewx+b )[(ewx+b + 1) − (ewx+b − 1)] = − [ ] 2 (ewx+b + 1)2

= −

(x.ewx+b ) ∗ 2 ∗ [ wx+b ] 2 (e + 1)2

y − f (x)

∂L (x.ewx+b ) = − (y − f (x))[ wx+b ] ∂w (e + 1)2

Page 2

]

b. Solving for ∂L ∂b

∂L 1 ∂ = (y − f (x))2 ∂b 2 ∂b

1 ∂ = ∗ 2 ∗ (y − f (x)) ∗ (0 − f (x)) 2 ∂b Putting f(x) from the defination, and taking minus out ∂ 1 + tanh( wx2+ b ) = − (y − f (x))[ ] ∂b 2

= −

y − f (x) ∂ wx + b [ tanh( )] 2 ∂b 2

(3)

by using (2) in (3) wx+b

= − y − f (x ) [ ∂ ewx+b − 1 ] 2 ∂b e +1 y − f (x) (ewx+b − 0)(ewx+b + 1) − (ewx+b + 0)(ewx+b + 1) = − [ ] 2 (ewx+b − 1)2 y − f (x) (ewx+b )(ewx+b + 1) − (ewx+b )(ewx+b − 1) = − [ ] 2 (ewx+b + 1)2 y − f (x) (ewx+b )[(ewx+b + 1) − (ewx+b − 1)] = − [ ] 2 (ewx+b + 1)2

= −

(ewx+b ) ∗ 2 ∗ [ wx+b ] 2 (e + 1)2

y − f (x)

∂L (ewx+b ) = − (y − f (x))[ wx+b ] ∂b (e + 1)2

(b) Consi Consider der the eva evaluatio luation n of   E  as as given below, E   = g (x,y,z ) = σ (c(ax + by ) + dz )

Here x x,, y,z  are are inputs (constants) and a,b,c,d are parameters (variables). σ is the

Page 3

logistic logist ic sigmoi sigmoid d funct function ion define defined d as: σ ( x) =

1 1 + e

−x

Note that here E  is is a function of  a,b,c,d a,b,c,d. Compute the partial derivatives of   E  with with respect to the parameters a, b and d i.e. ∂E ∂E ∂E , ∂b and ∂d . ∂a Solution:

E   = g (x,y,z ) = σ [c(ax + by ) + dz ] σ ( x) =

1 1 + e

−x

(1)

(2)

a. ∂E ∂a ∂E ∂ = g (x,y,z ) = σ (c(ax + by ) + dz ) ∂a ∂a

=

1

∂ ∂a

e−(c(ax+by)+dz)

1+ ∂ e(c(ax+by)+dz) = ∂a e(c(ax+by)+dz) + 1

(e(acx+bcy+dz) (cx))(e(acx+bcy+dz) + 1) − ((e(acx+bcy+dz) (cx) + 0)e(acx+bcy+dz) ) = (e(acx+bcy+dz) + 1)2 (cx)(e(acx+bcy+dz) ) = (acx+bcy+dz) (e + 1)2 b. ∂E ∂b ∂E ∂ = g (x,y,z ) = σ (c(ax + by ) + dz ) ∂b ∂b

=

∂ 1 ∂b 1 + e−(c(ax+by)+dz)

∂ e(c(ax+by)+dz) = ∂b e(c(ax+by)+dz) + 1

(e(acx+bcy+dz) (cy ))(e(acx+bcy+dz) + 1) − ((e(acx+bcy+dz) (cy ) + 0)e(acx+bcy+dz) ) = (e(acx+bcy+dz) + 1)2

Page 4

(cy )(e(acx+bcy+dz) ) = (acx+bcy+dz) (e + 1)2 c. ∂E ∂d ∂E

∂

∂d = ∂d g (x,y,z ) = σ (c(ax + by ) + dz )

=

∂ 1 ∂ d 1 + e−(c(ax+by)+dz)

∂ e(c(ax+by)+dz) = ∂d e(c(ax+by)+dz) + 1

=

(e(acx+bcy+dz) (d))(e(acx+bcy+dz) + 1) − ((e(acx+bcy+dz) (d) + 0)e(acx+bcy+dz) ) (e(acx+bcy+dz) + 1)2 (acx+bcy+dz)

)(e ) = (d(acx (e +bcy+dz) + 1)2 2. Erroneous Estimates The first order derivative of a real valued function f  is defined by the following limit (if it exists), df (x) f (x + h ) − f (x) = lim (3) dx

h→0

h

On observing the above definition we see that the derivative of a function is the ratio of change in the function value to the change in the function input, when we change the input by a small quantit quantity y (infin (infinitesi itesimally mally small). 2

Consider the function f (x) = x − 2x + 1.

(x) = (a) Using the limit definition definition of deriv derivativ ative, e, show that the deriv derivativ ativee of   f (x) is df dx 2x − 2.

Solution:

f (x) = x 2 − 2x + 1 df (x) f (x + h ) − f (x) = lim h→0 dx h

from equation 1st and 2nd

Page 5

(1)

(2)

df (x) ((x + h)2 − 2(x + h) + 1) − (x2 − 2x + 1) = lim h→0 dx h x2 + 2hx + h2 − 2x − 2h + 1 − x2 + 2x − 1 = lim h→0 h 2

= lim h + 2hx − 2h h

h→0

= lim (h + 2x − 2) h→0

applying limits ∂f (x) = 2x − 2 ∂x

(b) The function evaluates to 0 at 1 i.e. f (1) = 0. Say we wanted to estimate the value of   f (1.01) and f (1.5) without using the defi f (x). We could think of using the definition of derivative to “extrapolate” nition of  f the value of   f (1) to obtain f (1.01) and f (1.5). A first degree approximation based on 3 on 3 would be the following. f (x + h ) ≈ f (x) + h

df (x) dx

Estimate f (1.01) and f (1.5) using the above formula. Solution:

f (x + h ) = f (x) + h

df (x) dx

f (1 + 0.01) = f (1) + 0.01 ∗ (2(1) − 2)

= 0 + 0.01 ∗ (0) =0 for f(1+0.5) f (1 + 0.5) = f (1) + 0.5 ∗ (2(1) − 2)

= 0 + 0.5 ∗ (0) =0 (c) Compa Compare re it to the actual v value alue of   f (1.01) = 0.0001, and f (1.5) = 0.25. Page 6

(3)

Solution:

The error in f(1.01) = 0.0001. The error in f(1.5) = 0.25 (d) Explain the discre discrepancy pancy from the actual v value. alue. Why does it incr increase/decrease ease/decrease whe when n we move further away from 1? Solution: The error increases increases in the approxi approximation mation as distance increases increases.. It

is so because approximation method meant to be applied with very minor diffe feren rence. ce. But as the dist distanc ancee inc increa rease se appro approxim ximati ation on fails fails.. It is sor sortt of as the difference between actual value and value to be found is tends to be zero we will get correct results. (e) Can we get a better estimate of   f (1.01) and f (1.5) by “correcting” our estimate from part (a)? Can you suggest a way of doing this? Solution: Better estimate can be getting through using more higher derivative

in the approxim approximation equation. ion. It canofbethe done unti untilseries. l equation reac reach h the value of the constant. It ation is sortequat of application taylor f (1.01) = f (1) + (0.01) ∗ (2(1) − 2) + (0.01)2 ∗

2 ( 2!

= 0 + 0 + 0 .0001 ∗ 1 = 0.0001 In second case. f (1 + 0.5) = f (1) + (0.5) ∗ (2(1) − 2) + (0.5)2 ∗

2 2!

= 0 + 0 + 0 .25 = 0.25 Our approximation is better here. Differentiation w.r.t. Vectors and matrices matrices 3. Differentiation d Consider vectors u, x ∈ R , and matrix A ∈ R n n . The gradient of a scalar function f  w.r.t. a vector x is a vector by itself, given by ×

∇x f   =



∂ f ∂f ∂ f , ,..., ∂ x1 ∂x 2 ∂x n

Page 7



Gradient of a scalar function w.r.t a matrix is a matrix.

∇A f   =

∂f ∂A 11 ∂f ∂A 21

     

.. .

∂f ∂A n1

∂f ∂A 12 ∂f ∂A 22

... ...

∂f ∂A n2

...

∂f ∂A 1n ∂f ∂A 2n

... . . . ...

∂f ∂A nn

  

Gradient Gradie nt of the gradi gradien entt of a fun functi ction on w.r.t. w.r.t. a ve vecto ctorr is a mat matrix rix.. It is refer referred red to as Hessian. 2 2 2 Hx f   = ∇ 2x f   =

∂ f ∂ 2 x1 ∂ 2 f ∂x 2 ∂x 1

.. .

∂ 2 f ∂x n ∂x 1

∂ f ∂x 1 ∂x 2 ∂ 2 f ∂ 2 x2

...

∂ 2 f ∂x n ∂x 2

...

...

∂ f ∂x 1 ∂x n ∂ 2 f ∂x 2 ∂x n

.. . . .. . . .

∂ 2 f ∂ 2 xn

  

(a) Deri Derive ve the expr expression essionss for the follo following wing gradien gradients. ts. 1. ∇x uT  x T

2. ∇x xT  x 3. ∇x x Ax 4. ∇A xT  Ax 5. ∇2x xT  Ax (Aside: Compare your results with derivatives for the scalar equivalents of the above expressions ax and x2 .) The gradient of a scalar f  w.r.t. a matrix X , is a matrix whose (i, j ) component is ∂f , where X ij .) ij is the (i, j ) component of the matrix X .) ∂X iijj Solution: 1. ∇x uT  x



∇x [ u1 u2 . . . un

=



  

1 x x2

.. .

xn

 

] = ∇ x [u1 x1 + u2 x2 + · · · + unxn ]

∂ ∂ ∂ u1 x1 + · · · + u n xn , u1 x1 + · · · + un xn u1 x1 + · · · + un xn , . . . , ∂ x1 ∂x 2 ∂x n

= (u1 , u2 , . . . , un) = u T

Page 8



2. ∇x xT  x



∇x [ x1 x2 . . . xn

=



     x1 x2

.. .

x

] = ∇ x [x21 + x22 + · · · + x2n ]

n

∂ 2 ∂ 2 ∂ 2 x1 + · · · + x2n , x1 + · · · + x2n x1 + · · · + x2n , . . . , ∂x 1 ∂ x2 ∂ xn

= (2x1 , 2x2 , . . . , 2xn )



= 2xT 3. ∇x xT  Ax

∇x [ x1 x2 . . . xn



  

a11 a12 . . . a21 x22 . . .

a1n a2n

ad1 ad2 . . .

adn



.. .. . . .. . . . .

     x1 x2

.. .

]

xn

= x1 a11 + x2 a21 + · · · + xnan1 . . . x1 a1n + x2 a2n + · · · + xn ann

  

x1 x2

.. .

xn

  

f   = (x1 a11 + x2 a21 + · · · + xn an1 )x1 + · · · + (x1 a1n + x2 a2n + · · · + xn ann )xn ∂f

= 2a11 x1 + x 2 a21 + · · · + xn an1 + x2 a12 + x3 a13 + · · · + xna1n

∂x 1 ∂ f = (x1 a11 + x2 a21 + · · · + xn an1 ) + (x1 a11 + x2 a12 + x3 a13 + · · · + xna1n) ∂x 1

similarly, ∂ f = (x1 a12 + x2 a22 + · · · + xn an2 ) + (x1 a21 + x2 a22 + x3 a23 + · · · + xna2n) ∂x 2 ∂ f = (x1 a13 + x2 a23 + · · · + xn an3 ) + (x1 a31 + x2 a32 + x3 a33 + · · · + xna3n) ∂x 3

Page 9

∂f = (x1 a1n + x2 a2n + · · · + xnann ) + (x1 an1 + x2 an2 + x3 an3 + · · · + xnann ) ∂x n



Ans =

∂f ∂f ∂f , . . . , ∂x , ∂x 1 ∂x 2 n T

∂f

=

=

   =

 

1 ∂x ∂f ∂x 2

.. .

∂f ∂x n



 

T

(x1 a11 + x2 a21 + · · · + xnan1 ) + ( x1 a11 + x2 a12 + x3 a13 + · · · + xn a1n ) (x1 a12 + x2 a22 + · · · + xnan2 ) + ( x1 a21 + x2 a22 + x3 a23 + · · · + xn a2n ) (x1 a13 + x2 a23 + · · · + xnan3 ) + ( x1 a31 + x2 a32 + x3 a33 + · · · + xn a3n ) .. . (x1 a1n + x 2 a2n + · · · + xnann) + (x1 an1 + x2 an2 + x3 an3 + · · · + xnann) T

(x1 a11 + x2 a21 + · · · + xn an1 ) x a x a x a 1 12 2 22 ((x1 a13 + + x2 a23 + + · · ·· ·· + + xnn ann23 ))

    

    

+

.. . (x1 a1n + x 2 a2n + · · · + xnann)

=

a11 a21 . . . a12 a22 . . . a13 a23 . . .

an1 an2 an3

a1n a2n . . .

ann

.. .. . . .. . . . .

    

(x1 a11 + x2 a12 + · · · + xn a1n)

T

x1 x2

.. .

xn

= AT x

+

T

a11 a12 . . . a21 a22 . . . a31 a32 . . .

a1n a2n a3n

an1 an2 . . .

ann

.. .. . . .. . . . .

+ (Ax)T T

= x (A + A )

4. ∇A x Ax Operating on the f obtained in the above part of question.

∇A f   =

=

  

∂f ∂A 11 ∂f ∂A 21

.. .

∂f ∂A n1

∂f ∂A 12 ∂f ∂A 22

... ...

∂f ∂A n2

...

x21 x2 x1 . . . x1 x2 x22 . . .

.. .

∂f ∂A 1n ∂f ∂A 2n

... . . . ...

.. . . . .

x1 xn x2 xn . . .

Page 10

    

.. . (x1 an1 + x2 an2 + · · · + xnann)

T

  

T

x a x a x a 1 21 2 22 ((x1 a31 + + x2 a32 + + · · ·· ·· + + xnn a23nn))

    

 

T

  

     

∂f ∂A nn

xn x1 xn x2

.. .

xn x2 n

    

T

x1 x2

.. .

xn

= xx T 5. ∇2x xT  Ax Using f derived in 3rd part. ∂ 2 f

∂ 2 f

Hx f   = ∇ 2x f   =

  

 

2 ∂ 2x1

∂ f ∂x 2 ∂x 1

.. .

∂ 2 f ∂x n ∂x 1

...

2 ∂x∂ 12∂x f ∂ 2 x2

...

∂ 2 f ∂x n ∂x 2

...

∂ 2 f n

∂x∂ 2∂x f ∂x 2 ∂x n 1

.. . . .. . . .

2a11 a12 + a 21 a12 + a21 2a22 = .. .. . . a1n + an1 an2 + a 2n

... ...

...

∂ 2 f ∂ 2 xn

a1n + an1 a2n + an2

.. .

. . . 2ann

= A + AT

  

 

(b) Use the equations obtained in the pr previous evious part to get the Linear regression regression solution that you studied in ML or PR. Suppose X   as input example-feature matrix, Y   as given outputs and w as weight vector. Solution:

Y   =

1 1 X   = .. 1



 

y1 y2 ... yn

x11 x12 . . . x21 x22 . . .

x1m x2m

.. .. . . . .. xn1 xn2 . . .

W   =

  

 

 

error1 = y 1 − error2 = y 2 −

.. . errorn = y n −

Page 11

w0 w1 ... wm

xmn

 

i=m i=0 wi x1i i=m i=0 wi x2i

 

i=m i=0 wi xni



  

Error =

1 error12 + error22 + error32 + · · · + errorn2 2



n



1 Error = errori2 2 j=1



    i=m

n

1 Error = 2 j=1

y j −

2

wi x ji

i=0

n

1 = (y j − X  jrow W )2 2 j =1

1 = (Y   − X W )T (Y   − X W ) 2 ∂Error ∂ 1 = (Y   − X W )T (Y   − X W ) ∂ W ∂W 2 ∂ 1 (Y   − X W ) = ∗ 2 ∗ (Y   − X W )T ∗ ∂W 2 1 0 = ∗ 2 ∗ (Y   − X W )T (−X ) 2 0 = (Y T − W T X T )X W T X T X   = Y T X

(X T X )W   = X T Y W   = (X T X )−1 X T Y

(c) By now you must hav havee the int intuit uition ion.. Gra Gradie dient nt w.r. w.r.t. t. a 1 dim dimens ension ional al arra array y wa wass 1 dimensional. dimensional. Gradi Gradient ent w. w.r.t. r.t. a 2-dime 2-dimensiona nsionall arra array y was 2 dime dimensiona nsional. l. Highe Higherr order ord er arra arrays ys are refe referre rred d to as tens tensors ors.. Let T be a 3 dim dimens ension ional al tens tensor. or. Write rite gradien ients ts w.r. w.r.t. t. a vec vector tor or a mat matrix rix in the the expression of   ∇T f . You can use grad expression.

Page 12

Solution:

∇T f   =

         

∂f ∂A 11r ∂f ∂A 12r

∂f ∂A 112 ∂f ∂A 122

... ...

∂f ∂A 1q 1

∂f ∂A 1q 2

...

∂f ∂A 211 ∂f ∂A 221

∂f ∂A 212 ∂f ∂A 222

... ...

∂A ∂f 21r ∂f ∂A 22r

∂f ∂A 2q 1

∂f ∂A 2q 2

...

∂f ∂A p11 ∂f ∂A p21

∂f ∂A p12 ∂f ∂A p22

...

∂f ∂A pq 2

...

∂f ∂A 111 ∂f ∂A 121

.. .

.. .

.. .

∂f ∂A pq 1

... . . . ...

∂f ∂A 1qr

... . . . ... .. . ...

∂f ∂A 2qr

∂f ∂A p1r ∂f ∂A p2r

.. . . .. . . .

∂f ∂A pqr

         

4. Ordered Derivatives An ordered network is a network where the state variables can be computed one at a time in a specified order. Answer the following questions regarding such a network. (a) Given the ordered netw network ork below, give a formula for calculating the ordered deriva ∂y 4 tive ∂y 1 in terms of partial derivatives w.r.t. y1 and y2 where y1 , y2 and y3 are the outputs of nodes 1, 2 and 3 respectively. 2

1

y1

4

3 Solution:

y2

∂y 4 ∂y 4 = ∂y 3 ∂y 1



y4

y3



∂ y4 ∂ y2 ∂y 3 ∂y 3 ∂y 2 + + ∂ y2 ∂ y1 ∂y 2 ∂y 1 ∂y 1

(b) The figure abov abovee can b bee view viewed ed as a dependenc dependency y graph as it tells us which v variabl ariables es Page 13

in the system depend on which other variables. For example, we see that y 3 depends on y1 and y2 which in turn also depends on y1 . Now cons consider ider the net networ work k given below, Y s−1

s0

Y

Y

s1

W

s2

W

W

U x1

s3

U

x2

Here, si = σ (W si−1 + Y si−2 + U xi + b )

U

x3

(∀i ≥ 1) .

Can you draw a dependency graph involving the variables s3 , s2 , s1 , W , Y ? Solution:

∂s 3 ∂s 3 3 (c) Giv Givee a form formula ula for computi computing ng ∂W , ∂Y    and ∂s for the network shown in part (b) ∂U

Solution: we knows

s3 = σ (U x3 + W s2 + Y S 1 )

(1)

s2 = σ (U x2 + W S 1 + Y S 0 )

(2)

s1 = σ (U x1 + W S 0 + Y S −1 )

(3)

Page 14

−e−(Ux 3+W S 2 +Y S 1 ) ∂S 3 = 2 ∂ W (1 + e−(U x3 +W s2+Y S 1) ) −e = (1 + e

(Ux 3 +W S 2 +Y S 1 )

−

(Ux 3 +W S 2 +Y S 1 ) )2

−

∂S 2 −e−(Ux 2 +W S 1 +Y S 0 ) = 2 ∂ W (1 + e−(U x2 +W s1+Y S 0) )

  

∂S 1 ∂ S 2 + S 2 + Y 0 + W ∂W ∂W

∂ S 2 ∂S 1 W + S 2 + Y ∂W ∂W



−

∂ S 1 + S 1 W ∂W

−

−e−(U x1+W S 0 +Y S 1) ∂S 1 = 2 ∂W (1 + e−(U x1+W S 0 +Y S 1) ) −

−









(4) (5)

(6)

∂S −1 ∂ S 0 + S 0 + Y 0 + W ∂ W ∂W

−e (Ux 1 +W S 0 +Y S 1) S 0 = 2 (1 + e (Ux 1 +W S 0 +Y S 1) ) −

∂ S 0 ∂ S 1 + S 1 + Y 0 + W ∂W ∂W

−e (U x2+W S 1+Y S 0) = 2 (1 + e (Ux 2 +W s1 +Y S 0 ) )



(7)



(8)

−

−

(9)

−

Placing (9) back in (7) (U x2 +W S 1 +Y S 0 )

−e ∂ S 2 = 2 ∂ W (1 + e−(Ux 2 +W s1+Y S 0) ) −



(Ux 1 +W S 0 +Y S

−e (1 + e −

W

−

1)

(Ux 1 +W S 0 +Y S

−

S 0

−

1)

)

2 + S 1



(10)

Placing (10),(9) back in (5) ∂ S 2 −e−(Ux 3 +W S 2+Y S 1 ) = 2 ∂W (1 + e−(Ux 3 +W S 2+Y S 1 ) )

−e (Ux 2+W S 1 +Y S 0 ) ∗(W 2 (1 + e (U x2 +W s1+Y S 0) )

−e (Ux 1 +W S 0 +Y S 1) S 0 W 2 + S 1 (1 + e (Ux 1 +W S 0 +Y S 1) )

−

−

−

−

−





−e (Ux 1 +W S 0+Y S 1 ) S 0 Y 2) (1 + e (Ux 1 +W S 0+Y S 1 ) ) −

−

−

b.

−

−e−(U x3 +W S 2+Y S 1) ∂S 3 ∂S 1 ∂ S 2 = + S 1 + Y w 2 ∂Y ∂Y ∂Y (1 + e−(U x3 +W S 2+Y S 1) )



−e−(U x2+W S 1 +Y S 0 ) ∂ S 2 ∂S 1 w = + S 0 2 ∂Y ∂Y (1 + e−(U x2+W S 1 +Y S 0 ) )



∂S 1 −e−(U x1+W S 0 +Y S 1 ) = 2 [ S −1 ] ∂Y (1 + e−(U x1+W S 0 +Y S 1 ) ) −

−

Page 15

+ S 2 +

−





−e−(U x3+W S 2 +Y S 1 ) −e−(Ux 2 +W S 1 +Y S 0 ) ∂ S 3 = 2 [w [ 2 ∂Y (1 + e−(U x3+W S 2 +Y S 1 ) ) (1 + e−(Ux 2 +W S 1 +Y S 0 ) )



∗ w

−e (1 + e

−

(Ux 1 +W S 0 +Y S

−

−e +Y (1 + e

1)

−

(U x1 +W S 0 +Y S

−

c.



1)

(Ux 1 +W S 0 +Y S

−

2 [ S −1 ]

+ S 0 ]

2 [ S −1 ]

+ S 1 ]

)

1)

−

(U x1 +W S 0 +Y S

−

1)

−

)

−e−(U x3 +W S 2+Y S 1) ∂S 3 ∂S 1 ∂S 2 Y x w = + + 3 2 ∂U ∂U ∂U (1 + e−(U x3 +W S 2+Y S 1) )



∂ S 1 −e−(Ux 2+W S 1 +Y S 0 ) ∂ S 2 x2 + w = 2 ∂ U ∂ U (1 + e−(Ux 2+W S 1 +Y S 0 ) )







−e−(U x1 +W S 0+Y S 1 ) ∂S 1 = 2 [ x1 ] ∂U (1 + e−(U x1 +W S 0+Y S 1 ) ) −

−

e−(Ux 3 +W S 2+Y S 1 ) ∂S 3 = − 2 [x3 ∂U (1 + e−(Ux 3 +W S 2+Y S 1 ) )

−e (1 + e



(U x2 +W S 1 +Y S 0 )

−

+w

2 −(U x2 +W S 1 +Y S 0 ) )

−e (1 + e

x2 + w

(Ux 1 +W S 0 +Y S

−

+Y

−e (1 + e

1)

(U x1 +W S 0 +Y S

−

−

(U x1 +W S 0 +Y S

−

1)

−



2 [ x1 ]

)

1)

−

(Ux 1 +W S 0 +Y S

−

1)

−

)

2 [ x1 ]]

5. Baby Steps From basic calculus, we know that we can find the minima (local and global) of a function by finding the first and second order set the zero and verify ver ify if the second deriv derivativ ativee at thederivatives. same point We is positiv positive. e. first Thederivative reason reasoning ing to behind the following procedure is based on the interpretation of the derivative of a function as the slope of the function at any given point. The above procedure, procedure, ev even en thoug though h corre correct ct can be intr intractab actable le in pract practice ice while trying to minimize functions. And this is not just a problem for the multivariable case, but even for single variable variable functions functions.. Consi Consider der minimizin minimizingg the function f (x) = x5 + 5 sin(x) + 10tan(x). Althou Although gh the functio function n f  is a contrived example, the point is that the standard derivative approach, might not always be a feasible way to find minima of functions. In this course, we will be routinely dealing with minimizing functions of multiple v variables ariables (in fact millions millions of vari ariabl ables) es).. Of cours coursee we will not be sol solvin vingg the them m by hand, but we need a more efficient way of minimizing functions. For the sake of this problem, consider

Page 16

we are trying to minimize a convex function of one variable f (x), 1 which is guaranteed to have a single minim minima. a. We will now build an iter iterativ ativee approac approach h to finding the minima of functions. The high level idea is the following: Start at a (random) point x0 . Veri erify fy if we are at the min minima ima.. If not, chan change ge the val value ue so that we are moving closer to the minima. Keep repeating until we hit the minima. x while (a) Use the intuition built from Q.3 to find a wa way y to change the current current value of   x still ensuring that we are improving (i.e. minimizing) the function. Solution: It can be seen that x0 has to be decreased until it reached point of

minima.. So slope can be ch minima chec ecke ked d at eve every ry inst instanc ancee till converg convergenc ence. e. If it is increasing than value of   x0 should be decreased and if slope is decreasing value should be incr increased eased.. While(!Converge) { x0 = x 0 − α

∂ f (x0 ) ∂x

} Here after a number of steps saturation will reach in x0 valu value. e. Tha Than n we will know that it is converge. (b) How would you use the same idea, if you had to minimize a function of multiple variables ? Solution: Same idea as above can also be used in multivariable case. while(!converge) { ∂ xt = x 0 − α f (x0 , y0 ) ∂x ∂ yt = y 0 − α f (x0 , y0 ) ∂y

x0 = x t y0 = y t

} (c) Does your method always lead to the global minima (smallest value) value) for non conve convex x functions func tions (whic (which h may hav havee mul multiple tiple local minima minima)? )? If yes, can you explain (prov (provee or argue) why? If not, can you give a concrete example of a case where it fails? Solution: The abov abovee met method hod ma may y fai faill in the case of non non-co -conv nvex ex funct function ion.. It

may result result local minima as solut solution ion instead of global minima. Also, it can resu result lt 1

https://en.wikipedia.org/wi https://en.wi kipedia.org/wiki/Convex_fun ki/Convex_function ction

Page 17

in different local minima according to it’s initialization x0 .

From the figure we can see that if random value of point is choosen to be x 1 than minima result will be M 1 whic which h is not the glob global al minim minima. a. Als Also, o, If   x2 minima will be M 2 and if   X 3 is random choosen point minima will be M 3 . (d) Do you think tthis his procedure alwa always ys works ffor or conv convex ex functions ? (i.e., are we always guaranteed to reach the minima) Solution: Yes, Yes, abo above ve sol soluti ution on alw alway ayss wo works rks for con conve vex x sol soluti ution on and it wil willl

always guaranteed to reach local minima if learning rate is selected properly. (e) (Ext (Extra) ra) Can you think of the num number ber of step stepss needed to reach the minim minimaa ? Solution: It depend on learning rate. If learning rate if low it will take time to

converge. conve rge. If learnin learningg rate is set bigger than it can b bee possible that it may not converge. (f) (Ext (Extra) ra) Can you think of wa ways ys to impro improve ve the num number ber of steps neede needed d to reac reach h the minima ? Solution: Optimizing learning rate is the way to optimizing learning rate.

6. Constrained Optimization Let f (x, y ) and g (x, y ) be smooth (continuous, differentiable etc.) etc.) real va valued lued functi functions ons of tw twoo variab variables. les. We want to minim minimize ize f , which is a convex function of   x and y . will be zero. Thus and ∂f (a) Argue that at the minima of   f , the partial derivative ∂f ∂y ∂x setting the partial derivatives to zero is a possible method for finding the minima. Solution: At the point of minima slope change it course from negative to pos-

iti itive ve.. And that is true in fra frame me of all va varia riable ble pre presen sent. t. Dur During ing tran transit sition ion at minima min ima slope will become zer zero. o. And slope will be zer zeroo wit with h res respect pect to ev every ery variabl ariable. e. So, concept of consid considerin eringg single va variabl riablee at a time while consider considering ing other as constant is satisfactory condition and indicator that minima may be present.

Page 18

(b) Suppose we are only interested in minimizing f  in the region where g (x, y ) = c, where c is some constant. Suppose this region is a curve in the x-y plane. Call this the feasible curve. Will our previous technique still work in this case? Why or why not? Solution: No, previous technique will not work. Now the scenerio has changed.

Earlier were considering minima for theencurve individually. Butb now ares looking we minima that is const constraine rained db bet etwe ween two curv curves. es. So, now both oth we curve curves should have to take under consideration together. (c) What is the component of   ∇g along the feasible curve, when computed at points lying on the curve? Solution: It will be perpendicular to the curve. So, component along the curve

should be zero(0). (d) * At the point on the feas feasible ible curv curve, e, which ach achiev ieves es minimu minimum m value of   f , what will be the component of   ∇f  along the curve? Solution: The component along the curve will be zero since gradient will be

perpendicular perpendi cular to the curv curve. e. (e) Using the previou previouss answers, answers, show that at the point on the feasibl feasiblee curve, ach achievin ievingg minimum value of   f , ∇f   = λ∇g for some real number λ. Th Thus, us, thi thiss equ equati ation, on, combined with the constraint ∇ g = 0 should enable us to find the minima. Solution: At the point of minima both ∇f   and ∇g will be in same direction,

perpendicular to their respective curve. So,

∇f   ∝ ∇g that is replace by constant λ,

∇f   = λ ∇g Where λ is lagrange multiplier. (f) * Using the insights from discussion so far, solve the the following optimization problem: max xa y b z c x,y,z

where x + y + z  = = 1

and given a,b,c > 0.

Page 19

Solution:

f (x,y,z ) = x a y b z c g (x,y,z ) = x + y + z   − − 1 ∂ a b c ∂ a b c ∂ a b c x y z , x y z , x y z ∂x ∂y ∂z

∇xa y b z c =



= axa 1 y b z c, bxa y b 1 z c , cxa y b z c

∇(x + y + z − 1) =



1

−

−

−



(1)

∂ ∂ ∂ (x + y + z   − − 1), (x + y + z   − − 1), (x + y + z   − − 1) ∂x ∂y ∂ z

= (1, 1, 1)



(2)

from (1) and (2)



axa−1 y b z c , bxa y b−1 z c , cxa y b z c−1 = λ (1, 1, 1) axa−1 y b z c = λ



(3)

(4) (5)

a b−1 c

bx y z = λ cxa y b z c−1 = λ

from (3) and (4) y =

b x a

(6)

z   = =

c x a

(7)

from (3) and (5) from (4),(5) and x + y + z  = = 1 x =

a a + b + c

b a + b + c c z   = = a + b + c

y =

a b c aa bb cc f ( , , )= a + b + c a + b + c a + b + c (a + b + c)a+b+c

7. Billions of Balloons Consider a large playground filled with 1 billion balloons. Of these there are k1 blue, k2 k 1 , k 2 and k 3 are not known to you but you are green and k 3 red balloons. The values of  k interest inte rested ed in esti estimating mating them. Of course, you cannot go over all the 1 billion balloons Page 20

and count count the nu num mber of bl blue ue,, gr gree een n an and d re red d ba ball lloon oons. s. So yo you u de deci cide de to rand random omly ly sample samp le 100 10000 bal balloon loonss and note down the numbe numberr of blue, green and red red bal balloons loons.. Let these counts be kˆ1 , kˆ2 and kˆ3 respectively. You then estimate the total number of blue, green and red balloons as 1000000 ∗ ˆk1 , 1000000 ∗ ˆk2 and 1000000 ∗ ˆk3 . (a) Your friend knows the values of   k1 , k2 and k3 and wants to see how bad your estimates are compared to the true values. values. Can you suggest some wa ways ys of calculating this difference? [Hint: Think about probability!] Solution:

Error =

|1000000 ∗ ˆk1 − k1 | + |1000000 ∗ ˆk2 − k2 | + |1000000 ∗ ˆk3 − k3 | 109

Accuracy =





10 − |1000000 ∗ ˆk1 − k1 | + |1000000 ∗ ˆk2 − k2 | + |1000000 ∗ ˆk3 − k3 | 9

109 ˆ1 ) = P (k ˆ2 ) = P (k ˆ3 ) = P (k P (k1 ) = P (k2 ) = P (k3 ) =

ˆ1 k ˆ1 + ˆk2 + ˆk3 k k ˆ2 ˆ1 + ˆk2 + ˆk3 k

ˆ3 k ˆ1 + ˆk2 + ˆk3 k

k1 k1 + k2 + k3

k2 k1 + k2 + k3

k3 k1 + k2 + k3

Probability difference = | P (kˆ1 ) − P (k1 )| + |P (kˆ2 ) − P (k2 )| + |P (kˆ3 ) − P (k3 )| (b) * Consider tw twoo wa ways ys of conv converti erting ng kˆ1 , kˆ2 and kˆ3 to a probability distribution: î k

pi =

î k



i

ˆ

q i =

e ki



i

ekî

Would you prefer the distribution q = [q 1 , q 2 ,...,q n ] over p = [ p p1 , p2 ,...,pn ] for the above task? Give reasons and provide an example to support your choice. Page 21

Solution: No, i will not prefer q over p. Sinc Sincee q seems baised probability

distribution given higher estimate to one observe in large and diminishing those who doesn’t doesn’t.. 8. ** Let X   be a real-valued random variable with p as its probability density function (PDF). We define the cumulative density function y(CDF) of   X   as =x F (x) = Pr(X   ≤ x ) = p(y )dy



y =−∞

What is the value of   EX [F (X )] )] (the expected value of the CDF of   X )? )? The answer is a real number (Hint: The expectation can be formulated as a double integral. Try to plott the area ov plo over er whic which h yo you u nee need d to inte integra grate te in the x-y plane plane.. No Now w look at the area over which you are not integrating. Do you notice any symmetries?) Solution:

y =x

F (x) = P r(X   ≤ x ) =



p(y )dy

y =−∞

y =x



d F (x) = d dx dx

p(y )dy

(1)

y =−∞

Using, Restated Fundamental Theorem of calculus 2 a

F (x) =

 

f (t)dt

(2)

x

has derivative, d d F (x) = dx dx

a

f (t)dt = f (x)

(3)

x

from (1),(3) d dxF dx F (x) = p (x)

Also we know

(4)

∞

E (F (x)) =

 

F (x) p(x)dx

−∞

By placing (4) in (5)

1

E (F (x)) =

F (x)

0

(F (x))2 = 2 1 = 2



Page 22



d(F (x)) dx 1 0

(5)

9. * Intuitive Urns An urn initially contains 3 red balls and 3 blue balls. One of the balls is removed without being observed. To find out the color of the removed ball, Alice and Bob independently perform the same experiment: experiment: they random randomly ly draw a ball, recor record d the color color,, and put it back. bac k. This is repeated seve several ral times and the number number of red and blue balls observe observed d by each of them is recorded. Alice draws 6 times and observes 6 red balls and 0 blue balls. Bob draws 600 times and observes 303 red balls and 297 blue balls. Obviously, both of them will predict that the removed ball was blue. (a) Int Intuitiv uitively ely,, who do you think has stron stronger ger evidenc evidencee for claiming that the remo remove ved d Don’ n’tt cheat heat by comput computin ing g th the e answ answer er.. This This ba ball ll was was bl blue ue,, an and d wh why? y? (Do subquestion has no marks, but is compulsory!) Solution: Bob, since he has more sample result. From the central limit theorem

I can assume he will be more closer. (b) What is the exac exactt probabilit probability y that the remo remove ved d ball was blue, give given n Alice’s observations? (Hint: Think Bayesian Probability) Solution: Let consider three events:

1. A : Alice Alice’s ’s Observ Observation. ation. 2. B : Bob’s Obse Observ rvation. ation. 3. C : Ball remov removed ed was blue blue.. We need to find P ( C ): A Using bayes theorem we can write. P

A C

P C = ∗ P  ( ( A ) A A P C ( C  ) ∗ P   (( C ) + P C A ∗ P  (

    

3 1 = 6 2 After removing blue ball probability of getting red ball = 53 & blue ball = 52 . After removing red ball probability of getting red ball = 25 & blue ball = 35 . 6 Getting 6 red ball with repetition when blue ball removed P A = 35 . C 6 A = 25 . Getting 6 red ball with repetition when red ball removed P C P (C ) =

    

P

C = A

3 6

3 6 5 1

5

2

+

1 2 2 6 5

1 2

          Page 23

36 = 6 2 + 36 (c) What is the exact probability that the remov removed ed ball was blue, given Bob’s observations? (Hint: Think Bayesian Probability) C Solution: We need to find P ( B ):

Using bayes theorem we can write. P

   

B P C ∗ P  ( ( A) C = B B P B ∗ P  ( ( C ) ∗ P   (( C ) + P C C 

3 1 = 6 2 After removing blue ball probability of getting red ball = 53 & blue ball = 52 . After removing red ball probability of getting red ball = 25 & blue ball = 35 . Getting 303 red and 297 blue ball with repetition when blue ball removed: P (C ) =

P

303

    3 5

A = C

2 5

297

Getting 303 red and 297 blue ball with repetition when red ball removed: P

P

303

297

                  C = A

2 5

A = C 

3 303 5

3 303 5 297 2 1 5 2

3 5

2 297 1 5 2 303 + 25

3 297 5

1 2

36

= 26 + 36 (d) Com Comput putati ationa onally lly,, who do yo you u thi think nk has str strong onger er evi eviden dence ce for cla claimi iming ng tha thatt the removed ball was blue? Solution: Computati Computationally onally,, Both have same evid evidence ence for claim claiming ing that the

removed ball was blue. Did your your int intuit uition ion match match up wit with h the com comput putati ations ons?? If yes yes,, aw aweso esome! me! If not, remember reme mber that probab probabilit ility y can ofte often n be seem decepti deceptivel vely y straig straightf htforw orward. ard. Try to avoid intuition when dealing with probability by grounding it in formalism. 10. Plotting Functions for Great Good Page 24

(a) Consi Consider der the var variable iable x and functions h11 (x), h12 (x) and h21 (x) such that 1 1 + e (400x+24) 1 h12 (x) = 1 + e (400x 24) h11 (x) =

−

−

−

h21 = h 11 (x) − h 12 (x)

The above set of functions are summarized in the graph below.

Plot the following functions: h11 (x), h12 (x) and h21 (x) for x ∈ (−1, 1) Solution:

Page 25

(b) Now consider the variables x1 , x2 and the functions h11 (x1 , x2 ), h12 (x1 , x2 ), h13 (x1 , x2 ), h14 (x1 , x2 ), h21 (x1 , x2 ), h22 (x1 , x2 ), h31 (x1 , x2 ) and f (x1 , x2 ) such that

h11 (x1 , x2 ) =

1

1 + e (x1+100x2+200) 1 12 1 2 +100 ( x2 200) h (x , x ) = 1 + e x1 1 h13 (x1 , x2 ) = 1 + e (100x1+x2+200) 1 h14 (x1 , x2 ) = 1 + e (100x1+x2 200) h21 (x1 , x2 ) = h 11 (x1 , x2 ) − h12 (x1 , x2 ) h22 (x1 , x2 ) = h 13 (x1 , x2 ) − h14 (x1 , x2 ) h31 (x1 , x2 ) = h 21 (x1 , x2 ) + h 22 (x1 , x2 ) 1 f (x1 , x2 ) = 1 + e (50h31(x) 100) −

−

−

−

−

−

−

−

The above set of functions are summarized in the graph below.

Plot the following functions: h11 (x1 , x2 ), h12 (x1 , x2 ), h13 (x1 , x2 ), h14 (x1 , x2 ), h21 (x1 , x2 ), h22 (x1 , x2 ), h31 (x1 , x2 ) and f (x1 , x2 ) for x1 ∈ (−5, 5) and x2 ∈ ( −5, 5) Solution:

Page 26

Page 27

Page 28

Linear Algebra Assignment Solution

Short Description

Description

Comments

We need your help!