Deep Learning：第8章

September 18, 2017 | Author: matsuolab | Category: N/A

Share Embed Donate

Report this link

Short Description

Deep Learning日本語翻訳版...

Description

259

8

PCA

1

4 4 J(θ) θ

8.

8.1

P P

J(θ)

P

J

(8.1)

J(θ) = E(x,y)∼pdata L(f (x; θ), y) L

f (x; θ)

x

pdata

y L θ

f (x; θ)

x

y y

8.1 pdata J ∗ (θ) = E(x,y)∼pdata L(f (x; θ), y).

(8.2)

8.1.1 8.2

pdata

pdata (x, y)

pdata (x, y)

p(x, y)

pˆ(x, y) 260

8.

empirical risk m

1 ! Ex,y∼pˆdata (x,y) [L(f (x; θ), y)] = L(f (x(i) ; θ), y (i) ) m i=1

(8.3)

m empirical risk minimization

0/1 0 2

8.1.2

0/1 (Marcotte and Savard, 1992) surrogate loss function 0/1

0/1 0/1

0 0/1

0

261

8.

0/1 0/1

( 7.8 ) 0/1

8.1.3

θML = arg max θ

m !

log pmodel (x(i) , y (i) ; θ).

(8.4)

i=1

(8.5)

J(θ) = Ex,y∼pˆdata log pmodel (x, y; θ). J

(8.6)

∇θ J(θ) = Ex,y∼pˆdata ∇θ log pmodel (x, y; θ).

(

n σ 262

5.46)

√ σ/ n

√

n

8.

2 1

100

10,000

100

10

1

1 m 1 1/m

batch

deter-

ministic

stochastic

online

2 minibatch

minibatch stochastic

stochastic 1

8.3.1

• 263

8.

•

•

•

GPU

2 2

32

256

16 (Wilson and Martinez, 2003)

• 1

g 100 H

H

−1

10,000

g H

−1

g

H

H g

H

H −1 g

g H

H

−1

g

g

264

8.

2

1 2

5

3

3 1

1

X J(X) 12.1.3

(

8.2) 1

1 2

pdata (x, y)

(x, y) pdata 265

8.

x (

y

8.2) J ∗ (θ) =

!! x

g = ∇θ J ∗ (θ) = 8.5

pdata (x, y)L(f (x; θ), y)

!! x

pdata (x, y)∇θ L(f (x; θ), y)

8.6 pdata

L

y {x(1) , . . . x(m) }

gˆ = θ

(8.8)

y

L x

(8.7)

y

gˆ

y (i)

! 1 ∇θ L(f (x(i) ; θ), y (i) ). m i SGD

Bottou and Bousquet (2008)

8.2

266

pdata

(8.9)

8.

8.2.1

H 4.3.1 SGD 4.9

−ϵg 1 2 ⊤ ϵ g Hg − ϵg ⊤ g 2

(8.10) 1 2 ⊤ 2ϵ g

g⊤ g

mHg

g ⊤ Hg g ⊤ Hg

1 8.1

8.2.2 1

267

ϵg ⊤ g

16

1.0

14

0.9

Classification error rate

Gradient norm

8.

12 10 8 6 4 2 0 2

50

0

50 100 150 200 250

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Training time (epochs)

0

50

100

150

200

250

Training time (epochs)

8.1:

1

model identifiability 1

i

j

1 n

n!m

m

weight space symmetry

268

8.

1 α

α (m × n)

(Sontag and Sussman, 1989; Brady et al., 1989; Gori and Tesi, 1992)

(Saxe et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2015; Choromanska et al., 2014)

269

8.

8.2.3 0

4.5 f : Rn → R

n

n

n Dauphin

et al. (2014)

n

Baldi and Hornik (1989) 14

Saxe et al. (2013) 270

8.

Dauphin et al. (2014) Choromanska et al. (2014)

Goodfellow et al. (2015)

8.2 0 Goodfellow et al. (2015)

0

Dauphin et al. (2014) saddle-free Newton method

0

0

271

8.

J(✓) n ctio

Pro jec tion 1 of ✓

f✓ 2o

je Pro

8.2:

Goodfellow et al. (2015)

2012

SGD

8.2.4 8.3

272

J(w,b)

8.

w b

8.3:

Pascanu et al. (2013)

gradient clipping

10.12.1

1

8.2.5

1 10 273

8.

W Wt

t

W = V diag(λ)V −1

W

! "t W t = V diag(λ)V −1 = V diag(λ)t V −1 1

λi

(8.11) 1

vanishing and exploding

1 gradient problem

diag(λ)t

W

W power method

x⊤ W t

W

x

W (Sussillo, 2014) 10.7

8.2.6

III 274

8.

8.2.7 J(θ) θ

θ

θ

1 Goodfellow et al. (2015) 8.2

8.1 − log p(y | x; θ)

p(y | x)

y

p(y | x) = N (y; f (θ), β −1 ) f (θ) β

y

8.4

275

J(✓)

8.

✓

8.4:

8.2

ϵ δ≪ϵ

8.4 8.2 276

8.

8.2.8

(Blum and Rivest, 1992; Judd, 1989; Wolpert and MacReady, 1997)

8.3 ( 4.3 ) 8.1.3

277

5.9

8.

8.3.1 SGD 8.1.3

i.i.d

m 8.1 Algorithm 8.1 k Require:

(SGD)

ϵk .

Require:

θ

while

do {x

(1)

(m)

,...,x

y (i)

} !

1 gˆ ← + m ∇θ

θ ← θ − ϵˆ g

end while

i

m

L(f (x(i) ; θ), y (i) )

SGD

ϵ

SGD k

ϵk SGD

(m ) 0 SGD ∞ "

k=1 ∞ " k=1

α=

k τ

ϵk = ∞,

and

ϵ2k < ∞.

(8.12) (8.13)

τ ϵk = (1 − α)ϵ0 + αϵτ 278

(8.14)

8.

τ

ϵ

ϵ0 ϵτ

τ ϵτ

τ ϵ0

1%

ϵ0

100

SGD

excess error J(θ) − minθ J(θ) SGD O( √1k )

k

O( k1 )

(Cramér, 1946; Rao, 1945)

O( k1 )

Bottou and Bousquet (2008) O( k1 )

SGD

279

8.

O( k1 )

SGD

Bottou (1998)

8.3.2

(Polyak, 1964)

8.5

20 10 0 10 20 30

8.5:

30

20

10

0

10

20

2

1

1

1

4.6

v momentum 280

8.

v α ∈ [0, 1]

v ← αv − ϵ∇θ

# m 1 " L(f (x(i) ; θ), y (i) ) , m i=1

θ ← θ + v. $ 1 %m & (i) (i) ∇θ m i=1 L(f (x ; θ), y )

v α

!

ϵ SGD

∇θ

(8.15) (8.16)

$ 1 %m m

i=1

& L(f (x(i) ; θ), y (i) ) .

8.2

g

ϵ||g|| . 1−α

(8.17)

ϵ||g|| . 1−α

(8.18)

−g

α = 0.9

1 1−α

10

α

0.5, 0.9, 0.99

α α

ϵ

θ(t)

f (t) =

∂2 θ(t). ∂t2

281

f (t)

(8.19)

8.

Algorithm 8.2 Require:

SGD) ϵ,

α.

Require:

θ,

v.

while

do {x

(1)

(m)

,...,x

.

y (i)

}

: g←

1 m ∇θ

!

i

m

L(f (x(i) ; θ), y (i) )

: v ← αv − ϵg

end while

: θ ←θ+v

t

v(t)

∂ θ(t), ∂t ∂ f (t) = v(t). ∂t

(8.20)

v(t) =

(8.21)

1

−∇θ J(θ)

1

1

−v(t) 282

8.

−v(t)

−v(t)

1

O(log t) 0

8.3.3 Sutskever et al. (2013) (Nesterov, 1983, 2004) ! % m $ 1 " # (i) (i) v ← αv − ϵ∇θ L f (x ; θ + αv), y , m i=1 θ ← θ + v, α

ϵ

8.3 Nesterov (1983) O(1/k) k

)

O(1/k 2 )

283

(8.22) (8.23)

8.

Algorithm 8.3 Require:

SGD) ϵ,

α.

Require:

θ,

v.

while

do {x .

(1)

(m)

,...,x

y (i)

}

: θ˜ ← θ + αv

: g←

: v ← αv − ϵg

end while

1 m ∇θ˜

: θ ←θ+v

8.4

284

!

i

m

˜ y (i) ) L(f (x(i) ; θ),

8.

2

chaos

285

8.

7.8

θ p(θ)

θ0

θ0

θ0

0

θ0 g

1

m U (− √1m , √1m )

n Glorot and Bengio (2010) normalized initialization Wi,j ∼ U

! " −

6 , m+n

g

Saxe et al. (2013) gain

g

286

"

6 m+n

#

.

(8.24)

8.

g Sussillo (2014) 1,000

8.2.5

3

1 2 3

√1 m

1 Martens (2010)

k sparse initialization m

m

287

8.

11.4.2

g

1

g

Mishkin and Matas (2015)

•

c

ci

i

softmax(b) = c

b Part III x

x • 288

8.

ReLU

0

0.1

ReLU

(Sussillo, 2014) •

u uh h≈1

h ∈ [0, 1] h

uh ≈ u

h

uh ≈ 0

u

Jozefowicz et al. (2015)

10.11

LSTM

1

p(y | x) = N (y | wT x + b, 1/β)

(8.25) 1

β

0

III

289

8.

8.5

4.3

8.2

delta-bar-delta

(Jacobs, 1988)

)

8.5.1 AdaGrad 8.4

AdaGrad

(Duchi et al., 2011)

AdaGrad

AdaGrad 290

8.

Algorithm 8.4 AdaGrad Require:

ϵ

Require:

θ

Require:

10−7

δ r=0

while

do {x

(1)

(m)

,...,x

.

: g←

y (i)

}

1 m ∇θ

!

i

m

L(f (x(i) ; θ), y (i) )

: r ←r+g⊙g

: ∆θ ← − δ+ϵ√r ⊙ g.

(

)

: θ ← θ + ∆θ

end while

8.5.2 RMSProp RMSProp

(Hinton, 2012)

AdaGrad AdaGrad

AdaGrad AdaGrad RMSprop AdaGrad RMSProp

8.5 8.6

AdaGrad

rho RMSprop 1 291

8.

Algorithm 8.5 RMSProp Require:

ϵ,

Require:

ρ.

θ

Require:

10−6 ,

δ,

r=0 while

do {x(1) , . . . , x(m) }

y (i)

.

: g←

1 m ∇θ

!

i

m

L(f (x(i) ; θ), y (i) )

: r ← ρr + (1 − ρ)g ⊙ g

ϵ : ∆θ = − √δ+r ⊙ g.

1 ( √δ+r

)

: θ ← θ + ∆θ

end while Algorithm 8.6

RMSProp

Require:

ϵ,

Require:

θ,

ρ,

α.

v. r=0

while

do {x .

(1)

(m)

,...,x

y (i)

}

m

: θ˜ ← θ + αv ! 1 ˜ y (i) ) : g←m ∇θ˜ i L(f (x(i) ; θ), : r ← ρr + (1 − ρ)g ⊙ g : v ← αv −

end while

: θ ←θ+v

√ϵ r

⊙ g.

( √1r

)

8.5.3 Adam Adam(Kingma and Ba, 2014) 8.7

Adam

adaptive moments RMSprop 292

8.

1 Adam

) RMSprop 2

Adam

8.7 RMSProp Adam

RMSProp Adam

Algorithm 8.7 Adam Require:

ϵ(

: 0.001)

Require:

ρ1 :

[0, 1). (

ρ2

0.999 )

0.9

Require:

δ(

Require:

: 10−8 )

θ s = 0, r = 0 t=0

while

do {x

(1)

(m)

,...,x

.

t←t+1

: g←

y (i)

}

1 m ∇θ

!

i

L(f (x(i) ; θ), y (i) ) : s ← ρ1 s + (1 − ρ1 )g : sˆ ←

: ∆θ = end while

m

ˆ s −ϵ √r+δ ˆ

(

: rˆ ←

: θ ← θ + ∆θ

293

: r ← ρ2 r + (1 − ρ2 )g ⊙ g

s 1−ρt1 r 1−ρt2

)

8.

8.5.4

Schaul et al. (2014) RMSProp AdaDelta SGD SGD RMSProp

RMSProp AdaDelta

Adam

8.6 LeCun et al. (1998a) m

J(θ) = Ex,y∼pˆdata (x,y) [L(f (x; θ), y)] = 7

8.6.1 4.3

294

1 ! L(f (x(i) ; θ), y (i) ). m i=1

(8.26)

8.

θ0 J(θ) 1 J(θ) ≈ J(θ0 ) + (θ − θ0 )⊤ ∇θ J(θ0 ) + (θ − θ0 )⊤ H(θ − θ0 ), 2 H

θ0

θ

(8.27)

J

θ ∗ = θ0 − H −1 ∇θ J(θ0 ) H

(8.28) H −1 8.8

Algorithm 8.8

J(θ) =

Require:

θ0

1 m

!m

i=1

L(f (x(i) ; θ), y (i) )

Require: m while : g←

do !

1 m ∇θ

: H←

: ∆θ = −H

(i) (i) i L(f (x ; θ), y ) ! 1 2 (i) (i) i L(f (x ; θ), y ) m ∇θ −1

−1

: H

g

: θ = θ + ∆θ end while

2 8.28 8.2.3

295

.

8.

α

θ ∗ = θ0 − [H (f (θ0 )) + αI]

−1

(8.29)

∇θ f (θ0 ).

(Levenberg, 1944; Marquardt, 1963) α α αI

α α

k k×k

k O(k 3 )

8.6.2 conjugate directions 4.3 8.6

dt−1

dt−1

∇θ J(θ) · dt−1 = 0 296

8.

20 10 0 −10 −20 −30 −30 −20 −10

0

10

20

8.6: 4.6

dt

dt = ∇θ J(θ)

dt−1

dt

dt−1

8.6

dt−1

conjugate t dt (8.30)

dt = ∇θ J(θ) + βt dt−1 βt 2

dt−1

dt

dt−1

d⊤ t Hdt−1 = 0

H

βt

H 297

8.

2

βt 1. Fletcher-Reeves βt = 2. Polak-Ribière βt =

∇θ J(θt )⊤ ∇θ J(θt ) ∇θ J(θt−1 )⊤ ∇θ J(θt−1 )

(8.31)

⊤

(∇θ J(θt ) − ∇θ J(θt−1 )) ∇θ J(θt ) ∇θ J(θt−1 )⊤ ∇θ J(θt−1 )

(8.32)

k k 8.9 Algorithm 8.9 Require:

θ0

Require: m ρ0 = 0 g0 = 0 t=1 while

do gt = 0

!

: gt ← βt =

(

1 (i) (i) i L(f (x ; θ), y ) m ∇θ ⊤ (gt −gt−1 ) gt (Polak-Ribière) ⊤ g gt−1 t−1

: βt k

(

k=5

t

) : ρt = −gt + βt ρt−1 !m 1 (i) (i) : ϵ∗ = argminϵ m i=1 L(f (x ; θt + ϵρt ), y ) ϵ∗

: θt+1 = θt + ϵ ρt ∗

t←t+1

end while

298

)

8.

nonlinear conjugate gradients

1 (Le et al., 2011) (Moller, 1993)

8.6.3 BFGS Broyden–Fletcher–Goldfarb–Shanno BFGS BFGS BFGS

θ ∗ = θ0 − H −1 ∇θ J(θ0 ), H

θ0

θ

(8.33)

J BFGS

H −1 H −1 Mt BFGS

Luenberger (1984)

Mt

ρt ϵ∗ 299

ρ t = Mt g t

8.

θt+1 = θt + ϵ∗ ρt .

(8.34)

BFGS

BFGS BFGS

O(n2 )

M

BFGS BFGS Limited Memory BFGS L-BFGS

BFGS

M L-BFGS

BFGS

M M

(t−1)

L-BFGS L-BFGS M O(n)

8.7

8.7.1 (Ioffe and Szegedy, 2015)

300

8.

yˆ = xw1 w2 w3 . . . wl wi yˆ yˆ

i

x

hi = hi−1 wi

wi

1

yˆ g = ∇w yˆ

yˆ yˆ

i

w ← w − ϵg

ϵg ⊤ g

yˆ

0.1

ϵ

0.1 g⊤ g

l yˆ x(w1 − ϵg1 )(w2 − ϵg2 ) . . . (wl − ϵgl ). ϵ2 g1 g2 3

l

!l

i=3

wi

(8.35) !l

i=3

1

wi 1

n>2

n

H H H′ =

H −µ , σ

µ

(8.36) σ

301

8.

H

µ

σ Hi,j

µj

σj H′

H

µ=

σ= δ

"

δ+

1 ! Hi,: m i

(8.37)

1 ! 2 (H − µ)i m i

(8.38)

10−8

z=0

√

z H hi

µ

σ µ

σ

yˆ = xw1 w2 . . . wl

hl−1 x

x

hl−1

hl−1

hl−1 ˆ l−1 h ˆ l−1 h

yˆ 302

8.

ˆ l−1 yˆ = wl h

1

0 ˆ hl−1

y hl−1

1

Desjardins et al. (2015)

H′

H γ

γH ′ + β

β 0 β

H

H γH ′ + β

β

φ(XW + b)

φ X Ioffe

XW + b 303

8.

and Szegedy (2015)

XW + b

XW β ReLU 9 µ

σ

8.7.2 f (x) 1

xi

xj coordinate descent

1

block coordinate descent

J(H, W ) =

! i,j

|Hi,j | +

!" i,j

X − W ⊤H

W

#2

i,j

.

(8.39)

X

W W

H

W 2

J 1

W H

W

1

H

W

H

304

8.

! " f (x) = (x1 − x2 )2 + α x21 + x22

α 0

1

α

8.7.3 (Polyak and Juditsky, 1992) θ

(1)

,...,θ

t (t) ˆ θ =

(t)

1 t

#

i

θ (i) g

θˆ(t) = αθˆ(t−1) + (1 − α)θ (t) .

(8.40) Szegedy

et al. (2015)

8.7.4

305

8.

pretraining Greedy algorithms

fine tuning

greedy supervised pretraining (Bengio et al., 2007) 8.7 MLP Simonyan and Zisserman (2015) (11 4

1

)

3

Yu et al. (2010)

MLP Bengio et al. (2007)

1 Yosinski et al. (2014) ImageNet

8

1,000

k

306

8.

y

U (1) h(1)

h(1)

W (1)

W (1) x

x

(a)

(b)

U (1)

y

U (1)

y

y U (2) h(2) W (2)

h(2) U (2)

W (2)

y

h(1) U (1)

W (1)

8.7:

h(1) y

W (1)

x

x

(c)

(d)

1 (b)

(Bengio et al., 2007) (a) (c) 1 MLP (d)

1 2

307

8.

ImageNet

1,000

15.2 FitNets (Romero et al., 2015)

1

student teacher 19

2 11

SGD

5

1

8.7.5

30 1980 308

8.

LSTM,

1

(Srivastava et al., 2015) et al., 2014a)

GoogLeNet (Szegedy (Lee et al., 2014)

1

8.7.6 8.2.7

Continuation method

309

8.

{J (0) , . . . , J (n) }

J(θ)

J (0)

J (n) J

(i)

J

J(θ)

(i+1)

J (i)

θ

Wu (1997) AI Mobahi and Fisher (2015)

J (i) (θ) = Eθ′ ∼N (θ′ ;θ,σ(i)2 ) J(θ ′ )

(8.41)

3 1 NP NP

2 J(θ) = −θ ⊤ θ

310

2

8.

curriculum learning

Bengio et al. (2009) shaping

(Skinner, 1958; Peterson, 2004; Krueger and Dayan, 2009)

(Solomonoff, 1989; Elman,

1993; Sanger, 1994)

Bengio et al. (2009) J (i)

(Spitkovsky et al., 2010; Collobert et al., 2011a; Mikolov et al., 2011b; Tu and Honavar, 2011) (Kumar et al., 2010; Lee and Grauman, 2011; Supancic and Ramanan, 2013) (Khan et al., 2011)

(Basu and Christensen, 2013) 1 Zaremba and Sutskever (2014)

311

8.

312

Deep Learning：第8章

Short Description

Description

Comments

We need your help!