Deep Learning:第8章
September 18, 2017 | Author: matsuolab | Category: N/A
Short Description
Deep Learning日本語翻訳版...
Description
259
8
PCA
1
4 4 J(θ) θ
8.
8.1
P P
J(θ)
P
J
(8.1)
J(θ) = E(x,y)∼pdata L(f (x; θ), y) L
f (x; θ)
x
pdata
y L θ
f (x; θ)
x
y y
8.1 pdata J ∗ (θ) = E(x,y)∼pdata L(f (x; θ), y).
(8.2)
8.1.1 8.2
pdata
pdata (x, y)
pdata (x, y)
p(x, y)
pˆ(x, y) 260
8.
empirical risk m
1 ! Ex,y∼pˆdata (x,y) [L(f (x; θ), y)] = L(f (x(i) ; θ), y (i) ) m i=1
(8.3)
m empirical risk minimization
0/1 0 2
8.1.2
0/1 (Marcotte and Savard, 1992) surrogate loss function 0/1
0/1 0/1
0 0/1
0
261
8.
0/1 0/1
( 7.8 ) 0/1
8.1.3
θML = arg max θ
m !
log pmodel (x(i) , y (i) ; θ).
(8.4)
i=1
(8.5)
J(θ) = Ex,y∼pˆdata log pmodel (x, y; θ). J
(8.6)
∇θ J(θ) = Ex,y∼pˆdata ∇θ log pmodel (x, y; θ).
(
n σ 262
5.46)
√ σ/ n
√
n
8.
2 1
100
10,000
100
10
1
1 m 1 1/m
batch
deter-
ministic
stochastic
online
2 minibatch
minibatch stochastic
stochastic 1
8.3.1
• 263
8.
•
•
•
GPU
2 2
32
256
16 (Wilson and Martinez, 2003)
• 1
g 100 H
H
−1
10,000
g H
−1
g
H
H g
H
H −1 g
g H
H
−1
g
g
264
8.
2
1 2
5
3
3 1
1
X J(X) 12.1.3
(
8.2) 1
1 2
pdata (x, y)
(x, y) pdata 265
8.
x (
y
8.2) J ∗ (θ) =
!! x
g = ∇θ J ∗ (θ) = 8.5
pdata (x, y)L(f (x; θ), y)
!! x
pdata (x, y)∇θ L(f (x; θ), y)
8.6 pdata
L
y {x(1) , . . . x(m) }
gˆ = θ
(8.8)
y
L x
(8.7)
y
gˆ
y (i)
! 1 ∇θ L(f (x(i) ; θ), y (i) ). m i SGD
Bottou and Bousquet (2008)
8.2
266
pdata
(8.9)
8.
8.2.1
H 4.3.1 SGD 4.9
−ϵg 1 2 ⊤ ϵ g Hg − ϵg ⊤ g 2
(8.10) 1 2 ⊤ 2ϵ g
g⊤ g
mHg
g ⊤ Hg g ⊤ Hg
1 8.1
8.2.2 1
267
ϵg ⊤ g
16
1.0
14
0.9
Classification error rate
Gradient norm
8.
12 10 8 6 4 2 0 2
50
0
50 100 150 200 250
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Training time (epochs)
0
50
100
150
200
250
Training time (epochs)
8.1:
1
model identifiability 1
i
j
1 n
n!m
m
weight space symmetry
268
8.
1 α
α (m × n)
(Sontag and Sussman, 1989; Brady et al., 1989; Gori and Tesi, 1992)
(Saxe et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2015; Choromanska et al., 2014)
269
8.
8.2.3 0
4.5 f : Rn → R
n
n
n Dauphin
et al. (2014)
n
Baldi and Hornik (1989) 14
Saxe et al. (2013) 270
8.
Dauphin et al. (2014) Choromanska et al. (2014)
Goodfellow et al. (2015)
8.2 0 Goodfellow et al. (2015)
0
Dauphin et al. (2014) saddle-free Newton method
0
0
271
8.
J(✓) n ctio
Pro jec tion 1 of ✓
f✓ 2o
je Pro
8.2:
Goodfellow et al. (2015)
2012
SGD
8.2.4 8.3
272
J(w,b)
8.
w b
8.3:
Pascanu et al. (2013)
gradient clipping
10.12.1
1
8.2.5
1 10 273
8.
W Wt
t
W = V diag(λ)V −1
W
! "t W t = V diag(λ)V −1 = V diag(λ)t V −1 1
λi
(8.11) 1
vanishing and exploding
1 gradient problem
diag(λ)t
W
W power method
x⊤ W t
W
x
W (Sussillo, 2014) 10.7
8.2.6
III 274
8.
8.2.7 J(θ) θ
θ
θ
1 Goodfellow et al. (2015) 8.2
8.1 − log p(y | x; θ)
p(y | x)
y
p(y | x) = N (y; f (θ), β −1 ) f (θ) β
y
8.4
275
J(✓)
8.
✓
8.4:
8.2
ϵ δ≪ϵ
8.4 8.2 276
8.
8.2.8
(Blum and Rivest, 1992; Judd, 1989; Wolpert and MacReady, 1997)
8.3 ( 4.3 ) 8.1.3
277
5.9
8.
8.3.1 SGD 8.1.3
i.i.d
m 8.1 Algorithm 8.1 k Require:
(SGD)
ϵk .
Require:
θ
while
do {x
(1)
(m)
,...,x
y (i)
} !
1 gˆ ← + m ∇θ
θ ← θ − ϵˆ g
end while
i
m
L(f (x(i) ; θ), y (i) )
SGD
ϵ
SGD k
ϵk SGD
(m ) 0 SGD ∞ "
k=1 ∞ " k=1
α=
k τ
ϵk = ∞,
and
ϵ2k < ∞.
(8.12) (8.13)
τ ϵk = (1 − α)ϵ0 + αϵτ 278
(8.14)
8.
τ
ϵ
ϵ0 ϵτ
τ ϵτ
τ ϵ0
1%
ϵ0
100
SGD
excess error J(θ) − minθ J(θ) SGD O( √1k )
k
O( k1 )
(Cramér, 1946; Rao, 1945)
O( k1 )
Bottou and Bousquet (2008) O( k1 )
SGD
279
8.
O( k1 )
SGD
Bottou (1998)
8.3.2
(Polyak, 1964)
8.5
20 10 0 10 20 30
8.5:
30
20
10
0
10
20
2
1
1
1
4.6
v momentum 280
8.
v α ∈ [0, 1]
v ← αv − ϵ∇θ
# m 1 " L(f (x(i) ; θ), y (i) ) , m i=1
θ ← θ + v. $ 1 %m & (i) (i) ∇θ m i=1 L(f (x ; θ), y )
v α
!
ϵ SGD
∇θ
(8.15) (8.16)
$ 1 %m m
i=1
& L(f (x(i) ; θ), y (i) ) .
8.2
g
ϵ||g|| . 1−α
(8.17)
ϵ||g|| . 1−α
(8.18)
−g
α = 0.9
1 1−α
10
α
0.5, 0.9, 0.99
α α
ϵ
θ(t)
f (t) =
∂2 θ(t). ∂t2
281
f (t)
(8.19)
8.
Algorithm 8.2 Require:
SGD) ϵ,
α.
Require:
θ,
v.
while
do {x
(1)
(m)
,...,x
.
y (i)
}
: g←
1 m ∇θ
!
i
m
L(f (x(i) ; θ), y (i) )
: v ← αv − ϵg
end while
: θ ←θ+v
t
v(t)
∂ θ(t), ∂t ∂ f (t) = v(t). ∂t
(8.20)
v(t) =
(8.21)
1
−∇θ J(θ)
1
1
−v(t) 282
8.
−v(t)
−v(t)
1
O(log t) 0
8.3.3 Sutskever et al. (2013) (Nesterov, 1983, 2004) ! % m $ 1 " # (i) (i) v ← αv − ϵ∇θ L f (x ; θ + αv), y , m i=1 θ ← θ + v, α
ϵ
8.3 Nesterov (1983) O(1/k) k
)
O(1/k 2 )
283
(8.22) (8.23)
8.
Algorithm 8.3 Require:
SGD) ϵ,
α.
Require:
θ,
v.
while
do {x .
(1)
(m)
,...,x
y (i)
}
: θ˜ ← θ + αv
: g←
: v ← αv − ϵg
end while
1 m ∇θ˜
: θ ←θ+v
8.4
284
!
i
m
˜ y (i) ) L(f (x(i) ; θ),
8.
2
chaos
285
8.
7.8
θ p(θ)
θ0
θ0
θ0
0
θ0 g
1
m U (− √1m , √1m )
n Glorot and Bengio (2010) normalized initialization Wi,j ∼ U
! " −
6 , m+n
g
Saxe et al. (2013) gain
g
286
"
6 m+n
#
.
(8.24)
8.
g Sussillo (2014) 1,000
8.2.5
3
1 2 3
√1 m
1 Martens (2010)
k sparse initialization m
m
287
8.
11.4.2
g
1
g
Mishkin and Matas (2015)
•
c
ci
i
softmax(b) = c
b Part III x
x • 288
8.
ReLU
0
0.1
ReLU
(Sussillo, 2014) •
u uh h≈1
h ∈ [0, 1] h
uh ≈ u
h
uh ≈ 0
u
Jozefowicz et al. (2015)
10.11
LSTM
1
p(y | x) = N (y | wT x + b, 1/β)
(8.25) 1
β
0
III
289
8.
8.5
4.3
8.2
delta-bar-delta
(Jacobs, 1988)
)
8.5.1 AdaGrad 8.4
AdaGrad
(Duchi et al., 2011)
AdaGrad
AdaGrad 290
8.
Algorithm 8.4 AdaGrad Require:
ϵ
Require:
θ
Require:
10−7
δ r=0
while
do {x
(1)
(m)
,...,x
.
: g←
y (i)
}
1 m ∇θ
!
i
m
L(f (x(i) ; θ), y (i) )
: r ←r+g⊙g
: ∆θ ← − δ+ϵ√r ⊙ g.
(
)
: θ ← θ + ∆θ
end while
8.5.2 RMSProp RMSProp
(Hinton, 2012)
AdaGrad AdaGrad
AdaGrad AdaGrad RMSprop AdaGrad RMSProp
8.5 8.6
AdaGrad
rho RMSprop 1 291
8.
Algorithm 8.5 RMSProp Require:
ϵ,
Require:
ρ.
θ
Require:
10−6 ,
δ,
r=0 while
do {x(1) , . . . , x(m) }
y (i)
.
: g←
1 m ∇θ
!
i
m
L(f (x(i) ; θ), y (i) )
: r ← ρr + (1 − ρ)g ⊙ g
ϵ : ∆θ = − √δ+r ⊙ g.
1 ( √δ+r
)
: θ ← θ + ∆θ
end while Algorithm 8.6
RMSProp
Require:
ϵ,
Require:
θ,
ρ,
α.
v. r=0
while
do {x .
(1)
(m)
,...,x
y (i)
}
m
: θ˜ ← θ + αv ! 1 ˜ y (i) ) : g←m ∇θ˜ i L(f (x(i) ; θ), : r ← ρr + (1 − ρ)g ⊙ g : v ← αv −
end while
: θ ←θ+v
√ϵ r
⊙ g.
( √1r
)
8.5.3 Adam Adam(Kingma and Ba, 2014) 8.7
Adam
adaptive moments RMSprop 292
8.
1 Adam
) RMSprop 2
Adam
8.7 RMSProp Adam
RMSProp Adam
Algorithm 8.7 Adam Require:
ϵ(
: 0.001)
Require:
ρ1 :
[0, 1). (
ρ2
0.999 )
0.9
Require:
δ(
Require:
: 10−8 )
θ s = 0, r = 0 t=0
while
do {x
(1)
(m)
,...,x
.
t←t+1
: g←
y (i)
}
1 m ∇θ
!
i
L(f (x(i) ; θ), y (i) ) : s ← ρ1 s + (1 − ρ1 )g : sˆ ←
: ∆θ = end while
m
ˆ s −ϵ √r+δ ˆ
(
: rˆ ←
: θ ← θ + ∆θ
293
: r ← ρ2 r + (1 − ρ2 )g ⊙ g
s 1−ρt1 r 1−ρt2
)
8.
8.5.4
Schaul et al. (2014) RMSProp AdaDelta SGD SGD RMSProp
RMSProp AdaDelta
Adam
8.6 LeCun et al. (1998a) m
J(θ) = Ex,y∼pˆdata (x,y) [L(f (x; θ), y)] = 7
8.6.1 4.3
294
1 ! L(f (x(i) ; θ), y (i) ). m i=1
(8.26)
8.
θ0 J(θ) 1 J(θ) ≈ J(θ0 ) + (θ − θ0 )⊤ ∇θ J(θ0 ) + (θ − θ0 )⊤ H(θ − θ0 ), 2 H
θ0
θ
(8.27)
J
θ ∗ = θ0 − H −1 ∇θ J(θ0 ) H
(8.28) H −1 8.8
Algorithm 8.8
J(θ) =
Require:
θ0
1 m
!m
i=1
L(f (x(i) ; θ), y (i) )
Require: m while : g←
do !
1 m ∇θ
: H←
: ∆θ = −H
(i) (i) i L(f (x ; θ), y ) ! 1 2 (i) (i) i L(f (x ; θ), y ) m ∇θ −1
−1
: H
g
: θ = θ + ∆θ end while
2 8.28 8.2.3
295
.
8.
α
θ ∗ = θ0 − [H (f (θ0 )) + αI]
−1
(8.29)
∇θ f (θ0 ).
(Levenberg, 1944; Marquardt, 1963) α α αI
α α
k k×k
k O(k 3 )
8.6.2 conjugate directions 4.3 8.6
dt−1
dt−1
∇θ J(θ) · dt−1 = 0 296
8.
20 10 0 −10 −20 −30 −30 −20 −10
0
10
20
8.6: 4.6
dt
dt = ∇θ J(θ)
dt−1
dt
dt−1
8.6
dt−1
conjugate t dt (8.30)
dt = ∇θ J(θ) + βt dt−1 βt 2
dt−1
dt
dt−1
d⊤ t Hdt−1 = 0
H
βt
H 297
8.
2
βt 1. Fletcher-Reeves βt = 2. Polak-Ribière βt =
∇θ J(θt )⊤ ∇θ J(θt ) ∇θ J(θt−1 )⊤ ∇θ J(θt−1 )
(8.31)
⊤
(∇θ J(θt ) − ∇θ J(θt−1 )) ∇θ J(θt ) ∇θ J(θt−1 )⊤ ∇θ J(θt−1 )
(8.32)
k k 8.9 Algorithm 8.9 Require:
θ0
Require: m ρ0 = 0 g0 = 0 t=1 while
do gt = 0
!
: gt ← βt =
(
1 (i) (i) i L(f (x ; θ), y ) m ∇θ ⊤ (gt −gt−1 ) gt (Polak-Ribière) ⊤ g gt−1 t−1
: βt k
(
k=5
t
) : ρt = −gt + βt ρt−1 !m 1 (i) (i) : ϵ∗ = argminϵ m i=1 L(f (x ; θt + ϵρt ), y ) ϵ∗
: θt+1 = θt + ϵ ρt ∗
t←t+1
end while
298
)
8.
nonlinear conjugate gradients
1 (Le et al., 2011) (Moller, 1993)
8.6.3 BFGS Broyden–Fletcher–Goldfarb–Shanno BFGS BFGS BFGS
θ ∗ = θ0 − H −1 ∇θ J(θ0 ), H
θ0
θ
(8.33)
J BFGS
H −1 H −1 Mt BFGS
Luenberger (1984)
Mt
ρt ϵ∗ 299
ρ t = Mt g t
8.
θt+1 = θt + ϵ∗ ρt .
(8.34)
BFGS
BFGS BFGS
O(n2 )
M
BFGS BFGS Limited Memory BFGS L-BFGS
BFGS
M L-BFGS
BFGS
M M
(t−1)
L-BFGS L-BFGS M O(n)
8.7
8.7.1 (Ioffe and Szegedy, 2015)
300
8.
yˆ = xw1 w2 w3 . . . wl wi yˆ yˆ
i
x
hi = hi−1 wi
wi
1
yˆ g = ∇w yˆ
yˆ yˆ
i
w ← w − ϵg
ϵg ⊤ g
yˆ
0.1
ϵ
0.1 g⊤ g
l yˆ x(w1 − ϵg1 )(w2 − ϵg2 ) . . . (wl − ϵgl ). ϵ2 g1 g2 3
l
!l
i=3
wi
(8.35) !l
i=3
1
wi 1
n>2
n
H H H′ =
H −µ , σ
µ
(8.36) σ
301
8.
H
µ
σ Hi,j
µj
σj H′
H
µ=
σ= δ
"
δ+
1 ! Hi,: m i
(8.37)
1 ! 2 (H − µ)i m i
(8.38)
10−8
z=0
√
z H hi
µ
σ µ
σ
yˆ = xw1 w2 . . . wl
hl−1 x
x
hl−1
hl−1
hl−1 ˆ l−1 h ˆ l−1 h
yˆ 302
8.
ˆ l−1 yˆ = wl h
1
0 ˆ hl−1
y hl−1
1
Desjardins et al. (2015)
H′
H γ
γH ′ + β
β 0 β
H
H γH ′ + β
β
φ(XW + b)
φ X Ioffe
XW + b 303
8.
and Szegedy (2015)
XW + b
XW β ReLU 9 µ
σ
8.7.2 f (x) 1
xi
xj coordinate descent
1
block coordinate descent
J(H, W ) =
! i,j
|Hi,j | +
!" i,j
X − W ⊤H
W
#2
i,j
.
(8.39)
X
W W
H
W 2
J 1
W H
W
1
H
W
H
304
8.
! " f (x) = (x1 − x2 )2 + α x21 + x22
α 0
1
α
8.7.3 (Polyak and Juditsky, 1992) θ
(1)
,...,θ
t (t) ˆ θ =
(t)
1 t
#
i
θ (i) g
θˆ(t) = αθˆ(t−1) + (1 − α)θ (t) .
(8.40) Szegedy
et al. (2015)
8.7.4
305
8.
pretraining Greedy algorithms
fine tuning
greedy supervised pretraining (Bengio et al., 2007) 8.7 MLP Simonyan and Zisserman (2015) (11 4
1
)
3
Yu et al. (2010)
MLP Bengio et al. (2007)
1 Yosinski et al. (2014) ImageNet
8
1,000
k
306
8.
y
U (1) h(1)
h(1)
W (1)
W (1) x
x
(a)
(b)
U (1)
y
U (1)
y
y U (2) h(2) W (2)
h(2) U (2)
W (2)
y
h(1) U (1)
W (1)
8.7:
h(1) y
W (1)
x
x
(c)
(d)
1 (b)
(Bengio et al., 2007) (a) (c) 1 MLP (d)
1 2
307
8.
ImageNet
1,000
15.2 FitNets (Romero et al., 2015)
1
student teacher 19
2 11
SGD
5
1
8.7.5
30 1980 308
8.
LSTM,
1
(Srivastava et al., 2015) et al., 2014a)
GoogLeNet (Szegedy (Lee et al., 2014)
1
8.7.6 8.2.7
Continuation method
309
8.
{J (0) , . . . , J (n) }
J(θ)
J (0)
J (n) J
(i)
J
J(θ)
(i+1)
J (i)
θ
Wu (1997) AI Mobahi and Fisher (2015)
J (i) (θ) = Eθ′ ∼N (θ′ ;θ,σ(i)2 ) J(θ ′ )
(8.41)
3 1 NP NP
2 J(θ) = −θ ⊤ θ
310
2
8.
curriculum learning
Bengio et al. (2009) shaping
(Skinner, 1958; Peterson, 2004; Krueger and Dayan, 2009)
(Solomonoff, 1989; Elman,
1993; Sanger, 1994)
Bengio et al. (2009) J (i)
(Spitkovsky et al., 2010; Collobert et al., 2011a; Mikolov et al., 2011b; Tu and Honavar, 2011) (Kumar et al., 2010; Lee and Grauman, 2011; Supancic and Ramanan, 2013) (Khan et al., 2011)
(Basu and Christensen, 2013) 1 Zaremba and Sutskever (2014)
311
8.
312
View more...
Comments