Deep Learning:第10章
September 18, 2017 | Author: matsuolab | Category: N/A
Short Description
Deep Learning日本語翻訳版...
Description
353
10
Recurrent neural networks RNN (Rumelhart et al., 1986a)
1 X x
1
... x
τ
1980 1
, .
(
)
g
10.
I went to Nepal in 2009 Went to Nepal
In 2009
I
2 2009
6
2
1 (Lang and Hinton, 1988; Waibel et al., 1989; Lang et al., 1990)
RNN x
t
1
τ
t
τ
g
RNN
Graves (2012) 354
10.
10.1 6.5.1 unfolding
s(t) = f (s(t−1) ; θ) s(t)
(10.1)
t
t−1
s
10.1 τ −1
τ 10.1
τ =3
s(3) =f (s(2) ; θ) =f (f (s
(1)
(10.2) (10.3)
; θ); θ)
10.1
10.3
10.1 s(... )
10.1:
f
s(t
1)
s(t)
f
s(t+1)
f
s(... )
f
10.1 t
f (
1
t
f
t+1 θ
)
x(t) s(t) = f (s(t−1) x(t) ; θ)
355
(10.4)
10.
10.5 10.4
h
h(t) = f (h(t−1) x(t) ; θ) 10.2
RNN
(10.5) h
t h
(t)
(x(t) x(t−1)
x(t−2) . . . x(2) x(1) )
h(t) RNN t ( 14 ) h
h(... )
h
h(t
1)
f f
h(t)
h(t+1)
f
x(t
1)
x(t)
h(... )
f
f
Unfold
x
(t)
x(t+1)
10.2:
x h
1 1
10.5
2
RNN
1 1 356
10.
10.2 t 1
1
t+1
RNN 1
10.2
g (t)
t h(t) =g (t) (x(t) x(t−1) x(t−2) . . . x(2) x(1) )
(10.6)
=f (h(t−1) x(t) ; θ) g (t)
(10.7)
(x(t) x(t−1) x(t−2) . . . x(2) x(1) ) g (t)
f 2 1.
2.
f
2
g (t) f
( )
(
)
357
10.
10.2 10.1
•
10.3
•
10.4
•
10.5 1
10.3 10.3
10.8 10.3
10.8
RNN (Siegelmann and Sontag, 1991; Siegelmann, 1995; Siegelmann and Sontag, 1995; Hyotyniemi, 1996) RNN RNN
1
Siegelmann and Sontag (1995)
886
RNN 10.3
RNN RNN 358
10.
y
y (t
1)
y (t)
y (t+1)
L
L(t
1)
L(t)
L(t+1)
o
o(t
1)
o(t)
o(t+1)
Unfold
V
V
W h
h
(... )
W
h
h
U
x
x(t
x
W
(t)
h
U
1)
(t+1)
h(... )
U
x(t)
x(t+1)
o L
o o
y
yˆ = softmax o
L
V W
(t 1)
U
10.3:
V
W
RNN
y W
U
V 10.8 RNN
RNN
1
o
yˆ t=1
h
0
t=τ a
t
h
t
= tanh a
t
o
t
=c+Vh
t
yˆ
t
= b + Wh
t−1
= softmax o
359
+ Ux
t
(10.8) (10.9) (10.10)
t
(10.11)
10.
y
y (t
1)
y (t)
y (t+1)
L
L(t
1)
L(t)
L(t+1)
o(t
1)
o(t)
o(t+1)
o
o(... )
W
V W h
Unfold
W
V h(t
U
1)
V
x(t
h(t+1)
U
1)
W
V
h(t)
U
x
W
h(... )
U
x(t)
x(t+1)
10.4: RNN
t xt
h
t
o
t
RNN 10.3
RNN h
h o
yt 10.3 RNN RNN
L
t
o
h
h
o
RNN
10.2.1
b
c U
V
W x
y L
t
x
1
... x
t
! " L {x(1) . . . x(τ ) } {y (1) . . . y (τ ) } # = L(t) t
360
y
t
(10.12) (10.13)
10.
=−
! t
" # log pmodel y (t) | {x(1) . . . x(t) }
$ % pmodel y (t) | {x(1) . . . x(t) }
(10.14) yˆ
t
y
t
10.3 O τ O τ
O τ back-propagation through time BPTT 10.2.2
10.2.1 1 10.4
t t teacher forcing t+1 y
2
t
" # log p y (1) y (2) | x(1) x(2) 361
(10.15)
10.
L(⌧ )
y (⌧ )
o(⌧ )
V ...
W
h(t
1)
10.5:
...
W
U
U x(t
h(t)
W
1)
U x(t)
x(...)
h(⌧ )
W U
x(⌧ )
1 o
! " ! " = log p y (2) | y (1) x(1) x(2) + log p y (1) | x(1) x(2) t=2 y
x
(10.16) y
2
10.6
open-loop
1 free-running 362
t
10.
y (t
1)
y (t)
L(t
1)
L(t)
W o(t
1)
o(t)
V h(t
V
1)
h(t
U
1)
1)
V
h(t)
U x(t
o(t
o(t)
W
1)
U
x(t)
x(t
Train time
1)
V h(t)
U x(t)
Test time
10.6: y y
t
o
t
(Bengio et al., 2015b)
10.2.2
6.5.6 363
t
RNN h t+1
10.
RNN RNN BPTT
10.12 U h
t
o
10.8
t
L
V
W
b c
t
x
t
N
t
∇N L ∂L =1 ∂L t o
(10.17)
t
yˆ
y i
t
∂L
(∇o t L)i =
∂oi t
∇o t L
t
=
∂L ∂L t = yˆi t − 1i ∂L t ∂oi t
y
(10.18)
t
τ o
t
h
τ
τ
∇h τ L = V ⊤ ∇o τ L ti 10.6 10.8
RNN
Y = {y
x t−1
RNN
y 366
1
RNN
... y
τ
t
}
i
10.
h(1)
h(2)
h(3)
h(4)
h(5)
h(... )
y (1)
y (2)
y (3)
y (4)
y (5)
y (...)
10.8: RNN 10.5 t h yt
3.6 P Y =P y
1
... y
τ
=
τ !
P y
t
|y
L
t
t=1
t−1
y
t−2
... y
1
(10.31)
t=1 {y
1
... y
τ
}
"
L=
(10.32)
t
L
t
= − log P y
t
=y
t
|y
t−1
y
t−2
... y
1
(10.33)
y {y RNN y
i
y
y
t−k
... y
t−1
y
t−1
y
}
t
i
t
RNN
1
y RNN 10.7
y RNN
h
367
t
t
10.
h
RNN
t *1
RNN
y k
O k
τ
RNN RNN
O 1
10.5
f
RNN
θ 10.8 h
t
h y
y
i
h
t
t stationary
t+1
t t t t RNN
*1
368
t
10.
1 RNN
(Schmidhuber, 2012) x
τ
1 RNN
RNN RNN
1
τ
τ
τ
τ
τ −t
τ
P x
1
... x
τ
=P τ P x
1
RNN
... x
τ
(10.34)
|τ
Goodfellow et al. (2014d)
τ
10.2.4 RNN x
y
RNN
t
10.8 x
1
x
2
x
... x
RNN
RNN
τ
y
y 6.2.1.1
P y; θ ω=θ
P y|ω 369
10.
P y|ω ω
P y|x
x
t = 1 ... τ 1
x 1
RNN RNN
t
x
x
RNN
y
RNN
1. 2.
h
0
3. 10.9 h
t
x R
R x⊤ R
y x⊤ R
x
θ ω
ω
RNN
x 10.8
x RNN !
P y
P y t
t
10.10
|x
1
t
... x
1
t
... y
τ
|x
1
... x
t
(10.35)
t+1 y 1
10.4
370
τ
10.
U
y (t
1)
L(t
1)
o(t
1)
y (t)
U
W
U
L(t)
o(t+1)
V
h(t
R
1)
W
V
h(t)
R
y (...)
L(t+1)
o(t)
V s(... )
y (t+1)
R
W
R
h(t+1)
W
h(... )
R
x
10.9:
x
RNN
Y y
10.3
RNN
1
t
RNN t
x
1
... x
t−1
x
y
y
y
2
371
t
t
10.
y (t
1)
y (t)
y (t+1)
L(t
1)
L(t)
L(t+1)
R o(t
R
1)
o(t)
V W h(... )
o(t+1)
V
V
W h(t
W
1)
h(t)
U x(t
R
W h(t+1)
U
1)
U
x(t)
10.10: x
x(t+1)
y RNN y 10.3
10.3 RNN
h(... )
x RNN
y x
y
Sequence-to-Sequence RNN (Schuster and Paliwal, 1997) (Graves et al., 2008; Graves and Schmidhuber, 2009) Schmidhuber, 2005; Graves et al., 2013)
(Graves and
(Baldi et al., 1999) (Graves, 2012)
RNN RNN
1 10.11
RNN
RNN
g
RNN h
t
RNN
t
o
t
t RNN 372
10.
y (t
1)
y (t)
y (t+1)
L(t
1)
L(t)
L(t+1)
o(t
1)
o(t)
o(t+1)
g (t
1)
g (t)
g (t+1)
h(t
1)
h(t)
h(t+1)
x(t
1)
x(t)
x(t+1)
10.11:
x t
L
(t)
y
h
g o(t)
t g
h(t)
(t)
t 2 4 Oi
RNN
2-D
i j RNN
j
RNN (Visin et al., 2015; Kalchbrenner et al., 2015)
RNN
373
10.
10.4
Encoder-Decoder
10.5
Sequence-to-Sequence
RNN
10.9
RNN
10.3 10.4 10.10 10.11
RNN Encoder …
x(1)
x(2)
x(...)
x(nx )
C
Decoder …
y (1)
10.12:
C
x 1 x 2 ... x encoder-decoderRNN RNN RNN C
y (2)
nx
y (...)
y (ny )
y 1 ... y Sequence-to-Sequence RNN
ny
RNN RNN
RNN
374
10.
RNN
C C
X= x
1
... x
nx
RNN (2014a)
Cho et al.
Sutskever et al. (2014)
10.12
encoder-decoder
sequence-to-sequence encoder
1 input
reader
RNN
C decoder
2 writer
output RNN
Y = y
1
... y
ny
10.9 nx
ny nx
nx = ny = τ ny
sequence-to-sequence RNN y
1
... y
ny
|x
1
... x
x
y
log P 2
nx
RNN
RNN
RNN
h nx
C RNN
C
10.2.4
to-sequence RNN 2
vector-
vector-to-sequence RNN RNN 2
1
RNN
Bahdanau et al. (2015)
C C
C 375
10.
attention mechanism 12.4.5.1
10.5 RNN
3
1. 2. 3. 10.3
RNN
3 1 1 (Graves et al., 2013; Pascanu
et al., 2014a) RNN Schmidhuber (1992) El Hihi and Bengio (1996) Jaeger (2007a) Graves et al. (2013)
10.13
RNN 10.13
Pascanu et al. (2014a)
10.13b
3
MLP 3 10.13b t
t+1 MLP
10.3
2
2 376
RNN
10.
y
y
h
h
x
x
x
(a)
(b)
(c)
y
z
h
10.13: 2014a) (a)
(Pascanu et al., (b) MLP (c)
Pascanu et al. (2014a)
10.13c
10.6 *2
1
RNN 10.14
Pollack
*2
Recursive neural network
RNN
377
10.
L
y
o
U
U
V x(1)
W
W
U
V
W
V
x(2)
V
x(3)
x(4)
10.14: U
V
x1 o
W
x
2
... x
t
y
(1990)
Bottou (2011) (Socher et al., 2011a,c, 2013a) (Socher et al., 2011b)
τ τ 1
378
O log τ
10.
(Socher et al., 2011a, 2013a)
Bottou (2011) Frasconi et al. (1997)
Frasconi et al. (1998) Socher et al. (2013a) (Weston et al., 2010; Bordes et al., 2012)
10.7 8.2.5
(Hochreiter, 1991; Doya, 1993; Bengio et al., 1994; Pascanu et al., 2013)
10.15
t
h
= W ⊤h
t−1
(10.36)
x 8.2.5 h W
Q
t
! "⊤ = Wt h
W = QΛQ⊤ 379
0
(10.37)
(10.38)
10.
4
0 1 2 3 4 5
Projection of output
3 2 1 0 1 2 3 4
60
40
20
0
20
40
60
Input coordinate
10.15:
100 y
h
t
= Q⊤ Λt Qh
0
(10.39)
1
t
1
100
x
0 h
1
0
w w w
t
w
t
1 w
t
!
t
w
0
t
O v
n
v
∗
v=
t
√ n
v v∗
Sussillo (2014)
RNN
(Hochreiter,
1991; Bengio et al., 1993, 1994) 380
10.
RNN (Bengio et al., 1993, 1994)
Bengio et al. (1994) SGD 10
20
RNN 0 Pascanu
et al. (2013)
Doya (1993) Bengio et al. (1994)
Siegelmann and Sontag
(1995) RNN 1 (Jaeger, 2003; Maass et al., 2002; Jaeger and Haas, 2004; Jaeger, 2007b)
10.8 h h
t−1
h
t
x
t
t
(Jaeger, 2003; Maass et al., 2002; Jaeger and Haas, 2004; Jaeger, 2007b)
1 echo state networks ESNs
(Jaeger and Haas, 2004; Jaeger,
liquid state machines
2007b)
(Maass et al., 2002)
ESN ESN reservoir computing (Lukoševičius and Jaeger, 2009)
t
h 381
t
10.
(Jaeger, 2003)
1 8.2.5 J
t
=
∂s t ∂s t−1
J
t
spectral radius t J
J
v
λ
1
g
Jg
J ng
n
g 1
g + δv J g + δv
g
J n g + δv
n g + δv
n
δJ n v
v 2
δ|λ|n δ
|λ| > 1
δ|λ|n
λ
J v
|λ| |λ| < 1
0 1
(Yildiz et al., 2012; 382
10.
Jaeger, 2012)
h
t+1
=h
t ⊤
W
W⊤
L2
h
contractive
1
h
t
h
t+1
32
h h W
1
t
1
t+1
J h
h
t
t
1
1 tanh tanh 1 tanh 3 tanh ESN 383
10.
(Sutskever, 2012; Sutskever et al., 2013) 8.4 1 2
10.9
Leaky 1
Leaky
10.9.1 1 Lin et al. (1996) (Lang and Hinton, 1988) t t+1 (Bengio, 1991) 8.2.5 Lin et al. (1996) τ
d τ d
τ
384
10.
10.9.2 Leaky 1 1 v
t
µ α
α
µ
t
t−1
µ µ
t
← αµ
t
t−1
1
+ 1−α v α
t
0
Leaky d
d 1 α Mozer (1992)
El Hihi and Bengio (1996)
Leaky (Jaeger et al., 2007) Leaky
2
1
1 Leaky (Mozer, 1992; Pascanu et al., 2013)
10.9.3 1
RNN
(El Hihi and Bengio, 1996) 1
385
10.
1 Leaky Mozer (1992)
Pascanu et al. (2013)
1 El Hihi and Bengio (1996)
Koutnik et al. (2014)
10.10
RNN RNN gated RNN
Long short-term memory LSTM
gated recurrent
unit GRU Leaky
RNN Leaky
RNN Leaky
Leaky RNN
10.11 LSTM Long short-term memory LSTM Schmidhuber, 1997) 386
(Hochreiter and
10.
output
×
self-loop +
× state
×
input
input gate
forget gate
output gate
10.16: LSTM
1
(Gers et al., 2000) LSTM LSTM
(Graves et al., 2009)
(Graves et al., 2013; Graves and Jaitly, 2014) (Sutskever et al., 2014)
(Graves, 2013) (Kiros et al., 2014b; Vinyals et al.,
387
10.
2014b; Xu et al., 2015)
(Vinyals et al., 2014a)
LSTM
10.16 (Graves et al., 2013; Pascanu et al., 2014a) LSTM RNN
LSTM
si t
Leaky
forget gate fi
t
t
0
i
1 ⎛
x
fi t = σ ⎝bfi +
t
b
f
U
f
W
#
Uif j xjt +
j
# j
h
⎞
Wif j hjt−1 ⎠
(10.40)
LSTM
t
f
fi t
LSTM ⎛
#
si t = fi t si t−1 + gi t σ ⎝bi +
b
U
Ui j xjt +
j
j
LSTM
W
#
⎞
Wi j hjt−1 ⎠
external input gate ⎛
LSTM
gi t = σ ⎝bgi +
#
0
Uig j xjt +
j
# j
hi t
& ' hi t = tanh si t qi t
(10.41)
⎞
Wig j hjt−1 ⎠
1
(10.42) output gate qi t
(10.43) 388
10.
⎛
bo
Uo
qi t = σ ⎝boi +
#
Uio j xjt +
j
# j
Wo
10.16
⎞
Wio j hjt−1 ⎠
(10.44)
si t
3
i
3 LSTM (Bengio et al., 1994; Hochreiter and Schmidhuber, 1997; Hochreiter et al., 2001) (Graves, 2012; Graves et al., 2013; Sutskever et al., 2014) LSTM
10.11.1
RNN
LSTM
GRU RNN
(Cho et al., 2014b; Chung
et al., 2014, 2015a; Jozefowicz et al., 2015; Chrupala et al., 2015)
⎛
hi t = ui t−1 hi t−1 + 1 − ui t−1 σ ⎝bi + u
#
Ui j xjt−1 +
j
# j
LSTM
⎞
Wi j rjt−1 hjt−1 ⎠
(10.45)
r ⎛
ui t = σ ⎝bui +
#
Uiu j xjt +
j
# j
389
⎞
Wiu j hjt ⎠
(10.46)
10.
⎛
ri t = σ ⎝bri +
#
Uir j xjt +
j
# j
⎞
Wir j hjt ⎠
(10.47)
1
LSTM
GRU LSTM
GRU
(Greff et al., 2015; Jozefowicz et al., 2015)
Greff et al. (2015) (2015)
LSTM
Jozefowicz et al. 1
Gers et al. (2000) LSTM
10.12 8.2.5
10.7
RNN
Martens and Sutskever (2011)
Martens and Sutskever (2011)
Sutskever 390
10.
et al. (2013) Sutskever (2012) LSTM
10.12.1 8.2.4 8.3
10.17
391
SGD
10.
w
J(w,b)
With clipping
J(w,b)
Without clipping
w b
10.17: 2
b
w b
1
Pascanu et al. (2013)
clipping the gradient (Mikolov, 2012; Pascanu et al., 2013) (Mikolov, 2012)
1
g
(Pascanu et al., 2013) if ||g|| > v g← v
1
g
392
||g|| (10.48)
gv ||g||
(10.49)
10.
Inf
Nan
v
(Graves, 2013)
10.12.2
1 1
10.11
LSTM
∇h t L (∇h t L)
∂h t ∂h t−1
∇h t L
393
(10.50) (10.51)
10.
Pascanu et al. (2013) $ ⎛ $$ ⎞2 ∂h t $ | (∇ L) | t $ ! h ∂h t−1 $ ⎝ Ω= − 1⎠ ||∇h t L| | t
(10.52)
Pascanu et al. (2013) ∇h t L RNN RNN
LSTM
10.13
3:00
141
Graves et al. (2014b) working memory
394
10.
Memory cells
Reading mechanism
Writing mechanism
Task network, controlling the memory
10.18:
(Hinton, 1990) Weston et al. (2014) memory networks neural Turing machine
Graves et al. (2014b)
Bahdanau et al. (2015)
12.4.5.1
end-to-end (Sukhbaatar et al., 2015; Joulin and Mikolov, 2015; Kumar et al., 2015; Vinyals et al., 2015a; Grefenstette et al., 2015) 395
10.
LSTM
GRU
LSTM 2
GRU
1
1 content-based addressing
“We all live in a yellow submarinei”
location-based addressing 347
396
10.
10.18
RNN
LSTM RNN 1
1
(Zaremba and Sutskever, 2015) 20.9.1
attention mechanism
(Bahdanau et al., 2015)
12.4.5.1
(Graves, 2013)
397
View more...
Comments