Deep Learning：第10章

September 18, 2017 | Author: matsuolab | Category: N/A

Share Embed Donate

Report this link

Short Description

Deep Learning日本語翻訳版...

Description

353

10

Recurrent neural networks RNN (Rumelhart et al., 1986a)

1 X x

1

... x

τ

1980 1

, .

(

)

g

10.

I went to Nepal in 2009 Went to Nepal

In 2009

I

2 2009

6

2

1 (Lang and Hinton, 1988; Waibel et al., 1989; Lang et al., 1990)

RNN x

t

1

τ

t

τ

g

RNN

Graves (2012) 354

10.

10.1 6.5.1 unfolding

s(t) = f (s(t−1) ; θ) s(t)

(10.1)

t

t−1

s

10.1 τ −1

τ 10.1

τ =3

s(3) =f (s(2) ; θ) =f (f (s

(1)

(10.2) (10.3)

; θ); θ)

10.1

10.3

10.1 s(... )

10.1:

f

s(t

1)

s(t)

f

s(t+1)

f

s(... )

f

10.1 t

f (

1

t

f

t+1 θ

)

x(t) s(t) = f (s(t−1) x(t) ; θ)

355

(10.4)

10.

10.5 10.4

h

h(t) = f (h(t−1) x(t) ; θ) 10.2

RNN

(10.5) h

t h

(t)

(x(t) x(t−1)

x(t−2) . . . x(2) x(1) )

h(t) RNN t ( 14 ) h

h(... )

h

h(t

1)

f f

h(t)

h(t+1)

f

x(t

1)

x(t)

h(... )

f

f

Unfold

x

(t)

x(t+1)

10.2:

x h

1 1

10.5

2

RNN

1 1 356

10.

10.2 t 1

1

t+1

RNN 1

10.2

g (t)

t h(t) =g (t) (x(t) x(t−1) x(t−2) . . . x(2) x(1) )

(10.6)

=f (h(t−1) x(t) ; θ) g (t)

(10.7)

(x(t) x(t−1) x(t−2) . . . x(2) x(1) ) g (t)

f 2 1.

2.

f

2

g (t) f

( )

(

)

357

10.

10.2 10.1

•

10.3

•

10.4

•

10.5 1

10.3 10.3

10.8 10.3

10.8

RNN (Siegelmann and Sontag, 1991; Siegelmann, 1995; Siegelmann and Sontag, 1995; Hyotyniemi, 1996) RNN RNN

1

Siegelmann and Sontag (1995)

886

RNN 10.3

RNN RNN 358

10.

y

y (t

1)

y (t)

y (t+1)

L

L(t

1)

L(t)

L(t+1)

o

o(t

1)

o(t)

o(t+1)

Unfold

V

V

W h

h

(... )

W

h

h

U

x

x(t

x

W

(t)

h

U

1)

(t+1)

h(... )

U

x(t)

x(t+1)

o L

o o

y

yˆ = softmax o

L

V W

(t 1)

U

10.3:

V

W

RNN

y W

U

V 10.8 RNN

RNN

1

o

yˆ t=1

h

0

t=τ a

t

h

t

= tanh a

t

o

t

=c+Vh

t

yˆ

t

= b + Wh

t−1

= softmax o

359

+ Ux

t

(10.8) (10.9) (10.10)

t

(10.11)

10.

y

y (t

1)

y (t)

y (t+1)

L

L(t

1)

L(t)

L(t+1)

o(t

1)

o(t)

o(t+1)

o

o(... )

W

V W h

Unfold

W

V h(t

U

1)

V

x(t

h(t+1)

U

1)

W

V

h(t)

U

x

W

h(... )

U

x(t)

x(t+1)

10.4: RNN

t xt

h

t

o

t

RNN 10.3

RNN h

h o

yt 10.3 RNN RNN

L

t

o

h

h

o

RNN

10.2.1

b

c U

V

W x

y L

t

x

1

... x

t

! " L {x(1) . . . x(τ ) } {y (1) . . . y (τ ) } # = L(t) t

360

y

t

(10.12) (10.13)

10.

=−

! t

" # log pmodel y (t) | {x(1) . . . x(t) }

$ % pmodel y (t) | {x(1) . . . x(t) }

(10.14) yˆ

t

y

t

10.3 O τ O τ

O τ back-propagation through time BPTT 10.2.2

10.2.1 1 10.4

t t teacher forcing t+1 y

2

t

" # log p y (1) y (2) | x(1) x(2) 361

(10.15)

10.

L(⌧ )

y (⌧ )

o(⌧ )

V ...

W

h(t

1)

10.5:

...

W

U

U x(t

h(t)

W

1)

U x(t)

x(...)

h(⌧ )

W U

x(⌧ )

1 o

! " ! " = log p y (2) | y (1) x(1) x(2) + log p y (1) | x(1) x(2) t=2 y

x

(10.16) y

2

10.6

open-loop

1 free-running 362

t

10.

y (t

1)

y (t)

L(t

1)

L(t)

W o(t

1)

o(t)

V h(t

V

1)

h(t

U

1)

1)

V

h(t)

U x(t

o(t

o(t)

W

1)

U

x(t)

x(t

Train time

1)

V h(t)

U x(t)

Test time

10.6: y y

t

o

t

(Bengio et al., 2015b)

10.2.2

6.5.6 363

t

RNN h t+1

10.

RNN RNN BPTT

10.12 U h

t

o

10.8

t

L

V

W

b c

t

x

t

N

t

∇N L ∂L =1 ∂L t o

(10.17)

t

yˆ

y i

t

∂L

(∇o t L)i =

∂oi t

∇o t L

t

=

∂L ∂L t = yî t − 1i ∂L t ∂oi t

y

(10.18)

t

τ o

t

h

τ

τ

∇h τ L = V ⊤ ∇o τ L ti 10.6 10.8

RNN

Y = {y

x t−1

RNN

y 366

1

RNN

... y

τ

t

}

i

10.

h(1)

h(2)

h(3)

h(4)

h(5)

h(... )

y (1)

y (2)

y (3)

y (4)

y (5)

y (...)

10.8: RNN 10.5 t h yt

3.6 P Y =P y

1

... y

τ

=

τ !

P y

t

|y

L

t

t=1

t−1

y

t−2

... y

1

(10.31)

t=1 {y

1

... y

τ

}

"

L=

(10.32)

t

L

t

= − log P y

t

=y

t

|y

t−1

y

t−2

... y

1

(10.33)

y {y RNN y

i

y

y

t−k

... y

t−1

y

t−1

y

}

t

i

t

RNN

1

y RNN 10.7

y RNN

h

367

t

t

10.

h

RNN

t *1

RNN

y k

O k

τ

RNN RNN

O 1

10.5

f

RNN

θ 10.8 h

t

h y

y

i

h

t

t stationary

t+1

t t t t RNN

*1

368

t

10.

1 RNN

(Schmidhuber, 2012) x

τ

1 RNN

RNN RNN

1

τ

τ

τ

τ

τ −t

τ

P x

1

... x

τ

=P τ P x

1

RNN

... x

τ

(10.34)

|τ

Goodfellow et al. (2014d)

τ

10.2.4 RNN x

y

RNN

t

10.8 x

1

x

2

x

... x

RNN

RNN

τ

y

y 6.2.1.1

P y; θ ω=θ

P y|ω 369

10.

P y|ω ω

P y|x

x

t = 1 ... τ 1

x 1

RNN RNN

t

x

x

RNN

y

RNN

1. 2.

h

0

3. 10.9 h

t

x R

R x⊤ R

y x⊤ R

x

θ ω

ω

RNN

x 10.8

x RNN !

P y

P y t

t

10.10

|x

1

t

... x

1

t

... y

τ

|x

1

... x

t

(10.35)

t+1 y 1

10.4

370

τ

10.

U

y (t

1)

L(t

1)

o(t

1)

y (t)

U

W

U

L(t)

o(t+1)

V

h(t

R

1)

W

V

h(t)

R

y (...)

L(t+1)

o(t)

V s(... )

y (t+1)

R

W

R

h(t+1)

W

h(... )

R

x

10.9:

x

RNN

Y y

10.3

RNN

1

t

RNN t

x

1

... x

t−1

x

y

y

y

2

371

t

t

10.

y (t

1)

y (t)

y (t+1)

L(t

1)

L(t)

L(t+1)

R o(t

R

1)

o(t)

V W h(... )

o(t+1)

V

V

W h(t

W

1)

h(t)

U x(t

R

W h(t+1)

U

1)

U

x(t)

10.10: x

x(t+1)

y RNN y 10.3

10.3 RNN

h(... )

x RNN

y x

y

Sequence-to-Sequence RNN (Schuster and Paliwal, 1997) (Graves et al., 2008; Graves and Schmidhuber, 2009) Schmidhuber, 2005; Graves et al., 2013)

(Graves and

(Baldi et al., 1999) (Graves, 2012)

RNN RNN

1 10.11

RNN

RNN

g

RNN h

t

RNN

t

o

t

t RNN 372

10.

y (t

1)

y (t)

y (t+1)

L(t

1)

L(t)

L(t+1)

o(t

1)

o(t)

o(t+1)

g (t

1)

g (t)

g (t+1)

h(t

1)

h(t)

h(t+1)

x(t

1)

x(t)

x(t+1)

10.11:

x t

L

(t)

y

h

g o(t)

t g

h(t)

(t)

t 2 4 Oi

RNN

2-D

i j RNN

j

RNN (Visin et al., 2015; Kalchbrenner et al., 2015)

RNN

373

10.

10.4

Encoder-Decoder

10.5

Sequence-to-Sequence

RNN

10.9

RNN

10.3 10.4 10.10 10.11

RNN Encoder …

x(1)

x(2)

x(...)

x(nx )

C

Decoder …

y (1)

10.12:

C

x 1 x 2 ... x encoder-decoderRNN RNN RNN C

y (2)

nx

y (...)

y (ny )

y 1 ... y Sequence-to-Sequence RNN

ny

RNN RNN

RNN

374

10.

RNN

C C

X= x

1

... x

nx

RNN (2014a)

Cho et al.

Sutskever et al. (2014)

10.12

encoder-decoder

sequence-to-sequence encoder

1 input

reader

RNN

C decoder

2 writer

output RNN

Y = y

1

... y

ny

10.9 nx

ny nx

nx = ny = τ ny

sequence-to-sequence RNN y

1

... y

ny

|x

1

... x

x

y

log P 2

nx

RNN

RNN

RNN

h nx

C RNN

C

10.2.4

to-sequence RNN 2

vector-

vector-to-sequence RNN RNN 2

1

RNN

Bahdanau et al. (2015)

C C

C 375

10.

attention mechanism 12.4.5.1

10.5 RNN

3

1. 2. 3. 10.3

RNN

3 1 1 (Graves et al., 2013; Pascanu

et al., 2014a) RNN Schmidhuber (1992) El Hihi and Bengio (1996) Jaeger (2007a) Graves et al. (2013)

10.13

RNN 10.13

Pascanu et al. (2014a)

10.13b

3

MLP 3 10.13b t

t+1 MLP

10.3

2

2 376

RNN

10.

y

y

h

h

x

x

x

(a)

(b)

(c)

y

z

h

10.13: 2014a) (a)

(Pascanu et al., (b) MLP (c)

Pascanu et al. (2014a)

10.13c

10.6 *2

1

RNN 10.14

Pollack

*2

Recursive neural network

RNN

377

10.

L

y

o

U

U

V x(1)

W

W

U

V

W

V

x(2)

V

x(3)

x(4)

10.14: U

V

x1 o

W

x

2

... x

t

y

(1990)

Bottou (2011) (Socher et al., 2011a,c, 2013a) (Socher et al., 2011b)

τ τ 1

378

O log τ

10.

(Socher et al., 2011a, 2013a)

Bottou (2011) Frasconi et al. (1997)

Frasconi et al. (1998) Socher et al. (2013a) (Weston et al., 2010; Bordes et al., 2012)

10.7 8.2.5

(Hochreiter, 1991; Doya, 1993; Bengio et al., 1994; Pascanu et al., 2013)

10.15

t

h

= W ⊤h

t−1

(10.36)

x 8.2.5 h W

Q

t

! "⊤ = Wt h

W = QΛQ⊤ 379

0

(10.37)

(10.38)

10.

4

0 1 2 3 4 5

Projection of output

3 2 1 0 1 2 3 4

60

40

20

0

20

40

60

Input coordinate

10.15:

100 y

h

t

= Q⊤ Λt Qh

0

(10.39)

1

t

1

100

x

0 h

1

0

w w w

t

w

t

1 w

t

!

t

w

0

t

O v

n

v

∗

v=

t

√ n

v v∗

Sussillo (2014)

RNN

(Hochreiter,

1991; Bengio et al., 1993, 1994) 380

10.

RNN (Bengio et al., 1993, 1994)

Bengio et al. (1994) SGD 10

20

RNN 0 Pascanu

et al. (2013)

Doya (1993) Bengio et al. (1994)

Siegelmann and Sontag

(1995) RNN 1 (Jaeger, 2003; Maass et al., 2002; Jaeger and Haas, 2004; Jaeger, 2007b)

10.8 h h

t−1

h

t

x

t

t

(Jaeger, 2003; Maass et al., 2002; Jaeger and Haas, 2004; Jaeger, 2007b)

1 echo state networks ESNs

(Jaeger and Haas, 2004; Jaeger,

liquid state machines

2007b)

(Maass et al., 2002)

ESN ESN reservoir computing (Lukoševičius and Jaeger, 2009)

t

h 381

t

10.

(Jaeger, 2003)

1 8.2.5 J

t

=

∂s t ∂s t−1

J

t

spectral radius t J

J

v

λ

1

g

Jg

J ng

n

g 1

g + δv J g + δv

g

J n g + δv

n g + δv

n

δJ n v

v 2

δ|λ|n δ

|λ| > 1

δ|λ|n

λ

J v

|λ| |λ| < 1

0 1

(Yildiz et al., 2012; 382

10.

Jaeger, 2012)

h

t+1

=h

t ⊤

W

W⊤

L2

h

contractive

1

h

t

h

t+1

32

h h W

1

t

1

t+1

J h

h

t

t

1

1 tanh tanh 1 tanh 3 tanh ESN 383

10.

(Sutskever, 2012; Sutskever et al., 2013) 8.4 1 2

10.9

Leaky 1

Leaky

10.9.1 1 Lin et al. (1996) (Lang and Hinton, 1988) t t+1 (Bengio, 1991) 8.2.5 Lin et al. (1996) τ

d τ d

τ

384

10.

10.9.2 Leaky 1 1 v

t

µ α

α

µ

t

t−1

µ µ

t

← αµ

t

t−1

1

+ 1−α v α

t

0

Leaky d

d 1 α Mozer (1992)

El Hihi and Bengio (1996)

Leaky (Jaeger et al., 2007) Leaky

2

1

1 Leaky (Mozer, 1992; Pascanu et al., 2013)

10.9.3 1

RNN

(El Hihi and Bengio, 1996) 1

385

10.

1 Leaky Mozer (1992)

Pascanu et al. (2013)

1 El Hihi and Bengio (1996)

Koutnik et al. (2014)

10.10

RNN RNN gated RNN

Long short-term memory LSTM

gated recurrent

unit GRU Leaky

RNN Leaky

RNN Leaky

Leaky RNN

10.11 LSTM Long short-term memory LSTM Schmidhuber, 1997) 386

(Hochreiter and

10.

output

×

self-loop +

× state

×

input

input gate

forget gate

output gate

10.16: LSTM

1

(Gers et al., 2000) LSTM LSTM

(Graves et al., 2009)

(Graves et al., 2013; Graves and Jaitly, 2014) (Sutskever et al., 2014)

(Graves, 2013) (Kiros et al., 2014b; Vinyals et al.,

387

10.

2014b; Xu et al., 2015)

(Vinyals et al., 2014a)

LSTM

10.16 (Graves et al., 2013; Pascanu et al., 2014a) LSTM RNN

LSTM

si t

Leaky

forget gate fi

t

t

0

i

1 ⎛

x

fi t = σ ⎝bfi +

t

b

f

U

f

W

#

Uif j xjt +

j

# j

h

⎞

Wif j hjt−1 ⎠

(10.40)

LSTM

t

f

fi t

LSTM ⎛

#

si t = fi t si t−1 + gi t σ ⎝bi +

b

U

Ui j xjt +

j

j

LSTM

W

#

⎞

Wi j hjt−1 ⎠

external input gate ⎛

LSTM

gi t = σ ⎝bgi +

#

0

Uig j xjt +

j

# j

hi t

& ' hi t = tanh si t qi t

(10.41)

⎞

Wig j hjt−1 ⎠

1

(10.42) output gate qi t

(10.43) 388

10.

⎛

bo

Uo

qi t = σ ⎝boi +

#

Uio j xjt +

j

# j

Wo

10.16

⎞

Wio j hjt−1 ⎠

(10.44)

si t

3

i

3 LSTM (Bengio et al., 1994; Hochreiter and Schmidhuber, 1997; Hochreiter et al., 2001) (Graves, 2012; Graves et al., 2013; Sutskever et al., 2014) LSTM

10.11.1

RNN

LSTM

GRU RNN

(Cho et al., 2014b; Chung

et al., 2014, 2015a; Jozefowicz et al., 2015; Chrupala et al., 2015)

⎛

hi t = ui t−1 hi t−1 + 1 − ui t−1 σ ⎝bi + u

#

Ui j xjt−1 +

j

# j

LSTM

⎞

Wi j rjt−1 hjt−1 ⎠

(10.45)

r ⎛

ui t = σ ⎝bui +

#

Uiu j xjt +

j

# j

389

⎞

Wiu j hjt ⎠

(10.46)

10.

⎛

ri t = σ ⎝bri +

#

Uir j xjt +

j

# j

⎞

Wir j hjt ⎠

(10.47)

1

LSTM

GRU LSTM

GRU

(Greff et al., 2015; Jozefowicz et al., 2015)

Greff et al. (2015) (2015)

LSTM

Jozefowicz et al. 1

Gers et al. (2000) LSTM

10.12 8.2.5

10.7

RNN

Martens and Sutskever (2011)

Martens and Sutskever (2011)

Sutskever 390

10.

et al. (2013) Sutskever (2012) LSTM

10.12.1 8.2.4 8.3

10.17

391

SGD

10.

w

J(w,b)

With clipping

J(w,b)

Without clipping

w b

10.17: 2

b

w b

1

Pascanu et al. (2013)

clipping the gradient (Mikolov, 2012; Pascanu et al., 2013) (Mikolov, 2012)

1

g

(Pascanu et al., 2013) if ||g|| > v g← v

1

g

392

||g|| (10.48)

gv ||g||

(10.49)

10.

Inf

Nan

v

(Graves, 2013)

10.12.2

1 1

10.11

LSTM

∇h t L (∇h t L)

∂h t ∂h t−1

∇h t L

393

(10.50) (10.51)

10.

Pascanu et al. (2013) $ ⎛ $$ ⎞2 ∂h t $ | (∇ L) | t $ ! h ∂h t−1 $ ⎝ Ω= − 1⎠ ||∇h t L| | t

(10.52)

Pascanu et al. (2013) ∇h t L RNN RNN

LSTM

10.13

3:00

141

Graves et al. (2014b) working memory

394

10.

Memory cells

Reading mechanism

Writing mechanism

Task network, controlling the memory

10.18:

(Hinton, 1990) Weston et al. (2014) memory networks neural Turing machine

Graves et al. (2014b)

Bahdanau et al. (2015)

12.4.5.1

end-to-end (Sukhbaatar et al., 2015; Joulin and Mikolov, 2015; Kumar et al., 2015; Vinyals et al., 2015a; Grefenstette et al., 2015) 395

10.

LSTM

GRU

LSTM 2

GRU

1

1 content-based addressing

“We all live in a yellow submarinei”

location-based addressing 347

396

10.

10.18

RNN

LSTM RNN 1

1

(Zaremba and Sutskever, 2015) 20.9.1

attention mechanism

(Bahdanau et al., 2015)

12.4.5.1

(Graves, 2013)

397

Deep Learning：第10章

Short Description

Description

Comments

We need your help!