Deep Learning:第18章
September 17, 2017 | Author: matsuolab | Category: N/A
Short Description
Deep Learning日本語翻訳版...
Description
576
18
16.2.2
p˜(x; θ) p˜
Z(θ) 1 p˜(x; θ) Z(θ)
(18.1)
p˜(x)dx
(18.2)
p(x; θ) = !
" x
20 p(x)
18.1
p˜(x)
(18.3)
18.
(18.4)
∇θ log p(x; θ) = ∇θ log p˜(x; θ) − ∇θ log Z(θ) positive phase
negative phase
RBM
RBM 19
log Z (18.5)
∇θ log Z ∇θ Z Z ! ∇θ x p˜(x) = Z ! ∇θ p˜(x) = x Z
(18.6)
=
x
p(x) > 0
(18.7) (18.8)
p˜(x)
=
!
!
x
∇θ exp (log p˜(x)) Z
exp (log p˜(x)) ∇θ log p˜(x) Z ! p˜(x)∇θ log p˜(x) = x Z " = p(x)∇θ log p˜(x) x
exp (log p˜(x)) (18.9) (18.10) (18.11) (18.12)
x
= Ex∼p(x) ∇θ log p˜(x)
577
(18.13)
18.
x
x
∇θ p˜
!
!
p˜(x)dx =
∇θ p˜(x)
(i)θ
x
∇θ p˜(x)
p˜ (ii)
x x
θ (iii)
∂ maxi | ∂θ p˜(x)| i
(18.14)
∇θ p˜(x)dx
≤ R(x)
θ
∇θ p˜(x)
R(x)
(18.15)
∇θ log Z = Ex∼p(x) ∇θ log p˜(x)
x log p˜(x)
log p˜(x) 16.7 18.1
18.2 18.15 1 18.1
578
log p˜
18.
Algorithm 18.1 MCMC ϵ k RBM
100
while
do 1 m
g← m
!m
{x(1) , . . . , x(m) }
m
i=1
∇θ log p˜(x(i) ; θ)
˜ (1) , . . . , x ˜ (m) x
for i = 1 to k do for j = 1 to m do ˜ (j) ← gibbs_update(˜ x x(j) )
end for end for g←g−
1 m
θ ← θ + ϵg
!m
i=1
∇θ log p˜(˜ x(i) ; θ)
end while
MCMC 2 2
18.1 log p˜
log Z
hallucinations
fantasy particles
(Crick and Mitchison, 1983) log p˜ log Z 579
18.
The positive phase
The negative phase
pdata (x)
pdata (x) p(x)
pmodel (x)
p(x)
pmodel (x)
x
x
18.1:
18.1
The positive phase
The negative phase
log p˜
19.5
18.1 MCMC
contrastive divergence CD CD
CD-k (Hinton, 2000, 2010)
18.2 580
k
18.
Algorithm 18.2 ϵ k
p(x; θ)
pdata RBM
1 20 while g←
do 1 m
m
!m
i=1
∇θ log p˜(x(i) ; θ)
{x(1) , . . . , x(m) }
for i = 1 to m do ˜ (i) ← x(i) x
end for
for i = 1 to k do for j = 1 to m do ˜ (j) ← gibbs_update(˜ x x(j) )
end for end for g←g−
1 m
θ ← θ + ϵg
!m
i=1
∇θ log p˜(˜ x(i) ; θ)
end while
CD
CD
spurious modes k
Carreira-Perpiñan and Hinton (2005)
RBM
fully visible Boltzmann machines
CD 581
18.2
18.
pmodel (x)
p(x)
pdata (x)
x
18.2:
18.2
R R
x
edit distance 2-D
CD CD MCMC
Bengio and Delalleau (2009)
MCMC CD
RBM DBN
DBM
CD
582
CD
18.
CD CD CD
CD Sutskever and Tieleman (2010)
CD
CD CD stochastic maximum likelihood SML (Younes, 1998) persistent contrastive divergence PCD PCD
1
k
PCD-k
(Tieleman, 2008)
18.3
SML
CD SML
CD SML
Marlin et al. (2010) SML RBM
RBM
SVM
SML 583
SML
18.
SML
k
MNIST
ϵ
7 7
9 Algorithm 18.3 ϵ k RBM
1
p(x; θ + ϵg) DBM
˜ (m) } {˜ x(1) , . . . , x
m while g←
p(x; θ) 5 50 ( )
do 1 m
!m
m
i=1
∇θ log p˜(x(i) ; θ)
{x(1) , . . . , x(m) }
for i = 1 to k do
for j = 1 to m do ˜ (j) ← gibbs_update(˜ x x(j) )
end for end for g←g−
1 m
θ ← θ + ϵg
!m
i=1
∇θ log p˜(˜ x(i) ; θ)
end while SML
584
18.
Berglund and Raiko (2013)
CD
SML
CD SML
CD
MCMC MCMC
SML
17
(Desjardins et al., 2010; Cho et al., 2010) MCMC 1 Fast PCD FPCD
(Tieleman and Hinton, 2009)
θ θ = θ (slow) + θ (fast)
(18.16)
2 fast
fast fast fast MCMC
1 log p˜
log Z
log Z
log p˜(x) p˜ log Z
585
18.
18.3
p(x) = p(y)
x
a
b
1 ˜(x) Zp 1 ˜(y) Zp
=
p˜(x) p˜(y)
(18.17)
c
a
b
p(a | b) = a
c
p(a, b) p(a, b) p˜(a, b) =! =! . p(b) ˜(a, b, c) a,c p(a, b, c) a,c p a
c
1
a
(18.18)
1
c p˜
n−1
n
log p(x) = log p(x1 ) + log p(x2 | x1 ) + · · · + p(xn | x1:n−1 ) a c
c
(18.19)
x2:n pseudolikelihood
b
(Besag, 1975)
x−i
xi
n " i=1
(18.20)
log p(xi | x−i )
k
p˜ k
n
586
k×n
18.
(Mase, 1995) generalized pseudolikelihood estimator (Huang and Ogata, 2002) m S , i = 1, . . . , m (i)
S
m=1 m=n
m ! i=1
S
(i)
(1)
= 1, . . . , n
= {i}
(18.21)
log p(xS(i) | x−S(i) )
p(x)
S
S 19
p˜(x) p˜
SML 1
(Goodfellow et al., 2013b) SML 587
18.
log Z 1 Marlin and de Freitas (2011)
18.4 (Hyvärinen, 2005)
Z score
∇x log p(x)
1 ||∇x log pmodel (x; θ) − ∇x log pdata (x)||22 2 1 J(θ) = Epdata (x) L(x, θ) 2 θ ∗ = min J(θ)
(18.22)
L(x, θ) =
(18.23) (18.24)
θ
Z
∇x Z = 0
x
Z pdata
L(x, θ) ˜ L(x, θ) =
n ! j=1
"
∂2 1 log pmodel (x; θ) + 2 ∂xj 2 n
#
∂ log pmodel (x; θ) ∂xj
$2 %
x
x
log p˜(x) log p˜(x)
log p˜(x)
588
(18.25)
18.
(Hyvärinen, 2007a)
CD Lyu (2009) Marlin et al. (2010)
Marlin et al. (2010) (generalized score
0 matching GSM) ratio matching
(Hyvärinen, 2007b)
L(RM) (x, θ) =
n ! j=1
⎛ ⎝
1 1+
pmodel (x;θ) pmodel (f (x),j);θ)
⎞2 ⎠
f (x, j)
(18.26) x
j
2 Marlin et al. (2010) SML
GSM p˜ SML
n
1
MCMC
MCMC 589
n
18.
Dauphin and Bengio (2013)
Marlin and de Freitas (2011)
18.5 pdata psmoothed (x) =
y
!
(18.27)
pdata (y)q(x | y)dy q(x | y)
x pdata pmodel
q 5.4.5
Kingma and LeCun (2010)
q
14.5.1
18.6 SML
CD
(Noise-contrastive estimation NCE) (Gutmann and Hyvarinen, 2010) 590
18.
(18.28)
log pmodel (x) = log p˜model (x; θ) + c − log Z(θ)
c θ θ
c
c
log pmodel (x)
c *1
c NCE
c 1
p(x)
noise distribution pnoise (x)
2 x
y pjoint (y = 1) =
1 , 2
(18.29)
pjoint (x | y = 1) = pmodel (x),
(18.30)
pjoint (x | y = 0) = pnoise (x).
(18.31)
y
x
x ptrain (y = 1) = 1 2
ptrain (x | y = 1) = pdata (x) ptrain (x | y = 0) = pnoise (x) pjoint
ptrain
θ, c = arg max Ex,y∼ptrain log pjoint (y | x). θ,c
*1
NCE
c
591
(18.32)
18.
pjoint pmodel (x) pmodel (x) + pnoise (x)
pjoint (y = 1 | x) = =
=
1 1+
(18.34)
pnoise (x) pmodel (x)
!
1
1 + exp log #
(18.33)
pnoise (x) pmodel (x)
pnoise (x) = σ − log pmodel (x)
(18.35)
"
$
(18.36) (18.37)
= σ (log pmodel (x) − log pnoise (x)) log p˜model
pjoint NCE
pnoise NCE
(Mnih and Kavukcuoglu, 2013) NCE 1 pmodel pnoise pmodel
pnoise pnoise
pmodel NCE
NCE
pjoint (y = 1 | x)
pjoint (y = 0 | x)
pjoint (y = 1 | x)
p˜ pnoise
592
18.
self-contrastive estimation
NCE
(Goodfellow, 2014)
NCE
(Welling et al., 2003b; Bengio, 2009)
20.10.4
18.7 Z(θ)
pA (x; θA ) = pB (x; θB ) =
1 ˜A (x; θA ) ZA p
1 ˜B (x; θB ) ZB p
!
i
pA (x(i) ; θA ) > " i
MB !
MA
2
{x(1) , . . . , x(m) }
m i
pB (x(i) ; θB )
log pA (x(i) ; θA ) −
"
log pB (x(i) ; θB ) > 0
i
MA 593
MB
(18.38)
18.
18.38 18.38 2 ! i
(i)
log pA (x ; θA ) −
!
(i)
log pB (x ; θB ) =
i
!" i
p˜A (x(i) ; θA ) log p˜B (x(i) ; θB )
2
MA r=
#
− m log MA
MB
Z(θB ) Z(θA )
Z(θA ) Z(θB ) (18.39)
MB
2
2
Z(θA )
Z(θB ) = rZ(θA ) =
Z(θB ) Z(θA ) Z(θA )
p0 (x) =
(18.40)
1 ˜0 (x) Z0 p
Z0 p˜0 (x) $ Z1 = p˜1 (x) dx $ p0 (x) = p˜1 (x) dx p0 (x) $ p˜1 (x) = Z0 p0 (x) dx p˜0 (x) K Z0 ! p˜1 (x(k) ) Zˆ1 = K p˜ (x(k) ) k=1 0
(18.42) (18.43)
s.t. : x(k) ∼ p0
p0 (x) p˜1
(18.41)
(18.44) Zˆ1
p0
K 1 ! p˜1 (x(k) ) K p˜ (x(k) ) k=1 0
s.t. : x(k) ∼ p0
594
(18.45)
18.
18.39 p0
2 18.44
p1
(Minka, 2005)
p1 p1 p0
p0
p1
p0
18.44
p1
Zˆ1 %2 K $ ! " Z0 # p˜1 (x(k) ) ˆ ˆ ˆ Var Z1 = 2 − Z1 K p˜0 (x(k) )
(18.46)
k=1
p˜1 (x(k) ) p˜0 (x(k) )
2 p0
p1
p0
p1
intermediate distributions
bridge the gap
18.7.1 DKL (p0 ∥p1 )
p0
p1
annealed importance sampling AIS (Jarzynski, 1997; Neal, 2001)
pη 0 , . . . , p η n
p0
p1
0 = η0 < η1 < · · · < ηn−1 < ηn = 1 RBM RBM
2 RBM 595
18.
Z1 Z0
Zη Z1 Z 1 Z η1 = · · · n−1 Z0 Z 0 Z η1 Zηn−1 Zη Zη Zη Z1 = 1 2 · · · n−1 Z 0 Z η1 Zηn−2 Zηn−1 =
n−1 ! j=0
0 ≤ j ≤ n−1
(18.47) (18.48)
Zηj+1 . Z ηj
pη j
(18.49)
pηj +1
Zηj+1 Zη j
Z1 Z0
p0 pη1 . . . pηn−1 p1
p0 η
1−ηj
(18.50)
pη j ∝ p 1 j p0
x Tηj (x′ | x)
Tηj (x′ | x)
pηj (x) pηj (x) =
AIS
"
pηj (x′ )Tηj (x | x′ ) dx′
p0
p1 • for k = 1 . . . K (k)
– xη1 ∼ p0 (x) (k)
(k)
(k)
– xη2 ∼ Tη1 (xη2 | xη1 ) – ...
(k)
(k)
x′
(k)
– xηn−1 ∼ Tηn−2 (xηn−1 | xηn−2 )
596
(18.51)
18.
(k)
(k)
(k)
– xηn ∼ Tηn−1 (xηn | xηn−1 )
• end
18.49
k
(k)
w(k) =
(k)
p˜η1 (xη1 ) p˜η2 (xη2 ) (k)
(k)
p˜0 (xη1 ) p˜η1 (xη2 )
(k)
...
p˜1 (x1 )
(18.52)
(k)
p˜ηn−1 (xηn ).
w(k) log w(k) 18.52 K Z1 1 ! (k) ≈ w Z0 K
(18.53)
k=1
AIS [xη1 , . . . , xηn−1 , x1 ] (Neal, 2001)
p˜(xη1 , . . . , xηn−1 , x1 ) =˜ p1 (x1 )T˜η (xη | x1 )T˜η n−1
n−1
(18.54) n−2
(xηn−2 | xηn−1 ) . . . T˜η1 (xη1 | xη2 ).
T˜a
(18.55)
Ta pa (x′ ) p˜a (x′ ) T˜a (x′ | x) = Ta (x | x′ ) = Ta (x | x′ ). pa (x) p˜a (x)
(18.56)
18.55 (18.57)
p˜(xη1 , . . . , xηn−1 , x1 ) = p˜1 (x1 )
n−2 " p˜η (xη ) p˜ηn−1 (xηn−1 ) i i Tηn−1 (x1 | xηn−1 ) Tη (xηi+1 | xηi ) ˜ (x ) i p˜ηn−1 (x1 ) p η η i i+1 i=1
n−2 " p˜ηi+1 (xηi+1 ) p˜1 (x1 ) = Tηn−1 (x1 | xηn−1 ) p˜η1 (xη1 ) Tηi (xηi+1 | xηi ) p˜ηn−1 (x1 ) p˜ηi (xηi+1 ) i=1
597
(18.58)
(18.59)
18.
q (18.60)
q(xη1 , . . . , xηn−1 , x1 ) = p0 (xη1 )Tη1 (xη2 | xη1 ) . . . Tηn−1 (x1 | xηn−1 ) 18.59
w(k) =
q(xη1 , . . . , xηn−1 , x1 )
(k) (k) (k) p˜(xη1 , . . . , xηn−1 , x1 ) p˜1 (x1 ) p˜η2 (xη2 ) p˜η1 (xη1 ) = . . . . (k) (k) (k) q(xη1 , . . . , xηn−1 , x1 ) p˜ηn−1 (xηn−1 ) p˜1 (xη1 ) p˜0 (x0 )
AIS
(18.61)
AIS
Jarzynski (1997)
Neal
(2001)
(Salakhutdinov and Murray, 2008) AIS
Neal (2001)
18.7.2 bridge sampling (Bennett, 1976)
AIS
1 bridge p0
1
Z1 Z1 /Z0
p∗ p1
p˜0
K (k) ! Z1 p˜∗ (x0 ) ≈ (k) Z0 ˜0 (x0 ) k=1 p
"
598
p˜∗ K (k) ! p˜∗ (x )
1 (k) ˜1 (x1 ) k=1 p
p˜1
p˜∗
(18.62)
18.
p∗
p0
support
p1 2 (opt)
p∗ r = Z1 /Z0
(x) ∝
DKL (p0 ∥p1 ) p˜0 (x)p˜1 (x) r p˜0 (x)+p˜1 (x)
r (Neal, 2005) r AIS
p0
DKL (p0 ∥p1 )
AIS
2
p∗
AIS
p0
p1 1
Neal (2005)
p1
linked importance sampling method
AIS
AIS
Desjardins et al. (2011)
AIS RBM RBM
AIS
599
View more...
Comments