The EM Algorithm
Michael Collins
In fulllment of the Written Preliminary Exam II requirement, September 1997
1
Contents
1 Intro duction 3
2 Preliminaries 4
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Maximum-likeliho o d Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Sucient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4.1 An example: the normal distribution . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.2 Other imp ortant prop erties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 The EM algorithm 7
3.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Proof that
L
(
) is non-decreasing at each iteration . . . . . . . . . . . . . . . . . . . 9
3.2.1 Proof of equation 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Proof of equation 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Proof that
L
(
) is
increasing
if
is not a stationary point of
L
. . . . . . . . . . 11
3.4 Generalised EM (GEM) algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Special Cases of the EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.1 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5.2 Algebraic Mo dels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Summary of the 4 Theorems in
DLR
. . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6.1 Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6.2 Theorems 2 and 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6.3 Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 (Wu 83)'s Commentary on the EM algorithm 18
4.1 Is
L
a global maximum, lo cal maximum or stationary value? . . . . . . . . . . . . . 19
4.1.1 Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.3 Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.4 Summary of Theorems 1, 2 and 3 . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.5 Example of Convergence to a Saddle Point . . . . . . . . . . . . . . . . . . . 20
4.1.6 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.7 Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Does
Converge to a p oint
? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.2 Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 The Non-convergent GEM Algorithm given in (Boyles 83) . . . . . . . . . . . . . . . 22
2
5 (Jamshidian and Jennrich 93) 22
5.1 Optimisation of Quadratic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 Conjugate Gradient Metho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.2 Generalised Conjugate Gradient Metho ds . . . . . . . . . . . . . . . . . . . . 24
5.2 Accelerating EM using Generalised Conjugate Gradients . . . . . . . . . . . . . . . . 24
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Conclusions 25
1 Intro duction
The Exp ectation Maximization (EM) algorithm is a parameter estimation method which falls into
the general framework of maximum-likeliho o d estimation, and is applied in cases where part of the
data can b e considered to be incomplete, or \hidden". It is essentially an iterative optimisation
algorithm which, at least under certain conditions, will converge to parameter values at a lo cal
maximum of the likelihoo d function. There are many statistical mo dels which turn out to be sp ecial
cases of EM, for example: Hidden Markov Models (HMMs) (Baum 71); the generalisation of HMMs
to Stochastic Context-Free Grammars (Baker 79); mixture models; and estimation in cases of missing
data.
(Dempster, Laird and Rubin) (from here on referred to as
DLR
)) dened the EM algorithm,
and proved certain prop erties, in particular that at each iteration the log-likelihoo d of the observed
data is guaranteed to b e non-decreasing. That is, if
L
(
) is the likelihoo d of the observed data
given parameter values
, and
t
,
t
+
1
are the parameter values at the
t
'th and
t
+ 1'th iterations
respectively, then
L
(
t
+1
)
L
(
t
). They also dened Generalised EM (GEM) algorithms, which
include EM as a sp ecial case, and can be more computationally ecient, while still guaranteeing
that
L
(
t
+1
)
L
(
t
).
(Wu 1983) addressed two issues:
1. Given that
L
converges to some value
L
, then is
L
a global maximum, local maximum,
saddle p oint or some other p oint? It is well known that
L
can not, in general, b e guaranteed
to b e a global maximum.
L
(
t
+1
)
L
(
t
) is one condition for convergence to a stationary
point of
L
, (Wu 83) denes additional conditions for convergence of an EM or GEM algorithm
to a stationary point. At least for EM algorithms, these conditions are quite mild. He also
gave a condition for convergence to a local maximum as opposed to a saddle p oint, but this
condition is dicult to verify in practice (and does not hold in many practical applications).
2. Under what conditions do the parameter estimates
also converge to some p oint
? Con-
vergence of
L
to a p oint
L
does not guarantee convergence of the parameter estimates to
some
, particularly if there is more than one p oint
satisfying
L
(
) =
L
.
(JJ 93) emphasise that EM is an optimisation algorithm for
L
, and show that it is approximately
a steep est descent algorithm, an optimisation metho d which often converges slowly. They show that
with a relatively minor increase in complexity the EM algorithm can b e mo died to a conjugate-
gradient descent method, which is known to be an improved optimisation algorithm. They give
3
experimental results showing that their algorithm typically converges around 3-10 times faster than
standard EM, and can in some cases be 25-100 times faster.
The remainder of this pap er gives some background about maximum-likelihood estimation in
section 2; considers the ma jor results of
DLR
, (Wu 83) and (JJ 77) in sections 3, 4 and 5; and
concludes in section 6. For a summary of the ma jor p oints of this paper the reader should refer at
this point to the bullet p oints in section 6.
2 Preliminaries
Most of the results in this section are taken from [BD 77].
2.1 Notation
We use bold-face throughout to denote matrices, normal typeface to denote scalars. Given a vector
X
, we write its
i
'th comp onent as
X
i
. We use the
D
operator to denote dierentiation. Where there
is ambiguity regarding which variable dierentiation is with resp ect to, we use sup erscripts on the
D
operator. For example,
D
10
Q
(
1
;
2
) is the rst derivative of
Q
w.r.t.
1
,
D
01
Q
(
1
;
2
) is
the rst derivative w.r.t.
2
.
2.2 Maximum-likeliho o d Estimation
In general we have
a
sample
X
=
f
X
1
; X
2
; :::X
n
g
where each
X
i
is a random variable (a single value, or vector of
values).
A vector of
parameters
such that we can dene the
likelihood
of the data
P
(
X
j
). We
can also dene the
log-likelihood
L
(
X
j
) = log
P
(
X
j
). Often the
X
i
s are independently
identically distributed (i.i.d.) so that
L
(
X
j
) =
P
i
=1
:::n
log
P
(
X
i
j
).
If is the parameter space, maximum-likeliho o d (ML) estimation involves setting the ML esti-
mate
M L
such that
M L
= arg max
2
L
(
X
j
) (1)
2.2.1 An example
Suppose we toss a coin 6 times, and
X
i
= 1 if the i'th toss is heads, 0 if it is tails. Say our sample
x
=
f
1
;
0
;
0
;
0
;
1
;
0
g
. Assume the coin has a probability
p
of b eing heads, 1
?
p
of b eing tails, so that
=
p
. Then
L
(
X
=
x
j
) =
n
X
i
=1
log
(
P
(
X
i
=
x
i
j
p
))
= 2 log
p
+ 4 log(1
?
p
) (2)
4
We can maximize
L
by setting the derivative w.r.t.
p
equal to 0:
d L
(
X
=
x
j
)
d p
=
2
p
?
4
1
?
p
= 0 (3)
Solving this gives
p
=
2
6
, which is the \intuitive" estimate for p, the proportion of heads which have
been seen in the sample.
Another common example of maximum-likeliho o d estimation is when the comp onents of
X
are
drawn i.i.d. from a normal distribution with unknown mean
and known variance
2
. It's simple
enough to prove that the ML estimate for
is
P
X
i
n
, i.e., the sample mean.
2.3 Sucient Statistics
A statistic
T
(
X
) is any real or vector-valued function of the data
X
. Note that if
T
(
X
1
) =
T
(
X
2
)
for two samples
X
1
and
X
2
such that
X
1
6
=
X
2
then
T
reduces the data, by mapping dierent
samples to the same value.
T
is
sucient
if there are functions
g
(
T
(
X
)
;
) and
h
(
X
) s.t.
P
(
X
j
) =
g
(
T
(
X
)
;
)
h
(
X
) (4)
Typically,
g
(
T
(
X
)
;
) =
P
(
T
(
X
)
j
) and
h
(
X
) =
P
(
X
j
T
(
X
). The crucial p oint is that when maxi-
mizing
P
(
X
j
) w.r.t.
we can simply maximize
g
(
T
(
X
)
;
), so the sucient statistics
summarise
the data { for ML estimation, once we know
T
we don't need to know anything else ab out the data.
2.3.1 An example
For the coin-tossing example, if the sample size is
n
and the number of heads in the sample is
N
h
,
then
P
(
X
j
) =
p
N
h
(1
?
p
)
(
n
?
N
h
)
(5)
So
T
= (
N
h
; n
) is sucient.
2.4 Exp onential Families
An important class of distributions is the exp onential family, where the likelihoo d can be written
P
(
X
j
) =
f
exp[
X
C
i
(
)
T
i
(
X
) +
d
(
) +
S
(
X
)]
g
I
A
(
X
) (6)
I
A
is the indicator function over the set
A
, and
A
cannot depend on
. Note that
T
(
X
) =
f
T
1
(
X
)
; T
2
(
X
)
:::T
n
(
X
)
g
is sucient.
If we dene the parameters
=
f
1
;
2
; :::
n
g
such that
C
i
(
) =
i
then these are called
the
natural
parameters. This can be a useful simplication, for example if when maximizing
L
we
dierentiate w.r.t.
, where for the natural parameters the derivative is then a simple function
involving
T
.
5