Handbook of Pattern Recognition
Short Description
Descripción: Pattern Matching...
Description
A 3rd Edition
r y d
D
o o c, with c = 1,...,K. Such a classifier has to be derived from a set of examples Xtv = {xi,i = 1...N} of known classes y,. Xtr will be called the training set and yi £ wc, c = 1...K a label. Unless otherwise stated it is assumed that yt is unique (objects belong to just a single class) and is known for all objects in Xtl. In section 2 training procedures will be discussed to derive classifiers C(x) from training sets. The performance of these classifiers is usually not just related the quality of the features (their ability to show class differences) but also to their number, i.e. the dimensionality of the feature space. A growing number of features
charactt rization 4
evaluation
update
•s
generalization
update
adaptation
a update
representation
class labels,
classifiers, class models
feature extraction prototype selection
features, dissimilarities class models, object models
larger tra ning sets
^ obj ects better sensors or measurement conditions
Fig. 1.
confidences
The pattern recognition system
5
may increase the class separability, but, may also decrease the statistical accuracy of the training procedure. It is thereby important to have a small number of good features. In section 3 a review is given of ways to reduce the number of features by selection or by combination (so called feature extraction). The evaluation of classifiers, discussed in section 4, is an important topic. As the characteristics of new applications are often unknown before, the best algorithms for feature reduction and classification have to be found iteratively on the basis of unbiased and accurate testing procedures. This chapter builds further on earlier reviews of the area of statistical pattern recognition by Fukunaga 12 and by Jain et al 16 . It is inevitable to repeat and summarize them partly. We will, however, also discuss some new directions like oneclass classifiers, combining classifiers, dissimilarity representations and techniques for building good classifiers and reducing the feature space simultaneously. In the last section of this chapter, the discussion, we will return to these new developments. 2. Classifiers For the development of classifiers, we have to consider two main aspects: the basic assumptions that the classifier makes about the data (which results in a functional form of the classifier), and the optimization procedure to fit the model to the training data. It is possible to consider very complex classifiers, but without efficient methods to fit these classifiers to the data, they are not useful. Therefore, in many cases the functional form of the classifier is restricted by the available optimization routines. We will start discussing the two-class classification problem. In the first three sections, 2.1, 2.2 and 2.3, the three basic approaches with their assumptions are given: first, modeling the class posteriors, second, modeling class conditional probabilities and finally modeling the classification boundary. In section 2.4 we discuss how these approaches can be extended to work for more than two classes. In the next section, the special case is considered where just one of the classes is reliably sampled. The last section, 2.6, discusses the possibilities to combine several (non-optimal) classifiers. 2.1. Bayes
classifiers
and
approximations
A classifier should assign a new object x to the most likely class. In a probabilistic setting this means that the label of the class with the highest posterior probability should be chosen. This class can be found when p(wi|x) and p(w2|x) (for a two class classification problem) are known. The classifier becomes: if p(wi|x) > p(w2|x)
assign object x to u>\, otherwise to W2-
(1)
When we assume that p(ui\x) and p(w2|x) are known, and further assume that misclassifying an object originating from o>i to W2 is as costly as vise versa, classifier (1) is the theoretical optimal classifier and will make the minimum error. This classifier is called the Bayes optimal classifier.
6
In practice p(wi|x) and p(w2|x) are not known, only samples x, are available, and the misclassification costs might be only known in approximation. Therefore approximations to the Bayes optimal classifier have to be made. This classifier can be approximated in several different ways, depending on knowledge of the classification problem. The first way is to approximate the class posterior probabilities p(w c |x). The logistic classifier assumes a particular model for the class posterior probabilities: P(wi|x) =
, ( , T ^, 1 + exp(—w T x)
1
P(w 2 |x) = 1 - p ( w 2 | x ) ,
(2)
where w is a p-dimensional weight vector. This basically implements a linear classifier in the feature space. An approach to fit this logistic classifier (2) to training data Xu, is to maximize the data likelihood L: N
L = np(wi|xO n i ( x ) p(w2|xi) n a ( x ) >
(3)
i
where n c (x) is 1 if object x belongs to class o>c, and 0 otherwise. This can be done by, for instance, an iterative gradient ascent method. Weights are iteratively updated using: Wnew = W 0 l d + ? 7 - — ,
(4)
where rj is a suitably chosen learning rate parameter. In Ref. 1 the first (and second) derivative of L with respect to w are derived for this and can be plugged into (4). 2.2.
Class densities
and Bayes
rule
Assumptions on p(w|x) are often difficult to make. Sometimes it is more convenient to make assumptions on the class conditional probability densities p(x|w): they indicate the distribution of the objects which are drawn from one of the classes. When assumptions on these distributions can be made, classifier (1) can be derived using Bayes' decision rule: PH*)
=? » ! .
(5)
This rule basically rewrites the class posterior probabilities in terms of the class conditional probabilities and the class priors p(w). This result can be substituted into (1), resulting in the following form: if P(X\LOI)P(U>I) > p(x|w2)p(w2)
assign x to wi, otherwise to o;2-
(6)
The term p(x) is ignored because this is constant for a given x. Any monotonically increasing function can be applied to both sides without changing the final decision. In some cases, a suitable choice will simplify the notation significantly. In particular,
7
using a logarithmic transformation can simplify the classifier when functions from the exponential family are used. For the special case of a two-class problem the classifiers can be rewritten in terms of a single discriminant function / ( x ) which is the difference between the left hand side and the right hand side. A few possibilities are: / ( x ) =p(wi|x) - p ( w 2 | x ) ,
(7)
/ ( x ) = p(x|on)p(wi) - p(x|w 2 )p(w 2 ),
(8)
/(x)=ln^+ln^ll.
(9)
p(x|w 2 )
P{LU2)
The classifier becomes: if
/(x) > 0
assign x t o w j , otherwise to o>2.
(10)
In many cases fitting p(x|a>) on training data is relatively straightforward. It is the standard density estimation problem: fit a density on a data sample. To estimate each p(x|w) the objects from just one of the classes UJ is used. Depending on the functional form of the class densities, different classifiers are constructed. One of the most common approaches is to assume a Gaussian density for each of the classes: p(x|w) = J V ( X ; / I , E ) =
( 2 7 r ) P /2| S |i/2
ex
P (-^(x-MfS-^x-M)) -
(n)
where /x is the (p-dimensional) mean of the class u>, and £ is the covariance matrix. Further, |£| indicates the determinant of £ and £ - 1 its inverse. For the explicit values of the parameters /x and E usually the maximum likelihood estimates are plugged in, therefore this classifier is called the plug-in Bayes classifier. Extra complications occur when the sample size TV is insufficient to (in particular) compute £ _ 1 . In these cases a standard solution is to regularize the covariance matrix such that the inverse can be computed: EA = £ + AJ,
(12)
where X is the p x p identity matrix, and A is the regularization parameter to set the trade off between the estimated covariance matrix and the regularizer I. Substituting (11) for each of the classes u>\ and w2 (with their estimated /Xj, /x2 and Ei, £ 2 ) into (9) results in: /(x) = i x W
- E r > + ^(MiSr 1 - M 2 S 2 - 1 ) T x
- ^ E r V i + ^ S j V s - i l n l E t l + ilnlE.I+ln^ij.
(13)
This classifier rule is quadratic in terms of x, and it is therefore called the normalbased quadratic classifier. For the quadratic classifier a full covariance matrix has to be estimated for each of the classes. In high dimensional feature spaces it can happen that insufficient
8
data is available to estimate these covariance matrices reliably. By restricting the covariance matrices to have less free variables, estimations can become more reliable. One approach the reduce the number of parameters, is to assume that both classes have an identical covariance structure: Ei = £2 = E. The classifier simplifies to: / ( x ) = \{^
- . / S - ' x - i / i f E - V i + \l%*-
V 2 + In ^
(14)
Because this classifier is linear in terms of x, this classifier is called the normal-based linear classifier. For the linear and the quadratic classifier, strong class distributional assumptions are made: each class has a Gaussian distribution. In many applications this cannot be assumed, and more flexible class models have to be used. One possibility is to use a 'non-parametric' model. An example is the Parzen density model. Here the density is estimated by summing local kernels with a fixed size h which are centered on each of the training objects: 1 N p(x|w) = - ^ J V f c x i . f t l ) ,
(15)
i—l
where X is the identity matrix and h is the width parameter which has to be optimized. By substituting (15) into (6), the Parzen classifier is defined. The only free parameter in this classifier is the size (or width) h of the kernel. Optimizing this parameter by maximizing the likelihood on the training data, will result in the solution h = 0. To avoid this, a leave-one-out procedure can be used 9 . 2.3. Boundary
methods
Density estimation in high dimensional spaces is difficult. In order to have a reliable estimate, large amounts of training data should be available. Unfortunately, in many cases the number of training objects is limited. Therefore it is not always wise to estimate the class distributions completely. Looking at (1), (6) and (10), it is only of interest which class is to be preferred over the other. This problem is simpler than estimating p(x\u>). For a two-class problem, we just a function / ( x ) is needed which is positive for objects of UJI and negative otherwise. In this section we will list some classifiers which avoid estimating p(x|w) but try to obtain a suitable / ( x ) . The Fisher classifier searches to find a direction w in the feature space, such that the two classes are separated as well as possible. The degree in which the two classes are separated, is measured by the so-called Fisher ratio, or Fisher criterion:
j = \™\~mf.
(16)
y sl + s22 ' Here mi and mo, are the means of the two classes, projected onto the direction w: mi = ~wTHi and rri2 = w T /x 2 . The si and s% are the variances of the two classes projected onto w. The criterion therefore favors directions in which the means are far apart and the variances are small.
9
This Fisher ratio can be explicitly rewritten in terms of w. First we rewrite s
l = E x 6 a , c ( w T x - w 7 > c ) 2 = Exea, c w T ( x - Mc)(x - ^ c ) T w = w r S c w . Second we write (mi - m 2 ) 2 = (w T /x 1 — w T /z 2 ) 2 = w T (/x 1 — /x2)(Ati _ i u 2 ) T w = w T SflW. The term 5 s is also called the between scatter matrix. J becomes: _ \rn-j - m 2 | 2 _ wr5Bw _ wTSBw T T , T sf + s?, w 5 i w + w S 2W w 5iyw' where Sw = S\ + S2 is also called the within scatter matrix. In order to optimize (17), we set the derivative of (17) to zero and obtain: (w T 5.Bw)5iyw = (wrS,^vw)S'BW.
(18)
We are interested in the direction of w and not in the length, so we drop the scalar terms between brackets. Further, from the definition of SB it follows that SBW is always in the direction fj,x — fj,2- Multiplying both sides of (18) by Sw gives: w~S^(Mi-M2)-
(19)
This classifier is known as the Fisher classifier. Note that the threshold b is not defined for this classifier. It is also linear and requires the inversion of the withinscatter Sw- This formulation yields an identical shape of w as the expression in (14), although the classifiers use very different starting assumptions! Most classifiers which have been discussed so far, have a very restricted form of their decision boundary. In many cases these boundaries are not flexible enough to follow the true decision boundaries. A flexible method is the k-nearest neighbor rule. This classifier looks locally which labels are most dominant in the training set. First it finds the k nearest objects in the training set AW(x), and then counts the number of these neighbors are from class u>i or u>2if
ni > 712 assign x t o u i , otherwise to cj2-
(20)
Although the training of the fc-nearest neighbor classifier is trivial (it only has to store all training objects, k can simply be optimized by a leave-one-out estimation), it may become expensive to classify a new object x. For this the distances to all training objects have to be computed, which may be prohibitive for large training sets and high dimensional feature spaces. Another classifier which is flexible but does not require the storage of the full training set is the multi-layered feed-forward neural network4. A neural network is a collection of small processing units, called the neurons, which are interconnected by weights w and v to form a network. A schematic picture is shown in Figure 2. An input object x is processed through different layers of neurons, through the hidden layer to the output layer. The output of the j - t h output neuron becomes: °j(*) = hj nT.vMvrTy)]
(21)
(see Figure 2 for the meaning of the variables). The object x is now assigned to the class j for which the corresponding output neuron has the highest output Oj.
10
X\ • l+exp(-vTh)
X2%
• Xp # •—-
i), and in practice often single Gaussian distributions for each of the classes are chosen. The reconstruction errors still contain free parameters in the form of a matrix of basis vectors W or a set of prototypes fj,k. These are optimized in their respective
16 procedures, like the Principal Component Analysis or Self-Organizing Maps. These scatter criteria and the supervised measures between distributions are mainly used in the feature selection, Section 3.2. T h e unsupervised reconstruction errors are used in feature extraction 3.3. Table 1. Feature selection criteria for measuring the difference between two distributions or for measuring a reconstruction error. Measures using scatter matrices For explanation J = tr(5^" 1 5i) of Si and S2 J = ln|5^15i| see text. J = gf2Measures between distributions Kolmogorov J = / |p(wi|x) — p(cj 2 |x)|p(x)cbc Average separation J = ^ x j e w i Sx^ewi Divergence Chernoff
J = / (p(x|wi)p(x|w 2 )) ( p ( * ] ^ j ) dx J = - log/p s (x|wi)p 1 _ s (x|u;2)dx
Fisher
J = y / ( p ( x | w i ) p ( w i ) -p(x|w 2 )p(w 2 )) 2 dx
PCA SOM
Reconstruction errors E=\\x(W(W'i'W)_1W'r)xf B = min f c ||x-Ai f c || 2
3 . 2 . Feature
selection
In feature selection a subset of the original features is chosen. A feature reduction procedure consist of two ingredients: the first is the evaluation criterion to evaluate a given set of features, the second is a search strategy to search over all possible feature subsets 1 6 . Exhaustive search is in many applications not feasible. W h e n we start with k = 250 features, and we want to select k = 10, we have to consider in principle ( 250 ) — 2 • 10 1 7 different subsets, which is clearly too much. Instead of exhaustive search, a forward selection can be applied. It starts with the single best feature (according to the evaluation criterion) and adds the feature which gives the biggest improvement in performance. This is repeated till the requested number of features k is reached. Instead of forward selection, the opposite approach can be used: backward selection. This starts with the complete set of features and removes t h a t feature such t h a t the performance increase is the largest. These approaches have the significant drawback t h a t they might miss the optimal subset. These are the subsets for which the individual features have poor discriminability but combined give a very good performance. In order to find these subsets, a more advanced search strategy is required. It can be a floating search where adding
17 and removing features is alternated. Another approach is the branch-and-bound algorithm 12 , where all the subsets of features is arranged in a search tree. This tree is traversed in such order that as soon as possible large sub branches can be disregarded, and the search process is shortened significantly. This strategy will yield the optimal subset when the evaluation criterion J is monotone, that means that when for a certain feature set a value of Jk is obtained, a subset of the features cannot have higher value for Jj,. Criteria like the Bayes error, the Chernoff distance or the functions on the scatter matrices fulfill this. Currently, other approaches appear which combine the traditional feature selection and subsequent training of a classifier. One example is a linear classifier (with the functional form of (25)) called LASSO, Least Absolute Shrinkage and Selection Operator 3 1 . The classification problem is approached as a regression problem with an additional regularization. A linear function is fitted to the data by minimizing the following error: n
m i n ^ ( y i - w T x i - 6 ) 2 + C||w||.
(31)
i
The first part defines the deviation of the linear function w T Xj + 6 from the expected label yi. The second part shrinks the weights w, such that many of them become zero. By choosing a suitable value for C, the number of retained features can be changed. This kind of regularization appears to be very effective when the number of feature is huge (in the thousands) and the training size is small (in the tens). A similar solution can be obtained when the term w T w in (26) is replace by |w| 3 . 3.3. Feature
extraction
Instead of using a subset of the given features, a smaller set of new features may be derived from the old ones. This can be done by linear or nonlinear feature extraction. For the computation of new features usually all original features are used. Feature extraction will therefore almost never reduce the amount of measurements. The optimization criteria are often based on reconstruction errors as in Table 1. The most well-known linear extraction method is the Principal Component Analysis (PCA) 17 . Each new feature i is a linear combination of the original features: x\ = WjX. The new features are optimized to minimize the PCA mean squared error reconstruction error, Table 1. It basically extracts the directions W , in which the data set shows the highest variance. These directions appear to be equivalent to the eigenvectors of the (estimated) covariance matrix E with the largest eigenvalues. For the i-th principal component W j therefore holds: E W i = XiWi,
Xi > Xj, iff < j .
(32)
An extension of the (linear) PCA is the kernelized version, kernel-PCA 24 . Here the standard covariance matrix E is replaced by a covariance matrix in a feature space. After rewriting, the eigenvalue problem in the feature space reduces to the
18 following eigenvalue problem: Koti = AjCKj. Here K is a N x N kernel matrix (like for instance (29)). An object x is mapped onto the i-th principal component by:
^ = J>}if(x, X j ).
(33)
j
Although this feature extraction is linear in the kernel space, in the feature space it will obtain non-linear combinations of features. There are many other methods for extracting nonlinear features, for instance the Self-Organizing Map (SOM) 20 . The SOM is an unsupervised clustering and feature extraction method in which the cluster centers are constrained in their placing. The construction of the SOM is such that all objects in the input space retain as much as possible their distance and neighborhood relations in the mapped space. In other words, the topology is preserved in the mapped space. The mapping is performed by a specific type of neural network, equipped with a special learning rule. Assume that we want to map an /c-dimensional measurement space to a fc'-dimensional feature space, where k' < k. In fact, often k' = 1 or k' = 2. In the feature space, we define a finite orthogonal grid with grid points. At each grid point we place a neuron or prototype . Each neuron stores an fc-dimensional vector /ifc that serves as a cluster center. By defining a grid for the neurons, each neuron does not only have a neighboring neuron in the measurement space, it also has a neighboring neuron in the grid. During the learning phase, neighboring neurons in the grid are enforced to also be neighbors in the measurement space. By doing so, the local topology will be preserved. Unfortunately, training a SOM involves the setting of many unintuitive parameters and heuristics (similar to many neural network approaches). A more principled approach to the SOM is the Generative Topographic Mapping, GTM 5 . The idea is to find a representation of the original p-dimensional data x in terms of //-dimensional latent variables z. For this a mapping function y(z|W) has to be defined. In the GTM it is assumed that the distribution of z in the latent variable space is a grid of delta functions z*: 1
K
W = ^ £ j where i, j G [1, N]. Rabiner24 defines the three basic problems of HMMs by: Problem 1 Given the observation sequence O = 0\02 • • -OT, and a model A = (A, B, IT), how do we efficiently compute P(0\X), the probability of the observation sequence given the model? Problem 2 Given the observation sequence O = O1O2 • • • Or, and the model A, how do we choose a corresponding state sequence Q = q\q2 ... qr which is optimal in some meaningful sense {i.e., best "explains" the observations)? Problem 3 How do we adjust the model parameters A = (A, B, n) to maximize P((9|A)? Problems 1 and 2 are elegantly and efficiently solved by the forward and Viterbi29'12 algorithms respectively as described by Rabiner in his tutorial. The forward algorithm is used to recognise matching HMMs (i.e., highest probability models, MAP) from the observation sequences. Note, again, that this is not a typical approach to pattern classification as
28
it does not involve matching model with observation attributes. That would involve comparing the model parameters and estimated observation model parameters. MAP does not perform this and so it cannot be as sensitive a measure as exact parameter comparisons. Indeed, a number of reports have already shown quite different HMMs can have identical emissions( observation sequences) 18 ' 3 . The Viterbi algorithm is used less frequently as we are normally more interested in finding the matching model than in finding the state sequence. However, this algorithm is critical in evaluating the precision of the HMM; in other words, how well the model can reconstruct (predict) the observations. Rabiner proposes solving Problem 3 via the Baum-Welch algorithm which is, in essence, a gradient ascent algorithm — a method which is guaranteed to find local maxima only. Solving Problem 3 is effectively the problem of learning to recognise new patterns, so it is really the fundamental problem of HMM learning theory; a significant improvement here could boost the performance of all HMM based pattern recognition systems. Therefore it is somewhat surprising that there appear to be relatively few papers devoted to this topic — the vast majority are devoted to applications of the HMM. In the next section we compare a number of alternatives to and variations of Baum-Welch in an attempt to find superior learning strategies. 2. Comparison of Methods for Robust HMM Parameter Estimation We focus on the problem of reliably learning HMMs from a small set of short observation sequences. The need to learn rapidly from small sets arises quite often in practice. In our case, we are interested in learning hand gestures which are limited to just 25 observations. The limitation arises because we record each video at 25 frames per second and each of our gestures takes less than one second to complete. Moreover, we wish to obtain good recognition performance from small training sets to ensure that new gestures can be rapidly recognised by the system. Four HMM parameter estimation methods are evaluated and compared by using a train and test classification methodology. For these binary classification tests we create two random HMMs and then use each of these to generate test and training data sequences. For normalization, we ensure that each test sequence can be correctly recognized by its true model; thus the true models obtain 100% classification accuracy on the test data by construction. The various learning methods are then used to estimate the two HMMs from their respective training sets and then the recognition performance of the pair of estimated HMMs is evaluated on the unseen test data sets. This random model generation and evaluation process is repeated 16 times for each data sample to provide meaningful statistical results. Before parameter re-estimation, we initialize with two random HMMs which should yield 50% recognition performance on average. So an average recognition performance above 50% after re-estimation shows that some degree of learning must have taken place. Clearly if the learning strategy can perfectly determine both of the HMMs which generated the training data sets, we would have 100% recognition performance on the test sets. We compare four learning methods 1) traditional Baum-Welch, 2) ensemble averaging
29 Classification Performance Averaged Over 16 Experiments True model 100% by construction.
0.95
3
0.9
H
i!
•i= rt
0.85
S 0)
0.8
§ £
0.75
o
t:
0 !
on
0-
0.65
rt!
sit
o
O)
0.6
View more...
Comments